Skip to main content

CSV Files

This will mostly be an example of a specific kind of file you might work with, and some built-in machinery specifically for that type of file.

That type of file in question: CSVs.

What are CSV Files

CSV (short for "comma-separated value(s)") files are effectively the file type for tables and spreadsheets.

So, consider this table of elements (in the "periodic table of elements" sense):

ElementSymbolAtomic Mass
HydrogenH1.00784
HeliumHe4.0026
LithiumLi6.941
BerylliumBe9.0121
BoronB10.811

Look how beautifully it is rendered on this website.

A CSV is basically the pure data/content of that table, it is much less aesthetically pleasing, but it contains all the same information:

Element,Symbol,Atomic Mass
Hydrogen,H,1.00784
Helium,He,4.0026
Lithium,Li,6.941
Beryllium,Be,9.0121
Boron,B,10.811

Observe the parallels between the table and the CSV. Honestly, the CSV just looks like the table but without the formatting (and with the commas).

The format has a couple characteristics:

  • The header row of the table (if it exists) is the first row of the CSV.
  • Every non-header row in the table, becomes a line in the CSV, called a record.
  • Every column in the table represents a field, the values in columns for a given row are separated by commas in the CSV.

A CSV may not have a header row, so what each column represents may just need to be known to the programmer or user. For instance, I could have a file named elements.csv that just has:

Hydrogen,H,1.00784
Helium,He,4.0026
Lithium,Li,6.941
Beryllium,Be,9.0121
Boron,B,10.811

Why cut of the header? Because it is annoying one-off row to deal with. Why keep the header? Because it contains useful information about what the file means/represents. There are arguments both ways.

Direct CSV Handling (Lists of Lists)

We'll look at unpacking, manipulating, and re-packaging a CSV both will only the machinery we already have (basic I/O and string methods) and with machinery provided by the csv module.

For both of these, we will be starting with the file elements.csv with the following contents:

Hydrogen,H,1.00784
Helium,He,4.0026
Lithium,Li,6.941
Beryllium,Be,9.0121
Boron,B,10.811

Our goal will be go from this "element name, periodic symbol, atomic mass" layout to a layout of "atomic number, periodic symbol, element name, atomic mass". This will be done by swapping the element name and periodic symbol for each row, and sticking the atomic number at the front.

The updated table will be written to updated_elements.csv.

Normal I/O

So here is an approach with all the stuff we already have, broken up across these tabs to reduce the horizontal space consumed.

As a first step, let's open the file, read the contents, and split it into lines. Nothing new here over last lecture. I'll be using the .read().splitlines() approach:

import pprint

with open("elements.csv", "r") as elem_file:
elem_rows = elem_file.read().splitlines()

pprint.pprint(elem_rows)

The above code will produce the output below.

['Hydrogen,H,1.00784',
'Helium,He,4.0026',
'Lithium,Li,6.941',
'Beryllium,Be,9.0121',
'Boron,B,10.811']

CSV Module: Reader and Writer

CSVs introduce the need to do that initial split on commas or the later join on commas (depending on if we need to read from or write to a CSV).

This stuff is kind of annoying. As an alternative, there is some machinery that handles it for us in the built-in csv module.

For reading, the csv module provides us with Readers. These are created with csv.reader(FILE_OBJECT), where FILE_OBJECT is an opened file. Not a file name, but the opened file object. This function returns an iterable of the desired broken-up rows (each row as a list), which can in turn be cast to a list of these row-lists.

Behold how we get the entire 2D-list construction with just this:

import csv
import pprint

with open("elements.csv", "r") as elem_file:
reader = csv.reader(elem_file)
elem_table = list(reader)

pprint.pprint(elem_table)

The above code will produce the output below.

[['Hydrogen', 'H', '1.00784'],
['Helium', 'He', '4.0026'],
['Lithium', 'Li', '6.941'],
['Beryllium', 'Be', '9.0121'],
['Boron', 'B', '10.811']]

caution

Material from here on was not part of the lecture, but is stuff I think is related and worth mentioning.


Handling CSVs as Lists of Dictionaries

Each row/entry/record in the CSV has columns that represent fields. The location/order of the fields is often more incidental or aesthetic (it doesn't change the informational content).

In order words, a dictionary (where the field names are the keys) is actually often a more appropriate way to represent a given record. And a list thereof a better way to represent the informational content of the CSV.

Normal I/O

As before, everything can be done with basic I/O, string manipulation, and some comprehensions.

The actual reading in of the CSV isn't different at first, in that we still do need to break the input into rows and values (in order to store values in the dictionaries).

For each row (which has been broken up into values), we need to repackage those values into a dictionary. An operation over every row is ... a comprehension!

import pprint

with open("elements.csv", "r") as elem_file:
elem_rows = elem_file.read().splitlines()
elem_table = [row.split(",") for row in elem_rows]
elem_dicts = [{"Element":row[0], "Symbol":row[1], "Atomic Mass":row[2]} for row in elem_table]

pprint.pprint(elem_dicts)

The above code will produce the output below.

[{'Atomic Mass': '1.00784', 'Element': 'Hydrogen', 'Symbol': 'H'},
{'Atomic Mass': '4.0026', 'Element': 'Helium', 'Symbol': 'He'},
{'Atomic Mass': '6.941', 'Element': 'Lithium', 'Symbol': 'Li'},
{'Atomic Mass': '9.0121', 'Element': 'Beryllium', 'Symbol': 'Be'},
{'Atomic Mass': '10.811', 'Element': 'Boron', 'Symbol': 'B'}]

CSV Module: DictReader and DictWriter

This dictionary packing and unpacking is annoying. As an alternative, there is some machinery that handles it for us in the built-in csv module.

Instead of a Reader, we now have a DictReader. This is created with csv.DictReader(FILE_OBJECT, FIELDNAME_LIST), where FILE_OBJECT is as in scv.reader() and FIELDNAME_LIST is a list of the field names, in the order they are read. (We'll later see that we don't need FIELDNAME_LIST if the CSV has a header row.) This function returns an iterable of the desired row-dictionaries, which can in turn be cast to a list.

Behold how we get the entire dictionary construction with just this:

import csv
import pprint

with open("elements.csv", "r") as elem_file:
reader = csv.DictReader(elem_file, ["Element","Symbol","Atomic Mass"])
elem_table = list(reader)

pprint.pprint(elem_table)

The above code will produce the output below.

[{'Atomic Mass': '1.00784', 'Element': 'Hydrogen', 'Symbol': 'H'},
{'Atomic Mass': '4.0026', 'Element': 'Helium', 'Symbol': 'He'},
{'Atomic Mass': '6.941', 'Element': 'Lithium', 'Symbol': 'Li'},
{'Atomic Mass': '9.0121', 'Element': 'Beryllium', 'Symbol': 'Be'},
{'Atomic Mass': '10.811', 'Element': 'Boron', 'Symbol': 'B'}]

CSV Module with Header Row

There is one more thing to show off about DictReader and DictWriter.

Consider a CSV with a header row. So imagine elements.csv was:

Element,Symbol,Atomic Mass
Hydrogen,H,1.00784
Helium,He,4.0026
Lithium,Li,6.941
Beryllium,Be,9.0121
Boron,B,10.811

If we don't specify the field names for DictReader, it will use the first row in the CSV.

For the DictWriter, it always needs the field names given. But we get a .writeheader() method that writes that list of field names as the header of the csv.

So if the CSV we were reading had a header row, and we wanted to write a header row in the transformed CSV, we could do this:

import csv

with open("elements.csv", "r") as elem_file:
reader = csv.DictReader(elem_file)
elem_table = list(reader)

for i in range(len(elem_table)):
row = elem_table[i] # updates to row will affect elem_table (mutability)
row["Atomic Number"] = i+1 # Add number to row at the appropriate field

with open("updated_elements.csv", "w") as new_elem_file:
writer = csv.DictWriter(new_elem_file, ["Atomic Number","Symbol","Element","Atomic Mass"])
writer.writeheader()
writer.writerows(elem_table)

After running that code, the contents of the file updated_elements.csv would be:

Atomic Number,Symbol,Element,Atomic Mass
1,H,Hydrogen,1.00784
2,He,Helium,4.0026
3,Li,Lithium,6.941
4,Be,Beryllium,9.0121
5,B,Boron,10.811