While command line tools allows for several quick out of the box data transformations, we resort to Python to doing anything a bit more custom.
This snippet opens a file in read only mode (default), loads the entire contents of the file as a string in full_text and prints it out.
with open('myfile.txt') as f:
full_text = f.read()
print full_textwith open(...) as f is called a "context manager". After opening a file, we generally want to close it to prevent memory leaks. The context manager will do this for us.
This snippet opens a file in write mode and writes the word 'hello' with a newline character at the end.
with open('testwrite.txt', 'w') as f:
f.write('hello')
f.write('\n')You can append to the end of a file by opening it in the mode a like with open('testwrite.txt', 'a') as f:.
Create a file called name.txt with your full name in it.
Write a python script that:
- reads
name.txtinto a variablemy_nameand then - writes a new file named
hello.txtwith the contentsHello, my name is <my_name>.
This snippet opens a file in read only mode and uses the csv module to instantiate a csv.DictReader. The DictReader will parse the CSV and return a dictionary for each record where the keys of the dictionary are the header of the csv. Then we take all of those dictionaries and put them into a list with rows = list(reader). If we wanted to get all of the rows into a single variable, we can run rows = list(reader). reader is what is referred to as an iterable in python. Running the list function exhausts the iterator and just gives us the contents of the reader as a list.
import csv
with open('myfile.csv') as f:
reader = csv.DictReader(f)
rows = list(reader)
for row in rows:
print(row)Note that since we have loaded the entire CSV into memory in the variable rows we can now put our for loop outside of the context manager since we no longer need access to the file, f.
You can also open a TSV file in the same manner by passing the delimeter argument to csv.DictReader.
import csv
with open('myfile.tsv') as f:
reader = csv.DictReader(f, delimeter='\t')
rows = list(reader)
for row in rows:
print(row)We will be using the csv.writer to write csv files. csv.DictWriter is a higher level abstraction you can also use but we will be using csv.writer in the examples below.
import csv
with open('testwrite.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['col1', 'col2'])
writer.writerow(['val1', 'val2'])
writer.writerow(['val1', 'val2'])
writer.writerow(['val1', 'val2'])You can read more about the csv module here: https://docs.python.org/2/library/csv.html
Write a python script that defines a list of dicts named vegetables like so:
vegetables = [
{"name": "eggplant"},
{"name": "tomato"},
{"name": "corn"},
...
]Write a python program that
- Loops through each vegetable
- In the loop, writes the name of each vegetable and the length of its name into a CSV
hints:
- Don't forget to first write a header row to the CSV
- To get the length of any string use the builtin
lenmethod. For example,len('dhrumil')is 7.
This snippet reads test.json and loads the contents as a dict into the variable data.
import json
with open('test.json') as f:
data = json.load(f)import json
rows = [
{"name": "Rachel", "age": 34},
{"name": "Monica", "age": 34},
{"name": "Phoebe", "age": 37}
]
with open('testwrite.json', 'w') as f:
json.dump(rows, f)-
Read
vegetables.csvinto a variable calledvegetables. -
Write
vegtablesas a JSON file calledvegetables.json. It should look like this:[ {"name": ..., "length": ...}, {"name": ..., "length": ...}, ]
Write a python program that outputs a unique list of superhero powers
- Reads
superheroes.json(in this folder) - Creates an empty array called
powers - Loops thorough the members of the squad, and appends the powers of each to the
powersarray. - Prints those powers to the terminal
hint: To get the unique elements in a list use the set method. For example, try running list(set([1, 1, 2, 3])) in your python console. Alternatively you can use an if statement to only add the powers to the list if they are not already in there.
Lets Read superheroes.json (in this folder) and output a flat CSV of members. The columns should be: name, age, secretIdentity, powers, squadName, homeTown, formed, secretBase, active. Any column that is top level, such as squadName should just be repeated for every row.
Here is an example set of steps:
- Read
superheroes.json - Write a header to the CSV file
- Loop over the members, and for each member write a row to the csv file
HINT: Powers will need to be transformed from a list to a string. You could use str(powers) to do this, or you could use ', '.join(['str1', 'str2', 'str3']) to make it a comma separated list.
We can use the datetime module to parse dates and convert them from one format to another. We will primarily be using the datetime.datetime.strptime and datetime.datetime.strftime methods. Check http://strftime.org/ for the format string codes.
import datetime
raw_date = "2017-01-11"
date_format = "%Y-%m-%d"
parsed_date = datetime.datetime.strptime(raw_date, date_format)
print parsed_date.strftime("%x") # 01/11/17- Set a variable birthday = "1-May-12".
- Parse the date using datetime.datetime.strptime.
- Use strftime to output a date that looks like "5/1/2012".
We can use for loops and if statements to filter through data.
rows = [
{"name": "Rachel", "age": 34},
{"name": "Monica", "age": 34},
{"name": "Phoebe", "age": 37}
]
# filter to age < 37
for row in rows:
if row['age'] < 37:
print(row)
# filter whitelist names
whitelist_names = ['Rachel', 'Phoebe']
for row in rows:
if row['name'] in whitelist_names:
print(row)
# blacklist names
blacklist_names = ['Rachel']
for row in rows:
if row['name'] not in blacklist_names:
print(row)- Read vegetables.csv into a variable called
vegetables. - Loop through
vegetablesand filter down to only green vegtables using a whitelist. - Output another csv called
green_vegetables.csv.
We can use for loops, if statements, and dicts to group data.
from pprint import pprint
cars = [
{"model": "Yaris", "make": "Toyota"},
{"model": "Auris", "make": "Toyota"},
{"model": "Camry", "make": "Toyota"},
{"model": "Prius", "make": "Toyota"},
{"model": "Civic", "make": "Honda"},
{"model": "Model 3", "make": "Tesla"},
]
cars_by_make = {}
for car in cars:
make = car['make']
if make in cars_by_make:
cars_by_make[make].append(car)
else:
cars_by_make[make] = [car]
pprint(cars_by_make)
# {'Honda': [{'make': 'Honda', 'model': 'Civic'}],
# 'Tesla': [{'make': 'Tesla', 'model': 'Model 3'}],
# 'Toyota': [{'make': 'Toyota', 'model': 'Yaris'},
# {'make': 'Toyota', 'model': 'Auris'},
# {'make': 'Toyota', 'model': 'Camry'},
# {'make': 'Toyota', 'model': 'Prius'}]}
number_of_cars_by_make = {}
for car in cars:
make = car['make']
if make in number_of_cars_by_make:
number_of_cars_by_make[make] += 1
else:
number_of_cars_by_make[make] = 1
pprint(number_of_cars_by_make)
# {'Honda': 1, 'Tesla': 1, 'Toyota': 4}- Use excel to add a column
colortovegtables.csv. - Read
vegtables.csvinto a variable calledvegtables. - Group
vegtablesbycoloras a variablevegtables_by_color. - Output
vegtables_by_colorinto a json calledvegtables_by_color.json.