Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Maincharacters_bargraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 33 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,34 @@
# TextMining
# Romeo & Juliet Text Mining
### Gracey Wilson

This is the base repo for the text mining and analysis project for Software Design at Olin College.
### Project Overview

I chose to analyze the script of Shakespeare's Romeo and Juliet. Specifically, I was curious how often characters spoke on the comparative metric of gender. My goal for the project was to develop some numbers and perhaps a graphic or two comparing how often male vs. female characters speak in the play. From a software design perspective, I also wanted to practice structuring a program on my own without any guiding scaffolding, and being able to justify the choices I made as the process moved forward. To gain access to the text and work with it repetitively, I saved a local copy of the play from the Project Gutenberg website using Python's pickle module.

### Implementation

As mentioned in the project overview, I used Python's pickle module to gain a local copy of the text from the Project Gutenberg website. I chose for the return values of most functions to be in the form of dictionaries that map the way each character is referred to in the text (their abbreviated character name) to their gender and the number of times they speak. The main actions of the program are counting the number of characters who speak in the script by gender, counting the number of times each of those characters speaks, and finding the average number of times gendered character speaks.

I tried to make most functions as general as possible in the hopes that the program could be used on other text files, especially plays. For instance, I included a text file as an input argument for all the functions I could. However, there are still parts of the program that are very specific to the text I chose to work with. For instance, when I came across a few character names that consist of two words rather than one, I hard-coded the program to recognize them. If I were to use this program with another text file, I would need to consider in what format those characters are referred to and perhaps do some specific hard coding to handle any outliers in that specific case.

### Results

On the most basic level, when comparing the average number of times all male characters speak to the average number of times all female characters speak, the female characters actually speak more than the male characters. However, it's worth noting that there are only 4 female characters, one of whom is Juliet, while there are 23 male characters, several of whom are servants with less than 5 lines, which likely skewed the averages.

In order to get more useful but perhaps also more subjective data, we can compare characters individually based on the size of their role in the play (i.e. Romeo vs. Juliet, the patriarchal figures vs. the matriarchal figures, etc.). Below is a bar graph showing a few of these comparisons:


![alt text](https://github.com/graceyw/TextMining/blob/master/Maincharacters_bargraph.png "")


Because there are only 4 female characters, we actually run out of characters of equal standing to compare; many characters such as Benvolio (shown above), Tybalt, the Prince and many others do not have female counterparts. What's especially interesting is that when the play was written, all the characters would have likely been played by men anyway.

From this kind of data we cannot necessarily draw strong conclusions on whether male characters in the play generally speak more than female characters. However, assuming male and female characters speak on average for similar lengths of time (more about testing that assumption in the following section), we *can* say that at any given time in the play, it is substantially more likely for a male character to be speaking than a female character.

### Reflection

Overall, I feel I was successful in making progress on my learning goals during this project. I practiced tackling a project without any given scaffolding, managed to answer my questions using the limited skills I have in Python, and strengthened the scope of those skills along the way. In future projects I aim to practice thinking out the whole script and what each function will do before beginning to write. I believe this will help me foresee issues and design better programs before I get in too deep.

If I were to continue this project, I would be interested in tracking how many words each character says rather than just how many times they speak because some of the characters might speak less than 20 times, but often give a 20-line soliloquy, while others might only say a line or two. If instead or in addition I wanted to optimize the work I did during this iteration of the project, I would also be interested in trying out a weighting system for each character (i.e. main characters' voices carry more weight in the overall average than supporting characters) in order to get more relevant results than simply an average of all the times the characters speak.

In conclusion, I enjoyed working on this project and am looking forward to continued learnings - both about 16th century literature, and about software design!
16 changes: 16 additions & 0 deletions miningthetext.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
'''Gracey Wilson
Software Design Spring 2017
Mini Project 3: Text Mining

A script that grabs the text from ProjectGutenberg online and pickles it'''

import requests
romeo_juliet_full_text = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt').text
print(romeo_juliet_full_text)

import pickle

# Save data to a file (will be part of your data fetching script)
f = open('romeo_juliet_full_text.pickle', 'wb')
pickle.dump(romeo_juliet_full_text, f)
f.close()
129 changes: 129 additions & 0 deletions percentByGender.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
'''Mini Project 5: Revisiting Text Mining
Software Design Spring 2017
Gracey Wilson

This script parses the text of Romeo and Juliet, counts how often each character speaks,
and returns the percentage of lines spoken by male vs female characters.'''

import string
import pickle

input_file = open('romeo_juliet_full_text.pickle', 'rb')
reloaded_copy_of_texts = pickle.load(input_file)
list_of_words = list(reloaded_copy_of_texts.split()) # Make the words in the textfile a list of strings

char_dict = {'Chor.' : 'Mixed',
'Samp.' : 'Male',
'Greg.' : 'Male',
'Abr.' : 'Male',
'Bal.' : 'Male',
'Ben.' : 'Male',
'Tyb.' : 'Male',
'Officer.' : 'Male',
'Citizens.' : 'Mixed',
'Cap.' : 'Male',
'Wife.' : 'Female',
'Cap. Wife.' : 'Female', # only 2 lines. 'Wife.' and 'Cap. Wife' are both Mrs. Cap
'Mon.' : 'Male',
'M. Wife.' : 'Female',
'Prince.' : 'Male',
'Rom.' : 'Male',
'Par.' : 'Male',
'Serv.' : 'Male',
'Nurse.' : 'Female',
'Jul.' : 'Female',
'Mer.' : 'Male',
'Friar.' : 'Male',
'Laur.' : 'Male',
'John.' : 'Male',
'Peter.' : 'Male',
'Apoth.' : 'Male',
'1. Serv.' : 'Male',
'2. Serv.' : 'Male',
'3. Serv.' : 'Male',
'2. Cap.' : 'Male',}


def count_character_gender(dict_name):
'''Counts how many male and female characters there are in the play.'''
number_of_male_characters = 0
number_of_female_characters = 0
for value in dict_name.values():
if value == 'Male':
number_of_male_characters = number_of_male_characters + 1
elif value == 'Female':
number_of_female_characters = number_of_female_characters + 1
answer = 'There are ' + str(number_of_male_characters) + ' speaking male characters and ' + str(number_of_female_characters) + ' speaking female characters.'
print(answer)


def handle_2word_chars(list_of_words):
'''Deals with all 2-word character abbreviations in the script.
i.e. 'M.' becomes 'M. Wife.' like it appears in the dictionary.
NOTE: Does not account for 'Cap. Wife' (because 'Cap.' is a
different character so the method of combining any string
that starts with 'Cap.' with the string that comes after it
would not increase overall accuracy.)'''
for i in range(len(list_of_words)):
if list_of_words[i] in ['M.', '1.', '2.', '3.']: # if the word IS in list of words
list_of_words[i] = list_of_words[i] + str(' ') + list_of_words[i+1]
return list_of_words


def parse(abbr_character_name,text_file_name):
'''Takes abbreviated character name as input (ie "Samp." for Sampson)
Counts and returns # of times each abbreviated name appears (and
therefore the number of times the character speaks in the text).
NOTE: 'Cap. Wife.' and 'Wife.' are the same person but this script
is not currently aware of that fact (which is why it thinks there
are 5 female characters when in reality there are 4. This also
causes discrepancies in the final calculation of the averages.)'''
number_of_mentions = 0
for word in text_file_name:
if word == abbr_character_name:
number_of_mentions = number_of_mentions + 1
return number_of_mentions


def parse_text_for_mentions(char_dict, text_file_name):
'''Counts number of times each key of dict_name appears in modified_text.
Takes a dictionary that maps character name to gender and a text file (in the form of a list of strings)
Returns name_number: a dictionary that maps character names to number of times they speak.'''
mentions = {}
for key in char_dict.keys():
number_of_mentions = parse(key, text_file_name)
mentions[key] = number_of_mentions
return mentions


def percent_breakdown(name_number,char_dict):
'''Calculates average number of times male and female characters speak.
Takes 2 dictionaries as input arguments:
1. name_number: maps character names to how many times they speak
2. char_dict: maps character names to gender
Creates a dictionary mapping genders to lines spoken.
Returns percentage of lines spoken by each gender.'''

gender_lines = {}
for key,value in name_number.items():
gender_lines[key] = name_number[key]
for key,value in char_dict.items():
gender_lines[value] = char_dict[value]

male_lines = 0
female_lines = 0
for key,value in gender_lines.items:
if key == 'Male':
male_lines += value
if key == 'Female':
female_lines += value
total_lines = male_lines + female_lines
male_percent = (male_lines / total_lines) * 100
female_percent = (female_lines / total_lines) * 100
print('Male characters say '+str(male_percent)+' of the lines. Female characters say '+str(female_percent)+' of the lines.'

if __name__ == '__main__':
count_character_gender(char_dict)
modified_text = handle_2word_chars(list_of_words)
name_number = dict(parse_text_for_mentions(char_dict,modified_text))
percent_breakdown(name_number,char_dict)
191 changes: 191 additions & 0 deletions processingthetext.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
'''Mini Project 3: Text Mining
Software Design Spring 2017
Gracey Wilson

This script parses the text of Romeo and Juliet, counts how often each character
speaks, and returns the average number of times male and female characters speak.'''

import string
import pickle

input_file = open('romeo_juliet_full_text.pickle', 'rb')
reloaded_copy_of_texts = pickle.load(input_file)
list_of_words = list(reloaded_copy_of_texts.split()) # Make the words in the textfile a list of strings

char_dict = {'Chor.' : 'Mixed',
'Samp.' : 'Male',
'Greg.' : 'Male',
'Abr.' : 'Male',
'Bal.' : 'Male',
'Ben.' : 'Male',
'Tyb.' : 'Male',
'Officer.' : 'Male',
'Citizens.' : 'Mixed',
'Cap.' : 'Male',
'Wife.' : 'Female',
'Cap. Wife.' : 'Female', # only 2 lines. 'Wife.' and 'Cap. Wife' are both Mrs. Cap
'Mon.' : 'Male',
'M. Wife.' : 'Female',
'Prince.' : 'Male',
'Rom.' : 'Male',
'Par.' : 'Male',
'Serv.' : 'Male',
'Nurse.' : 'Female',
'Jul.' : 'Female',
'Mer.' : 'Male',
'Friar.' : 'Male',
'Laur.' : 'Male',
'John.' : 'Male',
'Peter.' : 'Male',
'Apoth.' : 'Male',
'1. Serv.' : 'Male',
'2. Serv.' : 'Male',
'3. Serv.' : 'Male',
'2. Cap.' : 'Male',}


def count_character_gender(dict_name):
'''Counts how many male and female characters there are in the play.'''
number_of_male_characters = 0
number_of_female_characters = 0
for value in dict_name.values():
if value == 'Male':
number_of_male_characters = number_of_male_characters + 1
elif value == 'Female':
number_of_female_characters = number_of_female_characters + 1
answer = 'There are ' + str(number_of_male_characters) + ' speaking male characters and ' + str(number_of_female_characters) + ' speaking female characters.'
print(answer)

count_character_gender(char_dict)


def handle_2word_chars(list_of_words):
'''Deals with all 2-word character abbreviations in the script.
i.e. 'M.' becomes 'M. Wife.' like it appears in the dictionary.
NOTE: Does not account for 'Cap. Wife' (because 'Cap.' is a
different character so the method of combining any string
that starts with 'Cap.' with the string that comes after it
would not increase overall accuracy.)'''
for i in range(len(list_of_words)):
if list_of_words[i] in ['M.', '1.', '2.', '3.']: # if the word IS in list of words
list_of_words[i] = list_of_words[i] + str(' ') + list_of_words[i+1]
return list_of_words


def parse(abbr_character_name,text_file_name):
'''Takes abbreviated character name as input (ie "Samp." for Sampson)
Counts and returns # of times each abbreviated name appears (and
therefore the number of times the character speaks in the text).
NOTE: 'Cap. Wife.' and 'Wife.' are the same person but this script
is not currently aware of that fact (which is why it thinks there
are 5 female characters when in reality there are 4. This also
causes discrepancies in the final calculation of the averages.)'''
number_of_mentions = 0
for word in text_file_name:
if word == abbr_character_name:
number_of_mentions = number_of_mentions + 1
return number_of_mentions


modified_text = handle_2word_chars(list_of_words)
# print(modified_text) # unit test; returns text with 2-word names in single strings

# parse('1. Serv.',modified_text) # unit test; returns number of mentions for given character name


def parse_text_for_mentions(char_dict, text_file_name):
'''Counts number of times each key of dict_name appears in modified_text.
Takes a dictionary that maps character name to gender and a text file (in the form of a list of strings)
Returns name_number: a dictionary that maps character names to number of times they speak.'''
mentions = {}
for key in char_dict.keys():
number_of_mentions = parse(key, text_file_name)
mentions[key] = number_of_mentions
return mentions

parse_text_for_mentions(char_dict, modified_text)
name_number = dict(parse_text_for_mentions(char_dict,modified_text))


# print(name_number) # unit test; returns dictionary with character names as keys and # of times mentioned as value


def average_times_speaking(name_number,char_dict):
'''Calculates average number of times male and female characters speak.
Takes 2 dictionaries as input arguments:
1. name_number: maps character names to how many times they speak
2. char_dict: maps character names to gender
Returns average number of times characters of each gender speak. '''
for key,value in name_number.items():
name_number[key] = (char_dict[key],value)
print(name_number)
mention_count = { 'Male': (0, 0),
'Female': (0, 0),
'Mixed': (0, 0) }
for name, (gender, mentions) in name_number.items():
chars, total_mentions = mention_count[gender]
mention_count[gender] = (chars + 1, total_mentions + mentions)

return {gender: total / chars for gender, (chars, total) in mention_count.items()}

print(average_times_speaking(name_number,char_dict))

# '''Creates a dictionary that corresponds full
# character names to abbreviated character names.'''
# char_dict = {'Chor.' : 'Chorus',
# 'Samp.' : 'Sampson',
# 'Greg.' : 'Gregory',
# 'Abr.' : 'Abram',
# 'Bal.' : 'Balthasar',
# 'Ben.' : 'Benvolio',
# 'Tyb.' : 'Tybalt',
# 'Officer.' : 'Officer',
# 'Citizens.' : 'Citizens',
# 'Cap.' : 'Mr. Capulet',
# 'Wife.' : 'Mrs. Capulet',
# 'Cap. Wife.' : 'Old Lady Capulet',
# 'Mon.' : 'Mr. Monague',
# 'M. Wife' : 'Mrs. Montague',
# 'Prince.' : 'Price Escalus',
# 'Rom.' : 'Romeo',
# 'Par.' : 'Count Paris',
# 'Serv.' : 'Servant - the Clown',
# 'Nurse.' : 'Nurse',
# 'Jul.' : 'Juliet',
# 'Mer.' : 'Mercutio',
# '1.' : '1st Servingman',
# '2.' : '2nd Servingman',
# '3.' : '3rd Servingman',
# '2. Cap.' : '2nd Capulet man',
# 'Friar.' : 'Friar Laurence',
# 'Laur.' : 'Friar Laurence',
# 'John.' : 'Friar John',
# 'Peter.' : 'Peter the Nurses beau',
# 'Apoth.' : 'Apothecary'}

# '''Characters:
# Chorus.
# Escalus, Prince of Verona.
# Paris, a young Count, kinsman to the Prince.
# Montague, heads of two houses at variance with each other.
# Capulet, heads of two houses at variance with each other.
# An old Man, of the Capulet family.
# Romeo, son to Montague.
# Tybalt, nephew to Lady Capulet.
# Mercutio, kinsman to the Prince and friend to Romeo.
# Benvolio, nephew to Montague, and friend to Romeo
# Tybalt, nephew to Lady Capulet.
# Friar Laurence, Franciscan.
# Friar John, Franciscan.
# Balthasar, servant to Romeo.
# Abram, servant to Montague.
# Sampson, servant to Capulet.
# Gregory, servant to Capulet.
# Peter, servant to Juliet's nurse.
# An Apothecary.
# Three Musicians.
# An Officer.
# Lady Montague, wife to Montague.
# Lady Capulet, wife to Capulet.
# Juliet, daughter to Capulet.
# Nurse to Juliet.'''
Binary file added romeo_juliet_full_text.pickle
Binary file not shown.