diff --git a/.gitignore b/.gitignore deleted file mode 100644 index a47adf9..0000000 --- a/.gitignore +++ /dev/null @@ -1,131 +0,0 @@ - -# Created by https://www.gitignore.io/api/macos,pycharm,jupyternotebook,visualstudiocode -# Edit at https://www.gitignore.io/?templates=macos,pycharm,jupyternotebook,visualstudiocode - -### JupyterNotebook ### -.ipynb_checkpoints -*/.ipynb_checkpoints/* - -# Remove previous ipynb_checkpoints -# git rm -r .ipynb_checkpoints/ -# - -### macOS ### -# General -.DS_Store -.AppleDouble -.LSOverride - -# Icon must end with two \r -Icon - -# Thumbnails -._* - -# Files that might appear in the root of a volume -.DocumentRevisions-V100 -.fseventsd -.Spotlight-V100 -.TemporaryItems -.Trashes -.VolumeIcon.icns -.com.apple.timemachine.donotpresent - -# Directories potentially created on remote AFP share -.AppleDB -.AppleDesktop -Network Trash Folder -Temporary Items -.apdisk - -### PyCharm ### -# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm -# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839 - -# User-specific stuff -.idea/**/workspace.xml -.idea/**/tasks.xml -.idea/**/usage.statistics.xml -.idea/**/dictionaries -.idea/**/shelf - -# Generated files -.idea/**/contentModel.xml - -# Sensitive or high-churn files -.idea/**/dataSources/ -.idea/**/dataSources.ids -.idea/**/dataSources.local.xml -.idea/**/sqlDataSources.xml -.idea/**/dynamic.xml -.idea/**/uiDesigner.xml -.idea/**/dbnavigator.xml - -# Gradle -.idea/**/gradle.xml -.idea/**/libraries - -# Gradle and Maven with auto-import -# When using Gradle or Maven with auto-import, you should exclude module files, -# since they will be recreated, and may cause churn. Uncomment if using -# auto-import. -# .idea/modules.xml -# .idea/*.iml -# .idea/modules - -# CMake -cmake-build-*/ - -# Mongo Explorer plugin -.idea/**/mongoSettings.xml - -# File-based project format -*.iws - -# IntelliJ -out/ - -# mpeltonen/sbt-idea plugin -.idea_modules/ - -# JIRA plugin -atlassian-ide-plugin.xml - -# Cursive Clojure plugin -.idea/replstate.xml - -# Crashlytics plugin (for Android Studio and IntelliJ) -com_crashlytics_export_strings.xml -crashlytics.properties -crashlytics-build.properties -fabric.properties - -# Editor-based Rest Client -.idea/httpRequests - -# Android studio 3.1+ serialized cache file -.idea/caches/build_file_checksums.ser - -### PyCharm Patch ### -# Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721 - -# *.iml -# modules.xml -# .idea/misc.xml -# *.ipr - -# Sonarlint plugin -.idea/sonarlint - -### VisualStudioCode ### -.vscode/* -!.vscode/settings.json -!.vscode/tasks.json -!.vscode/launch.json -!.vscode/extensions.json - -### VisualStudioCode Patch ### -# Ignore all local history of files -.history - -# End of https://www.gitignore.io/api/macos,pycharm,jupyternotebook,visualstudiocode \ No newline at end of file diff --git a/.ipynb_checkpoints/main-checkpoint.ipynb b/.ipynb_checkpoints/main-checkpoint.ipynb new file mode 100644 index 0000000..e5b0ecc --- /dev/null +++ b/.ipynb_checkpoints/main-checkpoint.ipynb @@ -0,0 +1,557 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "M0HGrNOzyqt8" + }, + "source": [ + "# Bag of Words Lab\n", + "\n", + "## Introduction\n", + "\n", + "**Bag of words (BoW)** is an important technique in text mining and [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval). It turns the content of text into vectors of numbers which makes it possible to use mathematics and computer programs to analyze and compare documents.\n", + "\n", + "A BoW contains the following information:\n", + "\n", + "1. A dictionary of all the terms (words) in the text documents. The terms are normalized in terms of the letter case (e.g. `Ironhack` => `ironhack`), tense (e.g. `had` => `have`), singular form (e.g. `students` => `student`), etc.\n", + "1. The number of occurrences of each normalized term in each document.\n", + "\n", + "For example, assume we have three text documents:\n", + "\n", + "DOC 1: **Ironhack is cool.**\n", + "\n", + "DOC 2: **I love Ironhack.**\n", + "\n", + "DOC 3: **I am a student at Ironhack.**\n", + "\n", + "The BoW of the above documents looks like below:\n", + "\n", + "| TERM | DOC 1 | DOC 2 | Doc 3 |\n", + "|---|---|---|---|\n", + "| a | 0 | 0 | 1 |\n", + "| am | 0 | 0 | 1 |\n", + "| at | 0 | 0 | 1 |\n", + "| cool | 1 | 0 | 0 |\n", + "| i | 0 | 1 | 1 |\n", + "| ironhack | 1 | 1 | 1 |\n", + "| is | 1 | 0 | 0 |\n", + "| love | 0 | 1 | 0 |\n", + "| student | 0 | 0 | 1 |\n", + "\n", + "\n", + "The vector of each document in BoW can be high-dimensional since it can have as many terms as there exist words in the language. Data scientists use these vectors to represent the content of the documents. For instance, DOC 1 is represented with `[0, 0, 0, 1, 0, 1, 1, 0, 0]`, DOC 2 is represented with `[0, 0, 0, 0, 1, 1, 0, 1, 0]`, and DOC 3 is represented with `[1, 1, 1, 0, 1, 1, 0, 0, 1]`. Two documents are considered similar if their vector representations are similar.\n", + "\n", + "In real practice there are many additional techniques to improve the text mining accuracy such as using [stop words](https://en.wikipedia.org/wiki/Stop_words) (i.e. neglecting common words such as `a`, `I`, `to` that don't contribute much meaning), synonym list (e.g. consider `New York City` the same as `NYC` and `Big Apple`), and HTML tag removal if the data sources are webpages. In Module 3 you will learn how to use those advanced techniques for [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), a component of text mining.\n", + "\n", + "In real text mining projects data analysts use packages such as Scikit-Learn and NLTK, which you will learn in Module 3, to extract BoW from texts. In this exercise, however, we would like you to create BoW manually with Python. This is because by manually creating BoW you can better understand the concept and also practice the Python skills you have learned so far." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sZlv0ZHlyqt-" + }, + "source": [ + "## The Challenge\n", + "\n", + "We need to create a BoW from a list of documents. The documents (`doc1.txt`, `doc2.txt`, and `doc3.txt`) can be found in the `your-code` directory of this exercise. You will read the content of each document into an array of strings named `corpus`.\n", + "\n", + "*What is a corpus (plural: corpora)? Read the reference in the README file.*\n", + "\n", + "Your challenge is to use Python to generate the BoW of these documents. Your BoW should look like below:\n", + "\n", + "```python\n", + "bag_of_words = ['a', 'am', 'at', 'cool', 'i', 'ironhack', 'is', 'love', 'student']\n", + "\n", + "term_freq = [\n", + " [0, 0, 0, 1, 0, 1, 1, 0, 0],\n", + " [0, 0, 0, 0, 1, 1, 0, 1, 0],\n", + " [1, 1, 1, 0, 1, 1, 0, 0, 1],\n", + "]\n", + "```\n", + "\n", + "The code below reads the content of a file of text:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TlpxS-_e_zmH" + }, + "outputs": [], + "source": [ + "with open('C:\\\\...doc1.txt', 'r') as file:\n", + " data_in_file = file.read()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WlGsnTNu_0XG" + }, + "outputs": [], + "source": [ + "data_in_file" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pDFTIJVz_4Vp" + }, + "source": [ + "But Naturally, if we have many files, we don't want to open and read each one explcitly one by one. Let's define the `docs` array that contains the paths of `doc1.txt`, `doc2.txt`, and `doc3.txt`." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "C8K6MQQayqt_" + }, + "outputs": [], + "source": [ + "docs = ['doc1.txt', 'doc2.txt', 'doc3.txt']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F_I0Fcqayqt_" + }, + "source": [ + "Define an empty array named `corpus` that will contain the content strings of the docs. Loop `docs` and read the content of each doc (see cell above) into the `corpus` array." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "Uk6N-vogyquA" + }, + "outputs": [], + "source": [ + "# Write your code here\n", + "corpus = []\n", + "for doc in docs:\n", + " try:\n", + " # Attempt to open and read each document\n", + " with open(doc, 'r') as file:\n", + " content = file.read()\n", + " \n", + " # Append the content to the corpus list\n", + " corpus.append(content)\n", + " except FileNotFoundError:\n", + " # Handle the specific exception if the file is not found\n", + " print(f\"Warning: {doc} was not found.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qPn2JMW_yquA" + }, + "source": [ + "Print `corpus`." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "Gg31CafSyquA" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']\n" + ] + } + ], + "source": [ + "print(corpus)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bkzzdrIsyquA" + }, + "source": [ + "You expected to see:\n", + "\n", + "```['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']```\n", + "\n", + "But you actually saw:\n", + "\n", + "```['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']```\n", + "\n", + "This is because you haven't done two important steps:\n", + "\n", + "1. Remove punctuation from the strings\n", + "\n", + "1. Convert strings to lowercase\n", + "\n", + "Write your code below to process `corpus` (convert to lower case and remove special characters)." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "id": "hr19FpCRyquA" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']\n" + ] + } + ], + "source": [ + "# Write your code here\n", + "import string\n", + "\n", + "# Original list of strings\n", + "corpus = [\"Ironhack is cool.\", \"I love Ironhack.\", \"I am a student at Ironhack.\"]\n", + "\n", + "# Loop over each element by index to modify in place\n", + "for i in range(len(corpus)):\n", + " # Convert to lowercase and remove punctuation\n", + " corpus[i] = corpus[i].translate(str.maketrans('', '', string.punctuation)).lower()\n", + "\n", + "# The corpus list is now updated in place\n", + "print(corpus)\n", + "# Output: ['ironhack is co" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "te53ZNQ5yquA" + }, + "source": [ + "Now define `bag_of_words` as an empty array. It will be used to store the unique terms in `corpus`." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "id": "VRpMaq7HyquB" + }, + "outputs": [], + "source": [ + "bag_of_words = []" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wSjETDxByquB" + }, + "source": [ + "Loop through `corpus`. In each loop, do the following:\n", + "\n", + "1. Break the string into an array of terms.\n", + "1. Create a sub-loop to iterate the terms array.\n", + " * In each sub-loop, you'll check if the current term is already contained in `bag_of_words`. If not in `bag_of_words`, append it to the array." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "hH55NCTjyquB" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']\n" + ] + } + ], + "source": [ + "\n", + "# Main loop over each string in corpus\n", + "for string in corpus:\n", + " # Split the string into a list of terms\n", + " terms = string.split()\n", + " \n", + " # Sub-loop over each term in the list\n", + " for term in terms:\n", + " # Check if the term is already in bag_of_words\n", + " if term not in bag_of_words:\n", + " # Append the term to bag_of_words if it's not already present\n", + " bag_of_words.append(term)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ucETg_76yquB" + }, + "source": [ + "Print `bag_of_words`. You should see:\n", + "\n", + "```['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']```\n", + "\n", + "If not, fix your code in the previous cell." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "id": "RDNezDxvyquB" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']\n" + ] + } + ], + "source": [ + "print(bag_of_words)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nZxZ9oCkyquB" + }, + "source": [ + "Now we define an empty array called `term_freq`. Loop `corpus` for a second time. In each loop, create a sub-loop to iterate the terms in `bag_of_words`. Count how many times each term appears in each doc of `corpus`. Append the term-frequency array to `term_freq`." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "id": "S-q_Xw-7yquC" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]\n" + ] + } + ], + "source": [ + "# Write your code here\n", + "term_freq = []\n", + "\n", + "# Outer loop to process each document in the corpus\n", + "for document in corpus:\n", + " # Split the document into terms\n", + " terms = document.split()\n", + " \n", + " # Initialize a list to store frequency of each word in bag_of_words for this document\n", + " doc_term_freq = []\n", + " \n", + " # Inner loop to process each word in bag_of_words\n", + " for word in bag_of_words:\n", + " # Count occurrences of the word in the current document\n", + " count = terms.count(word) \n", + " # Append this count to doc_term_freq\n", + " doc_term_freq.append(count)\n", + " \n", + " # Append the term frequency list for this document to term_freq\n", + " term_freq.append(doc_term_freq)\n", + "\n", + "# Display the term frequency for each document\n", + "print(term_freq)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C5rTgoo7yquC" + }, + "source": [ + "Print `term_freq`. You should see:\n", + "\n", + "```[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "35ESP-61yquC" + }, + "source": [ + "**If your output is correct, congratulations! You've solved the challenge!**\n", + "\n", + "If not, go back and check for errors in your code." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eNahUeB4yquC" + }, + "source": [ + "## Bonus Question\n", + "\n", + "Now you want to improve your previous solution by removing the stop words from the corpus. The idea is you only want to add terms that are not in the `stop_words` list to the `bag_of_words` array.\n", + "\n", + "Requirements:\n", + "\n", + "1. Move all your previous codes from `main.ipynb` to the cell below.\n", + "1. Improve your solution by ignoring stop words in `bag_of_words`.\n", + "\n", + "After you're done, your `bag_of_words` should be:\n", + "\n", + "```['ironhack', 'cool', 'love', 'student']```\n", + "\n", + "And your `term_freq` should be:\n", + "\n", + "```[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]```" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "id": "XDroiBGYyquC" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['ironhack', 'cool', 'love', 'student']" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "stop_words = ['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']\n", + "\n", + "# Write your code below\n", + "for word in bag_of_words:\n", + " if word in stop_words:\n", + " bag_of_words.remove(word)\n", + "\n", + "bag_of_words\n" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]\n" + ] + } + ], + "source": [ + "term_freq = []\n", + "\n", + "# Outer loop to process each document in the corpus\n", + "for document in corpus:\n", + " # Split the document into terms\n", + " terms = document.split()\n", + " \n", + " # Initialize a list to store frequency of each word in bag_of_words for this document\n", + " doc_term_freq = []\n", + " \n", + " # Inner loop to process each word in bag_of_words\n", + " for word in bag_of_words:\n", + " # Count occurrences of the word in the current document\n", + " count = terms.count(word) \n", + " # Append this count to doc_term_freq\n", + " doc_term_freq.append(count)\n", + " \n", + " # Append the term frequency list for this document to term_freq\n", + " term_freq.append(doc_term_freq)\n", + "\n", + "# Display the term frequency for each document\n", + "print(term_freq)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2D0dq58ryquC" + }, + "source": [ + "## Additional Challenge for the Nerds\n", + "\n", + "We will learn Scikit-Learn in Module 3 which has built in the BoW feature. Try to use Scikit-Learn to generate the BoW for this challenge and check whether the output is the same as yours. You will need to do some googling to find out how to use Scikit-Learn to generate BoW.\n", + "\n", + "**Notes:**\n", + "\n", + "* To install Scikit-Learn, use `pip install sklearn`.\n", + "\n", + "* Scikit-Learn removes stop words by default. You don't need to manually remove stop words.\n", + "\n", + "* Scikit-Learn's output has slightly different format from the output example demonstrated above. It's ok, you don't need to convert the Scikit-Learn output.\n", + "\n", + "The Scikit-Learn output will look like below:\n", + "\n", + "```python\n", + "# BoW:\n", + "{u'love': 5, u'ironhack': 3, u'student': 6, u'is': 4, u'cool': 2, u'am': 0, u'at': 1}\n", + "\n", + "# term_freq:\n", + "[[0 0 1 1 1 0 0]\n", + " [0 0 0 1 0 1 0]\n", + " [1 1 0 1 0 0 1]]\n", + " ```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.14.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/README.md b/README.md deleted file mode 100644 index 9a58063..0000000 --- a/README.md +++ /dev/null @@ -1,46 +0,0 @@ -Ironhack Logo - -# Lab | Intro to Python with Bag of Words - -## Introduction - -In this lab, we will practice a few fundamental Python concepts. We will use as motivation the creation of something called a Bag of Words (BoW) model. BoW is an essential technique in Natural Language Processing which we will cover in Module 3. For the time being, **you don't need to fully understand how a Bag of Words is used or even what it is in detail**. The exercise serves for you to train for loops and lists and strings. The Natural Language Processing stuff is just the backdrop, you'll have plenty of time to delve into that. - - -### Getting Started - -In your Terminal, navigate into the directory `your-code` of this lab that contains `main.ipynb`, `doc1.txt`, `doc2.txt`, and `doc3.txt`. Start Jupyter Notebook by executing `jupyter notebook`. A webpage should automatically open for you but in case not, go to [http://localhost:8888](http://localhost:8888). Then click the link to each ipynb file to complete the challenges. - -## Deliverables - -`main.ipynb` with your responses. - -## Submission - -Upon completion, add your deliverables to git. Then git commit, push your branch to the remote and make a merge request as taught in class. - -## Resources - -* [The `re` Library](https://docs.python.org/3/library/re.html) - -* [F-strings](https://www.python.org/dev/peps/pep-0498/) - -* [Regular Expressions](https://developers.google.com/edu/python/regular-expressions) - -* [Python Input and Output (how to read file content)](https://docs.python.org/3/tutorial/inputoutput.html) - -* [How to Remove Punctuation in Python String](https://www.quora.com/How-do-I-remove-punctuation-from-a-Python-string) - -* [Convert String to Lowercase in Python](https://docs.python.org/3/library/stdtypes.html#str.lower) - -* [Break Python String into Array](https://docs.python.org/3/library/stdtypes.html#str.split) - -* [What is Text Corpus?](https://en.wikipedia.org/wiki/Text_corpus) - -* [A Gentle Introduction to the Bag-of-Words Model](https://machinelearningmastery.com/gentle-introduction-bag-words-model/) - -## Additional Reading - -If you are a research-type person, you will find [this article](http://rstb.royalsocietypublishing.org/content/royptb/366/1567/1101.full.pdf) interesting. Scientists used techniques based on BoW to calculate the frequency of words used cross 17 world languages. They found there is a consistent pattern in terms of the frequency of words being used in human languages. Some mad scientists even [want to use this technique to analyze dolphin language](http://grantome.com/grant/NSF/PHY-1530544) because they believe they can build corpora based on the sounds dolphins make, correlate the dolphin language corpora with human language corpora, and potentially understand what dolphins speak. :astonished: :astonished: :astonished: - -Data analytics is now entering almost every discipline and profession. You will want to reflect on how you will apply your data analytics skills to the fields you are familiar with -- in creative ways. There are tons of fun secrets waiting for you to discover with data analytics. diff --git a/your-code/doc1.txt b/doc1.txt similarity index 100% rename from your-code/doc1.txt rename to doc1.txt diff --git a/your-code/doc2.txt b/doc2.txt similarity index 100% rename from your-code/doc2.txt rename to doc2.txt diff --git a/your-code/doc3.txt b/doc3.txt similarity index 100% rename from your-code/doc3.txt rename to doc3.txt diff --git a/main.ipynb b/main.ipynb new file mode 100644 index 0000000..e5b0ecc --- /dev/null +++ b/main.ipynb @@ -0,0 +1,557 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "M0HGrNOzyqt8" + }, + "source": [ + "# Bag of Words Lab\n", + "\n", + "## Introduction\n", + "\n", + "**Bag of words (BoW)** is an important technique in text mining and [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval). It turns the content of text into vectors of numbers which makes it possible to use mathematics and computer programs to analyze and compare documents.\n", + "\n", + "A BoW contains the following information:\n", + "\n", + "1. A dictionary of all the terms (words) in the text documents. The terms are normalized in terms of the letter case (e.g. `Ironhack` => `ironhack`), tense (e.g. `had` => `have`), singular form (e.g. `students` => `student`), etc.\n", + "1. The number of occurrences of each normalized term in each document.\n", + "\n", + "For example, assume we have three text documents:\n", + "\n", + "DOC 1: **Ironhack is cool.**\n", + "\n", + "DOC 2: **I love Ironhack.**\n", + "\n", + "DOC 3: **I am a student at Ironhack.**\n", + "\n", + "The BoW of the above documents looks like below:\n", + "\n", + "| TERM | DOC 1 | DOC 2 | Doc 3 |\n", + "|---|---|---|---|\n", + "| a | 0 | 0 | 1 |\n", + "| am | 0 | 0 | 1 |\n", + "| at | 0 | 0 | 1 |\n", + "| cool | 1 | 0 | 0 |\n", + "| i | 0 | 1 | 1 |\n", + "| ironhack | 1 | 1 | 1 |\n", + "| is | 1 | 0 | 0 |\n", + "| love | 0 | 1 | 0 |\n", + "| student | 0 | 0 | 1 |\n", + "\n", + "\n", + "The vector of each document in BoW can be high-dimensional since it can have as many terms as there exist words in the language. Data scientists use these vectors to represent the content of the documents. For instance, DOC 1 is represented with `[0, 0, 0, 1, 0, 1, 1, 0, 0]`, DOC 2 is represented with `[0, 0, 0, 0, 1, 1, 0, 1, 0]`, and DOC 3 is represented with `[1, 1, 1, 0, 1, 1, 0, 0, 1]`. Two documents are considered similar if their vector representations are similar.\n", + "\n", + "In real practice there are many additional techniques to improve the text mining accuracy such as using [stop words](https://en.wikipedia.org/wiki/Stop_words) (i.e. neglecting common words such as `a`, `I`, `to` that don't contribute much meaning), synonym list (e.g. consider `New York City` the same as `NYC` and `Big Apple`), and HTML tag removal if the data sources are webpages. In Module 3 you will learn how to use those advanced techniques for [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), a component of text mining.\n", + "\n", + "In real text mining projects data analysts use packages such as Scikit-Learn and NLTK, which you will learn in Module 3, to extract BoW from texts. In this exercise, however, we would like you to create BoW manually with Python. This is because by manually creating BoW you can better understand the concept and also practice the Python skills you have learned so far." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sZlv0ZHlyqt-" + }, + "source": [ + "## The Challenge\n", + "\n", + "We need to create a BoW from a list of documents. The documents (`doc1.txt`, `doc2.txt`, and `doc3.txt`) can be found in the `your-code` directory of this exercise. You will read the content of each document into an array of strings named `corpus`.\n", + "\n", + "*What is a corpus (plural: corpora)? Read the reference in the README file.*\n", + "\n", + "Your challenge is to use Python to generate the BoW of these documents. Your BoW should look like below:\n", + "\n", + "```python\n", + "bag_of_words = ['a', 'am', 'at', 'cool', 'i', 'ironhack', 'is', 'love', 'student']\n", + "\n", + "term_freq = [\n", + " [0, 0, 0, 1, 0, 1, 1, 0, 0],\n", + " [0, 0, 0, 0, 1, 1, 0, 1, 0],\n", + " [1, 1, 1, 0, 1, 1, 0, 0, 1],\n", + "]\n", + "```\n", + "\n", + "The code below reads the content of a file of text:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TlpxS-_e_zmH" + }, + "outputs": [], + "source": [ + "with open('C:\\\\...doc1.txt', 'r') as file:\n", + " data_in_file = file.read()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WlGsnTNu_0XG" + }, + "outputs": [], + "source": [ + "data_in_file" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pDFTIJVz_4Vp" + }, + "source": [ + "But Naturally, if we have many files, we don't want to open and read each one explcitly one by one. Let's define the `docs` array that contains the paths of `doc1.txt`, `doc2.txt`, and `doc3.txt`." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "C8K6MQQayqt_" + }, + "outputs": [], + "source": [ + "docs = ['doc1.txt', 'doc2.txt', 'doc3.txt']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F_I0Fcqayqt_" + }, + "source": [ + "Define an empty array named `corpus` that will contain the content strings of the docs. Loop `docs` and read the content of each doc (see cell above) into the `corpus` array." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "Uk6N-vogyquA" + }, + "outputs": [], + "source": [ + "# Write your code here\n", + "corpus = []\n", + "for doc in docs:\n", + " try:\n", + " # Attempt to open and read each document\n", + " with open(doc, 'r') as file:\n", + " content = file.read()\n", + " \n", + " # Append the content to the corpus list\n", + " corpus.append(content)\n", + " except FileNotFoundError:\n", + " # Handle the specific exception if the file is not found\n", + " print(f\"Warning: {doc} was not found.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qPn2JMW_yquA" + }, + "source": [ + "Print `corpus`." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "Gg31CafSyquA" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']\n" + ] + } + ], + "source": [ + "print(corpus)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bkzzdrIsyquA" + }, + "source": [ + "You expected to see:\n", + "\n", + "```['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']```\n", + "\n", + "But you actually saw:\n", + "\n", + "```['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']```\n", + "\n", + "This is because you haven't done two important steps:\n", + "\n", + "1. Remove punctuation from the strings\n", + "\n", + "1. Convert strings to lowercase\n", + "\n", + "Write your code below to process `corpus` (convert to lower case and remove special characters)." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "id": "hr19FpCRyquA" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']\n" + ] + } + ], + "source": [ + "# Write your code here\n", + "import string\n", + "\n", + "# Original list of strings\n", + "corpus = [\"Ironhack is cool.\", \"I love Ironhack.\", \"I am a student at Ironhack.\"]\n", + "\n", + "# Loop over each element by index to modify in place\n", + "for i in range(len(corpus)):\n", + " # Convert to lowercase and remove punctuation\n", + " corpus[i] = corpus[i].translate(str.maketrans('', '', string.punctuation)).lower()\n", + "\n", + "# The corpus list is now updated in place\n", + "print(corpus)\n", + "# Output: ['ironhack is co" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "te53ZNQ5yquA" + }, + "source": [ + "Now define `bag_of_words` as an empty array. It will be used to store the unique terms in `corpus`." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "id": "VRpMaq7HyquB" + }, + "outputs": [], + "source": [ + "bag_of_words = []" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wSjETDxByquB" + }, + "source": [ + "Loop through `corpus`. In each loop, do the following:\n", + "\n", + "1. Break the string into an array of terms.\n", + "1. Create a sub-loop to iterate the terms array.\n", + " * In each sub-loop, you'll check if the current term is already contained in `bag_of_words`. If not in `bag_of_words`, append it to the array." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "hH55NCTjyquB" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']\n" + ] + } + ], + "source": [ + "\n", + "# Main loop over each string in corpus\n", + "for string in corpus:\n", + " # Split the string into a list of terms\n", + " terms = string.split()\n", + " \n", + " # Sub-loop over each term in the list\n", + " for term in terms:\n", + " # Check if the term is already in bag_of_words\n", + " if term not in bag_of_words:\n", + " # Append the term to bag_of_words if it's not already present\n", + " bag_of_words.append(term)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ucETg_76yquB" + }, + "source": [ + "Print `bag_of_words`. You should see:\n", + "\n", + "```['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']```\n", + "\n", + "If not, fix your code in the previous cell." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "id": "RDNezDxvyquB" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']\n" + ] + } + ], + "source": [ + "print(bag_of_words)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nZxZ9oCkyquB" + }, + "source": [ + "Now we define an empty array called `term_freq`. Loop `corpus` for a second time. In each loop, create a sub-loop to iterate the terms in `bag_of_words`. Count how many times each term appears in each doc of `corpus`. Append the term-frequency array to `term_freq`." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "id": "S-q_Xw-7yquC" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]\n" + ] + } + ], + "source": [ + "# Write your code here\n", + "term_freq = []\n", + "\n", + "# Outer loop to process each document in the corpus\n", + "for document in corpus:\n", + " # Split the document into terms\n", + " terms = document.split()\n", + " \n", + " # Initialize a list to store frequency of each word in bag_of_words for this document\n", + " doc_term_freq = []\n", + " \n", + " # Inner loop to process each word in bag_of_words\n", + " for word in bag_of_words:\n", + " # Count occurrences of the word in the current document\n", + " count = terms.count(word) \n", + " # Append this count to doc_term_freq\n", + " doc_term_freq.append(count)\n", + " \n", + " # Append the term frequency list for this document to term_freq\n", + " term_freq.append(doc_term_freq)\n", + "\n", + "# Display the term frequency for each document\n", + "print(term_freq)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C5rTgoo7yquC" + }, + "source": [ + "Print `term_freq`. You should see:\n", + "\n", + "```[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "35ESP-61yquC" + }, + "source": [ + "**If your output is correct, congratulations! You've solved the challenge!**\n", + "\n", + "If not, go back and check for errors in your code." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eNahUeB4yquC" + }, + "source": [ + "## Bonus Question\n", + "\n", + "Now you want to improve your previous solution by removing the stop words from the corpus. The idea is you only want to add terms that are not in the `stop_words` list to the `bag_of_words` array.\n", + "\n", + "Requirements:\n", + "\n", + "1. Move all your previous codes from `main.ipynb` to the cell below.\n", + "1. Improve your solution by ignoring stop words in `bag_of_words`.\n", + "\n", + "After you're done, your `bag_of_words` should be:\n", + "\n", + "```['ironhack', 'cool', 'love', 'student']```\n", + "\n", + "And your `term_freq` should be:\n", + "\n", + "```[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]```" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "id": "XDroiBGYyquC" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['ironhack', 'cool', 'love', 'student']" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "stop_words = ['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']\n", + "\n", + "# Write your code below\n", + "for word in bag_of_words:\n", + " if word in stop_words:\n", + " bag_of_words.remove(word)\n", + "\n", + "bag_of_words\n" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]\n" + ] + } + ], + "source": [ + "term_freq = []\n", + "\n", + "# Outer loop to process each document in the corpus\n", + "for document in corpus:\n", + " # Split the document into terms\n", + " terms = document.split()\n", + " \n", + " # Initialize a list to store frequency of each word in bag_of_words for this document\n", + " doc_term_freq = []\n", + " \n", + " # Inner loop to process each word in bag_of_words\n", + " for word in bag_of_words:\n", + " # Count occurrences of the word in the current document\n", + " count = terms.count(word) \n", + " # Append this count to doc_term_freq\n", + " doc_term_freq.append(count)\n", + " \n", + " # Append the term frequency list for this document to term_freq\n", + " term_freq.append(doc_term_freq)\n", + "\n", + "# Display the term frequency for each document\n", + "print(term_freq)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2D0dq58ryquC" + }, + "source": [ + "## Additional Challenge for the Nerds\n", + "\n", + "We will learn Scikit-Learn in Module 3 which has built in the BoW feature. Try to use Scikit-Learn to generate the BoW for this challenge and check whether the output is the same as yours. You will need to do some googling to find out how to use Scikit-Learn to generate BoW.\n", + "\n", + "**Notes:**\n", + "\n", + "* To install Scikit-Learn, use `pip install sklearn`.\n", + "\n", + "* Scikit-Learn removes stop words by default. You don't need to manually remove stop words.\n", + "\n", + "* Scikit-Learn's output has slightly different format from the output example demonstrated above. It's ok, you don't need to convert the Scikit-Learn output.\n", + "\n", + "The Scikit-Learn output will look like below:\n", + "\n", + "```python\n", + "# BoW:\n", + "{u'love': 5, u'ironhack': 3, u'student': 6, u'is': 4, u'cool': 2, u'am': 0, u'at': 1}\n", + "\n", + "# term_freq:\n", + "[[0 0 1 1 1 0 0]\n", + " [0 0 0 1 0 1 0]\n", + " [1 1 0 1 0 0 1]]\n", + " ```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.14.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/your-code/main.ipynb b/your-code/main.ipynb deleted file mode 100644 index 577fd55..0000000 --- a/your-code/main.ipynb +++ /dev/null @@ -1 +0,0 @@ -{"cells":[{"cell_type":"markdown","metadata":{"id":"M0HGrNOzyqt8"},"source":["# Bag of Words Lab\n","\n","## Introduction\n","\n","**Bag of words (BoW)** is an important technique in text mining and [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval). It turns the content of text into vectors of numbers which makes it possible to use mathematics and computer programs to analyze and compare documents.\n","\n","A BoW contains the following information:\n","\n","1. A dictionary of all the terms (words) in the text documents. The terms are normalized in terms of the letter case (e.g. `Ironhack` => `ironhack`), tense (e.g. `had` => `have`), singular form (e.g. `students` => `student`), etc.\n","1. The number of occurrences of each normalized term in each document.\n","\n","For example, assume we have three text documents:\n","\n","DOC 1: **Ironhack is cool.**\n","\n","DOC 2: **I love Ironhack.**\n","\n","DOC 3: **I am a student at Ironhack.**\n","\n","The BoW of the above documents looks like below:\n","\n","| TERM | DOC 1 | DOC 2 | Doc 3 |\n","|---|---|---|---|\n","| a | 0 | 0 | 1 |\n","| am | 0 | 0 | 1 |\n","| at | 0 | 0 | 1 |\n","| cool | 1 | 0 | 0 |\n","| i | 0 | 1 | 1 |\n","| ironhack | 1 | 1 | 1 |\n","| is | 1 | 0 | 0 |\n","| love | 0 | 1 | 0 |\n","| student | 0 | 0 | 1 |\n","\n","\n","The vector of each document in BoW can be high-dimensional since it can have as many terms as there exist words in the language. Data scientists use these vectors to represent the content of the documents. For instance, DOC 1 is represented with `[0, 0, 0, 1, 0, 1, 1, 0, 0]`, DOC 2 is represented with `[0, 0, 0, 0, 1, 1, 0, 1, 0]`, and DOC 3 is represented with `[1, 1, 1, 0, 1, 1, 0, 0, 1]`. Two documents are considered similar if their vector representations are similar.\n","\n","In real practice there are many additional techniques to improve the text mining accuracy such as using [stop words](https://en.wikipedia.org/wiki/Stop_words) (i.e. neglecting common words such as `a`, `I`, `to` that don't contribute much meaning), synonym list (e.g. consider `New York City` the same as `NYC` and `Big Apple`), and HTML tag removal if the data sources are webpages. In Module 3 you will learn how to use those advanced techniques for [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), a component of text mining.\n","\n","In real text mining projects data analysts use packages such as Scikit-Learn and NLTK, which you will learn in Module 3, to extract BoW from texts. In this exercise, however, we would like you to create BoW manually with Python. This is because by manually creating BoW you can better understand the concept and also practice the Python skills you have learned so far."]},{"cell_type":"markdown","metadata":{"id":"sZlv0ZHlyqt-"},"source":["## The Challenge\n","\n","We need to create a BoW from a list of documents. The documents (`doc1.txt`, `doc2.txt`, and `doc3.txt`) can be found in the `your-code` directory of this exercise. You will read the content of each document into an array of strings named `corpus`.\n","\n","*What is a corpus (plural: corpora)? Read the reference in the README file.*\n","\n","Your challenge is to use Python to generate the BoW of these documents. Your BoW should look like below:\n","\n","```python\n","bag_of_words = ['a', 'am', 'at', 'cool', 'i', 'ironhack', 'is', 'love', 'student']\n","\n","term_freq = [\n"," [0, 0, 0, 1, 0, 1, 1, 0, 0],\n"," [0, 0, 0, 0, 1, 1, 0, 1, 0],\n"," [1, 1, 1, 0, 1, 1, 0, 0, 1],\n","]\n","```\n","\n","The code below reads the content of a file of text:"]},{"cell_type":"code","source":["with open('C:\\\\...doc1.txt', 'r') as file:\n"," data_in_file = file.read()"],"metadata":{"id":"TlpxS-_e_zmH"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["data_in_file"],"metadata":{"id":"WlGsnTNu_0XG"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["But Naturally, if we have many files, we don't want to open and read each one explcitly one by one. Let's define the `docs` array that contains the paths of `doc1.txt`, `doc2.txt`, and `doc3.txt`."],"metadata":{"id":"pDFTIJVz_4Vp"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"C8K6MQQayqt_"},"outputs":[],"source":["docs = ['doc1.txt', 'doc2.txt', 'doc3.txt']"]},{"cell_type":"markdown","metadata":{"id":"F_I0Fcqayqt_"},"source":["Define an empty array named `corpus` that will contain the content strings of the docs. Loop `docs` and read the content of each doc (see cell above) into the `corpus` array."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Uk6N-vogyquA"},"outputs":[],"source":["# Write your code here\n"]},{"cell_type":"markdown","metadata":{"id":"qPn2JMW_yquA"},"source":["Print `corpus`."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Gg31CafSyquA"},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{"id":"bkzzdrIsyquA"},"source":["You expected to see:\n","\n","```['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']```\n","\n","But you actually saw:\n","\n","```['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']```\n","\n","This is because you haven't done two important steps:\n","\n","1. Remove punctuation from the strings\n","\n","1. Convert strings to lowercase\n","\n","Write your code below to process `corpus` (convert to lower case and remove special characters)."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"hr19FpCRyquA"},"outputs":[],"source":["# Write your code here"]},{"cell_type":"markdown","metadata":{"id":"te53ZNQ5yquA"},"source":["Now define `bag_of_words` as an empty array. It will be used to store the unique terms in `corpus`."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"VRpMaq7HyquB"},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{"id":"wSjETDxByquB"},"source":["Loop through `corpus`. In each loop, do the following:\n","\n","1. Break the string into an array of terms.\n","1. Create a sub-loop to iterate the terms array.\n"," * In each sub-loop, you'll check if the current term is already contained in `bag_of_words`. If not in `bag_of_words`, append it to the array."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"hH55NCTjyquB"},"outputs":[],"source":["# Write your code here\n"]},{"cell_type":"markdown","metadata":{"id":"ucETg_76yquB"},"source":["Print `bag_of_words`. You should see:\n","\n","```['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']```\n","\n","If not, fix your code in the previous cell."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"RDNezDxvyquB"},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{"id":"nZxZ9oCkyquB"},"source":["Now we define an empty array called `term_freq`. Loop `corpus` for a second time. In each loop, create a sub-loop to iterate the terms in `bag_of_words`. Count how many times each term appears in each doc of `corpus`. Append the term-frequency array to `term_freq`."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"S-q_Xw-7yquC"},"outputs":[],"source":["# Write your code here\n"]},{"cell_type":"markdown","metadata":{"id":"C5rTgoo7yquC"},"source":["Print `term_freq`. You should see:\n","\n","```[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]```"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"63Y_cfsjyquC"},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{"id":"35ESP-61yquC"},"source":["**If your output is correct, congratulations! You've solved the challenge!**\n","\n","If not, go back and check for errors in your code."]},{"cell_type":"markdown","metadata":{"id":"eNahUeB4yquC"},"source":["## Bonus Question\n","\n","Now you want to improve your previous solution by removing the stop words from the corpus. The idea is you only want to add terms that are not in the `stop_words` list to the `bag_of_words` array.\n","\n","Requirements:\n","\n","1. Move all your previous codes from `main.ipynb` to the cell below.\n","1. Improve your solution by ignoring stop words in `bag_of_words`.\n","\n","After you're done, your `bag_of_words` should be:\n","\n","```['ironhack', 'cool', 'love', 'student']```\n","\n","And your `term_freq` should be:\n","\n","```[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]```"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"XDroiBGYyquC"},"outputs":[],"source":["stop_words = ['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']\n","\n","# Write your code below\n"]},{"cell_type":"markdown","metadata":{"id":"2D0dq58ryquC"},"source":["## Additional Challenge for the Nerds\n","\n","We will learn Scikit-Learn in Module 3 which has built in the BoW feature. Try to use Scikit-Learn to generate the BoW for this challenge and check whether the output is the same as yours. You will need to do some googling to find out how to use Scikit-Learn to generate BoW.\n","\n","**Notes:**\n","\n","* To install Scikit-Learn, use `pip install sklearn`.\n","\n","* Scikit-Learn removes stop words by default. You don't need to manually remove stop words.\n","\n","* Scikit-Learn's output has slightly different format from the output example demonstrated above. It's ok, you don't need to convert the Scikit-Learn output.\n","\n","The Scikit-Learn output will look like below:\n","\n","```python\n","# BoW:\n","{u'love': 5, u'ironhack': 3, u'student': 6, u'is': 4, u'cool': 2, u'am': 0, u'at': 1}\n","\n","# term_freq:\n","[[0 0 1 1 1 0 0]\n"," [0 0 0 1 0 1 0]\n"," [1 1 0 1 0 0 1]]\n"," ```"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Tp_4ILcNyquC"},"outputs":[],"source":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.8"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":0} \ No newline at end of file