From cc718748a425f339c9bc96c59e5d3e860ba2b32b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C2=A8Tumraya=C2=A8?= <¨Adair.guijt@gmail.com¨> Date: Sat, 17 Jan 2026 07:21:16 +0100 Subject: [PATCH] updated main.ipynb --- .ipynb_checkpoints/main-checkpoint.ipynb | 557 +++++++++++++++++++++++ doc1.txt | 1 + doc2.txt | 1 + doc3.txt | 1 + main.ipynb | 557 +++++++++++++++++++++++ 5 files changed, 1117 insertions(+) create mode 100644 .ipynb_checkpoints/main-checkpoint.ipynb create mode 100644 doc1.txt create mode 100644 doc2.txt create mode 100644 doc3.txt create mode 100644 main.ipynb diff --git a/.ipynb_checkpoints/main-checkpoint.ipynb b/.ipynb_checkpoints/main-checkpoint.ipynb new file mode 100644 index 0000000..e5b0ecc --- /dev/null +++ b/.ipynb_checkpoints/main-checkpoint.ipynb @@ -0,0 +1,557 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "M0HGrNOzyqt8" + }, + "source": [ + "# Bag of Words Lab\n", + "\n", + "## Introduction\n", + "\n", + "**Bag of words (BoW)** is an important technique in text mining and [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval). It turns the content of text into vectors of numbers which makes it possible to use mathematics and computer programs to analyze and compare documents.\n", + "\n", + "A BoW contains the following information:\n", + "\n", + "1. A dictionary of all the terms (words) in the text documents. The terms are normalized in terms of the letter case (e.g. `Ironhack` => `ironhack`), tense (e.g. `had` => `have`), singular form (e.g. `students` => `student`), etc.\n", + "1. The number of occurrences of each normalized term in each document.\n", + "\n", + "For example, assume we have three text documents:\n", + "\n", + "DOC 1: **Ironhack is cool.**\n", + "\n", + "DOC 2: **I love Ironhack.**\n", + "\n", + "DOC 3: **I am a student at Ironhack.**\n", + "\n", + "The BoW of the above documents looks like below:\n", + "\n", + "| TERM | DOC 1 | DOC 2 | Doc 3 |\n", + "|---|---|---|---|\n", + "| a | 0 | 0 | 1 |\n", + "| am | 0 | 0 | 1 |\n", + "| at | 0 | 0 | 1 |\n", + "| cool | 1 | 0 | 0 |\n", + "| i | 0 | 1 | 1 |\n", + "| ironhack | 1 | 1 | 1 |\n", + "| is | 1 | 0 | 0 |\n", + "| love | 0 | 1 | 0 |\n", + "| student | 0 | 0 | 1 |\n", + "\n", + "\n", + "The vector of each document in BoW can be high-dimensional since it can have as many terms as there exist words in the language. Data scientists use these vectors to represent the content of the documents. For instance, DOC 1 is represented with `[0, 0, 0, 1, 0, 1, 1, 0, 0]`, DOC 2 is represented with `[0, 0, 0, 0, 1, 1, 0, 1, 0]`, and DOC 3 is represented with `[1, 1, 1, 0, 1, 1, 0, 0, 1]`. Two documents are considered similar if their vector representations are similar.\n", + "\n", + "In real practice there are many additional techniques to improve the text mining accuracy such as using [stop words](https://en.wikipedia.org/wiki/Stop_words) (i.e. neglecting common words such as `a`, `I`, `to` that don't contribute much meaning), synonym list (e.g. consider `New York City` the same as `NYC` and `Big Apple`), and HTML tag removal if the data sources are webpages. In Module 3 you will learn how to use those advanced techniques for [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), a component of text mining.\n", + "\n", + "In real text mining projects data analysts use packages such as Scikit-Learn and NLTK, which you will learn in Module 3, to extract BoW from texts. In this exercise, however, we would like you to create BoW manually with Python. This is because by manually creating BoW you can better understand the concept and also practice the Python skills you have learned so far." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sZlv0ZHlyqt-" + }, + "source": [ + "## The Challenge\n", + "\n", + "We need to create a BoW from a list of documents. The documents (`doc1.txt`, `doc2.txt`, and `doc3.txt`) can be found in the `your-code` directory of this exercise. You will read the content of each document into an array of strings named `corpus`.\n", + "\n", + "*What is a corpus (plural: corpora)? Read the reference in the README file.*\n", + "\n", + "Your challenge is to use Python to generate the BoW of these documents. Your BoW should look like below:\n", + "\n", + "```python\n", + "bag_of_words = ['a', 'am', 'at', 'cool', 'i', 'ironhack', 'is', 'love', 'student']\n", + "\n", + "term_freq = [\n", + " [0, 0, 0, 1, 0, 1, 1, 0, 0],\n", + " [0, 0, 0, 0, 1, 1, 0, 1, 0],\n", + " [1, 1, 1, 0, 1, 1, 0, 0, 1],\n", + "]\n", + "```\n", + "\n", + "The code below reads the content of a file of text:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TlpxS-_e_zmH" + }, + "outputs": [], + "source": [ + "with open('C:\\\\...doc1.txt', 'r') as file:\n", + " data_in_file = file.read()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WlGsnTNu_0XG" + }, + "outputs": [], + "source": [ + "data_in_file" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pDFTIJVz_4Vp" + }, + "source": [ + "But Naturally, if we have many files, we don't want to open and read each one explcitly one by one. Let's define the `docs` array that contains the paths of `doc1.txt`, `doc2.txt`, and `doc3.txt`." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "C8K6MQQayqt_" + }, + "outputs": [], + "source": [ + "docs = ['doc1.txt', 'doc2.txt', 'doc3.txt']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F_I0Fcqayqt_" + }, + "source": [ + "Define an empty array named `corpus` that will contain the content strings of the docs. Loop `docs` and read the content of each doc (see cell above) into the `corpus` array." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "Uk6N-vogyquA" + }, + "outputs": [], + "source": [ + "# Write your code here\n", + "corpus = []\n", + "for doc in docs:\n", + " try:\n", + " # Attempt to open and read each document\n", + " with open(doc, 'r') as file:\n", + " content = file.read()\n", + " \n", + " # Append the content to the corpus list\n", + " corpus.append(content)\n", + " except FileNotFoundError:\n", + " # Handle the specific exception if the file is not found\n", + " print(f\"Warning: {doc} was not found.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qPn2JMW_yquA" + }, + "source": [ + "Print `corpus`." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "Gg31CafSyquA" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']\n" + ] + } + ], + "source": [ + "print(corpus)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bkzzdrIsyquA" + }, + "source": [ + "You expected to see:\n", + "\n", + "```['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']```\n", + "\n", + "But you actually saw:\n", + "\n", + "```['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']```\n", + "\n", + "This is because you haven't done two important steps:\n", + "\n", + "1. Remove punctuation from the strings\n", + "\n", + "1. Convert strings to lowercase\n", + "\n", + "Write your code below to process `corpus` (convert to lower case and remove special characters)." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "id": "hr19FpCRyquA" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']\n" + ] + } + ], + "source": [ + "# Write your code here\n", + "import string\n", + "\n", + "# Original list of strings\n", + "corpus = [\"Ironhack is cool.\", \"I love Ironhack.\", \"I am a student at Ironhack.\"]\n", + "\n", + "# Loop over each element by index to modify in place\n", + "for i in range(len(corpus)):\n", + " # Convert to lowercase and remove punctuation\n", + " corpus[i] = corpus[i].translate(str.maketrans('', '', string.punctuation)).lower()\n", + "\n", + "# The corpus list is now updated in place\n", + "print(corpus)\n", + "# Output: ['ironhack is co" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "te53ZNQ5yquA" + }, + "source": [ + "Now define `bag_of_words` as an empty array. It will be used to store the unique terms in `corpus`." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "id": "VRpMaq7HyquB" + }, + "outputs": [], + "source": [ + "bag_of_words = []" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wSjETDxByquB" + }, + "source": [ + "Loop through `corpus`. In each loop, do the following:\n", + "\n", + "1. Break the string into an array of terms.\n", + "1. Create a sub-loop to iterate the terms array.\n", + " * In each sub-loop, you'll check if the current term is already contained in `bag_of_words`. If not in `bag_of_words`, append it to the array." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "hH55NCTjyquB" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']\n" + ] + } + ], + "source": [ + "\n", + "# Main loop over each string in corpus\n", + "for string in corpus:\n", + " # Split the string into a list of terms\n", + " terms = string.split()\n", + " \n", + " # Sub-loop over each term in the list\n", + " for term in terms:\n", + " # Check if the term is already in bag_of_words\n", + " if term not in bag_of_words:\n", + " # Append the term to bag_of_words if it's not already present\n", + " bag_of_words.append(term)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ucETg_76yquB" + }, + "source": [ + "Print `bag_of_words`. You should see:\n", + "\n", + "```['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']```\n", + "\n", + "If not, fix your code in the previous cell." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "id": "RDNezDxvyquB" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']\n" + ] + } + ], + "source": [ + "print(bag_of_words)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nZxZ9oCkyquB" + }, + "source": [ + "Now we define an empty array called `term_freq`. Loop `corpus` for a second time. In each loop, create a sub-loop to iterate the terms in `bag_of_words`. Count how many times each term appears in each doc of `corpus`. Append the term-frequency array to `term_freq`." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "id": "S-q_Xw-7yquC" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]\n" + ] + } + ], + "source": [ + "# Write your code here\n", + "term_freq = []\n", + "\n", + "# Outer loop to process each document in the corpus\n", + "for document in corpus:\n", + " # Split the document into terms\n", + " terms = document.split()\n", + " \n", + " # Initialize a list to store frequency of each word in bag_of_words for this document\n", + " doc_term_freq = []\n", + " \n", + " # Inner loop to process each word in bag_of_words\n", + " for word in bag_of_words:\n", + " # Count occurrences of the word in the current document\n", + " count = terms.count(word) \n", + " # Append this count to doc_term_freq\n", + " doc_term_freq.append(count)\n", + " \n", + " # Append the term frequency list for this document to term_freq\n", + " term_freq.append(doc_term_freq)\n", + "\n", + "# Display the term frequency for each document\n", + "print(term_freq)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C5rTgoo7yquC" + }, + "source": [ + "Print `term_freq`. You should see:\n", + "\n", + "```[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "35ESP-61yquC" + }, + "source": [ + "**If your output is correct, congratulations! You've solved the challenge!**\n", + "\n", + "If not, go back and check for errors in your code." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eNahUeB4yquC" + }, + "source": [ + "## Bonus Question\n", + "\n", + "Now you want to improve your previous solution by removing the stop words from the corpus. The idea is you only want to add terms that are not in the `stop_words` list to the `bag_of_words` array.\n", + "\n", + "Requirements:\n", + "\n", + "1. Move all your previous codes from `main.ipynb` to the cell below.\n", + "1. Improve your solution by ignoring stop words in `bag_of_words`.\n", + "\n", + "After you're done, your `bag_of_words` should be:\n", + "\n", + "```['ironhack', 'cool', 'love', 'student']```\n", + "\n", + "And your `term_freq` should be:\n", + "\n", + "```[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]```" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "id": "XDroiBGYyquC" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['ironhack', 'cool', 'love', 'student']" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "stop_words = ['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']\n", + "\n", + "# Write your code below\n", + "for word in bag_of_words:\n", + " if word in stop_words:\n", + " bag_of_words.remove(word)\n", + "\n", + "bag_of_words\n" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]\n" + ] + } + ], + "source": [ + "term_freq = []\n", + "\n", + "# Outer loop to process each document in the corpus\n", + "for document in corpus:\n", + " # Split the document into terms\n", + " terms = document.split()\n", + " \n", + " # Initialize a list to store frequency of each word in bag_of_words for this document\n", + " doc_term_freq = []\n", + " \n", + " # Inner loop to process each word in bag_of_words\n", + " for word in bag_of_words:\n", + " # Count occurrences of the word in the current document\n", + " count = terms.count(word) \n", + " # Append this count to doc_term_freq\n", + " doc_term_freq.append(count)\n", + " \n", + " # Append the term frequency list for this document to term_freq\n", + " term_freq.append(doc_term_freq)\n", + "\n", + "# Display the term frequency for each document\n", + "print(term_freq)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2D0dq58ryquC" + }, + "source": [ + "## Additional Challenge for the Nerds\n", + "\n", + "We will learn Scikit-Learn in Module 3 which has built in the BoW feature. Try to use Scikit-Learn to generate the BoW for this challenge and check whether the output is the same as yours. You will need to do some googling to find out how to use Scikit-Learn to generate BoW.\n", + "\n", + "**Notes:**\n", + "\n", + "* To install Scikit-Learn, use `pip install sklearn`.\n", + "\n", + "* Scikit-Learn removes stop words by default. You don't need to manually remove stop words.\n", + "\n", + "* Scikit-Learn's output has slightly different format from the output example demonstrated above. It's ok, you don't need to convert the Scikit-Learn output.\n", + "\n", + "The Scikit-Learn output will look like below:\n", + "\n", + "```python\n", + "# BoW:\n", + "{u'love': 5, u'ironhack': 3, u'student': 6, u'is': 4, u'cool': 2, u'am': 0, u'at': 1}\n", + "\n", + "# term_freq:\n", + "[[0 0 1 1 1 0 0]\n", + " [0 0 0 1 0 1 0]\n", + " [1 1 0 1 0 0 1]]\n", + " ```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.14.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/doc1.txt b/doc1.txt new file mode 100644 index 0000000..e66288d --- /dev/null +++ b/doc1.txt @@ -0,0 +1 @@ +Ironhack is cool. \ No newline at end of file diff --git a/doc2.txt b/doc2.txt new file mode 100644 index 0000000..b21feac --- /dev/null +++ b/doc2.txt @@ -0,0 +1 @@ +I love Ironhack. \ No newline at end of file diff --git a/doc3.txt b/doc3.txt new file mode 100644 index 0000000..653c5b7 --- /dev/null +++ b/doc3.txt @@ -0,0 +1 @@ +I am a student at Ironhack. \ No newline at end of file diff --git a/main.ipynb b/main.ipynb new file mode 100644 index 0000000..e5b0ecc --- /dev/null +++ b/main.ipynb @@ -0,0 +1,557 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "M0HGrNOzyqt8" + }, + "source": [ + "# Bag of Words Lab\n", + "\n", + "## Introduction\n", + "\n", + "**Bag of words (BoW)** is an important technique in text mining and [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval). It turns the content of text into vectors of numbers which makes it possible to use mathematics and computer programs to analyze and compare documents.\n", + "\n", + "A BoW contains the following information:\n", + "\n", + "1. A dictionary of all the terms (words) in the text documents. The terms are normalized in terms of the letter case (e.g. `Ironhack` => `ironhack`), tense (e.g. `had` => `have`), singular form (e.g. `students` => `student`), etc.\n", + "1. The number of occurrences of each normalized term in each document.\n", + "\n", + "For example, assume we have three text documents:\n", + "\n", + "DOC 1: **Ironhack is cool.**\n", + "\n", + "DOC 2: **I love Ironhack.**\n", + "\n", + "DOC 3: **I am a student at Ironhack.**\n", + "\n", + "The BoW of the above documents looks like below:\n", + "\n", + "| TERM | DOC 1 | DOC 2 | Doc 3 |\n", + "|---|---|---|---|\n", + "| a | 0 | 0 | 1 |\n", + "| am | 0 | 0 | 1 |\n", + "| at | 0 | 0 | 1 |\n", + "| cool | 1 | 0 | 0 |\n", + "| i | 0 | 1 | 1 |\n", + "| ironhack | 1 | 1 | 1 |\n", + "| is | 1 | 0 | 0 |\n", + "| love | 0 | 1 | 0 |\n", + "| student | 0 | 0 | 1 |\n", + "\n", + "\n", + "The vector of each document in BoW can be high-dimensional since it can have as many terms as there exist words in the language. Data scientists use these vectors to represent the content of the documents. For instance, DOC 1 is represented with `[0, 0, 0, 1, 0, 1, 1, 0, 0]`, DOC 2 is represented with `[0, 0, 0, 0, 1, 1, 0, 1, 0]`, and DOC 3 is represented with `[1, 1, 1, 0, 1, 1, 0, 0, 1]`. Two documents are considered similar if their vector representations are similar.\n", + "\n", + "In real practice there are many additional techniques to improve the text mining accuracy such as using [stop words](https://en.wikipedia.org/wiki/Stop_words) (i.e. neglecting common words such as `a`, `I`, `to` that don't contribute much meaning), synonym list (e.g. consider `New York City` the same as `NYC` and `Big Apple`), and HTML tag removal if the data sources are webpages. In Module 3 you will learn how to use those advanced techniques for [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), a component of text mining.\n", + "\n", + "In real text mining projects data analysts use packages such as Scikit-Learn and NLTK, which you will learn in Module 3, to extract BoW from texts. In this exercise, however, we would like you to create BoW manually with Python. This is because by manually creating BoW you can better understand the concept and also practice the Python skills you have learned so far." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sZlv0ZHlyqt-" + }, + "source": [ + "## The Challenge\n", + "\n", + "We need to create a BoW from a list of documents. The documents (`doc1.txt`, `doc2.txt`, and `doc3.txt`) can be found in the `your-code` directory of this exercise. You will read the content of each document into an array of strings named `corpus`.\n", + "\n", + "*What is a corpus (plural: corpora)? Read the reference in the README file.*\n", + "\n", + "Your challenge is to use Python to generate the BoW of these documents. Your BoW should look like below:\n", + "\n", + "```python\n", + "bag_of_words = ['a', 'am', 'at', 'cool', 'i', 'ironhack', 'is', 'love', 'student']\n", + "\n", + "term_freq = [\n", + " [0, 0, 0, 1, 0, 1, 1, 0, 0],\n", + " [0, 0, 0, 0, 1, 1, 0, 1, 0],\n", + " [1, 1, 1, 0, 1, 1, 0, 0, 1],\n", + "]\n", + "```\n", + "\n", + "The code below reads the content of a file of text:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TlpxS-_e_zmH" + }, + "outputs": [], + "source": [ + "with open('C:\\\\...doc1.txt', 'r') as file:\n", + " data_in_file = file.read()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WlGsnTNu_0XG" + }, + "outputs": [], + "source": [ + "data_in_file" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pDFTIJVz_4Vp" + }, + "source": [ + "But Naturally, if we have many files, we don't want to open and read each one explcitly one by one. Let's define the `docs` array that contains the paths of `doc1.txt`, `doc2.txt`, and `doc3.txt`." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "C8K6MQQayqt_" + }, + "outputs": [], + "source": [ + "docs = ['doc1.txt', 'doc2.txt', 'doc3.txt']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F_I0Fcqayqt_" + }, + "source": [ + "Define an empty array named `corpus` that will contain the content strings of the docs. Loop `docs` and read the content of each doc (see cell above) into the `corpus` array." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "Uk6N-vogyquA" + }, + "outputs": [], + "source": [ + "# Write your code here\n", + "corpus = []\n", + "for doc in docs:\n", + " try:\n", + " # Attempt to open and read each document\n", + " with open(doc, 'r') as file:\n", + " content = file.read()\n", + " \n", + " # Append the content to the corpus list\n", + " corpus.append(content)\n", + " except FileNotFoundError:\n", + " # Handle the specific exception if the file is not found\n", + " print(f\"Warning: {doc} was not found.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qPn2JMW_yquA" + }, + "source": [ + "Print `corpus`." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "Gg31CafSyquA" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']\n" + ] + } + ], + "source": [ + "print(corpus)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bkzzdrIsyquA" + }, + "source": [ + "You expected to see:\n", + "\n", + "```['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']```\n", + "\n", + "But you actually saw:\n", + "\n", + "```['Ironhack is cool.', 'I love Ironhack.', 'I am a student at Ironhack.']```\n", + "\n", + "This is because you haven't done two important steps:\n", + "\n", + "1. Remove punctuation from the strings\n", + "\n", + "1. Convert strings to lowercase\n", + "\n", + "Write your code below to process `corpus` (convert to lower case and remove special characters)." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "id": "hr19FpCRyquA" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack is cool', 'i love ironhack', 'i am a student at ironhack']\n" + ] + } + ], + "source": [ + "# Write your code here\n", + "import string\n", + "\n", + "# Original list of strings\n", + "corpus = [\"Ironhack is cool.\", \"I love Ironhack.\", \"I am a student at Ironhack.\"]\n", + "\n", + "# Loop over each element by index to modify in place\n", + "for i in range(len(corpus)):\n", + " # Convert to lowercase and remove punctuation\n", + " corpus[i] = corpus[i].translate(str.maketrans('', '', string.punctuation)).lower()\n", + "\n", + "# The corpus list is now updated in place\n", + "print(corpus)\n", + "# Output: ['ironhack is co" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "te53ZNQ5yquA" + }, + "source": [ + "Now define `bag_of_words` as an empty array. It will be used to store the unique terms in `corpus`." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "id": "VRpMaq7HyquB" + }, + "outputs": [], + "source": [ + "bag_of_words = []" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wSjETDxByquB" + }, + "source": [ + "Loop through `corpus`. In each loop, do the following:\n", + "\n", + "1. Break the string into an array of terms.\n", + "1. Create a sub-loop to iterate the terms array.\n", + " * In each sub-loop, you'll check if the current term is already contained in `bag_of_words`. If not in `bag_of_words`, append it to the array." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "hH55NCTjyquB" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']\n" + ] + } + ], + "source": [ + "\n", + "# Main loop over each string in corpus\n", + "for string in corpus:\n", + " # Split the string into a list of terms\n", + " terms = string.split()\n", + " \n", + " # Sub-loop over each term in the list\n", + " for term in terms:\n", + " # Check if the term is already in bag_of_words\n", + " if term not in bag_of_words:\n", + " # Append the term to bag_of_words if it's not already present\n", + " bag_of_words.append(term)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ucETg_76yquB" + }, + "source": [ + "Print `bag_of_words`. You should see:\n", + "\n", + "```['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']```\n", + "\n", + "If not, fix your code in the previous cell." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "id": "RDNezDxvyquB" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['ironhack', 'is', 'cool', 'i', 'love', 'am', 'a', 'student', 'at']\n" + ] + } + ], + "source": [ + "print(bag_of_words)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nZxZ9oCkyquB" + }, + "source": [ + "Now we define an empty array called `term_freq`. Loop `corpus` for a second time. In each loop, create a sub-loop to iterate the terms in `bag_of_words`. Count how many times each term appears in each doc of `corpus`. Append the term-frequency array to `term_freq`." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "id": "S-q_Xw-7yquC" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]\n" + ] + } + ], + "source": [ + "# Write your code here\n", + "term_freq = []\n", + "\n", + "# Outer loop to process each document in the corpus\n", + "for document in corpus:\n", + " # Split the document into terms\n", + " terms = document.split()\n", + " \n", + " # Initialize a list to store frequency of each word in bag_of_words for this document\n", + " doc_term_freq = []\n", + " \n", + " # Inner loop to process each word in bag_of_words\n", + " for word in bag_of_words:\n", + " # Count occurrences of the word in the current document\n", + " count = terms.count(word) \n", + " # Append this count to doc_term_freq\n", + " doc_term_freq.append(count)\n", + " \n", + " # Append the term frequency list for this document to term_freq\n", + " term_freq.append(doc_term_freq)\n", + "\n", + "# Display the term frequency for each document\n", + "print(term_freq)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C5rTgoo7yquC" + }, + "source": [ + "Print `term_freq`. You should see:\n", + "\n", + "```[[1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 1, 1, 1, 1]]```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "35ESP-61yquC" + }, + "source": [ + "**If your output is correct, congratulations! You've solved the challenge!**\n", + "\n", + "If not, go back and check for errors in your code." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eNahUeB4yquC" + }, + "source": [ + "## Bonus Question\n", + "\n", + "Now you want to improve your previous solution by removing the stop words from the corpus. The idea is you only want to add terms that are not in the `stop_words` list to the `bag_of_words` array.\n", + "\n", + "Requirements:\n", + "\n", + "1. Move all your previous codes from `main.ipynb` to the cell below.\n", + "1. Improve your solution by ignoring stop words in `bag_of_words`.\n", + "\n", + "After you're done, your `bag_of_words` should be:\n", + "\n", + "```['ironhack', 'cool', 'love', 'student']```\n", + "\n", + "And your `term_freq` should be:\n", + "\n", + "```[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]```" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "id": "XDroiBGYyquC" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['ironhack', 'cool', 'love', 'student']" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "stop_words = ['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']\n", + "\n", + "# Write your code below\n", + "for word in bag_of_words:\n", + " if word in stop_words:\n", + " bag_of_words.remove(word)\n", + "\n", + "bag_of_words\n" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1]]\n" + ] + } + ], + "source": [ + "term_freq = []\n", + "\n", + "# Outer loop to process each document in the corpus\n", + "for document in corpus:\n", + " # Split the document into terms\n", + " terms = document.split()\n", + " \n", + " # Initialize a list to store frequency of each word in bag_of_words for this document\n", + " doc_term_freq = []\n", + " \n", + " # Inner loop to process each word in bag_of_words\n", + " for word in bag_of_words:\n", + " # Count occurrences of the word in the current document\n", + " count = terms.count(word) \n", + " # Append this count to doc_term_freq\n", + " doc_term_freq.append(count)\n", + " \n", + " # Append the term frequency list for this document to term_freq\n", + " term_freq.append(doc_term_freq)\n", + "\n", + "# Display the term frequency for each document\n", + "print(term_freq)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2D0dq58ryquC" + }, + "source": [ + "## Additional Challenge for the Nerds\n", + "\n", + "We will learn Scikit-Learn in Module 3 which has built in the BoW feature. Try to use Scikit-Learn to generate the BoW for this challenge and check whether the output is the same as yours. You will need to do some googling to find out how to use Scikit-Learn to generate BoW.\n", + "\n", + "**Notes:**\n", + "\n", + "* To install Scikit-Learn, use `pip install sklearn`.\n", + "\n", + "* Scikit-Learn removes stop words by default. You don't need to manually remove stop words.\n", + "\n", + "* Scikit-Learn's output has slightly different format from the output example demonstrated above. It's ok, you don't need to convert the Scikit-Learn output.\n", + "\n", + "The Scikit-Learn output will look like below:\n", + "\n", + "```python\n", + "# BoW:\n", + "{u'love': 5, u'ironhack': 3, u'student': 6, u'is': 4, u'cool': 2, u'am': 0, u'at': 1}\n", + "\n", + "# term_freq:\n", + "[[0 0 1 1 1 0 0]\n", + " [0 0 0 1 0 1 0]\n", + " [1 1 0 1 0 0 1]]\n", + " ```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.14.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}