This repository contains the implementation and report for an NLP course homework focused on Persian text processing on the Taaghche comments dataset. It covers:
- Document-level sentiment classification (Negative / Neutral / Positive)
- Word/token-level sequence labeling (NER-style tagging) using crawled book metadata (book name, author, translator, publisher)
- A lightweight web crawler + extractor for collecting Taaghche book-page metadata to support entity list construction and labeling.
Report: see
document/Report.pdf(source:document/Report.tex).
-
DocClassifier_Base.ipynb
Baseline document sentiment classifier (TF‑IDF + Logistic Regression), including preprocessing, class balancing, hyperparameter search, and evaluation. -
DocClassifier_transformer.ipynb
Transformer-based document sentiment classifier using a Persian BERT model (e.g.,HooshvareLab/bert-fa-base-uncased) with TensorFlow / Hugging Face. -
WordClassifier_Base.ipynb
Baseline token-level classifier (sequence labeling) using classical preprocessing + a BiLSTM-style model for entity tagging. -
WordClassifier_Transformer.ipynb
Transformer-based token-level classifier (BERT for token classification). -
crawler/crawler.py: downloads Taaghche book pages by iterating over book IDs and saving HTML filesextractor.py: parses saved HTML pages, extracts embedded JSON-LD, and writes book metadata to CSV
-
document/Report.pdf: compiled project reportReport.tex: LaTeX source of the report and explanationsimg/: figures used in the report
Other directories (e.g., datasets/, resources/, stopwords/) are used by the notebooks for data and assets; exact contents may depend on how/where the notebooks were executed (local/Colab/Kaggle).
- Labeling: ratings are mapped to sentiment classes via thresholds (negative/neutral/positive).
- Preprocessing: Persian-character filtering and whitespace normalization.
- Class balancing: downsampling to equalize class sizes.
- Baseline: TF‑IDF + Logistic Regression (GridSearch over
ngram_range,max_features,C). - Transformer: fine-tuning a Persian BERT model for sequence classification.
- Crawls Taaghche book pages (by numeric ID range).
- Extracts JSON-LD metadata (book name, author(s), translator(s), publisher).
- Outputs CSV chunks for later analysis / entity list creation.
- Creates labels for tokens using book metadata fields and a BIO tagging scheme (e.g.,
B-Author,I-Author,O, etc.). - Trains:
- a baseline BiLSTM model
- a transformer token-classification model
The report includes experiments and evaluation (classification report, confusion matrix, and overall metrics). In the baseline document sentiment setup (TF‑IDF + Logistic Regression), the reported accuracy is approximately 0.61 on the evaluated split (see document/Report.tex / document/Report.pdf for full details and figures).
Run the notebooks in Jupyter/Colab/Kaggle:
DocClassifier_Base.ipynbDocClassifier_transformer.ipynbWordClassifier_Base.ipynbWordClassifier_Transformer.ipynb
Because the notebooks were executed in different environments (paths like Colab/Kaggle appear in the report), you may need to adjust:
- dataset paths (e.g.,
datasets/taghche.csv) - output paths
- GPU/runtime settings
From the repository root:
pip install requests beautifulsoup4 pandas
python crawler/crawler.py
python crawler/extractor.pyThis produces:
- HTML pages in
book_pages/ - extracted CSV files in
output_csv/
Note: Be mindful of rate limits and respectful crawling. The script uses delays and threading; you may want to reduce concurrency depending on network/policy constraints.
Typical packages used across notebooks/scripts include:
pandas,numpy,scikit-learntensorflow(transformer notebooks)transformershazm(Persian preprocessing/tokenization)beautifulsoup4,requestsmatplotlib,seaborn,plotly
- Ilia Hashemi Rad
- AmirMohammad Fakhimi
- AmirMahdi Namjoo
(See the report for identifiers and full submission details.)
This project is licensed under the terms of the LICENSE file in this repository.