NLP HW3 — Taaghche Sentiment Analysis & NER

This repository contains the implementation and report for an NLP course homework focused on Persian text processing on the Taaghche comments dataset. It covers:

Document-level sentiment classification (Negative / Neutral / Positive)
Word/token-level sequence labeling (NER-style tagging) using crawled book metadata (book name, author, translator, publisher)
A lightweight web crawler + extractor for collecting Taaghche book-page metadata to support entity list construction and labeling.

Report: see document/Report.pdf (source: document/Report.tex).

Repository Structure

DocClassifier_Base.ipynb
Baseline document sentiment classifier (TF‑IDF + Logistic Regression), including preprocessing, class balancing, hyperparameter search, and evaluation.
DocClassifier_transformer.ipynb
Transformer-based document sentiment classifier using a Persian BERT model (e.g., HooshvareLab/bert-fa-base-uncased) with TensorFlow / Hugging Face.
WordClassifier_Base.ipynb
Baseline token-level classifier (sequence labeling) using classical preprocessing + a BiLSTM-style model for entity tagging.
WordClassifier_Transformer.ipynb
Transformer-based token-level classifier (BERT for token classification).
crawler/
- crawler.py: downloads Taaghche book pages by iterating over book IDs and saving HTML files
- extractor.py: parses saved HTML pages, extracts embedded JSON-LD, and writes book metadata to CSV
document/
- Report.pdf: compiled project report
- Report.tex: LaTeX source of the report and explanations
- img/: figures used in the report

Other directories (e.g., datasets/, resources/, stopwords/) are used by the notebooks for data and assets; exact contents may depend on how/where the notebooks were executed (local/Colab/Kaggle).

Methods (High-Level)

1) Document Sentiment Classification

Labeling: ratings are mapped to sentiment classes via thresholds (negative/neutral/positive).
Preprocessing: Persian-character filtering and whitespace normalization.
Class balancing: downsampling to equalize class sizes.
Baseline: TF‑IDF + Logistic Regression (GridSearch over ngram_range, max_features, C).
Transformer: fine-tuning a Persian BERT model for sequence classification.

2) Crawling & Metadata Extraction

Crawls Taaghche book pages (by numeric ID range).
Extracts JSON-LD metadata (book name, author(s), translator(s), publisher).
Outputs CSV chunks for later analysis / entity list creation.

3) Token-Level Tagging (NER-style)

Creates labels for tokens using book metadata fields and a BIO tagging scheme (e.g., B-Author, I-Author, O, etc.).
Trains:
- a baseline BiLSTM model
- a transformer token-classification model

Results (from the report)

The report includes experiments and evaluation (classification report, confusion matrix, and overall metrics). In the baseline document sentiment setup (TF‑IDF + Logistic Regression), the reported accuracy is approximately 0.61 on the evaluated split (see document/Report.tex / document/Report.pdf for full details and figures).

How to Run

Option A — Open the notebooks

Run the notebooks in Jupyter/Colab/Kaggle:

DocClassifier_Base.ipynb
DocClassifier_transformer.ipynb
WordClassifier_Base.ipynb
WordClassifier_Transformer.ipynb

Because the notebooks were executed in different environments (paths like Colab/Kaggle appear in the report), you may need to adjust:

dataset paths (e.g., datasets/taghche.csv)
output paths
GPU/runtime settings

Option B — Run the crawler

From the repository root:

pip install requests beautifulsoup4 pandas
python crawler/crawler.py
python crawler/extractor.py

This produces:

HTML pages in book_pages/
extracted CSV files in output_csv/

Note: Be mindful of rate limits and respectful crawling. The script uses delays and threading; you may want to reduce concurrency depending on network/policy constraints.

Dependencies (common)

Typical packages used across notebooks/scripts include:

pandas, numpy, scikit-learn
tensorflow (transformer notebooks)
transformers
hazm (Persian preprocessing/tokenization)
beautifulsoup4, requests
matplotlib, seaborn, plotly

Authors

Ilia Hashemi Rad
AmirMohammad Fakhimi
AmirMahdi Namjoo

(See the report for identifiers and full submission details.)

License

This project is licensed under the terms of the LICENSE file in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP HW3 — Taaghche Sentiment Analysis & NER

Repository Structure

Methods (High-Level)

1) Document Sentiment Classification

2) Crawling & Metadata Extraction

3) Token-Level Tagging (NER-style)

Results (from the report)

How to Run

Option A — Open the notebooks

Option B — Run the crawler

Dependencies (common)

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
crawler		crawler
datasets		datasets
document		document
resources		resources
stopwords		stopwords
.gitignore		.gitignore
DocClassifier_Base.ipynb		DocClassifier_Base.ipynb
DocClassifier_transformer.ipynb		DocClassifier_transformer.ipynb
LICENSE		LICENSE
README.md		README.md
WordClassifier_Base.ipynb		WordClassifier_Base.ipynb
WordClassifier_Base_(Final_Run_on_Colab).ipynb		WordClassifier_Base_(Final_Run_on_Colab).ipynb
WordClassifier_Transformer.ipynb		WordClassifier_Transformer.ipynb
WordClassifier_Transformer_(Final_Run_on_Colab).ipynb		WordClassifier_Transformer_(Final_Run_on_Colab).ipynb

Folders and files

Latest commit

History

Repository files navigation

NLP HW3 — Taaghche Sentiment Analysis & NER

Repository Structure

Methods (High-Level)

1) Document Sentiment Classification

2) Crawling & Metadata Extraction

3) Token-Level Tagging (NER-style)

Results (from the report)

How to Run

Option A — Open the notebooks

Option B — Run the crawler

Dependencies (common)

Authors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages