Skip to content

trishajp38-source/ShortTextLanguageID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 

Repository files navigation

Short Text Language Identification: A Comparative Study

Python ![Scikit-learn](https://img.shields.io/badge/scikit--learn-1.3. 0-orange) ![License](https://img.shields.io/badge/License-MIT-green. svg)

๐Ÿ“Š Project Overview

This project implements and evaluates four machine learning models for short text language identification across 17 languages using identical preprocessing and feature engineering pipelines. We analyze three distinct datasets to understand how data characteristics impact model performance.

๐ŸŽฏ Key Objectives

  • Compare performance of Logistic Regression, Linear SVM, Multinomial Naive Bayes, and Random Forest
  • Evaluate across three diverse datasets: Tatoeba, Twitter, and WiLI-2018
  • Analyze the impact of data quality on language identification accuracy
  • Identify which languages are easiest/hardest to classify

๐Ÿ“ˆ Results Summary

Dataset Best Model Accuracy Macro-F1 Status
Tatoeba Linear SVM 99.38% 0.9938 โœ… Excellent
WiLI-2018 Linear SVM 97-100% N/A โœ… Excellent
Twitter Multinomial NB 33.33% 0.2222 โš ๏ธ Data-limited

๐Ÿ† Key Findings

  1. โœ… Linear SVM achieves state-of-the-art performance (99.38% accuracy on clean data)
  2. โœ… Character n-grams (2-4) are highly effective for curated text
  3. โœ… Data quality > Model complexity (66% accuracy gap vs. 2.76% between models)
  4. โœ… Unique scripts = Perfect classification (Arabic, Hebrew, Japanese: 100% accuracy)
  5. โš ๏ธ Traditional methods require sufficient data (Twitter dataset had only 3 test samples)

๐Ÿ”ค Supported Languages (17)

Code Language Script Accuracy (Tatoeba)
ara Arabic Arabic 100%
deu German Latin 99.50%
eng English Latin 99.50%
epo Esperanto Latin 99.50%
fra French Latin 99.00%
heb Hebrew Hebrew 100%
hun Hungarian Latin 99.75%
ita Italian Latin 98.27%
jpn Japanese Japanese 100%
nds Low Saxon Latin 97.76%
nld Dutch Latin 98.75%
pol Polish Latin 100%
por Portuguese Latin 98.75%
rus Russian Cyrillic 97.51%
spa Spanish Latin 97.76%
tur Turkish Latin 99.50%
ukr Ukrainian Cyrillic 97.49%

๐Ÿ—‚๏ธ Datasets

1. Tatoeba Dataset

  • Source: Tatoeba Project
  • Type: Curated translation sentences
  • Samples: 17,000 (1,000 per language)
  • Quality: โญโญโญโญโญ Clean, grammatically correct
  • Performance: 99.38% accuracy

2. Twitter Dataset

  • Source: Kaggle
  • Type: Social media posts
  • Samples: Limited (3 test samples)
  • Quality: โญ Noisy, informal, insufficient
  • Performance: 33.33% accuracy (data-limited)

3. WiLI-2018 Dataset

  • Source: Wikipedia articles
  • Type: Formal encyclopedic text
  • Samples: ~17,000
  • Quality: โญโญโญโญโญ Professional, edited
  • Performance: 97-100% per-language accuracy

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Jupyter Notebook or Google Colab

Quick Start

# Clone the repository
git clone https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID.git
cd ShortTextLanguageID

# Install dependencies
pip install pandas numpy regex scikit-learn scipy matplotlib seaborn tqdm

# Run the notebook
jupyter notebook ShortTextLanguageID.ipynb

Dependencies

pandas >= 2.0.0
numpy >= 1.24.0
regex >= 2023.10.3
scikit-learn >= 1.3.0
scipy >= 1.11.0
matplotlib >= 3.7.0
seaborn >= 0.12.0
tqdm >= 4.65.0

๐Ÿš€ Usage

Google Colab (Recommended)

# Clone repository
!git clone https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID.git
%cd ShortTextLanguageID

# Install dependencies
!pip install -q pandas numpy regex scikit-learn scipy matplotlib seaborn tqdm

# Run the notebook or execute cells

Kaggle

# Clone repository
!git clone https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID.git
%cd ShortTextLanguageID

# Install dependencies (most are pre-installed)
!pip install -q regex

# Import and run

Local Jupyter

jupyter notebook ShortTextLanguageID. ipynb

๐Ÿ“ Project Structure

ShortTextLanguageID/
โ”œโ”€โ”€ ShortTextLanguageID.ipynb    # Main analysis notebook
โ”œโ”€โ”€ README.md                     # This file
โ”œโ”€โ”€ data/                         # Downloaded datasets (auto-generated)
โ”‚   โ”œโ”€โ”€ tatoeba_balanced.csv
โ”‚   โ”œโ”€โ”€ twitter_data.csv
โ”‚   โ””โ”€โ”€ wili_data.csv
โ””โ”€โ”€ results/                      # Output files (auto-generated)
    โ”œโ”€โ”€ confusion_matrices/
    โ”œโ”€โ”€ model_comparisons/
    โ””โ”€โ”€ per_language_performance/

๐Ÿค– Models Evaluated

1. Logistic Regression

  • Type: Linear classifier
  • Tatoeba Accuracy: 99.00%
  • Pros: Fast, interpretable
  • Cons: Slightly lower accuracy than SVM

2. Linear SVM ๐Ÿ†

  • Type: Support Vector Machine (linear kernel)
  • Tatoeba Accuracy: 99.38% (BEST)
  • Pros: Highest accuracy, fast training
  • Cons: None significant

3. Multinomial Naive Bayes

  • Type: Probabilistic classifier
  • Tatoeba Accuracy: 99.29%
  • Pros: Very fast, close to SVM performance
  • Cons: Assumes feature independence

4. Random Forest

  • Type: Ensemble (tree-based)
  • Tatoeba Accuracy: 96.62%
  • Pros: Handles non-linear patterns
  • Cons: Slower training, lower accuracy on sparse features

๐Ÿ”ฌ Methodology

1. Data Collection

  • Download from Tatoeba, Twitter (Kaggle), WiLI-2018
  • Balance datasets (1,000 samples per language)

2. Preprocessing

  • Text cleaning (lowercase, strip whitespace)
  • Remove special characters
  • Normalize Unicode

3. Feature Engineering

  • Character n-grams (n=2, 3, 4)
  • TF-IDF vectorization
  • 10,000 max features
  • Sublinear TF scaling

4. Model Training

  • 80-20 train-test split
  • Stratified sampling (balanced classes)
  • Identical hyperparameters across datasets

5. Evaluation

  • Accuracy, Macro-F1 score
  • Per-language precision, recall, F1
  • Confusion matrices
  • Cross-dataset comparison

๐Ÿ“Š Detailed Results

Model Comparison (Tatoeba Dataset)

Model Accuracy Macro-F1 Training Time
Linear SVM 99.38% 0.9938 ~15 sec
Multinomial NB 99.29% 0.9929 ~5 sec
Logistic Regression 99.00% 0.9900 ~10 sec
Random Forest 96.62% 0.9661 ~180 sec

Easiest Languages to Classify

Languages with 100% accuracy (unique scripts):

  • ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic (ara)
  • ๐Ÿ‡ฎ๐Ÿ‡ฑ Hebrew (heb)
  • ๐Ÿ‡ฏ๐Ÿ‡ต Japanese (jpn)
  • ๐Ÿ‡ต๐Ÿ‡ฑ Polish (pol)

Hardest Languages to Classify

Languages with most confusion (Romance family):

  • ๐Ÿ‡ต๐Ÿ‡น Portuguese (por) - 98.75%
  • ๐Ÿ‡ช๐Ÿ‡ธ Spanish (spa) - 97.76%
  • ๐Ÿ‡ฎ๐Ÿ‡น Italian (ita) - 98.27%

Reason: Shared Latin script and similar character patterns


๐Ÿ’ก Key Insights

1. Data Quality Dominates Performance

  • 66-percentage-point gap between datasets (Tatoeba 99.38% vs. Twitter 33.33%)
  • 2.76-percentage-point gap between models on same data
  • Conclusion: Invest in data quality > algorithm selection (24x more impact)

2. Character N-Grams Are Highly Effective

  • Capture language-specific orthographic patterns
  • Examples:
    • English: "th", "ing", "tion"
    • German: "sch", "ich", "ein"
    • French: "eau", "tion", "ent"
    • Japanese: "ใงใ™", "ใพใ™", "ใ—ใŸ"

3. Linear Models Outperform Complex Ones

  • Linear SVM (99.38%) > Random Forest (96.62%)
  • Reason: Sparse, high-dimensional TF-IDF features favor linear decision boundaries
  • Random Forest struggles with sparse binary features

4. Script Uniqueness = Easy Classification

  • Languages with unique alphabets: 100% accuracy
  • Languages sharing scripts (Latin, Cyrillic): Slight confusion

5. Social Media Text Requires Different Approaches

  • Traditional methods failed on Twitter (33.33%)
  • Issues: Insufficient data (3 test samples), noise, slang, code-switching
  • Recommendation: Use deep learning (mBERT, XLM-R) for social media

๐Ÿ“ Limitations

Twitter Dataset

  • โš ๏ธ Only 3 test samples (statistically invalid)
  • โš ๏ธ Possible task mismatch (sentiment labels vs. language ID)
  • โš ๏ธ High class imbalance (models predict only majority class)
  • โš ๏ธ No meaningful learning occurred

Conclusion: Results demonstrate failure mode when data is insufficient, highlighting the importance of proper dataset curation.


๐Ÿ”ฎ Future Work

  1. ๐Ÿ“Œ Collect properly-sized Twitter dataset (1,000+ samples per language)
  2. ๐Ÿ“Œ Test on code-switched text (bilingual users)
  3. ๐Ÿ“Œ Compare with transformer models (mBERT, XLM-RoBERTa)
  4. ๐Ÿ“Œ Develop hybrid word+character n-gram features
  5. ๐Ÿ“Œ Build language family-specific classifiers
  6. ๐Ÿ“Œ Evaluate on streaming text (character-by-character detection)
  7. ๐Ÿ“Œ Handle extremely short text (<20 characters)
  8. ๐Ÿ“Œ Add more low-resource languages

๐ŸŽ“ Academic Context

This project was developed as part of a comparative study on short text language identification methods. The goal was to evaluate traditional machine learning approaches across datasets with varying characteristics and quality levels.

Institution: [Your College/University]
Course: [Your Course Name]
Date: December 2025


๐Ÿ‘ฅ Contributors

  • Student 1: Tatoeba Dataset Analysis
  • Student 2: Twitter Dataset Analysis
  • Student 3: WiLI-2018 Dataset Analysis
  • Student 4: Comparative Analysis & Report Compilation

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments


๐Ÿ“ž Contact

Project Repository: https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID

For questions or collaborations:


๐Ÿ“š References

  1. Joulin, A., et al. (2016). "FastText: Compressing text classification models"
  2. Jaech, A., et al. (2016). "Phonological pun-derstanding"
  3. Thoma, M. (2018). "The WiLI benchmark dataset for written language identification"
  4. Tatoeba Project. (2024). "Collection of sentences and translations"

๐ŸŒŸ Star This Repository

If you found this project helpful, please consider giving it a โญ on GitHub!


Last Updated: December 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors