Short Text Language Identification: A Comparative Study

![Scikit-learn](https://img.shields.io/badge/scikit--learn-1.3. 0-orange) ![License](https://img.shields.io/badge/License-MIT-green. svg)

📊 Project Overview

This project implements and evaluates four machine learning models for short text language identification across 17 languages using identical preprocessing and feature engineering pipelines. We analyze three distinct datasets to understand how data characteristics impact model performance.

🎯 Key Objectives

Compare performance of Logistic Regression, Linear SVM, Multinomial Naive Bayes, and Random Forest
Evaluate across three diverse datasets: Tatoeba, Twitter, and WiLI-2018
Analyze the impact of data quality on language identification accuracy
Identify which languages are easiest/hardest to classify

📈 Results Summary

Dataset	Best Model	Accuracy	Macro-F1	Status
Tatoeba	Linear SVM	99.38%	0.9938	✅ Excellent
WiLI-2018	Linear SVM	97-100%	N/A	✅ Excellent
Twitter	Multinomial NB	33.33%	0.2222	⚠️ Data-limited

🏆 Key Findings

✅ Linear SVM achieves state-of-the-art performance (99.38% accuracy on clean data)
✅ Character n-grams (2-4) are highly effective for curated text
✅ Data quality > Model complexity (66% accuracy gap vs. 2.76% between models)
✅ Unique scripts = Perfect classification (Arabic, Hebrew, Japanese: 100% accuracy)
⚠️ Traditional methods require sufficient data (Twitter dataset had only 3 test samples)

🔤 Supported Languages (17)

Code	Language	Script	Accuracy (Tatoeba)
ara	Arabic	Arabic	100%
deu	German	Latin	99.50%
eng	English	Latin	99.50%
epo	Esperanto	Latin	99.50%
fra	French	Latin	99.00%
heb	Hebrew	Hebrew	100%
hun	Hungarian	Latin	99.75%
ita	Italian	Latin	98.27%
jpn	Japanese	Japanese	100%
nds	Low Saxon	Latin	97.76%
nld	Dutch	Latin	98.75%
pol	Polish	Latin	100%
por	Portuguese	Latin	98.75%
rus	Russian	Cyrillic	97.51%
spa	Spanish	Latin	97.76%
tur	Turkish	Latin	99.50%
ukr	Ukrainian	Cyrillic	97.49%

🗂️ Datasets

1. Tatoeba Dataset

Source: Tatoeba Project
Type: Curated translation sentences
Samples: 17,000 (1,000 per language)
Quality: ⭐⭐⭐⭐⭐ Clean, grammatically correct
Performance: 99.38% accuracy

2. Twitter Dataset

Source: Kaggle
Type: Social media posts
Samples: Limited (3 test samples)
Quality: ⭐ Noisy, informal, insufficient
Performance: 33.33% accuracy (data-limited)

3. WiLI-2018 Dataset

Source: Wikipedia articles
Type: Formal encyclopedic text
Samples: ~17,000
Quality: ⭐⭐⭐⭐⭐ Professional, edited
Performance: 97-100% per-language accuracy

🛠️ Installation

Prerequisites

Python 3.8 or higher
pip package manager
Jupyter Notebook or Google Colab

Quick Start

# Clone the repository
git clone https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID.git
cd ShortTextLanguageID

# Install dependencies
pip install pandas numpy regex scikit-learn scipy matplotlib seaborn tqdm

# Run the notebook
jupyter notebook ShortTextLanguageID.ipynb

Dependencies

pandas >= 2.0.0
numpy >= 1.24.0
regex >= 2023.10.3
scikit-learn >= 1.3.0
scipy >= 1.11.0
matplotlib >= 3.7.0
seaborn >= 0.12.0
tqdm >= 4.65.0

🚀 Usage

Google Colab (Recommended)

# Clone repository
!git clone https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID.git
%cd ShortTextLanguageID

# Install dependencies
!pip install -q pandas numpy regex scikit-learn scipy matplotlib seaborn tqdm

# Run the notebook or execute cells

Kaggle

# Clone repository
!git clone https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID.git
%cd ShortTextLanguageID

# Install dependencies (most are pre-installed)
!pip install -q regex

# Import and run

Local Jupyter

jupyter notebook ShortTextLanguageID. ipynb

📁 Project Structure

ShortTextLanguageID/
├── ShortTextLanguageID.ipynb    # Main analysis notebook
├── README.md                     # This file
├── data/                         # Downloaded datasets (auto-generated)
│   ├── tatoeba_balanced.csv
│   ├── twitter_data.csv
│   └── wili_data.csv
└── results/                      # Output files (auto-generated)
    ├── confusion_matrices/
    ├── model_comparisons/
    └── per_language_performance/

🤖 Models Evaluated

1. Logistic Regression

Type: Linear classifier
Tatoeba Accuracy: 99.00%
Pros: Fast, interpretable
Cons: Slightly lower accuracy than SVM

2. Linear SVM 🏆

Type: Support Vector Machine (linear kernel)
Tatoeba Accuracy: 99.38% (BEST)
Pros: Highest accuracy, fast training
Cons: None significant

3. Multinomial Naive Bayes

Type: Probabilistic classifier
Tatoeba Accuracy: 99.29%
Pros: Very fast, close to SVM performance
Cons: Assumes feature independence

4. Random Forest

Type: Ensemble (tree-based)
Tatoeba Accuracy: 96.62%
Pros: Handles non-linear patterns
Cons: Slower training, lower accuracy on sparse features

🔬 Methodology

1. Data Collection

Download from Tatoeba, Twitter (Kaggle), WiLI-2018
Balance datasets (1,000 samples per language)

2. Preprocessing

Text cleaning (lowercase, strip whitespace)
Remove special characters
Normalize Unicode

3. Feature Engineering

Character n-grams (n=2, 3, 4)
TF-IDF vectorization
10,000 max features
Sublinear TF scaling

4. Model Training

80-20 train-test split
Stratified sampling (balanced classes)
Identical hyperparameters across datasets

5. Evaluation

Accuracy, Macro-F1 score
Per-language precision, recall, F1
Confusion matrices
Cross-dataset comparison

📊 Detailed Results

Model Comparison (Tatoeba Dataset)

Model	Accuracy	Macro-F1	Training Time
Linear SVM	99.38%	0.9938	~15 sec
Multinomial NB	99.29%	0.9929	~5 sec
Logistic Regression	99.00%	0.9900	~10 sec
Random Forest	96.62%	0.9661	~180 sec

Easiest Languages to Classify

Languages with 100% accuracy (unique scripts):

🇸🇦 Arabic (ara)
🇮🇱 Hebrew (heb)
🇯🇵 Japanese (jpn)
🇵🇱 Polish (pol)

Hardest Languages to Classify

Languages with most confusion (Romance family):

🇵🇹 Portuguese (por) - 98.75%
🇪🇸 Spanish (spa) - 97.76%
🇮🇹 Italian (ita) - 98.27%

Reason: Shared Latin script and similar character patterns

💡 Key Insights

1. Data Quality Dominates Performance

66-percentage-point gap between datasets (Tatoeba 99.38% vs. Twitter 33.33%)
2.76-percentage-point gap between models on same data
Conclusion: Invest in data quality > algorithm selection (24x more impact)

2. Character N-Grams Are Highly Effective

Capture language-specific orthographic patterns
Examples:
- English: "th", "ing", "tion"
- German: "sch", "ich", "ein"
- French: "eau", "tion", "ent"
- Japanese: "です", "ます", "した"

3. Linear Models Outperform Complex Ones

Linear SVM (99.38%) > Random Forest (96.62%)
Reason: Sparse, high-dimensional TF-IDF features favor linear decision boundaries
Random Forest struggles with sparse binary features

4. Script Uniqueness = Easy Classification

Languages with unique alphabets: 100% accuracy
Languages sharing scripts (Latin, Cyrillic): Slight confusion

5. Social Media Text Requires Different Approaches

Traditional methods failed on Twitter (33.33%)
Issues: Insufficient data (3 test samples), noise, slang, code-switching
Recommendation: Use deep learning (mBERT, XLM-R) for social media

📝 Limitations

Twitter Dataset

⚠️ Only 3 test samples (statistically invalid)
⚠️ Possible task mismatch (sentiment labels vs. language ID)
⚠️ High class imbalance (models predict only majority class)
⚠️ No meaningful learning occurred

Conclusion: Results demonstrate failure mode when data is insufficient, highlighting the importance of proper dataset curation.

🔮 Future Work

📌 Collect properly-sized Twitter dataset (1,000+ samples per language)
📌 Test on code-switched text (bilingual users)
📌 Compare with transformer models (mBERT, XLM-RoBERTa)
📌 Develop hybrid word+character n-gram features
📌 Build language family-specific classifiers
📌 Evaluate on streaming text (character-by-character detection)
📌 Handle extremely short text (<20 characters)
📌 Add more low-resource languages

🎓 Academic Context

This project was developed as part of a comparative study on short text language identification methods. The goal was to evaluate traditional machine learning approaches across datasets with varying characteristics and quality levels.

Institution: [Your College/University]
Course: [Your Course Name]
Date: December 2025

👥 Contributors

Student 1: Tatoeba Dataset Analysis
Student 2: Twitter Dataset Analysis
Student 3: WiLI-2018 Dataset Analysis
Student 4: Comparative Analysis & Report Compilation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Tatoeba Project for multilingual sentence corpus
WiLI-2018 Benchmark for Wikipedia language identification dataset
Kaggle for Twitter sentiment dataset
scikit-learn community for machine learning tools

📞 Contact

Project Repository: https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID

For questions or collaborations:

Open an issue on GitHub
Email: [your.email@example.com]

📚 References

Joulin, A., et al. (2016). "FastText: Compressing text classification models"
Jaech, A., et al. (2016). "Phonological pun-derstanding"
Thoma, M. (2018). "The WiLI benchmark dataset for written language identification"
Tatoeba Project. (2024). "Collection of sentences and translations"

🌟 Star This Repository

If you found this project helpful, please consider giving it a ⭐ on GitHub!

Last Updated: December 2025

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
ShortTextLanguageid.ipynb		ShortTextLanguageid.ipynb

Folders and files

Latest commit

History

Repository files navigation

Short Text Language Identification: A Comparative Study

📊 Project Overview

🎯 Key Objectives

📈 Results Summary

🏆 Key Findings

🔤 Supported Languages (17)

🗂️ Datasets

1. Tatoeba Dataset

2. Twitter Dataset

3. WiLI-2018 Dataset

🛠️ Installation

Prerequisites

Quick Start

Dependencies

🚀 Usage

Google Colab (Recommended)

Kaggle

Local Jupyter

📁 Project Structure

🤖 Models Evaluated

1. Logistic Regression

2. Linear SVM 🏆

3. Multinomial Naive Bayes

4. Random Forest

🔬 Methodology

1. Data Collection

2. Preprocessing

3. Feature Engineering

4. Model Training

5. Evaluation

📊 Detailed Results

Model Comparison (Tatoeba Dataset)

Easiest Languages to Classify

Hardest Languages to Classify

💡 Key Insights

1. Data Quality Dominates Performance

2. Character N-Grams Are Highly Effective

3. Linear Models Outperform Complex Ones

4. Script Uniqueness = Easy Classification

5. Social Media Text Requires Different Approaches

📝 Limitations

Twitter Dataset

🔮 Future Work

🎓 Academic Context

👥 Contributors

📄 License

🙏 Acknowledgments

📞 Contact

📚 References

🌟 Star This Repository

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages