

This project implements and evaluates four machine learning models for short text language identification across 17 languages using identical preprocessing and feature engineering pipelines. We analyze three distinct datasets to understand how data characteristics impact model performance.
- Compare performance of Logistic Regression, Linear SVM, Multinomial Naive Bayes, and Random Forest
- Evaluate across three diverse datasets: Tatoeba, Twitter, and WiLI-2018
- Analyze the impact of data quality on language identification accuracy
- Identify which languages are easiest/hardest to classify
| Dataset | Best Model | Accuracy | Macro-F1 | Status |
|---|---|---|---|---|
| Tatoeba | Linear SVM | 99.38% | 0.9938 | โ Excellent |
| WiLI-2018 | Linear SVM | 97-100% | N/A | โ Excellent |
| Multinomial NB | 33.33% | 0.2222 |
- โ Linear SVM achieves state-of-the-art performance (99.38% accuracy on clean data)
- โ Character n-grams (2-4) are highly effective for curated text
- โ Data quality > Model complexity (66% accuracy gap vs. 2.76% between models)
- โ Unique scripts = Perfect classification (Arabic, Hebrew, Japanese: 100% accuracy)
โ ๏ธ Traditional methods require sufficient data (Twitter dataset had only 3 test samples)
| Code | Language | Script | Accuracy (Tatoeba) |
|---|---|---|---|
| ara | Arabic | Arabic | 100% |
| deu | German | Latin | 99.50% |
| eng | English | Latin | 99.50% |
| epo | Esperanto | Latin | 99.50% |
| fra | French | Latin | 99.00% |
| heb | Hebrew | Hebrew | 100% |
| hun | Hungarian | Latin | 99.75% |
| ita | Italian | Latin | 98.27% |
| jpn | Japanese | Japanese | 100% |
| nds | Low Saxon | Latin | 97.76% |
| nld | Dutch | Latin | 98.75% |
| pol | Polish | Latin | 100% |
| por | Portuguese | Latin | 98.75% |
| rus | Russian | Cyrillic | 97.51% |
| spa | Spanish | Latin | 97.76% |
| tur | Turkish | Latin | 99.50% |
| ukr | Ukrainian | Cyrillic | 97.49% |
- Source: Tatoeba Project
- Type: Curated translation sentences
- Samples: 17,000 (1,000 per language)
- Quality: โญโญโญโญโญ Clean, grammatically correct
- Performance: 99.38% accuracy
- Source: Kaggle
- Type: Social media posts
- Samples: Limited (3 test samples)
- Quality: โญ Noisy, informal, insufficient
- Performance: 33.33% accuracy (data-limited)
- Source: Wikipedia articles
- Type: Formal encyclopedic text
- Samples: ~17,000
- Quality: โญโญโญโญโญ Professional, edited
- Performance: 97-100% per-language accuracy
- Python 3.8 or higher
- pip package manager
- Jupyter Notebook or Google Colab
# Clone the repository
git clone https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID.git
cd ShortTextLanguageID
# Install dependencies
pip install pandas numpy regex scikit-learn scipy matplotlib seaborn tqdm
# Run the notebook
jupyter notebook ShortTextLanguageID.ipynbpandas >= 2.0.0
numpy >= 1.24.0
regex >= 2023.10.3
scikit-learn >= 1.3.0
scipy >= 1.11.0
matplotlib >= 3.7.0
seaborn >= 0.12.0
tqdm >= 4.65.0
# Clone repository
!git clone https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID.git
%cd ShortTextLanguageID
# Install dependencies
!pip install -q pandas numpy regex scikit-learn scipy matplotlib seaborn tqdm
# Run the notebook or execute cells# Clone repository
!git clone https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID.git
%cd ShortTextLanguageID
# Install dependencies (most are pre-installed)
!pip install -q regex
# Import and runjupyter notebook ShortTextLanguageID. ipynbShortTextLanguageID/
โโโ ShortTextLanguageID.ipynb # Main analysis notebook
โโโ README.md # This file
โโโ data/ # Downloaded datasets (auto-generated)
โ โโโ tatoeba_balanced.csv
โ โโโ twitter_data.csv
โ โโโ wili_data.csv
โโโ results/ # Output files (auto-generated)
โโโ confusion_matrices/
โโโ model_comparisons/
โโโ per_language_performance/
- Type: Linear classifier
- Tatoeba Accuracy: 99.00%
- Pros: Fast, interpretable
- Cons: Slightly lower accuracy than SVM
- Type: Support Vector Machine (linear kernel)
- Tatoeba Accuracy: 99.38% (BEST)
- Pros: Highest accuracy, fast training
- Cons: None significant
- Type: Probabilistic classifier
- Tatoeba Accuracy: 99.29%
- Pros: Very fast, close to SVM performance
- Cons: Assumes feature independence
- Type: Ensemble (tree-based)
- Tatoeba Accuracy: 96.62%
- Pros: Handles non-linear patterns
- Cons: Slower training, lower accuracy on sparse features
- Download from Tatoeba, Twitter (Kaggle), WiLI-2018
- Balance datasets (1,000 samples per language)
- Text cleaning (lowercase, strip whitespace)
- Remove special characters
- Normalize Unicode
- Character n-grams (n=2, 3, 4)
- TF-IDF vectorization
- 10,000 max features
- Sublinear TF scaling
- 80-20 train-test split
- Stratified sampling (balanced classes)
- Identical hyperparameters across datasets
- Accuracy, Macro-F1 score
- Per-language precision, recall, F1
- Confusion matrices
- Cross-dataset comparison
| Model | Accuracy | Macro-F1 | Training Time |
|---|---|---|---|
| Linear SVM | 99.38% | 0.9938 | ~15 sec |
| Multinomial NB | 99.29% | 0.9929 | ~5 sec |
| Logistic Regression | 99.00% | 0.9900 | ~10 sec |
| Random Forest | 96.62% | 0.9661 | ~180 sec |
Languages with 100% accuracy (unique scripts):
- ๐ธ๐ฆ Arabic (ara)
- ๐ฎ๐ฑ Hebrew (heb)
- ๐ฏ๐ต Japanese (jpn)
- ๐ต๐ฑ Polish (pol)
Languages with most confusion (Romance family):
- ๐ต๐น Portuguese (por) - 98.75%
- ๐ช๐ธ Spanish (spa) - 97.76%
- ๐ฎ๐น Italian (ita) - 98.27%
Reason: Shared Latin script and similar character patterns
- 66-percentage-point gap between datasets (Tatoeba 99.38% vs. Twitter 33.33%)
- 2.76-percentage-point gap between models on same data
- Conclusion: Invest in data quality > algorithm selection (24x more impact)
- Capture language-specific orthographic patterns
- Examples:
- English: "th", "ing", "tion"
- German: "sch", "ich", "ein"
- French: "eau", "tion", "ent"
- Japanese: "ใงใ", "ใพใ", "ใใ"
- Linear SVM (99.38%) > Random Forest (96.62%)
- Reason: Sparse, high-dimensional TF-IDF features favor linear decision boundaries
- Random Forest struggles with sparse binary features
- Languages with unique alphabets: 100% accuracy
- Languages sharing scripts (Latin, Cyrillic): Slight confusion
- Traditional methods failed on Twitter (33.33%)
- Issues: Insufficient data (3 test samples), noise, slang, code-switching
- Recommendation: Use deep learning (mBERT, XLM-R) for social media
โ ๏ธ Only 3 test samples (statistically invalid)โ ๏ธ Possible task mismatch (sentiment labels vs. language ID)โ ๏ธ High class imbalance (models predict only majority class)โ ๏ธ No meaningful learning occurred
Conclusion: Results demonstrate failure mode when data is insufficient, highlighting the importance of proper dataset curation.
- ๐ Collect properly-sized Twitter dataset (1,000+ samples per language)
- ๐ Test on code-switched text (bilingual users)
- ๐ Compare with transformer models (mBERT, XLM-RoBERTa)
- ๐ Develop hybrid word+character n-gram features
- ๐ Build language family-specific classifiers
- ๐ Evaluate on streaming text (character-by-character detection)
- ๐ Handle extremely short text (<20 characters)
- ๐ Add more low-resource languages
This project was developed as part of a comparative study on short text language identification methods. The goal was to evaluate traditional machine learning approaches across datasets with varying characteristics and quality levels.
Institution: [Your College/University]
Course: [Your Course Name]
Date: December 2025
- Student 1: Tatoeba Dataset Analysis
- Student 2: Twitter Dataset Analysis
- Student 3: WiLI-2018 Dataset Analysis
- Student 4: Comparative Analysis & Report Compilation
This project is licensed under the MIT License - see the LICENSE file for details.
- Tatoeba Project for multilingual sentence corpus
- WiLI-2018 Benchmark for Wikipedia language identification dataset
- Kaggle for Twitter sentiment dataset
- scikit-learn community for machine learning tools
Project Repository: https://github.com/1ms23cs199-dotcomValueError/ShortTextLanguageID
For questions or collaborations:
- Open an issue on GitHub
- Email: [your.email@example.com]
- Joulin, A., et al. (2016). "FastText: Compressing text classification models"
- Jaech, A., et al. (2016). "Phonological pun-derstanding"
- Thoma, M. (2018). "The WiLI benchmark dataset for written language identification"
- Tatoeba Project. (2024). "Collection of sentences and translations"
If you found this project helpful, please consider giving it a โญ on GitHub!
Last Updated: December 2025