You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: A comparative study of various language classifiers based on different combinations of linguistic features vs a baseline n-grams based model
4
+
---
5
+
6
+
## About the Project
7
+
8
+
A machine learning system for identifying Hindi and Marathi text written in Devanagari script using multiple feature extraction techniques and classification models for a comparative study. The classifiers used were both Naive Bayes. This is my course project for the course CL2 (Computational Linguistics 2).
- Supports Hindi and Marathi language identification
17
+
- Multiple feature extraction methods:
18
+
- Character frequency analysis
19
+
- Word length statistics
20
+
- Character class distribution (vowels, consonants, matras)
21
+
- N-gram analysis
22
+
- Morphological analysis
23
+
- POS tagging features (optional)
24
+
- TF-IDF features
25
+
26
+
## Setup
27
+
28
+
Run `setup.sh` to install required packages and download necessary data files.
29
+
30
+
## Usage
31
+
32
+
Run all cells in `model_comparison.ipynb`
33
+
> Create a directory named `<data_size>` in case it is not created automatically on running the notebook.
34
+
35
+
## Results
36
+
37
+
Results of the model comparison are available in `results/<data_size>` where `<data_size>` is the configured size of the training + testing data used for the models.
0 commit comments