Skip to content

Commit 2419c53

Browse files
committed
feat: language identification project page
1 parent 553832f commit 2419c53

File tree

3 files changed

+38
-12
lines changed

3 files changed

+38
-12
lines changed

src/content/docs/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ hero:
88
file: ../../assets/houston.webp
99
actions:
1010
- text: Stuff I've Made
11-
link: /projects/placeholder
11+
link: /projects/lid
1212
icon: right-arrow
1313
- text: Resume
1414
link: https://drive.google.com/file/d/1L8lr5-B45VnHHbTIvZRya9fNBJOjI8h3/

src/content/docs/projects/lid.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
title: Language Identification Analysis
3+
description: A comparative study of various language classifiers based on different combinations of linguistic features vs a baseline n-grams based model
4+
---
5+
6+
## About the Project
7+
8+
A machine learning system for identifying Hindi and Marathi text written in Devanagari script using multiple feature extraction techniques and classification models for a comparative study. The classifiers used were both Naive Bayes. This is my course project for the course CL2 (Computational Linguistics 2).
9+
10+
[`GitHub`](https://github.com/bitmap4/language-identification-analysis)
11+
[`Report`](https://drive.google.com/file/d/1AOhGJupvoLIHYv3SttUXIJ8m7ax3K5HE)
12+
13+
14+
## Features
15+
16+
- Supports Hindi and Marathi language identification
17+
- Multiple feature extraction methods:
18+
- Character frequency analysis
19+
- Word length statistics
20+
- Character class distribution (vowels, consonants, matras)
21+
- N-gram analysis
22+
- Morphological analysis
23+
- POS tagging features (optional)
24+
- TF-IDF features
25+
26+
## Setup
27+
28+
Run `setup.sh` to install required packages and download necessary data files.
29+
30+
## Usage
31+
32+
Run all cells in `model_comparison.ipynb`
33+
> Create a directory named `<data_size>` in case it is not created automatically on running the notebook.
34+
35+
## Results
36+
37+
Results of the model comparison are available in `results/<data_size>` where `<data_size>` is the configured size of the training + testing data used for the models.

src/content/docs/projects/placeholder.md

Lines changed: 0 additions & 11 deletions
This file was deleted.

0 commit comments

Comments
 (0)