Information Retrieval - Vietnamese Legal Search

Project truy xuat thong tin tren bo du lieu Zalo AI Legal Text Retrieval VN. Huong chay hien tai la dense retrieval voi BGE-M3 + FAISS tren full corpus.

Cau truc chinh

.
├── app.py
├── embed_bge.py
├── evaluate_bge_20k.py
├── data/
│   └── zalo_ai_legal_text_retrieval_vn/
│       ├── corpus.jsonl
│       ├── queries.jsonl
│       ├── queries_unique.jsonl
│       └── qrels/
│           ├── train.jsonl
│           └── test.jsonl
├── artifacts/
│   ├── bge_m3_legal_full/
│   └── bge_m3_queries_full/
├── pyproject.toml
└── src/
    ├── cli.py
    └── search_tfidf/
        ├── bge_m3_engine.py
        ├── champion_bm25/
        ├── documents.py
        ├── engine.py
        ├── io.py
        └── text_utils.py

Du lieu

corpus.jsonl: full legal corpus, 61,425 documents.
queries.jsonl: query goc, co mot so query id bi lap.
queries_unique.jsonl: query da loai trung theo _id, dung de embed/evaluate.
qrels/train.jsonl: relevance labels train.
qrels/test.jsonl: relevance labels test.

Cai dat

cd  InformationRetrieval
pip install -r requirements.txt

Neu chay trong moi truong editable package:

pip install -e .

Embed Full Corpus Bang BGE-M3

Script don gian de embed corpus:

cd /kaggle/InformationRetrieval && python embed_bge.py \
  --input data/zalo_ai_legal_text_retrieval_vn/corpus.jsonl \
  --output-dir artifacts/bge_m3_legal_full \
  --batch-size 32 \
  --max-length 1024 \
  --device cuda \
  --fp16

Output:

artifacts/bge_m3_legal_full/bge_m3.index
artifacts/bge_m3_legal_full/bge_m3_embeddings.npy
artifacts/bge_m3_legal_full/bge_m3_meta.joblib

Ghi chu:

BGE-M3 ho tro context toi khoang 8192 token.
--max-length 1024 nhanh va hop ly de thu nghiem full corpus.
Neu GPU du VRAM, co the thu --max-length 2048 hoac 4096.
Neu bi out-of-memory, giam --batch-size 16 hoac --batch-size 8.
--fp16 giup giam VRAM va thuong nhanh hon tren CUDA.

Embed Queries

Dung file query unique de tranh embed lap query:

cd /kaggle/InformationRetrieval && python embed_bge.py \
  --input data/zalo_ai_legal_text_retrieval_vn/queries_unique.jsonl \
  --output-dir artifacts/bge_m3_queries_full \
  --batch-size 32 \
  --max-length 256 \
  --device cuda \
  --fp16

Output:

artifacts/bge_m3_queries_full/bge_m3.index
artifacts/bge_m3_queries_full/bge_m3_embeddings.npy
artifacts/bge_m3_queries_full/bge_m3_meta.joblib

Evaluate Retrieval

Evaluate tren test set:

cd /kaggle/InformationRetrieval && python evaluate_bge_20k.py \
  --corpus-dir artifacts/bge_m3_legal_full \
  --query-dir artifacts/bge_m3_queries_full \
  --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/test.jsonl

Evaluate tren train set:

cd /kaggle/InformationRetrieval && python evaluate_bge_20k.py \
  --corpus-dir artifacts/bge_m3_legal_full \
  --query-dir artifacts/bge_m3_queries_full \
  --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/train.jsonl

Mac dinh script tinh:

Recall@1,3,5,10,20,50,100
Hit@1,3,5,10,20,50,100
MRR@1,3,5,10,20,50,100
nDCG@1,3,5,10,20,50,100

Search Mot Query

Neu da co index full corpus, co the search bang CLI:

cd /kaggle/InformationRetrieval && PYTHONPATH=src python -m cli search-bge \
  --model-dir artifacts/bge_m3_legal_full \
  --query "Công an xã xử phạt lỗi không mang bằng lái xe có đúng không?" \
  --top-k 5 \
  --device cuda

Baseline TF-IDF/BM25

Code baseline van nam trong:

src/search_tfidf/engine.py: TF-IDF + cosine similarity.
src/search_tfidf/bm25_engine.py: BM25.

Neu can build baseline tren legal corpus:

PYTHONPATH=src python -m cli build-bm25 \
  --input data/zalo_ai_legal_text_retrieval_vn/corpus.jsonl \
  --model-dir artifacts/bm25_legal_full

Champion BM25 Inference

Module nay chi dung cho bo Zalo AI Legal Text Retrieval VN va artifact tokenized trong artifacts/bm25_underthesea.

Build BM25 champion-list model mot lan:

python src/champion-list/champion_bm25/build_model.py --input data/zalo_ai_legal_text_retrieval_vn/corpus.jsonl --model-dir artifacts/bm25_legal_full --champion-size 9000

Neu ban da co corpus tokenize san va khong muon tokenize lai, dung lenh nay:

python src/champion-list/champion_bm25/build_model.py --corpus-tokenized artifacts/bm25_underthesea/corpus_doc_id.jsonl --model-dir artifacts/bm25_legal_8000 --champion-size 8000

Sau do inference truc tiep bang query text:

python src/champion-list/champion_bm25/search.py --model-dir artifacts/bm25_legal_full --query "Mức phạt khi quay đầu xe ô tô trên đường cao tốc" --top-k 5

Luong nay hoat dong nhu sau:

build_model.py build va luu inverted_index, champion_index, idf, doc_lengths, avgdl.
search.py chi load bm25_model.joblib, tokenize query bang underthesea, roi search.
Corpus khong tokenize lai va khong build lai chi muc luc inference.

Evaluate Champion BM25

Neu muon do metric cua BM25 champion-list dung dung model inference da build san tren tap test:

python src/champion-list/champion_bm25/evaluate.py --model-dir artifacts/bm25_legal_6000 --queries-tokenized artifacts/bm25_underthesea/queries_test_tokens.jsonl --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/test.jsonl --mode champion

Neu muon evaluate tren tap train:

python src/champion-list/champion_bm25/evaluate.py --model-dir artifacts/bm25_legal_6000 --queries-raw data/zalo_ai_legal_text_retrieval_vn/queries_unique.jsonl --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/train.jsonl --mode champion

Neu muon so sanh BM25 full va BM25 champion tren test:

python src/champion-list/champion_bm25/evaluate.py --model-dir artifacts/bm25_legal_6000 --queries-tokenized artifacts/bm25_underthesea/queries_test_tokens.jsonl --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/test.jsonl --mode both

Neu muon so sanh BM25 full va BM25 champion tren train:

python src/champion-list/champion_bm25/evaluate.py --model-dir artifacts/bm25_legal_6000 --queries-raw data/zalo_ai_legal_text_retrieval_vn/queries_unique.jsonl --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/train.jsonl --mode both

Ghi chu:

--queries-tokenized dung khi ban da co query tokenize san, hop cho benchmark latency.
--queries-raw se tokenize query bang underthesea luc runtime.
--mode both in ra ca BM25 full va BM25 champion list trong cung mot lan chay.
--champion-size chi can o buoc build model; inference va evaluate tu --model-dir se dung champion size da luu san.

Luu Y Ve Git

Khong nen commit cac file trong artifacts/ len GitHub vi embedding/index rat lon. GitHub chan file tren 100MB. Hay de artifacts o local/Kaggle output, hoac dung Git LFS neu that su can versioning artifact.

Nen ignore cac artifact sinh ra:

artifacts/bge_m3_*/
*.npy
*.index
*.joblib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval - Vietnamese Legal Search

Cau truc chinh

Du lieu

Cai dat

Embed Full Corpus Bang BGE-M3

Embed Queries

Evaluate Retrieval

Search Mot Query

Baseline TF-IDF/BM25

Champion BM25 Inference

Evaluate Champion BM25

Luu Y Ve Git

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
artifacts		artifacts
data/zalo_ai_legal_text_retrieval_vn		data/zalo_ai_legal_text_retrieval_vn
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
theo		theo
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval - Vietnamese Legal Search

Cau truc chinh

Du lieu

Cai dat

Embed Full Corpus Bang BGE-M3

Embed Queries

Evaluate Retrieval

Search Mot Query

Baseline TF-IDF/BM25

Champion BM25 Inference

Evaluate Champion BM25

Luu Y Ve Git

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages