Project truy xuat thong tin tren bo du lieu Zalo AI Legal Text Retrieval VN. Huong chay hien tai la dense retrieval voi BGE-M3 + FAISS tren full corpus.
.
├── app.py
├── embed_bge.py
├── evaluate_bge_20k.py
├── data/
│ └── zalo_ai_legal_text_retrieval_vn/
│ ├── corpus.jsonl
│ ├── queries.jsonl
│ ├── queries_unique.jsonl
│ └── qrels/
│ ├── train.jsonl
│ └── test.jsonl
├── artifacts/
│ ├── bge_m3_legal_full/
│ └── bge_m3_queries_full/
├── pyproject.toml
└── src/
├── cli.py
└── search_tfidf/
├── bge_m3_engine.py
├── champion_bm25/
├── documents.py
├── engine.py
├── io.py
└── text_utils.py
corpus.jsonl: full legal corpus, 61,425 documents.queries.jsonl: query goc, co mot so query id bi lap.queries_unique.jsonl: query da loai trung theo_id, dung de embed/evaluate.qrels/train.jsonl: relevance labels train.qrels/test.jsonl: relevance labels test.
cd InformationRetrieval
pip install -r requirements.txtNeu chay trong moi truong editable package:
pip install -e .Script don gian de embed corpus:
cd /kaggle/InformationRetrieval && python embed_bge.py \
--input data/zalo_ai_legal_text_retrieval_vn/corpus.jsonl \
--output-dir artifacts/bge_m3_legal_full \
--batch-size 32 \
--max-length 1024 \
--device cuda \
--fp16Output:
artifacts/bge_m3_legal_full/bge_m3.index
artifacts/bge_m3_legal_full/bge_m3_embeddings.npy
artifacts/bge_m3_legal_full/bge_m3_meta.joblib
Ghi chu:
- BGE-M3 ho tro context toi khoang 8192 token.
--max-length 1024nhanh va hop ly de thu nghiem full corpus.- Neu GPU du VRAM, co the thu
--max-length 2048hoac4096. - Neu bi out-of-memory, giam
--batch-size 16hoac--batch-size 8. --fp16giup giam VRAM va thuong nhanh hon tren CUDA.
Dung file query unique de tranh embed lap query:
cd /kaggle/InformationRetrieval && python embed_bge.py \
--input data/zalo_ai_legal_text_retrieval_vn/queries_unique.jsonl \
--output-dir artifacts/bge_m3_queries_full \
--batch-size 32 \
--max-length 256 \
--device cuda \
--fp16Output:
artifacts/bge_m3_queries_full/bge_m3.index
artifacts/bge_m3_queries_full/bge_m3_embeddings.npy
artifacts/bge_m3_queries_full/bge_m3_meta.joblib
Evaluate tren test set:
cd /kaggle/InformationRetrieval && python evaluate_bge_20k.py \
--corpus-dir artifacts/bge_m3_legal_full \
--query-dir artifacts/bge_m3_queries_full \
--qrels data/zalo_ai_legal_text_retrieval_vn/qrels/test.jsonlEvaluate tren train set:
cd /kaggle/InformationRetrieval && python evaluate_bge_20k.py \
--corpus-dir artifacts/bge_m3_legal_full \
--query-dir artifacts/bge_m3_queries_full \
--qrels data/zalo_ai_legal_text_retrieval_vn/qrels/train.jsonlMac dinh script tinh:
Recall@1,3,5,10,20,50,100
Hit@1,3,5,10,20,50,100
MRR@1,3,5,10,20,50,100
nDCG@1,3,5,10,20,50,100
Neu da co index full corpus, co the search bang CLI:
cd /kaggle/InformationRetrieval && PYTHONPATH=src python -m cli search-bge \
--model-dir artifacts/bge_m3_legal_full \
--query "Công an xã xử phạt lỗi không mang bằng lái xe có đúng không?" \
--top-k 5 \
--device cudaCode baseline van nam trong:
src/search_tfidf/engine.py: TF-IDF + cosine similarity.src/search_tfidf/bm25_engine.py: BM25.
Neu can build baseline tren legal corpus:
PYTHONPATH=src python -m cli build-bm25 \
--input data/zalo_ai_legal_text_retrieval_vn/corpus.jsonl \
--model-dir artifacts/bm25_legal_fullModule nay chi dung cho bo Zalo AI Legal Text Retrieval VN va artifact tokenized trong artifacts/bm25_underthesea.
Build BM25 champion-list model mot lan:
python src/champion-list/champion_bm25/build_model.py --input data/zalo_ai_legal_text_retrieval_vn/corpus.jsonl --model-dir artifacts/bm25_legal_full --champion-size 9000Neu ban da co corpus tokenize san va khong muon tokenize lai, dung lenh nay:
python src/champion-list/champion_bm25/build_model.py --corpus-tokenized artifacts/bm25_underthesea/corpus_doc_id.jsonl --model-dir artifacts/bm25_legal_8000 --champion-size 8000Sau do inference truc tiep bang query text:
python src/champion-list/champion_bm25/search.py --model-dir artifacts/bm25_legal_full --query "Mức phạt khi quay đầu xe ô tô trên đường cao tốc" --top-k 5Luong nay hoat dong nhu sau:
build_model.pybuild va luuinverted_index,champion_index,idf,doc_lengths,avgdl.search.pychi loadbm25_model.joblib, tokenize query bangunderthesea, roi search.- Corpus khong tokenize lai va khong build lai chi muc luc inference.
Neu muon do metric cua BM25 champion-list dung dung model inference da build san tren tap test:
python src/champion-list/champion_bm25/evaluate.py --model-dir artifacts/bm25_legal_6000 --queries-tokenized artifacts/bm25_underthesea/queries_test_tokens.jsonl --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/test.jsonl --mode championNeu muon evaluate tren tap train:
python src/champion-list/champion_bm25/evaluate.py --model-dir artifacts/bm25_legal_6000 --queries-raw data/zalo_ai_legal_text_retrieval_vn/queries_unique.jsonl --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/train.jsonl --mode championNeu muon so sanh BM25 full va BM25 champion tren test:
python src/champion-list/champion_bm25/evaluate.py --model-dir artifacts/bm25_legal_6000 --queries-tokenized artifacts/bm25_underthesea/queries_test_tokens.jsonl --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/test.jsonl --mode bothNeu muon so sanh BM25 full va BM25 champion tren train:
python src/champion-list/champion_bm25/evaluate.py --model-dir artifacts/bm25_legal_6000 --queries-raw data/zalo_ai_legal_text_retrieval_vn/queries_unique.jsonl --qrels data/zalo_ai_legal_text_retrieval_vn/qrels/train.jsonl --mode bothGhi chu:
--queries-tokenizeddung khi ban da co query tokenize san, hop cho benchmark latency.--queries-rawse tokenize query bang underthesea luc runtime.--mode bothin ra ca BM25 full va BM25 champion list trong cung mot lan chay.--champion-sizechi can o buoc build model; inference va evaluate tu--model-dirse dung champion size da luu san.
Khong nen commit cac file trong artifacts/ len GitHub vi embedding/index rat lon.
GitHub chan file tren 100MB. Hay de artifacts o local/Kaggle output, hoac dung Git LFS neu that su can versioning artifact.
Nen ignore cac artifact sinh ra:
artifacts/bge_m3_*/
*.npy
*.index
*.joblib