- pdfplumber + BM25 (original): 0.7133333
- fitz (read table) + remove punctuation + BM25: 0.7200000
- remove punctuation 沒差
- OCR 沒差
- Self-defined function
- (256, 0): 0.7533333
- RecursiveCharacterTextSplitter:
(chunk_size, chunk_overlap)(100, 0): 0.76(100, 50): 0.7533333(200, 0): 0.7533333(200, 50): 0.7600000(200, 100): 0.7666667(400, 0): 0.7266667(400, 100): 0.7466667(400, 200): 0.7266667(800, 400): 0.7333333
(200, 100),top_n_bm25=5,distiluse-base-multilingual-cased-v1: 0.7866667(200, 100),top_n_bm25=10,distiluse-base-multilingual-cased-v1: 0.7866667(200, 100),top_n_bm25=10,paraphrase-multilingual-MiniLM-L12-v2: 0.7933333(200, 100),top_n_bm25=10,bce-embedding-base_v1: 0.8733333 / 00:03:58
(200, 100),top_n_bm25=10,bce-reranker-base_v1: 0.8800000 / 00:04:58
(200, 100),top_n_bm25=10,bce-embedding-base_v1,top_n_embed=5,bce-reranker-base_v1: 0.8666667 / 00:06:23
(200, 100),paraphrase-multilingual-MiniLM-L12-v2: 0.7466667- normalize_embeddings 沒差
(200, 100),bce-embedding-base_v1: 0.8333333
(200, 100),bce-embedding-base_v1,top_n_embed=10,bce-reranker-base_v1: 0.9200000 / 1:20:29
(200, 100),bce-reranker-base_v1: 0.9200000 / 1:24:26
max_length=512,bce-reranker-base_v1: 0.9333333 / 1:06:33
- Baseline: 71.3%
- Read tables from PDFs: 72%
- Remove punctuation: 74%
- Use bigrams: 80.0%
- Use trigrams: 82.0%
- Use 4-grams: 79.3%
- Remove stopwords: 82.66%
- Synonym expansion: 81.33%
- Fixed synonym expansion: 82.0%
- Use pkuseg tokenizer: 83.33%
- Use BERT NER: 84.66%
- Reranking: 92.66%
- Embedding: 82.66%
- Smart segment: 93.33%