Skip to content

Latest commit

 

History

History
61 lines (52 loc) · 2.23 KB

File metadata and controls

61 lines (52 loc) · 2.23 KB

My Contribution

  • pdfplumber + BM25 (original): 0.7133333
  • fitz (read table) + remove punctuation + BM25: 0.7200000
    • remove punctuation 沒差
  • OCR 沒差

fitz + (remove punctuation, segmentation) + BM25

  • Self-defined function
    • (256, 0): 0.7533333
  • RecursiveCharacterTextSplitter: (chunk_size, chunk_overlap)
    • (100, 0): 0.76
    • (100, 50): 0.7533333
    • (200, 0): 0.7533333
    • (200, 50): 0.7600000
    • (200, 100): 0.7666667
    • (400, 0): 0.7266667
    • (400, 100): 0.7466667
    • (400, 200): 0.7266667
    • (800, 400): 0.7333333

fitz + RecursiveCharacterTextSplitter + BM25 + Embedding

  • (200, 100), top_n_bm25=5, distiluse-base-multilingual-cased-v1: 0.7866667
  • (200, 100), top_n_bm25=10, distiluse-base-multilingual-cased-v1: 0.7866667
  • (200, 100), top_n_bm25=10, paraphrase-multilingual-MiniLM-L12-v2: 0.7933333
  • (200, 100), top_n_bm25=10, bce-embedding-base_v1: 0.8733333 / 00:03:58

fitz + RecursiveCharacterTextSplitter + BM25 + Reranker

  • (200, 100), top_n_bm25=10, bce-reranker-base_v1: 0.8800000 / 00:04:58

fitz + RecursiveCharacterTextSplitter + BM25 + Embedding + Reranker

  • (200, 100), top_n_bm25=10, bce-embedding-base_v1, top_n_embed=5, bce-reranker-base_v1: 0.8666667 / 00:06:23

fitz + RecursiveCharacterTextSplitter + Embedding

  • (200, 100), paraphrase-multilingual-MiniLM-L12-v2: 0.7466667
    • normalize_embeddings 沒差
  • (200, 100), bce-embedding-base_v1: 0.8333333

fitz + RecursiveCharacterTextSplitter + Embedding + Reranker

  • (200, 100), bce-embedding-base_v1, top_n_embed=10, bce-reranker-base_v1: 0.9200000 / 1:20:29

fitz + RecursiveCharacterTextSplitter + Reranker

  • (200, 100), bce-reranker-base_v1: 0.9200000 / 1:24:26

fitz + SmartSegmentation + Reranker

  • max_length=512, bce-reranker-base_v1: 0.9333333 / 1:06:33

Other Contribution

  • Baseline: 71.3%
  • Read tables from PDFs: 72%
  • Remove punctuation: 74%
  • Use bigrams: 80.0%
  • Use trigrams: 82.0%
  • Use 4-grams: 79.3%
  • Remove stopwords: 82.66%
  • Synonym expansion: 81.33%
  • Fixed synonym expansion: 82.0%
  • Use pkuseg tokenizer: 83.33%
  • Use BERT NER: 84.66%
  • Reranking: 92.66%
  • Embedding: 82.66%
  • Smart segment: 93.33%