feat: Add normalized metadata reranking and query expansion (Fixes #40)#94
feat: Add normalized metadata reranking and query expansion (Fixes #40)#94zohaib-7035 wants to merge 2 commits intoINCF:mainfrom
Conversation
QuantumByte-01
left a comment
There was a problem hiding this comment.
The reranking logic is sound (bounded multipliers, log-normalization) and the test structure is good. Three issues to fix:
1. Factual error in QUERY_SYNONYMS (critical)
"mouse brain": ["Rattus norvegicus", ...]Rattus norvegicus is rat, not mouse. Mouse is Mus musculus. Expanding a 'mouse brain' query to return rat datasets would give users wrong results. Fix the mapping.
2. expand_query is only wired into smart_knowledge_search
smart_knowledge_search is only called when filters are provided. The primary search paths (general_search, general_search_async) don't use expand_query. Either apply it consistently across all search entry points, or add a comment explaining why it is intentionally selective.
3. Float precision risk in tests
assert ranked[0]["_rerank_multiplier"] == 1.30 # can be 1.3000000000000003
assert ranked[0]["_score"] == 130.0IEEE 754 float arithmetic (1.0 + 0.10 + 0.15 + 0.05) can produce 1.3000000000000003. Use pytest.approx:
assert ranked[0]["_rerank_multiplier"] == pytest.approx(1.30)
assert ranked[0]["_score"] == pytest.approx(130.0)…pand_query consistently, use pytest.approx
|
Hi @QuantumByte-01 , I’ve addressed all three requested changes:
All tests are now passing. Ready for your re-review-let me know if anything else is needed. Thanks! |
Summary
This PR re-introduces the Metadata Reranking feature and Query Expansion requested in issue #40, but completely rewrites the mathematical scoring logic to strictly address the maintainer feedback regarding distorted search results and flat variable additions.
Changes
math.log10()to citation counts to safely compress massive outliers (e.g. 10,000 citations).0.0to1.0based on the dataset batches.+ 10.0), we apply max-capped multipliers: Citations (max1.15x), Year (max1.10x), Trusted Source (max1.05x). This strictly restricts the absolute maximum possible metadata boost to+ 30%.mouse brain,eeg,fmri, etc).test_metadata_rerank.pymathematically proving the boundary multiplier limits are protected safely.Test Output Proof (Passing Screenshots)
Notice how the multipliers are mathematically bounded and all Pytest boundaries pass cleanly!