Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ QueryGym implements the following query reformulation methods:
| **MuGI** | Multi-granularity information expansion with adaptive concatenation | [Zhang et al., 2024](https://arxiv.org/abs/2401.06311) |
| **LameR** | Context-based passage synthesis using retrieved documents | [Mackie et al., 2023](https://arxiv.org/abs/2304.14233) |
| **CSQE** | Context-based sentence-level query expansion (KEQE + CSQE) | [Lee et al., 2024](https://arxiv.org/abs/2402.18031) |
| **ThinkQE** | Multi-round reasoning-based query expansion with corpus feedback | [Le et al., 2025](https://arxiv.org/abs/2506.09260) |
| **Query2E** | Query to entity/keyword expansion | [Jagerman et al., 2023](https://arxiv.org/abs/2305.03653)|

For detailed usage and parameters, see the [Methods Reference](https://querygym.readthedocs.io/en/latest/user-guide/methods-reference/).
Expand Down
3 changes: 2 additions & 1 deletion docs/getting-started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Available methods:
- `lamer` - Context-based passage synthesis
- `query2e` - Query to entity expansion
- `csqe` - Context-based sentence extraction
- `thinkqe` - Multi-round reasoning-based passage expansion

### 3. Reformulate Queries

Expand Down Expand Up @@ -137,7 +138,7 @@ See [Loading Datasets](../user-guide/datasets.md) for more details.

## Context-Based Reformulation

Some methods (like `lamer`, `csqe`) use retrieved contexts:
Some methods (like `lamer`, `csqe`, `thinkqe`) use retrieved contexts:

```python
import querygym as qg
Expand Down
67 changes: 65 additions & 2 deletions docs/user-guide/methods-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Complete reference guide for all query reformulation methods in QueryGym, includ
- [LameR](#lamer)
- [Query2E](#query2e)
- [CSQE](#csqe)
- [ThinkQE](#thinkqe)

---

Expand Down Expand Up @@ -561,6 +562,68 @@ result = reformulator.reformulate(qg.QueryItem("q1", "quantum computing"))

---

### ThinkQE

**Method Name:** `"thinkqe"`
**Requires Context:** Yes
**Description:** Multi-round query expansion with retrieved passage feedback. Each round uses the original query plus newly retrieved passages to generate pseudo-passages, appends them to the retrieval query, and retrieves again.

#### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `keep_passage_num` | int | `5` | Number of retrieved passages kept for prompting |
| `gen_num` | int | `2` | Number of expansions generated per round |
| `num_interaction` | int | `3` | Number of expansion rounds after baseline retrieval |
| `accumulate` | bool | `True` | Accumulate all previous expansions into later rounds |
| `use_passage_filter` | bool | `True` | Blacklist passages repeated from two rounds ago |
| `repeat_weight` | float | `3` | Divisor for adaptive query repetition |
| `search_k` | int | `keep_passage_num` | Retrieval depth for each round before filtering; use `1000` to mirror the original archive runs |
| `max_demo_len` | int | `None` | Optional word truncation length for each passage |
| `no_thinking` | bool | `False` | Prefill a closing `</think>` tag to disable reasoning traces |
| `searcher` | object | `None` | Pre-configured searcher instance (recommended) |
| `searcher_type` | str | `"pyserini"` | Type of searcher to create |
| `searcher_kwargs` | dict | `{}` | Keyword arguments for searcher initialization |
| `index` | str | `None` | Pyserini index name (legacy format) |
| `temperature` | float | `0.7` | Sampling temperature (via `llm_config`) |
| `max_tokens` | int | `32768` | Maximum tokens per generation (via `llm_config`) |

#### Usage Example

```python
import querygym as qg
from pyserini.search.lucene import LuceneSearcher

pyserini_searcher = LuceneSearcher.from_prebuilt_index("msmarco-v1-passage")
searcher = qg.wrap_pyserini_searcher(pyserini_searcher, answer_key="contents")

reformulator = qg.create_reformulator(
"thinkqe",
model="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B",
params={
"searcher": searcher,
"keep_passage_num": 5,
"gen_num": 2,
"num_interaction": 3,
"accumulate": True,
"use_passage_filter": True,
"repeat_weight": 3,
"search_k": 1000,
"max_demo_len": 128,
},
llm_config={"temperature": 0.7, "max_tokens": 32768}
)

result = reformulator.reformulate_batch([qg.QueryItem("q1", "quantum computing")])[0]
```

#### Output Format

- **Concatenation:** `(query × adaptive_repeat) + expansion_1 + expansion_2 + ...` using newline joins
- **Metadata:** Includes `round_history`, `gen_num`, `keep_passage_num`, `accumulated_count`, `q_repeat`, and per-round raw response counts

---

## Quick Reference Table

| Method | Requires Context | Key Parameters | Default LLM Config |
Expand All @@ -573,12 +636,13 @@ result = reformulator.reformulate(qg.QueryItem("q1", "quantum computing"))
| **LameR** | Yes | `retrieval_k`, `gen_passages`, `searcher` | temp=1.0, max_tokens=128 |
| **Query2E** | No | `mode`, `max_keywords`, `num_examples` (fs) | temp=0.3, max_tokens=256 |
| **CSQE** | Yes | `retrieval_k`, `gen_num`, `searcher` | temp=1.0, max_tokens=1024 |
| **ThinkQE** | Yes | `keep_passage_num`, `gen_num`, `num_interaction`, `searcher` | temp=0.7, max_tokens=32768 |

---

## Tips and Best Practices

1. **Context-Based Methods (LameR, CSQE):**
1. **Context-Based Methods (LameR, CSQE, ThinkQE):**
- Always provide a `searcher` instance or configure `searcher_type`/`searcher_kwargs`
- Use `qg.wrap_pyserini_searcher()` for easy integration with Pyserini
- Set appropriate `retrieval_k` based on your needs (default: 10)
Expand All @@ -604,4 +668,3 @@ result = reformulator.reformulate(qg.QueryItem("q1", "quantum computing"))
- [API Reference](../api/methods.md) - Technical API documentation
- [Query Reformulation Guide](reformulation.md) - Usage tutorials
- [Examples](https://github.com/ls3-lab/QueryGym/tree/main/examples) - Complete workflow examples

21 changes: 21 additions & 0 deletions docs/user-guide/reformulation.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,26 @@ reformulator = qg.create_reformulator("csqe", model="gpt-4")
results = reformulator.reformulate_batch(queries, contexts=contexts)
```

### ThinkQE

Multi-round reasoning-based expansion with iterative corpus feedback.

```python
reformulator = qg.create_reformulator(
"thinkqe",
model="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B",
params={
"searcher": searcher,
"num_interaction": 3,
"keep_passage_num": 5,
"gen_num": 2,
"accumulate": True,
"use_passage_filter": True,
"search_k": 1000,
},
)
```

## Method Comparison

| Method | Requires Context | Type | Best For |
Expand All @@ -105,6 +125,7 @@ results = reformulator.reformulate_batch(queries, contexts=contexts)
| lamer | Yes | Context synthesis | Re-ranking |
| query2e | No | Entity expansion | Entity queries |
| csqe | Yes | Sentence extraction | Precision-focused |
| thinkqe | Yes | Iterative reasoning | Multi-round feedback |

## Custom Parameters

Expand Down
3 changes: 1 addition & 2 deletions examples/querygym_pyserini/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ python examples/querygym_pyserini/reformulate_queries.py \
```

**Options:**
- `--method`: QueryGym method (genqr, genqr_ensemble, query2doc, qa_expand, mugi, lamer, query2e, csqe)
- `--method`: QueryGym method (genqr, genqr_ensemble, query2doc, qa_expand, mugi, lamer, query2e, csqe, thinkqe)
- `--model`: LLM model name (e.g., qwen2.5:7b, llama3.1:8b, gpt-4, etc.)
- `--base-url`: LLM API endpoint (e.g., http://localhost:11434/v1)
- `--api-key`: LLM API key
Expand Down Expand Up @@ -525,4 +525,3 @@ python examples/querygym_pyserini/pipeline.py \
```

**Note:** The config file (`reformulation_config.yaml`) contains pre-configured settings for all methods, including complex method parameters. CLI arguments override config file values.

3 changes: 1 addition & 2 deletions examples/querygym_pyserini/reformulate_queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,7 @@ def main():
parser.add_argument(
'--method',
type=str,
help='QueryGym reformulation method (genqr, genqr_ensemble, query2doc, etc.)'
help='QueryGym reformulation method (genqr, genqr_ensemble, query2doc, lamer, csqe, thinkqe, etc.)'
)
parser.add_argument(
'--model',
Expand Down Expand Up @@ -485,4 +485,3 @@ def main():

if __name__ == '__main__':
main()

19 changes: 18 additions & 1 deletion examples/querygym_pyserini/reformulation_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,24 @@ methods:
gen_num: 2 # Number of expansions for both KEQE and CSQE (default: 2)
# Note: Searcher is automatically configured from dataset registry

# ThinkQE: Multi-round reasoning-based query expansion (requires retrieval)
thinkqe:
enabled: true
model: "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
llm:
temperature: 0.7
max_tokens: 32768
params:
keep_passage_num: 5
gen_num: 2
num_interaction: 3
accumulate: true
use_passage_filter: true
repeat_weight: 3
max_demo_len: 128
search_k: 1000
# Note: Searcher is automatically configured from dataset registry

# Example: Multiple configurations for the same method
# You can define multiple variants with different parameters:
#
Expand All @@ -136,4 +154,3 @@ methods:
# temperature: 0.9
# params:
# n_generations: 7

Loading