feat: Add GET/POST /task/list endpoint#277
feat: Add GET/POST /task/list endpoint#277saathviksheerla wants to merge 3 commits intoopenml:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds a new 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #277 +/- ##
=======================================
Coverage ? 52.94%
=======================================
Files ? 34
Lines ? 1526
Branches ? 128
=======================================
Hits ? 808
Misses ? 716
Partials ? 2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Hey - I've found 4 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="src/routers/openml/tasks.py" line_range="362-363" />
<code_context>
+ )
+ # build a reverse map: dataset_id -> task_id
+ # needed because quality rows come back keyed by did, but our tasks dict is keyed by task_id
+ did_to_task_id = {t["did"]: tid for tid, t in tasks.items()}
+ for row in qualities_result.all():
+ tid = did_to_task_id.get(row.data)
+ if tid is not None:
</code_context>
<issue_to_address>
**issue (bug_risk):** Qualities are only attached to one task per dataset, dropping qualities for other tasks sharing the same dataset.
Because `did_to_task_id` maps each dataset id to a single `task_id`, only the last task using a given dataset receives qualities; earlier tasks end up with empty `quality` lists. To handle multiple tasks per dataset, map `did -> list[task_id]` and, when iterating over `qualities_result`, attach each quality row to all tasks in that list.
</issue_to_address>
### Comment 2
<location path="tests/routers/openml/task_test.py" line_range="36-42" />
<code_context>
+ assert all(t["task_type_id"] == 1 for t in tasks)
+
+
+async def test_list_tasks_filter_tag(py_api: httpx.AsyncClient) -> None:
+ """Filter by tag returns only tasks with that tag."""
+ response = await py_api.post("/tasks/list", json={"tag": "OpenML100"})
+ assert response.status_code == HTTPStatus.OK
+ tasks = response.json()
+ assert len(tasks) > 0
+ assert all("OpenML100" in t["tag"] for t in tasks)
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Add coverage for the remaining filters (`data_tag`, `task_id`, `data_id`, `data_name`, and other quality ranges).
Current tests exercise `task_type_id`, `tag`, `pagination`, and `number_instances`, but not the other supported filters: `data_tag`, `task_id`, `data_id`, `data_name`, `number_features`, `number_classes`, and `number_missing_values`. This leaves several WHERE-clause branches untested.
Consider adding targeted tests such as:
- `data_tag`: only tasks whose dataset has the given tag are returned.
- `task_id`: multiple IDs return exactly those tasks (and correct ordering/uniqueness if relevant).
- `data_id`: only tasks for the specified dataset(s) are returned.
- `data_name`: verifies expected matching behavior (case-(in)sensitivity as intended).
- One or more quality ranges using `number_features` or `number_classes` to exercise `_quality_clause` beyond `NumberOfInstances`.
Suggested implementation:
```python
async def test_list_tasks_filter_type(py_api: httpx.AsyncClient) -> None:
"""Filter by task_type_id returns only tasks of that type."""
response = await py_api.post("/tasks/list", json={"task_type_id": 1})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert all(t["task_type_id"] == 1 for t in tasks)
async def test_list_tasks_default(py_api: httpx.AsyncClient) -> None:
"""Default call returns active tasks with correct shape."""
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert isinstance(tasks, list)
assert len(tasks) > 0
# verify shape of first task
task = tasks[0]
async def test_list_tasks_filter_data_tag(py_api: httpx.AsyncClient) -> None:
"""Filter by data_tag returns only tasks whose dataset has that tag."""
# Get a tag actually present on at least one dataset
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
# Find first task that has at least one data tag
tagged_task = next(
t for t in tasks if t.get("data_tags") is not None and len(t.get("data_tags")) > 0
)
data_tag = tagged_task["data_tags"][0]
response = await py_api.post("/tasks/list", json={"data_tag": data_tag})
assert response.status_code == HTTPStatus.OK
filtered = response.json()
assert len(filtered) > 0
assert all(
data_tag in t.get("data_tags", []) for t in filtered
), "All returned tasks should have the dataset tag used for filtering"
async def test_list_tasks_filter_task_id(py_api: httpx.AsyncClient) -> None:
"""Filter by task_id returns exactly the requested tasks."""
# Discover a small set of valid task IDs from the default listing
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert len(tasks) >= 2
requested_ids = sorted(t["task_id"] for t in tasks[:3])
response = await py_api.post("/tasks/list", json={"task_id": requested_ids})
assert response.status_code == HTTPStatus.OK
filtered = response.json()
returned_ids = sorted(t["task_id"] for t in filtered)
assert set(returned_ids) == set(
requested_ids
), "Filtering by task_id should return exactly the requested tasks"
# If the API defines a particular ordering, assert on that here as well
# (e.g. sorted by task_id). The current check ensures no duplicates.
assert len(returned_ids) == len(set(returned_ids))
async def test_list_tasks_filter_data_id(py_api: httpx.AsyncClient) -> None:
"""Filter by data_id returns only tasks from those datasets."""
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert len(tasks) > 0
# Take a couple of distinct data_ids from the available tasks
data_ids = []
for t in tasks:
if t["data_id"] not in data_ids:
data_ids.append(t["data_id"])
if len(data_ids) == 2:
break
response = await py_api.post("/tasks/list", json={"data_id": data_ids})
assert response.status_code == HTTPStatus.OK
filtered = response.json()
assert len(filtered) > 0
assert all(
t["data_id"] in data_ids for t in filtered
), "All returned tasks should belong to one of the requested data_ids"
async def test_list_tasks_filter_data_name(py_api: httpx.AsyncClient) -> None:
"""Filter by data_name returns only tasks whose dataset has that name."""
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert len(tasks) > 0
data_name = tasks[0]["data_name"]
# Use the exact name observed in the listing
response = await py_api.post("/tasks/list", json={"data_name": data_name})
assert response.status_code == HTTPStatus.OK
filtered = response.json()
assert len(filtered) > 0
assert all(
t["data_name"] == data_name for t in filtered
), "All returned tasks should have the requested data_name"
async def test_list_tasks_filter_number_features_range(py_api: httpx.AsyncClient) -> None:
"""Filter by number_features quality range returns tasks within that range."""
# Probe existing tasks to discover a concrete number_features value
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
# Find a task with a defined number_features value
task_with_features = next(
t for t in tasks if t.get("number_features") is not None
)
num_features = task_with_features["number_features"]
# Build a narrow range around that concrete value to exercise the quality clause
payload = {
"number_features": {
"min": num_features,
"max": num_features,
}
}
response = await py_api.post("/tasks/list", json=payload)
assert response.status_code == HTTPStatus.OK
filtered = response.json()
assert len(filtered) > 0
assert all(
t.get("number_features") is not None
and payload["number_features"]["min"]
<= t["number_features"]
<= payload["number_features"]["max"]
for t in filtered
), "All returned tasks should have number_features within the requested range"
```
These tests assume:
- Response objects include `data_tags` (a list of strings on each task), `task_id`, `data_id`, `data_name`, and `number_features`.
- The `/tasks/list` request model accepts:
- `data_tag` as a single string.
- `task_id` and `data_id` as lists of integers.
- `data_name` as a string.
- Quality ranges in the form `{"number_features": {"min": ..., "max": ...}}`.
To fully align with your existing API:
1. Adjust the field names (`data_tags` vs. `dataset_tags`, etc.) in the assertions to match the actual task schema.
2. Align the request payload shapes with whatever is already used in the existing `number_instances` test (e.g., if it uses `number_features_min` / `number_features_max` instead of a nested dict, mirror that here).
3. If `task_id` / `data_id` only accept a single integer instead of a list, modify `requested_ids` / `data_ids` accordingly and assert on a single returned task.
4. If data-name filtering is case-insensitive, extend `test_list_tasks_filter_data_name` to issue a request with `data_name` in a different case and assert that the same tasks are returned.
</issue_to_address>
### Comment 3
<location path="tests/routers/openml/task_test.py" line_range="56-62" />
<code_context>
+ assert ids1 != ids2
+
+
+async def test_list_tasks_number_instances_range(py_api: httpx.AsyncClient) -> None:
+ """number_instances range filter returns tasks whose dataset matches."""
+ response = await py_api.post(
+ "/tasks/list",
+ json={"number_instances": "100..1000"},
+ )
+ assert response.status_code == HTTPStatus.OK
+ assert len(response.json()) > 0
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a negative test for invalid range syntax to ensure it does not result in a server error.
Since `_quality_clause` raises `ValueError` when `range_` doesn’t match `integer_range_regex`, it would be valuable to add a test that covers an invalid `number_instances` value (e.g. `"abc"` or `"1...2"`) and verifies:
- The response status is a validation error (e.g. 422) rather than 500.
- The error body matches the API’s standard error format.
This guards against accidentally surfacing raw `ValueError`s as 500s if validation behavior changes later.
```suggestion
async def test_list_tasks_pagination_offset(py_api: httpx.AsyncClient) -> None:
"""Offset returns different results than no offset."""
r1 = await py_api.post("/tasks/list", json={"pagination": {"limit": 5, "offset": 0}})
r2 = await py_api.post("/tasks/list", json={"pagination": {"limit": 5, "offset": 5}})
ids1 = [t["task_id"] for t in r1.json()]
ids2 = [t["task_id"] for t in r2.json()]
assert ids1 != ids2
async def test_list_tasks_number_instances_range(py_api: httpx.AsyncClient) -> None:
"""number_instances range filter returns tasks whose dataset matches."""
response = await py_api.post(
"/tasks/list",
json={"number_instances": "100..1000"},
)
assert response.status_code == HTTPStatus.OK
assert len(response.json()) > 0
async def test_list_tasks_number_instances_invalid_range(py_api: httpx.AsyncClient) -> None:
"""Invalid number_instances range does not raise server error and returns validation error."""
response = await py_api.post(
"/tasks/list",
json={"number_instances": "1...2"},
)
assert response.status_code == HTTPStatus.UNPROCESSABLE_ENTITY
body = response.json()
assert isinstance(body, dict)
# Assert the error body follows the API's standard validation error format.
# Adjust these expectations if your API uses a different envelope.
assert "detail" in body
```
</issue_to_address>
### Comment 4
<location path="src/routers/openml/tasks.py" line_range="220" />
<code_context>
+
+@router.post(path="/list", description="Provided for convenience, same as `GET` endpoint.")
+@router.get(path="/list")
+async def list_tasks( # noqa: PLR0913
+ pagination: Annotated[Pagination, Body(default_factory=Pagination)],
+ task_type_id: Annotated[int | None, Body(description="Filter by task type id.")] = None,
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting small helper functions for IN-clause construction, range parsing, WHERE-clause assembly, dataset-status subquery, and task-enrichment to make `list_tasks` shorter and easier to follow.
You can keep the new behavior but reduce complexity by extracting a few focused helpers and reusing them.
### 1. Extract a generic `IN` clause builder
You repeat `",".join(...)` and f-string interpolation for IDs and names. A tiny helper keeps SQL construction consistent and reduces visual noise:
```python
def _in_clause(values: list[int | str]) -> str:
# Callers are responsible for correctness of values (already sanitized IDs or enums)
return ", ".join(str(v) for v in values)
```
Usage:
```python
task_ids_str = _in_clause(tasks.keys())
did_list = _in_clause([t["did"] for t in tasks.values()])
basic_inputs_str = _in_clause([f"'{i}'" for i in BASIC_TASK_INPUTS])
qualities_str = _in_clause([f"'{q}'" for q in QUALITIES_TO_SHOW])
```
This shrinks the `list_tasks` body and centralizes the `IN (...)` formatting.
### 2. Split range parsing from quality SQL rendering
`_quality_clause` currently parses and renders SQL. Splitting the concerns makes both parts easier to read and test.
```python
def _parse_integer_range(range_: str | None) -> tuple[str, str | None] | None:
if not range_:
return None
match = re.match(integer_range_regex, range_)
if not match:
msg = f"`range_` not a valid range: {range_}"
raise ValueError(msg)
start, end = match.groups()
end_num = end[2:] if end else None # strip ".."
return start, end_num
```
Then a generic quality clause builder:
```python
def _quality_clause(quality: str, range_: str | None) -> str:
parsed = _parse_integer_range(range_)
if not parsed:
return ""
start, end = parsed
value_expr = f"`value` BETWEEN {start} AND {end}" if end else f"`value`={start}"
return f"""
AND t.`task_id` IN (
SELECT ti.`task_id` FROM task_inputs ti
WHERE ti.`input`='source_data' AND ti.`value` IN (
SELECT `data` FROM data_quality
WHERE `quality`='{quality}' AND {value_expr}
)
)
"""
```
This leaves `list_tasks` with just the high-level mapping:
```python
where_number_instances = _quality_clause("NumberOfInstances", number_instances)
...
```
### 3. Extract WHERE-clause assembly into a helper
The top of `list_tasks` builds many related conditions inline. Move that to a dedicated function that returns both fragments and the parameter dict:
```python
def _build_task_where_clauses(
status: TaskStatusFilter,
task_type_id: int | None,
tag: str | None,
data_tag: str | None,
task_id: list[int] | None,
data_id: list[int] | None,
data_name: str | None,
number_instances: str | None,
number_features: str | None,
number_classes: str | None,
number_missing_values: str | None,
) -> tuple[str, dict[str, Any]]:
if status == TaskStatusFilter.ALL:
where_status = ""
else:
where_status = f"AND IFNULL(ds.`status`, 'in_preparation') = '{status}'"
where_type = "" if task_type_id is None else "AND t.`ttid` = :task_type_id"
where_tag = "" if tag is None else \
"AND t.`task_id` IN (SELECT `id` FROM task_tag WHERE `tag` = :tag)"
where_data_tag = "" if data_tag is None else \
"AND d.`did` IN (SELECT `id` FROM dataset_tag WHERE `tag` = :data_tag)"
where_task_id = "" if not task_id else f"AND t.`task_id` IN ({_in_clause(task_id)})"
where_data_id = "" if not data_id else f"AND d.`did` IN ({_in_clause(data_id)})"
where_data_name = "" if data_name is None else "AND d.`name` = :data_name"
where_number_instances = _quality_clause("NumberOfInstances", number_instances)
where_number_features = _quality_clause("NumberOfFeatures", number_features)
where_number_classes = _quality_clause("NumberOfClasses", number_classes)
where_number_missing_values = _quality_clause(
"NumberOfMissingValues", number_missing_values
)
where_sql = (
where_status
+ where_type
+ where_tag
+ where_data_tag
+ where_task_id
+ where_data_id
+ where_data_name
+ where_number_instances
+ where_number_features
+ where_number_classes
+ where_number_missing_values
)
params = {
"task_type_id": task_type_id,
"tag": tag,
"data_tag": data_tag,
"data_name": data_name,
}
return where_sql, params
```
Then `list_tasks` reduces to:
```python
where_sql, params = _build_task_where_clauses(
status, task_type_id, tag, data_tag, task_id, data_id,
data_name, number_instances, number_features,
number_classes, number_missing_values,
)
query = text(f"""
SELECT ...
FROM ...
WHERE 1=1
{where_sql}
GROUP BY ...
LIMIT {pagination.limit} OFFSET {pagination.offset}
""")
result = await expdb.execute(query, parameters=params)
```
### 4. Extract the dataset-status subquery
Moving the status subquery out of the main query makes the main SQL much easier to read:
```python
LATEST_DATASET_STATUS_SUBQUERY = """
SELECT ds1.did, ds1.status
FROM dataset_status ds1
WHERE ds1.status_date = (
SELECT MAX(ds2.status_date)
FROM dataset_status ds2
WHERE ds1.did = ds2.did
)
"""
```
Then in `list_tasks`:
```python
query = text(f"""
SELECT ...
FROM task t
...
LEFT JOIN ({LATEST_DATASET_STATUS_SUBQUERY}) ds ON ds.`did` = d.`did`
WHERE 1=1
{where_sql}
...
""")
```
### 5. Unify enrichment of `tasks` (inputs/qualities/tags)
You can use a generic helper for the “append or init list” pattern to avoid repetition:
```python
def _append_task_attr(
tasks: dict[int, dict[str, Any]],
task_id: int,
key: str,
value: Any,
) -> None:
tasks[task_id].setdefault(key, []).append(value)
```
Then:
```python
# inputs
for row in inputs_result.all():
_append_task_attr(
tasks,
row.task_id,
"input",
{"name": row.input, "value": row.value},
)
# qualities
for row in qualities_result.all():
tid = did_to_task_id.get(row.data)
if tid is not None:
_append_task_attr(
tasks,
tid,
"quality",
{"name": row.quality, "value": str(row.value)},
)
# tags
for row in tags_result.all():
_append_task_attr(tasks, row.id, "tag", row.tag)
```
And a single “ensure keys” helper:
```python
def _ensure_task_keys(task: dict[str, Any], keys: list[str]) -> None:
for key in keys:
task.setdefault(key, [])
```
Usage:
```python
for task in tasks.values():
_ensure_task_keys(task, ["input", "quality", "tag"])
```
---
These extractions keep all current behavior and SQL intact, but make `list_tasks` mostly a high‑level orchestration over small, single-purpose helpers, which should address the complexity concerns.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| async def test_list_tasks_filter_tag(py_api: httpx.AsyncClient) -> None: | ||
| """Filter by tag returns only tasks with that tag.""" | ||
| response = await py_api.post("/tasks/list", json={"tag": "OpenML100"}) | ||
| assert response.status_code == HTTPStatus.OK | ||
| tasks = response.json() | ||
| assert len(tasks) > 0 | ||
| assert all("OpenML100" in t["tag"] for t in tasks) |
There was a problem hiding this comment.
suggestion (testing): Add coverage for the remaining filters (data_tag, task_id, data_id, data_name, and other quality ranges).
Current tests exercise task_type_id, tag, pagination, and number_instances, but not the other supported filters: data_tag, task_id, data_id, data_name, number_features, number_classes, and number_missing_values. This leaves several WHERE-clause branches untested.
Consider adding targeted tests such as:
data_tag: only tasks whose dataset has the given tag are returned.task_id: multiple IDs return exactly those tasks (and correct ordering/uniqueness if relevant).data_id: only tasks for the specified dataset(s) are returned.data_name: verifies expected matching behavior (case-(in)sensitivity as intended).- One or more quality ranges using
number_featuresornumber_classesto exercise_quality_clausebeyondNumberOfInstances.
Suggested implementation:
async def test_list_tasks_filter_type(py_api: httpx.AsyncClient) -> None:
"""Filter by task_type_id returns only tasks of that type."""
response = await py_api.post("/tasks/list", json={"task_type_id": 1})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert all(t["task_type_id"] == 1 for t in tasks)
async def test_list_tasks_default(py_api: httpx.AsyncClient) -> None:
"""Default call returns active tasks with correct shape."""
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert isinstance(tasks, list)
assert len(tasks) > 0
# verify shape of first task
task = tasks[0]
async def test_list_tasks_filter_data_tag(py_api: httpx.AsyncClient) -> None:
"""Filter by data_tag returns only tasks whose dataset has that tag."""
# Get a tag actually present on at least one dataset
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
# Find first task that has at least one data tag
tagged_task = next(
t for t in tasks if t.get("data_tags") is not None and len(t.get("data_tags")) > 0
)
data_tag = tagged_task["data_tags"][0]
response = await py_api.post("/tasks/list", json={"data_tag": data_tag})
assert response.status_code == HTTPStatus.OK
filtered = response.json()
assert len(filtered) > 0
assert all(
data_tag in t.get("data_tags", []) for t in filtered
), "All returned tasks should have the dataset tag used for filtering"
async def test_list_tasks_filter_task_id(py_api: httpx.AsyncClient) -> None:
"""Filter by task_id returns exactly the requested tasks."""
# Discover a small set of valid task IDs from the default listing
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert len(tasks) >= 2
requested_ids = sorted(t["task_id"] for t in tasks[:3])
response = await py_api.post("/tasks/list", json={"task_id": requested_ids})
assert response.status_code == HTTPStatus.OK
filtered = response.json()
returned_ids = sorted(t["task_id"] for t in filtered)
assert set(returned_ids) == set(
requested_ids
), "Filtering by task_id should return exactly the requested tasks"
# If the API defines a particular ordering, assert on that here as well
# (e.g. sorted by task_id). The current check ensures no duplicates.
assert len(returned_ids) == len(set(returned_ids))
async def test_list_tasks_filter_data_id(py_api: httpx.AsyncClient) -> None:
"""Filter by data_id returns only tasks from those datasets."""
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert len(tasks) > 0
# Take a couple of distinct data_ids from the available tasks
data_ids = []
for t in tasks:
if t["data_id"] not in data_ids:
data_ids.append(t["data_id"])
if len(data_ids) == 2:
break
response = await py_api.post("/tasks/list", json={"data_id": data_ids})
assert response.status_code == HTTPStatus.OK
filtered = response.json()
assert len(filtered) > 0
assert all(
t["data_id"] in data_ids for t in filtered
), "All returned tasks should belong to one of the requested data_ids"
async def test_list_tasks_filter_data_name(py_api: httpx.AsyncClient) -> None:
"""Filter by data_name returns only tasks whose dataset has that name."""
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert len(tasks) > 0
data_name = tasks[0]["data_name"]
# Use the exact name observed in the listing
response = await py_api.post("/tasks/list", json={"data_name": data_name})
assert response.status_code == HTTPStatus.OK
filtered = response.json()
assert len(filtered) > 0
assert all(
t["data_name"] == data_name for t in filtered
), "All returned tasks should have the requested data_name"
async def test_list_tasks_filter_number_features_range(py_api: httpx.AsyncClient) -> None:
"""Filter by number_features quality range returns tasks within that range."""
# Probe existing tasks to discover a concrete number_features value
response = await py_api.post("/tasks/list", json={})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
# Find a task with a defined number_features value
task_with_features = next(
t for t in tasks if t.get("number_features") is not None
)
num_features = task_with_features["number_features"]
# Build a narrow range around that concrete value to exercise the quality clause
payload = {
"number_features": {
"min": num_features,
"max": num_features,
}
}
response = await py_api.post("/tasks/list", json=payload)
assert response.status_code == HTTPStatus.OK
filtered = response.json()
assert len(filtered) > 0
assert all(
t.get("number_features") is not None
and payload["number_features"]["min"]
<= t["number_features"]
<= payload["number_features"]["max"]
for t in filtered
), "All returned tasks should have number_features within the requested range"These tests assume:
- Response objects include
data_tags(a list of strings on each task),task_id,data_id,data_name, andnumber_features. - The
/tasks/listrequest model accepts:data_tagas a single string.task_idanddata_idas lists of integers.data_nameas a string.- Quality ranges in the form
{"number_features": {"min": ..., "max": ...}}.
To fully align with your existing API:
- Adjust the field names (
data_tagsvs.dataset_tags, etc.) in the assertions to match the actual task schema. - Align the request payload shapes with whatever is already used in the existing
number_instancestest (e.g., if it usesnumber_features_min/number_features_maxinstead of a nested dict, mirror that here). - If
task_id/data_idonly accept a single integer instead of a list, modifyrequested_ids/data_idsaccordingly and assert on a single returned task. - If data-name filtering is case-insensitive, extend
test_list_tasks_filter_data_nameto issue a request withdata_namein a different case and assert that the same tasks are returned.
|
|
||
| @router.post(path="/list", description="Provided for convenience, same as `GET` endpoint.") | ||
| @router.get(path="/list") | ||
| async def list_tasks( # noqa: PLR0913 |
There was a problem hiding this comment.
issue (complexity): Consider extracting small helper functions for IN-clause construction, range parsing, WHERE-clause assembly, dataset-status subquery, and task-enrichment to make list_tasks shorter and easier to follow.
You can keep the new behavior but reduce complexity by extracting a few focused helpers and reusing them.
1. Extract a generic IN clause builder
You repeat ",".join(...) and f-string interpolation for IDs and names. A tiny helper keeps SQL construction consistent and reduces visual noise:
def _in_clause(values: list[int | str]) -> str:
# Callers are responsible for correctness of values (already sanitized IDs or enums)
return ", ".join(str(v) for v in values)Usage:
task_ids_str = _in_clause(tasks.keys())
did_list = _in_clause([t["did"] for t in tasks.values()])
basic_inputs_str = _in_clause([f"'{i}'" for i in BASIC_TASK_INPUTS])
qualities_str = _in_clause([f"'{q}'" for q in QUALITIES_TO_SHOW])This shrinks the list_tasks body and centralizes the IN (...) formatting.
2. Split range parsing from quality SQL rendering
_quality_clause currently parses and renders SQL. Splitting the concerns makes both parts easier to read and test.
def _parse_integer_range(range_: str | None) -> tuple[str, str | None] | None:
if not range_:
return None
match = re.match(integer_range_regex, range_)
if not match:
msg = f"`range_` not a valid range: {range_}"
raise ValueError(msg)
start, end = match.groups()
end_num = end[2:] if end else None # strip ".."
return start, end_numThen a generic quality clause builder:
def _quality_clause(quality: str, range_: str | None) -> str:
parsed = _parse_integer_range(range_)
if not parsed:
return ""
start, end = parsed
value_expr = f"`value` BETWEEN {start} AND {end}" if end else f"`value`={start}"
return f"""
AND t.`task_id` IN (
SELECT ti.`task_id` FROM task_inputs ti
WHERE ti.`input`='source_data' AND ti.`value` IN (
SELECT `data` FROM data_quality
WHERE `quality`='{quality}' AND {value_expr}
)
)
"""This leaves list_tasks with just the high-level mapping:
where_number_instances = _quality_clause("NumberOfInstances", number_instances)
...3. Extract WHERE-clause assembly into a helper
The top of list_tasks builds many related conditions inline. Move that to a dedicated function that returns both fragments and the parameter dict:
def _build_task_where_clauses(
status: TaskStatusFilter,
task_type_id: int | None,
tag: str | None,
data_tag: str | None,
task_id: list[int] | None,
data_id: list[int] | None,
data_name: str | None,
number_instances: str | None,
number_features: str | None,
number_classes: str | None,
number_missing_values: str | None,
) -> tuple[str, dict[str, Any]]:
if status == TaskStatusFilter.ALL:
where_status = ""
else:
where_status = f"AND IFNULL(ds.`status`, 'in_preparation') = '{status}'"
where_type = "" if task_type_id is None else "AND t.`ttid` = :task_type_id"
where_tag = "" if tag is None else \
"AND t.`task_id` IN (SELECT `id` FROM task_tag WHERE `tag` = :tag)"
where_data_tag = "" if data_tag is None else \
"AND d.`did` IN (SELECT `id` FROM dataset_tag WHERE `tag` = :data_tag)"
where_task_id = "" if not task_id else f"AND t.`task_id` IN ({_in_clause(task_id)})"
where_data_id = "" if not data_id else f"AND d.`did` IN ({_in_clause(data_id)})"
where_data_name = "" if data_name is None else "AND d.`name` = :data_name"
where_number_instances = _quality_clause("NumberOfInstances", number_instances)
where_number_features = _quality_clause("NumberOfFeatures", number_features)
where_number_classes = _quality_clause("NumberOfClasses", number_classes)
where_number_missing_values = _quality_clause(
"NumberOfMissingValues", number_missing_values
)
where_sql = (
where_status
+ where_type
+ where_tag
+ where_data_tag
+ where_task_id
+ where_data_id
+ where_data_name
+ where_number_instances
+ where_number_features
+ where_number_classes
+ where_number_missing_values
)
params = {
"task_type_id": task_type_id,
"tag": tag,
"data_tag": data_tag,
"data_name": data_name,
}
return where_sql, paramsThen list_tasks reduces to:
where_sql, params = _build_task_where_clauses(
status, task_type_id, tag, data_tag, task_id, data_id,
data_name, number_instances, number_features,
number_classes, number_missing_values,
)
query = text(f"""
SELECT ...
FROM ...
WHERE 1=1
{where_sql}
GROUP BY ...
LIMIT {pagination.limit} OFFSET {pagination.offset}
""")
result = await expdb.execute(query, parameters=params)4. Extract the dataset-status subquery
Moving the status subquery out of the main query makes the main SQL much easier to read:
LATEST_DATASET_STATUS_SUBQUERY = """
SELECT ds1.did, ds1.status
FROM dataset_status ds1
WHERE ds1.status_date = (
SELECT MAX(ds2.status_date)
FROM dataset_status ds2
WHERE ds1.did = ds2.did
)
"""Then in list_tasks:
query = text(f"""
SELECT ...
FROM task t
...
LEFT JOIN ({LATEST_DATASET_STATUS_SUBQUERY}) ds ON ds.`did` = d.`did`
WHERE 1=1
{where_sql}
...
""")5. Unify enrichment of tasks (inputs/qualities/tags)
You can use a generic helper for the “append or init list” pattern to avoid repetition:
def _append_task_attr(
tasks: dict[int, dict[str, Any]],
task_id: int,
key: str,
value: Any,
) -> None:
tasks[task_id].setdefault(key, []).append(value)Then:
# inputs
for row in inputs_result.all():
_append_task_attr(
tasks,
row.task_id,
"input",
{"name": row.input, "value": row.value},
)
# qualities
for row in qualities_result.all():
tid = did_to_task_id.get(row.data)
if tid is not None:
_append_task_attr(
tasks,
tid,
"quality",
{"name": row.quality, "value": str(row.value)},
)
# tags
for row in tags_result.all():
_append_task_attr(tasks, row.id, "tag", row.tag)And a single “ensure keys” helper:
def _ensure_task_keys(task: dict[str, Any], keys: list[str]) -> None:
for key in keys:
task.setdefault(key, [])Usage:
for task in tasks.values():
_ensure_task_keys(task, ["input", "quality", "tag"])These extractions keep all current behavior and SQL intact, but make list_tasks mostly a high‑level orchestration over small, single-purpose helpers, which should address the complexity concerns.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
tests/routers/openml/task_test.py (3)
7-25: Assert the defaultactivecontract.This currently checks shape only, so it still passes if the endpoint stops applying the default status filter. Please assert that every returned task has
status == "active".🧪 Small test hardening
tasks = response.json() assert isinstance(tasks, list) assert len(tasks) > 0 + assert all(task["status"] == "active" for task in tasks) # verify shape of first task🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/routers/openml/task_test.py` around lines 7 - 25, The test test_list_tasks_default should enforce the default "active" contract: after fetching response from the "/tasks/list" endpoint and validating tasks is a non-empty list (as shown), iterate over the returned tasks (e.g., for task in tasks) and assert task["status"] == "active" for every item; add this assertion to the test_list_tasks_default function so the endpoint is required to apply the default active status filter.
28-33: Avoid a vacuous pass in the type-filter test.
all(t["task_type_id"] == 1 for t in tasks)is true for[], so a regression that returns no rows still passes.🧪 Small test hardening
assert response.status_code == HTTPStatus.OK tasks = response.json() + assert len(tasks) > 0 assert all(t["task_type_id"] == 1 for t in tasks)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/routers/openml/task_test.py` around lines 28 - 33, The test test_list_tasks_filter_type currently uses all(...) which vacuously passes for an empty list; modify the test to first assert that the response contains at least one task (e.g., assert tasks), then keep the existing check that all(t["task_type_id"] == 1 for t in tasks). Locate the test function test_list_tasks_filter_type and the POST to "/tasks/list" (json={"task_type_id": 1}) and add a non-empty assertion before the all(...) assertion so the test fails if no rows are returned.
65-72: Check the filtered values, not just that something came back.
assert len(response.json()) > 0does not provenumber_instancesis working. This test would still pass if the filter were ignored completely.🧪 Stronger assertion
) assert response.status_code == HTTPStatus.OK - assert len(response.json()) > 0 + tasks = response.json() + assert len(tasks) > 0 + for task in tasks: + qualities = {q["name"]: q["value"] for q in task["quality"]} + assert 100 <= float(qualities["NumberOfInstances"]) <= 1000🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/routers/openml/task_test.py` around lines 65 - 72, The test test_list_tasks_number_instances_range currently only asserts that something was returned; update it to validate the filter by parsing response.json() (ensure it's a list), then for each returned task object (from the POST to "/tasks/list" with json {"number_instances":"100..1000"}) extract the numeric value (e.g., task["number_instances"] or task["dataset"]["number_instances"] depending on payload shape), coerce it to int, and assert each value is between 100 and 1000 inclusive so the test fails if the filter is ignored or misapplied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/routers/openml/tasks.py`:
- Around line 277-307: The SELECT query built into the variable query currently
lacks an ORDER BY, making pagination via pagination.limit and pagination.offset
unstable; modify the SQL assembled in query (the multi-line text(...) that
references status_subquery and the various {where_*} clauses) to append a
deterministic ORDER BY before the LIMIT/OFFSET — e.g. order by a stable unique
column such as t.`task_id` (and secondary columns like d.`did` if needed) so
results are consistently sorted across pages.
- Around line 360-368: The current did_to_task_id mapping only keeps one task_id
per dataset, so replace it with a mapping from dataset id to a list of task ids
(e.g., build did_to_task_ids from tasks by collecting tid into lists for each
t["did"]), then when iterating qualities_result.all() look up the list
(did_to_task_ids.get(row.data, [])) and append the quality entry to each
corresponding tasks[tid] (using the same setdefault("quality", []) logic) so
every task sharing the same source_data receives the quality rows.
- Around line 218-233: The current combined handler list_tasks is using
Body(...) for all parameters which breaks GET semantics; split into two
endpoints: keep the existing POST /list handler (function list_tasks_post) that
accepts Pagination, task_type_id, tag, etc. from Body(...) and keep using
Depends(expdb_connection) for expdb, and create a new GET /list handler
(function list_tasks_get) that accepts the same parameters from Query(...) (or
via a Depends-based query model) instead of Body so query-string filters
populate correctly; ensure both handlers delegate shared logic to a common
helper (e.g., _list_tasks_impl) that takes the normalized args (Pagination,
TaskStatusFilter, lists like task_id/data_id, number_* ranges, expdb) to avoid
duplicating business logic.
---
Nitpick comments:
In `@tests/routers/openml/task_test.py`:
- Around line 7-25: The test test_list_tasks_default should enforce the default
"active" contract: after fetching response from the "/tasks/list" endpoint and
validating tasks is a non-empty list (as shown), iterate over the returned tasks
(e.g., for task in tasks) and assert task["status"] == "active" for every item;
add this assertion to the test_list_tasks_default function so the endpoint is
required to apply the default active status filter.
- Around line 28-33: The test test_list_tasks_filter_type currently uses
all(...) which vacuously passes for an empty list; modify the test to first
assert that the response contains at least one task (e.g., assert tasks), then
keep the existing check that all(t["task_type_id"] == 1 for t in tasks). Locate
the test function test_list_tasks_filter_type and the POST to "/tasks/list"
(json={"task_type_id": 1}) and add a non-empty assertion before the all(...)
assertion so the test fails if no rows are returned.
- Around line 65-72: The test test_list_tasks_number_instances_range currently
only asserts that something was returned; update it to validate the filter by
parsing response.json() (ensure it's a list), then for each returned task object
(from the POST to "/tasks/list" with json {"number_instances":"100..1000"})
extract the numeric value (e.g., task["number_instances"] or
task["dataset"]["number_instances"] depending on payload shape), coerce it to
int, and assert each value is between 100 and 1000 inclusive so the test fails
if the filter is ignored or misapplied.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 318531a5-1859-48cb-9afb-9606ff2ddb84
📒 Files selected for processing (2)
src/routers/openml/tasks.pytests/routers/openml/task_test.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/routers/openml/task_test.py (1)
92-96: Test doesn't verify GET with query parameters.This test only exercises GET without parameters, which works because all endpoint parameters have defaults. It doesn't verify that GET with query parameters (e.g.,
?task_type_id=1) actually filters results. Given theBody(...)parameter binding issue in the endpoint, such a test would currently fail—which would help surface the problem.Consider adding a test case like:
async def test_list_tasks_get_with_query_params(py_api: httpx.AsyncClient) -> None: """GET /tasks/list with query params should filter results.""" response = await py_api.get("/tasks/list", params={"task_type_id": 1}) assert response.status_code == HTTPStatus.OK tasks = response.json() assert all(t["task_type_id"] == 1 for t in tasks)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/routers/openml/task_test.py` around lines 92 - 96, Add a new test that calls the same endpoint as test_list_tasks_get but supplies query parameters to ensure filtering works: create async def test_list_tasks_get_with_query_params(py_api: httpx.AsyncClient) that performs py_api.get("/tasks/list", params={"task_type_id": 1}), asserts response.status_code == HTTPStatus.OK, parses response.json() into tasks, and asserts all returned items have t["task_type_id"] == 1; reference the existing test_list_tasks_get to place the new test nearby and validate the query-parameter behavior for the /tasks/list endpoint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/routers/openml/task_test.py`:
- Around line 82-89: The test test_list_tasks_no_results should assert
error["code"] equals the exact string instead of using substring containment;
update the assertion that currently checks "372" in error["code"] to a direct
equality check comparing error["code"] to the full expected string (use the same
serialized form used elsewhere, i.e., str(exc.code) format) so that
error["code"] is asserted with equality rather than containment.
---
Nitpick comments:
In `@tests/routers/openml/task_test.py`:
- Around line 92-96: Add a new test that calls the same endpoint as
test_list_tasks_get but supplies query parameters to ensure filtering works:
create async def test_list_tasks_get_with_query_params(py_api:
httpx.AsyncClient) that performs py_api.get("/tasks/list",
params={"task_type_id": 1}), asserts response.status_code == HTTPStatus.OK,
parses response.json() into tasks, and asserts all returned items have
t["task_type_id"] == 1; reference the existing test_list_tasks_get to place the
new test nearby and validate the query-parameter behavior for the /tasks/list
endpoint.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 946d27de-85ae-4812-a074-8a3f092dadea
📒 Files selected for processing (2)
src/routers/openml/tasks.pytests/routers/openml/task_test.py
Description
Implements
GET /tasks/listandPOST /tasks/listendpoints, migrating from the PHP API.Fixes: #23
Filters supported
task_type_id,tag,data_tag,status,limit,offset,task_id,data_id,data_name,number_instances,number_features,number_classes,number_missing_valuesImplementation notes
datasets.py— filters via request body, both GET and POST decorators on the same functiondatabase/file needed)active(matches PHP behavior)list_datasetstask,task_type,task_inputs,dataset,dataset_status,data_quality,task_tagChecklist