Skip to content

feat: Add GET/POST /task/list endpoint#277

Open
saathviksheerla wants to merge 3 commits intoopenml:mainfrom
saathviksheerla:feat/task-list-endpoint
Open

feat: Add GET/POST /task/list endpoint#277
saathviksheerla wants to merge 3 commits intoopenml:mainfrom
saathviksheerla:feat/task-list-endpoint

Conversation

@saathviksheerla
Copy link
Contributor

Description

Implements GET /tasks/list and POST /tasks/list endpoints, migrating from the PHP API.

Fixes: #23

Filters supported

task_type_id, tag, data_tag, status, limit, offset, task_id, data_id, data_name, number_instances, number_features, number_classes, number_missing_values

Implementation notes

  • Follows the pattern established in datasets.py — filters via request body, both GET and POST decorators on the same function
  • Inline SQL in router (no new database/ file needed)
  • Default status filter: active (matches PHP behavior)
  • Qualities returned are the same 10 as in list_datasets
  • Tables: task, task_type, task_inputs, dataset, dataset_status, data_quality, task_tag

Checklist

  • I have performed a self-review of my own pull request
  • Tests pass locally
  • I have commented my code in hard-to-understand areas, and provided or updated docstrings as needed
  • I have added tests that cover the changes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2def25ad-96db-42d8-8fda-ef16221cdef4

📥 Commits

Reviewing files that changed from the base of the PR and between 73817ee and 4e9ee5d.

📒 Files selected for processing (1)
  • tests/routers/openml/task_test.py

Walkthrough

Adds a new /tasks/list endpoint (GET/POST) that returns task summaries with pagination and multiple optional filters: task_type_id, tag, data_tag, status, task_id, data_id, data_name, and numeric range filters for instances/features/classes/missing values. Introduces TaskStatusFilter, QUALITIES_TO_SHOW, BASIC_TASK_INPUTS, _quality_clause, and list_tasks, which builds dynamic WHERE clauses, queries joined task/type/dataset/status tables, batch-fetches inputs/qualities/tags, assembles results, and raises NoResultsError when no matches. Adds tests covering filtering, pagination, ranges, GET handling, and validation.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately describes the main change: implementing GET/POST /tasks/list endpoints.
Description check ✅ Passed The pull request description is clearly related to the changeset, outlining the implementation of the /tasks/list endpoints with supported filters and technical details.
Linked Issues check ✅ Passed The PR implements the GET /tasks/list endpoints required by issue #23 with extensive filtering capabilities (task_type_id, tag, data_tag, status, pagination, task_id, data_id, data_name, number_instances/features/classes/missing_values).
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the /tasks/list endpoints: new router function with filters, supporting enums/constants, and comprehensive test coverage for the new functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 23.61111% with 55 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@c9ada15). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/routers/openml/tasks.py 23.61% 55 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #277   +/-   ##
=======================================
  Coverage        ?   52.94%           
=======================================
  Files           ?       34           
  Lines           ?     1526           
  Branches        ?      128           
=======================================
  Hits            ?      808           
  Misses          ?      716           
  Partials        ?        2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@saathviksheerla saathviksheerla marked this pull request as ready for review March 17, 2026 10:15
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src/routers/openml/tasks.py" line_range="362-363" />
<code_context>
+    )
+    # build a reverse map: dataset_id -> task_id
+    # needed because quality rows come back keyed by did, but our tasks dict is keyed by task_id
+    did_to_task_id = {t["did"]: tid for tid, t in tasks.items()}
+    for row in qualities_result.all():
+        tid = did_to_task_id.get(row.data)
+        if tid is not None:
</code_context>
<issue_to_address>
**issue (bug_risk):** Qualities are only attached to one task per dataset, dropping qualities for other tasks sharing the same dataset.

Because `did_to_task_id` maps each dataset id to a single `task_id`, only the last task using a given dataset receives qualities; earlier tasks end up with empty `quality` lists. To handle multiple tasks per dataset, map `did -> list[task_id]` and, when iterating over `qualities_result`, attach each quality row to all tasks in that list.
</issue_to_address>

### Comment 2
<location path="tests/routers/openml/task_test.py" line_range="36-42" />
<code_context>
+    assert all(t["task_type_id"] == 1 for t in tasks)
+
+
+async def test_list_tasks_filter_tag(py_api: httpx.AsyncClient) -> None:
+    """Filter by tag returns only tasks with that tag."""
+    response = await py_api.post("/tasks/list", json={"tag": "OpenML100"})
+    assert response.status_code == HTTPStatus.OK
+    tasks = response.json()
+    assert len(tasks) > 0
+    assert all("OpenML100" in t["tag"] for t in tasks)
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Add coverage for the remaining filters (`data_tag`, `task_id`, `data_id`, `data_name`, and other quality ranges).

Current tests exercise `task_type_id`, `tag`, `pagination`, and `number_instances`, but not the other supported filters: `data_tag`, `task_id`, `data_id`, `data_name`, `number_features`, `number_classes`, and `number_missing_values`. This leaves several WHERE-clause branches untested.

Consider adding targeted tests such as:
- `data_tag`: only tasks whose dataset has the given tag are returned.
- `task_id`: multiple IDs return exactly those tasks (and correct ordering/uniqueness if relevant).
- `data_id`: only tasks for the specified dataset(s) are returned.
- `data_name`: verifies expected matching behavior (case-(in)sensitivity as intended).
- One or more quality ranges using `number_features` or `number_classes` to exercise `_quality_clause` beyond `NumberOfInstances`.

Suggested implementation:

```python
async def test_list_tasks_filter_type(py_api: httpx.AsyncClient) -> None:
    """Filter by task_type_id returns only tasks of that type."""
    response = await py_api.post("/tasks/list", json={"task_type_id": 1})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert all(t["task_type_id"] == 1 for t in tasks)




async def test_list_tasks_default(py_api: httpx.AsyncClient) -> None:
    """Default call returns active tasks with correct shape."""
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert isinstance(tasks, list)
    assert len(tasks) > 0
    # verify shape of first task
    task = tasks[0]


async def test_list_tasks_filter_data_tag(py_api: httpx.AsyncClient) -> None:
    """Filter by data_tag returns only tasks whose dataset has that tag."""
    # Get a tag actually present on at least one dataset
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    # Find first task that has at least one data tag
    tagged_task = next(
        t for t in tasks if t.get("data_tags") is not None and len(t.get("data_tags")) > 0
    )
    data_tag = tagged_task["data_tags"][0]

    response = await py_api.post("/tasks/list", json={"data_tag": data_tag})
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    assert len(filtered) > 0
    assert all(
        data_tag in t.get("data_tags", []) for t in filtered
    ), "All returned tasks should have the dataset tag used for filtering"


async def test_list_tasks_filter_task_id(py_api: httpx.AsyncClient) -> None:
    """Filter by task_id returns exactly the requested tasks."""
    # Discover a small set of valid task IDs from the default listing
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert len(tasks) >= 2

    requested_ids = sorted(t["task_id"] for t in tasks[:3])

    response = await py_api.post("/tasks/list", json={"task_id": requested_ids})
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    returned_ids = sorted(t["task_id"] for t in filtered)

    assert set(returned_ids) == set(
        requested_ids
    ), "Filtering by task_id should return exactly the requested tasks"
    # If the API defines a particular ordering, assert on that here as well
    # (e.g. sorted by task_id). The current check ensures no duplicates.
    assert len(returned_ids) == len(set(returned_ids))


async def test_list_tasks_filter_data_id(py_api: httpx.AsyncClient) -> None:
    """Filter by data_id returns only tasks from those datasets."""
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert len(tasks) > 0

    # Take a couple of distinct data_ids from the available tasks
    data_ids = []
    for t in tasks:
        if t["data_id"] not in data_ids:
            data_ids.append(t["data_id"])
        if len(data_ids) == 2:
            break

    response = await py_api.post("/tasks/list", json={"data_id": data_ids})
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    assert len(filtered) > 0
    assert all(
        t["data_id"] in data_ids for t in filtered
    ), "All returned tasks should belong to one of the requested data_ids"


async def test_list_tasks_filter_data_name(py_api: httpx.AsyncClient) -> None:
    """Filter by data_name returns only tasks whose dataset has that name."""
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert len(tasks) > 0

    data_name = tasks[0]["data_name"]

    # Use the exact name observed in the listing
    response = await py_api.post("/tasks/list", json={"data_name": data_name})
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    assert len(filtered) > 0
    assert all(
        t["data_name"] == data_name for t in filtered
    ), "All returned tasks should have the requested data_name"


async def test_list_tasks_filter_number_features_range(py_api: httpx.AsyncClient) -> None:
    """Filter by number_features quality range returns tasks within that range."""
    # Probe existing tasks to discover a concrete number_features value
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    # Find a task with a defined number_features value
    task_with_features = next(
        t for t in tasks if t.get("number_features") is not None
    )
    num_features = task_with_features["number_features"]

    # Build a narrow range around that concrete value to exercise the quality clause
    payload = {
        "number_features": {
            "min": num_features,
            "max": num_features,
        }
    }
    response = await py_api.post("/tasks/list", json=payload)
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    assert len(filtered) > 0
    assert all(
        t.get("number_features") is not None
        and payload["number_features"]["min"]
        <= t["number_features"]
        <= payload["number_features"]["max"]
        for t in filtered
    ), "All returned tasks should have number_features within the requested range"

```

These tests assume:
- Response objects include `data_tags` (a list of strings on each task), `task_id`, `data_id`, `data_name`, and `number_features`.
- The `/tasks/list` request model accepts:
  - `data_tag` as a single string.
  - `task_id` and `data_id` as lists of integers.
  - `data_name` as a string.
  - Quality ranges in the form `{"number_features": {"min": ..., "max": ...}}`.

To fully align with your existing API:
1. Adjust the field names (`data_tags` vs. `dataset_tags`, etc.) in the assertions to match the actual task schema.
2. Align the request payload shapes with whatever is already used in the existing `number_instances` test (e.g., if it uses `number_features_min` / `number_features_max` instead of a nested dict, mirror that here).
3. If `task_id` / `data_id` only accept a single integer instead of a list, modify `requested_ids` / `data_ids` accordingly and assert on a single returned task.
4. If data-name filtering is case-insensitive, extend `test_list_tasks_filter_data_name` to issue a request with `data_name` in a different case and assert that the same tasks are returned.
</issue_to_address>

### Comment 3
<location path="tests/routers/openml/task_test.py" line_range="56-62" />
<code_context>
+    assert ids1 != ids2
+
+
+async def test_list_tasks_number_instances_range(py_api: httpx.AsyncClient) -> None:
+    """number_instances range filter returns tasks whose dataset matches."""
+    response = await py_api.post(
+        "/tasks/list",
+        json={"number_instances": "100..1000"},
+    )
+    assert response.status_code == HTTPStatus.OK
+    assert len(response.json()) > 0
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a negative test for invalid range syntax to ensure it does not result in a server error.

Since `_quality_clause` raises `ValueError` when `range_` doesn’t match `integer_range_regex`, it would be valuable to add a test that covers an invalid `number_instances` value (e.g. `"abc"` or `"1...2"`) and verifies:
- The response status is a validation error (e.g. 422) rather than 500.
- The error body matches the API’s standard error format.

This guards against accidentally surfacing raw `ValueError`s as 500s if validation behavior changes later.

```suggestion
async def test_list_tasks_pagination_offset(py_api: httpx.AsyncClient) -> None:
    """Offset returns different results than no offset."""
    r1 = await py_api.post("/tasks/list", json={"pagination": {"limit": 5, "offset": 0}})
    r2 = await py_api.post("/tasks/list", json={"pagination": {"limit": 5, "offset": 5}})
    ids1 = [t["task_id"] for t in r1.json()]
    ids2 = [t["task_id"] for t in r2.json()]
    assert ids1 != ids2


async def test_list_tasks_number_instances_range(py_api: httpx.AsyncClient) -> None:
    """number_instances range filter returns tasks whose dataset matches."""
    response = await py_api.post(
        "/tasks/list",
        json={"number_instances": "100..1000"},
    )
    assert response.status_code == HTTPStatus.OK
    assert len(response.json()) > 0


async def test_list_tasks_number_instances_invalid_range(py_api: httpx.AsyncClient) -> None:
    """Invalid number_instances range does not raise server error and returns validation error."""
    response = await py_api.post(
        "/tasks/list",
        json={"number_instances": "1...2"},
    )
    assert response.status_code == HTTPStatus.UNPROCESSABLE_ENTITY
    body = response.json()
    assert isinstance(body, dict)
    # Assert the error body follows the API's standard validation error format.
    # Adjust these expectations if your API uses a different envelope.
    assert "detail" in body
```
</issue_to_address>

### Comment 4
<location path="src/routers/openml/tasks.py" line_range="220" />
<code_context>
+
+@router.post(path="/list", description="Provided for convenience, same as `GET` endpoint.")
+@router.get(path="/list")
+async def list_tasks(  # noqa: PLR0913
+    pagination: Annotated[Pagination, Body(default_factory=Pagination)],
+    task_type_id: Annotated[int | None, Body(description="Filter by task type id.")] = None,
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting small helper functions for IN-clause construction, range parsing, WHERE-clause assembly, dataset-status subquery, and task-enrichment to make `list_tasks` shorter and easier to follow.

You can keep the new behavior but reduce complexity by extracting a few focused helpers and reusing them.

### 1. Extract a generic `IN` clause builder

You repeat `",".join(...)` and f-string interpolation for IDs and names. A tiny helper keeps SQL construction consistent and reduces visual noise:

```python
def _in_clause(values: list[int | str]) -> str:
    # Callers are responsible for correctness of values (already sanitized IDs or enums)
    return ", ".join(str(v) for v in values)
```

Usage:

```python
task_ids_str = _in_clause(tasks.keys())
did_list = _in_clause([t["did"] for t in tasks.values()])
basic_inputs_str = _in_clause([f"'{i}'" for i in BASIC_TASK_INPUTS])
qualities_str = _in_clause([f"'{q}'" for q in QUALITIES_TO_SHOW])
```

This shrinks the `list_tasks` body and centralizes the `IN (...)` formatting.

### 2. Split range parsing from quality SQL rendering

`_quality_clause` currently parses and renders SQL. Splitting the concerns makes both parts easier to read and test.

```python
def _parse_integer_range(range_: str | None) -> tuple[str, str | None] | None:
    if not range_:
        return None
    match = re.match(integer_range_regex, range_)
    if not match:
        msg = f"`range_` not a valid range: {range_}"
        raise ValueError(msg)
    start, end = match.groups()
    end_num = end[2:] if end else None  # strip ".."
    return start, end_num
```

Then a generic quality clause builder:

```python
def _quality_clause(quality: str, range_: str | None) -> str:
    parsed = _parse_integer_range(range_)
    if not parsed:
        return ""
    start, end = parsed
    value_expr = f"`value` BETWEEN {start} AND {end}" if end else f"`value`={start}"
    return f"""
        AND t.`task_id` IN (
            SELECT ti.`task_id` FROM task_inputs ti
            WHERE ti.`input`='source_data' AND ti.`value` IN (
                SELECT `data` FROM data_quality
                WHERE `quality`='{quality}' AND {value_expr}
            )
        )
    """
```

This leaves `list_tasks` with just the high-level mapping:

```python
where_number_instances = _quality_clause("NumberOfInstances", number_instances)
...
```

### 3. Extract WHERE-clause assembly into a helper

The top of `list_tasks` builds many related conditions inline. Move that to a dedicated function that returns both fragments and the parameter dict:

```python
def _build_task_where_clauses(
    status: TaskStatusFilter,
    task_type_id: int | None,
    tag: str | None,
    data_tag: str | None,
    task_id: list[int] | None,
    data_id: list[int] | None,
    data_name: str | None,
    number_instances: str | None,
    number_features: str | None,
    number_classes: str | None,
    number_missing_values: str | None,
) -> tuple[str, dict[str, Any]]:
    if status == TaskStatusFilter.ALL:
        where_status = ""
    else:
        where_status = f"AND IFNULL(ds.`status`, 'in_preparation') = '{status}'"

    where_type = "" if task_type_id is None else "AND t.`ttid` = :task_type_id"
    where_tag = "" if tag is None else \
        "AND t.`task_id` IN (SELECT `id` FROM task_tag WHERE `tag` = :tag)"
    where_data_tag = "" if data_tag is None else \
        "AND d.`did` IN (SELECT `id` FROM dataset_tag WHERE `tag` = :data_tag)"
    where_task_id = "" if not task_id else f"AND t.`task_id` IN ({_in_clause(task_id)})"
    where_data_id = "" if not data_id else f"AND d.`did` IN ({_in_clause(data_id)})"
    where_data_name = "" if data_name is None else "AND d.`name` = :data_name"

    where_number_instances = _quality_clause("NumberOfInstances", number_instances)
    where_number_features = _quality_clause("NumberOfFeatures", number_features)
    where_number_classes = _quality_clause("NumberOfClasses", number_classes)
    where_number_missing_values = _quality_clause(
        "NumberOfMissingValues", number_missing_values
    )

    where_sql = (
        where_status
        + where_type
        + where_tag
        + where_data_tag
        + where_task_id
        + where_data_id
        + where_data_name
        + where_number_instances
        + where_number_features
        + where_number_classes
        + where_number_missing_values
    )

    params = {
        "task_type_id": task_type_id,
        "tag": tag,
        "data_tag": data_tag,
        "data_name": data_name,
    }

    return where_sql, params
```

Then `list_tasks` reduces to:

```python
where_sql, params = _build_task_where_clauses(
    status, task_type_id, tag, data_tag, task_id, data_id,
    data_name, number_instances, number_features,
    number_classes, number_missing_values,
)

query = text(f"""
    SELECT ...
    FROM ...
    WHERE 1=1
        {where_sql}
    GROUP BY ...
    LIMIT {pagination.limit} OFFSET {pagination.offset}
""")
result = await expdb.execute(query, parameters=params)
```

### 4. Extract the dataset-status subquery

Moving the status subquery out of the main query makes the main SQL much easier to read:

```python
LATEST_DATASET_STATUS_SUBQUERY = """
    SELECT ds1.did, ds1.status
    FROM dataset_status ds1
    WHERE ds1.status_date = (
        SELECT MAX(ds2.status_date)
        FROM dataset_status ds2
        WHERE ds1.did = ds2.did
    )
"""
```

Then in `list_tasks`:

```python
query = text(f"""
    SELECT ...
    FROM task t
    ...
    LEFT JOIN ({LATEST_DATASET_STATUS_SUBQUERY}) ds ON ds.`did` = d.`did`
    WHERE 1=1
        {where_sql}
    ...
""")
```

### 5. Unify enrichment of `tasks` (inputs/qualities/tags)

You can use a generic helper for the “append or init list” pattern to avoid repetition:

```python
def _append_task_attr(
    tasks: dict[int, dict[str, Any]],
    task_id: int,
    key: str,
    value: Any,
) -> None:
    tasks[task_id].setdefault(key, []).append(value)
```

Then:

```python
# inputs
for row in inputs_result.all():
    _append_task_attr(
        tasks,
        row.task_id,
        "input",
        {"name": row.input, "value": row.value},
    )

# qualities
for row in qualities_result.all():
    tid = did_to_task_id.get(row.data)
    if tid is not None:
        _append_task_attr(
            tasks,
            tid,
            "quality",
            {"name": row.quality, "value": str(row.value)},
        )

# tags
for row in tags_result.all():
    _append_task_attr(tasks, row.id, "tag", row.tag)
```

And a single “ensure keys” helper:

```python
def _ensure_task_keys(task: dict[str, Any], keys: list[str]) -> None:
    for key in keys:
        task.setdefault(key, [])
```

Usage:

```python
for task in tasks.values():
    _ensure_task_keys(task, ["input", "quality", "tag"])
```

---

These extractions keep all current behavior and SQL intact, but make `list_tasks` mostly a high‑level orchestration over small, single-purpose helpers, which should address the complexity concerns.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +36 to +42
async def test_list_tasks_filter_tag(py_api: httpx.AsyncClient) -> None:
"""Filter by tag returns only tasks with that tag."""
response = await py_api.post("/tasks/list", json={"tag": "OpenML100"})
assert response.status_code == HTTPStatus.OK
tasks = response.json()
assert len(tasks) > 0
assert all("OpenML100" in t["tag"] for t in tasks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add coverage for the remaining filters (data_tag, task_id, data_id, data_name, and other quality ranges).

Current tests exercise task_type_id, tag, pagination, and number_instances, but not the other supported filters: data_tag, task_id, data_id, data_name, number_features, number_classes, and number_missing_values. This leaves several WHERE-clause branches untested.

Consider adding targeted tests such as:

  • data_tag: only tasks whose dataset has the given tag are returned.
  • task_id: multiple IDs return exactly those tasks (and correct ordering/uniqueness if relevant).
  • data_id: only tasks for the specified dataset(s) are returned.
  • data_name: verifies expected matching behavior (case-(in)sensitivity as intended).
  • One or more quality ranges using number_features or number_classes to exercise _quality_clause beyond NumberOfInstances.

Suggested implementation:

async def test_list_tasks_filter_type(py_api: httpx.AsyncClient) -> None:
    """Filter by task_type_id returns only tasks of that type."""
    response = await py_api.post("/tasks/list", json={"task_type_id": 1})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert all(t["task_type_id"] == 1 for t in tasks)




async def test_list_tasks_default(py_api: httpx.AsyncClient) -> None:
    """Default call returns active tasks with correct shape."""
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert isinstance(tasks, list)
    assert len(tasks) > 0
    # verify shape of first task
    task = tasks[0]


async def test_list_tasks_filter_data_tag(py_api: httpx.AsyncClient) -> None:
    """Filter by data_tag returns only tasks whose dataset has that tag."""
    # Get a tag actually present on at least one dataset
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    # Find first task that has at least one data tag
    tagged_task = next(
        t for t in tasks if t.get("data_tags") is not None and len(t.get("data_tags")) > 0
    )
    data_tag = tagged_task["data_tags"][0]

    response = await py_api.post("/tasks/list", json={"data_tag": data_tag})
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    assert len(filtered) > 0
    assert all(
        data_tag in t.get("data_tags", []) for t in filtered
    ), "All returned tasks should have the dataset tag used for filtering"


async def test_list_tasks_filter_task_id(py_api: httpx.AsyncClient) -> None:
    """Filter by task_id returns exactly the requested tasks."""
    # Discover a small set of valid task IDs from the default listing
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert len(tasks) >= 2

    requested_ids = sorted(t["task_id"] for t in tasks[:3])

    response = await py_api.post("/tasks/list", json={"task_id": requested_ids})
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    returned_ids = sorted(t["task_id"] for t in filtered)

    assert set(returned_ids) == set(
        requested_ids
    ), "Filtering by task_id should return exactly the requested tasks"
    # If the API defines a particular ordering, assert on that here as well
    # (e.g. sorted by task_id). The current check ensures no duplicates.
    assert len(returned_ids) == len(set(returned_ids))


async def test_list_tasks_filter_data_id(py_api: httpx.AsyncClient) -> None:
    """Filter by data_id returns only tasks from those datasets."""
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert len(tasks) > 0

    # Take a couple of distinct data_ids from the available tasks
    data_ids = []
    for t in tasks:
        if t["data_id"] not in data_ids:
            data_ids.append(t["data_id"])
        if len(data_ids) == 2:
            break

    response = await py_api.post("/tasks/list", json={"data_id": data_ids})
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    assert len(filtered) > 0
    assert all(
        t["data_id"] in data_ids for t in filtered
    ), "All returned tasks should belong to one of the requested data_ids"


async def test_list_tasks_filter_data_name(py_api: httpx.AsyncClient) -> None:
    """Filter by data_name returns only tasks whose dataset has that name."""
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert len(tasks) > 0

    data_name = tasks[0]["data_name"]

    # Use the exact name observed in the listing
    response = await py_api.post("/tasks/list", json={"data_name": data_name})
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    assert len(filtered) > 0
    assert all(
        t["data_name"] == data_name for t in filtered
    ), "All returned tasks should have the requested data_name"


async def test_list_tasks_filter_number_features_range(py_api: httpx.AsyncClient) -> None:
    """Filter by number_features quality range returns tasks within that range."""
    # Probe existing tasks to discover a concrete number_features value
    response = await py_api.post("/tasks/list", json={})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    # Find a task with a defined number_features value
    task_with_features = next(
        t for t in tasks if t.get("number_features") is not None
    )
    num_features = task_with_features["number_features"]

    # Build a narrow range around that concrete value to exercise the quality clause
    payload = {
        "number_features": {
            "min": num_features,
            "max": num_features,
        }
    }
    response = await py_api.post("/tasks/list", json=payload)
    assert response.status_code == HTTPStatus.OK
    filtered = response.json()
    assert len(filtered) > 0
    assert all(
        t.get("number_features") is not None
        and payload["number_features"]["min"]
        <= t["number_features"]
        <= payload["number_features"]["max"]
        for t in filtered
    ), "All returned tasks should have number_features within the requested range"

These tests assume:

  • Response objects include data_tags (a list of strings on each task), task_id, data_id, data_name, and number_features.
  • The /tasks/list request model accepts:
    • data_tag as a single string.
    • task_id and data_id as lists of integers.
    • data_name as a string.
    • Quality ranges in the form {"number_features": {"min": ..., "max": ...}}.

To fully align with your existing API:

  1. Adjust the field names (data_tags vs. dataset_tags, etc.) in the assertions to match the actual task schema.
  2. Align the request payload shapes with whatever is already used in the existing number_instances test (e.g., if it uses number_features_min / number_features_max instead of a nested dict, mirror that here).
  3. If task_id / data_id only accept a single integer instead of a list, modify requested_ids / data_ids accordingly and assert on a single returned task.
  4. If data-name filtering is case-insensitive, extend test_list_tasks_filter_data_name to issue a request with data_name in a different case and assert that the same tasks are returned.


@router.post(path="/list", description="Provided for convenience, same as `GET` endpoint.")
@router.get(path="/list")
async def list_tasks( # noqa: PLR0913
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting small helper functions for IN-clause construction, range parsing, WHERE-clause assembly, dataset-status subquery, and task-enrichment to make list_tasks shorter and easier to follow.

You can keep the new behavior but reduce complexity by extracting a few focused helpers and reusing them.

1. Extract a generic IN clause builder

You repeat ",".join(...) and f-string interpolation for IDs and names. A tiny helper keeps SQL construction consistent and reduces visual noise:

def _in_clause(values: list[int | str]) -> str:
    # Callers are responsible for correctness of values (already sanitized IDs or enums)
    return ", ".join(str(v) for v in values)

Usage:

task_ids_str = _in_clause(tasks.keys())
did_list = _in_clause([t["did"] for t in tasks.values()])
basic_inputs_str = _in_clause([f"'{i}'" for i in BASIC_TASK_INPUTS])
qualities_str = _in_clause([f"'{q}'" for q in QUALITIES_TO_SHOW])

This shrinks the list_tasks body and centralizes the IN (...) formatting.

2. Split range parsing from quality SQL rendering

_quality_clause currently parses and renders SQL. Splitting the concerns makes both parts easier to read and test.

def _parse_integer_range(range_: str | None) -> tuple[str, str | None] | None:
    if not range_:
        return None
    match = re.match(integer_range_regex, range_)
    if not match:
        msg = f"`range_` not a valid range: {range_}"
        raise ValueError(msg)
    start, end = match.groups()
    end_num = end[2:] if end else None  # strip ".."
    return start, end_num

Then a generic quality clause builder:

def _quality_clause(quality: str, range_: str | None) -> str:
    parsed = _parse_integer_range(range_)
    if not parsed:
        return ""
    start, end = parsed
    value_expr = f"`value` BETWEEN {start} AND {end}" if end else f"`value`={start}"
    return f"""
        AND t.`task_id` IN (
            SELECT ti.`task_id` FROM task_inputs ti
            WHERE ti.`input`='source_data' AND ti.`value` IN (
                SELECT `data` FROM data_quality
                WHERE `quality`='{quality}' AND {value_expr}
            )
        )
    """

This leaves list_tasks with just the high-level mapping:

where_number_instances = _quality_clause("NumberOfInstances", number_instances)
...

3. Extract WHERE-clause assembly into a helper

The top of list_tasks builds many related conditions inline. Move that to a dedicated function that returns both fragments and the parameter dict:

def _build_task_where_clauses(
    status: TaskStatusFilter,
    task_type_id: int | None,
    tag: str | None,
    data_tag: str | None,
    task_id: list[int] | None,
    data_id: list[int] | None,
    data_name: str | None,
    number_instances: str | None,
    number_features: str | None,
    number_classes: str | None,
    number_missing_values: str | None,
) -> tuple[str, dict[str, Any]]:
    if status == TaskStatusFilter.ALL:
        where_status = ""
    else:
        where_status = f"AND IFNULL(ds.`status`, 'in_preparation') = '{status}'"

    where_type = "" if task_type_id is None else "AND t.`ttid` = :task_type_id"
    where_tag = "" if tag is None else \
        "AND t.`task_id` IN (SELECT `id` FROM task_tag WHERE `tag` = :tag)"
    where_data_tag = "" if data_tag is None else \
        "AND d.`did` IN (SELECT `id` FROM dataset_tag WHERE `tag` = :data_tag)"
    where_task_id = "" if not task_id else f"AND t.`task_id` IN ({_in_clause(task_id)})"
    where_data_id = "" if not data_id else f"AND d.`did` IN ({_in_clause(data_id)})"
    where_data_name = "" if data_name is None else "AND d.`name` = :data_name"

    where_number_instances = _quality_clause("NumberOfInstances", number_instances)
    where_number_features = _quality_clause("NumberOfFeatures", number_features)
    where_number_classes = _quality_clause("NumberOfClasses", number_classes)
    where_number_missing_values = _quality_clause(
        "NumberOfMissingValues", number_missing_values
    )

    where_sql = (
        where_status
        + where_type
        + where_tag
        + where_data_tag
        + where_task_id
        + where_data_id
        + where_data_name
        + where_number_instances
        + where_number_features
        + where_number_classes
        + where_number_missing_values
    )

    params = {
        "task_type_id": task_type_id,
        "tag": tag,
        "data_tag": data_tag,
        "data_name": data_name,
    }

    return where_sql, params

Then list_tasks reduces to:

where_sql, params = _build_task_where_clauses(
    status, task_type_id, tag, data_tag, task_id, data_id,
    data_name, number_instances, number_features,
    number_classes, number_missing_values,
)

query = text(f"""
    SELECT ...
    FROM ...
    WHERE 1=1
        {where_sql}
    GROUP BY ...
    LIMIT {pagination.limit} OFFSET {pagination.offset}
""")
result = await expdb.execute(query, parameters=params)

4. Extract the dataset-status subquery

Moving the status subquery out of the main query makes the main SQL much easier to read:

LATEST_DATASET_STATUS_SUBQUERY = """
    SELECT ds1.did, ds1.status
    FROM dataset_status ds1
    WHERE ds1.status_date = (
        SELECT MAX(ds2.status_date)
        FROM dataset_status ds2
        WHERE ds1.did = ds2.did
    )
"""

Then in list_tasks:

query = text(f"""
    SELECT ...
    FROM task t
    ...
    LEFT JOIN ({LATEST_DATASET_STATUS_SUBQUERY}) ds ON ds.`did` = d.`did`
    WHERE 1=1
        {where_sql}
    ...
""")

5. Unify enrichment of tasks (inputs/qualities/tags)

You can use a generic helper for the “append or init list” pattern to avoid repetition:

def _append_task_attr(
    tasks: dict[int, dict[str, Any]],
    task_id: int,
    key: str,
    value: Any,
) -> None:
    tasks[task_id].setdefault(key, []).append(value)

Then:

# inputs
for row in inputs_result.all():
    _append_task_attr(
        tasks,
        row.task_id,
        "input",
        {"name": row.input, "value": row.value},
    )

# qualities
for row in qualities_result.all():
    tid = did_to_task_id.get(row.data)
    if tid is not None:
        _append_task_attr(
            tasks,
            tid,
            "quality",
            {"name": row.quality, "value": str(row.value)},
        )

# tags
for row in tags_result.all():
    _append_task_attr(tasks, row.id, "tag", row.tag)

And a single “ensure keys” helper:

def _ensure_task_keys(task: dict[str, Any], keys: list[str]) -> None:
    for key in keys:
        task.setdefault(key, [])

Usage:

for task in tasks.values():
    _ensure_task_keys(task, ["input", "quality", "tag"])

These extractions keep all current behavior and SQL intact, but make list_tasks mostly a high‑level orchestration over small, single-purpose helpers, which should address the complexity concerns.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
tests/routers/openml/task_test.py (3)

7-25: Assert the default active contract.

This currently checks shape only, so it still passes if the endpoint stops applying the default status filter. Please assert that every returned task has status == "active".

🧪 Small test hardening
     tasks = response.json()
     assert isinstance(tasks, list)
     assert len(tasks) > 0
+    assert all(task["status"] == "active" for task in tasks)
     # verify shape of first task
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/task_test.py` around lines 7 - 25, The test
test_list_tasks_default should enforce the default "active" contract: after
fetching response from the "/tasks/list" endpoint and validating tasks is a
non-empty list (as shown), iterate over the returned tasks (e.g., for task in
tasks) and assert task["status"] == "active" for every item; add this assertion
to the test_list_tasks_default function so the endpoint is required to apply the
default active status filter.

28-33: Avoid a vacuous pass in the type-filter test.

all(t["task_type_id"] == 1 for t in tasks) is true for [], so a regression that returns no rows still passes.

🧪 Small test hardening
     assert response.status_code == HTTPStatus.OK
     tasks = response.json()
+    assert len(tasks) > 0
     assert all(t["task_type_id"] == 1 for t in tasks)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/task_test.py` around lines 28 - 33, The test
test_list_tasks_filter_type currently uses all(...) which vacuously passes for
an empty list; modify the test to first assert that the response contains at
least one task (e.g., assert tasks), then keep the existing check that
all(t["task_type_id"] == 1 for t in tasks). Locate the test function
test_list_tasks_filter_type and the POST to "/tasks/list" (json={"task_type_id":
1}) and add a non-empty assertion before the all(...) assertion so the test
fails if no rows are returned.

65-72: Check the filtered values, not just that something came back.

assert len(response.json()) > 0 does not prove number_instances is working. This test would still pass if the filter were ignored completely.

🧪 Stronger assertion
     )
     assert response.status_code == HTTPStatus.OK
-    assert len(response.json()) > 0
+    tasks = response.json()
+    assert len(tasks) > 0
+    for task in tasks:
+        qualities = {q["name"]: q["value"] for q in task["quality"]}
+        assert 100 <= float(qualities["NumberOfInstances"]) <= 1000
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/task_test.py` around lines 65 - 72, The test
test_list_tasks_number_instances_range currently only asserts that something was
returned; update it to validate the filter by parsing response.json() (ensure
it's a list), then for each returned task object (from the POST to "/tasks/list"
with json {"number_instances":"100..1000"}) extract the numeric value (e.g.,
task["number_instances"] or task["dataset"]["number_instances"] depending on
payload shape), coerce it to int, and assert each value is between 100 and 1000
inclusive so the test fails if the filter is ignored or misapplied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/routers/openml/tasks.py`:
- Around line 277-307: The SELECT query built into the variable query currently
lacks an ORDER BY, making pagination via pagination.limit and pagination.offset
unstable; modify the SQL assembled in query (the multi-line text(...) that
references status_subquery and the various {where_*} clauses) to append a
deterministic ORDER BY before the LIMIT/OFFSET — e.g. order by a stable unique
column such as t.`task_id` (and secondary columns like d.`did` if needed) so
results are consistently sorted across pages.
- Around line 360-368: The current did_to_task_id mapping only keeps one task_id
per dataset, so replace it with a mapping from dataset id to a list of task ids
(e.g., build did_to_task_ids from tasks by collecting tid into lists for each
t["did"]), then when iterating qualities_result.all() look up the list
(did_to_task_ids.get(row.data, [])) and append the quality entry to each
corresponding tasks[tid] (using the same setdefault("quality", []) logic) so
every task sharing the same source_data receives the quality rows.
- Around line 218-233: The current combined handler list_tasks is using
Body(...) for all parameters which breaks GET semantics; split into two
endpoints: keep the existing POST /list handler (function list_tasks_post) that
accepts Pagination, task_type_id, tag, etc. from Body(...) and keep using
Depends(expdb_connection) for expdb, and create a new GET /list handler
(function list_tasks_get) that accepts the same parameters from Query(...) (or
via a Depends-based query model) instead of Body so query-string filters
populate correctly; ensure both handlers delegate shared logic to a common
helper (e.g., _list_tasks_impl) that takes the normalized args (Pagination,
TaskStatusFilter, lists like task_id/data_id, number_* ranges, expdb) to avoid
duplicating business logic.

---

Nitpick comments:
In `@tests/routers/openml/task_test.py`:
- Around line 7-25: The test test_list_tasks_default should enforce the default
"active" contract: after fetching response from the "/tasks/list" endpoint and
validating tasks is a non-empty list (as shown), iterate over the returned tasks
(e.g., for task in tasks) and assert task["status"] == "active" for every item;
add this assertion to the test_list_tasks_default function so the endpoint is
required to apply the default active status filter.
- Around line 28-33: The test test_list_tasks_filter_type currently uses
all(...) which vacuously passes for an empty list; modify the test to first
assert that the response contains at least one task (e.g., assert tasks), then
keep the existing check that all(t["task_type_id"] == 1 for t in tasks). Locate
the test function test_list_tasks_filter_type and the POST to "/tasks/list"
(json={"task_type_id": 1}) and add a non-empty assertion before the all(...)
assertion so the test fails if no rows are returned.
- Around line 65-72: The test test_list_tasks_number_instances_range currently
only asserts that something was returned; update it to validate the filter by
parsing response.json() (ensure it's a list), then for each returned task object
(from the POST to "/tasks/list" with json {"number_instances":"100..1000"})
extract the numeric value (e.g., task["number_instances"] or
task["dataset"]["number_instances"] depending on payload shape), coerce it to
int, and assert each value is between 100 and 1000 inclusive so the test fails
if the filter is ignored or misapplied.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 318531a5-1859-48cb-9afb-9606ff2ddb84

📥 Commits

Reviewing files that changed from the base of the PR and between c9ada15 and 1a507f8.

📒 Files selected for processing (2)
  • src/routers/openml/tasks.py
  • tests/routers/openml/task_test.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/routers/openml/task_test.py (1)

92-96: Test doesn't verify GET with query parameters.

This test only exercises GET without parameters, which works because all endpoint parameters have defaults. It doesn't verify that GET with query parameters (e.g., ?task_type_id=1) actually filters results. Given the Body(...) parameter binding issue in the endpoint, such a test would currently fail—which would help surface the problem.

Consider adding a test case like:

async def test_list_tasks_get_with_query_params(py_api: httpx.AsyncClient) -> None:
    """GET /tasks/list with query params should filter results."""
    response = await py_api.get("/tasks/list", params={"task_type_id": 1})
    assert response.status_code == HTTPStatus.OK
    tasks = response.json()
    assert all(t["task_type_id"] == 1 for t in tasks)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/task_test.py` around lines 92 - 96, Add a new test that
calls the same endpoint as test_list_tasks_get but supplies query parameters to
ensure filtering works: create async def
test_list_tasks_get_with_query_params(py_api: httpx.AsyncClient) that performs
py_api.get("/tasks/list", params={"task_type_id": 1}), asserts
response.status_code == HTTPStatus.OK, parses response.json() into tasks, and
asserts all returned items have t["task_type_id"] == 1; reference the existing
test_list_tasks_get to place the new test nearby and validate the
query-parameter behavior for the /tasks/list endpoint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/routers/openml/task_test.py`:
- Around line 82-89: The test test_list_tasks_no_results should assert
error["code"] equals the exact string instead of using substring containment;
update the assertion that currently checks "372" in error["code"] to a direct
equality check comparing error["code"] to the full expected string (use the same
serialized form used elsewhere, i.e., str(exc.code) format) so that
error["code"] is asserted with equality rather than containment.

---

Nitpick comments:
In `@tests/routers/openml/task_test.py`:
- Around line 92-96: Add a new test that calls the same endpoint as
test_list_tasks_get but supplies query parameters to ensure filtering works:
create async def test_list_tasks_get_with_query_params(py_api:
httpx.AsyncClient) that performs py_api.get("/tasks/list",
params={"task_type_id": 1}), asserts response.status_code == HTTPStatus.OK,
parses response.json() into tasks, and asserts all returned items have
t["task_type_id"] == 1; reference the existing test_list_tasks_get to place the
new test nearby and validate the query-parameter behavior for the /tasks/list
endpoint.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 946d27de-85ae-4812-a074-8a3f092dadea

📥 Commits

Reviewing files that changed from the base of the PR and between 1a507f8 and 73817ee.

📒 Files selected for processing (2)
  • src/routers/openml/tasks.py
  • tests/routers/openml/task_test.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GET /task/list/{filters}

1 participant