diff --git a/docs/contributing/tests.md b/docs/contributing/tests.md
index 5886d9e1..ade0760f 100644
--- a/docs/contributing/tests.md
+++ b/docs/contributing/tests.md
@@ -1,98 +1,229 @@
 # Writing Tests
 
-tl;dr:
- - Setting up the `py_api` fixture to test directly against a REST API endpoint is really slow, only use it for migration/integration tests.
- - Getting a database fixture and doing a database call is slow, consider mocking if appropriate.
+This guide specifies the required testing strategy for the OpenML server API.
+All tests MUST follow this strategy to ensure correctness while maintaining fast test execution.
 
-## Overhead from Fixtures
-Sometimes, you want to interact with the REST API through the `py_api` fixture,
-or want access to a database with `user_test` or `expdb_test` fixtures.
-Be warned that these come with considerable relative overhead, which adds up when running thousands of tests.
+## Testing Strategy Overview
+
+Write tests at FOUR levels. Each level serves a specific purpose:
+
+| Level | Purpose | Overhead |
+|-------|---------|----------|
+| Migration | Verify Python API matches PHP API | ~300ms |
+| Integration | Test full FastAPI stack | ~60ms |
+| Database | Test DB functions directly | ~4ms |
+| Unit | Test input/output processing | ~1ms |
+
+## 1. Migration Tests
+
+You MUST write migration tests for every endpoint to verify correctness during PHP to Python migration.
+
+**When to write these:** Test each endpoint against the PHP API to verify equivalent behavior.
+
+**Key requirement:** Use `persisted_flow` (or similar) fixture to persist data to the database before the test, so the PHP API can also access it. Without persistence, the PHP API cannot see the test data.
+
+### Example
 
 ```python
-@pytest.mark.parametrize('execution_number', range(5000))
-def test_private_dataset_owner_access(
-        execution_number,
-        expdb_test: Connection,
-        user_test: Connection,
-        py_api: TestClient,
+@pytest.mark.mut
+async def test_flow_exists(
+    persisted_flow: Flow,
+    py_api: httpx.AsyncClient,
+    php_api: httpx.AsyncClient,
 ) -> None:
-    fetch_user(ApiKey.REGULAR_USER, user_test)  # accesses only the user db
-    get_estimation_procedures(expdb_test)  # accesses only the experiment db
-    py_api.get("/does/not/exist")  # only queries the api
+    path = f"exists/{persisted_flow.name}/{persisted_flow.external_version}"
+    py_response, php_response = await asyncio.gather(
+        py_api.get(f"/flows/{path}"),
+        php_api.get(f"/flow/{path}"),
+    )
+
+    assert py_response.status_code == php_response.status_code
+    assert py_response.json() == {"flow_id": persisted_flow.id}
+```
+
+### Persisting Data for PHP API Tests
+
+When testing against the PHP API, you MUST persist data to the database because:
+- The PHP API reads from the same database as Python
+- The `expdb_test` fixture rolls back changes after each test by default
+- Without persistence, PHP API will not see the test data
+
+Use `persist=True` in `temporary_records` or use `persisted_*` fixtures:
+
+```python
+# Option 1: Use persisted fixture (recommended)
+@pytest.fixture
+async def persisted_flow(flow: Flow, expdb_test: AsyncConnection) -> AsyncIterator[Flow]:
+    await expdb_test.commit()  # Persist to database
+    yield flow
+    await expdb_test.rollback()  # Rollback after test
+    # Cleanup...
+
+# Option 2: Use persist parameter
+async with temporary_records(connection, insert_queries, delete_queries, persist=True):
+    # Data is visible to both APIs
     pass
 ```
 
-When individually adding/removing components, we measure (for 5000 repeats, n=1):
-
-| expdb | user | api | exp call | user call | api get |  time (s) |
-|-------|------|-----|----------|-----------|---------|----------:|
-|  ❌   |  ❌  | ❌  | ❌      | ❌        | ❌      |      1.78 |
-|  ✅   |  ❌  | ❌  | ❌      | ❌        | ❌      |      3.45 |
-|  ❌   |  ✅  | ❌  | ❌      | ❌        | ❌      |      3.22 |
-|  ❌   |  ❌  | ✅  | ❌      | ❌        | ❌      |    298.48 |
-|  ✅   |  ✅  | ❌  | ❌      | ❌        | ❌      |      4.44 |
-|  ✅   |  ✅  | ✅  | ❌      | ❌        | ❌      |    285.69 |
-|  ✅   |  ❌  | ❌  | ✅      | ❌        | ❌      |      4.91 |
-|  ❌   |  ✅  | ❌  | ❌      | ✅        | ❌      |      5.81 |
-|  ✅   |  ✅  | ✅  | ✅      | ✅        | ✅      |    307.91 |
-
-Adding a fixture that just returns some value adds only minimal overhead (1.91s),
-so the burden comes from establishing the database connection itself.
-
-We make the following observations:
-
-- Adding a database fixture adds the same overhead as instantiating an entirely new test.
-- Overhead of adding multiple database fixtures is not free
-- The `py_api` fixture adds two orders of magnitude more overhead
-
-We want our tests to be fast, so we want to avoid using these fixtures when we reasonably can.
-We restrict usage of `py_api` fixtures to integration/migration tests, since it is very slow.
-These only run on CI before merges.
-For database fixtures
-
-We will write some fixtures that can be used to e.g., get a `User` without accessing the database.
-The validity of these users will be tested against the database in only a single test.
-
-### Mocking
-Mocking can help us reduce the reliance on database connections in tests.
-A mocked function can prevent accessing the database, and instead return a predefined value instead.
-
-It has a few upsides:
- - It's faster than using a database fixture (see below).
- - The test is not dependent on the database: you can run the test without a database.
-
-But it also has downsides:
- - Behavior changes in the database, such as schema changes, are not automatically reflected in the tests.
- - The database layer (e.g., queries) are not actually tested.
-
-Basically, the mocked behavior may not match real behavior when executed on a database.
-For this reason, for each mocked entity, we should add a test that verifies that if the database layer
-is invoked with the database, it returns the expected output that matches the mock.
-This is additional overhead in development, but hopefully it pays back in more granular test feedback and faster tests.
-
-On the speed of mocks, consider these two tests:
-
-```diff
-@pytest.mark.parametrize('execution_number', range(5000))
-def test_private_dataset_owner_access(
-        execution_number,
-        admin,
-+        mocker,
--        expdb_test: Connection,
+## 2. Integration Tests
+
+You SHOULD write integration tests for critical paths that exercise the full FastAPI stack.
+
+**When to write these:** Test expected paths that require the complete application stack (client, dependency injection, database). Limit to ONE test per kind of expected response.
+
+**Important:** Creating the FastAPI TestClient has significant overhead (~60ms per test). Therefore, you MUST limit integration tests to critical paths only. Use database or unit tests for additional coverage.
+
+### Example
+
+```python
+async def test_flow_exists(flow: Flow, py_api: httpx.AsyncClient) -> None:
+    """Test flow exists - found case."""
+    response = await py_api.get(f"/flows/exists/{flow.name}/{flow.external_version}")
+    assert response.status_code == HTTPStatus.OK
+    assert response.json() == {"flow_id": flow.id}
+
+
+async def test_flow_exists_not_exists(py_api: httpx.AsyncClient) -> None:
+    """Test flow exists - not found case."""
+    response = await py_api.get("/flows/exists/foo/bar")
+    assert response.status_code == HTTPStatus.NOT_FOUND
+```
+
+## 3. Database Tests
+
+You SHOULD write database tests to verify database query behavior directly.
+
+**When to write these:** Test database queries, ORM functions, and data retrieval logic.
+
+**Advantage:** Fast execution (~4ms per test), no FastAPI overhead.
+
+### Example
+
+```python
+async def test_database_flow_exists(flow: Flow, expdb_test: AsyncConnection) -> None:
+    retrieved_flow = await database.flows.get_by_name(flow.name, flow.external_version, expdb_test)
+    assert retrieved_flow is not None
+    assert retrieved_flow.id == flow.id
+
+
+async def test_database_flow_not_found(expdb_test: AsyncConnection) -> None:
+    retrieved_flow = await database.flows.get_by_name("foo", "bar", expdb_test)
+    assert retrieved_flow is None
+```
+
+### Database Fixtures and Rollback
+
+The `expdb_test` and `user_test` fixtures use automatic rollback:
+
+```python
+@contextlib.asynccontextmanager
+async def automatic_rollback(engine: AsyncEngine) -> AsyncIterator[AsyncConnection]:
+    async with engine.connect() as connection:
+        transaction = await connection.begin()
+        yield connection
+        if transaction.is_active:
+            await transaction.rollback()
+```
+
+**Behavior:**
+- Default: Changes are rolled back after each test (safe, no cleanup needed)
+- Use `persist=True` or `persisted_*` fixtures ONLY when data must be visible to external systems (e.g., PHP API)
+
+## 4. Unit Tests
+
+You SHOULD write unit tests to verify input/output processing and error handling.
+
+**When to write these:** Test edge cases, error handling, input validation, and output formatting. Use `pytest.mark.parametrize` to test multiple inputs efficiently.
+
+**Advantage:** Fast execution (~1ms per test), no database or API overhead.
+
+### Example
+
+```python
+async def test_flow_exists_calls_db_correctly(
+    name: str,
+    external_version: str,
+    expdb_test: AsyncConnection,
+    mocker: MockerFixture,
 ) -> None:
-+    mock = mocker.patch('database.datasets.get')
-+    class Dataset(NamedTuple):
-+        uploader: int
-+        visibility: Visibility
-+    mock.return_value = Dataset(uploader=1, visibility=Visibility.PRIVATE)
-
-    _get_dataset_raise_otherwise(
-        dataset_id=1,
-        user=admin,
--        expdb=expdb_test,
-+        expdb=None,
+    mocked_db = mocker.patch("database.flows.get_by_name", new_callable=mocker.AsyncMock)
+    await flow_exists(name, external_version, expdb_test)
+    mocked_db.assert_called_once_with(
+        name=name,
+        external_version=external_version,
+        expdb=mocker.ANY,
     )
+
+
+async def test_flow_exists_handles_not_found(
+    mocker: MockerFixture, expdb_test: AsyncConnection
+) -> None:
+    mocker.patch("database.flows.get_by_name", return_value=None)
+    with pytest.raises(FlowNotFoundError) as error:
+        await flow_exists("foo", "bar", expdb_test)
+    assert error.value.status_code == HTTPStatus.NOT_FOUND
+```
+
+### When to Mock
+
+| Use Case | Approach |
+|----------|----------|
+| Testing input/output processing | Mock the database call |
+| Testing error handling | Mock return value to trigger error |
+| Testing database behavior | Use database tests instead |
+
+## Test Markers
+
+Use pytest markers to control test execution:
+
+- `@pytest.mark.mut` - Migration tests (run against both PHP and Python APIs)
+- Default - Fast unit/database/integration tests (run locally and on CI)
+
+```bash
+# Run only fast tests locally
+pytest -m "not mut"
+
+# Run all tests including migration
+pytest -m "mut"
 ```
-There is only a single database call in the test. It fetches a record on an indexed field and does not require any joins.
-Despite the database call being very light, the database-included test is ~50% slower than the mocked version (3.50s vs 5.04s).
+
+## Fixtures Reference
+
+| Fixture | Purpose | Rollback | Use When |
+|---------|---------|----------|----------|
+| `expdb_test` | DB connection to experiment db | Automatic | Testing DB queries |
+| `user_test` | DB connection to user db | Automatic | Testing user data |
+| `py_api` | FastAPI test client with DB | Automatic | Integration tests |
+| `php_api` | PHP API client | N/A | Migration tests |
+| `flow` | Creates temp flow in DB | Automatic | Unit/DB tests |
+| `persisted_flow` | Persists flow for both APIs | Manual cleanup | Migration tests |
+
+## Performance Data
+
+Understanding fixture overhead helps you write efficient tests.
+
+### Fixture Overhead (5000 iterations)
+
+| Fixtures Used | Time (s) |
+|---------------|----------|
+| None | 1.78 |
+| `expdb_test` | 3.45 |
+| `user_test` | 3.22 |
+| `py_api` | 298.48 |
+| `expdb_test` + `user_test` | 4.44 |
+| All three | 307.91 |
+
+### Key Observations
+
+- **Database fixtures** add ~2ms overhead each
+- **FastAPI TestClient** adds ~60ms overhead (two orders of magnitude more than DB)
+- Adding multiple database fixtures is not free
+- Use `pytest.mark.parametrize` at the database/unit level for multiple inputs
+
+### Guidelines
+
+1. DO use database tests with `pytest.mark.parametrize` for multiple inputs
+2. DO mock database calls in unit tests for speed
+3. DO limit integration tests to one per expected response type
+4. DO NOT use `py_api` fixture unless testing the full FastAPI stack
+5. DO NOT create a new database fixture per input in parameterized tests