Skip to content

AP-677 create tind provider hook#58

Open
jason-raitz wants to merge 8 commits intomainfrom
AP-677_create-tind-provider-hook
Open

AP-677 create tind provider hook#58
jason-raitz wants to merge 8 commits intomainfrom
AP-677_create-tind-provider-hook

Conversation

@jason-raitz
Copy link
Copy Markdown
Contributor

@jason-raitz jason-raitz commented May 1, 2026

Done

  • version bumped python-tind-client for easier connections without setting default_storage_dir
  • refactored and removed FetchTind -> /mokelumne/providers/hooks/tind.py
  • created a provider entrypoint for the new provider
    • I slightly changed the ticket to make a mokelumne provider that implements a tind connection-type called tind in the TindHook class. I could be convinced to change this. It was slightly different since many custom providers are assumed to be external imports, not part of the project.
  • persist the connection in the airflow metastore?
  • Airflow Connection GUI visible and editable

TODO?

  • add some customizations like the ones @danschmidt5189 suggested
  • Possibly rebase and refactor things based on the timing of @awilfox's AP-670 merge

@jason-raitz
Copy link
Copy Markdown
Contributor Author

In order to persist the Connection in the datastore, we need to add it during the airflow-init container's execution.

If we do it this way, this is what it looks like in the GUI for the Connections page.
Screenshot 2026-05-04 at 11 32 28 AM

And this is what the Connection editor modal looks like for the Tind Connection.
Screenshot 2026-05-04 at 11 32 55 AM

 - manually adds Tind API Url and Key to the metastore during airflow-init
 - allows the user to see and edit the connection in the Airflow GUI
@jason-raitz jason-raitz marked this pull request as ready for review May 4, 2026 15:54
Copy link
Copy Markdown
Member

@anarchivist anarchivist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good. my main comments are as follows:

  • i don't know why we're reverting back to the old TIND_API_KEY/TIND_BASE_URL environment variables - it doesn't seem necessary.
  • there are a lot of unrelated changes to the compose file that seem unnecessary to the PR.
  • i know we didn't want to add tests for FetchTind but i am curious about what are plans are to add tests for the TINDHook.

Comment thread docker-compose.yml
Comment on lines +123 to +133
if [ -n "${TIND_API_KEY:-}" ]; then
airflow connections add tind_default \
--conn-type tind \
--conn-host "${TIND_BASE_URL}" \
--conn-password "${TIND_API_KEY}" \
--conn-description "TIND API connection" || \
airflow connections set tind_default \
--conn-type tind \
--conn-host "${TIND_BASE_URL}" \
--conn-password "${TIND_API_KEY}"
fi
Copy link
Copy Markdown
Member

@anarchivist anarchivist May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if we actually want to do this. isn't the point to do away with the environment variables altogether (other than AIRFLOW_CONN_TIND_DEFAULT, which was also removed from example.env?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I like Jason's method here.

Thoughts:

  1. This is solely about the dev workflow.
  2. Doing it this way retains our simple method: scaffold .env, then run docker compose up --build.
  3. This also retains the ability to modify the connection via the UI, which you lose with AIRFLOW_CONN_TIND_DEFAULT.

I'm honestly struggling to think of a case where we'd want to use AIRFLOW_CONN_TIND_DEFAULT. Any thoughts…?

Copy link
Copy Markdown
Member

@anarchivist anarchivist May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it's in example.env, so my view is that it's optional anyway.

i don't love that we're effectively reverting to requiring an environment variable pair for something that's managed as a connection. using the airflow cli to set it also feels heavy handed.

Comment thread example.env
AIRFLOW__API_AUTH__JWT_SECRET=
# @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key
AIRFLOW__CORE__FERNET_KEY=
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, why are we taking this out?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comment above: if AIRFLOW_CONN_TIND_DEFAULT is set you can't change the value via the UI. @jason-raitz's method is basically the best of both worlds: you still get ENV-based initialization, but you can also change the value later if you want to (i.e. for interactive testing).

Comment thread README.md
| `AIRFLOW__API__BASE_URL` | Where Airflow's webserver is reachable. | `AIRFLOW__API__BASE_URL="http://localhost:8080"` |
| `AIRFLOW__CORE__FERNET_KEY` | [Fernet](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html) encryption key used to encrypt Airflow secrets | `AIRFLOW__CORE__FERNET_KEY="somebase64value="` |
| `AIRFLOW__API_AUTH__JWT_SECRET` | Secret key used to sign JWT tokens for Airflow's API authentication. The default value used in development and testing should be replaced in production. | `AIRFLOW__API_AUTH__JWT_SECRET="some32bytesecret"` |
| `AIRFLOW_CONN_TIND_DEFAULT` | Airflow connection json string for TIND access.<br>Note: the Connection params listed in the example are all needed! | `AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'` |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto this?

Comment thread docker-compose.yml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general comment - there are a lot of stylistic changes to the compose file in this PR. why are they being included in this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a linter installed and actively fixing linting issues on save. I can turn that off....

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second @anarchivist — it's not specific to this PR, and absent agreement within the team and enforcement via CI/tests this sort of change is liable to be unintentionally reverted. I think it's out-of-scope for this PR, but something we could address later via test enforcement.

Comment thread requirements.txt
# langchain-core
# libcst
# uvicorn
pyyaml-ft==8.0.0 \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know why this dependency is no longer necessary?

Comment thread mokelumne/providers/__init__.py Outdated
@@ -0,0 +1,22 @@
"""Mokelumne Airflow provider."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Mokelumne Airflow provider."""
"""Mokelumne Airflow providers."""

Comment thread mokelumne/providers/hooks/tind.py Outdated
 - Airflow 3 prefers provider.yaml for ui field behavior
Copy link
Copy Markdown
Member

@danschmidt5189 danschmidt5189 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very good, I like the direction this is taking. My big question is whether this should be a TIND provider, specifically, rather than "Mokelumne". That makes the most sense to me and is an obvious candidate for extraction / publication down the road. And as I noted in my comment, if we scope this to TIND we should still be able to vendor multiple other custom providers in the ./providers directory by grouping them into subdirs.

Comment thread mokelumne/providers/__init__.py Outdated
from typing import Any


def get_provider_info() -> dict[str, Any]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we noted, this duplicates information already present in the provider.yaml. Looking at community providers it seemed like they either eat that cost (and maintain multiple copies of this info) or use a build script to generate this method based on some static configuration file.

Comment thread example.env
AIRFLOW__API_AUTH__JWT_SECRET=
# @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key
AIRFLOW__CORE__FERNET_KEY=
AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comment above: if AIRFLOW_CONN_TIND_DEFAULT is set you can't change the value via the UI. @jason-raitz's method is basically the best of both worlds: you still get ENV-based initialization, but you can also change the value later if you want to (i.e. for interactive testing).

@@ -0,0 +1,19 @@
package-name: mokelumne
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big picture question: Should this be a TIND provider (rather than all of Mokelumne)? That is more tightly scoped, and potentially something we'd want to publish. I could definitely see us adding to that package pretty soon (e.g. TindQueryOperator, TindDownloadOperator, …).

Based on what we saw with community providers, we could still vendor multiple providers using a directory structure like:

./providers/tind/ (this PR)
./providers/mokelumne/ (future)

Where each had its own provider.yaml and get_provider_info() method.

try:
client = FetchTind.from_connection(orig_run_id, conn="tind_default")
filemd = client.get_first_file_metadata(tind_id)
hook = TindHook()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: hook seems a little too generic here—maybe tind_hook or just tind?

Aside: I suspect we'll eventually support injecting the tind_conn_id (defaulting to tind_default). We don't quite have a use-case at the moment, but you could imagine that being useful for testing or sandboxing. For example, you could hand a specific collection-scoped TIND connection to student workers.

def __init__(self, conn_id: str = "tind_default") -> None:
super().__init__()
self.conn_id = conn_id
self._client: TINDClient | None = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially it didn't sit right with me that we deferred _client initialization, but based on our review of the community's providers I think that's just a nit. This is all fine.

I'll also correct what I said in our pairing session: Python attributes are read-only by default, so conn_id and _client are effectively tied together and there's no risk of the conn_id being modified by a caller in a confusing way.

self.conn_id = conn_id
self._client: TINDClient | None = None

def get_conn(self) -> TINDClient:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto this — I originally didn't love that this method was called get_conn, implying that it returns a Connection, but after reviewing a few community providers I see that's the standard way of naming this sort of method. LGTM!

Comment thread docker-compose.yml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second @anarchivist — it's not specific to this PR, and absent agreement within the team and enforcement via CI/tests this sort of change is liable to be unintentionally reverted. I think it's out-of-scope for this PR, but something we could address later via test enforcement.

Comment thread docker-compose.yml
Comment on lines +123 to +133
if [ -n "${TIND_API_KEY:-}" ]; then
airflow connections add tind_default \
--conn-type tind \
--conn-host "${TIND_BASE_URL}" \
--conn-password "${TIND_API_KEY}" \
--conn-description "TIND API connection" || \
airflow connections set tind_default \
--conn-type tind \
--conn-host "${TIND_BASE_URL}" \
--conn-password "${TIND_API_KEY}"
fi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I like Jason's method here.

Thoughts:

  1. This is solely about the dev workflow.
  2. Doing it this way retains our simple method: scaffold .env, then run docker compose up --build.
  3. This also retains the ability to modify the connection via the UI, which you lose with AIRFLOW_CONN_TIND_DEFAULT.

I'm honestly struggling to think of a case where we'd want to use AIRFLOW_CONN_TIND_DEFAULT. Any thoughts…?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants