AP-677 create tind provider hook#58
Conversation
- manually adds Tind API Url and Key to the metastore during airflow-init - allows the user to see and edit the connection in the Airflow GUI
anarchivist
left a comment
There was a problem hiding this comment.
mostly looks good. my main comments are as follows:
- i don't know why we're reverting back to the old
TIND_API_KEY/TIND_BASE_URLenvironment variables - it doesn't seem necessary. - there are a lot of unrelated changes to the compose file that seem unnecessary to the PR.
- i know we didn't want to add tests for
FetchTindbut i am curious about what are plans are to add tests for theTINDHook.
| if [ -n "${TIND_API_KEY:-}" ]; then | ||
| airflow connections add tind_default \ | ||
| --conn-type tind \ | ||
| --conn-host "${TIND_BASE_URL}" \ | ||
| --conn-password "${TIND_API_KEY}" \ | ||
| --conn-description "TIND API connection" || \ | ||
| airflow connections set tind_default \ | ||
| --conn-type tind \ | ||
| --conn-host "${TIND_BASE_URL}" \ | ||
| --conn-password "${TIND_API_KEY}" | ||
| fi |
There was a problem hiding this comment.
i'm wondering if we actually want to do this. isn't the point to do away with the environment variables altogether (other than AIRFLOW_CONN_TIND_DEFAULT, which was also removed from example.env?)
There was a problem hiding this comment.
FWIW I like Jason's method here.
Thoughts:
- This is solely about the dev workflow.
- Doing it this way retains our simple method: scaffold .env, then run
docker compose up --build. - This also retains the ability to modify the connection via the UI, which you lose with
AIRFLOW_CONN_TIND_DEFAULT.
I'm honestly struggling to think of a case where we'd want to use AIRFLOW_CONN_TIND_DEFAULT. Any thoughts…?
There was a problem hiding this comment.
well, it's in example.env, so my view is that it's optional anyway.
i don't love that we're effectively reverting to requiring an environment variable pair for something that's managed as a connection. using the airflow cli to set it also feels heavy handed.
| AIRFLOW__API_AUTH__JWT_SECRET= | ||
| # @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key | ||
| AIRFLOW__CORE__FERNET_KEY= | ||
| AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}' |
There was a problem hiding this comment.
wait, why are we taking this out?
There was a problem hiding this comment.
Per my comment above: if AIRFLOW_CONN_TIND_DEFAULT is set you can't change the value via the UI. @jason-raitz's method is basically the best of both worlds: you still get ENV-based initialization, but you can also change the value later if you want to (i.e. for interactive testing).
| | `AIRFLOW__API__BASE_URL` | Where Airflow's webserver is reachable. | `AIRFLOW__API__BASE_URL="http://localhost:8080"` | | ||
| | `AIRFLOW__CORE__FERNET_KEY` | [Fernet](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html) encryption key used to encrypt Airflow secrets | `AIRFLOW__CORE__FERNET_KEY="somebase64value="` | | ||
| | `AIRFLOW__API_AUTH__JWT_SECRET` | Secret key used to sign JWT tokens for Airflow's API authentication. The default value used in development and testing should be replaced in production. | `AIRFLOW__API_AUTH__JWT_SECRET="some32bytesecret"` | | ||
| | `AIRFLOW_CONN_TIND_DEFAULT` | Airflow connection json string for TIND access.<br>Note: the Connection params listed in the example are all needed! | `AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}'` | |
There was a problem hiding this comment.
general comment - there are a lot of stylistic changes to the compose file in this PR. why are they being included in this PR
There was a problem hiding this comment.
I've got a linter installed and actively fixing linting issues on save. I can turn that off....
There was a problem hiding this comment.
I second @anarchivist — it's not specific to this PR, and absent agreement within the team and enforcement via CI/tests this sort of change is liable to be unintentionally reverted. I think it's out-of-scope for this PR, but something we could address later via test enforcement.
| # langchain-core | ||
| # libcst | ||
| # uvicorn | ||
| pyyaml-ft==8.0.0 \ |
There was a problem hiding this comment.
do we know why this dependency is no longer necessary?
| @@ -0,0 +1,22 @@ | |||
| """Mokelumne Airflow provider.""" | |||
There was a problem hiding this comment.
| """Mokelumne Airflow provider.""" | |
| """Mokelumne Airflow providers.""" |
- Airflow 3 prefers provider.yaml for ui field behavior
danschmidt5189
left a comment
There was a problem hiding this comment.
Overall looks very good, I like the direction this is taking. My big question is whether this should be a TIND provider, specifically, rather than "Mokelumne". That makes the most sense to me and is an obvious candidate for extraction / publication down the road. And as I noted in my comment, if we scope this to TIND we should still be able to vendor multiple other custom providers in the ./providers directory by grouping them into subdirs.
| from typing import Any | ||
|
|
||
|
|
||
| def get_provider_info() -> dict[str, Any]: |
There was a problem hiding this comment.
As we noted, this duplicates information already present in the provider.yaml. Looking at community providers it seemed like they either eat that cost (and maintain multiple copies of this info) or use a build script to generate this method based on some static configuration file.
| AIRFLOW__API_AUTH__JWT_SECRET= | ||
| # @see https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/fernet.html#generating-fernet-key | ||
| AIRFLOW__CORE__FERNET_KEY= | ||
| AIRFLOW_CONN_TIND_DEFAULT='{"conn_type": "http","password": "your-tind-key-here","host": "https://digicoll.lib.berkeley.edu/api/v1","schema": "https"}' |
There was a problem hiding this comment.
Per my comment above: if AIRFLOW_CONN_TIND_DEFAULT is set you can't change the value via the UI. @jason-raitz's method is basically the best of both worlds: you still get ENV-based initialization, but you can also change the value later if you want to (i.e. for interactive testing).
| @@ -0,0 +1,19 @@ | |||
| package-name: mokelumne | |||
There was a problem hiding this comment.
Big picture question: Should this be a TIND provider (rather than all of Mokelumne)? That is more tightly scoped, and potentially something we'd want to publish. I could definitely see us adding to that package pretty soon (e.g. TindQueryOperator, TindDownloadOperator, …).
Based on what we saw with community providers, we could still vendor multiple providers using a directory structure like:
./providers/tind/ (this PR)
./providers/mokelumne/ (future)
Where each had its own provider.yaml and get_provider_info() method.
| try: | ||
| client = FetchTind.from_connection(orig_run_id, conn="tind_default") | ||
| filemd = client.get_first_file_metadata(tind_id) | ||
| hook = TindHook() |
There was a problem hiding this comment.
Nit: hook seems a little too generic here—maybe tind_hook or just tind?
Aside: I suspect we'll eventually support injecting the tind_conn_id (defaulting to tind_default). We don't quite have a use-case at the moment, but you could imagine that being useful for testing or sandboxing. For example, you could hand a specific collection-scoped TIND connection to student workers.
| def __init__(self, conn_id: str = "tind_default") -> None: | ||
| super().__init__() | ||
| self.conn_id = conn_id | ||
| self._client: TINDClient | None = None |
There was a problem hiding this comment.
Initially it didn't sit right with me that we deferred _client initialization, but based on our review of the community's providers I think that's just a nit. This is all fine.
I'll also correct what I said in our pairing session: Python attributes are read-only by default, so conn_id and _client are effectively tied together and there's no risk of the conn_id being modified by a caller in a confusing way.
| self.conn_id = conn_id | ||
| self._client: TINDClient | None = None | ||
|
|
||
| def get_conn(self) -> TINDClient: |
There was a problem hiding this comment.
Ditto this — I originally didn't love that this method was called get_conn, implying that it returns a Connection, but after reviewing a few community providers I see that's the standard way of naming this sort of method. LGTM!
There was a problem hiding this comment.
I second @anarchivist — it's not specific to this PR, and absent agreement within the team and enforcement via CI/tests this sort of change is liable to be unintentionally reverted. I think it's out-of-scope for this PR, but something we could address later via test enforcement.
| if [ -n "${TIND_API_KEY:-}" ]; then | ||
| airflow connections add tind_default \ | ||
| --conn-type tind \ | ||
| --conn-host "${TIND_BASE_URL}" \ | ||
| --conn-password "${TIND_API_KEY}" \ | ||
| --conn-description "TIND API connection" || \ | ||
| airflow connections set tind_default \ | ||
| --conn-type tind \ | ||
| --conn-host "${TIND_BASE_URL}" \ | ||
| --conn-password "${TIND_API_KEY}" | ||
| fi |
There was a problem hiding this comment.
FWIW I like Jason's method here.
Thoughts:
- This is solely about the dev workflow.
- Doing it this way retains our simple method: scaffold .env, then run
docker compose up --build. - This also retains the ability to modify the connection via the UI, which you lose with
AIRFLOW_CONN_TIND_DEFAULT.
I'm honestly struggling to think of a case where we'd want to use AIRFLOW_CONN_TIND_DEFAULT. Any thoughts…?


Done
TODO?