Skip to content

Update ingestion helper to use cloud run services#1999

Open
vish-cs wants to merge 1 commit into
datacommonsorg:masterfrom
vish-cs:workflow
Open

Update ingestion helper to use cloud run services#1999
vish-cs wants to merge 1 commit into
datacommonsorg:masterfrom
vish-cs:workflow

Conversation

@vish-cs
Copy link
Copy Markdown
Contributor

@vish-cs vish-cs commented May 11, 2026

Updated ingestion workflow, cloud build, and Terraform scripts to use cloud run services instead of cloud function for helpers.
Moved aggregation helper logic into ingestion helper to consolidate under a single docker image. We can always deploy them as independent services if required.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the import automation infrastructure from Cloud Functions to Cloud Run services, containerizing the ingestion and import helpers and consolidating aggregation logic. Several critical issues were identified: the aggregation_utils.py file is missing from the PR, the SPANNER_GRAPH_DATABASE_ID environment variable is not configured for the new service, and removing DDL management from Terraform will break database initialization. Furthermore, the build process is brittle due to remote schema fetching, and a stale default URL persists in the update script.

Comment thread import-automation/workflow/ingestion-helper/main.py
Comment thread import-automation/terraform/main.tf
Comment thread import-automation/terraform/main.tf
Comment thread import-automation/executor/scripts/update_import_version.sh
Comment thread import-automation/workflow/ingestion-helper/Dockerfile
Comment thread import-automation/workflow/ingestion-helper/schema.sql
Comment thread import-automation/workflow/import-helper/Dockerfile
# Fetch proto file from GitHub
RUN curl -o storage.proto https://raw.githubusercontent.com/datacommonsorg/import/master/pipeline/data/src/main/proto/storage.proto

RUN curl -o schema.sql https://raw.githubusercontent.com/datacommonsorg/import/master/pipeline/spanner/src/main/resources/spanner_schema.sql
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should remove this. The point is was want to use the schema.sql in this directory. With this line, you're overriding the schema.sql to what's in the import/ repo.

And FYI - for the June 15th milestone, we will be switching over to the new schema which I believe will no longer need this
RUN curl -o storage.proto https://raw.githubusercontent.com/datacommonsorg/import/master/pipeline/data/src/main/proto/storage.proto
So we will be able to remove it within a month, and the schema.sql will be the complete schema.

Copy link
Copy Markdown
Contributor

@gmechali gmechali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one change needed for the schema download on the dockerfile but otherwise LGTM.
Pls share this with Sandeep as well, so he doesn't waste time starting from an old branch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants