Skip to content

Fix Google Drive spreadsheet crawler table grouping#2754

Merged
Tian-2017 merged 2 commits into
mainfrom
fix-g-drive-ingestion-data-source-and-crawler-level
May 19, 2026
Merged

Fix Google Drive spreadsheet crawler table grouping#2754
Tian-2017 merged 2 commits into
mainfrom
fix-g-drive-ingestion-data-source-and-crawler-level

Conversation

@Tian-2017
Copy link
Copy Markdown
Contributor

@Tian-2017 Tian-2017 commented May 19, 2026

Why

Google Drive spreadsheet imports write raw data under a dataset-specific path:

<department>/<output_folder>/<data_set_name>/import_year=...

The crawler was scanning the shared <department>/<output_folder> prefix with table level 3. For some imports, Glue grouped the dataset folder as a partition and created a generic table such as g_drive instead of the expected dataset table.

This caused land_reg_registered_leases_greater_london_2026_04 to land successfully in S3 but not appear as its own raw-zone Glue table.
image

Parking imports appeared to work because the parking g-drive folder contains many datasets with different schemas. In that case, Glue could not safely group everything into one shared g_drive table, so it created dataset-specific tables such as hackney_carpark with a warning message "Found schemas don't match at level 3; Created table hackney_carpark". That behaviour is incidental, not guaranteed. The crawler config is still too broad and should be fixed so every Google Drive import reliably crawls its own dataset path.
image

What Changed

  • Updated the spreadsheet import crawler target to scan the dataset-specific raw prefix.
  • Updated the crawler table level to match that path so the dataset folder becomes the table root.
  • Keeps the Glue job output path unchanged.

Validation

  • Manually tested the equivalent crawler config in prod.
  • Confirmed Glue created housing-raw-zone.land_reg_registered_leases_greater_london_2026_04.
  • Confirmed the table has the expected import_year, import_month, import_day, and import_date partitions.

@Tian-2017 Tian-2017 requested review from a team as code owners May 19, 2026 13:12
@sonarqubecloud
Copy link
Copy Markdown

@Tian-2017 Tian-2017 merged commit 84543c8 into main May 19, 2026
16 checks passed
@Tian-2017 Tian-2017 deleted the fix-g-drive-ingestion-data-source-and-crawler-level branch May 19, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants