Fix Google Drive spreadsheet crawler table grouping by Tian-2017 · Pull Request #2754 · LBHackney-IT/Data-Platform

Tian-2017 · 2026-05-19T13:12:31Z

Why

Google Drive spreadsheet imports write raw data under a dataset-specific path:

<department>/<output_folder>/<data_set_name>/import_year=...

The crawler was scanning the shared <department>/<output_folder> prefix with table level 3. For some imports, Glue grouped the dataset folder as a partition and created a generic table such as g_drive instead of the expected dataset table.

This caused land_reg_registered_leases_greater_london_2026_04 to land successfully in S3 but not appear as its own raw-zone Glue table.

Parking imports appeared to work because the parking g-drive folder contains many datasets with different schemas. In that case, Glue could not safely group everything into one shared g_drive table, so it created dataset-specific tables such as hackney_carpark with a warning message "Found schemas don't match at level 3; Created table hackney_carpark". That behaviour is incidental, not guaranteed. The crawler config is still too broad and should be fixed so every Google Drive import reliably crawls its own dataset path.

What Changed

Updated the spreadsheet import crawler target to scan the dataset-specific raw prefix.
Updated the crawler table level to match that path so the dataset folder becomes the table root.
Keeps the Glue job output path unchanged.

Validation

Manually tested the equivalent crawler config in prod.
Confirmed Glue created housing-raw-zone.land_reg_registered_leases_greater_london_2026_04.
Confirmed the table has the expected import_year, import_month, import_day, and import_date partitions.

…consistency

sonarqubecloud · 2026-05-19T13:13:10Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Tian-2017 added 2 commits May 19, 2026 13:56

fix g-drive ingestion data source and crawler level

0f905f2

refactor: remove catalog_table local variable and update outputs for …

c896438

…consistency

Tian-2017 requested review from a team as code owners May 19, 2026 13:12

annajgibson approved these changes May 19, 2026

View reviewed changes

Tian-2017 merged commit 84543c8 into main May 19, 2026
16 checks passed

Tian-2017 deleted the fix-g-drive-ingestion-data-source-and-crawler-level branch May 19, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Google Drive spreadsheet crawler table grouping#2754

Fix Google Drive spreadsheet crawler table grouping#2754
Tian-2017 merged 2 commits into
mainfrom
fix-g-drive-ingestion-data-source-and-crawler-level

Tian-2017 commented May 19, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tian-2017 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Validation

Uh oh!

sonarqubecloud Bot commented May 19, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tian-2017 commented May 19, 2026 •

edited

Loading