Draft
Conversation
b41a990 to
65c59bf
Compare
- Created 4 additional staging models: * stg__pluto_input_research.sql * stg__pluto_pts.sql * stg__dcp_mappluto.sql * stg__previous_pluto.sql - Fixed remaining non-stg__ references in 13 SQL files - All source tables now consistently use staging models - Total staging models: 40 (up from 36)
- Created 01a_dbt_staging.sh to run dbt staging models - Script runs between data load and legacy SQL build - Materializes 40 staging models before 02_build.sh runs - Added pluto_build/README.md documenting build sequence - Legacy SQL can now reference stg__ tables
- Moved 9 CSV files from pluto_build/data/ to seeds/ - Configured seeds in dbt_project.yml (+quote_columns, +schema: public) - Documented all seeds in seeds/_seeds.yml - Updated 01a_dbt_staging.sh to run 'dbt seed' before staging models - Deleted 01_load_local_csvs.sh (replaced by dbt seed) - Deleted sql/_create.sql (replaced by dbt seed) - Updated README.md with seed documentation - No SQL changes needed - seeds create same table names
- Update GitHub workflow to call 01a_dbt_staging.sh instead of removed 01_load_local_csvs.sh - Remove duplicate dbt seed call from 07_custom_qaqc.sh to avoid reloading seeds - Seeds are now loaded exactly once via 01a_dbt_staging.sh Closes data-engineering-n58.3
- Add --profiles-dir . to all dbt commands in 01a_dbt_staging.sh and 07_custom_qaqc.sh - Move 'cd ..' before dbt deps/debug in 01a_dbt_staging.sh - Fix schema config deprecation in dbt_project.yml (add + prefix to tests.schema) - Ensures dbt uses local profiles.yml in GHA workflows
- Add column_types config for ignored_bbls_for_unit_count_test (bbl, pluto_version as text) - Add column_types config for pluto_input_research (bbl as text) - Add column_types config for pluto_input_condolot_descriptiveattributes (condno, parid as text) - Remove incorrect column_types from zoning_district_class_descriptions - Fixes 'integer out of range' errors when loading BBL values
- Change condno -> CondNO and parid -> PARID to match CSV header - Fixes integer out of range error
- Change seeds schema from 'public' to BUILD_ENGINE_SCHEMA to match build scripts - Update stg__pluto_input_research to reference seed with ref() instead of source() - Ensures build scripts can find seed tables in the correct schema
- Seeds were loading to doubled schema (target_schema + custom_schema) - dbt automatically uses BUILD_ENGINE_SCHEMA from profiles.yml as target - Removing +schema config fixes: ar_dbtify_pluto_staging_models_ar_dbtify_pluto_staging_models -> ar_dbtify_pluto_staging_models - Matches green_fast_track pattern Closes data-engineering-n58.5
…ttributes - Change parid -> PARID (quoted) - Change landsize -> LandSize (quoted) - Change story -> Story (quoted) - Change yearbuilt -> YearBuilt (quoted) Seeds with +quote_columns preserve original CSV case
- Change c.code -> c."Code" (quoted) - Change c.type -> c."Type" (quoted) Completes seed column case audit - all other seeds use lowercase
- Set bsmnt_type, bsmntgradient, bsmtcode as text - Fixes type mismatch error: operator does not exist: text = integer
Changes: - Add dbt run for stg__pluto_input_geocodes at start of 02_build.sh - Update create_rpad_geo.sql to reference stg__pluto_input_geocodes - Update create_cama_primebbl.sql to reference stg__pluto_input_geocodes - Update primebbl.sql to reference stg__pluto_input_geocodes This ensures the staging model is materialized as a table before SQL scripts mutate it (adding columns, renaming bbl). The raw recipe source is a view to another schema, so we need the materialized staging model.
Changes: - Add materialized='table' config to stg__pluto_input_geocodes.sql - Remove duplicate dbt run from 02_build.sh (already runs in 01a_dbt_staging.sh) The staging model must be a table (not view) because create_rpad_geo.sql mutates it (adds columns, renames bbl). It's already built in 01a_dbt_staging.sh so no additional dbt run needed in 02_build.sh.
fdfcd53 to
2d6c7d5
Compare
Add dbt intermediate models for PLUTO simple transformations
- int_pluto__far: Calculate built FAR and lookup max FAR by zoning district
- int_pluto__irrlotcode: Transform irregular lot codes (I->Y)
- int_pluto__sanitation: Extract sanitboro and sanitdistrict
- int_pluto__landuse: Lookup landuse and calculate areasource for vacant lots
- int_pluto__ownertype: Lookup owner type from COLP and calculate exempt properties
All models:
- Output bbl + calculated fields
- Include unique BBL indexes for join performance
- Have schema.yml with unique/not_null tests on bbl
Part of epic de-74o: DBT'ify pluto SQL files
Integrate dbt intermediate models into PLUTO build
Changes:
- Add 'pluto_simple_enrichment' tag to all 5 intermediate models
- Create product/pluto_enriched.sql to assemble enriched data
- Create apply_dbt_enrichments.sql to UPDATE pluto table
- Update 02_build.sh to run dbt models and apply enrichments
- Remove legacy SQL file calls from 02_build.sh
- Delete replaced SQL files: far.sql, irrlotcode.sql, sanitboro.sql, landuse.sql, ownertype.sql
The build process now:
1. Creates base pluto table
2. Runs dbt models tagged with 'pluto_simple_enrichment'
3. Applies enrichments back to pluto table via batch UPDATE
4. Continues with remaining build steps
Closes de-74o.6
Fix sqlfluff violations in dbt models
- Remove redundant ELSE NULL clauses
- Fix table aliasing consistency
- Replace USING with explicit ON clauses
- Fix indentation
All models now pass sqlfluff linting with postgres dialect and jinja templater.
Remove direnv from dbt run command in build script
The build script will rely on environment variables already being set
before execution. Using relative path (cd ..) to run dbt from the
product directory.
Reference pluto table directly instead of as source
Changed all models to reference 'pluto' directly instead of using
{{ source('recipe_sources', 'pluto') }}. This prevents dbt from trying
to validate the source exists during parsing, which would fail since
the pluto table is created by SQL scripts before dbt runs.
Use target.schema to reference pluto table in dbt models
Changed all models to use {{ target.schema }}.pluto instead of just 'pluto'
so dbt knows which schema to look in. The pluto table is created by SQL scripts
in BUILD_ENGINE_SCHEMA, which is the same schema dbt targets.
Fix dbt model errors and pass sqlfluff
Fixes:
- int_pluto__landuse: Remove COALESCE with non-existent p.landuse column
- int_pluto__ownertype: Add DISTINCT ON to deduplicate BBLs from stg__dcp_colp
- Format SELECT targets on separate lines for sqlfluff
All models now build successfully and pass sqlfluff linting.
Add 5 new dbt intermediate models for PLUTO
Models created:
- int_pluto__edesignation: E-designation number (10,925 rows)
- int_pluto__latlong: Latitude/longitude/centroid from coordinates
- int_pluto__condono: Formatted condo numbers (859,212 rows)
- int_pluto__lpc: Historic districts and landmarks (859,212 rows)
- int_pluto__numericfields: Cleaned numeric fields (859,212 rows)
All models:
- Build successfully
- Have unique BBL indexes
- Include schema.yml with tests
- Pass sqlfluff linting
- Tagged with pluto_simple_enrichment
Closes de-74o.7, de-74o.8, de-74o.9, de-74o.10, de-74o.11
Integrate second batch of dbt models into build
Changes:
- Move dbt run command to after geocodes.sql (latlong needs xcoord populated)
- Remove SQL file calls for: edesignation, lpc, numericfields, condono, latlong
- Update pluto_enriched.sql to join 5 new intermediate models
- Update apply_dbt_enrichments.sql to apply 10 new fields
- Delete 5 replaced SQL files
Now running dbt enrichments after geocodes populates coordinates, allowing
latlong transformation to work correctly. All 10 dbt models are integrated.
Add MIH area dbt models (spatial - not yet integrated)
Created 3 models in intermediate/miharea/:
- int_mih__cleaned: Clean MIH option names
- int_mih__lot_overlap: Spatial overlap calculations
- int_pluto__miharea: Pivot to mih_opt1-4 columns
Note: These models require pluto.geom which is added later in the build
(plutogeoms.sql runs at line 54, dbt runs at line 29). These need a
separate 'pluto_spatial_enrichment' tag and must run after geometries
are in place. Deferring full integration for now.
Closes de-74o.12, de-74o.13
Integrate MIH models with pluto_late_stage tag and cleanup
Changes:
- Add 'pluto_late_stage' tag for spatial models that run after geometries
- Add second dbt run in 02_build.sh after latlong.sql (line ~92)
- Update pluto_enriched.sql to join MIH and transitzone models
- Update apply_dbt_enrichments.sql to apply mih_opt1-4, trnstzone fields
- Delete miharea.sql and transitzone.sql from pluto_build/sql/
- Create placeholder int_pluto__transitzone.sql (returns no rows for now)
The pluto_late_stage tag allows spatial models to run after plutogeoms.sql
adds the geom column. MIH models are fully functional. Transitzone needs
full implementation later (see de-74o.13).
Complete transitzone dbt models with pluto_late_stage tag
Created 5 models in intermediate/transitzone/:
- int_tz__atomic_geoms: Decompose multipolygons for performance
- int_tz__tax_blocks: Create sub-blocks from tax blocks
- int_tz__block_to_tz_ranked: Block-level transit zone coverage
- int_tz__bbl_to_tz_ranked: Lot-level coverage for ambiguous blocks
- int_pluto__transitzone: Final BBL->transit zone assignment
Changes:
- Tag all transitzone models with 'pluto_late_stage'
- Reference dcp_transit_zone_ranks seed for priority rankings
- Add seed definition to _seeds.yml
- Update schema files
These models run in the late-stage dbt command after geometries are added.
Closes de-74o.13
Add pluto_late_stage tag to all supporting models
Added tag to:
- int_mih__cleaned
- int_mih__lot_overlap
- int_tz__atomic_geoms
- int_tz__tax_blocks
- int_tz__block_to_tz_ranked
- int_tz__bbl_to_tz_ranked
This allows 'dbt run --select tag:pluto_late_stage' to build all
dependencies in the correct order based on {{ ref() }} relationships.
DBT correctly builds models in dependency order: cleaned and atomic_geoms
first, then dependent models. Errors expected in test env since pluto.geom
doesn't exist yet - will work in actual build after plutogeoms.sql.
Add flood_flag dbt model to late-stage enrichment
Created int_pluto__flood_flag.sql:
- Flags lots intersecting FEMA 1% annual chance floodplains
- firm07_flag: 2007 floodplain
- pfirm15_flag: 2015 preliminary floodplain
- Uses ST_INTERSECTS with ST_SUBDIVIDE for performance
- Tagged with pluto_late_stage (requires pluto.geom)
Changes:
- Remove duplicate latlong.sql call from 02_build.sh (line 86)
- Remove flood_flag.sql call from 02_build.sh
- Delete flood_flag.sql
- Update pluto_enriched.sql to join flood_flag model
- Update apply_dbt_enrichments.sql to apply firm07_flag, pfirm15_flag
- Add schema.yml entry
Total: 19 models converted, 15 SQL files deleted, 34 fields managed by dbt
Fix sqlfluff aliasing violation in apply_dbt_enrichments.sql
Fixed AL01 violation - table aliasing consistency.
Fix pluto_enriched build order and apply enrichments once
Issues fixed:
- pluto_enriched wasn't being built (has tag 'pluto_enrichment' not in selection)
- apply_dbt_enrichments.sql was called twice (after each dbt run)
Solution:
- Explicitly include 'pluto_enriched' in both dbt run commands
- Only call apply_dbt_enrichments.sql ONCE at the end after late-stage models
- pluto_enriched gets rebuilt in both stages, final version has all fields
Build flow:
1. Run simple enrichment models + pluto_enriched (partial)
2. Continue SQL build steps
3. Run late-stage models + pluto_enriched (complete with all fields)
4. Apply ALL enrichments to pluto table in single UPDATE
Consolidate dbt runs to single execution at end of build
Changed from 2 dbt runs to 1:
- Removed dbt run after geocodes.sql (line 31)
- Single dbt run now at line 87: --select tag:pluto_simple_enrichment tag:pluto_late_stage pluto_enriched
- apply_dbt_enrichments.sql runs once after all models built
Benefits:
- Simpler build flow
- Only materializes pluto_enriched once (not twice)
- All enrichments applied in single batch UPDATE
- Easier to understand and maintain
Consolidate to single pluto_enrichment tag
Changed all models from pluto_simple_enrichment/pluto_late_stage to pluto_enrichment:
- 10 simple models
- 3 MIH models
- 5 transit zone models
- 1 flood flag model
Build script now uses: --select tag:pluto_enrichment pluto_enriched
Simpler and clearer since everything runs at the end anyway.
Remove lat/long/centroid from update_empty_coord.sql
The update_empty_coord.sql script only needs to populate xcoord/ycoord
from geometries when they're missing. The latitude, longitude, and centroid
fields are now calculated by int_pluto__latlong dbt model which runs at
the end.
Flow:
1. update_empty_coord.sql fills missing xcoord/ycoord (from geom)
2. int_pluto__latlong transforms all xcoord/ycoord to lat/long/centroid
3. apply_dbt_enrichments.sql updates pluto with those values
Revert latlong to SQL, keep dbt model commented out
Issue: spatialjoins.sql (line 79) needs the centroid column to exist
before dbt runs (line 85). Original latlong.sql created this column.
Solution:
- Restore latlong.sql to create/populate latitude, longitude, centroid columns
- Add latlong.sql call back to 02_build.sh (after update_empty_coord.sql)
- Keep int_pluto__latlong.sql model but comment out in pluto_enriched.sql
- Comment out lat/long/centroid in apply_dbt_enrichments.sql
The dbt model remains for future refactoring when we can better handle
the column creation timing. For now, SQL handles these fields directly.
Fix ST_SUBDIVIDE usage in flood_flag model
Error: set-returning functions not allowed in JOIN conditions
Solution: Move ST_SUBDIVIDE to separate CTEs before joining:
- firm07_subdivided: subdivide geometries first
- firm07_intersections: then join with subdivided geoms
- Same for pfirm15
This matches the pattern used in the original flood_flag.sql.
Optimize flood_flag model by splitting into separate indexed models
Issue: int_pluto__flood_flag was taking >30 minutes due to scanning all
859k BBLs multiple times with complex CTEs.
Solution: Break into 3 models with indexed BBLs:
1. int_flood__firm07_bbls - Only BBLs that intersect 2007 floodplain
2. int_flood__pfirm15_bbls - Only BBLs that intersect 2015 floodplain
3. int_pluto__flood_flag - Simple indexed JOIN of the above
Benefits:
- Each intermediate model uses INNER JOIN (only processes intersecting rows)
- BBL indexes on intermediate models speed up final LEFT JOIN
- Matches original SQL pattern (direct updates, not full table scans)
- dbt can materialize and index intermediate results
This should reduce runtime from >30min to ~1-2 minutes.
Simplify flood_flag to single model, only return flagged BBLs
Issue: Splitting into 3 models caused spatial work to run twice (once per
intermediate model) instead of being optimized by query planner.
Key insight: Original SQL only UPDATED matching rows, didn't scan all 859k BBLs.
Our LEFT JOIN to all pluto BBLs was the problem!
Solution: Single model with all CTEs, but final SELECT only returns BBLs that
have at least one flag (using UNION of firm07 and pfirm15 BBLs).
This matches original SQL behavior:
- Spatial joins only touch intersecting rows
- Query planner can optimize both subdivisions together
- Only ~10-20k BBLs returned (not 859k)
- apply_dbt_enrichments.sql LEFT JOINs to apply flags
Revert flood_flag back to SQL
The dbt model was too slow (>10 minutes) despite multiple optimization
attempts. The issue is fundamental: dbt models need to scan data to build
the enrichment table, whereas the original SQL directly UPDATEs only
matching rows using spatial indexes.
Changes:
- Restore flood_flag.sql to pluto_build/sql/
- Add flood_flag.sql call back to 02_build.sh (line 78)
- Delete int_pluto__flood_flag.sql dbt model
- Remove flood fields from pluto_enriched.sql and apply_dbt_enrichments.sql
The original SQL approach is much faster for sparse spatial intersections
where only a small fraction of BBLs match (~10-20k out of 859k).
Actually create flood_flag.sql file
Previous commit said it restored the file but it wasn't actually created.
Add PLUTO UPDATE statement dependency analysis
- Created dependency graph analysis for remaining SQL files
- Identified 42 UPDATE statements across 23 files
- Grouped updates into 7 migration categories
- No circular dependencies detected - all leaf candidates
- Provides recommended migration order for dbt conversion
Closes de-74o.14
Add execution order and inter-group dependency analysis to PLUTO dependencies
Key findings:
- No circular dependencies (backfill.sql is unused)
- latlong and numericfields must run BEFORE dbt models
- Identified 4 migration waves with clear safety boundaries
- CAMA, Zoning, Classification are safe to migrate first
Add PLUTO data flow analysis and dbt migration strategy
Answers key questions:
- When rows inserted: bbl.sql (line 22 of 02_build.sh)
- Data flow: INSERT → UPDATE (allocated) → UPDATE (geocodes) → many UPDATEs → dbt → apply back
- Recommends progressive migration strategy over big bang rewrite
- Identifies 4 migration waves with clear boundaries
Add comprehensive PLUTO dbt migration strategy
Key insights:
- PLUTO is mutable state table (INSERT + progressive UPDATEs)
- 7 fields updated multiple times (mostly progressive refinement)
- Hybrid SQL/DBT architecture is optimal
- DBT for business logic, SQL for spatial/procedural operations
- Don't dbt'ify initial population - current pattern works
- Focus on migrating 40+ simple UPDATE files to dbt
Add analysis: Should pluto_rpad_geo be a dbt model?
Strong YES recommendation:
- Enables targeted re-running (key requirement)
- Removes 7 SQL UPDATE files by absorbing logic
- Eliminates mutable intermediate state
- Better dependency management via dbt DAG
- 2-3 hour migration effort with low risk
Creates int__dof_pts_propmaster and int__pluto_rpad_geo models
Add missing staging models for transit zones and MIH
- Create stg__dcp_transit_zones.sql staging model
- Create stg__dcp_gis_mandatory_inclusionary_housing.sql staging model
- Update int_tz__atomic_geoms to use staging model (fixes geometry column reference)
- Update int_mih__cleaned to use staging model
- Update qaqc_int__transit_zones_questionable_assignments to use staging model
Fixes GHA build errors where these source tables weren't available in the build schema.
Add session summary for PLUTO dbt migration analysis
Complete analysis session deliverables:
- 4 documentation files (dependencies, data flow, strategy, rpad_geo)
- 2 analysis scripts (reusable)
- Epic updated with end goal (pure dbt project)
- Clear migration path: Wave 0 (rpad_geo) → Waves 1-4 → Phase 5 (pure dbt)
- All questions answered, ready to execute
Add refined migration plan based on requirements
Key decisions:
- Wave 0 (rpad_geo) first - enables targeted re-running (main goal)
- Fast iteration, delete SQL files immediately
- Validate against nightly_qa.pluto (spot checks)
- Performance must be equal or better
- Migrate ALL files including complex spatial
- Solo work with quick per-file workflow
- 60 hours estimated across 4 waves
Update PLUTO migration: rename SQL files instead of deleting, add source comment to models
Add dev mode sampling plan for fast PLUTO iteration
Wave 0: DBT'ify pluto_rpad_geo as intermediate models
- Created models/intermediate/rpad/int__dof_pts_propmaster.sql
- Migrated from create_pts.sql
- Added primebbl logic from primebbl.sql
- Created models/intermediate/rpad/int__pluto_rpad_geo.sql
- Migrated from create_rpad_geo.sql
- Incorporated logic from 5 SQL files:
- zerovacantlots.sql (vacant lot adjustments)
- lotarea.sql (lot area calculations)
- primebbl.sql (prime BBL assignment)
- apdate.sql (date formatting)
- geocode_billingbbl.sql (billing BBL parsing)
- Implemented PLUTO_DEV_MODE for fast iteration (~100 BBLs, 20 per borough)
- Updated 02_build.sh to run dbt models after preprocessing.sql
- Updated 8 downstream SQL files to reference new dbt models:
- address.sql, bbl.sql, bldgclass.sql, cama_bldgarea_1.sql
- cama_easements.sql, create_allocated.sql, geocode_notgeocoded.sql, geocodes.sql
- Renamed 7 legacy SQL files to *_migrated.sql
Closes de-74o.15
migrate CAMA
fix rpad_geo
2d6c7d5 to
bf5f256
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.