Skip to content

[PULL REQUEST] Add Employment Estimates#192

Open
bryce-sandag wants to merge 24 commits intomainfrom
LEHD-Employment
Open

[PULL REQUEST] Add Employment Estimates#192
bryce-sandag wants to merge 24 commits intomainfrom
LEHD-Employment

Conversation

@bryce-sandag
Copy link

Describe this pull request. What changes are being made?

Add in the employment estimates functionality. This is the base for employment estimates using publicly available data. Starts with LEHD LODES data at the block level by 2-digit NAICS, except split NIACS 72 into NAICS 721 and 722 and scales the employment to QCEW county level controls.

What issues does this pull request address?

close #185
close #186
close #188
close #189

Additional context

Issue #188 was completed by changing the functionality of read_sql_query_acs() to function on more than just ACS data, which resulted in changing the function to read_sql_query_custom()

There will still be some functionality to be added in future issues to be created, such as add in missing job categories not covered by LEHD LODES and QCEW

@bryce-sandag bryce-sandag requested a review from Copilot February 4, 2026 19:16
@bryce-sandag bryce-sandag self-assigned this Feb 4, 2026
@bryce-sandag bryce-sandag added the enhancement New feature or request label Feb 4, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds employment estimates functionality to the codebase, implementing LEHD LODES data processing at the block level by 2-digit NAICS codes (with special handling to split NAICS 72 into 721 and 722) and scaling to QCEW county level controls.

Changes:

  • Added new employment module with functions to retrieve LODES data, apply NAICS 72 splits, aggregate to MGRA level, and control to QCEW totals
  • Renamed read_sql_query_acs() to read_sql_query_custom() to support querying ACS, LEHD LODES, and EDD point-level data with enhanced year lookback functionality
  • Created SQL queries for employment data retrieval and database schema updates for storing employment inputs/outputs

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
wiki/Utility.md Updated documentation to reflect renamed function and expanded data source support
sql/employment/xref_block_to_mgra.sql New query to retrieve Census block to MGRA crosswalk
sql/employment/get_naics72_split.sql New query to split NAICS 72 into 721 and 722 using EDD point-level data
sql/employment/get_mgra.sql New query to retrieve distinct MGRAs for a run
sql/employment/get_lodes_data.sql New query to retrieve LEHD LODES employment data by block and industry code
sql/employment/QCEW_control.sql New query to retrieve QCEW county-level employment controls
sql/create_objects.sql Added database tables for employment control inputs and job outputs
python/utils.py Renamed and enhanced query function with improved year lookback logic and added LEHD database engine
python/pop_type.py Updated function call to use renamed utility function
python/parsers.py Added employment module to configuration validation
python/hs_hh.py Updated function call to use renamed utility function
python/hh_characteristics.py Updated function calls to use renamed utility function
python/employment.py New module implementing employment estimates workflow
python/ase.py Updated function calls to use renamed utility function
main.py Integrated employment module into main execution flow
config.yml Added employment flag to debug configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@Eric-Liu-SANDAG Eric-Liu-SANDAG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure all SQL files use only spaces. There's some tabs here and there

I haven't looked at any output yet, is there some [run_id] which has employment numbers yet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What? --Drop Temp table and ok and spatial index if they exist. Also this doesn't drop any spatial index
  • What's the point of DECLARE @qry NVARCHAR(max)?
  • If you have long math equations, would very much prefer new lines start with an operator
  • I'm confused by the usage of COALESCE. Wouldn't ISNULL be a lot more clear?
  • The column [jobs] is very poorly named. It should be something like [average_monthly_jobs]
  • My preferred formatting for longer CASE statements is:
SELECT
    CASE 
        WHEN LEFT([code], 3) = '721' THEN '721'
        WHEN LEFT([code], 3) = '722' THEN '722'
        ELSE NULL 
    END AS [industry_code],
    ...
FROM ...
  • Shorter case statements can remain like CASE WHEN [emp_m1] IS NOT NULL THEN 1 ELSE 0 END
  • Your final IF/ELSE statement has no worst case scenario for years before 2010
  • Not to be too pedantic, but [pct_721] and [pct_722] are technically proportions and not percentages

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • fixed comment, drop table also drops spatial index
  • removed DECLARE @qry NVARCHAR(max), was not needed
  • made change where appropriate to start line with operator
  • Changed COALESCE to ISNULL
  • Changed [jobs] to [average_monthly_jobs]
  • Fixed formatting for longer CASE statements
  • running before 2010 return the message 'EDD point-level data does not exist'. This is by design as don't want to run for any years prior to 2010
  • fine as [pct_721] and [pct_722]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GregorSchroeder
Corrected missing parenthesis in ELSE IF @year = 2015 section to do correct addition and division

Also tested 2014 and 2016 and returns no data. May not be big deal as pulling in data using utils.read_sql_query_fallback so will just grab data for year previous, but may be worth looking into

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Eric-Liu-SANDAG is tasked with integrating 2014 into [EMPCORE]
2016 data does not seem to exist so the fallback will revert to 2015

@bryce-sandag
Copy link
Author

I haven't looked at any output yet, is there some [run_id] which has employment numbers yet?

run_id = 22 has data for jobs in ws database and this is after addressing first two comments from @Eric-Liu-SANDAG

@bryce-sandag
Copy link
Author

@Eric-Liu-SANDAG @GregorSchroeder
run_id = 39 has data for jobs in ws database and this is after addressing comments from Eric

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid url changes, download the PDF and add to a new documentation folder in the repo. Probably add a README.md in the new documentation folder which notes the original source, aka the url

Copy link
Contributor

@Eric-Liu-SANDAG Eric-Liu-SANDAG Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dude GitHub completely ate my comments on this file? I don't see them on my previous request changes...

  • Need to restructure employment.py to match the format of the other modules. For example, see how pop_type.py:run_pop() is structured, with explicit inputs, outputs, validation, and insertion
  • The validation part is super super important. Make sure to run on both inputs and outputs
  • A lot of the processing can be combined via chained operators to remove a ton of the intermediate variables. For example, in aggregate_lodes_to_mgra(), you have the variables lehd_to_mgra, lehd_to_mgra_summed, and final_lehd_to_mgra, which feels excessive
  • As a continuation of above, not of fan of variable names like jobs_frame (why not just jobs, we already know it's a pd.DataFrame) and final_etc
  • Remove self-explanatory comments like # Add run_id column
  • Be extremely careful with your usages of utils.integerize_1d(). If you look at other locations it is used, it is nearly always proceeded by some kind of sort. This is because a single row or value being different can completely change the output of utils.integerize_1d() due to it's random nature. I would recommend that you sort values before and additionally, do two consecutive runs back to back and write a script to ensure that outputs are the same between runs. Note, the output of SQL scripts is not guaranteed to output in the same order each time, unless you do an explicit ORDER BY and ensure that there are no ties
  • Add some comments at the top similar to other modules. See pop_type.py for example
  • Surely there's a better way to cross join in pandas without using that weird key thing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

3 participants