Conversation
This update separates out some of the functionality. Then addressed the ability to lookback multiple years to grab most recently available data. Integrated the ability to split NAICS 72 into NIACS 721 and 722
change name of function and changed all the spots where read_sql_query_acs was being used
There was a problem hiding this comment.
Pull request overview
This pull request adds employment estimates functionality to the codebase, implementing LEHD LODES data processing at the block level by 2-digit NAICS codes (with special handling to split NAICS 72 into 721 and 722) and scaling to QCEW county level controls.
Changes:
- Added new employment module with functions to retrieve LODES data, apply NAICS 72 splits, aggregate to MGRA level, and control to QCEW totals
- Renamed
read_sql_query_acs()toread_sql_query_custom()to support querying ACS, LEHD LODES, and EDD point-level data with enhanced year lookback functionality - Created SQL queries for employment data retrieval and database schema updates for storing employment inputs/outputs
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| wiki/Utility.md | Updated documentation to reflect renamed function and expanded data source support |
| sql/employment/xref_block_to_mgra.sql | New query to retrieve Census block to MGRA crosswalk |
| sql/employment/get_naics72_split.sql | New query to split NAICS 72 into 721 and 722 using EDD point-level data |
| sql/employment/get_mgra.sql | New query to retrieve distinct MGRAs for a run |
| sql/employment/get_lodes_data.sql | New query to retrieve LEHD LODES employment data by block and industry code |
| sql/employment/QCEW_control.sql | New query to retrieve QCEW county-level employment controls |
| sql/create_objects.sql | Added database tables for employment control inputs and job outputs |
| python/utils.py | Renamed and enhanced query function with improved year lookback logic and added LEHD database engine |
| python/pop_type.py | Updated function call to use renamed utility function |
| python/parsers.py | Added employment module to configuration validation |
| python/hs_hh.py | Updated function call to use renamed utility function |
| python/hh_characteristics.py | Updated function calls to use renamed utility function |
| python/employment.py | New module implementing employment estimates workflow |
| python/ase.py | Updated function calls to use renamed utility function |
| main.py | Integrated employment module into main execution flow |
| config.yml | Added employment flag to debug configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Eric-Liu-SANDAG
left a comment
There was a problem hiding this comment.
Make sure all SQL files use only spaces. There's some tabs here and there
I haven't looked at any output yet, is there some [run_id] which has employment numbers yet?
There was a problem hiding this comment.
- What?
--Drop Temp table and ok and spatial index if they exist. Also this doesn't drop any spatial index - What's the point of
DECLARE @qry NVARCHAR(max)? - If you have long math equations, would very much prefer new lines start with an operator
- I'm confused by the usage of
COALESCE. Wouldn'tISNULLbe a lot more clear? - The column
[jobs]is very poorly named. It should be something like[average_monthly_jobs] - My preferred formatting for longer
CASEstatements is:
SELECT
CASE
WHEN LEFT([code], 3) = '721' THEN '721'
WHEN LEFT([code], 3) = '722' THEN '722'
ELSE NULL
END AS [industry_code],
...
FROM ...- Shorter case statements can remain like
CASE WHEN [emp_m1] IS NOT NULL THEN 1 ELSE 0 END - Your final
IF/ELSEstatement has no worst case scenario for years before 2010 - Not to be too pedantic, but
[pct_721]and[pct_722]are technically proportions and not percentages
There was a problem hiding this comment.
- fixed comment, drop table also drops spatial index
- removed
DECLARE @qry NVARCHAR(max), was not needed - made change where appropriate to start line with operator
- Changed
COALESCEtoISNULL - Changed
[jobs]to[average_monthly_jobs] - Fixed formatting for longer
CASEstatements - running before 2010 return the message
'EDD point-level data does not exist'. This is by design as don't want to run for any years prior to 2010 - fine as
[pct_721]and[pct_722]
There was a problem hiding this comment.
@GregorSchroeder
Corrected missing parenthesis in ELSE IF @year = 2015 section to do correct addition and division
Also tested 2014 and 2016 and returns no data. May not be big deal as pulling in data using utils.read_sql_query_fallback so will just grab data for year previous, but may be worth looking into
There was a problem hiding this comment.
@Eric-Liu-SANDAG is tasked with integrating 2014 into [EMPCORE]
2016 data does not seem to exist so the fallback will revert to 2015
|
…orrect math in 2015 year section
|
@Eric-Liu-SANDAG @GregorSchroeder |
There was a problem hiding this comment.
To avoid url changes, download the PDF and add to a new documentation folder in the repo. Probably add a README.md in the new documentation folder which notes the original source, aka the url
There was a problem hiding this comment.
Dude GitHub completely ate my comments on this file? I don't see them on my previous request changes...
- Need to restructure
employment.pyto match the format of the other modules. For example, see howpop_type.py:run_pop()is structured, with explicit inputs, outputs, validation, and insertion - The validation part is super super important. Make sure to run on both inputs and outputs
- A lot of the processing can be combined via chained operators to remove a ton of the intermediate variables. For example, in
aggregate_lodes_to_mgra(), you have the variableslehd_to_mgra,lehd_to_mgra_summed, andfinal_lehd_to_mgra, which feels excessive - As a continuation of above, not of fan of variable names like
jobs_frame(why not justjobs, we already know it's apd.DataFrame) andfinal_etc - Remove self-explanatory comments like
# Add run_id column - Be extremely careful with your usages of
utils.integerize_1d(). If you look at other locations it is used, it is nearly always proceeded by some kind of sort. This is because a single row or value being different can completely change the output ofutils.integerize_1d()due to it's random nature. I would recommend that you sort values before and additionally, do two consecutive runs back to back and write a script to ensure that outputs are the same between runs. Note, the output of SQL scripts is not guaranteed to output in the same order each time, unless you do an explicitORDER BYand ensure that there are no ties - Add some comments at the top similar to other modules. See
pop_type.pyfor example - Surely there's a better way to cross join in pandas without using that weird
keything
Describe this pull request. What changes are being made?
Add in the employment estimates functionality. This is the base for employment estimates using publicly available data. Starts with LEHD LODES data at the block level by 2-digit NAICS, except split NIACS 72 into NAICS 721 and 722 and scales the employment to QCEW county level controls.
What issues does this pull request address?
close #185
close #186
close #188
close #189
Additional context
Issue #188 was completed by changing the functionality of
read_sql_query_acs()to function on more than just ACS data, which resulted in changing the function toread_sql_query_custom()There will still be some functionality to be added in future issues to be created, such as add in missing job categories not covered by LEHD LODES and QCEW