generated from SANDAG/Estimates-Forecasts-Template
-
Notifications
You must be signed in to change notification settings - Fork 0
[PULL REQUEST] Add Employment Estimates #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bryce-sandag
wants to merge
24
commits into
main
Choose a base branch
from
LEHD-Employment
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
a0c011d
#185 initial commit or work done
bryce-sandag c66e9b8
#189 - block level naics72 split query
2c28142
#185 separate lodes data pull from join to mgra
bryce-sandag 86019c7
#185 #188 #189 Update logic for employment estimates
bryce-sandag dec3eaf
#185 change "jobs" in output to "value" and fix year left in SQL query
bryce-sandag 3243c07
#185 #186 Create output/input table in SQL and add ability to output …
bryce-sandag b7e8062
#188 change function read_sql_query_acs to read_sql_query_custom
bryce-sandag 8410963
#188 update wiki for change in read_sql_query_acs to read_sql_query_c…
bryce-sandag 3be99f5
#185 reset config
bryce-sandag a2b6275
#185 reset config
bryce-sandag 3788834
#185 cleanup in a few spots
bryce-sandag 6015a5a
#185 remove output folder used during testing
bryce-sandag 0cc89b9
#185 Change connection when grabbing mgras
bryce-sandag 827d066
#185 #188 addressed first 2 comments from @Eric-Liu-SANDAG in pull re…
bryce-sandag b8dfdb6
#185 address pull request feedback for get_lodes_data.sql
bryce-sandag f20e4c6
#185 Update utils.py and utility.md for update to read_sql_query_fall…
bryce-sandag 26a00d6
#185 #189 addressed pull request feedback and added parenthesis for c…
bryce-sandag 5f376d5
#185 remove get_mgra.sql and use as string directly
bryce-sandag 824b8ef
#185 fix using only spaces vs tabs in sql files
bryce-sandag 223fcc4
#185 remove year left in query from testing
bryce-sandag 46cda0e
#185 fix table being called for [inputs].[mgra]
bryce-sandag f7a109b
#185 #188 update utility.md and utils.py
bryce-sandag b8bb79c
#185 #189 Better format based on feedback and fix year set when was d…
bryce-sandag 73bca20
#185 better formatting to match rest of estimates program
bryce-sandag File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,252 @@ | ||
| import numpy as np | ||
| import pandas as pd | ||
| import sqlalchemy as sql | ||
|
|
||
| import python.utils as utils | ||
|
|
||
| generator = np.random.default_rng(utils.RANDOM_SEED) | ||
|
|
||
|
|
||
| def run_employment(year: int): | ||
| """Control function to create jobs data by industry_code (NAICS) at the MGRA level. | ||
|
|
||
| Get the LEHD LODES data, aggregate to the MGRA level using the block to MGRA | ||
| crosswalk, then apply control totals from QCEW using integerization. | ||
|
|
||
| Functionality is split apart for code encapsulation (function inputs not included): | ||
| get_LODES_data - Get LEHD LODES data for a specified year, including | ||
| special handling for industry_code 72 (Accommodation and Food Services) | ||
| xref_block_to_mgra - Get crosswalk from Census blocks to MGRAs | ||
| aggregate_lodes_to_mgra - Aggregate LODES data to MGRA level using allocation | ||
| percentages from the block to MGRA crosswalk | ||
| get_control_totals - Load QCEW employment data as county total controls | ||
| apply_employment_controls - Apply control totals to employment data using | ||
| utils.integerize_1d() | ||
| _insert_jobs - Store both the control totals and controlled employment | ||
| inputs/outputs to the production database | ||
|
|
||
|
|
||
| Args: | ||
| year (int): estimates year | ||
| """ | ||
|
|
||
| # Check MGRA version and raise error if not 'mgra15' | ||
| if utils.MGRA_VERSION != "mgra15": | ||
| raise ValueError( | ||
| f"Employment module only works with MGRA_VERSION = 'mgra15'. " | ||
| f"Current MGRA_VERSION is '{utils.MGRA_VERSION}'." | ||
| ) | ||
|
|
||
| jobs_inputs = _get_jobs_inputs(year) | ||
| # TODO _validate_jobs_inputs here before proceeding | ||
|
|
||
| jobs_outputs = _create_jobs_output(jobs_inputs) | ||
| # TODO _validate_jobs_outputs here before proceeding | ||
|
|
||
| _insert_jobs(jobs_inputs, jobs_outputs) | ||
|
|
||
|
|
||
| def get_LODES_data(year: int) -> pd.DataFrame: | ||
| """Retrieve LEHD LODES data for a specified year. | ||
|
|
||
| Args: | ||
| year (int): The year for which to retrieve LEHD LODES data. | ||
| """ | ||
|
|
||
| with utils.LEHD_ENGINE.connect() as con: | ||
| with open(utils.SQL_FOLDER / "employment/get_lodes_data.sql") as file: | ||
| lodes_data = utils.read_sql_query_fallback( | ||
| max_lookback=2, | ||
| sql=sql.text(file.read()), | ||
| con=con, | ||
| params={"year": year}, | ||
| ) | ||
|
|
||
| with utils.GIS_ENGINE.connect() as con: | ||
| with open(utils.SQL_FOLDER / "employment/get_naics72_split.sql") as file: | ||
| split_naics_72 = utils.read_sql_query_fallback( | ||
| max_lookback=3, | ||
| sql=sql.text(file.read()), | ||
| con=con, | ||
| params={"year": year}, | ||
| ) | ||
|
|
||
| # Separate industry_code 72 from other industries | ||
| lodes_72 = lodes_data[lodes_data["industry_code"] == "72"].copy() | ||
| lodes_other = lodes_data[lodes_data["industry_code"] != "72"].copy() | ||
|
|
||
| # Join industry_code 72 data with split percentages | ||
| lodes_72_split = lodes_72.merge(split_naics_72, on="block", how="left") | ||
|
|
||
| # Create rows for industry_code 721 | ||
| lodes_721 = lodes_72_split[["year", "block"]].copy() | ||
| lodes_721["industry_code"] = "721" | ||
| lodes_721["jobs"] = lodes_72_split["jobs"] * lodes_72_split["pct_721"] | ||
|
|
||
| # Create rows for industry_code 722 | ||
| lodes_722 = lodes_72_split[["year", "block"]].copy() | ||
| lodes_722["industry_code"] = "722" | ||
| lodes_722["jobs"] = lodes_72_split["jobs"] * lodes_72_split["pct_722"] | ||
|
|
||
| # Combine all data | ||
| combined_data = pd.concat([lodes_other, lodes_721, lodes_722], ignore_index=True) | ||
| combined_data = combined_data[["year", "block", "industry_code", "jobs"]] | ||
|
|
||
| return combined_data | ||
|
|
||
|
|
||
| def aggregate_lodes_to_mgra( | ||
| combined_data: pd.DataFrame, xref: pd.DataFrame, year: int | ||
| ) -> pd.DataFrame: | ||
| """Aggregate LODES data to MGRA level using allocation percentages. | ||
|
|
||
| Args: | ||
| combined_data (pd.DataFrame): LODES data with columns: year, block, industry_code, jobs | ||
| xref (pd.DataFrame): Crosswalk with columns: block, mgra, allocation_pct | ||
| year (int): The year for which to aggregate data | ||
|
|
||
| Returns: | ||
| pd.DataFrame: Aggregated data at MGRA level with columns: year, mgra, industry_code, jobs | ||
| """ | ||
| # Get MGRA data from SQL | ||
| with utils.ESTIMATES_ENGINE.connect() as con: | ||
| mgra_data = pd.read_sql_query( | ||
| sql=sql.text( | ||
| """ | ||
| SELECT DISTINCT [mgra] | ||
| FROM [inputs].[mgra] | ||
| WHERE run_id = :run_id | ||
| ORDER BY [mgra] | ||
| """ | ||
| ), | ||
| con=con, | ||
| params={"run_id": utils.RUN_ID}, | ||
| ) | ||
|
|
||
| # Get unique industry codes and cross join with MGRA data | ||
| unique_industries = combined_data["industry_code"].unique() | ||
| jobs = mgra_data.merge( | ||
| pd.DataFrame({"industry_code": unique_industries}), how="cross" | ||
| ) | ||
| jobs["year"] = year | ||
| jobs = jobs[["year", "mgra", "industry_code"]] | ||
|
|
||
| # Join combined_data to xref and calculate allocated jobs | ||
| lehd_to_mgra = combined_data.merge(xref, on="block", how="inner") | ||
| lehd_to_mgra["value"] = lehd_to_mgra["jobs"] * lehd_to_mgra["allocation_pct"] | ||
|
|
||
| # Join summed data to jobs, keeping all MGRAs and industry codes | ||
| jobs = jobs.merge( | ||
| lehd_to_mgra.groupby(["year", "mgra", "industry_code"], as_index=False)[ | ||
| "value" | ||
| ].sum(), | ||
| on=["year", "mgra", "industry_code"], | ||
| how="left", | ||
| ) | ||
| jobs["value"] = jobs["value"].fillna(0) | ||
| jobs["run_id"] = utils.RUN_ID | ||
| jobs = jobs[["run_id", "year", "mgra", "industry_code", "value"]] | ||
|
|
||
| return jobs | ||
|
|
||
|
|
||
| def _get_jobs_inputs(year: int) -> dict[str, pd.DataFrame]: | ||
| """Get input data related to jobs for a specified year. | ||
|
|
||
| Args: | ||
| year (int): The year for which to retrieve input data. | ||
| Returns: | ||
| dict[str, pd.DataFrame]: A dictionary containing input DataFrames related to jobs. | ||
| """ | ||
| # Store results here | ||
| jobs_inputs = {} | ||
|
|
||
| jobs_inputs["LODES_data"] = get_LODES_data(year) | ||
|
|
||
| with utils.LEHD_ENGINE.connect() as con: | ||
| # get crosswalk from Census blocks to MGRAs | ||
| with open(utils.SQL_FOLDER / "employment/xref_block_to_mgra.sql") as file: | ||
| jobs_inputs["xref_block_to_mgra"] = pd.read_sql_query( | ||
| sql=sql.text(file.read()), | ||
| con=con, | ||
| params={"mgra_version": utils.MGRA_VERSION}, | ||
| ) | ||
|
|
||
| # get regional employment control totals from QCEW | ||
| with open(utils.SQL_FOLDER / "employment/QCEW_control.sql") as file: | ||
| jobs_inputs["control_totals"] = utils.read_sql_query_fallback( | ||
| sql=sql.text(file.read()), | ||
| con=con, | ||
| params={ | ||
| "year": year, | ||
| }, | ||
| ) | ||
| jobs_inputs["control_totals"]["run_id"] = utils.RUN_ID | ||
|
|
||
| jobs_inputs["lehd_jobs"] = aggregate_lodes_to_mgra( | ||
| jobs_inputs["LODES_data"], jobs_inputs["xref_block_to_mgra"], year | ||
| ) | ||
|
|
||
| return jobs_inputs | ||
|
|
||
|
|
||
| def _create_jobs_output( | ||
| jobs_inputs: dict[str, pd.DataFrame], | ||
| ) -> dict[str, pd.DataFrame]: | ||
| """Apply control totals to employment data using utils.integerize_1d(). | ||
|
|
||
| Args: | ||
| original_data (pd.DataFrame): LEHD LODES data at MGRA level. | ||
| control_totals (pd.DataFrame): Employment control totals from QCEW. | ||
| generator (np.random.Generator): NumPy random number generator. | ||
|
|
||
| Returns: | ||
| pd.DataFrame: Controlled employment data. | ||
| """ | ||
| jobs_outputs = {} | ||
| # Create a copy of original_data for controlled results | ||
| jobs_outputs["results"] = jobs_inputs["lehd_jobs"].copy() | ||
|
|
||
| # Get unique industry codes | ||
| industry_codes = jobs_inputs["lehd_jobs"]["industry_code"].unique() | ||
|
|
||
| # Apply integerize_1d to each industry code | ||
| for industry_code in industry_codes: | ||
| # Filter original data for this industry | ||
| industry_mask = jobs_inputs["lehd_jobs"]["industry_code"] == industry_code | ||
|
|
||
| # Get control value for this industry | ||
| control_value = jobs_inputs["control_totals"][ | ||
| jobs_inputs["control_totals"]["industry_code"] == industry_code | ||
| ]["value"].iloc[0] | ||
|
|
||
| # Apply integerize_1d and update controlled_data | ||
| jobs_outputs["results"].loc[industry_mask, "value"] = utils.integerize_1d( | ||
| data=jobs_inputs["lehd_jobs"].loc[industry_mask, "value"], | ||
| control=control_value, | ||
| methodology="weighted_random", | ||
| generator=generator, | ||
| ) | ||
|
|
||
| return jobs_outputs | ||
|
|
||
|
|
||
| def _insert_jobs( | ||
| jobs_inputs: dict[str, pd.DataFrame], jobs_outputs: dict[str, pd.DataFrame] | ||
| ) -> None: | ||
| """Insert input and output data related to jobs to the database.""" | ||
|
|
||
| # Insert input and output data to database | ||
| with utils.ESTIMATES_ENGINE.connect() as con: | ||
|
|
||
| jobs_inputs["control_totals"].to_sql( | ||
| name="controls_jobs", | ||
| con=con, | ||
| schema="inputs", | ||
| if_exists="append", | ||
| index=False, | ||
| ) | ||
|
|
||
| jobs_outputs["results"].to_sql( | ||
| name="jobs", con=con, schema="outputs", if_exists="append", index=False | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bryce-sandag marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dude GitHub completely ate my comments on this file? I don't see them on my previous request changes...
employment.pyto match the format of the other modules. For example, see howpop_type.py:run_pop()is structured, with explicit inputs, outputs, validation, and insertionaggregate_lodes_to_mgra(), you have the variableslehd_to_mgra,lehd_to_mgra_summed, andfinal_lehd_to_mgra, which feels excessivejobs_frame(why not justjobs, we already know it's apd.DataFrame) andfinal_etc# Add run_id columnutils.integerize_1d(). If you look at other locations it is used, it is nearly always proceeded by some kind of sort. This is because a single row or value being different can completely change the output ofutils.integerize_1d()due to it's random nature. I would recommend that you sort values before and additionally, do two consecutive runs back to back and write a script to ensure that outputs are the same between runs. Note, the output of SQL scripts is not guaranteed to output in the same order each time, unless you do an explicitORDER BYand ensure that there are no tiespop_type.pyfor examplekeything