A command-line tool to extract structured JSON from a Form 700 PDF report from the California Fair Political Practices Commission database.
Note: This tool is currently a prototype and may not be further developed.
Usage: fppc700-extract [OPTIONS] PATH
Options:
--silent Do not print anything to the screen
--help Show this message and exit.The following command would extract the data from a FPPC Form 700 named "Legislator_2025_Annual.pdf" into a JSON file in the same directory as the PDF named "Legislator_2025_Annual.json":
fppc700-extract --path Legislator_2025_Annual.pdfThe package is also a Python library so it can be integrated with Python code and not just through the command line interface. After adding the package as a dependency to your project you can use the behavior with the extract_form_700 function which returns the JSON representation, such as:
from fppc700extract import extract_form_700
pdf_path = "PUT_YOUR_PATH_HERE"
silent = True
extracted = extract_form_700(pdf_path, silent)You can install the CLI tool from this Github repository using pip or uv:
pip install git+https://github.com/CalMatters/fppc700-extract.gituv pip install "git+https://github.com/CalMatters/fppc700-extract.git"We made this tool because we needed it! And we're sharing it publicly in case other folks need it or have ideas for improvement.
We've gone through and extracted data from the Form 700 documents filed by the entire legislator the past few years (for filings regarding 2022, 2023, 2024, 2025) in a relatively time-consuming process. However, starting in 2025 AB1170 required legislators to submit their reports electronicly which means that all the documents have exactly the same layout.
The test_fppc700extract.py tests the main extract_form_700 function's output. It runs the function on a fixture PDF and compares the returned data to a JSON fixture, locking the behavior in place.
Run test(s):
uv run pytestThis tool uses pdfplumber to extract well-formed data from California Form 700 documents. It OCRs the data out of the PDF and shapes it to match Disclosure Disco's data model, leaving behind a JSON file of the shaped data.
The main logic of the app utilizes the JSON files in /layouts and each file is an array of objects that describe a bounding box of the desired text content (coordinates) and what to call that data (name), which must be unique within the file.
Each layout has a bounding box of the schedule title at the top of every page except for the first one (which is a cover page - always).
Here's an example from Schedule A1, which documents stock investments:
{
"name": "investment-1-1-fmv-1m-plus",
"coordinates": [
[
160, // left
188 // top
],
[
165, // right
194 // bottom
]
],
"checkbox": true // optional
}The script will go through the PDF page by page and crop the page to the bounding box represented by coordinates (each pair is [x, y], first pair is top left corner and second pair is bottom right corner), OCR the cropped page snippet, and associate that data with the name value. The checkbox key is optional but if it is true then the value associated with that name will be a boolean based on if a checkbox is detected.
A page's worth of extracted data is passed to a parsing function depending on the page schedule which transforms the data into the model expected by Disclosure Disco. If the page is Schedule D then it will be passed to parse_schedule_d_gifts() for transformation.
Once all of the pages of document formId.pdf have been parsed, the data is sent to Disclosure Disco as well as written to a file named formId.json.
Reports for filing year 2024 and later are required to be filed electronically so the layout should be stable, running this tool on reports from years before 2024 might yield inaccurate results.
Adjusting the data contained in the /layouts/*.json files can change the output of the script. You can use the web app in /layout-editor to visually debug and adjust a layout file. It doesn't open files or save them so copy and paste JSON from the file to the editor and back.
If you end up using this tool, please get in touch and share your use case with us by opening an issue or sending an email to jeremia@calmatters.org.