Skip to content

CalMatters/fppc700-extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fppc700-extract

A command-line tool to extract structured JSON from a Form 700 PDF report from the California Fair Political Practices Commission database.

Note: This tool is currently a prototype and may not be further developed.

Usage: fppc700-extract [OPTIONS] PATH

Options:
  --silent  Do not print anything to the screen
  --help    Show this message and exit.

Example

The following command would extract the data from a FPPC Form 700 named "Legislator_2025_Annual.pdf" into a JSON file in the same directory as the PDF named "Legislator_2025_Annual.json":

fppc700-extract --path Legislator_2025_Annual.pdf

The package is also a Python library so it can be integrated with Python code and not just through the command line interface. After adding the package as a dependency to your project you can use the behavior with the extract_form_700 function which returns the JSON representation, such as:

from fppc700extract import extract_form_700

pdf_path = "PUT_YOUR_PATH_HERE"
silent = True
extracted = extract_form_700(pdf_path, silent)

Installation

You can install the CLI tool from this Github repository using pip or uv:

pip install git+https://github.com/CalMatters/fppc700-extract.git
uv pip install "git+https://github.com/CalMatters/fppc700-extract.git"

Motivation

We made this tool because we needed it! And we're sharing it publicly in case other folks need it or have ideas for improvement.

We've gone through and extracted data from the Form 700 documents filed by the entire legislator the past few years (for filings regarding 2022, 2023, 2024, 2025) in a relatively time-consuming process. However, starting in 2025 AB1170 required legislators to submit their reports electronicly which means that all the documents have exactly the same layout.

Tests

The test_fppc700extract.py tests the main extract_form_700 function's output. It runs the function on a fixture PDF and compares the returned data to a JSON fixture, locking the behavior in place.

Run test(s):

uv run pytest

How it works

This tool uses pdfplumber to extract well-formed data from California Form 700 documents. It OCRs the data out of the PDF and shapes it to match Disclosure Disco's data model, leaving behind a JSON file of the shaped data.

The main logic of the app utilizes the JSON files in /layouts and each file is an array of objects that describe a bounding box of the desired text content (coordinates) and what to call that data (name), which must be unique within the file.

Each layout has a bounding box of the schedule title at the top of every page except for the first one (which is a cover page - always).

Here's an example from Schedule A1, which documents stock investments:

{
    "name": "investment-1-1-fmv-1m-plus",
    "coordinates": [
      [
        160,          // left
        188           // top
      ],
      [
        165,          // right
        194           // bottom
      ]
    ],
    "checkbox": true  // optional
  }

The script will go through the PDF page by page and crop the page to the bounding box represented by coordinates (each pair is [x, y], first pair is top left corner and second pair is bottom right corner), OCR the cropped page snippet, and associate that data with the name value. The checkbox key is optional but if it is true then the value associated with that name will be a boolean based on if a checkbox is detected.

A page's worth of extracted data is passed to a parsing function depending on the page schedule which transforms the data into the model expected by Disclosure Disco. If the page is Schedule D then it will be passed to parse_schedule_d_gifts() for transformation.

Once all of the pages of document formId.pdf have been parsed, the data is sent to Disclosure Disco as well as written to a file named formId.json.

Reports for filing year 2024 and later are required to be filed electronically so the layout should be stable, running this tool on reports from years before 2024 might yield inaccurate results.

Adjusting JSON /layouts files

Adjusting the data contained in the /layouts/*.json files can change the output of the script. You can use the web app in /layout-editor to visually debug and adjust a layout file. It doesn't open files or save them so copy and paste JSON from the file to the editor and back.

Please let us know if you use this tool!

If you end up using this tool, please get in touch and share your use case with us by opening an issue or sending an email to jeremia@calmatters.org.

About

A command-line tool to extract structured JSON from California FPPC Form 700 PDF reports

Topics

Resources

License

Stars

Watchers

Forks

Contributors