Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 152 additions & 9 deletions docs/llms-full.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14714,6 +14714,94 @@ Execute a YAML-based validation workflow.
pipeline or version control system, allowing you to maintain validation rules alongside your
code.

### Governance Metadata

YAML workflows support governance metadata via `owner`, `consumers`, and `version` top-level
keys. These are forwarded to the `Validate` constructor and embedded in the validation report:

```python
yaml_config = '''
tbl: small_table
tbl_name: sales_pipeline
owner: Data Engineering
consumers: [Analytics, Finance, Compliance]
version: "2.1.0"
steps:
- col_vals_not_null:
columns: [a, b]
'''

result = pb.yaml_interrogate(yaml_config)
print(f"Owner: {result.owner}")
print(f"Consumers: {result.consumers}")
print(f"Version: {result.version}")
```

### Aggregate Validations

YAML supports aggregate validation methods for checking column-level statistics. These methods
validate that a column's sum, average, or standard deviation meets a threshold:

```python
yaml_config = '''
tbl: small_table
steps:
- col_sum_gt:
columns: [d]
value: 0
- col_avg_le:
columns: [a]
value: 10
'''

result = pb.yaml_interrogate(yaml_config)
result
```

The 15 available aggregate methods follow the pattern `col_{stat}_{comparator}` where
`{stat}` is `sum`, `avg`, or `sd` and `{comparator}` is `gt`, `lt`, `ge`,
`le`, or `eq`.

### Data Freshness

Check that a date/datetime column has recent data using `data_freshness`:

```yaml
tbl: events.csv
steps:
- data_freshness:
columns: event_date
freshness: "24h"
```

### Active Parameter Shortcut

The `active=` parameter controls whether a validation step runs. It supports boolean values
and Python expression shortcuts:

```yaml
steps:
- col_vals_gt:
columns: [d]
value: 100
active: false # Skip this step

- col_vals_not_null:
columns: [a]
active: true # Always run (default)
```

### Null Percentage Check

Use `col_pct_null` to validate that the percentage of null values in a column is within bounds:

```yaml
steps:
- col_pct_null:
columns: [a, b]
value: 0.05
```

### Using `set_tbl=` to Override the Table

The `set_tbl=` parameter allows you to override the table specified in the YAML configuration.
Expand Down Expand Up @@ -14903,6 +14991,39 @@ Validate YAML configuration against the expected structure.
source ('tbl') exists or is accessible. Data source validation occurs during execution with
`yaml_interrogate()`.

Supported Top-level Keys
------------------------
The following top-level keys are recognized in the YAML configuration:

- `tbl`: data source specification (required)
- `steps`: list of validation steps (required)
- `tbl_name`: human-readable table name
- `label`: validation description
- `df_library`: DataFrame library (`"polars"`, `"pandas"`, `"duckdb"`)
- `lang`: language code
- `locale`: locale setting
- `brief`: global brief template
- `thresholds`: global failure thresholds
- `actions`: global failure actions
- `final_actions`: actions triggered after all steps complete
- `owner`: data owner (governance metadata)
- `consumers`: data consumers (governance metadata)
- `version`: validation version string (governance metadata)
- `reference`: reference table for comparison-based validations

Unknown top-level keys are rejected, which catches typos like `tbl_nmae` or `step`.

Supported Validation Methods
----------------------------
In addition to all standard validation methods (e.g., `col_vals_gt`, `rows_distinct`,
`col_schema_match`), the following methods are also supported:

- `col_pct_null`: check the percentage of null values in a column
- `data_freshness`: check that data is recent
- aggregate methods: `col_sum_gt`, `col_sum_lt`, `col_sum_ge`, `col_sum_le`,
`col_sum_eq`, `col_avg_gt`, `col_avg_lt`, `col_avg_ge`, `col_avg_le`,
`col_avg_eq`, `col_sd_gt`, `col_sd_lt`, `col_sd_ge`, `col_sd_le`, `col_sd_eq`

See Also
--------
yaml_interrogate : execute YAML-based validation workflows
Expand Down Expand Up @@ -14993,6 +15114,28 @@ Convert YAML validation configuration to equivalent Python code.
The generated code includes all configuration parameters, thresholds, and maintains the exact
same validation logic as the original YAML workflow.

Governance metadata (`owner`, `consumers`, `version`) and `reference` are also rendered
in the generated Python code:

```python
yaml_config = '''
tbl: small_table
tbl_name: Sales Pipeline
owner: Data Engineering
consumers: [Analytics, Finance]
version: "2.1.0"
steps:
- col_vals_not_null:
columns: [a]
- col_sum_gt:
columns: [d]
value: 0
'''

python_code = pb.yaml_to_python(yaml_config)
print(python_code)
```

This function is also useful for educational purposes, helping users understand how YAML
configurations map to the underlying Python API calls.

Expand Down Expand Up @@ -15844,31 +15987,31 @@ generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, ou

Supported Countries
-------------------
The `country=` parameter currently supports 71 countries with full locale data:
The `country=` parameter currently supports 75 countries with full locale data:

**Europe (32 countries):** Austria (`"AT"`), Belgium (`"BE"`), Bulgaria (`"BG"`),
**Europe (33 countries):** Austria (`"AT"`), Belgium (`"BE"`), Bulgaria (`"BG"`),
Croatia (`"HR"`), Cyprus (`"CY"`), Czech Republic (`"CZ"`), Denmark (`"DK"`),
Estonia (`"EE"`), Finland (`"FI"`), France (`"FR"`), Germany (`"DE"`), Greece (`"GR"`),
Hungary (`"HU"`), Iceland (`"IS"`), Ireland (`"IE"`), Italy (`"IT"`), Latvia (`"LV"`),
Lithuania (`"LT"`), Luxembourg (`"LU"`), Malta (`"MT"`), Netherlands (`"NL"`),
Norway (`"NO"`), Poland (`"PL"`), Portugal (`"PT"`), Romania (`"RO"`), Russia (`"RU"`),
Slovakia (`"SK"`), Slovenia (`"SI"`), Spain (`"ES"`), Sweden (`"SE"`),
Switzerland (`"CH"`), United Kingdom (`"GB"`)
Switzerland (`"CH"`), Ukraine (`"UA"`), United Kingdom (`"GB"`)

**Americas (9 countries):** Argentina (`"AR"`), Brazil (`"BR"`), Canada (`"CA"`),
Chile (`"CL"`), Colombia (`"CO"`), Costa Rica (`"CR"`), Mexico (`"MX"`),
Peru (`"PE"`), United States (`"US"`)
**Americas (11 countries):** Argentina (`"AR"`), Brazil (`"BR"`), Canada (`"CA"`),
Chile (`"CL"`), Colombia (`"CO"`), Costa Rica (`"CR"`), Ecuador (`"EC"`),
Mexico (`"MX"`), Panama (`"PA"`), Peru (`"PE"`), United States (`"US"`)

**Asia-Pacific (17 countries):** Australia (`"AU"`), Bangladesh (`"BD"`),
China (`"CN"`), Hong Kong (`"HK"`), India (`"IN"`), Indonesia (`"ID"`),
Japan (`"JP"`), Malaysia (`"MY"`), New Zealand (`"NZ"`), Pakistan (`"PK"`),
Philippines (`"PH"`), Singapore (`"SG"`), South Korea (`"KR"`),
Sri Lanka (`"LK"`), Taiwan (`"TW"`), Thailand (`"TH"`), Vietnam (`"VN"`)

**Middle East & Africa (13 countries):** Algeria (`"DZ"`), Egypt (`"EG"`),
**Middle East & Africa (14 countries):** Algeria (`"DZ"`), Egypt (`"EG"`),
Ethiopia (`"ET"`), Ghana (`"GH"`), Kenya (`"KE"`), Morocco (`"MA"`),
Nigeria (`"NG"`), Senegal (`"SN"`), South Africa (`"ZA"`), Tunisia (`"TN"`),
Turkey (`"TR"`), Uganda (`"UG"`), United Arab Emirates (`"AE"`)
Nigeria (`"NG"`), Saudi Arabia (`"SA"`), Senegal (`"SN"`), South Africa (`"ZA"`),
Tunisia (`"TN"`), Turkey (`"TR"`), Uganda (`"UG"`), United Arab Emirates (`"AE"`)

Pytest Fixture
--------------
Expand Down
16 changes: 8 additions & 8 deletions docs/user-guide/test-data-generation.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -486,24 +486,24 @@ Letters I, O, Q, and U are excluded from plate generation, matching real-world r

### Supported Countries

Pointblank currently supports 71 countries with full locale data for realistic test data generation.
Pointblank currently supports 75 countries with full locale data for realistic test data generation.
You can use either ISO 3166-1 alpha-2 codes (e.g., `"US"`) or alpha-3 codes (e.g., `"USA"`).

**Europe (32 countries):**
**Europe (33 countries):**

- Austria (`AT`), Belgium (`BE`), Bulgaria (`BG`), Croatia (`HR`), Cyprus (`CY`), Czech Republic (`CZ`), Denmark (`DK`), Estonia (`EE`), Finland (`FI`), France (`FR`), Germany (`DE`), Greece (`GR`), Hungary (`HU`), Iceland (`IS`), Ireland (`IE`), Italy (`IT`), Latvia (`LV`), Lithuania (`LT`), Luxembourg (`LU`), Malta (`MT`), Netherlands (`NL`), Norway (`NO`), Poland (`PL`), Portugal (`PT`), Romania (`RO`), Russia (`RU`), Slovakia (`SK`), Slovenia (`SI`), Spain (`ES`), Sweden (`SE`), Switzerland (`CH`), United Kingdom (`GB`)
- Austria (`AT`), Belgium (`BE`), Bulgaria (`BG`), Croatia (`HR`), Cyprus (`CY`), Czech Republic (`CZ`), Denmark (`DK`), Estonia (`EE`), Finland (`FI`), France (`FR`), Germany (`DE`), Greece (`GR`), Hungary (`HU`), Iceland (`IS`), Ireland (`IE`), Italy (`IT`), Latvia (`LV`), Lithuania (`LT`), Luxembourg (`LU`), Malta (`MT`), Netherlands (`NL`), Norway (`NO`), Poland (`PL`), Portugal (`PT`), Romania (`RO`), Russia (`RU`), Slovakia (`SK`), Slovenia (`SI`), Spain (`ES`), Sweden (`SE`), Switzerland (`CH`), Ukraine (`UA`), United Kingdom (`GB`)

**Americas (9 countries):**
**Americas (11 countries):**

- Argentina (`AR`), Brazil (`BR`), Canada (`CA`), Chile (`CL`), Colombia (`CO`), Costa Rica (`CR`), Mexico (`MX`), Peru (`PE`), United States (`US`)
- Argentina (`AR`), Brazil (`BR`), Canada (`CA`), Chile (`CL`), Colombia (`CO`), Costa Rica (`CR`), Ecuador (`EC`), Mexico (`MX`), Panama (`PA`), Peru (`PE`), United States (`US`)

**Asia-Pacific (17 countries):**

- Australia (`AU`), Bangladesh (`BD`), China (`CN`), Hong Kong (`HK`), India (`IN`), Indonesia (`ID`), Japan (`JP`), Malaysia (`MY`), New Zealand (`NZ`), Pakistan (`PK`), Philippines (`PH`), Singapore (`SG`), South Korea (`KR`), Sri Lanka (`LK`), Taiwan (`TW`), Thailand (`TH`), Vietnam (`VN`)

**Middle East & Africa (13 countries):**
**Middle East & Africa (14 countries):**

- Algeria (`DZ`), Egypt (`EG`), Ethiopia (`ET`), Ghana (`GH`), Kenya (`KE`), Morocco (`MA`), Nigeria (`NG`), Senegal (`SN`), South Africa (`ZA`), Tunisia (`TN`), Turkey (`TR`), Uganda (`UG`), United Arab Emirates (`AE`)
- Algeria (`DZ`), Egypt (`EG`), Ethiopia (`ET`), Ghana (`GH`), Kenya (`KE`), Morocco (`MA`), Nigeria (`NG`), Saudi Arabia (`SA`), Senegal (`SN`), South Africa (`ZA`), Tunisia (`TN`), Turkey (`TR`), Uganda (`UG`), United Arab Emirates (`AE`)

Additional countries and expanded coverage are planned for future releases.

Expand Down Expand Up @@ -835,7 +835,7 @@ By incorporating test data generation into your process, you can:

- quickly prototype validation rules before working with production data
- create reproducible test fixtures for automated testing and CI/CD pipelines
- generate locale-specific data for internationalization testing across 71 countries
- generate locale-specific data for internationalization testing across 75 countries
- ensure coherent relationships between related fields like names, emails, addresses, jobs, and
license plates
- produce datasets of any size with consistent, realistic values
Expand Down
4 changes: 4 additions & 0 deletions pointblank/countries/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,7 @@
"DE", # Germany
"DK", # Denmark
"DZ", # Algeria
"EC", # Ecuador
"EE", # Estonia
"EG", # Egypt
"ES", # Spain
Expand Down Expand Up @@ -283,13 +284,15 @@
"NL", # Netherlands
"NO", # Norway
"NZ", # New Zealand
"PA", # Panama
"PE", # Peru
"PH", # Philippines
"PK", # Pakistan
"PL", # Poland
"PT", # Portugal
"RO", # Romania
"RU", # Russia
"SA", # Saudi Arabia
"SE", # Sweden
"SG", # Singapore
"SI", # Slovenia
Expand All @@ -299,6 +302,7 @@
"TN", # Tunisia
"TR", # Turkey
"TW", # Taiwan
"UA", # Ukraine
"UG", # Uganda
"US", # United States
"VN", # Vietnam
Expand Down
Loading
Loading