Skip to content

[Python] Support tree-model TsFiles in TsFileDataFrame#816

Open
Young-Leo wants to merge 7 commits into
apache:developfrom
Young-Leo:tsdf-tree
Open

[Python] Support tree-model TsFiles in TsFileDataFrame#816
Young-Leo wants to merge 7 commits into
apache:developfrom
Young-Leo:tsdf-tree

Conversation

@Young-Leo
Copy link
Copy Markdown
Contributor

Summary

This PR teaches the Python TsFileDataFrame to read tree-model TsFiles
(in addition to the existing table-model support) and, in the same series of
commits, slims the underlying dataset index so it scales to wide / sparse
schemas without paying for phantom (device, field) cells.

The user-facing dataset surface (__len__, list_timeseries, __getitem__,
.loc, aligned reads) is unchanged for both models — tree-model files
just become loadable through the same API.

What changed

Tree-model support

  • Detect the model kind at reader open. An empty table-schema map ⇒ tree
    model; otherwise table model.
  • For tree files, synthesize one virtual TableEntry:
    • table name = the shared root segment of every device path
    • tag columns = _col_1 .. _col_N (one positional column per remaining
      path segment, padded with None for shorter devices)
    • fields = union of measurements across all devices
  • Per-device measurement ownership is preserved by registering only the
    (device_id, field_idx) pairs that are actually written on disk in
    series_stats_by_ref, so the dataset never advertises phantom series.
  • Tree-model reads route through query_table_on_tree with client-side
    device filtering. This works around two cwrapper limitations on the
    current native build (query_tree_by_row rejects multi-segment device
    paths, and successive query_table_on_tree calls leak duplicate col_*
    columns); both are documented inline at the cwrapper boundary.
  • Tree-mode rendering: drop the leading table column, use _col_i
    headers, print None tag cells as "None", and surface the model
    marker in the repr header.
  • Mixing table-model and tree-model files in one load set is rejected
    with a clear error.

Dataset index slim-down

  • SeriesStats becomes a NamedTuple (~120 B vs. the previous ~360 B
    per-series dict).
  • _DerivedCache removed; lookups computed lazily on top of existing
    state.
  • Per-reader device_refs: List[List[DeviceRef]] collapsed into a
    pre-aggregated device_time_bounds: List[Tuple[Optional[int], Optional[int]]], so _query_aligned reads bounds in O(1).
  • Drop the redundant series_ref_set (use series_ref_map keys).
  • Phase 6: unify table/tree semantics so the table-model branch no
    longer pads series_stats_by_ref with empty placeholders for
    schema-declared but never-written cells. The dataset view is now
    strictly “real devices × real fields” in both models.
  • Cleanup: rename _LogicalIndex_DataFrameCatalog, shorten the five
    internal field names (devices / device_index / device_time_bounds
    / series / series_shards), inline the now-trivial
    iter_owned_series_refs wrapper.

New public API

  • TsFileDataFrame.model — read-only model marker ("table" or "tree").
  • TsFileDataFrame.list_timeseries_metadata() — per-series metadata as a
    flat tabular view (works identically for both models).

Compatibility

  • No changes to the existing dataset surface. Existing user code that
    loads table-model TsFiles continues to work without modification.
  • No changes to the on-disk format, the cwrapper, or the C++/Java sides.
  • SeriesStats integer fields tighten from Optional[int] to int. The
    surrounding get_series_info_by_ref still exposes them as the existing
    dict shape, so callers do not see an API change.

Memory impact

Two benches were used because the wins land in different shapes of
schema.

Bench A — 30 k devices × 1 field, full density

Step Tracked size Δ
baseline 81.20 MB
SeriesStats NamedTuple 70.67 MB −10.53 MB
_DerivedCache removal 59.82 MB −10.85 MB
device_time_bounds aggregation 56.40 MB −3.42 MB
series_ref_set removal 54.40 MB −2.00 MB
drop phantom (device, field) cells 54.40 MB 0

End-to-end: 81.20 → 54.40 MB (−33 %). Dropping phantom cells brings
nothing here because every device writes the single declared field;
there are no skipped cells to prune.

Bench B — 5 k devices × 5 fields × 60 % density (sparse / wide)

Component Before After
series_ref_map 15.93 MB 9.55 MB
series_stats_by_ref 7.53 MB 4.51 MB
tracked total 26.50 MB 16.30 MB

Dropping phantom cells alone brings −38 % on this fixture; the
sparser and wider the schema, the larger the win.

Testing

  • python -m pytest python/tests/test_tsfile_dataset.py → 41 / 41 pass.
  • Four new tree-model tests cover: metadata + repr layout, single-series
    read, multi-field aligned read, list_timeseries_metadata column
    shape, and the mixed-model load rejection.
  • One new sparse-schema test (test_dataset_omits_table_model_phantom_series_for_skipped_cells)
    proves Tablet-skipped cells stay out of list_timeseries, __len__,
    series_ref_map, and that tsdf[skipped_path] raises KeyError.

Young-Leo added 7 commits May 13, 2026 22:08
Detect model kind at reader open, synthesize a virtual TableEntry for
tree files (root segment as table name, _col_1.._col_N for positional
path-depth headers, union of all device measurements as fields), and
preserve per-device measurement ownership via series_stats_by_ref so
the dataset layer never registers phantom (device, field) pairs.

The dataset surface (__len__, list_timeseries, __getitem__, .loc) is
unchanged for both models. New public API: read-only df.model and
df.list_timeseries_metadata(). Mixing table and tree files in one
load set is rejected with a clear error.

Tree-model reads route through query_table_on_tree with client-side
device filtering, working around two cwrapper bugs (multi-segment
path rejection and duplicate col_* leak across queries).

Tree-mode rendering omits the leading "table" column, uses _col_i
headers, prints None tag cells as "None"; the repr header now carries
the model marker.

Tests: 4 new tree-model tests cover metadata + repr layout, single-
series and aligned reads, list_timeseries_metadata column shape, and
mixed-model load rejection. All 40 tests pass.
Phase 1a-5 of the Python-side memory optimization for TsFileDataFrame
indexes at 30k-device scale: 81.20 MB -> 48.90 MB (-32.30 MB, -39.8%).

- metadata: introduce SeriesStats NamedTuple + empty_series_stats()
  singleton; replace per-series 6-key dict (~360 B) with NamedTuple
  (~120 B). [Phase 2: -10.53 MB]
- dataframe: drop the _DerivedCache class entirely; inline its three
  members (catalog/order/path lookups) into _LogicalIndex methods
  computed lazily from existing data. [Phase 4: -10.85 MB]
- dataframe: replace _LogicalIndex.device_refs (List[List[DeviceRef]])
  with device_bounds (List[Tuple[Optional[int], Optional[int]]])
  aggregated at register time so _query_aligned reads bounds in O(1)
  without holding per-reader DeviceRef tuples. [Phase 5: -3.42 MB]
- dataframe: drop redundant series_ref_set; use series_ref_map keys for
  membership checks. [Phase 1a: -2.00 MB]
- reader: get_series_info_by_ref reads SeriesStats attributes and exposes
  them as the existing dict shape (no API change for callers).

40/40 dataset tests still pass. See notes/TsFileDataFrame实施总结.md
section 6 for per-phase measurement breakdown.
Unify table/tree-model semantics so the dataset surface only carries
real series. Previously the table-model branch padded series_stats_by_ref
with empty_series_stats() placeholders for every (device, field) cell
declared in the schema but never written, producing a Cartesian-product
index that grew linearly with schema width even when most cells were
empty.

After this change, both models populate series_stats_by_ref only for
cells with statistic.row_count > 0. The dataset view becomes 'real
devices x real fields', matching the principle that 'as many devices
(and series) exist as were physically written'.

Changes:
- metadata: SeriesStats fields tighten from Optional[int] to int;
  delete _EMPTY_SERIES_STATS constant and empty_series_stats() helper;
  drop unused Optional import.
- reader: _metadata_field_stats now filters on statistic.has_statistic
  + row_count > 0 (was timeline_statistic); placeholder branch in
  _cache_metadata_table_model removed; iter_owned_series_refs docstring
  rewritten as a single unified description for both models.
- tests: new test_dataset_omits_table_model_phantom_series_for_skipped_cells
  proves Tablet-skipped cells stay out of list_timeseries / __len__ /
  series_ref_map and tsdf[skipped_path] raises KeyError.

Bench impact:
- 30k devices x 1 field full-density (existing bench): 0 change
  (no phantoms existed in this scenario).
- 5k devices x 5 fields x 60% density (new sparse bench):
  series_ref_map 15.93 -> 9.55 MB, series_stats_by_ref 7.53 -> 4.51 MB,
  tracked total ~26.5 -> 16.30 MB (~38% reduction).

41/41 dataset tests pass.
After Phase 6 unified table/tree-model semantics, iter_owned_series_refs
became a no-op wrapper around catalog.series_stats_by_ref.keys(). Its
docstring still existed only to explain the (now-impossible) phantom-
series concern. Remove the method and inline direct iteration at the
3 call sites:

- reader.iter_series_paths / iter_series_refs: iterate the catalog dict
  directly (matches the pattern already used by series_count).
- dataframe._register_reader: drop the getattr fallback (which existed
  only to handle a hypothetical reader without iter_owned_series_refs);
  iterate catalog.series_stats_by_ref directly.

Net: -22 / +9 lines, no behavior change. 41/41 dataset tests pass.
Rename the cross-file unified index to _DataFrameCatalog and shorten its field names for clarity:

- device_order -> devices
- device_index_by_key -> device_index
- device_bounds -> device_time_bounds
- series_refs_ordered -> series
- series_ref_map -> series_shards

Also slim the SeriesStats docstring in metadata.py and update tests to the new names. No behavior change.
Slim docstrings on _cache_metadata_tree_model, _read_series_by_row_tree and _read_arrow_tree to the one-line essentials.
tsdf.loc[..., [name, idx]] where both refs resolve to the same logical series previously appended the read fragment once per spec entry, producing duplicated timestamps and NaN-padded rows downstream. Append once per series in _query_aligned and add a regression test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant