Skip to content

Conversation

@kirk0830
Copy link

@kirk0830 kirk0830 commented Dec 5, 2025

Summary by CodeRabbit

  • Bug Fixes

    • More robust, memory-safe trajectory loading with stronger per-frame validation, consistent atom/PBC checks, and user-visible warnings for mismatches.
  • New Features

    • Options to include additional electronic data (Hamiltonian, overlap, density matrix, eigenvalues) when loading datasets.
  • Documentation

    • Expanded parameter and usage descriptions for dataset construction and trajectory loading.
  • Refactor

    • Unified data-loading flow that consolidates per-frame atomic data into a consistent dataset format.

✏️ Tip: You can customize this high-level summary in your review settings.

@kirk0830 kirk0830 changed the title feat: Refactor trajectory data loading and improve the docstrings refactor: Refactor trajectory data loading and improve the docstrings Dec 5, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 5, 2025

📝 Walkthrough

Walkthrough

Adds typed flags to _TrajData.from_ase_traj, a detailed docstring, and memory-safe ASE Trajectory reading; constructs consolidated per-frame arrays (positions, atomic_numbers, optional cell), validates natoms/nframes/PBC consistency (with warnings), and centralizes data construction in DefaultDataset.__init__ via a build_data helper.

Changes

Cohort / File(s) Summary
Data loading & dataset init
dptb/data/dataset/_default_dataset.py
- _TrajData.from_ase_traj: signature now typed to include get_Hamiltonian, get_overlap, get_DM, get_eigenvalues (bools) and info: Optional[Dict]; new docstring; uses ASE Trajectory as a context manager for memory-safe reading; computes nframes and natoms from the trajectory and updates/validates info; assembles consolidated data dict with pos, atomic_numbers, and optional cell (None for missing frames); validates shapes and PBC consistency and emits warnings on mismatches; returns a new _TrajData instance; asserts incompatible flag combinations (e.g., Hamiltonian vs DM).
- DefaultDataset.__init__: added type hints and docstring; introduces a build_data helper to dispatch pos_type to _TrajData.from_ase_traj or _TrajData.from_text_data; replaces per-file ad-hoc subdata construction with calls to build_data; documents memory-safety behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

  • Inspect _TrajData.from_ase_traj for correct ASE Trajectory context-manager usage and resource cleanup.
  • Verify natoms/nframes derivation, shape validations for pos/cell, and PBC-consistency logic (warnings vs errors).
  • Confirm flag compatibility checks (Hamiltonian vs DM) and that forwarded flags/params through build_data are correct.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main changes: refactoring trajectory data loading logic and improving docstrings, which aligns with the key modifications in the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
dptb/data/dataset/_default_dataset.py (1)

417-417: Remove unused variable assignment.

The pbc variable is assigned but never used after line 417.

Apply this diff:

             assert all(attr in info for attr in ["r_max", "pbc"])
-            pbc = info["pbc"] # not used?
             self.raw_data.append(
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84d18b8 and 0164e2c.

📒 Files selected for processing (1)
  • dptb/data/dataset/_default_dataset.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.14.7)
dptb/data/dataset/_default_dataset.py

174-174: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


208-208: print() call in assert statement is likely unintentional

Remove print

(RUF030)


417-417: Local variable pbc is assigned to but never used

Remove assignment to unused variable pbc

(F841)

🔇 Additional comments (9)
dptb/data/dataset/_default_dataset.py (9)

195-197: LGTM!

Good defensive programming with a clear, actionable error message for mutually exclusive options.


205-212: LGTM! Good use of context manager and frame validation.

The context manager ensures proper resource cleanup, and the frame count validation with info update is a robust approach to handle inconsistencies.


214-222: LGTM! Robust atom count validation.

The use of np.unique() to ensure consistent atom counts across frames is good defensive programming, and the info update when mismatches are detected is appropriate.


224-240: LGTM! Comprehensive PBC handling.

The validation logic properly handles various PBC configurations and edge cases, with appropriate warnings when info and trajectory settings differ.


242-269: Good data assembly with proper validation.

The frame-by-frame data collection and final validation ensure data consistency. The cell handling logic (lines 264-268) correctly enforces that either all frames have cells or none do, preventing hybrid scenarios.


360-395: LGTM! Excellent documentation.

The comprehensive docstring with detailed parameter descriptions significantly improves code maintainability and usability.


396-399: LGTM! Good abstraction.

The local helper function cleanly centralizes the logic for selecting the appropriate data constructor based on pos_type, improving maintainability.


418-425: LGTM! Consistent data loading.

The use of the build_data helper ensures uniform data construction across different pos_type variants.


444-444: LGTM! Helpful type annotation.

The type annotation improves code clarity and IDE support.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
dptb/data/dataset/_default_dataset.py (2)

216-244: Consider consolidating trajectory iterations.

The trajectory is iterated three times (lines 216, 229, 244) to validate atom counts, PBC settings, and collect data. While the current approach provides clear validation, combining into a single pass would improve performance for large trajectories.

Consider refactoring to:

# Single-pass validation and collection
natoms_list, pbc_list = [], []
atomic_numbers, positions, cell = [], [], []

for atoms in traj:
    natoms_list.append(len(atoms))
    pbc_list.append(atoms.pbc.tolist())
    
    atomic_numbers.append(atoms.get_atomic_numbers())
    positions.append(atoms.get_positions())
    cell_read = atoms.get_cell()
    cell.append(None if np.allclose(cell_read, np.zeros((3, 3)), atol=1e-6) else cell_read)

# Validate consistency
natoms = np.unique(natoms_list)
assert len(natoms) == 1, "Number of atoms in trajectory file is not consistent."
natoms = natoms[0]

pbc_unique = np.unique(pbc_list, axis=0)
assert len(pbc_unique) == 1, "PBC setting in trajectory file is not consistent."
pbc = pbc_unique[0].tolist()

203-203: Remove unnecessary initialization.

Line 203 initializes these variables to None, but they're immediately reassigned to empty lists at line 243. The initial assignment is redundant.

Apply this diff:

-        atomic_numbers, positions, cell = None, None, None
-        # use the context manager to avoid memory leak
         with Trajectory(traj_file[0], 'r') as traj:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0164e2c and 46c7f94.

📒 Files selected for processing (1)
  • dptb/data/dataset/_default_dataset.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.14.7)
dptb/data/dataset/_default_dataset.py

417-417: Local variable pbc is assigned to but never used

Remove assignment to unused variable pbc

(F841)

🔇 Additional comments (5)
dptb/data/dataset/_default_dataset.py (5)

170-194: LGTM! Well-documented method signature.

The type hints and comprehensive docstring improve code clarity. The past review comments regarding type hints and the typo have been properly addressed.


195-201: LGTM! Clear validation logic.

The assertion prevents conflicting data loads, and the single-trajectory enforcement is appropriate.


256-268: LGTM! Robust data validation and assembly.

The validation ensures trajectory data is complete and consistent, and the cell handling correctly supports both periodic and non-periodic systems while rejecting hybrid cases.


360-395: LGTM! Excellent documentation.

The comprehensive docstring and type hints significantly improve code clarity and maintainability.


396-399: LGTM! Clean abstraction.

The build_data helper centralizes the loading logic and makes the code more maintainable by routing to the appropriate constructor based on pos_type.

info=info)
self.raw_data.append(subdata)
assert all(attr in info for attr in ["r_max", "pbc"])
pbc = info["pbc"] # not used?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused variable assignment.

Static analysis correctly identifies that pbc is assigned but never used. The comment acknowledges this uncertainty.

Apply this diff:

         assert all(attr in info for attr in ["r_max", "pbc"])
-        pbc = info["pbc"] # not used?
         self.raw_data.append(
🧰 Tools
🪛 Ruff (0.14.7)

417-417: Local variable pbc is assigned to but never used

Remove assignment to unused variable pbc

(F841)

🤖 Prompt for AI Agents
In dptb/data/dataset/_default_dataset.py around line 417 the code assigns pbc =
info["pbc"] but never uses it; remove the unused variable assignment (and any
accompanying comment that questions its usage) so no unused local remains, or if
pbc is intended to be used, replace the removal by using it appropriately—most
likely simply delete the pbc = info["pbc"] line.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
dptb/data/dataset/_default_dataset.py (1)

416-417: Remove unused variable assignment.

The pbc variable is assigned but never used. The comment acknowledges this uncertainty. This was previously flagged but remains unaddressed.

Apply this diff:

         assert all(attr in info for attr in ["r_max", "pbc"])
-        pbc = info["pbc"] # not used?
         self.raw_data.append(

As per past review comments, this issue should be resolved.

🧹 Nitpick comments (1)
dptb/data/dataset/_default_dataset.py (1)

195-197: Consider using and instead of * for boolean logic.

The multiplication operator works for the boolean check, but using and would be more idiomatic and clearer.

Apply this diff:

-        assert not get_Hamiltonian * get_DM, \
+        assert not (get_Hamiltonian and get_DM), \
             "Hamiltonian and Density Matrix can only loaded one at a time, " + \
             "for which will occupy the same attribute in the AtomicData."
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46c7f94 and d23e630.

📒 Files selected for processing (1)
  • dptb/data/dataset/_default_dataset.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.14.7)
dptb/data/dataset/_default_dataset.py

417-417: Local variable pbc is assigned to but never used

Remove assignment to unused variable pbc

(F841)

🔇 Additional comments (5)
dptb/data/dataset/_default_dataset.py (5)

170-194: LGTM! Excellent documentation and type safety improvements.

The type hints and detailed docstring significantly improve code clarity and maintainability. The Optional[Dict] type hint is correctly applied, and the docstring thoroughly documents the method's behavior including the info update semantics.


203-269: Excellent memory-safe refactoring with thorough validation!

The context manager prevents memory leaks from persistent Trajectory file handles. The refactored logic provides:

  • Clear extraction of nframes and natoms with consistency validation
  • Automatic info updates when trajectory data differs
  • PBC handling with appropriate warnings for mismatches
  • Consolidated data construction with proper shape validation
  • Clean separation between cell present/absent cases

This is a significant improvement over ad-hoc per-frame accumulation.


360-395: LGTM! Clear documentation of the refactored constructor.

The type hints and comprehensive docstring improve code maintainability. The documentation clearly explains each parameter's purpose, including transparent notes about unused parameters retained for compatibility.


396-399: LGTM! Clean abstraction centralizes data loading.

The build_data helper provides a flexible dispatch mechanism that routes to the appropriate builder based on pos_type, with a sensible fallback to from_text_data. This improves maintainability and makes future extensions easier.


418-425: LGTM! Clean integration of the build_data helper.

The call to build_data correctly passes all necessary parameters including the pos_type dispatch key and flags for loading optional data. The code is clear and maintainable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant