TCKDB Integration by calvinp0 · Pull Request #890 · ReactionMechanismGenerator/ARC

calvinp0 · 2026-05-14T21:12:10Z

Initial work of integrating TCKDB Adapter in ARC

Two related bugs in SSHClient._connect: 1. ssh.connect() never received key_filename, so paramiko fell back to the SSH agent and hardcoded defaults (~/.ssh/id_rsa, id_ed25519, ...), silently ignoring servers[<name>]['key'] from settings.py. With no agent running, this manifests as "Incorrect padding" when the default id_rsa is in OpenSSH format. 2. load_system_host_keys(filename=self.key) was passing the private-key path as a known_hosts file, corrupting the host_keys store. The call was masked by AutoAddPolicy() but is meaningless. Drop the argument so paramiko loads the real system known_hosts. Same fix as origin/AzureServer (a53d6ba), applied surgically.

Adds arc/tckdb/ package with: - config.py: TCKDBConfig dataclass parsed from input YAML 'tckdb' block - adapter.py: HTTP adapter for posting payloads to a TCKDB instance - payload_writer.py: serializes ARC species/reaction artifacts to disk - idempotency.py: avoids re-uploading completed jobs - *_test.py: unit tests for each module ARC.py reads the optional 'tckdb' key from the input YAML, builds a TCKDBConfig, and attaches it to the ARC object before execute(). When absent, behavior is unchanged. Docs: docs/tckdb-integration.md.

Three defensive parsing fixes from origin/AzureServer: 1. parse_running_jobs_ids: lstrip() PBS qstat lines before splitting. PBS output is column-aligned with leading spaces, so the existing `status_line.split('.')[0]` returned '' on every row, making ARC silently lose track of running jobs. 2. submit_job: guard each cluster_soft branch with `stdout and ...` before indexing stdout[0]. When qsub writes to stderr and stdout is empty, the elif cascade IndexError'd and masked the real submission error. PBS branch was the most exposed (no string guard at all). 3. submit_job: add a "Memory specification can not be satisfied" stderr handler alongside the existing "Requested node configuration is not available" hint.

PBS jobs left their full remote work directory on the cluster after every run, eventually filling user quota and causing future qsub submissions to fail with cryptic errors. Three coordinated changes: - ssh.py: new SSHClient.remove_dir(remote_path) — wraps `rm -r` and raises ServerError on stderr. - adapter.py: new JobAdapter.remove_remote_files() — calls remove_dir on the job's remote_path, no-op for local servers or unset paths. - scheduler.py: invoke job.remove_remote_files() in end_job's success path, after rotors_dict update and before save_restart_dict. Wrapped in try/except so cleanup failure logs a warning rather than aborting the run — losing the cluster cleanup is preferable to losing the result. Same shape as origin/AzureServer.

determine_job_status() can raise JobError when the remote output file is missing or unreadable over SFTP — previously this propagated and aborted the entire scheduler run instead of triggering the existing re-run logic. Treat it the same as IOError so transient remote-side failures lead to a job re-submit, not a crash.

_get_additional_job_info unconditionally tried to download remote_path/job.log for every cluster_soft. job.log is HTCondor's native event log; PBS, Slurm, OGE/SGE inline the ESS command directly in submit.sh and never produce a job.log. Result on PBS: every job logged a misleading "check that submit script has -o/-e flags" warning even when the run succeeded. Gate the third path directly on cluster_soft == 'htcondor' so the intent is explicit. Inspired by origin/AzureServer, which used a structural proxy (`'job.sh' in files_to_upload`) — same behavior, but the cluster_soft check reads as the actual condition rather than a filename coincidence. Skipping the null-byte read handler from AzureServer; trigger is unknown and the change adds two-mode read complexity for a problem that hasn't reproduced.

PipeRun.submit_to_scheduler invokes qsub/sbatch on the orchestrator machine via local_submit_job, and the worker (`python -m arc.scripts.pipe_worker`) reads pipe_root from the local filesystem. When the resolved server for the task's engine is remote (e.g., a PBS cluster reached over SSH), submission fails its `shutil.which` check at pipe_run.py:270, returns 'errored' silently, and the pipe run stays registered in active_pipes — the scheduler then loops indefinitely because the planner already removed those tasks from _pending_pipe_sp and reported them as piped. Add an upfront check in should_use_pipe: if the engine's first server in ess_settings is anything other than 'local', refuse pipe. The existing fallback at scheduler.py:546-554 then routes those tasks through run_sp_job, which uses SSHClient correctly. This is a band-aid until full remote-pipe support lands (tracked on the pipe-ssh-support branch). It preserves pipe's efficiency for local/HTCondor users while preventing the silent deadlock for everyone else.

still in progress

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Copilot

Pull request overview

This PR introduces initial ARC ↔ TCKDB integration by adding a post-run upload sweep (driven by output/output.yml), TCKDB payload/sidecar writing + idempotency, and several supporting schema/provenance enhancements across ARC output, parsers, and job execution.

Changes:

Add a TCKDB upload sweep (species/reaction modes), CLI re-runner, payload sidecars, and idempotency-key utilities.
Extend ARC output/provenance to support richer thermo points, TS path-search artifacts (NEB/GSM), IRC direction tracking, and constraints.
Improve operational robustness (Arkane stderr handling, SSH pooling, remote file handling, TS adapter filtering) and update documentation.

Reviewed changes

Copilot reviewed 65 out of 69 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
docs/tckdb_upload_boundary.md	Defines ARC→TCKDB semantic boundary and guardrail expectations.
docs/output_yml_schema.md	Updates output.yml schema docs (thermo_points, GSM log).
docs/gaussian.md	Adds Gaussian run/troubleshooting notes (new doc).
arc/testing/test_JobAdapter/calcs/Species/spc1/opt_a472/submit.sh	Adds PBS submit fixture for JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1/opt_a472/input.gjf	Adds Gaussian input fixture for JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1/opt_a1313/submit.sh	Adds PBS submit fixture for JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1/opt_a1313/input.gjf	Adds Gaussian input fixture for JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1_and_2_others/conf_opt_a472/submit.sh	Adds PBS submit fixture for multi-species JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1_and_2_others/conf_opt_a472/input.gjf	Adds Gaussian input fixture for multi-species JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1_and_2_others/conf_opt_a1313/submit.sh	Adds PBS submit fixture for multi-species JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1_and_2_others/conf_opt_a1313/input.gjf	Adds Gaussian input fixture for multi-species JobAdapter tests.
arc/testing/test_JobAdapter_ServerTimeLimit/calcs/Species/spc1/opt_101/err.txt	Adds PBS walltime-limit stderr fixture.
arc/testing/test_JobAdapter_scan/calcs/Species/methanol_and_5_others/scan_a472/input.gjf	Adds Gaussian scan input fixture.
arc/testing/test_JobAdapter_scan/calcs/Species/methanol_and_5_others/scan_a1313/input.gjf	Adds Gaussian scan input fixture.
arc/tckdb/sweep.py	Implements end-of-run sweep dispatch and artifact uploading.
arc/tckdb/payload_writer.py	Writes payload JSON + sidecar metadata; replay gating helper.
arc/tckdb/payload_writer_test.py	Adds unit tests for payload/sidecar writing and replay gate.
arc/tckdb/idempotency.py	Adds idempotency-key builders for payloads and artifacts.
arc/tckdb/idempotency_test.py	Adds unit tests for idempotency-key stability/distinctness.
arc/tckdb/constraints.py	Adds constraint dataclass + serializer for TCKDB payloads.
arc/tckdb/config.py	Adds TCKDB config parsing, API-key resolution, artifact-kind validation.
arc/tckdb/cli.py	Adds standalone CLI to rerun sweep against existing project output.yml.
arc/tckdb/cli_test.py	Adds tests for CLI parsing and dispatch behavior.
arc/tckdb/init.py	Exposes adapter/config public API.
arc/statmech/arkane.py	Unifies tunneling method constant; fixes Arkane success signal to output.py existence; maps thermo_points.
arc/statmech/arkane_test.py	Adds tests for Arkane stderr classification + output.py success gating.
arc/species/species.py	Ensures monoatomic species get a trivial `final_xyz`; adds rotor scan software field; renames cp_data→thermo_points in ThermoData.
arc/species/species_test.py	Adds tests for monoatomic `final_xyz` behavior; updates xyz fixtures expectations.
arc/settings/submit.py	Clarifies example submit templates; adds ORCA auxiliary file copies; renames example keys and adds back-compat aliases.
arc/settings/submit_test.py	Updates tests to validate example keys/aliases without restricting user overrides.
arc/settings/settings.py	Updates ORCA NEB default level; expands candidate RMG install paths.
arc/scripts/save_arkane_thermo.py	Emits richer `thermo_points` (Cp/H/S/G) instead of cp_data.
arc/scripts/get_species_corrections.py	Adds helper script to compute AEC/BAC totals/components via Arkane in RMG env.
arc/scripts_test.py	Updates tests for thermo_points extraction.
arc/scheduler.py	Adds TS-guess path routing (NEB vs GSM), IRC direction tracking, TS adapter filtering, remote cleanup hook, and scan software tracking.
arc/scheduler_test.py	Adds tests for TS guess paths routing and GSM path key seeding.
arc/reaction/reaction_test.py	Adds regression test for opt-done gate with monoatomic reactant.
arc/parser/parser.py	Adds rich IRC parser wrapper, scan absolute energies wrapper, GSM trajectory filename handling.
arc/parser/parser_test.py	Adds tests for rich IRC parsing behavior.
arc/parser/constraints_test.py	Adds tests for Gaussian/ORCA constraint parsing.
arc/parser/adapters/orca.py	Adds ORCA input constraint parser.
arc/parser/adapter.py	Adds default hook for absolute scan energies in Hartree.
arc/job/trsh.py	Adds “DispUnconverged” detection and `no_tight` troubleshooting path.
arc/job/trsh_test.py	Adds tests for displacement-only unconverged detection.
arc/job/ssh.py	Adds delayed-existence retry for downloads; improves job-id parsing; SSH connect key handling; adds remove_dir.
arc/job/ssh_pool.py	Adds process-lifetime SSH connection pool.
arc/job/pipe/pipe_coordinator.py	Prevents pipe mode for engines resolving to remote servers.
arc/job/pipe/pipe_coordinator_test.py	Adds test ensuring pipe is disabled when engine resolves to remote server.
arc/job/adapters/ts/xtbgsm_test.py	Adds tests for GSM provenance capture and ograd preservation behavior.
arc/job/adapters/ts/xtb_gsm.py	Records GSM stringfile path as TSGuess log provenance; tracks node-output dir.
arc/job/adapters/ts/orca_neb.py	Makes NEB input refer to local reactant/product xyz filenames (not absolute).
arc/job/adapters/scripts/xtb_gsm/ograd	Preserves per-node energy/gradient/xtbout files for provenance.
arc/job/adapters/gaussian.py	Applies `no_tight` troubleshooting flag to drop opt=tight.
arc/job/adapter.py	Adds SSH reuse (pool/shared) across upload/submit/download; submit-script validation; remote cleanup; HTCondor job.log gating.
ARC.py	Wires TCKDB sweep into main execution and closes SSH pool on exit.
ARC_test.py	Adds tests intended to cover ARC.py/TCKDB sweep wiring.
.gitignore	Adds ignores for agent folders and ARC.egg-info.

Comments suppressed due to low confidence (1)

docs/gaussian.md:33

In the Gaussian input template, MULTIPLICTY is misspelled; this should be MULTIPLICITY to avoid confusion for users copying the template.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+# Back-compat aliases. The original keys (``'local'``, ``'pbs_sample'``,
+# ``'server3'``) shipped with the repo for years and downstream forks
+# may still reference them by those names. Aliasing keeps existing
+# tests + user setups working while the ``EXAMPLE_*`` names are the
+# documented, recommended way to refer to these example templates.
+submit_scripts['local'] = submit_scripts['EXAMPLE_slurm_supercloud']
+submit_scripts['pbs_sample'] = submit_scripts['EXAMPLE_pbs']
+submit_scripts['server3'] = submit_scripts['EXAMPLE_slurm_alongd_supercloud']


+# ARC.py is the top-level entry script; import its sweep helper directly.
+import importlib.util
+_ARC_PY = Path(__file__).parent / "ARC.py"
+_spec = importlib.util.spec_from_file_location("arc_entry", _ARC_PY)
+_arc_entry = importlib.util.module_from_spec(_spec)
+_spec.loader.exec_module(_arc_entry)
+


+        # We treat both "ts_label set but TS missing/non-converged" and
+        # "no ts_label at all" as the partial case. The latter is rare
+        # (ARC normally emits a ts_label even on failure), but a
+        # reaction record with reactants/products and no TS is exactly
+        # the partial-shape we want to allow under the flag.
+        is_partial = bool(ts_label) and not ts_converged
+


+    The contract a discovery-based replay tool MUST honor: if a
+    sidecar is marked ``is_partial: true`` (or the on-disk file's name
+    carries the ``.partial`` infix), do **not** replay it as a
+    complete bundle. Partial computed-reaction sidecars omit


+                time.sleep(1.0)
+        else:
+            logger.debug(f'{remote_file_path} does not exist on {self.server}; '
+                         f'skipping download.')


+## Troubleshooting
+
+There are many situations where we can gaussian job errors and gaussian in the output file reports the error code.
+


+PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1/opt_a472"
+cd "$PBS_O_WORKDIR"


+PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1/opt_a1313"
+cd "$PBS_O_WORKDIR"


+PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1_and_2_others/conf_opt_a472"
+cd "$PBS_O_WORKDIR"


+PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1_and_2_others/conf_opt_a1313"
+cd "$PBS_O_WORKDIR"


calvinp0 added 8 commits May 15, 2026 00:11

mixture of updates

61e71e4

still in progress

Copilot AI review requested due to automatic review settings May 14, 2026 21:12

calvinp0 added 4 commits May 15, 2026 00:12

Further updates - NEB, GSM

dbffca6

More updates

e026fa2

ARC - TCKDB coverage updates

dd51588

More integration checks

d1b36d2

github-actions Bot added Module: Scheduler Module: Species Module: SSH Module: trsh Troubleshooting labels May 14, 2026

Copilot started reviewing on behalf of calvinp0 May 14, 2026 21:12 View session

github-advanced-security AI found potential problems May 14, 2026

View reviewed changes

Copilot AI reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCKDB Integration#890

TCKDB Integration#890
calvinp0 wants to merge 12 commits into
mainfrom
tckdb-imp

calvinp0 commented May 14, 2026

Uh oh!

github-advanced-security AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		## Troubleshooting

		There are many situations where we can gaussian job errors and gaussian in the output file reports the error code.

		PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1/opt_a472"
		cd "$PBS_O_WORKDIR"

		PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1/opt_a1313"
		cd "$PBS_O_WORKDIR"

		PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1_and_2_others/conf_opt_a472"
		cd "$PBS_O_WORKDIR"

		PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1_and_2_others/conf_opt_a1313"
		cd "$PBS_O_WORKDIR"

Conversation

calvinp0 commented May 14, 2026

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants