Skip to content

TCKDB Integration#890

Open
calvinp0 wants to merge 12 commits into
mainfrom
tckdb-imp
Open

TCKDB Integration#890
calvinp0 wants to merge 12 commits into
mainfrom
tckdb-imp

Conversation

@calvinp0
Copy link
Copy Markdown
Member

Initial work of integrating TCKDB Adapter in ARC

calvinp0 added 8 commits May 15, 2026 00:11
Two related bugs in SSHClient._connect:

1. ssh.connect() never received key_filename, so paramiko fell back to
   the SSH agent and hardcoded defaults (~/.ssh/id_rsa, id_ed25519, ...),
   silently ignoring servers[<name>]['key'] from settings.py. With no
   agent running, this manifests as "Incorrect padding" when the default
   id_rsa is in OpenSSH format.

2. load_system_host_keys(filename=self.key) was passing the private-key
   path as a known_hosts file, corrupting the host_keys store. The call
   was masked by AutoAddPolicy() but is meaningless. Drop the argument
   so paramiko loads the real system known_hosts.

Same fix as origin/AzureServer (a53d6ba), applied surgically.
Adds arc/tckdb/ package with:
- config.py: TCKDBConfig dataclass parsed from input YAML 'tckdb' block
- adapter.py: HTTP adapter for posting payloads to a TCKDB instance
- payload_writer.py: serializes ARC species/reaction artifacts to disk
- idempotency.py: avoids re-uploading completed jobs
- *_test.py: unit tests for each module

ARC.py reads the optional 'tckdb' key from the input YAML, builds a
TCKDBConfig, and attaches it to the ARC object before execute(). When
absent, behavior is unchanged.

Docs: docs/tckdb-integration.md.
Three defensive parsing fixes from origin/AzureServer:

1. parse_running_jobs_ids: lstrip() PBS qstat lines before splitting.
   PBS output is column-aligned with leading spaces, so the existing
   `status_line.split('.')[0]` returned ''  on every row, making ARC
   silently lose track of running jobs.

2. submit_job: guard each cluster_soft branch with `stdout and ...`
   before indexing stdout[0]. When qsub writes to stderr and stdout
   is empty, the elif cascade IndexError'd and masked the real
   submission error. PBS branch was the most exposed (no string
   guard at all).

3. submit_job: add a "Memory specification can not be satisfied"
   stderr handler alongside the existing "Requested node
   configuration is not available" hint.
PBS jobs left their full remote work directory on the cluster after
every run, eventually filling user quota and causing future qsub
submissions to fail with cryptic errors. Three coordinated changes:

- ssh.py: new SSHClient.remove_dir(remote_path) — wraps `rm -r` and
  raises ServerError on stderr.
- adapter.py: new JobAdapter.remove_remote_files() — calls remove_dir
  on the job's remote_path, no-op for local servers or unset paths.
- scheduler.py: invoke job.remove_remote_files() in end_job's success
  path, after rotors_dict update and before save_restart_dict. Wrapped
  in try/except so cleanup failure logs a warning rather than aborting
  the run — losing the cluster cleanup is preferable to losing the
  result.

Same shape as origin/AzureServer.
determine_job_status() can raise JobError when the remote output file
is missing or unreadable over SFTP — previously this propagated and
aborted the entire scheduler run instead of triggering the existing
re-run logic. Treat it the same as IOError so transient remote-side
failures lead to a job re-submit, not a crash.
_get_additional_job_info unconditionally tried to download
remote_path/job.log for every cluster_soft. job.log is HTCondor's
native event log; PBS, Slurm, OGE/SGE inline the ESS command directly
in submit.sh and never produce a job.log. Result on PBS: every job
logged a misleading "check that submit script has -o/-e flags"
warning even when the run succeeded.

Gate the third path directly on cluster_soft == 'htcondor' so the
intent is explicit. Inspired by origin/AzureServer, which used a
structural proxy (`'job.sh' in files_to_upload`) — same behavior, but
the cluster_soft check reads as the actual condition rather than a
filename coincidence.

Skipping the null-byte read handler from AzureServer; trigger is
unknown and the change adds two-mode read complexity for a problem
that hasn't reproduced.
PipeRun.submit_to_scheduler invokes qsub/sbatch on the orchestrator
machine via local_submit_job, and the worker (`python -m
arc.scripts.pipe_worker`) reads pipe_root from the local filesystem.
When the resolved server for the task's engine is remote (e.g., a PBS
cluster reached over SSH), submission fails its `shutil.which` check at
pipe_run.py:270, returns 'errored' silently, and the pipe run stays
registered in active_pipes — the scheduler then loops indefinitely
because the planner already removed those tasks from _pending_pipe_sp
and reported them as piped.

Add an upfront check in should_use_pipe: if the engine's first server
in ess_settings is anything other than 'local', refuse pipe. The
existing fallback at scheduler.py:546-554 then routes those tasks
through run_sp_job, which uses SSHClient correctly.

This is a band-aid until full remote-pipe support lands (tracked on
the pipe-ssh-support branch). It preserves pipe's efficiency for
local/HTCondor users while preventing the silent deadlock for everyone
else.
still in progress
Copilot AI review requested due to automatic review settings May 14, 2026 21:12
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces initial ARC ↔ TCKDB integration by adding a post-run upload sweep (driven by output/output.yml), TCKDB payload/sidecar writing + idempotency, and several supporting schema/provenance enhancements across ARC output, parsers, and job execution.

Changes:

  • Add a TCKDB upload sweep (species/reaction modes), CLI re-runner, payload sidecars, and idempotency-key utilities.
  • Extend ARC output/provenance to support richer thermo points, TS path-search artifacts (NEB/GSM), IRC direction tracking, and constraints.
  • Improve operational robustness (Arkane stderr handling, SSH pooling, remote file handling, TS adapter filtering) and update documentation.

Reviewed changes

Copilot reviewed 65 out of 69 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
docs/tckdb_upload_boundary.md Defines ARC→TCKDB semantic boundary and guardrail expectations.
docs/output_yml_schema.md Updates output.yml schema docs (thermo_points, GSM log).
docs/gaussian.md Adds Gaussian run/troubleshooting notes (new doc).
arc/testing/test_JobAdapter/calcs/Species/spc1/opt_a472/submit.sh Adds PBS submit fixture for JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1/opt_a472/input.gjf Adds Gaussian input fixture for JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1/opt_a1313/submit.sh Adds PBS submit fixture for JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1/opt_a1313/input.gjf Adds Gaussian input fixture for JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1_and_2_others/conf_opt_a472/submit.sh Adds PBS submit fixture for multi-species JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1_and_2_others/conf_opt_a472/input.gjf Adds Gaussian input fixture for multi-species JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1_and_2_others/conf_opt_a1313/submit.sh Adds PBS submit fixture for multi-species JobAdapter tests.
arc/testing/test_JobAdapter/calcs/Species/spc1_and_2_others/conf_opt_a1313/input.gjf Adds Gaussian input fixture for multi-species JobAdapter tests.
arc/testing/test_JobAdapter_ServerTimeLimit/calcs/Species/spc1/opt_101/err.txt Adds PBS walltime-limit stderr fixture.
arc/testing/test_JobAdapter_scan/calcs/Species/methanol_and_5_others/scan_a472/input.gjf Adds Gaussian scan input fixture.
arc/testing/test_JobAdapter_scan/calcs/Species/methanol_and_5_others/scan_a1313/input.gjf Adds Gaussian scan input fixture.
arc/tckdb/sweep.py Implements end-of-run sweep dispatch and artifact uploading.
arc/tckdb/payload_writer.py Writes payload JSON + sidecar metadata; replay gating helper.
arc/tckdb/payload_writer_test.py Adds unit tests for payload/sidecar writing and replay gate.
arc/tckdb/idempotency.py Adds idempotency-key builders for payloads and artifacts.
arc/tckdb/idempotency_test.py Adds unit tests for idempotency-key stability/distinctness.
arc/tckdb/constraints.py Adds constraint dataclass + serializer for TCKDB payloads.
arc/tckdb/config.py Adds TCKDB config parsing, API-key resolution, artifact-kind validation.
arc/tckdb/cli.py Adds standalone CLI to rerun sweep against existing project output.yml.
arc/tckdb/cli_test.py Adds tests for CLI parsing and dispatch behavior.
arc/tckdb/init.py Exposes adapter/config public API.
arc/statmech/arkane.py Unifies tunneling method constant; fixes Arkane success signal to output.py existence; maps thermo_points.
arc/statmech/arkane_test.py Adds tests for Arkane stderr classification + output.py success gating.
arc/species/species.py Ensures monoatomic species get a trivial final_xyz; adds rotor scan software field; renames cp_data→thermo_points in ThermoData.
arc/species/species_test.py Adds tests for monoatomic final_xyz behavior; updates xyz fixtures expectations.
arc/settings/submit.py Clarifies example submit templates; adds ORCA auxiliary file copies; renames example keys and adds back-compat aliases.
arc/settings/submit_test.py Updates tests to validate example keys/aliases without restricting user overrides.
arc/settings/settings.py Updates ORCA NEB default level; expands candidate RMG install paths.
arc/scripts/save_arkane_thermo.py Emits richer thermo_points (Cp/H/S/G) instead of cp_data.
arc/scripts/get_species_corrections.py Adds helper script to compute AEC/BAC totals/components via Arkane in RMG env.
arc/scripts_test.py Updates tests for thermo_points extraction.
arc/scheduler.py Adds TS-guess path routing (NEB vs GSM), IRC direction tracking, TS adapter filtering, remote cleanup hook, and scan software tracking.
arc/scheduler_test.py Adds tests for TS guess paths routing and GSM path key seeding.
arc/reaction/reaction_test.py Adds regression test for opt-done gate with monoatomic reactant.
arc/parser/parser.py Adds rich IRC parser wrapper, scan absolute energies wrapper, GSM trajectory filename handling.
arc/parser/parser_test.py Adds tests for rich IRC parsing behavior.
arc/parser/constraints_test.py Adds tests for Gaussian/ORCA constraint parsing.
arc/parser/adapters/orca.py Adds ORCA input constraint parser.
arc/parser/adapter.py Adds default hook for absolute scan energies in Hartree.
arc/job/trsh.py Adds “DispUnconverged” detection and no_tight troubleshooting path.
arc/job/trsh_test.py Adds tests for displacement-only unconverged detection.
arc/job/ssh.py Adds delayed-existence retry for downloads; improves job-id parsing; SSH connect key handling; adds remove_dir.
arc/job/ssh_pool.py Adds process-lifetime SSH connection pool.
arc/job/pipe/pipe_coordinator.py Prevents pipe mode for engines resolving to remote servers.
arc/job/pipe/pipe_coordinator_test.py Adds test ensuring pipe is disabled when engine resolves to remote server.
arc/job/adapters/ts/xtbgsm_test.py Adds tests for GSM provenance capture and ograd preservation behavior.
arc/job/adapters/ts/xtb_gsm.py Records GSM stringfile path as TSGuess log provenance; tracks node-output dir.
arc/job/adapters/ts/orca_neb.py Makes NEB input refer to local reactant/product xyz filenames (not absolute).
arc/job/adapters/scripts/xtb_gsm/ograd Preserves per-node energy/gradient/xtbout files for provenance.
arc/job/adapters/gaussian.py Applies no_tight troubleshooting flag to drop opt=tight.
arc/job/adapter.py Adds SSH reuse (pool/shared) across upload/submit/download; submit-script validation; remote cleanup; HTCondor job.log gating.
ARC.py Wires TCKDB sweep into main execution and closes SSH pool on exit.
ARC_test.py Adds tests intended to cover ARC.py/TCKDB sweep wiring.
.gitignore Adds ignores for agent folders and ARC.egg-info.
Comments suppressed due to low confidence (1)

docs/gaussian.md:33

  • In the Gaussian input template, MULTIPLICTY is misspelled; this should be MULTIPLICITY to avoid confusion for users copying the template.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread arc/settings/submit.py
Comment on lines +1217 to +1224
# Back-compat aliases. The original keys (``'local'``, ``'pbs_sample'``,
# ``'server3'``) shipped with the repo for years and downstream forks
# may still reference them by those names. Aliasing keeps existing
# tests + user setups working while the ``EXAMPLE_*`` names are the
# documented, recommended way to refer to these example templates.
submit_scripts['local'] = submit_scripts['EXAMPLE_slurm_supercloud']
submit_scripts['pbs_sample'] = submit_scripts['EXAMPLE_pbs']
submit_scripts['server3'] = submit_scripts['EXAMPLE_slurm_alongd_supercloud']
Comment thread ARC_test.py
Comment on lines +18 to +24
# ARC.py is the top-level entry script; import its sweep helper directly.
import importlib.util
_ARC_PY = Path(__file__).parent / "ARC.py"
_spec = importlib.util.spec_from_file_location("arc_entry", _ARC_PY)
_arc_entry = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(_arc_entry)

Comment thread arc/tckdb/sweep.py
Comment on lines +181 to +187
# We treat both "ts_label set but TS missing/non-converged" and
# "no ts_label at all" as the partial case. The latter is rare
# (ARC normally emits a ts_label even on failure), but a
# reaction record with reactants/products and no TS is exactly
# the partial-shape we want to allow under the flag.
is_partial = bool(ts_label) and not ts_converged

Comment on lines +282 to +285
The contract a discovery-based replay tool MUST honor: if a
sidecar is marked ``is_partial: true`` (or the on-disk file's name
carries the ``.partial`` infix), do **not** replay it as a
complete bundle. Partial computed-reaction sidecars omit
Comment thread arc/job/ssh.py
time.sleep(1.0)
else:
logger.debug(f'{remote_file_path} does not exist on {self.server}; '
f'skipping download.')
Comment thread docs/gaussian.md
Comment on lines +67 to +70
## Troubleshooting

There are many situations where we can gaussian job errors and gaussian in the output file reports the error code.

Comment on lines +11 to +12
PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1/opt_a472"
cd "$PBS_O_WORKDIR"
Comment on lines +11 to +12
PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1/opt_a1313"
cd "$PBS_O_WORKDIR"
Comment on lines +11 to +12
PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1_and_2_others/conf_opt_a472"
cd "$PBS_O_WORKDIR"
Comment on lines +11 to +12
PBS_O_WORKDIR="/home/calvin.p/runs/arc_projects/calvin.p/runs/ARC_Projects/test/spc1_and_2_others/conf_opt_a1313"
cd "$PBS_O_WORKDIR"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants