This guide describes dataset conversion, embedding formats, and evaluation workflows.
For embedding generation details, see docs/EMBEDDINGS.md.
The fastest way to see 3DCS in action is to run the bundled demo with pre-computed GemNet embeddings:
pip install -e .
python examples/demo.py allThis downloads the HF dataset and evaluates the included fixture embeddings for both
chirality and trajectory benchmarks. See examples/demo.py for details.
Convert the raw RDKit molecule data into Hugging Face datasets:
python -m three_dbench convert chirality \
--input-pkl data/chirality/chirality_bench_conformers_noised_only.pkl \
--output-dir data/hf/chirality
python -m three_dbench convert traj \
--mol-pkl-dir data/traj/mol_pkl \
--energy-dir data/traj/npz_data \
--output-dir data/hf/traj
python -m three_dbench convert rotation \
--lmdb-root data/rotation/results \
--output-dir data/hf/rotationUse --no-mol-blocks to skip MolBlock storage when disk space is limited. Rotation
evaluation requires MolBlocks to compute RMSD.
- Flat array: one embedding per conformer, aligned with the dataset order
- Use the
offsetfield in the dataset rows to verify alignment if needed - Supported files:
.npz,.npy,.pkl - Use
--embedding-keyfor NPZ or pickled dicts
- Flat array (default): embeddings aligned with
offsetandn_conformers - By key: a dict mapping
keyto a(n_conformers, dim)array - Supported files:
.npzor.pkldicts for by-key layout
- Directory of
rmd17_*.npzfiles (default) - Or a dict mapping
mol_typeto(n_frames, dim)arrays
python -m three_dbench evaluate chirality \
--dataset-dir data/hf/chirality \
--embeddings data/chirality/unimol/1.npz \
--embedding-key arr_0 \
--model-name unimolpython -m three_dbench evaluate rotation \
--dataset-dir data/hf/rotation \
--embeddings /path/to/rotation_embeddings.npz \
--embedding-key arr_0 \
--model-name my_modelTo use a dict keyed by rotation entry IDs:
python -m three_dbench evaluate rotation \
--dataset-dir data/hf/rotation \
--embeddings /path/to/rotation_by_key.pkl \
--layout by-key \
--model-name my_modelpython -m three_dbench evaluate traj \
--dataset-dir data/hf/traj/energies \
--embeddings data/traj/results/unimol \
--embedding-key arr_0 \
--model-name unimolThe scripts under examples/ will download the dataset from HF if it is not cached
locally, then run evaluation using existing embeddings.
python examples/run_chirality_from_hf.py \
--repo-id EscheWang/3dcs \
--config chirality \
--embeddings data/chirality/unimol/1.npz \
--embedding-key arr_0 \
--model-name unimol \
--output-dir results/chirality/unimolpython examples/run_traj_from_hf.py \
--repo-id EscheWang/3dcs \
--config traj_energies \
--embeddings-dir data/traj/results/unimol \
--embedding-key arr_0 \
--model-name unimol \
--output-dir results/traj/unimolpython examples/run_rotation_from_hf.py \
--repo-id EscheWang/3dcs \
--config rotation \
--embeddings-dir data/rotation/results/gemnet \
--embedding-key gemnet \
--model-name gemnet \
--output-dir results/rotation/gemnetpython examples/run_rotation_from_hf.py \
--repo-id EscheWang/3dcs \
--config rotation \
--embeddings-dir data/rotation/results/gemnet \
--embedding-key gemnet \
--model-name gemnet \
--output-dir results/rotation/gemnet_quick \
--shards 0 \
--max-keys 200This section shows the full pipeline when you generate embeddings with your own model. It includes download, embedding generation, format validation, and evaluation.
# Chirality dataset
python - <<'PY'
from datasets import load_dataset
ds = load_dataset("EscheWang/3dcs", name="chirality", split="train")
ds.save_to_disk("data/hf/chirality")
PY
# Trajectory energies
python - <<'PY'
from datasets import load_dataset
ds = load_dataset("EscheWang/3dcs", name="traj_energies", split="train")
ds.save_to_disk("data/hf/traj/energies")
PY
# Rotation dataset
python - <<'PY'
from datasets import load_dataset
ds = load_dataset("EscheWang/3dcs", name="rotation", split="train")
ds.save_to_disk("data/hf/rotation")
PYUse MolBlocks from the HF datasets to feed your model and generate fixed-length vectors.
Detailed formats and examples are in docs/EMBEDDINGS.md.
Minimal skeleton for chirality:
from datasets import load_from_disk
from three_dbench.datasets.serialization import mol_from_block
import numpy as np
ds = load_from_disk("data/hf/chirality")
vectors = []
for row in ds:
mols = [mol_from_block(b, sanitize=False) for b in row["mol_blocks"]]
for mol in mols:
vec = your_model(mol) # shape (dim,)
vectors.append(vec)
embeddings = np.stack(vectors, axis=0)
np.savez("embeddings/my_model_chirality.npz", arr_0=embeddings)Embedding arrays must align with the dataset order. For flat arrays, the total number
of vectors must equal the sum of n_conformers in the dataset.
from datasets import load_from_disk
import numpy as np
ds = load_from_disk("data/hf/chirality")
expected = int(sum(ds["n_conformers"]))
arr = np.load("embeddings/my_model_chirality.npz")["arr_0"]
assert arr.shape[0] == expectedpython -m three_dbench evaluate chirality \
--dataset-dir data/hf/chirality \
--embeddings embeddings/my_model_chirality.npz \
--embedding-key arr_0 \
--model-name my_model \
--output-dir results/chirality/my_modelpython -m three_dbench evaluate traj \
--dataset-dir data/hf/traj/energies \
--embeddings embeddings/traj_my_model \
--embedding-key arr_0 \
--model-name my_model \
--output-dir results/traj/my_modelembeddings/traj_my_model should contain one rmd17_*.npz per molecule.
For rotation, embeddings must be aligned with offset and n_conformers in the
rotation dataset. If you generate per-key embeddings, build a dict mapping key to
(n_conformers, dim) and save as a pickle.
python -m three_dbench evaluate rotation \
--dataset-dir data/hf/rotation \
--embeddings embeddings/rotation_by_key.pkl \
--layout by-key \
--model-name my_model \
--output-dir results/rotation/my_model- If the evaluation errors with a length mismatch, check the total number of conformers.
- Rotation evaluation requires MolBlocks; do not use
--no-mol-blocksduring conversion. - Trajectory evaluation uses energy windows; make sure embeddings cover all frames.
Each evaluation writes:
summary.csv: aggregated statisticsdetails.csvor*_per_molecule.json: per-sample metricsconfig.json(trajectory)
Output directory defaults to results/{task}/{model_name} unless overridden.