Skip to content

MetaStack v3: 1.1792 sliding bpb, 10L BigramHash SmearGate OrthoInit SWA#205

Open
xinpw8 wants to merge 3 commits intoopenai:mainfrom
xinpw8:main
Open

MetaStack v3: 1.1792 sliding bpb, 10L BigramHash SmearGate OrthoInit SWA#205
xinpw8 wants to merge 3 commits intoopenai:mainfrom
xinpw8:main

Conversation

@xinpw8
Copy link

@xinpw8 xinpw8 commented Mar 20, 2026

10-layer GPT with BigramHash embeddings, SmearGate, OrthoInit, SWA (30 checkpoints), Muon WD=0.04, mixed int5(MLP)/int6(attn) quantization, 2% magnitude pruning. Sliding window eval at stride=64, seq_len=1024.

val_bpb: 1.1792 (sliding window int6+zstd22, 12.1MB artifact) Trained 600s on 8xH100 SXM, seed 1337, step 7819.

Includes: search harness (ProteinLite), Vast.ai deploy pipeline, live GPU monitor, v1/v2 lineage, test suite.
<3

xinpw8 added 2 commits March 20, 2026 03:43
10-layer GPT with BigramHash embeddings, SmearGate, OrthoInit,
SWA (30 checkpoints), Muon WD=0.04, mixed int5(MLP)/int6(attn)
quantization, 2% magnitude pruning. Sliding window eval at
stride=64, seq_len=1024.

val_bpb: 1.1792 (sliding window int6+zstd22, 12.1MB artifact)
Trained 600s on 8xH100 SXM, seed 1337, step 7819.

Includes: search harness (ProteinLite), Vast.ai deploy pipeline,
live GPU monitor, v1/v2 lineage, test suite.
Copilot AI review requested due to automatic review settings March 20, 2026 11:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a standalone hyperparameter search harness (“ProteinLite” + runner/log parser), a live monitoring utility, and a Vast.ai deployment pipeline to support rapid Parameter Golf experimentation and remote execution, alongside new MetaStack v2/v3 record artifacts.

Changes:

  • Add search/ harness (config loader, run launcher, log parsing, live-status tracking, optimizer) plus unit tests.
  • Add Vast.ai deployment scripts (remote bootstrap, config rendering, DDP smoke checks) and a live GPU/training monitor.
  • Add new record folders/config presets for MetaStack v2 WD + MetaStack v3 competitive submission metadata.

Reviewed changes

Copilot reviewed 33 out of 34 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tools/monitor.py Live SSH/local GPU + log + search-run status monitor.
search/__init__.py Declares search package.
search/config.py YAML-backed config loading for runner/search space/settings.
search/log_parser.py Parses training logs into structured metrics + objective selection.
search/protein_lite.py Implements ProteinLite optimizer (Sobol warm start + GP guidance).
search/runner.py Builds commands/env and launches runs while teeing output.
search/run_search.py Orchestrates search loop, live status JSON, and result persistence.
search_configs/metastack_v2_wd_smoke.yaml Smoke search preset for v2 WD.
search_configs/metastack_v2_wd_sliding_local.yaml Local sliding-eval search preset for v2 WD.
search_configs/metastack_v2_wd_sliding_remote.yaml Remote 8-GPU sliding-eval search preset for v2 WD.
tests/test_search_configs.py Unit test coverage for config presets.
tests/test_search_live_status.py Unit tests for live status parsing/phase tracking.
tests/test_search_log_parser.py Unit tests for log parsing/objective selection.
tests/test_search_protein_lite.py Unit tests for deterministic suggestions + scoring semantics.
tests/test_search_run_search.py Unit tests for observation building + log path resolution.
tests/test_search_runner.py Unit tests for command building + launch output streaming.
deploy/vast/requirements.remote.txt Remote runtime deps for search/training bootstrap.
deploy/vast/rsync_excludes.txt Excludes large/ephemeral artifacts from rsync deploy.
deploy/vast/render_remote_config.py Rewrites configs for remote workdir/python/log/output paths.
deploy/vast/remote_bootstrap_and_run.sh End-to-end remote preflight/bootstrap/tests/data fetch/search runner.
deploy/vast/Dockerfile.amd64 Builds an amd64 runtime image with remote deps installed.
deploy/vast/ddp_smoke.py Minimal NCCL all-reduce sanity check.
deploy/vast/deploy_and_launch.py Syncs repo to Vast instance and launches remote workflow over SSH.
deploy/vast/deploy_and_launch.sh Thin wrapper to run deploy_and_launch.py.
deploy/vast/create_vast_template.sh Helper to create Vast template with required env/search params.
deploy/vast/build_amd64_image.sh Helper to build/push amd64 docker image for Vast runs.
records/track_10min_16mb/2026-03-19_MetaStack_v1/train_gpt.py Adds v1 lineage trainer snapshot to records.
records/track_10min_16mb/2026-03-20_MetaStack_v2_WD/train_gpt.py Adds v2 WD trainer snapshot to records.
records/track_10min_16mb/2026-03-20_MetaStack_v2_WD/README.md Documents v2 WD thesis/results/repro steps.
records/track_10min_16mb/2026-03-20_MetaStack_v2_WD/EXPERIMENT_LEDGER.md Append-only experiment ledger for v2 WD.
records/track_10min_16mb/2026-03-20_MetaStack_v3_competitive/README.md Documents v3 competitive model details + repro command.
records/track_10min_16mb/2026-03-20_MetaStack_v3_competitive/submission.json Submission metadata for v3 competitive artifact.
.gitignore Ignores Vast venv + deploy run outputs + logs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +350 to +363
for r in completed:
rid = r.get("run_id", "?")[:35]
st = r.get("status", "?")[:12]
pre = r.get("terminal_prequant_bpb")
i6 = r.get("final_int6_bpb")
sl = r.get("sliding_window_bpb")
sz = r.get("int6_artifact_bytes")
color = "\033[32m" if st == "completed" else "\033[31m"
pre_s = f"{pre:.4f}" if pre else "N/A"
i6_s = f"{i6:.4f}" if i6 else "N/A"
sl_s = f"{sl:.4f}" if sl else "N/A"
sz_s = format_bytes(sz) if sz else "N/A"
lines.append(f" {rid:<35} {color}{st:<12}\033[0m {pre_s:>8} {i6_s:>8} {sl_s:>8} {sz_s:>10}")
lines.append("")
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting block for completed search runs is mis-indented: i6_s/sl_s/sz_s and the lines.append(...) call are outside the for r in completed: loop, which will raise UnboundLocalError (or even a syntax/indentation error depending on editor) and only print one row. Re-indent those lines inside the loop (and consider using is not None checks instead of if pre so 0.0 doesn’t display as N/A).

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +33
def run_remote(ssh_cmd: str, cmd: str, timeout: int = 10) -> str | None:
try:
result = subprocess.run(
ssh_cmd.split() + [cmd],
capture_output=True, text=True, timeout=timeout,
)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_remote() tokenizes the provided SSH command with ssh_cmd.split(), which breaks when flags include quoted values (e.g. -o StrictHostKeyChecking=accept-new) or paths with spaces. Use shlex.split(ssh_cmd) to parse the SSH command line robustly.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +56
def run_local(cmd: str, timeout: int = 10) -> str | None:
try:
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True, timeout=timeout,
)
if result.returncode == 0:
return result.stdout
except (subprocess.TimeoutExpired, Exception):
pass
return None


def run_cmd(ssh_cmd: str | None, cmd: str, timeout: int = 10) -> str | None:
if ssh_cmd:
return run_remote(ssh_cmd, cmd, timeout)
return run_local(cmd, timeout)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_local() executes cmd with shell=True while many call sites interpolate user-provided workdir/paths into the command string. This is both fragile (spaces/quoting) and exposes command-injection risk if the monitor is pointed at an untrusted workdir. Prefer subprocess.run([...]) with argument lists (and quote/escape remote paths similarly).

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +135
WD_RE = re.compile(r"weight_decay\s+token:([\d.]+)\s+head:([\d.]+)\s+muon:([\d.]+)\s+scalar:([\d.]+)")


def parse_training_log(text: str) -> dict:
"""Parse key metrics from a training log."""
info: dict = {}
last_step = None
last_val = None
for line in text.splitlines():
m = STEP_RE.search(line)
if m:
step, total = int(m.group(1)), int(m.group(2))
last_step = {"step": step, "total": total, "time_ms": int(m.group(5)), "avg_ms": float(m.group(6))}
if m.group(3):
last_val = {"val_loss": float(m.group(3)), "val_bpb": float(m.group(4)), "step": step, "total": total}
m = QUANT_RE.search(line)
if m:
info["quant"] = {"int8_bpb": float(m.group(1)), "int6_bpb": float(m.group(2)),
"int8_sz": int(m.group(3)), "int6_sz": int(m.group(4))}
m = ROUNDTRIP_RE.search(line)
if m:
info.setdefault("roundtrips", {})[m.group(1)] = {"bpb": float(m.group(3)), "time_ms": int(m.group(4))}
m = SLIDING_RE.search(line)
if m and "sliding_window:progress" not in line:
info["sliding"] = {"bpb": float(m.group(2)), "complete": True}
m = SLIDING_PROG_RE.search(line)
if m:
info["sliding_progress"] = {"pct": float(m.group(1)), "wps": float(m.group(2))}
if STOP_RE.search(line):
info["stopped_early"] = True
m = PARAMS_RE.search(line)
if m:
info["params"] = int(m.group(1))
m = WD_RE.search(line)
if m:
info["wd"] = {"token": m.group(1), "muon": m.group(3), "scalar": m.group(4)}
if last_step:
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_training_log() matches weight_decay token:<...> head:<...> muon:<...> scalar:<...> but the captured head value is dropped and the remaining values are stored as strings. This makes the displayed WD incomplete/inconsistent; store all four fields and parse them as floats for reliable formatting/comparisons.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +19
ROOT = Path("/home/spark-advantage/parameter-golf")


class SearchConfigPresetTests(unittest.TestCase):
def test_v2_wd_sliding_local_has_weight_decay_knobs(self):
config = load_search_config(ROOT / "search_configs/metastack_v2_wd_sliding_local.yaml")
self.assertIn("MUON_WEIGHT_DECAY", config.search_space)
self.assertIn("SCALAR_WEIGHT_DECAY", config.search_space)
self.assertEqual(config.fixed_env["TOKEN_WEIGHT_DECAY"], 0.0)
self.assertEqual(config.fixed_env["HEAD_WEIGHT_DECAY"], 0.0)
self.assertIn("2026-03-20_MetaStack_v2_WD/train_gpt.py", str(config.runner.script_path))
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test hard-codes ROOT = Path("/home/spark-advantage/parameter-golf"), which will fail on CI and any developer machine with a different checkout location. Derive the repo root from __file__ (e.g., Path(__file__).resolve().parents[1]) or use relative paths so the test is portable.

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +151
start_monitors() {
(
printf 'timestamp,index,name,util_gpu,util_mem,mem_used_mb,mem_total_mb,power_w,temp_c\n'
while true; do
local now
now="$(timestamp)"
while IFS=, read -r idx name util_gpu util_mem mem_used mem_total power temp; do
[[ -z "${idx:-}" ]] && continue
printf '%s,%s,%s,%s,%s,%s,%s,%s,%s\n' "$now" "$idx" "$name" "$util_gpu" "$util_mem" "$mem_used" "$mem_total" "$power" "$temp"
done < <(nvidia-smi --query-gpu=index,name,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw,temperature.gpu --format=csv,noheader,nounits 2>/dev/null || true)
sleep 5
done
) >"$GPU_MONITOR_LOG" &
GPU_MONITOR_PID="$!"

(
printf 'timestamp,fs_size_bytes,fs_used_bytes,fs_avail_bytes,mem_total_bytes,mem_available_bytes,load1\n'
while true; do
local now fs size used avail mem_total mem_avail load1
now="$(timestamp)"
read -r size used avail < <(df -B1 --output=size,used,avail "$WORKDIR" | awk 'NR==2 {print $1, $2, $3}')
read -r mem_total mem_avail < <(free -b | awk '/^Mem:/ {print $2, $7}')
read -r load1 _ < /proc/loadavg
printf '%s,%s,%s,%s,%s,%s,%s\n' "$now" "$size" "$used" "$avail" "$mem_total" "$mem_avail" "$load1"
sleep 15
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_monitors() runs its loops inside subshells (...) but uses local now / local now fs ... inside those subshells. In bash, local is only valid inside a function, so this will print an error and can stop the monitor loops. Remove local (or restructure into a real helper function invoked normally).

Copilot uses AI. Check for mistakes.
run_unit_tests() {
CURRENT_PHASE="unit_tests"
write_status "$CURRENT_PHASE" "running" "running deployment unit tests"
run_logged "unit tests" "$PYTHON_BIN" -m unittest tests.test_search_runner tests.test_vast_render_remote_config
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remote bootstrap script runs python -m unittest tests.test_search_runner tests.test_vast_render_remote_config, but tests/test_vast_render_remote_config.py is not present in this PR/repo. This will make every remote run fail during the unit test phase; either add the missing test module or remove/replace it with the intended test(s).

Suggested change
run_logged "unit tests" "$PYTHON_BIN" -m unittest tests.test_search_runner tests.test_vast_render_remote_config
run_logged "unit tests" "$PYTHON_BIN" -m unittest tests.test_search_runner

Copilot uses AI. Check for mistakes.
VAST_BIN="${VAST_BIN:-$ROOT_DIR/.venv-vastai/bin/vastai}"
TEMPLATE_NAME="${PG_TEMPLATE_NAME:-parameter-golf-8xh100}"
IMAGE="${PG_TEMPLATE_IMAGE:-nvcr.io/nvidia/pytorch:25.12-py3}"
DISK_SPACE="${PG_TEMPLATE_DISK_GB:-200}"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults are inconsistent: DISK_SPACE defaults to 200GB, but SEARCH_PARAMS includes disk_space>=300. With defaults, the created template configuration won’t match its own search filter expectations. Align the default PG_TEMPLATE_DISK_GB with the search params (or drop disk_space>=... from the default filter).

Suggested change
DISK_SPACE="${PG_TEMPLATE_DISK_GB:-200}"
DISK_SPACE="${PG_TEMPLATE_DISK_GB:-300}"

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +11
from scipy.stats.qmc import Sobol
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, DotProduct, Matern, WhiteKernel

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module unconditionally imports scipy and scikit-learn, but the repo’s top-level requirements.txt doesn’t include them. As-is, a clean install will fail to import search.protein_lite (and the added unit tests will fail). Add these dependencies to the standard requirements (or make the import optional with a clear error message/instructions).

Suggested change
from scipy.stats.qmc import Sobol
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, DotProduct, Matern, WhiteKernel
try:
from scipy.stats.qmc import Sobol
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, DotProduct, Matern, WhiteKernel
_SCIPY_SKLEARN_AVAILABLE = True
except ImportError: # pragma: no cover - optional dependencies
_SCIPY_SKLEARN_AVAILABLE = False
class _MissingOptionalDependency:
def __init__(self, *args, **kwargs):
raise ImportError(
"Optional dependencies 'scipy' and 'scikit-learn' are required for "
"Sobol sampling and Gaussian process models in 'search.protein_lite'. "
"Install them with:\n\n pip install scipy scikit-learn\n"
)
# Placeholders that raise a clear error if used without the optional deps
Sobol = _MissingOptionalDependency
GaussianProcessRegressor = _MissingOptionalDependency
ConstantKernel = _MissingOptionalDependency
DotProduct = _MissingOptionalDependency
Matern = _MissingOptionalDependency
WhiteKernel = _MissingOptionalDependency
warnings.warn(
"Optional dependencies 'scipy' and 'scikit-learn' are not installed. "
"Functionality requiring Sobol sampling or Gaussian process models in "
"'search.protein_lite' will not be available.",
RuntimeWarning,
)

Copilot uses AI. Check for mistakes.
- monitor.py: fix indentation bug in search results display loop
- monitor.py: include dropped 'head' weight_decay field in parsed log
- test_search_configs.py: derive repo root from __file__ instead of hardcoded path
- create_vast_template.sh: align disk default (300GB) with search filter
- commit missing test_vast_render_remote_config.py referenced by bootstrap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants