Skip to content

Fix kernel panic after unattended kernel upgrade#3

Open
pnc wants to merge 11 commits into
eleostech:mainfrom
pnc:kernel-panic
Open

Fix kernel panic after unattended kernel upgrade#3
pnc wants to merge 11 commits into
eleostech:mainfrom
pnc:kernel-panic

Conversation

@pnc
Copy link
Copy Markdown
Member

@pnc pnc commented May 15, 2026

The update-initramfs symlink (to /bin/true) made the machine unbootable after updating the kernel (no real surprise there.)

Claude explanation:

When unattended-upgrades installed a new kernel, its postinst called update-initramfs which silently did nothing. GRUB picked up the new vmlinuz but with no initrd line. On next boot the kernel couldn't load virtio_blk (module in initramfs), so the root disk was invisible and it panicked with "VFS: Unable to mount root fs".

Fix: divert only during first-boot provisioning (guarded by boot-finished), then restore in runcmd so future kernel upgrades generate a working initramfs.

Add e2e test that installs a second kernel flavor, verifies the initrd is created, reboots, and confirms the VM comes back on the new kernel.

The update-initramfs diversion (to /bin/true) was permanent — applied
in bootcmd with no restore.  When unattended-upgrades installed a new
kernel, its postinst called update-initramfs which silently did nothing.
GRUB picked up the new vmlinuz but with no initrd line.  On next boot
the kernel couldn't load virtio_blk (module in initramfs), so the root
disk was invisible and it panicked with "VFS: Unable to mount root fs".

Fix: divert only during first-boot provisioning (guarded by
boot-finished), then restore in runcmd so future kernel upgrades
generate a working initramfs.

Add e2e test that installs a second kernel flavor, verifies the initrd
is created, reboots, and confirms the VM comes back on the new kernel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pnc pnc requested a review from ddellacosta May 15, 2026 12:05
@pnc
Copy link
Copy Markdown
Member Author

pnc commented May 15, 2026

@ddellacosta I finally had a little time to dig into the kernel panic we both saw. The call was coming from inside the house (overzealous "speed" hack!)

@pnc
Copy link
Copy Markdown
Member Author

pnc commented May 15, 2026

@ddellacosta You will probably want to undivert/unsymlink (per the "Undo" bit) your current VM and re-run update-initramfs so you don't lose your work again.

@ddellacosta
Copy link
Copy Markdown

Gotcha, will give it a shot!

pnc and others added 6 commits May 15, 2026 14:14
The cloud kernel package name includes the arch suffix
(linux-image-cloud-arm64 vs linux-image-cloud-amd64). Use
dpkg --print-architecture inside the guest to pick the right one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set MODULES=dep in initramfs-tools so only modules for detected
hardware (virtio) are included, instead of hundreds of bare-metal
drivers. Also bump the kernel install test timeout from 300s to 600s
as a safety margin under TCG emulation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The kernel reboot test hardcoded a partition UUID specific to the
current Debian cloud image. Query it from the running VM via
grub-probe so the test survives base image updates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous /v2/* wildcard allowed pulling any Docker image — too
broad for a security-focused sandbox.  Scope to library/hello-world
(used by the e2e test) and add the bare /v2/ endpoint required for
registry version checks.  Also add CloudFront CDN (some Docker blob
redirects land there instead of R2) and the uv installer redirect
path on releases.astral.sh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The official install script runs `claude install` after downloading
the binary, which maps ~70 GB of virtual memory and gets OOM-killed
in 512 MB VMs.  Download the binary directly via curl and create
the symlink ourselves, bypassing the problematic subcommand.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verify that both tools are installed and on PATH after cloud-init
provisioning.  Also update CLAUDE.md to note that tests run inside
this VM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pnc and others added 4 commits May 21, 2026 22:07
The manual binary download was a workaround for OOM in 512M VMs.
With the test VM bumped to 2G (next commit), the official installer
works correctly.  This reverts to `curl | bash` which handles
version detection, binary placement, and symlink creation.

This reverts the cloud-init portion of 67bef94.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
512M was not enough for the Claude Code binary — it maps ~70GB of
virtual address space on startup and fails silently (rc=255) when
the kernel denies the allocation.  Bump to 2G so the official
installer and runtime both work.

Also collect file type, shared library, dmesg, and memory info
when claude --version fails, so future failures are diagnosable
without guesswork.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for the Claude Code install failure on Linux CI:

1. Change the TCG CPU model from qemu64 to max.  The Claude Code
   x86_64 binary uses instructions that qemu64 doesn't emulate,
   causing an invalid-opcode trap (visible in dmesg).

2. Download the binary directly instead of using the official
   installer.  The installer runs `claude install` which maps ~70 GB
   of virtual memory — even with cpu=max, this takes so long under
   TCG that cloud-init times out.

Validated locally by reproducing the CI failure:
  - TCG + qemu64 + official installer → invalid opcode (FAIL)
  - TCG + max   + official installer → cloud-init timeout (FAIL)
  - TCG + max   + direct download    → PASS

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a section explaining how to reproduce CI failures locally
by forcing TCG mode with QEMU_ACCEL=tcg, and the expected
reproduce → fix → verify → full suite workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pnc pnc marked this pull request as draft May 22, 2026 16:12
@pnc pnc marked this pull request as ready for review May 22, 2026 18:06
@pnc
Copy link
Copy Markdown
Member Author

pnc commented May 22, 2026

@ddellacosta At long last, passing tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants