Postmortem: Claude Code Hanging in Sandbox
Date: 2026-01-19
Severity: P0
Duration: ~4 hours
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SUMMARY
Production agent runs were failing silently. Claude Code started but never produced output, timing out after 15+ minutes.
Root Cause: stdin was configured as "pipe" but never closed, causing Claude Code to hang waiting for EOF.
Why CI missed it: CI uses mock-claude which doesn't check stdin state. Real Claude Code does.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
THE BUG
Before (hangs - stdin pipe never closed):
spawn(cmd, args, { stdio: ["pipe", "pipe", "pipe"] })
After (works - stdin is /dev/null, immediate EOF):
spawn(cmd, args, { stdio: ["ignore", "pipe", "pipe"] })
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
HOW WE FOUND IT
SSH into sandbox
cat /tmp/vm0-agent-*.log → empty (no Claude output)
ps aux | grep claude → process alive, using 23% memory
ps -p 510 -o wchan → ep_pol (waiting on I/O)
ls -la /proc/510/fd/0 → stdin connected to pipe
Manual "claude --print hello" → works (TTY mode)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
WHY NO ROLLBACK
Release included database migration. Forward-fix was safer.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
INVESTIGATION NOISE
Runner npm publish failure (@vm0/core not built) was unrelated but consumed investigation time.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ACTION ITEMS
✅ Fix stdin → "ignore" (#1316)
✅ Add spawn unit tests (#1319)
✅ Fix CI publish jobs (#1306, #1318)
🔲 Add real Claude test in CI (TODO)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
KEY LESSON
Mock ≠ Real: CI must include at least one test with real Claude Code to catch behavior differences like stdin handling.