Guest Download Failure Due to Parallel Race Condition on Overlapping Mount Paths
What Happened
On 2026-04-10, agent jobs with multiple storages failed during VM initialization. The guest-download binary, which runs inside the VM to download and extract storage archives, crashed with canonicalize ENOENT errors when processing skills storages.
Root Cause
guest-download extracts storage archives in parallel (up to 4 concurrent threads). A recent change added a remove_dir_all(target_path) call at the start of each thread to clean stale files on VM reuse (keep-alive).
The storage mount paths have a guaranteed parent-child overlap:
- Instructions mount at /home/user/.claude
- Skills mount at /home/user/.claude/skills/{name}
When threads run concurrently, the parent path's remove_dir_all deletes child directories already created by sibling threads, causing those threads to fail with ENOENT.
Impact
Timeline (UTC)
Time | Event |
2026-04-09 16:38 | Code change merged — added remove_dir_all pre-cleanup in parallel download threads |
2026-04-10 ~02:00 | Job failures reported on prod-3 |
2026-04-10 ~02:45 | Root cause identified via prod SSH log analysis |
2026-04-10 03:21 | Fix merged and deployed |
Fix
Removed remove_dir_all from download_and_extract() — threads now only do create_dir_all + streaming tar extraction
Disabled --keep-alive in CI and production to avoid VM reuse until proper stale file cleanup is implemented (#8757)