Affected
Major outage from 2:14 AM to 3:51 AM, Operational from 3:51 AM to 12:00 AM
- PostmortemPostmortem
Guest Download Failure Due to Parallel Race Condition on Overlapping Mount Paths
What Happened
On 2026-04-10, agent jobs with multiple storages failed during VM initialization. The
guest-downloadbinary, which runs inside the VM to download and extract storage archives, crashed withcanonicalizeENOENT errors when processing skills storages.Root Cause
guest-downloadextracts storage archives in parallel (up to 4 concurrent threads). A recent change added aremove_dir_all(target_path)call at the start of each thread to clean stale files on VM reuse (keep-alive).The storage mount paths have a guaranteed parent-child overlap:
- Instructions mount at
/home/user/.claude- Skills mount at
/home/user/.claude/skills/{name}When threads run concurrently, the parent path's
remove_dir_alldeletes child directories already created by sibling threads, causing those threads to fail with ENOENT.Impact
Scope: All jobs
Duration: ~11 hours (2026-04-09 16:38 UTC — 2026-04-10 03:21 UTC)
Timeline (UTC)
Time
Event
2026-04-09 16:38
Code change merged — added
remove_dir_allpre-cleanup in parallel download threads2026-04-10 ~02:00
Job failures reported on prod-3
2026-04-10 ~02:45
Root cause identified via prod SSH log analysis
2026-04-10 03:21
Fix merged and deployed
Fix
Removed
remove_dir_allfromdownload_and_extract()— threads now only docreate_dir_all+ streaming tar extractionDisabled
--keep-alivein CI and production to avoid VM reuse until proper stale file cleanup is implemented (#8757)
- ResolvedResolvedThis incident has been resolved.
- IdentifiedIdentifiedWe are continuing to work on a fix for this incident.
- InvestigatingInvestigatingWe are currently investigating this incident.