VM0 - storage download failed – Incident details

storage download failed

Resolved
Major outage
Started 21 days agoLasted about 2 hours

Affected

Runner

Major outage from 2:14 AM to 3:51 AM, Operational from 3:51 AM to 12:00 AM

Updates
  • Postmortem
    Postmortem

    Guest Download Failure Due to Parallel Race Condition on Overlapping Mount Paths

    What Happened

    On 2026-04-10, agent jobs with multiple storages failed during VM initialization. The guest-download binary, which runs inside the VM to download and extract storage archives, crashed with canonicalize ENOENT errors when processing skills storages.

    Root Cause

    guest-download extracts storage archives in parallel (up to 4 concurrent threads). A recent change added a remove_dir_all(target_path) call at the start of each thread to clean stale files on VM reuse (keep-alive).

    The storage mount paths have a guaranteed parent-child overlap:

    - Instructions mount at /home/user/.claude

    - Skills mount at /home/user/.claude/skills/{name}

    When threads run concurrently, the parent path's remove_dir_all deletes child directories already created by sibling threads, causing those threads to fail with ENOENT.

    Impact

    • Scope: All jobs

    • Duration: ~11 hours (2026-04-09 16:38 UTC — 2026-04-10 03:21 UTC)

    Timeline (UTC)

    Time

    Event

    2026-04-09 16:38

    Code change merged — added remove_dir_all pre-cleanup in parallel download threads

    2026-04-10 ~02:00

    Job failures reported on prod-3

    2026-04-10 ~02:45

    Root cause identified via prod SSH log analysis

    2026-04-10 03:21

    Fix merged and deployed

    Fix

    • Removed remove_dir_all from download_and_extract() — threads now only do create_dir_all + streaming tar extraction

    • Disabled --keep-alive in CI and production to avoid VM reuse until proper stale file cleanup is implemented (#8757)

  • Resolved
    Resolved
    This incident has been resolved.
  • Identified
    Identified
    We are continuing to work on a fix for this incident.
  • Investigating
    Investigating
    We are currently investigating this incident.