<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <id>tag:status.vm0.ai,2005:/history</id>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai"/>
  <link rel="self" type="application/atom+xml" href="https://status.vm0.ai/history.atom"/>
  <title>VM0 Status - Incident history</title>
  <updated>2026-04-24T03:06:08.995+00:00</updated>
  <author>
    <name>VM0</name>
  </author>
  
<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmocbwfxz003ie7yg902jt5xx</id>
  <published>2026-04-24T03:06:08.995+00:00</published>
  <updated>2026-04-24T03:06:08.995+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmocbwfxz003ie7yg902jt5xx"/>
  <title>Due to a performance degradation with the upstream provider, our Sonnet 4.6 model is currently affected.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 45 minutes</p>
    <p><strong>Affected Components:</strong> API Service</p>
    <p><small>Apr <var data-var='date'> 24</var>, <var data-var='time'>03:06:08</var> GMT+0</small><br /><strong>Investigating</strong> -
  We suggest temporarily switching to other models.</p>
<p><small>Apr <var data-var='date'> 24</var>, <var data-var='time'>03:50:56</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmo8htmpf0053d4cajuvbcelh</id>
  <published>2026-04-21T10:40:51.048+00:00</published>
  <updated>2026-04-21T10:40:51.048+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmo8htmpf0053d4cajuvbcelh"/>
  <title>Currently unable to log in to the website.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 57 minutes</p>
    <p><strong>Affected Components:</strong> Web Page</p>
    <p><small>Apr <var data-var='date'> 21</var>, <var data-var='time'>10:40:51</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Apr <var data-var='date'> 21</var>, <var data-var='time'>11:02:54</var> GMT+0</small><br /><strong>Identified</strong> -
  We are continuing to work on a fix for this incident. A hotfix is being prepared for deployment to the live environment..</p>
<p><small>Apr <var data-var='date'> 21</var>, <var data-var='time'>11:33:59</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We are currently releasing a hotfix to resolve this issue..</p>
<p><small>Apr <var data-var='date'> 21</var>, <var data-var='time'>11:37:38</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Apr <var data-var='date'> 21</var>, <var data-var='time'>11:40:46</var> GMT+0</small><br /><strong>Postmortem</strong> -
  Today, during the project, we attempted to add a permission-related feature to the system. While implementing a new authentication method, we introduced a breaking change that prevented web users from logging in.

This modification should have been caught by our automated checks. However, the check was accidentally bypassed, allowing the bug to leak into the production environment.

Subsequently, we revoked the bypass permission capability to prevent similar incidents from occurring..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmo87eiuo00113ul7sh57uf2p</id>
  <published>2026-04-21T05:49:09.951+00:00</published>
  <updated>2026-04-21T05:49:09.951+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmo87eiuo00113ul7sh57uf2p"/>
  <title>The system is unable to run any tasks.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 4 minutes</p>
    <p><strong>Affected Components:</strong> Runner</p>
    <p><small>Apr <var data-var='date'> 21</var>, <var data-var='time'>05:49:09</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Apr <var data-var='date'> 21</var>, <var data-var='time'>05:50:58</var> GMT+0</small><br /><strong>Identified</strong> -
  We are continuing to work on a fix for this incident..</p>
<p><small>Apr <var data-var='date'> 21</var>, <var data-var='time'>05:52:41</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmo2ugmg50g0gz3zj3tc7npdk</id>
  <published>2026-04-17T11:48:01.967+00:00</published>
  <updated>2026-04-17T11:48:01.967+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmo2ugmg50g0gz3zj3tc7npdk"/>
  <title>The Google Workspace connector is currently not working properly.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 18 minutes</p>
    <p><strong>Affected Components:</strong> Connector, Web Page, Runner, API Service, Storage</p>
    <p><small>Apr <var data-var='date'> 17</var>, <var data-var='time'>11:48:01</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Apr <var data-var='date'> 17</var>, <var data-var='time'>12:07:14</var> GMT+0</small><br /><strong>Identified</strong> -
  We are continuing to work on a fix for this incident..</p>
<p><small>Apr <var data-var='date'> 17</var>, <var data-var='time'>12:40:02</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We implemented a fix and are currently monitoring the result..</p>
<p><small>Apr <var data-var='date'> 17</var>, <var data-var='time'>13:05:59</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Apr <var data-var='date'> 17</var>, <var data-var='time'>13:23:06</var> GMT+0</small><br /><strong>Postmortem</strong> -
  The problem stems from a recent update where we modified the update rules for the OAuth connector. Our goal was to implement a more proactive refresh mechanism to ensure a valid token is available whenever the system runs.

However, this new refresh rule introduced a bug specifically related to the Google OAuth refresh process. Because this issue was difficult to replicate within our automated checking procedures, it was unfortunately deployed to the live environment.

To prevent this from happening again, we are currently working on implementing automated validation for the OAuth connector workflow to ensure these types of issues are caught before release..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmnydkazx02elz659ubr0pxta</id>
  <published>2026-04-14T08:43:55.642+00:00</published>
  <updated>2026-04-14T09:50:09.956+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmnydkazx02elz659ubr0pxta"/>
  <title>We are experiencing a failure with the Agent task runner, which is currently preventing our Agents from executing tasks.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 6 minutes</p>
    <p><strong>Affected Components:</strong> Runner</p>
    <p><small>Apr <var data-var='date'> 14</var>, <var data-var='time'>09:50:09</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Apr <var data-var='date'> 14</var>, <var data-var='time'>08:43:55</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Apr <var data-var='date'> 14</var>, <var data-var='time'>09:35:13</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We implemented a fix and are currently monitoring the result..</p>
<p><small>Apr <var data-var='date'> 14</var>, <var data-var='time'>09:50:08</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmns9w8cj06a5zia892px46qr</id>
  <published>2026-04-10T02:14:36.526+00:00</published>
  <updated>2026-04-10T02:14:36.526+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmns9w8cj06a5zia892px46qr"/>
  <title>storage download failed</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 37 minutes</p>
    <p><strong>Affected Components:</strong> Runner</p>
    <p><small>Apr <var data-var='date'> 10</var>, <var data-var='time'>02:14:36</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Apr <var data-var='date'> 10</var>, <var data-var='time'>03:00:13</var> GMT+0</small><br /><strong>Identified</strong> -
  We are continuing to work on a fix for this incident..</p>
<p><small>Apr <var data-var='date'> 10</var>, <var data-var='time'>03:51:11</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Apr <var data-var='date'> 10</var>, <var data-var='time'>04:35:41</var> GMT+0</small><br /><strong>Postmortem</strong> -
  # Guest Download Failure Due to Parallel Race Condition on Overlapping Mount Paths

### What Happened

On 2026-04-10, agent jobs with multiple storages failed during VM initialization. The `guest-download` binary, which runs inside the VM to download and extract storage archives, crashed with `canonicalize` ENOENT errors when processing skills storages.

### Root Cause

`guest-download` extracts storage archives in parallel (up to 4 concurrent threads). A recent change added a `remove_dir_all(target_path)` call at the start of each thread to clean stale files on VM reuse (keep-alive).

The storage mount paths have a guaranteed parent-child overlap:

\- Instructions mount at `/home/user/.claude`

\- Skills mount at `/home/user/.claude/skills/{name}`

When threads run concurrently, the parent path&#039;s `remove_dir_all` deletes child directories already created by sibling threads, causing those threads to fail with ENOENT.

### Impact

* **Scope:** All jobs
* **Duration:** \~11 hours (2026-04-09 16:38 UTC — 2026-04-10 03:21 UTC)

### Timeline (UTC)

| Time               | Event                                                                                |
| ------------------ | ------------------------------------------------------------------------------------ |
| 2026-04-09 16:38   | Code change merged — added remove\_dir\_all pre-cleanup in parallel download threads |
| 2026-04-10 \~02:00 | Job failures reported on prod-3                                                      |
| 2026-04-10 \~02:45 | Root cause identified via prod SSH log analysis                                      |
| 2026-04-10 03:21   | Fix merged and deployed                                                              |

### Fix

* Removed `remove_dir_all` from `download_and_extract()` — threads now only do `create_dir_all` \+ streaming tar extraction
* Disabled `--keep-alive` in CI and production to avoid VM reuse until proper stale file cleanup is implemented (#8757).</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmnpv44m100l0xnamhsd9whby</id>
  <published>2026-04-08T09:45:18.190+00:00</published>
  <updated>2026-04-08T09:45:18.190+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmnpv44m100l0xnamhsd9whby"/>
  <title>We are currently experiencing an outage. Users of the V0 Managed Token service may be affected.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 14 minutes</p>
    <p><strong>Affected Components:</strong> Runner</p>
    <p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>09:45:18</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>09:53:24</var> GMT+0</small><br /><strong>Investigating</strong> -
  This incident is related to our most recent release. We are currently investigating the situation and working to resolve an issue that occurred during the deployment process..</p>
<p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>10:22:56</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We have submitted the fix code and are observing whether the problem has been resolved..</p>
<p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>10:46:12</var> GMT+0</small><br /><strong>Resolved</strong> -
  fixed.</p>
<p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>10:57:19</var> GMT+0</small><br /><strong>Postmortem</strong> -
  # Postmortem: Agent Task Execution Failure Due to Model ID Format Mismatch

### What Happened

On 2026-04-08, all vm0 agent task runs failed to start. Users were unable to execute any agent tasks across all built-in model provider types.

### Root Cause

The Claude Code CLI and Anthropic API require model IDs in hyphenated format (e.g., `claude-opus-4-6`). However, the model ID constants in our codebase and the values stored in the database used dot notation (e.g., `claude-opus-4.6`). When vm0 attempted to launch an agent, the CLI rejected the model ID at startup, blocking all task execution.

### Impact

* **Scope:** 100% of agent task runs across `vm0`, `anthropic-api-key`, and `claude-code-oauth-token` provider types
* **Duration:** \~30 minutes (09:47 UTC — 10:17 UTC)
* **Not affected:** OpenRouter and Vercel AI Gateway providers (use a separate `anthropic/claude-*` naming convention)

### Timeline (UTC)

| Time  | Event                   |
| ----- | ----------------------- |
| 09:47 | Fix PR opened (#8511)   |
| 10:17 | PR merged and deployed  |
| 10:17 | Task execution restored |

### Fix

1. Renamed all model ID constants in `MODEL_PROVIDER_TYPES` and `VM0_MODEL_TO_PROVIDER` from dot to hyphen format
2. Ran DB migration `0230` to backfill affected rows in `model_providers.selected_model`, `vm0_api_keys.model`, and `credit_pricing.model`
3. Updated dev seed scripts and all related test suites.</p>
<p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>10:58:59</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmn77hak4000cqot3pmimzkuh</id>
  <published>2026-03-26T08:23:50.672+00:00</published>
  <updated>2026-03-26T08:23:50.672+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmn77hak4000cqot3pmimzkuh"/>
  <title>We are currently experiencing some service outages online.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 3 minutes</p>
    <p><strong>Affected Components:</strong> Web Page, API Service</p>
    <p><small>Mar <var data-var='date'> 26</var>, <var data-var='time'>08:23:50</var> GMT+0</small><br /><strong>Investigating</strong> -
  The agents are temporarily unable to perform conversations or other tasks. We are currently investigating this incident..</p>
<p><small>Mar <var data-var='date'> 26</var>, <var data-var='time'>08:27:03</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Mar <var data-var='date'> 26</var>, <var data-var='time'>08:28:16</var> GMT+0</small><br /><strong>Postmortem</strong> -
  Our most recent deployment script encountered several issues during the latest release, which prevented the database migration from executing as planned. 

However, the application service proceeded with the deployment under the assumption that the database execution was successful. 

This has resulted in an inconsistency between the live application and the database versions, making it impossible to create a new event activity.

We are currently conducting an investigation to determine why the database upgrade scripts failed to execute correctly..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmmvj8e1j05gjezxdenn7qed4</id>
  <published>2026-03-18T04:19:36.603+00:00</published>
  <updated>2026-03-18T04:19:36.603+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmmvj8e1j05gjezxdenn7qed4"/>
  <title>Service temporarily unavailable</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 30 minutes</p>
    <p><strong>Affected Components:</strong> Web Page, API Service</p>
    <p><small>Mar <var data-var='date'> 18</var>, <var data-var='time'>04:19:36</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Mar <var data-var='date'> 18</var>, <var data-var='time'>04:49:34</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmmn1whyg006ddyuxih5nnm4z</id>
  <published>2026-03-12T05:52:18.976+00:00</published>
  <updated>2026-03-12T05:52:18.976+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmmn1whyg006ddyuxih5nnm4z"/>
  <title>The website is not allowing users to log in normally.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 38 minutes</p>
    <p><strong>Affected Components:</strong> Web Page</p>
    <p><small>Mar <var data-var='date'> 12</var>, <var data-var='time'>05:52:18</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Mar <var data-var='date'> 12</var>, <var data-var='time'>05:52:49</var> GMT+0</small><br /><strong>Investigating</strong> -
  When opening the login page, a blank screen is displayed. Since the username and password fields are not visible, users are unable to complete the login process..</p>
<p><small>Mar <var data-var='date'> 12</var>, <var data-var='time'>05:54:25</var> GMT+0</small><br /><strong>Identified</strong> -
  The issue originated with the most recent security update..</p>
<p><small>Mar <var data-var='date'> 12</var>, <var data-var='time'>06:22:48</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We implemented a fix and are currently monitoring the result..</p>
<p><small>Mar <var data-var='date'> 12</var>, <var data-var='time'>06:30:07</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmli1k8ru0042hh84qomwdnme</id>
  <published>2026-02-11T13:04:13.947+00:00</published>
  <updated>2026-02-11T13:04:13.947+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmli1k8ru0042hh84qomwdnme"/>
  <title>www.vm0.ai cannot directly redirect to platform.vm0.ai.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 33 minutes</p>
    <p><strong>Affected Components:</strong> Web Page</p>
    <p><small>Feb <var data-var='date'> 11</var>, <var data-var='time'>13:04:13</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Feb <var data-var='date'> 11</var>, <var data-var='time'>13:14:00</var> GMT+0</small><br /><strong>Identified</strong> -
  We are continuing to work on a fix for this incident..</p>
<p><small>Feb <var data-var='date'> 11</var>, <var data-var='time'>13:36:50</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmlayuga407c3a1jxpcjvfqui</id>
  <published>2026-02-06T13:28:00.000+00:00</published>
  <updated>2026-02-06T13:28:00.000+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmlayuga407c3a1jxpcjvfqui"/>
  <title>The platform.vm0.ai cannot be opened.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 16 minutes</p>
    
    <p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>13:28:00</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>14:18:09</var> GMT+0</small><br /><strong>Identified</strong> -
  We determined that the issue originated from a recent deployment of the platform frontend code..</p>
<p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>14:29:55</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Feb <var data-var='date'> 6</var>, <var data-var='time'>14:38:03</var> GMT+0</small><br /><strong>Postmortem</strong> -
  This incident was caused by a code refactoring that consolidated references to CLERK\_PUBLISHABLE\_KEY across several web sites. However, due to an oversight in omitting the variable name in the deployment script, the platform failed to locate the legacy CLERK\_PUBLISHABLE\_KEY in the production environment, resulting in page failures and inability to use [platform.vm0.ai](http://platform.vm0.ai).

The related API services and container services were not affected.

The follow-up remediation plan primarily includes attempting to validate required environment variables during the build phase to prevent problematic code from being deployed. Additionally, introducing e2e testing for [platform.vm0.ai](http://platform.vm0.ai) to ensure the happy path workflow functions normally..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:status.vm0.ai,2005:Incident/cmkkqswpo002w10pflokikw48</id>
  <published>2026-01-19T02:40:00.000+00:00</published>
  <updated>2026-01-19T02:40:00.000+00:00</updated>
  <link rel="alternate" type="text/html" href="https://status.vm0.ai/incident/cmkkqswpo002w10pflokikw48"/>
  <title>Agent can&#039;t run</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 4 hours and 58 minutes</p>
    <p><strong>Affected Components:</strong> Runner</p>
    <p><small>Jan <var data-var='date'> 19</var>, <var data-var='time'>02:40:00</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Jan <var data-var='date'> 19</var>, <var data-var='time'>02:45:00</var> GMT+0</small><br /><strong>Identified</strong> -
  We are continuing to work on a fix for this incident..</p>
<p><small>Jan <var data-var='date'> 19</var>, <var data-var='time'>02:50:00</var> GMT+0</small><br /><strong>Identified</strong> -
  This glitch comes from a recent runner deployment, and the team is trying to fix the issue.</p>
<p><small>Jan <var data-var='date'> 19</var>, <var data-var='time'>05:50:18</var> GMT+0</small><br /><strong>Identified</strong> -
  Locating the problem comes from a recent database change, the team is trying to fix the data that caused the problem.</p>
<p><small>Jan <var data-var='date'> 19</var>, <var data-var='time'>06:21:45</var> GMT+0</small><br /><strong>Identified</strong> -
  We have now restored normal operation for both the database and task dispatcher, and are currently working on getting the Claude code in the sandbox back online..</p>
<p><small>Jan <var data-var='date'> 19</var>, <var data-var='time'>07:03:49</var> GMT+0</small><br /><strong>Identified</strong> -
  We have confirmed that the issue lies in the way VM0 calls the claude code. The minimal fix has been finalized and is currently being redeployed..</p>
<p><small>Jan <var data-var='date'> 19</var>, <var data-var='time'>07:38:24</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Jan <var data-var='date'> 19</var>, <var data-var='time'>08:23:59</var> GMT+0</small><br /><strong>Postmortem</strong> -
  Postmortem: Claude Code Hanging in Sandbox

Date: 2026-01-19  
Severity: P0  
Duration: \~4 hours

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

SUMMARY

Production agent runs were failing silently. Claude Code started but never produced output, timing out after 15+ minutes.

Root Cause: stdin was configured as &quot;pipe&quot; but never closed, causing Claude Code to hang waiting for EOF.

Why CI missed it: CI uses mock-claude which doesn&#039;t check stdin state. Real Claude Code does.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

THE BUG

Before (hangs - stdin pipe never closed):  
spawn(cmd, args, { stdio: \[&quot;pipe&quot;, &quot;pipe&quot;, &quot;pipe&quot;\] })

After (works - stdin is /dev/null, immediate EOF):  
spawn(cmd, args, { stdio: \[&quot;ignore&quot;, &quot;pipe&quot;, &quot;pipe&quot;\] })

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

HOW WE FOUND IT

1. SSH into sandbox
2. cat /tmp/vm0-agent-\*.log → empty (no Claude output)
3. ps aux | grep claude → process alive, using 23% memory
4. ps -p 510 -o wchan → ep\_pol (waiting on I/O)
5. ls -la /proc/510/fd/0 → stdin connected to pipe
6. Manual &quot;claude --print hello&quot; → works (TTY mode)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

WHY NO ROLLBACK

Release included database migration. Forward-fix was safer.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

INVESTIGATION NOISE

Runner npm publish failure (@vm0/core not built) was unrelated but consumed investigation time.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ACTION ITEMS

✅ Fix stdin → &quot;ignore&quot; ([#1316](https://github.com/vm0-ai/vm0/pull/1316))  
✅ Add spawn unit tests ([#1319](https://github.com/vm0-ai/vm0/pull/1319))  
✅ Fix CI publish jobs ([#1306](https://github.com/vm0-ai/vm0/pull/1306), [#1318](https://github.com/vm0-ai/vm0/pull/1318))  
🔲 Add real Claude test in CI (TODO)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

KEY LESSON

Mock ≠ Real: CI must include at least one test with real Claude Code to catch behavior differences like stdin handling..</p>

        ]]>
  </content>
</entry>

</feed>