<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>VM0 Status - Incident history</title>
    <link>https://status.vm0.ai</link>
    <description>VM0</description>
    <pubDate>Fri, 24 Apr 2026 03:06:08 +0000</pubDate>
    
<item>
  <title>Due to a performance degradation with the upstream provider, our Sonnet 4.6 model is currently affected.</title>
  <description>
    Type: Incident
    Duration: 45 minutes

    Affected Components: API Service
    Apr 24, 03:06:08 GMT+0 - Investigating - We suggest temporarily switching to other models Apr 24, 03:50:56 GMT+0 - Resolved - This incident has been resolved. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 45 minutes</p>
    <p><strong>Affected Components:</strong> </p>
    &lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 24&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;03:06:08&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We suggest temporarily switching to other models.&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 24&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;03:50:56&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Fri, 24 Apr 2026 03:06:08 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmocbwfxz003ie7yg902jt5xx</link>
  <guid>https://status.vm0.ai/incident/cmocbwfxz003ie7yg902jt5xx</guid>
</item>

<item>
  <title>Currently unable to log in to the website.</title>
  <description>
    Type: Incident
    Duration: 57 minutes

    Affected Components: Web Page
    Apr 21, 10:40:51 GMT+0 - Investigating - We are currently investigating this incident. Apr 21, 11:02:54 GMT+0 - Identified - We are continuing to work on a fix for this incident. A hotfix is being prepared for deployment to the live environment. Apr 21, 11:33:59 GMT+0 - Monitoring - We are currently releasing a hotfix to resolve this issue. Apr 21, 11:37:38 GMT+0 - Resolved - This incident has been resolved. Apr 21, 11:40:46 GMT+0 - Postmortem - Today, during the project, we attempted to add a permission-related feature to the system. While implementing a new authentication method, we introduced a breaking change that prevented web users from logging in.

This modification should have been caught by our automated checks. However, the check was accidentally bypassed, allowing the bug to leak into the production environment.

Subsequently, we revoked the bypass permission capability to prevent similar incidents from occurring. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 57 minutes</p>
    <p><strong>Affected Components:</strong> </p>
    &lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 21&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;10:40:51&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 21&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;11:02:54&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  We are continuing to work on a fix for this incident. A hotfix is being prepared for deployment to the live environment..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 21&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;11:33:59&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; -
  We are currently releasing a hotfix to resolve this issue..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 21&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;11:37:38&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 21&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;11:40:46&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Postmortem&lt;/strong&gt; -
  Today, during the project, we attempted to add a permission-related feature to the system. While implementing a new authentication method, we introduced a breaking change that prevented web users from logging in.

This modification should have been caught by our automated checks. However, the check was accidentally bypassed, allowing the bug to leak into the production environment.

Subsequently, we revoked the bypass permission capability to prevent similar incidents from occurring..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Tue, 21 Apr 2026 10:40:51 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmo8htmpf0053d4cajuvbcelh</link>
  <guid>https://status.vm0.ai/incident/cmo8htmpf0053d4cajuvbcelh</guid>
</item>

<item>
  <title>The system is unable to run any tasks.</title>
  <description>
    Type: Incident
    Duration: 4 minutes

    Affected Components: Runner
    Apr 21, 05:49:09 GMT+0 - Investigating - We are currently investigating this incident. Apr 21, 05:50:58 GMT+0 - Identified - We are continuing to work on a fix for this incident. Apr 21, 05:52:41 GMT+0 - Resolved - This incident has been resolved. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 4 minutes</p>
    <p><strong>Affected Components:</strong> </p>
    &lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 21&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;05:49:09&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 21&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;05:50:58&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  We are continuing to work on a fix for this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 21&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;05:52:41&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Tue, 21 Apr 2026 05:49:09 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmo87eiuo00113ul7sh57uf2p</link>
  <guid>https://status.vm0.ai/incident/cmo87eiuo00113ul7sh57uf2p</guid>
</item>

<item>
  <title>The Google Workspace connector is currently not working properly.</title>
  <description>
    Type: Incident
    Duration: 1 hour and 18 minutes

    Affected Components: Connector, Web Page, Runner, API Service, Storage
    Apr 17, 11:48:01 GMT+0 - Investigating - We are currently investigating this incident. Apr 17, 12:07:14 GMT+0 - Identified - We are continuing to work on a fix for this incident. Apr 17, 12:40:02 GMT+0 - Monitoring - We implemented a fix and are currently monitoring the result. Apr 17, 13:05:59 GMT+0 - Resolved - This incident has been resolved. Apr 17, 13:23:06 GMT+0 - Postmortem - The problem stems from a recent update where we modified the update rules for the OAuth connector. Our goal was to implement a more proactive refresh mechanism to ensure a valid token is available whenever the system runs.

However, this new refresh rule introduced a bug specifically related to the Google OAuth refresh process. Because this issue was difficult to replicate within our automated checking procedures, it was unfortunately deployed to the live environment.

To prevent this from happening again, we are currently working on implementing automated validation for the OAuth connector workflow to ensure these types of issues are caught before release. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 18 minutes</p>
    <p><strong>Affected Components:</strong> , , , , </p>
    &lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 17&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;11:48:01&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 17&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;12:07:14&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  We are continuing to work on a fix for this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 17&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;12:40:02&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; -
  We implemented a fix and are currently monitoring the result..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 17&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;13:05:59&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 17&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;13:23:06&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Postmortem&lt;/strong&gt; -
  The problem stems from a recent update where we modified the update rules for the OAuth connector. Our goal was to implement a more proactive refresh mechanism to ensure a valid token is available whenever the system runs.

However, this new refresh rule introduced a bug specifically related to the Google OAuth refresh process. Because this issue was difficult to replicate within our automated checking procedures, it was unfortunately deployed to the live environment.

To prevent this from happening again, we are currently working on implementing automated validation for the OAuth connector workflow to ensure these types of issues are caught before release..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Fri, 17 Apr 2026 11:48:01 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmo2ugmg50g0gz3zj3tc7npdk</link>
  <guid>https://status.vm0.ai/incident/cmo2ugmg50g0gz3zj3tc7npdk</guid>
</item>

<item>
  <title>We are experiencing a failure with the Agent task runner, which is currently preventing our Agents from executing tasks.</title>
  <description>
    Type: Incident
    Duration: 1 hour and 6 minutes

    Affected Components: Runner
    Apr 14, 09:50:09 GMT+0 - Resolved - This incident has been resolved. Apr 14, 08:43:55 GMT+0 - Investigating - We are currently investigating this incident. Apr 14, 09:35:13 GMT+0 - Monitoring - We implemented a fix and are currently monitoring the result. Apr 14, 09:50:08 GMT+0 - Resolved - This incident has been resolved. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 6 minutes</p>
    <p><strong>Affected Components:</strong> </p>
    &lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 14&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;09:50:09&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 14&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;08:43:55&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 14&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;09:35:13&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; -
  We implemented a fix and are currently monitoring the result..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 14&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;09:50:08&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Tue, 14 Apr 2026 08:43:55 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmnydkazx02elz659ubr0pxta</link>
  <guid>https://status.vm0.ai/incident/cmnydkazx02elz659ubr0pxta</guid>
</item>

<item>
  <title>storage download failed</title>
  <description>
    Type: Incident
    Duration: 1 hour and 37 minutes

    Affected Components: Runner
    Apr 10, 02:14:36 GMT+0 - Investigating - We are currently investigating this incident. Apr 10, 03:00:13 GMT+0 - Identified - We are continuing to work on a fix for this incident. Apr 10, 03:51:11 GMT+0 - Resolved - This incident has been resolved. Apr 10, 04:35:41 GMT+0 - Postmortem - # Guest Download Failure Due to Parallel Race Condition on Overlapping Mount Paths

### What Happened

On 2026-04-10, agent jobs with multiple storages failed during VM initialization. The `guest-download` binary, which runs inside the VM to download and extract storage archives, crashed with `canonicalize` ENOENT errors when processing skills storages.

### Root Cause

`guest-download` extracts storage archives in parallel (up to 4 concurrent threads). A recent change added a `remove_dir_all(target_path)` call at the start of each thread to clean stale files on VM reuse (keep-alive).

The storage mount paths have a guaranteed parent-child overlap:

\- Instructions mount at `/home/user/.claude`

\- Skills mount at `/home/user/.claude/skills/{name}`

When threads run concurrently, the parent path&#039;s `remove_dir_all` deletes child directories already created by sibling threads, causing those threads to fail with ENOENT.

### Impact

* **Scope:** All jobs
* **Duration:** \~11 hours (2026-04-09 16:38 UTC — 2026-04-10 03:21 UTC)

### Timeline (UTC)

| Time               | Event                                                                                |
| ------------------ | ------------------------------------------------------------------------------------ |
| 2026-04-09 16:38   | Code change merged — added remove\_dir\_all pre-cleanup in parallel download threads |
| 2026-04-10 \~02:00 | Job failures reported on prod-3                                                      |
| 2026-04-10 \~02:45 | Root cause identified via prod SSH log analysis                                      |
| 2026-04-10 03:21   | Fix merged and deployed                                                              |

### Fix

* Removed `remove_dir_all` from `download_and_extract()` — threads now only do `create_dir_all` \+ streaming tar extraction
* Disabled `--keep-alive` in CI and production to avoid VM reuse until proper stale file cleanup is implemented (#8757) 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 37 minutes</p>
    <p><strong>Affected Components:</strong> </p>
    &lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 10&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;02:14:36&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 10&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;03:00:13&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  We are continuing to work on a fix for this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 10&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;03:51:11&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 10&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;04:35:41&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Postmortem&lt;/strong&gt; -
  # Guest Download Failure Due to Parallel Race Condition on Overlapping Mount Paths

### What Happened

On 2026-04-10, agent jobs with multiple storages failed during VM initialization. The `guest-download` binary, which runs inside the VM to download and extract storage archives, crashed with `canonicalize` ENOENT errors when processing skills storages.

### Root Cause

`guest-download` extracts storage archives in parallel (up to 4 concurrent threads). A recent change added a `remove_dir_all(target_path)` call at the start of each thread to clean stale files on VM reuse (keep-alive).

The storage mount paths have a guaranteed parent-child overlap:

\- Instructions mount at `/home/user/.claude`

\- Skills mount at `/home/user/.claude/skills/{name}`

When threads run concurrently, the parent path&#039;s `remove_dir_all` deletes child directories already created by sibling threads, causing those threads to fail with ENOENT.

### Impact

* **Scope:** All jobs
* **Duration:** \~11 hours (2026-04-09 16:38 UTC — 2026-04-10 03:21 UTC)

### Timeline (UTC)

| Time               | Event                                                                                |
| ------------------ | ------------------------------------------------------------------------------------ |
| 2026-04-09 16:38   | Code change merged — added remove\_dir\_all pre-cleanup in parallel download threads |
| 2026-04-10 \~02:00 | Job failures reported on prod-3                                                      |
| 2026-04-10 \~02:45 | Root cause identified via prod SSH log analysis                                      |
| 2026-04-10 03:21   | Fix merged and deployed                                                              |

### Fix

* Removed `remove_dir_all` from `download_and_extract()` — threads now only do `create_dir_all` \+ streaming tar extraction
* Disabled `--keep-alive` in CI and production to avoid VM reuse until proper stale file cleanup is implemented (#8757).&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Fri, 10 Apr 2026 02:14:36 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmns9w8cj06a5zia892px46qr</link>
  <guid>https://status.vm0.ai/incident/cmns9w8cj06a5zia892px46qr</guid>
</item>

<item>
  <title>We are currently experiencing an outage. Users of the V0 Managed Token service may be affected.</title>
  <description>
    Type: Incident
    Duration: 1 hour and 14 minutes

    Affected Components: Runner
    Apr 8, 09:45:18 GMT+0 - Investigating - We are currently investigating this incident. Apr 8, 09:53:24 GMT+0 - Investigating - This incident is related to our most recent release. We are currently investigating the situation and working to resolve an issue that occurred during the deployment process. Apr 8, 10:22:56 GMT+0 - Monitoring - We have submitted the fix code and are observing whether the problem has been resolved. Apr 8, 10:46:12 GMT+0 - Resolved - fixed Apr 8, 10:57:19 GMT+0 - Postmortem - # Postmortem: Agent Task Execution Failure Due to Model ID Format Mismatch

### What Happened

On 2026-04-08, all vm0 agent task runs failed to start. Users were unable to execute any agent tasks across all built-in model provider types.

### Root Cause

The Claude Code CLI and Anthropic API require model IDs in hyphenated format (e.g., `claude-opus-4-6`). However, the model ID constants in our codebase and the values stored in the database used dot notation (e.g., `claude-opus-4.6`). When vm0 attempted to launch an agent, the CLI rejected the model ID at startup, blocking all task execution.

### Impact

* **Scope:** 100% of agent task runs across `vm0`, `anthropic-api-key`, and `claude-code-oauth-token` provider types
* **Duration:** \~30 minutes (09:47 UTC — 10:17 UTC)
* **Not affected:** OpenRouter and Vercel AI Gateway providers (use a separate `anthropic/claude-*` naming convention)

### Timeline (UTC)

| Time  | Event                   |
| ----- | ----------------------- |
| 09:47 | Fix PR opened (#8511)   |
| 10:17 | PR merged and deployed  |
| 10:17 | Task execution restored |

### Fix

1. Renamed all model ID constants in `MODEL_PROVIDER_TYPES` and `VM0_MODEL_TO_PROVIDER` from dot to hyphen format
2. Ran DB migration `0230` to backfill affected rows in `model_providers.selected_model`, `vm0_api_keys.model`, and `credit_pricing.model`
3. Updated dev seed scripts and all related test suites Apr 8, 10:58:59 GMT+0 - Resolved - This incident has been resolved. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 14 minutes</p>
    <p><strong>Affected Components:</strong> </p>
    &lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 8&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;09:45:18&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 8&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;09:53:24&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  This incident is related to our most recent release. We are currently investigating the situation and working to resolve an issue that occurred during the deployment process..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 8&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;10:22:56&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; -
  We have submitted the fix code and are observing whether the problem has been resolved..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 8&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;10:46:12&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  fixed.&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 8&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;10:57:19&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Postmortem&lt;/strong&gt; -
  # Postmortem: Agent Task Execution Failure Due to Model ID Format Mismatch

### What Happened

On 2026-04-08, all vm0 agent task runs failed to start. Users were unable to execute any agent tasks across all built-in model provider types.

### Root Cause

The Claude Code CLI and Anthropic API require model IDs in hyphenated format (e.g., `claude-opus-4-6`). However, the model ID constants in our codebase and the values stored in the database used dot notation (e.g., `claude-opus-4.6`). When vm0 attempted to launch an agent, the CLI rejected the model ID at startup, blocking all task execution.

### Impact

* **Scope:** 100% of agent task runs across `vm0`, `anthropic-api-key`, and `claude-code-oauth-token` provider types
* **Duration:** \~30 minutes (09:47 UTC — 10:17 UTC)
* **Not affected:** OpenRouter and Vercel AI Gateway providers (use a separate `anthropic/claude-*` naming convention)

### Timeline (UTC)

| Time  | Event                   |
| ----- | ----------------------- |
| 09:47 | Fix PR opened (#8511)   |
| 10:17 | PR merged and deployed  |
| 10:17 | Task execution restored |

### Fix

1. Renamed all model ID constants in `MODEL_PROVIDER_TYPES` and `VM0_MODEL_TO_PROVIDER` from dot to hyphen format
2. Ran DB migration `0230` to backfill affected rows in `model_providers.selected_model`, `vm0_api_keys.model`, and `credit_pricing.model`
3. Updated dev seed scripts and all related test suites.&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Apr &lt;var data-var=&#039;date&#039;&gt; 8&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;10:58:59&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Wed, 8 Apr 2026 09:45:18 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmnpv44m100l0xnamhsd9whby</link>
  <guid>https://status.vm0.ai/incident/cmnpv44m100l0xnamhsd9whby</guid>
</item>

<item>
  <title>We are currently experiencing some service outages online.</title>
  <description>
    Type: Incident
    Duration: 3 minutes

    Affected Components: Web Page, API Service
    Mar 26, 08:23:50 GMT+0 - Investigating - The agents are temporarily unable to perform conversations or other tasks. We are currently investigating this incident. Mar 26, 08:27:03 GMT+0 - Resolved - This incident has been resolved. Mar 26, 08:28:16 GMT+0 - Postmortem - Our most recent deployment script encountered several issues during the latest release, which prevented the database migration from executing as planned. 

However, the application service proceeded with the deployment under the assumption that the database execution was successful. 

This has resulted in an inconsistency between the live application and the database versions, making it impossible to create a new event activity.

We are currently conducting an investigation to determine why the database upgrade scripts failed to execute correctly. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 3 minutes</p>
    <p><strong>Affected Components:</strong> , </p>
    &lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 26&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;08:23:50&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  The agents are temporarily unable to perform conversations or other tasks. We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 26&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;08:27:03&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 26&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;08:28:16&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Postmortem&lt;/strong&gt; -
  Our most recent deployment script encountered several issues during the latest release, which prevented the database migration from executing as planned. 

However, the application service proceeded with the deployment under the assumption that the database execution was successful. 

This has resulted in an inconsistency between the live application and the database versions, making it impossible to create a new event activity.

We are currently conducting an investigation to determine why the database upgrade scripts failed to execute correctly..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Thu, 26 Mar 2026 08:23:50 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmn77hak4000cqot3pmimzkuh</link>
  <guid>https://status.vm0.ai/incident/cmn77hak4000cqot3pmimzkuh</guid>
</item>

<item>
  <title>Service temporarily unavailable</title>
  <description>
    Type: Incident
    Duration: 30 minutes

    Affected Components: Web Page, API Service
    Mar 18, 04:19:36 GMT+0 - Investigating - We are currently investigating this incident. Mar 18, 04:49:34 GMT+0 - Resolved - This incident has been resolved. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 30 minutes</p>
    <p><strong>Affected Components:</strong> , </p>
    &lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 18&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;04:19:36&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 18&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;04:49:34&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Wed, 18 Mar 2026 04:19:36 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmmvj8e1j05gjezxdenn7qed4</link>
  <guid>https://status.vm0.ai/incident/cmmvj8e1j05gjezxdenn7qed4</guid>
</item>

<item>
  <title>The website is not allowing users to log in normally.</title>
  <description>
    Type: Incident
    Duration: 38 minutes

    Affected Components: Web Page
    Mar 12, 05:52:18 GMT+0 - Investigating - We are currently investigating this incident. Mar 12, 05:52:49 GMT+0 - Investigating - When opening the login page, a blank screen is displayed. Since the username and password fields are not visible, users are unable to complete the login process. Mar 12, 05:54:25 GMT+0 - Identified - The issue originated with the most recent security update. Mar 12, 06:22:48 GMT+0 - Monitoring - We implemented a fix and are currently monitoring the result. Mar 12, 06:30:07 GMT+0 - Resolved - This incident has been resolved. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 38 minutes</p>
    <p><strong>Affected Components:</strong> </p>
    &lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 12&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;05:52:18&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 12&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;05:52:49&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  When opening the login page, a blank screen is displayed. Since the username and password fields are not visible, users are unable to complete the login process..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 12&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;05:54:25&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  The issue originated with the most recent security update..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 12&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;06:22:48&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; -
  We implemented a fix and are currently monitoring the result..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Mar &lt;var data-var=&#039;date&#039;&gt; 12&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;06:30:07&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Thu, 12 Mar 2026 05:52:18 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmmn1whyg006ddyuxih5nnm4z</link>
  <guid>https://status.vm0.ai/incident/cmmn1whyg006ddyuxih5nnm4z</guid>
</item>

<item>
  <title>www.vm0.ai cannot directly redirect to platform.vm0.ai.</title>
  <description>
    Type: Incident
    Duration: 33 minutes

    Affected Components: Web Page
    Feb 11, 13:04:13 GMT+0 - Investigating - We are currently investigating this incident. Feb 11, 13:14:00 GMT+0 - Identified - We are continuing to work on a fix for this incident. Feb 11, 13:36:50 GMT+0 - Resolved - This incident has been resolved. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 33 minutes</p>
    <p><strong>Affected Components:</strong> </p>
    &lt;p&gt;&lt;small&gt;Feb &lt;var data-var=&#039;date&#039;&gt; 11&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;13:04:13&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Feb &lt;var data-var=&#039;date&#039;&gt; 11&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;13:14:00&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  We are continuing to work on a fix for this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Feb &lt;var data-var=&#039;date&#039;&gt; 11&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;13:36:50&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Wed, 11 Feb 2026 13:04:13 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmli1k8ru0042hh84qomwdnme</link>
  <guid>https://status.vm0.ai/incident/cmli1k8ru0042hh84qomwdnme</guid>
</item>

<item>
  <title>The platform.vm0.ai cannot be opened.</title>
  <description>
    Type: Incident
    Duration: 16 minutes

    
    Feb 6, 13:28:00 GMT+0 - Investigating - We are currently investigating this incident. Feb 6, 14:18:09 GMT+0 - Identified - We determined that the issue originated from a recent deployment of the platform frontend code. Feb 6, 14:29:55 GMT+0 - Resolved - This incident has been resolved. Feb 6, 14:38:03 GMT+0 - Postmortem - This incident was caused by a code refactoring that consolidated references to CLERK\_PUBLISHABLE\_KEY across several web sites. However, due to an oversight in omitting the variable name in the deployment script, the platform failed to locate the legacy CLERK\_PUBLISHABLE\_KEY in the production environment, resulting in page failures and inability to use [platform.vm0.ai](http://platform.vm0.ai).

The related API services and container services were not affected.

The follow-up remediation plan primarily includes attempting to validate required environment variables during the build phase to prevent problematic code from being deployed. Additionally, introducing e2e testing for [platform.vm0.ai](http://platform.vm0.ai) to ensure the happy path workflow functions normally. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 16 minutes</p>
    
    &lt;p&gt;&lt;small&gt;Feb &lt;var data-var=&#039;date&#039;&gt; 6&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;13:28:00&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Feb &lt;var data-var=&#039;date&#039;&gt; 6&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;14:18:09&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  We determined that the issue originated from a recent deployment of the platform frontend code..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Feb &lt;var data-var=&#039;date&#039;&gt; 6&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;14:29:55&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Feb &lt;var data-var=&#039;date&#039;&gt; 6&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;14:38:03&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Postmortem&lt;/strong&gt; -
  This incident was caused by a code refactoring that consolidated references to CLERK\_PUBLISHABLE\_KEY across several web sites. However, due to an oversight in omitting the variable name in the deployment script, the platform failed to locate the legacy CLERK\_PUBLISHABLE\_KEY in the production environment, resulting in page failures and inability to use [platform.vm0.ai](http://platform.vm0.ai).

The related API services and container services were not affected.

The follow-up remediation plan primarily includes attempting to validate required environment variables during the build phase to prevent problematic code from being deployed. Additionally, introducing e2e testing for [platform.vm0.ai](http://platform.vm0.ai) to ensure the happy path workflow functions normally..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Fri, 6 Feb 2026 13:28:00 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmlayuga407c3a1jxpcjvfqui</link>
  <guid>https://status.vm0.ai/incident/cmlayuga407c3a1jxpcjvfqui</guid>
</item>

<item>
  <title>Agent can&#039;t run</title>
  <description>
    Type: Incident
    Duration: 4 hours and 58 minutes

    Affected Components: Runner
    Jan 19, 02:40:00 GMT+0 - Investigating - We are currently investigating this incident. Jan 19, 02:45:00 GMT+0 - Identified - We are continuing to work on a fix for this incident. Jan 19, 02:50:00 GMT+0 - Identified - This glitch comes from a recent runner deployment, and the team is trying to fix the issue Jan 19, 05:50:18 GMT+0 - Identified - Locating the problem comes from a recent database change, the team is trying to fix the data that caused the problem Jan 19, 06:21:45 GMT+0 - Identified - We have now restored normal operation for both the database and task dispatcher, and are currently working on getting the Claude code in the sandbox back online. Jan 19, 07:03:49 GMT+0 - Identified - We have confirmed that the issue lies in the way VM0 calls the claude code. The minimal fix has been finalized and is currently being redeployed. Jan 19, 07:38:24 GMT+0 - Resolved - This incident has been resolved. Jan 19, 08:23:59 GMT+0 - Postmortem - Postmortem: Claude Code Hanging in Sandbox

Date: 2026-01-19  
Severity: P0  
Duration: \~4 hours

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

SUMMARY

Production agent runs were failing silently. Claude Code started but never produced output, timing out after 15+ minutes.

Root Cause: stdin was configured as &quot;pipe&quot; but never closed, causing Claude Code to hang waiting for EOF.

Why CI missed it: CI uses mock-claude which doesn&#039;t check stdin state. Real Claude Code does.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

THE BUG

Before (hangs - stdin pipe never closed):  
spawn(cmd, args, { stdio: \[&quot;pipe&quot;, &quot;pipe&quot;, &quot;pipe&quot;\] })

After (works - stdin is /dev/null, immediate EOF):  
spawn(cmd, args, { stdio: \[&quot;ignore&quot;, &quot;pipe&quot;, &quot;pipe&quot;\] })

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

HOW WE FOUND IT

1. SSH into sandbox
2. cat /tmp/vm0-agent-\*.log → empty (no Claude output)
3. ps aux | grep claude → process alive, using 23% memory
4. ps -p 510 -o wchan → ep\_pol (waiting on I/O)
5. ls -la /proc/510/fd/0 → stdin connected to pipe
6. Manual &quot;claude --print hello&quot; → works (TTY mode)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

WHY NO ROLLBACK

Release included database migration. Forward-fix was safer.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

INVESTIGATION NOISE

Runner npm publish failure (@vm0/core not built) was unrelated but consumed investigation time.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ACTION ITEMS

✅ Fix stdin → &quot;ignore&quot; ([#1316](https://github.com/vm0-ai/vm0/pull/1316))  
✅ Add spawn unit tests ([#1319](https://github.com/vm0-ai/vm0/pull/1319))  
✅ Fix CI publish jobs ([#1306](https://github.com/vm0-ai/vm0/pull/1306), [#1318](https://github.com/vm0-ai/vm0/pull/1318))  
🔲 Add real Claude test in CI (TODO)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

KEY LESSON

Mock ≠ Real: CI must include at least one test with real Claude Code to catch behavior differences like stdin handling. 
  </description>
  <content:encoded>
    <![CDATA[<p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 4 hours and 58 minutes</p>
    <p><strong>Affected Components:</strong> </p>
    &lt;p&gt;&lt;small&gt;Jan &lt;var data-var=&#039;date&#039;&gt; 19&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;02:40:00&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Investigating&lt;/strong&gt; -
  We are currently investigating this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Jan &lt;var data-var=&#039;date&#039;&gt; 19&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;02:45:00&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  We are continuing to work on a fix for this incident..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Jan &lt;var data-var=&#039;date&#039;&gt; 19&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;02:50:00&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  This glitch comes from a recent runner deployment, and the team is trying to fix the issue.&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Jan &lt;var data-var=&#039;date&#039;&gt; 19&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;05:50:18&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  Locating the problem comes from a recent database change, the team is trying to fix the data that caused the problem.&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Jan &lt;var data-var=&#039;date&#039;&gt; 19&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;06:21:45&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  We have now restored normal operation for both the database and task dispatcher, and are currently working on getting the Claude code in the sandbox back online..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Jan &lt;var data-var=&#039;date&#039;&gt; 19&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;07:03:49&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Identified&lt;/strong&gt; -
  We have confirmed that the issue lies in the way VM0 calls the claude code. The minimal fix has been finalized and is currently being redeployed..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Jan &lt;var data-var=&#039;date&#039;&gt; 19&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;07:38:24&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Resolved&lt;/strong&gt; -
  This incident has been resolved..&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Jan &lt;var data-var=&#039;date&#039;&gt; 19&lt;/var&gt;, &lt;var data-var=&#039;time&#039;&gt;08:23:59&lt;/var&gt; GMT+0&lt;/small&gt;&lt;br&gt;&lt;strong&gt;Postmortem&lt;/strong&gt; -
  Postmortem: Claude Code Hanging in Sandbox

Date: 2026-01-19  
Severity: P0  
Duration: \~4 hours

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

SUMMARY

Production agent runs were failing silently. Claude Code started but never produced output, timing out after 15+ minutes.

Root Cause: stdin was configured as &quot;pipe&quot; but never closed, causing Claude Code to hang waiting for EOF.

Why CI missed it: CI uses mock-claude which doesn&#039;t check stdin state. Real Claude Code does.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

THE BUG

Before (hangs - stdin pipe never closed):  
spawn(cmd, args, { stdio: \[&quot;pipe&quot;, &quot;pipe&quot;, &quot;pipe&quot;\] })

After (works - stdin is /dev/null, immediate EOF):  
spawn(cmd, args, { stdio: \[&quot;ignore&quot;, &quot;pipe&quot;, &quot;pipe&quot;\] })

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

HOW WE FOUND IT

1. SSH into sandbox
2. cat /tmp/vm0-agent-\*.log → empty (no Claude output)
3. ps aux | grep claude → process alive, using 23% memory
4. ps -p 510 -o wchan → ep\_pol (waiting on I/O)
5. ls -la /proc/510/fd/0 → stdin connected to pipe
6. Manual &quot;claude --print hello&quot; → works (TTY mode)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

WHY NO ROLLBACK

Release included database migration. Forward-fix was safer.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

INVESTIGATION NOISE

Runner npm publish failure (@vm0/core not built) was unrelated but consumed investigation time.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ACTION ITEMS

✅ Fix stdin → &quot;ignore&quot; ([#1316](https://github.com/vm0-ai/vm0/pull/1316))  
✅ Add spawn unit tests ([#1319](https://github.com/vm0-ai/vm0/pull/1319))  
✅ Fix CI publish jobs ([#1306](https://github.com/vm0-ai/vm0/pull/1306), [#1318](https://github.com/vm0-ai/vm0/pull/1318))  
🔲 Add real Claude test in CI (TODO)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

KEY LESSON

Mock ≠ Real: CI must include at least one test with real Claude Code to catch behavior differences like stdin handling..&lt;/p&gt;
]]>
  </content:encoded>
  <pubDate>Mon, 19 Jan 2026 02:40:00 +0000</pubDate>
  <link>https://status.vm0.ai/incident/cmkkqswpo002w10pflokikw48</link>
  <guid>https://status.vm0.ai/incident/cmkkqswpo002w10pflokikw48</guid>
</item>

  </channel>
  </rss>