VM0 Status - Incident history

Due to a performance degradation with the upstream provider, our Sonnet 4.6 model is currently affected.

2026-04-24T03:06:08.995+00:00

Type: Incident

Duration: 45 minutes

Affected Components: API Service

Apr 24, 03:06:08 GMT+0
Investigating - We suggest temporarily switching to other models.

Apr 24, 03:50:56 GMT+0
Resolved - This incident has been resolved..

Currently unable to log in to the website.

2026-04-21T10:40:51.048+00:00

Type: Incident

Duration: 57 minutes

Affected Components: Web Page

Apr 21, 10:40:51 GMT+0
Investigating - We are currently investigating this incident..

Apr 21, 11:02:54 GMT+0
Identified - We are continuing to work on a fix for this incident. A hotfix is being prepared for deployment to the live environment..

Apr 21, 11:33:59 GMT+0
Monitoring - We are currently releasing a hotfix to resolve this issue..

Apr 21, 11:37:38 GMT+0
Resolved - This incident has been resolved..

Apr 21, 11:40:46 GMT+0
Postmortem - Today, during the project, we attempted to add a permission-related feature to the system. While implementing a new authentication method, we introduced a breaking change that prevented web users from logging in. This modification should have been caught by our automated checks. However, the check was accidentally bypassed, allowing the bug to leak into the production environment. Subsequently, we revoked the bypass permission capability to prevent similar incidents from occurring..

The system is unable to run any tasks.

2026-04-21T05:49:09.951+00:00

Type: Incident

Duration: 4 minutes

Affected Components: Runner

Apr 21, 05:49:09 GMT+0
Investigating - We are currently investigating this incident..

Apr 21, 05:50:58 GMT+0
Identified - We are continuing to work on a fix for this incident..

Apr 21, 05:52:41 GMT+0
Resolved - This incident has been resolved..

The Google Workspace connector is currently not working properly.

2026-04-17T11:48:01.967+00:00

Type: Incident

Duration: 1 hour and 18 minutes

Affected Components: Connector, Web Page, Runner, API Service, Storage

Apr 17, 11:48:01 GMT+0
Investigating - We are currently investigating this incident..

Apr 17, 12:07:14 GMT+0
Identified - We are continuing to work on a fix for this incident..

Apr 17, 12:40:02 GMT+0
Monitoring - We implemented a fix and are currently monitoring the result..

Apr 17, 13:05:59 GMT+0
Resolved - This incident has been resolved..

Apr 17, 13:23:06 GMT+0
Postmortem - The problem stems from a recent update where we modified the update rules for the OAuth connector. Our goal was to implement a more proactive refresh mechanism to ensure a valid token is available whenever the system runs. However, this new refresh rule introduced a bug specifically related to the Google OAuth refresh process. Because this issue was difficult to replicate within our automated checking procedures, it was unfortunately deployed to the live environment. To prevent this from happening again, we are currently working on implementing automated validation for the OAuth connector workflow to ensure these types of issues are caught before release..

We are experiencing a failure with the Agent task runner, which is currently preventing our Agents from executing tasks.

2026-04-14T08:43:55.642+00:00

Type: Incident

Duration: 1 hour and 6 minutes

Affected Components: Runner

Apr 14, 09:50:09 GMT+0
Resolved - This incident has been resolved..

Apr 14, 08:43:55 GMT+0
Investigating - We are currently investigating this incident..

Apr 14, 09:35:13 GMT+0
Monitoring - We implemented a fix and are currently monitoring the result..

Apr 14, 09:50:08 GMT+0
Resolved - This incident has been resolved..

storage download failed

2026-04-10T02:14:36.526+00:00

Type: Incident

Duration: 1 hour and 37 minutes

Affected Components: Runner

Apr 10, 02:14:36 GMT+0
Investigating - We are currently investigating this incident..

Apr 10, 03:00:13 GMT+0
Identified - We are continuing to work on a fix for this incident..

Apr 10, 03:51:11 GMT+0
Resolved - This incident has been resolved..

Apr 10, 04:35:41 GMT+0
Postmortem - # Guest Download Failure Due to Parallel Race Condition on Overlapping Mount Paths ### What Happened On 2026-04-10, agent jobs with multiple storages failed during VM initialization. The `guest-download` binary, which runs inside the VM to download and extract storage archives, crashed with `canonicalize` ENOENT errors when processing skills storages. ### Root Cause `guest-download` extracts storage archives in parallel (up to 4 concurrent threads). A recent change added a `remove_dir_all(target_path)` call at the start of each thread to clean stale files on VM reuse (keep-alive). The storage mount paths have a guaranteed parent-child overlap: \- Instructions mount at `/home/user/.claude` \- Skills mount at `/home/user/.claude/skills/{name}` When threads run concurrently, the parent path's `remove_dir_all` deletes child directories already created by sibling threads, causing those threads to fail with ENOENT. ### Impact * **Scope:** All jobs * **Duration:** \~11 hours (2026-04-09 16:38 UTC — 2026-04-10 03:21 UTC) ### Timeline (UTC) | Time | Event | | ------------------ | ------------------------------------------------------------------------------------ | | 2026-04-09 16:38 | Code change merged — added remove\_dir\_all pre-cleanup in parallel download threads | | 2026-04-10 \~02:00 | Job failures reported on prod-3 | | 2026-04-10 \~02:45 | Root cause identified via prod SSH log analysis | | 2026-04-10 03:21 | Fix merged and deployed | ### Fix * Removed `remove_dir_all` from `download_and_extract()` — threads now only do `create_dir_all` \+ streaming tar extraction * Disabled `--keep-alive` in CI and production to avoid VM reuse until proper stale file cleanup is implemented (#8757).

We are currently experiencing an outage. Users of the V0 Managed Token service may be affected.

2026-04-08T09:45:18.190+00:00

Type: Incident

Duration: 1 hour and 14 minutes

Affected Components: Runner

Apr 8, 09:45:18 GMT+0
Investigating - We are currently investigating this incident..

Apr 8, 09:53:24 GMT+0
Investigating - This incident is related to our most recent release. We are currently investigating the situation and working to resolve an issue that occurred during the deployment process..

Apr 8, 10:22:56 GMT+0
Monitoring - We have submitted the fix code and are observing whether the problem has been resolved..

Apr 8, 10:46:12 GMT+0
Resolved - fixed.

Apr 8, 10:57:19 GMT+0
Postmortem - # Postmortem: Agent Task Execution Failure Due to Model ID Format Mismatch ### What Happened On 2026-04-08, all vm0 agent task runs failed to start. Users were unable to execute any agent tasks across all built-in model provider types. ### Root Cause The Claude Code CLI and Anthropic API require model IDs in hyphenated format (e.g., `claude-opus-4-6`). However, the model ID constants in our codebase and the values stored in the database used dot notation (e.g., `claude-opus-4.6`). When vm0 attempted to launch an agent, the CLI rejected the model ID at startup, blocking all task execution. ### Impact * **Scope:** 100% of agent task runs across `vm0`, `anthropic-api-key`, and `claude-code-oauth-token` provider types * **Duration:** \~30 minutes (09:47 UTC — 10:17 UTC) * **Not affected:** OpenRouter and Vercel AI Gateway providers (use a separate `anthropic/claude-*` naming convention) ### Timeline (UTC) | Time | Event | | ----- | ----------------------- | | 09:47 | Fix PR opened (#8511) | | 10:17 | PR merged and deployed | | 10:17 | Task execution restored | ### Fix 1. Renamed all model ID constants in `MODEL_PROVIDER_TYPES` and `VM0_MODEL_TO_PROVIDER` from dot to hyphen format 2. Ran DB migration `0230` to backfill affected rows in `model_providers.selected_model`, `vm0_api_keys.model`, and `credit_pricing.model` 3. Updated dev seed scripts and all related test suites.

Apr 8, 10:58:59 GMT+0
Resolved - This incident has been resolved..

We are currently experiencing some service outages online.

2026-03-26T08:23:50.672+00:00

Type: Incident

Duration: 3 minutes

Affected Components: Web Page, API Service

Mar 26, 08:23:50 GMT+0
Investigating - The agents are temporarily unable to perform conversations or other tasks. We are currently investigating this incident..

Mar 26, 08:27:03 GMT+0
Resolved - This incident has been resolved..

Mar 26, 08:28:16 GMT+0
Postmortem - Our most recent deployment script encountered several issues during the latest release, which prevented the database migration from executing as planned. However, the application service proceeded with the deployment under the assumption that the database execution was successful. This has resulted in an inconsistency between the live application and the database versions, making it impossible to create a new event activity. We are currently conducting an investigation to determine why the database upgrade scripts failed to execute correctly..

Service temporarily unavailable

2026-03-18T04:19:36.603+00:00

Type: Incident

Duration: 30 minutes

Affected Components: Web Page, API Service

Mar 18, 04:19:36 GMT+0
Investigating - We are currently investigating this incident..

Mar 18, 04:49:34 GMT+0
Resolved - This incident has been resolved..

The website is not allowing users to log in normally.

2026-03-12T05:52:18.976+00:00

Type: Incident

Duration: 38 minutes

Affected Components: Web Page

Mar 12, 05:52:18 GMT+0
Investigating - We are currently investigating this incident..

Mar 12, 05:52:49 GMT+0
Investigating - When opening the login page, a blank screen is displayed. Since the username and password fields are not visible, users are unable to complete the login process..

Mar 12, 05:54:25 GMT+0
Identified - The issue originated with the most recent security update..

Mar 12, 06:22:48 GMT+0
Monitoring - We implemented a fix and are currently monitoring the result..

Mar 12, 06:30:07 GMT+0
Resolved - This incident has been resolved..

www.vm0.ai cannot directly redirect to platform.vm0.ai.

2026-02-11T13:04:13.947+00:00

Type: Incident

Duration: 33 minutes

Affected Components: Web Page

Feb 11, 13:04:13 GMT+0
Investigating - We are currently investigating this incident..

Feb 11, 13:14:00 GMT+0
Identified - We are continuing to work on a fix for this incident..

Feb 11, 13:36:50 GMT+0
Resolved - This incident has been resolved..

The platform.vm0.ai cannot be opened.

2026-02-06T13:28:00.000+00:00

Type: Incident

Duration: 16 minutes

Feb 6, 13:28:00 GMT+0
Investigating - We are currently investigating this incident..

Feb 6, 14:18:09 GMT+0
Identified - We determined that the issue originated from a recent deployment of the platform frontend code..

Feb 6, 14:29:55 GMT+0
Resolved - This incident has been resolved..

Feb 6, 14:38:03 GMT+0
Postmortem - This incident was caused by a code refactoring that consolidated references to CLERK\_PUBLISHABLE\_KEY across several web sites. However, due to an oversight in omitting the variable name in the deployment script, the platform failed to locate the legacy CLERK\_PUBLISHABLE\_KEY in the production environment, resulting in page failures and inability to use [platform.vm0.ai](http://platform.vm0.ai). The related API services and container services were not affected. The follow-up remediation plan primarily includes attempting to validate required environment variables during the build phase to prevent problematic code from being deployed. Additionally, introducing e2e testing for [platform.vm0.ai](http://platform.vm0.ai) to ensure the happy path workflow functions normally..

Agent can't run

2026-01-19T02:40:00.000+00:00

Type: Incident

Duration: 4 hours and 58 minutes

Affected Components: Runner

Jan 19, 02:40:00 GMT+0
Investigating - We are currently investigating this incident..

Jan 19, 02:45:00 GMT+0
Identified - We are continuing to work on a fix for this incident..

Jan 19, 02:50:00 GMT+0
Identified - This glitch comes from a recent runner deployment, and the team is trying to fix the issue.

Jan 19, 05:50:18 GMT+0
Identified - Locating the problem comes from a recent database change, the team is trying to fix the data that caused the problem.

Jan 19, 06:21:45 GMT+0
Identified - We have now restored normal operation for both the database and task dispatcher, and are currently working on getting the Claude code in the sandbox back online..

Jan 19, 07:03:49 GMT+0
Identified - We have confirmed that the issue lies in the way VM0 calls the claude code. The minimal fix has been finalized and is currently being redeployed..

Jan 19, 07:38:24 GMT+0
Resolved - This incident has been resolved..

Jan 19, 08:23:59 GMT+0
Postmortem - Postmortem: Claude Code Hanging in Sandbox Date: 2026-01-19 Severity: P0 Duration: \~4 hours ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ SUMMARY Production agent runs were failing silently. Claude Code started but never produced output, timing out after 15+ minutes. Root Cause: stdin was configured as "pipe" but never closed, causing Claude Code to hang waiting for EOF. Why CI missed it: CI uses mock-claude which doesn't check stdin state. Real Claude Code does. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ THE BUG Before (hangs - stdin pipe never closed): spawn(cmd, args, { stdio: \["pipe", "pipe", "pipe"\] }) After (works - stdin is /dev/null, immediate EOF): spawn(cmd, args, { stdio: \["ignore", "pipe", "pipe"\] }) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ HOW WE FOUND IT 1. SSH into sandbox 2. cat /tmp/vm0-agent-\*.log → empty (no Claude output) 3. ps aux | grep claude → process alive, using 23% memory 4. ps -p 510 -o wchan → ep\_pol (waiting on I/O) 5. ls -la /proc/510/fd/0 → stdin connected to pipe 6. Manual "claude --print hello" → works (TTY mode) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ WHY NO ROLLBACK Release included database migration. Forward-fix was safer. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ INVESTIGATION NOISE Runner npm publish failure (@vm0/core not built) was unrelated but consumed investigation time. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ACTION ITEMS ✅ Fix stdin → "ignore" ([#1316](https://github.com/vm0-ai/vm0/pull/1316)) ✅ Add spawn unit tests ([#1319](https://github.com/vm0-ai/vm0/pull/1319)) ✅ Fix CI publish jobs ([#1306](https://github.com/vm0-ai/vm0/pull/1306), [#1318](https://github.com/vm0-ai/vm0/pull/1318)) 🔲 Add real Claude test in CI (TODO) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ KEY LESSON Mock ≠ Real: CI must include at least one test with real Claude Code to catch behavior differences like stdin handling..