Resolved
No impact has been observed for the last 90 minutes, the issue is fully resolved.
Monitoring
We continue to observe the recovery, the queueing time have been back to normal levels for 1h now. The team continues to monitor the situation.
Identified
We have confirmed the recovery of the functionality across all regions and have observed that for the last 30 minutes the queuing time got back to expected levels. We keep making changes and monitoring the situation.
Identified
We have added more capacity to our EU shard and are observing recovery of the region. Some of the workloads might still observe slightly longer queuing time, but the wait time is improving. The team continues to monitor the situation and are still adding more capacity.
Identified
The team is currently working on adding additional compute capacity to the EU shard. The queue time to start new jobs has still degraded performance.
Identified
We continue using the US compute capacity to reduce pressure on the EU compute shard, and we see a slow recovery. In the meantime, we are preparing mitigations to recover EU capacity. AMD64 and ARM64 capacity are still experiencing degraded creation time. MacOS capacity is operational.
Identified
We are mitigating the impact of our unavailability in EU shard compute by using capacity from or US shard. We still see high queue time for all jobs and full outage for arm64 instances.
Identified
The team identified a scheduling component overloaded by requests. The whole team is working on removing the overload from this component.
Investigating
We are investigating an outage in creation of new instances in our EU compute shard.