Vespa Cloud - New nodes can fail to provision - – Incident details

New nodes can fail to provision -

Resolved
Degraded performance
Started 3 days agoLasted about 7 hours

Affected

Enclave

Operational from 7:30 AM to 10:06 AM, Degraded performance from 10:06 AM to 12:47 PM, Operational from 12:47 PM to 2:03 PM

AWS

Operational from 7:30 AM to 10:06 AM, Degraded performance from 10:06 AM to 12:47 PM, Operational from 12:47 PM to 2:03 PM

Zones

Degraded performance from 7:30 AM to 9:07 AM, Operational from 7:30 AM to 10:06 AM, Degraded performance from 9:07 AM to 12:47 PM, Operational from 12:47 PM to 2:03 PM

dev

Operational from 7:30 AM to 10:06 AM, Degraded performance from 10:06 AM to 12:47 PM, Operational from 12:47 PM to 2:03 PM

dev.aws-euw1-az1

Operational from 7:30 AM to 10:06 AM, Degraded performance from 10:06 AM to 12:47 PM, Operational from 12:47 PM to 2:03 PM

dev.aws-us-east-1c

Operational from 7:30 AM to 10:06 AM, Degraded performance from 10:06 AM to 12:47 PM, Operational from 12:47 PM to 2:03 PM

Updates
  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Monitoring

    The rollout is compete and new nodes are able to provision.

    We are monitoring the situation and will post a final update later today.

  • Update
    Update

    We have found an edge-case which has been slowing down the rollout, we're in the process of mitigating this issue.

    For now new hosts are still impacted.

    New update at 12:15 or earlier.

    There is no impact for existing nodes.

  • Update
    Update

    Rollout is still in progress, still impacting new nodes on AWS.

    Existing nodes are not impacted.

  • Update
    Update

    We have implemented a fix and it is currently rolling out fleet-wide.

    Provision of new nodes is still impacted, but we should resolve within 30 minutes, we will post an update as soon as possible, or by 11:00 UTC in any case.

    Existing nodes are not affected.

  • Update
    Update

    We are continuing to work on a fix for this incident, we have noticed failures in more zones for new hosts.

    Ticket updated with the new Zones.

    Existing nodes are unaffected, only provisioning of new nodes is impacted.

  • Identified
    Identified

    We have identified the root cause and started working on a workaround.

  • Investigating
    Investigating

    We have observed that new nodes are failing to provision.

    Existing nodes are unaffected.