Platform Latency

Incident Report for Avochato

Postmortem

What Happened

During routine auto-scaling in response to automated rotating of application servers, the Avochato platform suffered network failures brokering client-side websocket requests to application servers. An application-layer resolution to client-side javascript errors experienced by some customers inadvertently amplified the volume of retry requests, and this caused an insurmountable queue of requests to our websocket broker database.

Secure websocket connections are used to deliver real-time notifications and app updates in the live inbox, and have inherent retry mechanisms to keep clients connected even if they lose connectivity intermittently. A high volume of concurrent retry requests timed out and filled the retry queue, where they continued to timeout and fail exponentially as browsers interacted with Avochato.

This led to an effective denial of service as the retry mechanisms created an insurmountable volume of requests, compounding based on peak platform usage by our user-base. Exponential back-off mechanisms did not prevent individual clients from sending requests below a safe threshold our network could process expediently. Unlike the control we have over server-side resources, the Avochato engineering team did not have effective means to disable toxic clients from reconnecting, and rushed to isolate and stem the root cause, specifically by deauthenticating certain sessions remotely.

Avochato servers remained operational and available on the open internet during the impact period, but interactions with the app became queued at the network level, causing extreme delays to end-users and API requests, as well as delays tagging data and uploading contacts, and delays in attempting to make outbound calls or route incoming calls.

The incident persisted while the massive queue of requests was processed, but the Avochato engineering team did not have tools available to clear the queue without risking data loss.

Resolution

The Avochato Platform auto-scaled application servers in response to the increase in traffic to handle peaks in usage.

Engineers were alerted and immediately began triaging reports of latency. After evaluating the network traffic and logs, our team identified the root cause and began developing mechanisms to stem websocket retry requests. Various diagnostics by the engineering team were able to decrease but not eliminate the above-average in-app latency so long as problematic clients were still online. Some cohorts of users were securely logged out remotely in order to prevent their clients from overloading Avochato. Backoff mechanisms have been modified to dramatically increase the period between retry requests.

Meanwhile, upgrades to the open-source websocket broker libraries used by the platform were identified, patched, tested, and deployed to production application servers in order to prevent the root cause. Additional logging was also implemented to better identify the volume of these requests for internal triage.

Functionality to securely reload or disable runaway client requests has been developed and deployed to production in order to prevent the root-cause from occurring across the platform.

Additional architecture points of failure were identified at the networking level and upgrades to those parts of the system have been proposed and prioritized to prevent this type of service disruption from occurring in the future.

Final Thoughts

We know how critical real-time conversations are to your team, and how important it is to be able to service your customers throughout the business day. Our team is committing to responding as promptly as possible to incoming support requests and providing as much information as possible during incidents.

Thank you again for choosing Avochato,

Christopher Neale

CTO and Co-founder

Posted Mar 11, 2021 - 12:09 PST

Resolved

This incident has been resolved and we are observing normal page load times, but we are continuing to investigate the root cause.

Posted Mar 09, 2021 - 15:37 PST

Monitoring

Our team has deployed a software update to address the root cause and is monitoring the results.

Posted Mar 09, 2021 - 14:25 PST

Update

We are continuing to work on a fix for this issue.

Posted Mar 09, 2021 - 14:14 PST

Update

We are continuing to work on a fix for this issue.

Posted Mar 09, 2021 - 13:58 PST

Update

We are continuing to work on a fix for this issue.

Posted Mar 09, 2021 - 12:32 PST

Update

We are continuing to work on a fix for this issue.

Posted Mar 09, 2021 - 11:50 PST

Identified

We have identified the issue and engineers are working on a resolution.

Posted Mar 09, 2021 - 10:46 PST

Update

We are continuing to investigate this issue.

Posted Mar 09, 2021 - 10:24 PST

Investigating

We are currently investigating this issue.

Posted Mar 09, 2021 - 10:03 PST

This incident affected: avochato.com, API, and Mobile.