Client Latency and Platform Outage

Incident Report for Avochato

Postmortem

What Happened

Starting in the afternoon, routine Conversation Management automation within the Avochato Platform began running on a disproportionately large body of background work using the default priority queue. This ultimately was due to a combination of account-specific settings, infrastructure restraints, and timing of the load across the Avochato platform. The Avochato platform suffered from growing latency in a series of waves, a short maintenance window of hard down-time, and another wave of latency as we addressed the root cause of the issue. All Avochato services were impacted.

This lead to an exponential concurrent amount of background jobs performing and competing for all platform resources. Ultimately, fixing the issue required putting the platform into maintenance mode while replacing hardware used in our cloud services. To clarify, this was not a planned or routine maintenance window, but the user experience was the same: app users would see a maintenance page (or error page for some users) and an inability to access the inbox. This was done in the interest of time and will be revised by the engineering team in the future.

During this period it was not clear where the source of runaway automation described above came from, but it caused the Avochato Platform to attempt to queue a new type of asynchronous job designed to push data to websockets. Because jobs and websockets use the same hardware, the influx basically ate up 100% of memory, as jobs that could not find available websockets could not complete and more and more jobs of that nature piled up waiting to publish to a websocket.

The source of this issue specifically relates to a recent platform upgrade deployed in previous weeks to reduce the turnaround time for users to send messages and receive notifications quickly. While this functionally worked for our customer-base, it ultimately moved the burden to a different part of the architecture in a way that scaled disproportionately under specific circumstances, and without proper limitations on concurrent throughput. The result caused our platform to be unable to process additional web requests (meaning high page load times) and queued a massive excess of background jobs in a short period (meaning delays in messages and lack of real-time notifications and inbox updates, etc).

Additionally, the latency and eventual outage led to our team being unable to respond to many customers who reached out to us during the impacted period in the timely manner that they have been accustomed, due to the platform failure.

The Engineering team prepared and deployed a migration to switch those types of new jobs from the default priority queue into a new lower-priority queue to constrain their impact. Deployment of this patch was done per our usual high-availability deployment process which involves taking one-third of our application servers offline at a time, reducing platform capacity while we deploy.

Regardless, in order to handle the overall volume of queued work and return to normalcy, Engineering applied emergency steps to replace the cloud computing instance storing the jobs with one twice its size but this could not be done without postponing the work as we switched the infrastructure. All efforts were made to prevent dropping the background jobs though ultimately not all jobs could be saved. Emergency steps to resolve the situation (during which Avochato switched into maintenance mode in order to purge the system of the busy processes) led to a short period of hard downtime and loss of queued jobs including processing contact CSV uploads, creating broadcast audiences, sending messages, and displaying notifications.

Once the necessary hardware was replaced, the root source of the resource-intensive automation continued to create excess jobs. However, it gave engineers the ability to reduce the noise, identify the source, and design a final resolution to treat the cause instead of the symptom.

Another migration was prepared to make it easy for admins to turn off functionality for specific sources of automation. Once deployed, systems administrators were able to eliminate the source of resource-intensive automations once and for all and new safeguards were installed for taking expedient, atomic actions in the future that would not require hardware or software deployments.

This ultimately returned our systems to normal as of yesterday evening.

Next Steps

Engineering has drafted and is prioritizing a series of TODOs regarding infrastructure points of failure, is implementing in-app indicators for when the system is under similar periods of stress and is working closely to resolve any impacted accounts that got into a bad state due to the actions taken during the period. Infrastructure planning has been prioritized to reduce the burden on specific parts of our architecture and prevent specific architecture from bearing multiple responsibilities that led to the failure.

We are continuing to monitor platform latency and take proactive steps to mitigate unforeseen combinations of Avochato automation from ever impacting the core inbox experience.

We understand the level of trust you place in the Avochato Platform to communicate with those most important to you.

On behalf of our team, thank you for your patience, and thank you for choosing Avochato,

Christopher Neale, CTO and co-founder

Posted Nov 20, 2020 - 12:13 PST

Resolved

This incident has been resolved and our team is continuing to monitor the stability of the platform and process outstanding queued work.

Posted Nov 19, 2020 - 16:54 PST

Monitoring

We are monitoring the resolution of the incident and services are being rolled back online.

Posted Nov 19, 2020 - 16:30 PST

Update

The Avochato Platform is entering a temporary maintenance period.

Posted Nov 19, 2020 - 16:17 PST

Identified

We are continuing to experience delays in serving pages and handling messages.

Our ops team is deploying a patch to our infrastructure and we will monitor the result.

Posted Nov 19, 2020 - 16:03 PST

Update

We are continuing to investigate this issue.

Posted Nov 19, 2020 - 14:46 PST

Investigating

We are currently investigating this issue.

Posted Nov 19, 2020 - 14:46 PST

This incident affected: avochato.com, API, and Mobile.