Conversation list slowdown

Incident Report for Avochato

Postmortem

What Happened

The morning after a data migration to improve inbox search performance and add new search functionality, a series of complex queries caused our search service to enter a bad state and became throttled during the impact period.

This appeared to be due to the size of our Elasticsearch shards exceeding a critical limit in terms of size. We exceeded the size threshold due to additional columns that were added to support new indices, multiplied by the size of the production dataset.

As a result, search results (including the default inbox experience, contacts list, etc) timed out until we could completely reboot the search infrastructure.

Impact

Customer Data (including conversations, messages, tags, etc) was not lost and messages continued to deliver as intended during the period. Most functionality was available including account and user management, though the experience for searching through the inbox was severely degraded.

Customers who were online trying to use the app or API during the impact period would not have easily been able to lookup conversations or contacts. It was difficult to create and populate broadcasts during the period. Users were still able to navigate to conversations directly from links and from notifications.

Resolution

Engineers used our metrics to identify the issue was specifically related to our search infrastructure. The team scaled up additional instances of our Elasticsearch infrastructure to try and solve the problem.

The team then rebuilt the Elasticsearch cluster as they were not able to "reboot" it, and began routing traffic to the new cluster. This resolved the issue for customers on a rolling basis, as some connections hit the new instance while other data routed to bad shards. The rebuild process was clocked at taking about 1 hour and 15 minutes to complete syncing all data to build new Elasticsearch indices, so unfortunately some customers were impacted during this entire period.

We have since moved from using 8 Elasticsearch shards to 16, and reindexed the dataset, which cut the total size of each shard in half.

Additional Notes

In a separate issue that occurred around the same time period, the www.avochato.com domain was flagged automatically by Avast Antivirus' anti-phishing browser extension. We believe this was a false positive, but it caused Avast users who had the extension installed to be unable to view Avochato. Users who whitelisted Avochato in their extension were able to continue to log in, and the team worked quickly to submit an appeal. Avast has since then removed us from their phishing blacklist.

Avochato poses no known phishing threat to its users, but we encourage users who suspect phishing attack vectors to submit their reports to www.avochato.com/bugbounty

‌

Thanks again for your patience while we resolved this issue, and for being an Avochato customer,

Christopher, CTO & CISO

Posted Nov 15, 2021 - 10:31 PST

Resolved

This incident has been resolved, but we will continue to monitor during the rest of the weekend.

Posted Oct 23, 2021 - 12:57 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 23, 2021 - 10:04 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 23, 2021 - 09:36 PDT

Update

We are continuing to investigate long delays when loading lists of conversations and contacts in the inbox. We have deployed a patch to resolve one of the identified sources of latency and are monitoring the results.

Posted Oct 23, 2021 - 08:42 PDT

Update

We are continuing to investigate this issue.

Posted Oct 23, 2021 - 08:05 PDT

Investigating

We are currently investigating this issue.

Posted Oct 23, 2021 - 08:05 PDT

This incident affected: avochato.com, API, and Mobile.