The morning after a data migration to improve inbox search performance and add new search functionality, a series of complex queries caused our search service to enter a bad state and became throttled during the impact period.
This appeared to be due to the size of our Elasticsearch shards exceeding a critical limit in terms of size. We exceeded the size threshold due to additional columns that were added to support new indices, multiplied by the size of the production dataset.
As a result, search results (including the default inbox experience, contacts list, etc) timed out until we could completely reboot the search infrastructure.
Customer Data (including conversations, messages, tags, etc) was not lost and messages continued to deliver as intended during the period. Most functionality was available including account and user management, though the experience for searching through the inbox was severely degraded.
Customers who were online trying to use the app or API during the impact period would not have easily been able to lookup conversations or contacts. It was difficult to create and populate broadcasts during the period. Users were still able to navigate to conversations directly from links and from notifications.
Engineers used our metrics to identify the issue was specifically related to our search infrastructure. The team scaled up additional instances of our Elasticsearch infrastructure to try and solve the problem.
The team then rebuilt the Elasticsearch cluster as they were not able to "reboot" it, and began routing traffic to the new cluster. This resolved the issue for customers on a rolling basis, as some connections hit the new instance while other data routed to bad shards. The rebuild process was clocked at taking about 1 hour and 15 minutes to complete syncing all data to build new Elasticsearch indices, so unfortunately some customers were impacted during this entire period.
We have since moved from using 8 Elasticsearch shards to 16, and reindexed the dataset, which cut the total size of each shard in half.
In a separate issue that occurred around the same time period, the www.avochato.com domain was flagged automatically by Avast Antivirus' anti-phishing browser extension. We believe this was a false positive, but it caused Avast users who had the extension installed to be unable to view Avochato. Users who whitelisted Avochato in their extension were able to continue to log in, and the team worked quickly to submit an appeal. Avast has since then removed us from their phishing blacklist.
Avochato poses no known phishing threat to its users, but we encourage users who suspect phishing attack vectors to submit their reports to www.avochato.com/bugbounty
Thanks again for your patience while we resolved this issue, and for being an Avochato customer,
Christopher, CTO & CISO