Title: Reactive by Example

Type: 40min talk

Tags: case-study scalability reactive resilience


Status: accepted

A cool story about the evolution of our monitoring infrastructure. From the naive approach to a super resilient system.

How do we manage to handle 4M metrics / minute, and over 1K concurrent connections?
What strategies did we try to apply and where did it fail?
What are the techniques and technologies we use in order to achieve this?
How do we handle errors, and failures at this scale?
What can we still improve?