Spoke at a conference

Gamification of Observability


Athletes, Firemen and Doctors train every day to be the best at their chosen profession. As engineers we spend much of our time getting stuff to production and making sure our infrastructure doesn’t burn down out right. We however spend very little time learning to understand and respond to outages. Things like Infrastructure as Code, Service Discovery and Config Management can and have helped us to quickly build and rebuild infrastructure, but we haven’t nearly spent enough time to train our self to review, monitor and respond to outages. Does our platform degrade in a graceful way or what does a high cpu load really mean? What can we learn from level 1 outages to be able to run our platforms more reliably? In this talk we´ll discuss the need for and the options of creating a game day culture. Where we as engineers not only write, maintain, and operate our software platforms but actively pursue ways to learn and predict its (non-functional) behaviour. We´ll look at tools like Prometheus, Loki, Tempo and toxiproxy for ways to prepare teams to tweak their testing and monitoring setup and work instructions to quickly observe, react to and resolve problems.

https://osmc.de/talks/en-gamification-of-observability/