Improved Alerting with Atlas Streaming Eval | by Netflix Know-how Weblog | Apr, 2023

Ruchir Jha, Brian Harrington, Yingwu Zhao
TL;DR
- Streaming alert analysis scales a lot better than the standard method of polling time-series databases.
- It permits us to beat excessive dimensionality/cardinality limitations of the time-series database.
- It opens doorways to assist extra thrilling use-cases.
Engineers need their alerting system to be realtime, dependable, and actionable. Whereas actionability is subjective and will fluctuate by use-case, reliability is non-negotiable. In different phrases, false positives are unhealthy however false negatives are absolutely the worst!
A couple of years in the past, we had been paged by our SRE crew as a result of our Metrics Alerting System falling behind — essential software well being alerts reached engineers 45 minutes late! As we investigated the alerting delay, we discovered that the variety of configured alerts had not too long ago elevated dramatically, by 5 instances! The alerting system queried Atlas, our time sequence database on a cron for every configured alert question, and was seeing an elevated throttle fee and extreme retries with backoffs. This, in flip, elevated the time between two consecutive checks for an alert, inflicting a world slowdown for all alerts. On additional investigation, we found that one consumer had programmatically created tens of 1000’s of recent alerts. This consumer represented a platform crew at Netflix, and their objective was to construct alerting automation for his or her customers.
Whereas we had been in a position to put out the quick hearth by disabling the newly created alerts, this incident raised some essential considerations across the scalability of our alerting system. We additionally heard from different platform groups at Netflix who needed to construct comparable automation for his or her customers who, given our state on the time, wouldn’t have been ready to take action with out impacting Imply Time To Detect (MTTD) for all others. Fairly, we had been taking a look at an order of magnitude enhance within the variety of alert queries simply over the following 6 months!
Since querying Atlas was the bottleneck, our first intuition was to scale it as much as meet the elevated alert question demand; nevertheless, we quickly realized that may enhance Atlas price prohibitively. Atlas is an in-memory time-series database that ingests a number of billions of time-series per day and retains the final two weeks of information. It’s already one of many largest providers at Netflix each in dimension and value. Whereas Atlas is architected round compute & storage separation, and we may theoretically simply scale the question layer to fulfill the elevated question demand, each question, no matter its sort, has an information element that must be pushed all the way down to the storage layer. To serve the growing variety of push down queries, the in-memory storage layer would wish to scale up as properly, and it grew to become clear that this may push the already costly storage prices far greater. Furthermore, widespread database optimizations like caching not too long ago queried information don’t actually work for alerting queries as a result of, usually talking, the final acquired datapoint is required for correctness. Take for instance, this alert question that checks if errors as a % of complete RPS exceeds a threshold of fifty% for 4 out of the final 5 minutes:
identify,errors,:eq,:sum,
identify,rps,:eq,:sum,
:div,
100,:mul,
50,:gt,
5,:rolling-count,4,:gt,
Say if the datapoint acquired for the final time interval results in a constructive analysis for this question, counting on stale/cached information would both enhance MTTD or end result within the notion of a false damaging, no less than till the lacking information is fetched and evaluated. It grew to become clear to us that we wanted to resolve the scalability downside with a essentially completely different method. Therefore, we began down the trail of alert analysis through real-time streaming metrics.
Excessive Stage Structure
The concept, at a excessive degree, was to keep away from the necessity to question the Atlas database virtually completely and transition most alert queries to streaming analysis.
Alert queries are submitted both through our Alerting UI or by API shoppers, that are then saved to a customized config database that helps streaming config updates (full snapshot + replace notifications). The Alerting Service receives these config updates and hashes each new or up to date alert question for analysis to considered one of its nodes by leveraging Edda Slots. The node accountable for evaluating a question, begins by breaking it down right into a set of “information expressions” and with them subscribes to an upstream “dealer” service. Information expressions outline what information must be sourced in an effort to consider a question. For the instance question listed above, the info expressions are identify,errors,:eq,:sum and identify,rps,:eq,:sum. The dealer service acts as a subscription supervisor that maps an information expression to a set of subscriptions. As well as, it additionally maintains a Question Index of all lively information expressions which is consulted to discern if an incoming datapoint is of curiosity to an lively subscriber. The internals listed here are outdoors the scope of this weblog put up.
Subsequent, the Alerting service (through the atlas-eval library) maps the acquired information factors for an information expression to the alert question that wants them. For alert queries that resolve to multiple information expression, we align the incoming information factors for every a kind of information expressions on the identical time boundary earlier than emitting the collected values to the ultimate eval step. For the instance above, the ultimate eval step could be accountable for computing the ratio and sustaining the rolling-count, which is retaining monitor of the variety of intervals through which the ratio crossed the brink as proven under:
The atlas-eval library helps streaming analysis for many if not all Query, Data, Math and Stateful operators supported by Atlas at the moment. Sure operators reminiscent of offset, integral, des are usually not supported on the streaming path.
OK, Outcomes?
At first, now we have efficiently alleviated our preliminary scalability downside with the polling primarily based structure. In the present day, we run 20X the variety of queries we used to run just a few years in the past, with ease and at a fraction of what it will have price to scale up the Atlas storage layer to serve the identical quantity. A number of platform groups at Netflix programmatically generate and preserve alerts on behalf of their customers with out having to fret about impacting different customers of the system. We’re in a position to preserve sturdy SLAs round Imply Time To Detect (MTTD) whatever the variety of alerts being evaluated by the system.
Moreover, streaming analysis allowed us to chill out restrictions round excessive cardinality that our customers had been beforehand working into — alert queries that had been rejected by Atlas Backend earlier than as a result of cardinality constraints at the moment are getting checked accurately on the streaming path. As well as, we’re ready to make use of Atlas Streaming to watch and alert on some very excessive cardinality use-cases, reminiscent of metrics derived from free-form log information.
Lastly, we switched Telltale, our holistic software well being monitoring system, from polling a metrics cache to utilizing realtime Atlas Streaming. The elemental thought behind Telltale is to detect anomalies on SLI metrics (for instance, latency, error charges, and so on). When such anomalies are detected, Telltale is ready to compute correlations with comparable metrics emitted from both upstream or downstream providers. As well as, it additionally computes correlations between SLI metrics and customized metrics just like the log derived metrics talked about above. This has confirmed to be priceless in direction of lowering Imply Time to Get well (MTTR). For instance, we’re in a position to now correlate elevated error charges with elevated fee of particular exceptions occurring in logs and even level to an exemplar stacktrace, as proven under:
Our logs pipeline fingerprints each log message and attaches a (very excessive cardinality) fingerprint tag to a log occasions counter that’s then emitted to Atlas Streaming. Telltale consumes this metric in a streaming vogue to establish fingerprints that correlate with anomalies seen in SLI metrics. As soon as an anomaly is discovered, we question the logs backend with the fingerprint hash to acquire the exemplar stacktrace. What’s extra is we at the moment are in a position to establish correlated anomalies (and exceptions) occurring in providers which may be N hops away from the affected service. A system like Telltale turns into simpler as extra providers are onboarded (and for that matter the total service graph), as a result of in any other case it turns into tough to root trigger the issue, particularly in a microservices-based structure. A couple of years in the past, as famous on this weblog, solely a few hundred providers had been utilizing Telltale; because of Atlas Streaming now we have now managed to onboard 1000’s of different providers at Netflix.
Lastly, we realized that after you take away limits on the variety of monitored queries, and begin supporting a lot greater metric dimensionality/cardinality with out impacting the fee/efficiency profile of the system, it opens doorways to many thrilling new potentialities. For instance, to make alerts extra actionable, we could now be capable of compute correlations between SLI anomalies and customized metrics with excessive cardinality dimensions, for instance an alert on elevated HTTP error charges might be able to level to impacted buyer cohorts, by linking to exactly correlated exemplars. This may assist builders with reproducibility.
Transitioning to the streaming path has been an extended journey for us. One of many challenges was issue in debugging eventualities the place the streaming path didn’t agree with what’s returned by querying the Atlas database. That is very true when both the info shouldn’t be obtainable in Atlas or the question shouldn’t be supported due to (say) cardinality constraints. This is among the causes it has taken us years to get right here. That mentioned, early indicators point out that the streaming paradigm could assist with tackling a cardinal downside in observability — efficient correlation between the metrics & occasions verticals (logs, and doubtlessly traces sooner or later), and we’re excited to discover the alternatives that this presents for Observability generally.