Enhancing code assessment time at Meta

- Code critiques are one of the essential elements of the software program growth course of
- At Meta we’ve acknowledged the necessity to make code critiques as quick as potential with out sacrificing high quality
- We’re sharing a number of instruments and steps we’ve taken at Meta to scale back the time ready for code critiques
When finished properly, code critiques can catch bugs, educate finest practices, and guarantee excessive code high quality. At Meta we name a person set of adjustments made to the codebase a “diff.” Whereas we like to maneuver quick at Meta, each diff should be reviewed, with out exception. However, because the Code Assessment group, we additionally perceive that when critiques take longer, individuals get much less finished.
We’ve studied a number of metrics to study extra about code assessment bottlenecks that result in sad builders and used that information to construct options that assist velocity up the code assessment course of with out sacrificing assessment high quality. We’ve discovered a correlation between gradual diff assessment instances (P75) and engineer dissatisfaction. Our instruments to floor diffs to the correct reviewers at key moments within the code assessment lifecycle have considerably improved the diff assessment expertise.
What makes a diff assessment really feel gradual?
To reply this query we began by our information. We monitor a metric that we name “Time In Assessment,” which is a measure of how lengthy a diff is ready on assessment throughout all of its particular person assessment cycles. We solely account for the time when the diff is ready on reviewer motion.

What we found stunned us. Once we appeared on the information in early 2021, our median (P50) hours in assessment for a diff was only some hours, which we felt was fairly good. Nonetheless, P75 (i.e., the slowest 25 % of critiques) we noticed diff assessment time improve by as a lot as a day.
We analyzed the correlation between Time In Assessment and person satisfaction (as measured by a company-wide survey). The outcomes have been clear: The longer somebody’s slowest 25 % of diffs take to assessment, the much less happy they have been by their code assessment course of. We now had our north star metric: P75 Time In Assessment.
Driving down Time In Assessment wouldn’t solely make individuals extra happy with their code assessment course of, it will additionally improve the productiveness of each engineer at Meta. Driving down Time to Assessment for our diffs means our engineers are spending considerably much less time on critiques – making them extra productive and extra happy with the general assessment course of.
Balancing velocity with high quality
Nonetheless, merely optimizing for the velocity of assessment may result in damaging unwanted effects, like encouraging rubber-stamp reviewing. We wanted a guardrail metric to guard towards damaging unintended penalties. We settled on “Eyeball Time” – the whole period of time reviewers spent a diff. A rise in rubber-stamping would result in a lower in Eyeball Time.
Now we have now established our purpose metric, Time In Assessment, and our guardrail metric, Eyeball Time. What comes subsequent?
Construct, experiment, and iterate
Practically each product group at Meta makes use of experimental and data-driven processes to launch and iterate on options. Nonetheless, this course of remains to be very new to inside instruments groups like ours. There are quite a lot of challenges (pattern measurement, randomization, community impact) that we’ve needed to overcome that product groups do not need. We tackle these challenges with new information foundations for operating network experiments and utilizing methods to scale back variance and improve pattern measurement. This further effort is value it — by laying the muse of an experiment, we are able to later show the impression and the effectiveness of the options we’re constructing.

Subsequent reviewable diff
The inspiration for this function got here from an unlikely place — video streaming companies. It’s straightforward to binge watch exhibits on sure streaming companies due to how seamless the transition is from one episode to a different. What if we may do this for code critiques? By queueing up diffs we may encourage a diff assessment move state, permitting reviewers to benefit from their time and psychological power.
And so Subsequent Reviewable Diff was born. We use machine studying to establish a diff that the present reviewer is very prone to need to assessment. Then we floor that diff to the reviewer after they end their present code assessment. We make it straightforward to cycle by potential subsequent diffs and shortly take away themselves as a reviewer if a diff isn’t related to them.
After its launch, we discovered that this function resulted in a 17 % total improve in assessment actions per day (comparable to accepting a diff, commenting, and so on.) and that engineers that use this move carry out 44 % extra assessment actions than the typical reviewer!
Enhancing reviewer suggestions
The selection of reviewers that an creator selects for a diff is essential. Diff authors need reviewers who’re going to assessment their code properly, shortly, and who’re specialists for the code their diff touches. Traditionally, Meta’s reviewer recommender checked out a restricted set of knowledge to make suggestions, resulting in issues with new information and staleness as engineers modified groups.
We constructed a brand new reviewer suggestion system, incorporating work hours consciousness and file possession info. This permits reviewers which might be out there to assessment a diff and usually tend to be nice reviewers to be prioritized. We rewrote the mannequin that powers these suggestions to help backtesting and automated retraining too.
The end result? A 1.5 % improve in diffs reviewed inside 24 hours and a rise in high three suggestion accuracy (how usually the precise reviewer is among the high three instructed) from under 60 % to just about 75 %. As an added bonus, the brand new mannequin was additionally 14 instances sooner (P90 latency)!
Stale Diff Nudgebot
We all know {that a} small proportion of stale diffs could make engineers sad, even when their diffs are reviewed shortly in any other case. Sluggish critiques produce other results too — the code itself turns into stale, authors should context swap, and total productiveness drops. To immediately tackle this, we constructed Nudgebot, which was impressed by research done at Microsoft.
For diffs that have been taking an additional very long time to assessment, Nudgebot determines the subset of reviewers which might be probably to assessment the diff. Then it sends them a chat ping with the suitable context for the diff together with a set of fast actions that permit recipients to leap proper into reviewing.
Our experiment with Nudgebot had nice outcomes. The common Time In Assessment for all diffs dropped 7 % (adjusted to exclude weekends) and the proportion of diffs that waited longer than three days for assessment dropped 12 %! The success of this function was individually published as properly.

What comes subsequent?
Our present and future work is targeted on questions like:
- What’s the proper set of individuals to be reviewing a given diff?
- How can we make it simpler for reviewers to have the knowledge they should give a top quality assessment?
- How can we leverage AI and machine studying to enhance the code assessment course of?
We’re regularly pursuing solutions to those questions, and we’re trying ahead to discovering extra methods to streamline developer processes sooner or later!
Are you interested by constructing the way forward for developer productiveness? Join us!
Acknowledgements
We’d wish to thank the next individuals for his or her assist and contributions to this put up: Louise Huang, Seth Rogers, and James Saindon.