Dragan Stepanović - Async Code Reviews Are Choking Your Company’s Throughput

So, over the course of the study, I analyzed now more than 40,000 pull requests. And uh the things that I’m going to talk about today are um mostly of systemic behavior or systemic patterns that I got to observe. Um some of the things that we observed were kind of intuitive, I kind of expected those, and some of the things were pretty much a surprise for me as well. So I’m here kind of to share the the learning and the journey that I had with the study. The teams that have been involved are uh typical product development teams, so no open source software. Although we also know that we adopted the whole idea of pull requests from the open source community in a completely different context, but that’s perhaps a discussion for some other day.

So, what was I curious to see? Um I was curious to understand the engagement, wait time and size. And the reasons for those are also I had a hunch, me having a background in extreme programming with highly collaborative ways of working, pair and mob programming. I had a hunch um of trying to kind of understand what is the engagement that I get to see on the pull requests also depending on how much wait time is there, the latency in the process, and also how does that relate to the size? So going into the engagement, why was I curious about it? So, one of the reasons was because there’s this delay, right? Async. And a um kind of metaphor that I had was that I noticed that when I also have a phone call with someone and there’s a delay in the communication, the communication tends to die off pretty quickly. Okay, because of the delays that are kind of um part, part of that. And when it comes to the choked feedback, what I mean by that is that compared to face-to-face conversation, in sync, um providing feedback in a written form over some tool tends to be very different and different kind. And it’s also uh feels a bit more expensive, not only because of delays that are part of that, but also because it’s a written form prone to miscommunication, not being able to see non-verbal cues, et cetera. Besides the fact that you also get to review something after the thing has been done mostly, um so you don’t have a ability to course correct. While when you have a face-to-face conversation, you can cut off the wrong path sooner. So there’s a reason why I also call this um high latency low throughput feedback. Now, let’s go into the first cut plot. So this is the data set of 500 pull requests and on the X-axis you can see the size in lines of code. Uh this was a proxy metric mostly for the effort that was part um of of building or kind of creating a pull request. And there are different ways to go about it, one way is also to measure the touch time or the processing time. But what I found during the study, it tends to be very prone to um yeah, outlier, so to say, because often it happens that, you know, when there’s a processing time and people work on something, it doesn’t get me and they’re working on it um whole time, there are lots of breaks, different interruptions, etcetera. So the size was a really nice, simple metric um as a proxy for that. And on the Y-axis, uh there’s uh engagement here, and the definition of engagement that I used is the number of non-trivial comments that I get to see on PRs.

Now, what is a non-trivial comment? So a non-trivial comment is a comment that is not trivial, and then the trivial comment is a very simple definition. A comment that has four or less words in it. Now, why? Because I got to observe, probably you as well, lots of LGTM, looks good to me, plus one, thumbs up, and this kind of feedback that doesn’t really provide value to the other side. So um what we can see from this scatter plot, not much, most of the pull requests are kind of point, yes, less than 500 lines of code and most of them are less than 12 non-trivial comments. But one thing that I started thinking about is that if you invest quite some time trying to build a feature and there’s, I don’t know, a week passes and you get two comments, it’s not the same compared to when you have a a very small change, right, and you get back the same amount of comments. Now, the quality of the comments was not part of the study, but it’s kind of interesting to understand what happens if we try to normalize the Y-axis by size, right? So things at that point start um getting interesting in the sense that um when we get to see on the Y-axis engagement per size and then plot it by size, what we get to see is that, uh and again, this was prevalent in all of the data sets that I have analyzed for the teams that have been doing async code reviews, it’s that as we increase the size of the pull request, the engagement per size tends to die off exponentially. Okay.

And I think there’s this intuitive part about it, um I’m pretty sure lots of people had the experience with huge pull requests, right, and just looks good to me, uh stamp it, approve it and then hope that nothing is going to break in production, right?

Um but important thing here when it comes to the code review as a process to build um quality in by embedding human judgment in the process is that for the bigger pull requests, we delay the process, plus we don’t get to see, there’s there’s a lack of engagement, which is a precondition in order to be able to build the quality in, because we’re basing the process based on the ability to get the feedback, right? Um I’m not saying anything about the left hand side, so I’m not saying that you have a lot of comments that it guarantees quality, not really, right? But what I’m saying if you don’t have ability to get the feedback and the quality is uh as a precondition has that, then you’re also not able to build the quality in, at least to the extent that you would like to do that. Um and that’s the reason why I also love this quote, never had a huge PR that didn’t look good to me. Um I also have a t-shirt that I wear at conferences. Um didn’t bring it this time. Uh but I’m also um uh uh sure that a lot of people also um run into this um what’s it called? Phenomenon. That uh the smaller the pull request, the more kind of engagement you get to see and and vice versa, like 500 lines of code, everything is fine, let’s just roll. Um and there’s a reason for that. Um I think if I get if I get a pull request after working, someone is working one week, right, and there’s a lot of code change changes that were introduced. There’s not much I can do about it, at that point the thing has already been done. So either there’s quality or not, right? Uh and there’s lack of ability to course correct. Plus um if I’m the author and I invested one week into building this feature and then someone comes back at me and says, hey, 80% of what you like, you took the wrong turn already at your first day. It’s it’s emotionally painful, right, because of the sunk cost fallacy kicks in and then it’s also difficult from that side to try to course correct, right? Um and recently I also bumped into this video, which I’m not sure if it’s going to load.

Yeah. Or maybe it is it is going to load, let’s see.

So yeah.

This is the way that I perceive reviewers with a big pull request trying to build the quality in, right? So at that point it becomes a theater, more than really trying to do it, right?

Um Okay, so then wait time. Wait time was the interesting part for me, at least the learning that I that I had. Um and if we look at ticket number one in this case, if you focus on that. And I also plotted on the um lower end of the slide, at the bottom, the cycle time or the lead time from starting um to work on this ticket until it’s um released or deployed or merged, it doesn’t really matter because we’re mostly focusing here on on this part. Um but what you get to see is there there’s there’s parts of of of its cycle time where we effectively spend working on this item, so Emma is coding or Luka is providing feedback, Emma is incorporating feedback, etcetera. But there’s a significant chunk also of the time where this item just spends waiting for someone’s attention, sitting in a queue, right? And that’s completely unproductive time, right? Uh besides the fact that it delays the the delivery, right? But what you get to observe in in teams that work individually and it’s hard to find a team that is working individually and doesn’t doesn’t have an async code review. The the percentage of wait time in terms of the uh cycle time um is huge. So the wait time starts dominating the the cycle time and there’s a reason and math behind that, queueing theory and Little’s law. I’m not going to go into that, but what you get to observe is that this um most of the time is spent uh items just waiting for someone’s attention.

Um so the way that I went about uh measuring the wait time, so I use an approximation, which is sometimes good, sometimes not as good. Um and it’s really important when we measure something in in in a delivery process to understand when the approximations um uh are not good enough. And the way that I went about it is that usually how we go about um in this process is that we develop a feature and then

raise a pull request and at that point we um invite someone for feedback. So the wait time is from start of that time, which I captured in in different ways until the pull request has been merged. Now, this approximation um has a couple of assumptions that I’m not going to have time to go through here, but one of the cases where this assumption doesn’t uh hold true or this approximation doesn’t hold true is when I get to see a lot of feedback after raising a pull request. Or a lot of follow-up commits, right, that tells me there’s already some rework and that shouldn’t be part of the wait time.

Um and then looking at some of the typical results that I got to observe, so here we have a a pull requests. Um 500 pull requests, merge pull requests, that took around six months to push through the system of work, so from the start time of the first pull request until the last pull request has been merged. And for this team, it took them around the wait time cumulative in months was close to 28 months. Now, this number is not precise, but the order of magnitude lends itself for a really nice conversations, right?

Um and the reason why I think it’s really important to understand what are the effects of lots of wait time that you get to see in the system of work is um this analogy or this game that we played when we were kids, the hot and cold game. So one person hides an item, the other one goes looking for it and the person that has hid the item gives feedback to the other person in regular intervals. And now imagine you have two teams, one team gets feedback after every second and the other team gets feedback after every minute. On which of these two teams on average you would bet that they’re going to find an item sooner? Right, so pretty much you’re going to uh go with the team that has a lower latency in the process. And the important part of this is that because of lots of waste, lots of wait time that is accumulated in the system of work, we also delay delivery, which means that our learning cadence, the the rate of learning that we have, decelerates. So we are also um running experiments less often and because of that, um learning less often.

Okay, so this is the scatter plot for the wait time by size. So on the Y-axis you can see the wait time in hours. And again, for this scatter plot, uh not much, in the sense that most of the pull requests took um um less than 200 hours, which is huge, but uh um yeah, people often get surprised by the numbers when when they when they get to see it. Um but when you think about it, if uh if I make a small change, I rename a couple of methods, a couple of variables or um um something like that, smaller, of smaller size, and let’s say I invest 20 or 30 minutes to do that and then I wait one day or two days for a review. It’s not the same as if I invest one week and then I wait for um review one day or two days, so the relative cost of review is different in these two cases, right? So what happens when we normalize the Y-axis by size, right? So, um, here we have um minutes, wait time, wait time in minutes per line of code. And this was interesting because what we get to see here and again, this was I can, I I, all of the data sets that I analyzed exhibited the same pattern. As we reduce the size of the pull request, the wait time per size goes up exponentially. And this is kind of interesting because I the way that I interpret it is that the cost of code review per size goes up exponentially as we reduce the size of the pull request. Okay. So the reviews per size of the pull request get more and more expensive as we reduce the the size of the pull request.

Um and that was a surprise for me, right? So often when I go to teams and there’s, you know, huge PRs, etcetera, one default um advice is let’s let’s do smaller pull requests. And we want to do smaller pull requests because we learned also the lesson about small batches from Lean. Uh and the reason why we want to have smaller pull requests is that they take less time to write, they’re quicker to review. Me as a reviewer, I can squeeze in 10 or 15 minutes of a review instead of one day review in my calendar. Uh as we got to see, it’s also uh has higher engagement uh for the smaller PRs than the bigger PRs, they’re also less risky because less things are changed. At the same time and when there’s a problem, it’s also easier to troubleshoot because there is a small smaller haystack into which to search for the needle, so to say. And when it comes to Dora research, um smaller PRs, um which as a side effect have um higher integration frequency, positively contribute to all four Dora metrics, right? So we want to have smaller pull requests. But what we get to observe is that the um cost of having those goes up exponentially, so the system is kind of pushing back against us, right? And one also kind of scenario to have in mind when when we talk about this um pattern is that um I’m pretty sure that most of the developers have experienced slow running test suites. So if if I have a test suit that is takes 20 minutes to run, I’m not going to run the test suit after every line of code change. Because it doesn’t make any economic sense, most of the time I’m just going to sit there and wait for um for the test suit to give me feedback. So effectively what the system is telling us, hey, getting feedback from test is super expensive, so use it wisely. So the way that we use it wisely is we run it less often, which means that we accumulate more changes, which means that we slide back into the bigger PRs. Okay. And one also kind of scenario to have in mind when when we talk about these pattern is that I’m pretty sure that most of the developers have experienced slow running test suites. So if if I have a test suit that is takes 20 minutes to run, I’m not going to run the test suit after every line of code change, because it doesn’t make any economic sense. Most of the time I’m just going to sit there and wait for for the test to give me feedback. So effectively what the system is telling us, hey, getting feedback from test is super expensive, so use it wisely. So the way that we use it wisely is we run it less often, which means that we accumulate more changes, which means that we slide back into the bigger PRs. Um, and for that reason I love this quote from Don Rainertson, while you may ignore economics, it won’t ignore you. So we need to understand the economics of the system that we have in place, um, in order to understand where the intervention points that we can that we can have. So another thing also to have in mind is try doing small changes in this kind of environment. Right? So if I if I if it takes me one or two days to get a review on on a small change, right? It’s most often, um, is going to inhibit me either to make this change or I’m going to batch it with something else, right? And then go back into bigger pull request. And the reason why I think this is important is that I think the teams that have lower latency in communication tend to have healthier code base.

Because you do you you change the system to correspond to your new mental model when you want to, not when it’s not expensive. Okay. So what happens then because of a health code base, it also means more responsive to change and less rework etc. So um there are lots of second and third and fourth order effects of these kind of systemic patterns.

Um, now talking about these second and third order effects, um, if you um look at this part that I highlighted here, can you review my pull request please? And this begging hands. Um, how many people here have been parts of the teams where you got to see repeated ask for reviewing pull requests, whether it be whether it is uh gentle reminders or begging or outright escalation to engineering manager or whatever, right? And uh this is interesting, uh I also observed that people have very different ways to beg for the pull requests. Um, some are more more gentle, some are less. But interesting thing, um, my my point here is that the amount of begging in the system is proportional to the work in process. So the more things you’re working on as a team, and if you’re working individually, you already have high working process, too high working process, then what happens is that the system becomes less responsive. Because of that, you get delays, uh latency in this process, and because of that, uh people have a need to repeatedly ask for reviewing the code, document or whatever is the case.

Um, when I tweeted it also one person says that they have a term for this, which is called merge begging. So if you are in need of having a term for that, um, there you go.

Um, another thing also, um, when it comes to these second and third order effects, what I also observed is that pull request need picking, which where need picking is defined as reviewer is asking for some small change where small is relative to the effort, um, and um, let’s say that, you know, I the reviewer found a better name for a variable or a method or there’s something about, you know, extra space or formatting or whatever is the case. Um, but they have a need to say, hey, sorry for this need pick, can you incorporate this change? And what they think is the also I I started searching for for these articles because I found a lot of them talking about how to stop need picking in code reviews, um, how do you handle need picky code reviewers, how to deal with a team member who is need picking etc. But the thing here, um, that that I observed is that when you reduce the cost of um of reviews and because of that also cost of change, then you get to see less need picks or people having to explain themselves as, um, um, yeah, I’m asking for a need pick. For example, if I’m pairing with someone, um, and I expected the method and my pair found a better name for that, I have only one line of code to lose. Okay. But when there is latency in this whole process, then the whole change becomes very expensive, and then we have to find a relatively bigger change in order to justify it, so to say.

Um, so I tried to make sense of the whole idea of trying to help the teams reduce the size of the pull request, um, the teams that have been doing async code reviews. And here, this is a cause loop diagram that we use, which is from systems thinking tool box, and I’m going to give you a short crash course on that. Um, so these labels are variables and they’re connected in some way. So there are two uh causal links, types of the causal links that we get to have. One is the positive, uh plus, and the other one is negative, and the positive one says that when two variables are connected and there’s plus between them, it means that both of them move in the same direction, up or down, while the negative means they move in the opposite directions. Now, when you dump these um um variables and draw lines between them, you might discover some feedback loops. So two feedback loops, um, from from systems thinking that are also represented here, are reinforcing feedback loop and a balancing feedback loop. And a reinforcing feedback loop is a feedback loop which kind of feeds off itself, so it’s kind of a snowball effect. While the balancing feedback loop is a feedback loop that seeks a goal.

So let’s say that, so we have a thermostat here, which is set at let’s say 21 degrees, if uh it’s hot outside, the system is going to work until it matches this number, then it’s going to stop. Okay. Now, if um if gets if it gets um more hot outside, it’s going to work even more, but the thing is that when it reaches this threshold, it’s going to stop, when there is discrepancy, it tries to um kind of stabilize the system, right? So let’s walk through this uh causal loop diagram because there is a story behind it. So the context is team is doing async code reviews and wanting to reduce the size of the pull request. So if you reduce the size of the pull request, the motivational incentive to review goes up, which is a good thing to have because I’m more in favor of reviewing smaller pull request than the bigger pull request. And because of that, time waiting for a review from the author side goes down, which is a good thing. Which, um, reduces the perceived cost of the code review, which drives the pull request size even lower, right? So it kind of incentivizes smaller pull requests.

And this is a reinforcing feedback loop that is desirable, the one that we want to have, but this is not the only thing that is going on. So if we look at what are what are the other things that happen when you use the size of the pull request in the teams that are doing async code reviews, is that the number of PRs to review in a unit of time goes up. So if we halve the size of the pull request, we get twice as many pull requests to review in a unit of time. Then when you push this further, you get higher number of interruptions for a reviewer, which then everyone trying to protect their kind of personal psychological flow, no one wants to be interrupted too often and do context switching and stop what they’re working, uh it kind of drives the motivational incentive to review down, which is pushing us back, um, into the other direction,

um, incentivizing us to increase the size of the pull request. So this is the balancing feedback loop that is kind of pushing against those this desirable reinforcing feedback loop. And the the problem here is this conflict that we get to see in this motivational incentive to review, which is mostly driven by the number of interruptions. Um, but that’s not the only problem that we have when we reduce the size of the pull request, because what happens also and you saw it with Emma when she was waiting for Luca, she pulled in something else. So the temptation to start working with something new goes up, which means that we have even more working process, more pull request to review, and the system is pushing against us even more.

Um, so shifting gears now and talking about the flow efficiency, uh we’re at flow con, I’m pretty sure that lots of people are familiar with it. But just shortly, the flow efficiency is the idea of or the metric where we try to understand how much of a cycle time of a work item we spend working on it compared to it just waiting for someone’s attention or sitting in a queue. The lower the flow efficiency, the worse our process, the more wait time, the more waste we have in the system and vice versa. And I um um crunched the data and uh this is one of the patterns that was again observable in um in these data sets. Uh on the Y axis, just think about it as a flow efficiency, I’m not going to spend too much time on trying to describe exactly this metric. Um, but it’s a proxy for the flow efficiency. And what you get to observe is that the flow efficiency starts plummeting at one point, on the size of the pull request, right? Um, and that is interesting because when you think about, um, if you do a thought experiment and let’s say that you have a 300 lines of code change that you need to push through the system of work, you can do it in at least two ways. So one way is 15 PRs of 20 lines of code, let’s say, and the other one is one PR of 300 lines of code. Now, if we have this kind of behavior, if uh when you think about it, the cumulative lead time to get 15 PRs of 20 lines of code through the system of work is way longer than the cumulative lead time of one PR of 300 lines of code, right?

Um, and that is also interesting because that also means that the throughput for the smaller pull requests goes down because we incur more wait time per cycle time or wait to processing time ratio goes up, and because of that, the throughput plummets. Um, one way to think about it also is that when we reduce the size of the pull request, when I’m working on a pull request, um, the processing time per size goes down or stays the same, doesn’t matter, um, but the wait time as we saw it goes exponentially up, right? And the reason for that is that with a smaller pull request, with a smaller batches in a system that is fully utilized, you incur cost of bringing in dependencies way more often. Okay. So I need when I have a smaller pull request, I need my peers more often now to review my pull request, right? Um, and that drives the wait to wait to processing time ratio up, which kills the flow efficiency, which leads to lower throughput.

Um, and when you think about the async work, the promise of the async work is that you can start working on something without the other person that you’re going to need being available. So think about the incentives that you now have in the system, what you’re effectively doing is you’re making the cost of starting new work zero. Right? Which means that it tilts the arrival rate in the system, meaning that you get higher inflow of new work, which then increases the work in process, more inventory, um, and slower delivery at the end.

Um, and so because of that, and if we um, yeah, take into account Little’s law formula, the work in process as you can see it here, it just goes shoots up and as you have more inventory, the cycle times just um, um skyrocket. And that’s not the only thing.

Um, because when you have high working process, you have more blockages in the system, you have more delays because the system is less responsive, and because of that, people are, especially, so humans are really good at pulling in more work when they’re blocked.

So you get to pull in more, work in process than than you have, and this is a really nasty reinforcing feedback loop to have. Okay. Now, think about it in this case, right? If we have this thing, this kind of, um, wait to processing time ratio in a fully utilized system. Again, if you’re working individually, um, you have too high utilization. Try estimating in this kind of, um, system, right? So what happens is that you’re probably estimating the effort, but are you estimating the wait time, which you have no idea how long it’s going to be, but the wait time is the thing that is dominating the cycle time. So

So there’s a reason why I say that estimating in a fully loaded system is a poor man’s attempt at achieving predictability. You have a bigger problems to solve before trying to do any kind of estimates.

Um, and another thing as a second order effect is of of long delays. When when you think about if I want to order a jacket from the online clothing store and let’s say that it takes them three days or four days to deliver the item, right? And I want to buy a jacket and there are a couple of candidates that I have. What happens is that I’m not going to order the first thing and then wait three days, and then order the second thing and then wait three days. I’m going to batch all of these things and then I’m going to order all of these things. Now, what happens is that, um, I’m going to choose only one of one of the jackets, most probably, right? So I’m going to return n minus one back to

to the shop, right? And so this is the this is the speculative inventory that starts accumulating in the system because delays are long. Okay. So there’s a online clothing store in Germany, I don’t know if they operate also in France, Zalando, and they usually say that they have problems with high return rate, and my take is they don’t have a problem with high return rate, they have a problem with the delivery lead times and because of that they have lots of speculative inventory flowing through the system, which also causes their partners to pile up the inventory because there it’s going to be piled up um pulled into the into the uh delivery.

So for that reason, I say that speculative inventory in the system increases with delays. And when you think about the pull request, there’s often when when there are lots of delays, you often get to see some over engineered solution because I’m not sure when’s the next time I’m going to be able to um to tweak the design for this, so I’m going to kind of do it um from the get go. But it’s speculative, right? So it’s not emergent.

Okay. So to sum it up, uh low quality or inability to build a quality into bigger bigger pull requests and uh throughput going down for the smaller pull requests. So we’re kind of uh between the um rock and a hard place. So we have to choose do we do we make a trade off on losing throughput or we lose the quality?

And for anyone that had the chance to read the book from Don Reinertsen, the principles of product development flow, so this is a typical batch size optimization U-curve. You’re trying to find a sweet spot in these different forces that are part of the system and here we are talking mostly about the transaction cost because these delays are kind of form of a transaction cost.

Okay. So me growing up also as an engineer, I learned to say there’s always a trade-off so so often, right? Um, but the thing that I also discovered is that some trade-offs actually do not exist because the underlying assumptions is are flawed. So it’s really important to understand what are the assumptions that we have behind the trade-off that we’re claiming that there’s there, right? because I think there are lots of cases where you can have win-win, right? It really depends on the context in a sense. And when I mentioned win-win, um, one of the things that have been a really, um, kind of a big myth that has been busted through, um, the, um, Dora research and that has been presented, um, the results have been presented in the Accelerate book is that when you talk about DevOps, you know, it’s not either throughput or stability, it’s both of these or none of these. So either you have both throughput and stability or none of those. So I’m trying to say is that maybe we can have our cake and eat it. Um, let’s try to find perhaps one of the options to go about this problem. So what we observe, we have a um exponentially rising cost of code review as we reduce the size of the pull request, and that’s leading um to a lower throughput for the smaller pull requests, right? So we want to have a small pull request, but we don’t want to lose throughput. How can we go about it? Let’s work backwards from there.

So if we want to keep the throughput the same or not to lose it exponentially as we reduce the size of the pull request, uh we must not incur higher costs of code review per size as we reduce the size of the pull request. Now, in order to do that, what needs to happen is that actors reaction time, whereby actors, I mean authors and reviewers, their reaction time as we reduce the size of the pull request has to be exponentially faster and faster and faster if you don’t want to lose the throughput. And in order to do that, the availability of the actors has to go exponentially up as we reduce the size of the pull request. So if you remember Emma and Luca, they weren’t able to react to each other’s requests because they were busy with other things, so high working process, um, that was killing availability, right?

So if we look at ticket number one and then pull its timeline, so here the gaps between the the arrows represents the delays that are existing currently in the process and let’s say that we want to reduce the size of the pull request. So what we’re saying is that as we reduce the size of the pull request in order not to lose the throughput, M and Luca need to react faster and faster and faster to each other’s requests. Okay. To the point that at one point, the request is going to be is um or the pull request is going to be small enough, to be small enough that they would need to work together in order to not lose the throughput. So that was the conclusion of the study, in order not to exponentially lose the throughput while reducing the average size of the pull request, people need to get exponentially closer and closer and closer in time, which kind of leads us to a continuous code reviews.

So going back to this uh problem that we get to see here, the conflict, the motivational incentive to review, right? Um, which was caused by the number increasing number of interruptions, right? Well, why? You have some water here. to the point that at one point, the request is going to be um or the request is going to be small enough to be small enough that they would need to work together in order to not lose the throughput.

So that was the conclusion of the study in order not to exponentially lose the throughput while reducing the average size of the pool request, people need to get exponentially closer and closer and closer in time, which kind of leads us to a continuous code review.

So, going back to this um problem that we get to see here, the conflict, the motivational incentive to review, right? Um, which was caused by the number increasing number of interruptions, right? Well, why have some water?

Right, so if you’re working on the same thing at the same time, you cannot get me interrupted because both of us are working on the same thing, right? And that leads me to this parallel universe, um, which I call co-creation patterns, but it’s a kind of joint term for peer and mob or ensemble or soft reming as it’s also called nowadays. And when we think about the economics of the system, the important side effect, positive side effect of working together is that you have um guaranteed availability of the reviewers. Okay? So the cost of of code review is zero, because and I can have it as much as I want. So, the other person is um available all the time, so I’m able to get the review up to every small change that I make, right? And what happens because of that? It means that the transaction cost, which means the cost of transferring a batch from one stage to another, from development to a review, um gets minimized or gets um close to zero. So you can have the optimal batch size all the way to the left.

And then think about how would this scatter plot look like uh if we had uh continuous code reviews, right? So the wait time is going to be closer to zero or zero, right? And the engagement, um, I don’t know, I didn’t measure it, but I would assume it goes up at least, and we saw it also, um, on, on the engagement side of the things, but when you have lower latency in the process, you also get to see higher engagement. Now, interesting thing also when you think about when there’s an outage in production your teams or your company, how do you go about um working? So do you do async um messages on Slack or, you know, emails etc? No. You want to minimize or reduce the mean time to recover, so you jump on a call and then you try to solve the same thing. So, it’s important, and this is a lesson from Eli Goldrad from theory of constraints. You can often find these patterns in the um, how do you say it, the emergency situations that you can actually apply in your day-to-day work, right? So you have the same goal with the delivery, you want to have shorter lead times because we want to accelerate the learning cadence and we can apply the same thing to your day-to-day work. Um, so think about what are the things that we’re trying to optimize here, right? So you want the size of the pull request to go down, we want wait time precise to goes down and we don’t want to lose the engagement, those are the things that we’re here optimizing for.

And you might have noticed this bar on the right hand side, which is the, I call it PR score. But what I did is I just um mapped the relationships between these variables that we’re trying to optimize for in a formula. Um, so you have size uh times wait time in minutes divided by engagement, so you want the size to go down, um wait time to go down and engagement to not lose it, stay constant or go up, but the lower this number is, the better this process is, right?

So when we think about working together, the size of the pull request or the change is minimal, it can be as low as one line of code, the wait time in minutes is zero because you have guaranteed availability, which means that the wait time is zero. And um, yeah, engagement doesn’t even kind of matter if it stays the same or goes up, um, I use one plus engagement because I often run into pull requests that have zero non-trivial comments. Right, so you don’t want to divide by by zero, and what this evaluates to, it’s it’s zero, right? So this is kind of saying that if you optimize for this for these things, you get the best results as a byproduct of the way of working.

Um and that’s the reason that I also say that the optimal size of a pull request is one line of code that is reviewed as is being typed. And I personally don’t know of a better way to achieve it than by peer and mob programming. I’m completely open to learning new ways to do that, and this is I think the best that we at least from my perspective that we have currently in our toolbox.

So, plotting the uh pull requests in the teams that have been doing AC code reviews, here on the Y-axis we can see the score, which is, which I had to plot in on a log scale because the results were so high or so bad. and that’s one world that I get to see. And then there’s this other world, parallel universe of continuous code review, um, peer and mob programming, which is, um, very, um, far away from it again on the log scale.

Um, so again, um, so one of the things that, that I also wanted to share is, um, a couple of years back, I worked with one of the teams where we, yeah, we did trunk-based development, mob programming, by default, um, all of the XP practices, so test-driven development, really great refactoring and design skills, et cetera. And one day I was thinking, okay, you know, um, we’re working this way, I’m curious, um, what’s the frequency of integration that we get to see? So these are the results from the team that has 3 million daily active customers, and lots of millions of revenue flowing through this system. And, um, what I got to see was that in a time span of nine hours and 20 minutes, we had 107 commits landing on a main. Which means that if you translate this into the pull requests, pull request world, we had a team of five people had 107 merge pull requests in in this time span, right? Which is very different than than, um, what you get to observe, otherwise in, in teams that are not doing these practices. And the mean time to integrate was two, 3.3 minutes, meaning that every 2.3 minutes you had a new change that land on the main and was in a deployable state. We didn’t deploy all the changes because essentially what we did, we shifted the bottleneck to the delivery pipeline, right? And that’s, that’s a, that’s a good thing to have. Um, but yeah, there’s different, there are certain preconditions in order to be able to achieve that. Um, so I also get to hear a lot about the trunk-based development and how to go about it. Fun fact about trunk-based development is that it’s a byproduct of all of these practices, um, that you get to do, and then you kind of unlock this as a reward of, of doing these things because you shift left on the quality and you’re able to build it in sooner.

So, throughput or quality, I would say throughput and quality, right? So I I think there there’s a way for us to get a win-win in that case. And I also like to finish with this um point that I think we’ve been told all along that we’ll achieve and more if we limit and delay our interactions as humans, right? But I think that you also or hope that you also have a data informed reason not to believe so.

Um, I wrote an article for InfoQ um sometime ago, and it gained a lot of traction in the community where I um talked a bit about this study and there a bunch of other things that I haven’t had a chance to talk about today. if you’re interested, feel free to check it out and with that, I would like to thank you for your time and attention and we can have a questions.

Hello. thank you very much, the I love the the presentation. I spent the better part of the last 15 years working with teams that were doing power programming by default, uh doing trunk-based development, all of the practices you pointed to, but I’ve seen an evolution in the industry especially over the last five, five to 10 years, where they’ve moved very far away from that and moving to more of these asynchronous poor request models. And I would suspect that some of the delays that you’ve used mathematics and graphing to point out. Um, I’m curious though, you mentioned something early about how the tools that we use, Git specifically, um, we came out of an open source community and and trying to solve an open source problem, which is inherently different than the way businesses are are structure their engineering organizations. The question I have is, Git is an inherently better tool than SVN, that which is what we used to do for analyzing code, but the GitHub model of using pull requests, I don’t understand how companies have adopted that as a de facto standard. So what is your, I guess observation there?

Yeah, that’s, that’s a very important question that I don’t know the answer to. I think it kind of emerged at one, one point in our industry and I’m not sure like what was the driver, I think there’s a lot of vendor things going around, promoting tools, right, that are leading in this direction without trying to name any of those. Um, and we I think there’s also this fact that most of the people, like what I heard also that, um, that every five years, number of developers, um, double, which means at any point in time you have less, um, you have at least 50% of the developers that have less than five years of experience, so now we’re already after this threshold where most of the people don’t even remember what was before that. And this is like, this is a new world for most of these of these people that I also get to work with, like, you know, and there’s also this assumption that something that is older is, um, is, is worse or not better as today, right? So this, um, need for new shiny things. Um, yeah, so I, I don’t have an answer for that, I think, I think it’s a thing that emerged somehow with lots of these different contributing factors, um, but definitely, um, there, there’s also this thing with Git, which is that again, talking about the economics of the system, Git really reduced the cost of branching, it made it zero, close to zero. So it’s you, it’s very easy to branch, so what you get as an incentive is that you have more branching. So I think that’s, that’s also the thing about trying to understand the incentives and the economics that surrounds um given tools usage, which I think was also kind of contributing to it. Yeah, that’s my take.

Thank you. A wonderful talk. I’m going to wait for the recording to send to all my teams I am working with. Um, you’ve been mentioning the dynamics of pull requests within a single team, right? this two people in a multi-team scaled environment. I assume you’re going to have the same dynamics but between the teams when one team sends request to another team. I have my own opinions on that, but I’d like to hear your like heuristics, what are your practices and ideas how to solve that at a scale of multi-teams collaborating on the same code base. Thank you.

Yeah, so I think, I think there are a couple of things that it really also depends on the, um, time horizon for the intervention. There are short-term, there are mid-term, there are long-term things that you can do about it. Um, I’m a big fan of strategic domain-driven design, that I think helps a lot with trying to understand where where where can we unlock more of the value.

Um, and, um, sometimes it’s also, at least in, in one of the recent cases, we had, um, having a bit more of fluidity in terms of the teams, so dynamic re-teaming or having a bit less strict boundaries until we try to find the better boundaries that we currently have. So with this, um, so we’re using also the, the current company we’re using the full kit, um, way of scheduling the work, which means don’t start the work if all of the skills that are needed to get this thing from start to end are not available. So, that means that’s a scheduling mechanism that, um, inhibits pulling work more than it’s needed and people don’t try to front-run each other, and then as a result of that, there’s less wait time that is, that is part of that. So, yeah, some of the things are kind of, um, short-term interventions, some are mid and long-term, but that’s kind of my take.

Thank you very much for the presentation. Um, I feel like you’re trying to optimize here for the what you call the throughput, which is actually the lead time of one PR. And did you try to look at what happens when instead of looking at the lead time of one PR, you’re looking at the throughput of a team? Because with peer programming, you’re basically saying that to get one ticket done, one feature done, you need two people. Um, so you’re already dividing by two somehow the available throughput at the team level, it works for one PR, but how does that scale at the team level? Is that something that you’ve tried to to look for?

Yeah, this is, this is a question that I often get. So I can talk for hours about this, but um I think I think we need to distinguish between trying to remove the waste of idle workers compared to a waste of idle work. So, what this means is that if you’re optimizing for utilization of the people of of people, then what you get as a result is unresponsive system that has huge delays and long lead times. And long lead times are leading magic for throughput, so if your long if your time long uh lead times are going up, it’s impossible to have uh higher throughput, okay, and vice versa as well. So, the focus is not on how much certain people are busy or utilized, the focus is on delivering value sooner, okay? And then when you start optimizing for that, you don’t care how many people are going to work on an item, I mean, there are ways or heuristics about how to form this this group, but you need to have all of the skills that are needed in order to get this thing from start to end. It might be two people, it might be three, might be five, four, doesn’t, doesn’t matter, but you want to optimize for getting the value sooner and accelerating the learning cadence. So I think, you know, this is a typical thing of resource versus resource efficiency versus flow efficiency and there’s a lot of articles and videos on that that, yeah, theory of constraints as well, um, that talk about this, it’s counterintuitive. That’s the reason why most of the companies are not doing it, right? But that’s also a competitive advantage for your companies, which is still undiscovered. I know, I know it sounds unrealistic, but yeah, I mean, El Goldrat spent three or four decades talking about this, yeah.

Hi, Regan, I’m Flavian. Um, thank you for the wonderful talk. It was it was a blast. um, you you talked about quality and and as engagement goes up, quality goes up as well. Have you looked at um the quality of the finished product as well, you know, from the from the standpoint of the QA, PMs, and also from from a user standpoint, number of bugs from production? Um, have you observed effects or measured effects um on those um variables?

Yes. So one of the points that I mentioned was that I’m not saying that if you have high engagement, you have higher quality. Right, so I was talking only about the right hand side, if you have engagement as a pre-condition in order to get the quality, to get the human judgment and you don’t have it, then you then you’re not able to build the quality in the one that you want to build in, the one that is pre-condition than that one. Um, but I can so I didn’t measure these, um, these metrics that you talked about. But what I observed again is when you have lower latency, in a sense that when people start work and finish together, you get to observe higher quality because just this fact that you are able to cut the wrong path sooner, means that you reduce a lot of the rework that that is waiting for you if you don’t do it, and because of that, you have less, um, less rework, less incidents, um, which means that you unlock more time for the value-added activities, and then because of that, you get more productive, so it’s like extending the capacity of a team in in a sense to get it in that way. That’s kind of my my take, I didn’t measure this, but this is this is my experience from, um, yeah, over the course of the years. Thanks.

Okay, no more questions. Thank you.

Raw Transcript