Your Improvement Worked. Or Did It?
The team introduced a WIP limit six weeks ago. Cycle time improved. The retrospective declared it a success.
Two months later cycle time is back where it was. There is a disagreement in the room about whether the limit ever worked. Nobody wrote down what they expected to happen, so there is no way to know.
This is the default state of delivery improvement: interventions tried, outcomes felt, declarations made. No measurement discipline. No way to learn.
Why "it feels better" fails
Delivery improvement rarely happens in a controlled environment. Teams change. Demand shifts. Seasons affect throughput. A single quarter might include a holiday slowdown, a reorg, a new product push, and a major incident.
In this environment, gut feel is unreliable for four specific reasons.
Regression to the mean. If you introduce an intervention during an unusually bad period, almost anything will look like it worked — the system was going to improve regardless. The intervention gets the credit for mean reversion.
Observer effect. Teams who know an intervention is being evaluated often behave differently during the evaluation period. The changed behaviour is the cause of improvement, not the intervention itself.
Seasonal and demand effects. Cycle time in the two weeks before a release is different from cycle time mid-sprint. Comparing across these periods produces noise, not signal.
Placebo. When teams believe a change will help, they report that it helped. Retrospectives are conversations about how people feel, not measurements of what happened.
None of these failure modes are signs of a bad team. They are the natural conditions of knowledge work. The only way to separate them from genuine intervention effect is to decide what you will measure before you intervene.
The three pre-commitments
A valid intervention test requires three decisions made before the intervention runs — not after.
A frozen baseline window. Identify the time period before the intervention that will serve as your comparison. It should be long enough to contain meaningful variation: at minimum eight to twelve data points. Do not extend or adjust it after the intervention based on how the comparison looks.
A written prediction. Write down what you expect to happen, in measurable terms: "Cycle time P85 will drop from 18 days to 12 days within eight weeks." Not "we expect improvement." Specific metric. Specific direction. Specific magnitude. Specific time window. A prediction you couldn't be wrong about is not a prediction.
A single designated watch metric. One primary metric that will carry the decision. Secondary metrics can inform, but the call is made on the primary. This prevents post-hoc reinterpretation — discovering after the fact that the metric that improved is the one that mattered after all.
These three commitments are not bureaucracy. They are the minimum required to learn anything from the intervention.
Signal lag by intervention type
How long before the signal registers depends on what you changed.
WIP limit changes typically register within two to four weeks. The effect on cycle time is direct and fast-acting: items that were waiting behind a full queue move sooner. A few completed sprints are enough to see the shift.
Capacity changes take longer. Adding a reviewer, a tester, or a deployer changes throughput, but the new hire needs time to onboard, the queue needs time to drain, and the signal needs time to accumulate. Four to eight weeks is the minimum observation window.
Process redesign is the slowest. Changing how work flows through the system takes time to stabilise, time for teams to adapt, and time for the new steady state to become visible in the data. Eight to twelve weeks before drawing conclusions.
Patience is not indefinite. It is calibrated to what you changed.
Calling it
At the end of the watch window, one of three verdicts is possible.
Confirmed: The watch metric moved in the predicted direction by approximately the predicted magnitude. The intervention worked. Document what you did, update the system, consider whether the intervention should become policy.
Refuted: The watch metric did not move, or moved in the wrong direction. The intervention did not produce the expected effect. This is not failure — it is information. The system is telling you something. Investigate the prediction: was the diagnosis wrong? Was the intervention applied correctly? Is there a confounding factor?
Inconclusive: The watch metric moved but not enough to distinguish from noise, or the data quality was insufficient. Do not declare success on inconclusive results. Either extend the window if the intervention is still in place, or accept that this test could not produce a verdict and design a cleaner test next time.
Insight
An intervention without a written prediction isn't a test. It's a guess with a retrospective.
Most teams skip the prediction because committing to a specific outcome feels risky. If the prediction is wrong, the intervention looks like a failure. But a prediction that can't be wrong is not measuring anything. The discomfort of the written prediction is precisely what makes it useful — it forces a genuine hypothesis rather than a vague hope.
The retrospective didn't reveal whether it worked. It revealed how the team felt about the decision to try it.
The teams that get better at delivery over time are not the ones that try the most interventions. They are the ones that learn from each one. Learning requires a prediction. Without a prediction, the data is just numbers and the retrospective is just a conversation.