LOBUS.WORKS
← All insights
Flow & Measurement

Your Improvement Worked. Or Did It?

24 Sept 2025·8 min

The team introduced a WIP limit six weeks ago. Cycle time improved. The retrospective declared it a success.

Two months later cycle time is back where it was. There is a disagreement in the room about whether the limit ever worked. Nobody wrote down what they expected to happen, so there is no way to know.

This is the default state of delivery improvement: interventions tried, outcomes felt, declarations made. No measurement discipline. No way to learn.

Why "it feels better" fails

Delivery improvement rarely happens in a controlled environment. Teams change. Demand shifts. Seasons affect throughput. A single quarter might include a holiday slowdown, a reorg, a new product push, and a major incident.

In this environment, gut feel is unreliable for four specific reasons.

Regression to the mean. If you introduce an intervention during an unusually bad period, almost anything will look like it worked — the system was going to improve regardless. The intervention gets the credit for mean reversion.

Observer effect. Teams who know an intervention is being evaluated often behave differently during the evaluation period. The changed behaviour is the cause of improvement, not the intervention itself.

Seasonal and demand effects. Cycle time in the two weeks before a release is different from cycle time mid-sprint. Comparing across these periods produces noise, not signal.

Placebo. When teams believe a change will help, they report that it helped. Retrospectives are conversations about how people feel, not measurements of what happened.

None of these failure modes are signs of a bad team. They are the natural conditions of knowledge work. The only way to separate them from genuine intervention effect is to decide what you will measure before you intervene.

baseline meaninterventionfrozen baselinewatch windowpredicted: ≤12d06d12d18d24dCycle timewk 0wk 4wk 8wk 12wk 16
A valid intervention test: frozen baseline establishes the comparison point, the intervention is marked explicitly, and the watch window measures what actually changed.

The three pre-commitments

A valid intervention test requires three decisions made before the intervention runs — not after.

A frozen baseline window. Identify the time period before the intervention that will serve as your comparison. It should be long enough to contain meaningful variation: at minimum eight to twelve data points. Do not extend or adjust it after the intervention based on how the comparison looks.

A written prediction. Write down what you expect to happen, in measurable terms: "Cycle time P85 will drop from 18 days to 12 days within eight weeks." Not "we expect improvement." Specific metric. Specific direction. Specific magnitude. Specific time window. A prediction you couldn't be wrong about is not a prediction.

A single designated watch metric. One primary metric that will carry the decision. Secondary metrics can inform, but the call is made on the primary. This prevents post-hoc reinterpretation — discovering after the fact that the metric that improved is the one that mattered after all.

These three commitments are not bureaucracy. They are the minimum required to learn anything from the intervention.

Signal lag by intervention type

How long before the signal registers depends on what you changed.

WIP limit2–4 weeksCapacity add4–8 weeksProcess redesign8–12 weeksTool adoption6–10 weeks02w4w6w8w10w12w14wWeeks before signal registers in cycle time data
Signal lag varies by intervention type. A WIP limit shows up in 2–4 weeks. A process redesign requires 8–12. Patience is calibrated, not indefinite.

WIP limit changes typically register within two to four weeks. The effect on cycle time is direct and fast-acting: items that were waiting behind a full queue move sooner. A few completed sprints are enough to see the shift.

Capacity changes take longer. Adding a reviewer, a tester, or a deployer changes throughput, but the new hire needs time to onboard, the queue needs time to drain, and the signal needs time to accumulate. Four to eight weeks is the minimum observation window.

Process redesign is the slowest. Changing how work flows through the system takes time to stabilise, time for teams to adapt, and time for the new steady state to become visible in the data. Eight to twelve weeks before drawing conclusions.

Patience is not indefinite. It is calibrated to what you changed.

Calling it

At the end of the watch window, one of three verdicts is possible.

Confirmed: The watch metric moved in the predicted direction by approximately the predicted magnitude. The intervention worked. Document what you did, update the system, consider whether the intervention should become policy.

Refuted: The watch metric did not move, or moved in the wrong direction. The intervention did not produce the expected effect. This is not failure — it is information. The system is telling you something. Investigate the prediction: was the diagnosis wrong? Was the intervention applied correctly? Is there a confounding factor?

Inconclusive: The watch metric moved but not enough to distinguish from noise, or the data quality was insufficient. Do not declare success on inconclusive results. Either extend the window if the intervention is still in place, or accept that this test could not produce a verdict and design a cleaner test next time.

Confirmedmetric moved as predictedRefutedmetric moved wrong directionInconclusivesignal indistinguishable from noise
Three possible verdicts after the watch window closes. Only confirmed or refuted produce learning. Inconclusive requires a cleaner test next time — not a declaration either way.

Insight

An intervention without a written prediction isn't a test. It's a guess with a retrospective.

Most teams skip the prediction because committing to a specific outcome feels risky. If the prediction is wrong, the intervention looks like a failure. But a prediction that can't be wrong is not measuring anything. The discomfort of the written prediction is precisely what makes it useful — it forces a genuine hypothesis rather than a vague hope.

The retrospective didn't reveal whether it worked. It revealed how the team felt about the decision to try it.

The teams that get better at delivery over time are not the ones that try the most interventions. They are the ones that learn from each one. Learning requires a prediction. Without a prediction, the data is just numbers and the retrospective is just a conversation.