Contents

Essays

Measure the System, Not the People: a Case for SDMs

A few years ago I took over the agile practice of a global engineering structure, a direct-to-consumer operation with an aggressive roadmap for the year, that had no way to answer the most basic management question: are we keeping up? This essay is about how I mapped that structure, designed four flow KPIs (Workload, Team Overhead, Productivity, and Efficiency) and implemented the process that kept them alive. It’s a field report for Service Delivery Managers, enterprise agilists, and any manager or specialist who needs to build visibility where today there is only opinion - and to land practices in a structure that resists them.

The SDM, or Service Delivery Manager, is the Kanban Method role accountable for the health of the service a team delivers: it tends to the flow, the risk, and the cadences, and it measures the production system rather than the people who run it. That’s the seat I’m writing this case from.

The scenario: everyone “used Kanban”

There were about eight product teams, organized into value streams in a service-oriented topology, running a global e-commerce platform. Ask any of them what their work method was and the answer came without hesitation: Kanban. In practice, the teams worked amethodically: what existed was a board with columns. No explicit policies, no WIP limits, no metrics, no review cadences. Calling that Kanban, full stop, is a fallacy that deserves its own essay; for now, what matters is what it cost.

The cost was visibility. Nobody could say whether the delivery pace was good or slow, whether the teams’ capacity could carry the roadmap, whether it made sense to ask for more people or spin up more teams. Conversations with management were perception contests: whoever told the better story walked away with the priority.

And there was a question almost nobody was asking: are the teams overloaded? That concern was mine and the developers’ - nobody else’s. But if you manage delivery, you can’t treat overload as sentimentality. A saturated system delivers less, fails more, and burns out the very people you’ll need next quarter. In a structure with that many teams, trusting that everything was being well prioritized, well understood, and well built was a luxury we couldn’t afford.

Before the metrics, a principle

I could have started by plugging a tool into Jira and dumping dashboards on the teams. It’s the most common mistake, and the assessment I ran convinced me not to make it. The design decision that came before any formula was this: measure the system, not the people.

Almost everything follows from that:

  • No story points. Velocity measures an internal estimation agreement, not delivery capability. I wanted flow metrics: counts and times the board itself produces. Less playful agility, more enterprise agility: adaptive, actionable, objective.
  • Workload carrying the same weight as result metrics. Chasing productivity and efficiency without watching the health of the production system is not sustainable. If workload isn’t on the same panel, the panel becomes a whip.
  • Each team compares against its own history. Every production system has its own characteristics; comparing team A with team B is hallway statistics. The first layer of reading is always the team’s context; then the value stream, then the structure.

Four KPIs engineering teams understand

The four indicators were calculated per closed month, team by team, from Jira statuses, with a flow analytics tool doing the collection. The output was never a naked number: it was a classification into ranges that any manager could read at a glance and any developer could explain.

One caveat before the formulas: the ranges below were calibrated for that context, by watching those teams’ history. Copy the design, not the numbers.

1. Workload - is the system too full?

It measures how much work is in the team’s hands relative to the team’s size. The calculation uses only the coding statuses (from “in development” to “ready for build”), because the point of view here is the development team.

WIP/Dev = average WIP in the period ÷ number of devs on the team

Classification WIP/Dev
Very Light < 0.5
Light 0.5 to 0.8
Optimal 0.8 to 1.5
Heavy 1.5 to 1.9
Very Heavy > 1.9

The optimal range sits close to one item per person - and that’s no coincidence. Above it, effort spreads across parallel threads, the sense of priority dissolves, and firefighting becomes routine. Workload was the KPI that turned overload from impression into data: instead of “the team seems underwater,” we could say “this team has been Very Heavy for two months.”

How this number came out of Jira (detail - skip if you like)
The analytics tool sampled, day by day, how many items were in the coding statuses (from “in development” to “ready for build”) and took the average over the closed month: that’s the average WIP. The divisor was the number of devs who actually worked in the period, not the nominal headcount - anyone on vacation or leave dropped out of the count, otherwise the team looked roomier than it was. It all came straight from Jira statuses, with no one filling in a spreadsheet by hand.

2. Team Overhead - how much work waits after the code?

Here I called overhead everything that had left the coder’s hands but hadn’t reached production yet: build queues, integrated testing, approvals, UAT. The mechanics are the same as Workload, swapping the statuses observed.

PC WIP/Dev = average post-coding WIP in the period ÷ number of devs

Classification PC WIP/Dev
Very Low < 1.0
Low 1.0 to 1.5
Medium 1.5 to 3.0
High 3.0 to 5.0
Very High > 5.0

The lower, the better. And high overhead is rarely the development team’s fault: it exposes downstream bottlenecks (environments, approvals, deploy windows) and it anticipates rework and risky deploys. This was the KPI that pulled discussions out of “the team is slow” and put them where they belonged: in the design of the flow after the code.

How this number came out of Jira (detail - skip if you like)
Same mechanics as Workload, changing only the window of statuses: here the daily average counted the items already out of coding and not yet in production - build queues, integrated testing, approvals, UAT. The divisor stays the same as Workload’s - the devs who actually worked in the period - on purpose: overhead is read against the capacity of those who feed the pipeline, so it stays comparable to Workload on the same panel.

This KPI wasn’t born from a catalog. It was born from a behavior I watched every month: when demands entered the UAT queue, the teams dove into new work and underestimated the chance of the “almost done” items coming back - every now and then there were adjustments to make, the so-called defects. In their estimates of effort and work concurrency, devs had a bias toward ignoring exactly what they themselves had put into the UAT queue. It was essential to stop starting and start finishing. But we also couldn’t afford to leave developers idle, waiting for the full testing of their items in UAT; we were between a rock and a hard place. The golden rule we agreed on: whatever sits furthest to the right of the board takes priority, because WIP had to shrink whenever possible.

There was an aggravating factor: deployments were dammed up into releases. And for me, work is only done when it’s in production, working properly. While it waits for the release, an item carries rework potential - and the whole organization had a huge bias toward ignoring that. When this work showed up, it was counted as hidden work undermining productivity: a thankless load, because the planned items ran late on its account, yet important and necessary. I built two awarenesses the structure didn’t have: that the items in a release have the potential to generate work, and that this stage of the software lifecycle cannot go without management or visibility. Team Overhead was the way to put a number on it.

3. Productivity - compared to whom? To the team itself

The month’s throughput compared with the previous month’s. That simple, on purpose: the index doesn’t compare teams with each other, it compares a team with its own record.

Prod Index = month’s TP ÷ previous month’s TP

Classification Prod Index
High Increase > 1.5
Increase 1.2 to 1.5
Stable 0.8 to 1.2
Decrease 0.5 to 0.8
High Decrease < 0.5

Context defines what good looks like. A newly formed team should alternate between Stable and Increase: it’s building rhythm. A mature team with low turnover tends toward Stable, and that’s fine: stability is what makes forecasting possible. One isolated Decrease is a conversation; three in a row are a diagnosis.

How this number came out of Jira (detail - skip if you like)
TP (throughput) was the count of items that crossed into “done” within the closed month - pure delivery, no story points. The index is just the ratio between the month’s TP and the previous month’s: a team that delivered 18 items after 15 lands at 1.2 (Increase). Canceled items don’t enter the count; a reopened item only counts again when it’s redelivered.

This design didn’t pass without a fight. The pressure was continuous, coming from the organization’s general governance, from stakeholders, and from my own leadership: standardized delivery rulers that all teams, all of them, should follow and be measured by. But I knew what set each team apart, from the basics to the structural - headcount, stack, frontend or backend, whether it was a platform team or an orchestrator team highly dependent on teams outside our structure, whether it owned an engine and depended on nobody. I argued, counter-argued, swam against the tide, brought data and narrative straight from the trenches. A single ruler would buy a false sense of comfort through conformity: the appearance of implemented standardization, producing vanity metrics, not actionable ones.

To give you a sense of how absurd it would have been: I had an orchestrator team with a cycle time around 23 to 30 days, and a frontend team, owning web components, with a cycle time of 3 to 5 days. Could the two be measured by the same ruler? Of course not. The alternative I proposed was the historical verification of each team through its own metrics, with a collection of flow and delivery data I had been preparing since the previous year. With it, team profiles were properly mapped, and evolution could be proposed team by team - based on history and on the SDR narratives, where the people involved brought the learnings to the surface, what came easy and what came hard.

4. Efficiency - how much throughput is eaten by problems?

Quality is a factor of efficiency: every pause to fix something is capacity that didn’t become delivery. The index relates the period’s throughput (TP: items delivered in the month) to the problems that emerged in it, split into two types: bugs (problems from items already delivered, coming back to the board) and defects (sub-bugs of items still in development).

Each type becomes a ratio: how many net problems showed up per item delivered in the month. Net because I subtract the ones canceled along the way - opened by mistake, duplicated, or invalid. They’re two sibling numbers, one for bugs and one for defects, each divided by throughput:

TP/NBR (bugs) = (new bugs - canceled bugs) ÷ TP TP/NDR (defects) = (new defects - canceled defects) ÷ TP Ineff Index = (TP/NDR × 2 + TP/NBR) ÷ 3

The name came from the dashboard and misleads at first glance: despite the “TP” up front, the calculation divides the net problems by throughput, not the other way around. Read each ratio as “net bugs (or defects) per item delivered.” If a team closed 20 deliveries and accumulated 4 net bugs in the month, its TP/NBR is 0.2 - one bug for every five items, or 20%. The Ineff Index just merges the two ratios into a single score, giving the defect double weight.

Classification Ineff Index
High Efficiency < 20%
Normal 20% to 40%
Low Efficiency > 40%

Defects carry two thirds of the index on purpose: they speak about the flow right now, while bugs usually speak about past deliveries. A team with a high Ineff Index is accelerating on a slippery surface: plenty of effort, little traction.

How this number came out of Jira (detail - skip if you like)
Bugs and defects were distinct types in Jira, so each could be summed per month and have its cancellations subtracted without mixing the counts. TP is the same throughput as in Productivity. Both ratios came straight from that count of monthly events - no daily sampling here, unlike Workload.

Here too I swam against the tide. Governance wanted to measure efficiency by counting bugs, and that was it: how many bugs per feature, how many bugs per team. I stepped back and questioned the concepts: what does it mean to be efficient? It isn’t speed - it’s executing the best way, without unnecessary effort or cost. If the development pipeline flowed well, with demands properly documented and no ambiguity about what was expected, devs and QAs would produce more, with higher quality. That’s why I doubled down on the culture of logging defects. When the QA just told the dev what to fix, the insight was lost: nobody got to understand what was usually a problem of development quality, therefore of flow, therefore of efficiency. Defects put operational cost and friction into the process; bringing their causes and their statistics to the surface was crucial.

The effect showed up in the SDRs, the monthly reviews I’ll get to shortly: we started discussing the defects behind each demand, drawing learnings, and we stopped suffering from certain root causes (or, at the very least, mitigated them a lot).

The reading is compositional

None of these KPIs says much on its own. Workload Heavy + Productivity Decrease suggests saturation: the system is full and that’s why it delivers less; the answer is limiting WIP, not demanding pace. Workload Optimal + Productivity Decrease calls for a different conversation: something changed in the team or in the demand. Productivity Increase + Efficiency Low is the classic false positive: throughput went up by pushing problems forward. Whoever reads one indicator in isolation and acts on it tends to make the system worse while claiming to manage it.

The implementation is the product

If I had merely published these formulas on a wiki, this essay wouldn’t exist. What made the KPIs work was everything that came before and around them.

A real assessment, with every role in the flow. Before implementing anything, I sat down with the people living each stage of the work, upstream to downstream: project managers, product managers, solution architects, engineering managers, developers, leadership. I mapped how demand was born, where it traveled, where it waited. Every metric was adjusted and agreed upon with those people before it existed on any dashboard. An imposed metric becomes theater: it gets ignored, or it gets gamed.

Training as a track, not a talk. I built a training track with three steps: flow management and the Kanban Method for everyone (the basics nobody had); predictability through forecasting with actionable metrics; and deeper modules for the structure’s managers, covering the metrics and predictability practices in detail. Everyone knew which metrics existed, how they were collected, and how to interpret them, directly or in composition. People fear the metric they don’t understand.

Cadence: the Service Delivery Review. I implemented recurring SDRs, the Kanban cadence that looks at the health of the service a team provides. Always over the closed month, with data prepared before the meeting: cycle time at the 85th percentile (total, pre-development, and development), WIP aging, flow efficiency, arrival rate against delivery rate. The team looked at the real items from that period and discussed how it had been working. The metric was never the star of the meeting; it was the starting point of the conversation.

And I insisted that the SDR be an agenda for the whole squad. Before, retrospectives belonged to the dev team alone, and they were amethodical: the three little columns of “what went well, what went wrong, action plan,” without ever looking at the work. The so-called hidden defects, when they surfaced in some dev’s account, reached the designer, the QA, or the PM through a game of telephone. Everyone could and should be in that inspection-and-adaptation agenda, hearing it firsthand, without noise, understanding the pains of the production floor and where the processes needed to improve. That was a mindset I worked hard to grow into the structure’s culture. And I got there.

A lean agility team. I led a small group of agilists covering the teams with a minimum-practices plan: recurring SDRs and a steady eye on flow health. There was no agilist per squad, and none was missed: with a clear method and agreed-upon metrics, few people can sustain a lot of structure. The Kanban Maturity Model served as a map for depth decisions; the process was customized for that structure’s challenges, not transplanted from a book.

What an SDM can take from this

If I had to reduce this experience to a reusable sequence:

  1. Map before you measure. Assessment with every role in the flow, upstream to downstream. The KPI design comes from there, not from the benchmark of the season.
  2. Agree before you deploy. An agreed metric generates conversation; an imposed metric generates defense.
  3. Measure flow, not people. WIP, throughput, times, problems. No individual indicators: the system is the unit of management.
  4. Classify into ranges, not numeric targets. “Optimal” and “Heavy” guide decisions; “WIP/Dev 1.47” becomes a target to be gamed.
  5. Read in composition. One KPI alone lies easily; four together tell a story.
  6. Train the interpreters. The training track is worth as much as the formula.
  7. Give it cadence. Without a recurring SDR, a metric is a photo on the wall. With one, it’s a navigation instrument.

The most valuable result never showed up on any dashboard: the conversations changed. Asking for more people stopped being arm wrestling and became capacity analysis. Overload stopped being a complaint and became a red range on a panel management actually followed. We didn’t walk out of there with perfect teams - we walked out with a structure able to see itself. For anyone managing delivery, that’s the first KPI that matters.

And if you’re the SDM or the specialist trying to land something like this in your structure, use this case without ceremony. Rigid leadership, with little will to understand before prescribing, rarely softens with theory; it softens with data, with narrative straight from the trenches, and with a path that has already worked somewhere.