Velocity and Story Points - They don't add up!


(10 comments)
August 22nd 2012


I love Agile's idea of velocity in theory - that after accumulating a few weeks/months of data, a team can derive the average number of story points it can implement per sprint, and then use this as a basis for knowing both how much it can commit to in the next sprint (short term) and also when the project will be finished (long term).

In practice, however, I believe any calculation of velocity (based in story points) is doomed to be dangerously inaccurate and misleading for either the short or long term. Here's why...

The first problem is that story points are not additive. Mike Cohn touches on (but does not fully address) this in a recent post. Using his example, imagine a team that uses story point buckets of 1, 2, 3, 5, and 8. Further, consider the median number of hours to complete user stories of each size:

Story Points Median Hours
1 21
2 52
3 64
5 100
8 111

Now assume your team's velocity is determined to be 16 points. It would seem that you could pluck any combination of user stories off the backlog (according to the business owner's prioritization, of course!), and be fairly confident that you could complete this work so long as the stories all sum to 16 points. It's easy to illustrate, however, why this doesn't work. For example, take just two different combinations of stories:

Story Combination Actual Hours
8, 8 = 16 111 + 111 = 222
5, 3, 2, 2, 2, 1, 1 = 16 100 + 64 + 52 + 52 + 52 + 21 + 21 = 362

Note that although the story points both sum to 16, the actual hours differ significantly: 222 hours for the first combination of stories and 362 for the second. This is not a negligible amount! It could very realistically cause a delay in a release (or alternatively, a copious amount of slack time for the development team). Either way, something is wrong.

The root problem here is in the correspondence of story points to actual hours. To be able to calculate velocity, it must be the case that story points not only preserve a relative ordering, but their numeric ratios as well. In other words, it must be true that two 1 point stories equals the same amount of effort as one 2 point story. Unfortunately, story points in practice are little more than ordinal values - e.g. we can rely on a 2 point story being bigger than a 1 point story, but not necessarily a 2 pointer being twice as big as a 1 pointer.

In general, the worse this correspondence between story points and actual hours is on your project, the more unreliable your velocity will be. To ensure that the correspondence is consistent, therefore, a project manager must be vigilant about calculating these statistics (as in the table above) and then presenting them to the team, so that the team can best adjust their estimates (for example, using the data above, if the team wanted to keep its notion of a "1", then it would need to adjust down its assessment of a "2").

Assume then that the correspondence between story points and actual hours is perfect. There's still a second problem: teams very seldom track actual hours. Instead, "actuals", as in the example above, are most often derived from the hour-based task estimates given during the planning phase to tasks. For example, a simple "Add order" user story might be broken into three tasks, and each of these tasks would be given hour estimates prior to starting the sprint.

Summing all these tasks, the project manager would get the total number of hours for that story, which is then used to burn down from. The problem is, of course, these hour estimates seldom map to how long these tasks actually took, and so using these "actual hours" to measure the correspondence of story points to actual hours will be spurious. Again, velocity is chimerical.

Finally, a third problem, as a number of experts have pointed out, is that a team's assessment of story points adjust over time. For example, what the team used to think of as a "2" a few months back may now be considered a "3". This type of adjustment is fine, again, if story points are only to be used as relative values, but if they are used to calculate velocity, it simply destroys any hope for accuracy, because it is simply not mathematically valid to add them together.

As an analogy, if half-way through your sophomore year your college switched from a 4 point scale to a 5 point scale, they obviously couldn't calculate your GPA by just adding all your grades together and dividing by the number of units. The scale changed! Again, the point is that because story points aren't additive, velocity just doesn't work.

In the end, I completely understand that story points and velocity were designed specifically to be a low-ceremony estimates, and further that the purpose of agile is to spend more time delivering working software and less time on producing plans for delivering software. That's great. However, there seems to be a very pervasive and dogmatic belief in the robustness of velocity, and it's just not empirically warranted. The idea is great in concept, but it's based on a the faulty premise: that story points can be used like numbers.

Personally, I severely doubt that velocity can be used reliably for long term forecasting, and definitely not for sprint planning. Just my thoughts - what do you think?

I'm an "old" programmer who has been blogging for almost 20 years now. In 2017, I started Highline Solutions, a consulting company that helps with software architecture and full-stack development. I have two degrees from Carnegie Mellon University, one practical (Information and Decision Systems) and one not so much (Philosophy - thesis here). Pittsburgh, PA is my home where I live with my wife and 3 energetic boys.
I recently released a web app called TechRez, a "better resume for tech". The idea is that instead of sending out the same-old static PDF resume that's jam packed with buzz words and spans multiple pages, you can create a TechRez, which is modern, visual, and interactive. Try it out for free!
Got a Comment?
Comments (10)
Chris
August 24, 2012
Another beef I have with story points is that they don't track at all the value to the company. I can write down 4 SP for twiddling my thumbs or 3 for revamping a UI widget that makes deployments twice as fast and they are graded as the same thing.

Granted a metric of "biz ROI / hours" is hard to come up with but thats what really should be tracked.

Oh and SP also miss out on if you decrease the amount of time to do something. People will naturally reduce their SP counts if it is less painful to do a task. So instead of showing an improved velocity improving the infrastructure wont help.
August 30, 2012
I believe that story points serve as a "rough" estimate. In the teams I work with, story point estimates are made quickly (a few minutes to be sure we understand the story, then quickly discuss and reach a consensus estimate). They are quantized (must round off to some Fibonacci number) which means that any given estimate is necessarily imperfect.

As such, they provide a cheap (didn't take long to generate) but rough (not perfectly accurate) estimate, and they have to be respected as such. Story point estimates would not be useful to answer questions like "Will this project deliver in October or November?", but they ARE useful for questions like "Would this be a 3-month project or a 1 year project?" For some purposes, a more precise estimate is needed, and then it may be necessary to invest a few hours to a few weeks to perform detailed work to generate a more precise estimate. However, I think that such situations are rare: people *want* perfect estimates ahead of time but rarely *need* them. Also I think that people are usually fooling themselves: most (usually waterfall) projects with precise up-front estimates later discover that those estimates are not accurate.

One of the strengths of story points is that everyone (including the customer) REALIZES that they are rough and don't correspond to a precise delivery date -- something that can be difficult to explain for estimates expressed in hours.
September 15, 2012
Do you have a better estimation method? Better = better accuracy without much more cost

As Michael points out, we too are using our Story Point estimats to give us a feeling of how long it will take, and not as a basis to set a deadline.

Additionally, I think that changes of the scope due to feedback and things learnt while implementing have even more impact on how wrong the estimate is: customer: "the stuff you showed me in the Sprint Review lead to a new idea, let's do this now instead of the old thing."
This happens quite a lot in my project and the result is that even if our estimates would be accurate, they'd still be obsolte then.

Therefore, we strictly work in order of priority (or order as it is called today) and make the best of the time we have.
Andrew
September 18, 2012
One other point that has been made is that the discrepancy between story point sizes and velocity is mitigated given a homogenous sprint composition. Presuming that your average sprint does not vary wildly (e.g. four '13s' one sprint and twenty-six '2s' the next), then using historical data for sprint planning is still relevant. How typically this is the case, I don't know.

Changing the "story point" buckets mid-ship doesn't seem to necessarily be a viable solution as it will almost certainly affect the group's psychological precedent for story sizing, which has hopefully converged to a common understanding given sufficient time. I think the lightweight solution may be "behind the scenes" where the proportionate historical times estimates for various story point sizes are used for planning (instead of the story points themselves). This should be pretty straightforward to pull off with minor mastery of excel and a few extra minutes and could potentially yield big accuracy gains. Also relevant is the increasing [magnitude of] standard deviations associated with larger stories, indicating that average-story-point-time-to-complete may not be the proper metric for relative sizing in terms of sprint planning (maybe some weighted average or something weighted towards the "high end").
Ben
September 19, 2012
Thanks for all the comments! (and sorry it took so long to reply)

@Chris - Hopefully the prioritization of the backlog will ensure that the highest-value features are implemented, and in my experience product owners do do a rough calculation of ROI - e.g. "this user story was rated an 8, but it's very important to the business, so let's do it". Good point about the subjectivity of story points - I have definitely seen teams inflate story points for tasks that are "boring" or "unpleasant".

@Michael - Really great points, and I very much appreciate your explanation. I think in so far as everyone has the expectation that story points are a very rough estimate, then there's no problem. You're right, people shouldn't infer whether the project will finish exactly in October or November, but should be able to infer whether it's 3 months or 1 years worth of work.

I guess my biggest complaint/concern is not with story points in isolation, but with when they meet the concept of velocity. Because story points use numeric symbols (e.g. "1", "2", "5", etc.), people (in my experience) ascribe an inappropriate amount of trust/rigor to them, because they are not actual numbers! (i.e. they don't have the mathematical properties of numbers) If we just used t-shirt sizes, there probably wouldn't ever be this confusion...which is what I've tried to push on my teams. As soon as we use numbers, however, then people start saying things like "we finished 57 story points last sprint, so we should be able to do the same every sprint", which I think is an unreliable assertion. The most reliable way to know how much you can fit in the next sprint, or when you'll eventually deliver, is to break down user stories into tasks and estimate in hours. As you asserted though (and I think I agree with you), this is not something that most teams *need*.

@Urs - Thanks. I think that's a great perspective - "work in order of priority (or order as it is called today) and make the best of the time we have". Many teams spend a lot of effort estimating and planning, and it's not always clear (as Michael mentioned) that this time is worthwhile. This seems like a very Kanban frame of mind.

@Andrew - It would be interesting to mine sprint data to understand composition of user stories, standard deviations, etc. Would be interested in hearing more if you do this.
Grant
September 20, 2012
Ben,

Interesting argument but I see a couple of flaws.

I’m making couple of assumptions here:
• The error of margin is directly proportional to the size of the estimate (e.g. – a user story sized @ 8sp will have a larger error of margin than a user story sized @ 1sp)
• The time period of when a user story is sized using story points is closer to the mouth of the Cone of Uncertainty than when a detailed task estimate is provided (Since they are both estimates taken at different points in the SDLC, they will both have an error of margin but it will be greater when a user story is sized using SP).

Given that the data in your example is fictional, it’s probably not a good idea to nit-pick on the details. But the example above is staged so that there is a rough relationship with all of the data except for the last SP value (i.e. – 8). Armed with the 2 assumptions stated above, I would take a more in-depth look at the data values collected for 8 SP and see what the std dev is for those values. If the range is high as assumed, then perhaps the issue here is that when the team estimates something at 8SP, they really have little idea of what the level of effort is required to complete the user story. Based on the data above it would make more sense for the team to break it down to the smaller story points where a good relationship is established with the story points that does prove to be additive.

My point is really that the use of story points does not just end with an initial establishment of the values and then assumed to “work itself out.” Given that the introduction to story points is novel to most teams and the original values chosen by a team is somewhat random, I would expect the initial data gathering to highlight a poor relationship between the story points (much worse than the one in your example above). The key here is to review and improve based upon feedback from data collection exercises that you describe above. There are options in your example that the team could make that would provide a rough relationship to serve as an input to sprint planning and release forecasting. But it’s important that the team make the decisions on these options as they are the one doing the estimates.

-Grant
Abro Zacheria
March 31, 2013
Excellent articale and simple way to understand.
John
June 12, 2013
I think the issue is that you are looking at apples and oranges when comparing hours and story points. Story points are meant as an estimation of complexity for the story. On the teams I coach, I (privately) keep track of how many story points each individual on the team bangs out each sprint, that way I can estimate output of changes in the make up of the team from sprint to sprint (Example: Joe is always good for 16 story points per sprint and he'll be on vacation, so we'll estimate that the team can do 16 fewer points this sprint than last sprint).

Also, it doesn't account differences in skill sets. What is a two hour task for one individual could be a two day task for another, yet the complexity stayed the same.
Juanal
October 10, 2013
I agree with your article, Ben. From the very first moment I read material about using arbitrary story points and then applying arithmetic operations to them, alarms went off in my mind.

I believe using story points that are not proportional to time units is a bad idea. People will always have the natural tendency of using arithmetic operations. For example, they'll think about “how many story points do I have in total?”, and as you mention in this article, the average will be used to determine speed. At least having them in a uniform scale is more accurate than doing it over inconsistent scales.
Dogman
July 15, 2016
I couldn't agree with this article more! SP math doesn't add up.

1. Yes, SP's should be used by developers to quickly estimate
2. SP's should be honestly recognized as icons that represent hourly estimates with means, medians, and standard deviations. Go ahead! Use Fibonacci numbers, shirt sizes, animals, whatever! But be intellectually honest and admit that they stand for real numbers in the background.
3. Historical sprint data should be used to refine what each SP icon stands for numerically
4. Make developers aware of how variant their SP icons are
5. Use an SP-to-Hour calculator during sprint planning to use real math to plan capacity (I'd recommend using the high side of your standard deviation distribution in your calculator)
6. Similarly, use your SP-to-Hour calculator to help release plan. Your SP-to-Hour calculator should have a low and high range and than can drive your low and high estimates of PBI backlog duration.

Thanks @Ben, great article.