Life as an engineer: Software Estimation

Tl;dr

In software estimation, counting and computing are more reliable than judgement alone. Try to start your estimation by counting something closely related to the project size, available early, consistent between projects and statistically meaningful. To make sure as few miscounts would be done as possible, check the work against some sort of Work Breakdown Structure. Best Case and Worst Case make a good start for judgement, but there is a hidden problem of adding them up. The Mostly Like Case and Expected Case formular were introduced to help the situation. Estimation can also be made by comparing to similar projects in the past.
Not everything in a software project can be counted (easily), a family of techniques known as proxy-based estimation helps overcome this challenge. The techniques rely massively on the ability to process raw historical data of an organization.
Software estimation can also take advantage of wisdom of the crowds by creating the right environment to support this idea.

Count, Compute, Judge

People have the tendency to mistake estimation with judgement (or, in another word, guess). However, researchers have found that judgement alone is the most inaccurate form of estimation. Estimation experts are more likely to perform better than newbies with judgement alone, but in fact, that result comes from a wide range of historical data, experience and painful stories the experts have been through in their career. Was the estimation made in an inexperienced field, no different in performance was observed.

Under the light of statistical science, counting and computing are proven to be more reliable. You should always count related things first, then compute what you can't count and finalize the estimation with calibration data. Only use judgement as the last resort.

There are many things you can count in a software project. Per the cone of uncertainty, the later in the project, the finer the level of granularity you can count at.

In order to avoid being paralyzed by choices, there are a few rules of thumb we can use to decide what to count.

As size is the strongest influence in a software project, count something that closely reflects the project size
Thing that can be count early in the project is better than something we need to wait till later
To get benefits from historical data (like pro estimators), we better count something consistent between projects
Count something that woud result in a relatively large number (like 20 or more) so that we can take advantage of the law of large numbers. The law states that the erros on the high side and errors on the low side cancel each other out to some degree.

Decomposition by Work Breakdown Structure

In the last post, we have already listed omitted activities as one of the major sources of estimation error. Omitted activities consists of forgotten features and forgotten tasks. Forgotten features can be reduced by thorough requirement engineering and experience. In general, this requires much practice and retrospection to improve. Forgotten tasks, on the other hand, can be improved dramatically by checking our work against an activity-based work breakdown structure.

Justify Probability Statement

Even though, judgement is very subjective, we cannot avoid that. After all the counting and computing, judgement is needed for the actual numbers. Educated guess is better than blind guess, and here is how can we make one. We have learned that estimate is a probability statement, we should stop using single-point number as reliable estimate. Best Case and Worst Case make a good start, it is more likely to catch actual hours somewhere in the middle and makes us more comfortable.

But there is a problem with adding up best and worst cases. Lets say that each of the individual Best Case estimates is 25% likely, meaning that you have only a 25% chance of doing as well or better than the estimate. The odds of delivering any individual task according to a Best Case estimate are not great: only 1 in 4 (25%). But the odds of delivering all the tasks are vanishingly small. To deliver both the first task and the second task on time, you have to beat 1 in 4 odds for the first task and 1 in 4 odds for the second task. Statistically, those odds are multiplied together, so the odds of completing both tasks on time is only 1 in 16. To complete all 10 tasks on time you have to multiply the 1/4s 10 times, which gives you odds of only about 1 in 1,000,000, or 0.000095%. The odds of 1 in 4 might not seem so bad at the individual task level, but the combined odds kill software schedules. The statistics of combining a set of Worst Case estimates work similarly. (McConnell, 2006)

Due to that, we introduce Most Likely Case with the hope that the sum will be closer to the actual. Still, developers' "most likely" estimates tend to be optimistic. A technique called the Program Evaluation and Review Technique allows us to calculate the expected case. The formula is derived from statistical studies.

Expected Case = [Best Case + (4 x Most Likely Case) + Worst Case] / 6

Or if the organization has a history of consistent optimism

Expected Case = [Best Case + (3 x Most Likely Case) + (2 x Worst Case)] / 6

Estimating by Analogy

The basic idea is to create new estimates by comparing the new project to a similar project in the past. Again, old rule applies, count first, then compute and use judgement

Break similar previous project into pieces using requirements and WBS	Count
Compare the size of new project and the old one piece by piece	Judge
Build up the estimate for the new project's size as a percentage of the old project's size	Compute
Create an effort estimate based on the size of the new project compared to the size of the previous project	Compute
Calibrate the result	Judge

There are areas where analogy doesn't work, like business rule. But still

"One contrast between the estimate created using analogy + decomposition, and un-decomposed approach is that in the later uncertainty in one area can spread to other areas." (McConnell, 2006)

Proxy-based Estimate

Not all activities in software development process result in code, nor everything can be counted, for instance how many test cases a feature needs, how many defects should be expected, how many pages of user documentation would be written. A family of estimation techniques known as proxy-based techniques helps to overcome these challenges. Find another metric that is correlated to what we want to estimate ultimately. Once the proxy is found, we estimate or count the number of proxy items and then use a calculation based on historical data to convert from the proxy count to the estimate we really want. The basic idea behind this kind of technique is that developers cannot estimate exactly accurately, but can estimate relatively accurately pretty well. Which means it is hard to tell if a task takes 4 or 6 hours, but relatively easy to state a task is two times harder than another. By making relative comparison to the past, we tell the future.

Where can we find the proxy? There are three main sources: industry average data, organization historical data and project specific data, in the order of increasing accuracy.

The data of different organizations within the same industry differentiate variously, by a factor of 10. And if we use the average productivity for our industry, we won't be accounting for the possibility that our organization might be at the top end of the productivity range or at the bottom. (McConnell, 2006)
The majority of projects in an organization are often similar in size and also developed under similar organizational influences. The estimates hence will not be subject to much error.
Project specific data is useful in the same way historical data is. Further more using data from the project itself will account for the influences that are unique to that specific project. The sooner on a project we can begin basing our estimates on data from the project itself, the sooner our estimates will become truly accurate.

A few popular approaches in this family are Story Points, Fuzzy Logic, T-shirt Sizing and Standard Components. Due to the lack of processing historical data, we do not use proxy-based estimate very often. Lets take a quick look anyway.

Story Points is very well-known across Scrum teams. It is used to measure the relative effort for a story, in numeric units. Story Points only starts to become useful after the first few iteration, when the team can count the number of story points it delivered and compute its velocity. You can easily find articles about Story Points on the Internet.

Fuzzy Logic is exactly similar to Story Points but instead of using numeric measurements, people classify size as Very Small, Small, Medium, Large, and Very Large. The favorite argument point when it comes to "Fuzz Logic vs Story Points" is that the use of numeric scale implies that you can perform numeric operations on the numbers, multiplication, addition, subtraction, and so on. But if that relationship isn't valid, a 12-point story doesn't require 4 times the effort a 3-point story needs, then the number 12 isn't any more valid than the Large and Very Large sizes.

T-shirt Sizing is a derivation of Fuzzy Logic where business value is brought to the table. Sales and marketing staff will say, "How can I know whether I want that feature if I don't know how much it costs?" And a good estimator will say, "I can't tell you what it will cost until we've done more detailed requirements work." It would appear that the two groups are at an impasse. By representing both business value and development cost, nontechnical stakeholders can make decision based on net business value. (Numeric values are for illustration purpose)

	Development Cost
Business Value	Extra Large	Large	Medium	Small
Extra Large	0	4	6	7
Large	-4	0	2	3
Medium	-6	-2	0	1
Small	-7	-3	-1	0

Standard Components is the most straight forward one. If we have developed a number of program that are architecturally similar to each other and possess a certain amount of historical data, we can estimate the number of standard components we have in the new program, and compute the size of the new program based on past sizes.

When proxy-based estimate is not effective.

When using proxy-based estimate, it's important to remember the Law of Large Numbers, that the rolled-up number has a validity that the underlying numbers do not have. If you don't have a big enough number of items to estimate, the statistics of this approach won't work properly, and you should look for another method.

Collect historical data

The weather today won't always be the same as it was yesterday, but it's more likely to be like yesterday's weather than like anything else (Beck and Fowler 2001)

The most important reason to use historical data from our own organization is that it improves estimation accuracy. Historical data takes into account organizational influences. Estimating these influences one by one is difficult and error-prone. Historical data adjusts for all these influence even though identifying the specifics is hard. The data also helps us avoid subjectivity and unfounded optimism. There is an effect known as The Second Project Effect where a lot of assumptions are made from what you learned from the last project, "We know the business logic better this time", "There was a lot of turnover last time, we won't have it this time (?!)" or "We will do a better job at requirement management". With historical data, we use a simple assumption that the next project will go about the same as the last project did. Because in fact productivity is an organizational attribute that cannot easily be varied from project to project (Putnam and Myers 1992).

Group Review

This is actually not a technique at all, but rather a set of rules of thumb to conduct an estimation review in group. The goal of doing the review in group is to obtain the wisdom of the crowd, so before looking into the set of rules, lets talk about this effect.

Wisdom of the crowd is the belief that the aggregation of information in groups, resulting in decisions that are often better than could have been made by any single member of the group. Not all crowds (groups) are wise. Consider, for example, mobs or crazed investors in a stock market bubble. These key criteria separate wise crowds from irrational ones:

Criteria	Description
Diversity in opinion	Each person should have private information even if it's just an eccentric interpretation of the known facts.
Independence	People's opinions aren't determined by the opinions of those around them.
Decentralization	People are able to specialize and draw on local knowledge.
Aggregation	Some mechanism exists for turning private judgments into a collective decision.

http://en.wikipedia.org/wiki/The_Wisdom_of_Crowds

When the decision making environment is not set up to accept the crowd, is that the benefits of individual judgments and private information are lost and that the crowd can only do as well as its smartest member, rather than perform better

Extreme	Description
Homogeneity	The need for diversity within a crowd to ensure enough variance in approach, thought process, and private information is stressed
Centralization	The hierarchical management bureaucracy limits the advantage of the wisdom of low-level engineers.
Division	Information held by one subdivision was not accessible by another.
Imitation	Where choices are visible and made in sequence, an "information cascade"[5] can form in which only the first few decision makers gain.
Emotionality	Emotional factors, such as a feeling of belonging, can lead to peer pressure, herd instinct, and in extreme cases collective hysteria.

http://en.wikipedia.org/wiki/The_Wisdom_of_Crowds

Based on that study, Steve McConnell suggested this set of rules in group review.

Have each team member estimate pieces of the project individually, and then meet to compare your estimates Discuss differences in the estimates enough to understand the sources of the differences. Work until you reach consensus on high and low ends of estimation ranges.
Don't just average your estimates and accept that You can compute the average, but you need to discuss the differences among individual results. Do not just take the calculated average automatically. Convergence among the estimates tells you that you probably have a good estimate. Spread tells you that there are probably factors you have overlooked and need to understand better.
Arrive at a consensus estimate that the whole group accepts If you reach an impasse, you can't vote. You must discuss differences and obtain agreement from all group members.

Beck, Kent, and Martin Fowler, 2001. Planning Extreme Programming, Boston, MA: Addison-Wesley.
Putnam, Lawrence H., and Ware Myers, 1992. Measures for Excellence: Reliable Software On Time, Within Budget, Englewood Cliffs, NJ: Yourdon Press.
Steve McConnell. (2006). Calibration and historical data. In: Software Estimation - Demystifying the black art. Redmond: Microsoft Press.
Steve McConnell. (2006). Estimation by Analogy. In: Software Estimation - Demystifying the black art. Redmond: Microsoft Press.
Steve McConnell. (2006). Individual expert judgement. In: Software Estimation - Demystifying the black art. Redmond: Microsoft Press.

Life as an engineer

Pages

Saturday, April 13, 2013

Software Estimation - Techniques and process