A Critical Look At Some Pair Programming Research

6 Apr 2004

Science is what we have learned about how to keep from fooling ourselves

Richard Feynman

The last post examined some of the psychological and statistical effects that can adversely affect the reliability of anecdotal evidence. It is important to keep these effects in mind when evaluating the importance of our colleague's stories of success with particular software development techniques. But this doesn't mean that claims based upon empirical evidence are necessarily trustworthy. The experimental process is also susceptible to these forces, where they manifest as methodological errors and biased analysis of results. This post will illustrate some of these effects in action, using a well-known empirical study of pair programming as an example.

The Williams Experiment

In the Fall of 1999 at the University of Utah Laurie Williams conducted an experiment in pair programming upon students in the Senior Software Engineering course [1]. Here is a brief outline of her experiment:

The class contained 41 students. On the first day of class, 35 of the 41 students (85%) indicated a preference for working collaboratively over individually.
The students were divided into two groups - 13 students in the control group, 28 students in the experimental group
The control and experimental groups were given the same four programming assignments over the next six weeks. The 13 students in the control group completed the assignments individually using Humphrey's PSP (Personal Software Process). The 28 students in the experimental group completed the assignments in 14 pairs using William's CSP (Collaborative Software Process).
On several occasions the students in the experimental group took anonymous surveys designed to assess their relative enjoyment of pair programming and the degree of confidence they felt in their solutions.
The quality of the assignments was assessed by running them against a suite of automated tests

The overall results were:

The average elapsed time for assignment completion was about 15% less for pairs than for individuals
The average quality level of the assignments was about 15% higher for pairs than for individuals
90% or more of survey responses indicated that pair programming was more enjoyable.
About 95% of responses indicated a higher confidence in solutions that were pair programmed.

Ever since then, the phrase "a 15% increase in development cost for a 15% improvement in quality" has been cited many times, without qualification, as being the value proposition for pair programming.

In a subsequent presentation [3] Williams claimed that the only experimental variable at work here was pair programming. However a closer examination of the experimental method reveals that this was far from the case. There were numerous uncontrolled factors that could account in whole or part for the results achieved. Many of these factors are mentioned only in passing in Williams' paper, and some are entirely unacknowledged.

Population Bias

The 28 students in the experimental group were not chosen randomly - they were picked from amongst the 35 that initially indicated a preference for working collaboratively. This means that 6 students who indicated that preference were rejected (on an unspecified basis) and put in the control group. This creates a population bias in several ways.

All of the students in the experimental group have had their preference accommodated by the researchers, whereas 6 of the 13 in the control group have not. It is possible that this influences their subsequent performance.
All of the students in the experimental group are volunteers, which could be correlated with greater ability, enthusiasm or other personality characteristics that their non-volunteer counterparts in the control group (of which there are 7) don't possess.
Williams also notes that the collaborative pairs did additional assignments to keep the overall workload the same between the two groups. This could advantage the pairs by giving them extra practice and extra opportunity to receive feedback on their performance (which could be additionally motivating).
13 of the 14 pairs were self-selecting i.e. the partners requested to pair with each other. So the experiment is not evaluating pair programming in general, but a particular instance of it where participants are free to select their own partner, and to program only with them for the duration of the project.

The mechanisms by which the population was selected may well bias them towards favorable responses to survey questions, for a few reasons:

The ability of the pairs to select their own partners means that they are partnered with someone whom they are already on good terms with.
All members of the experimental group volunteered to be in that group, and so they have an ego investment in the consequences of their decision. Furthermore, they all indicated a preference for pair programming prior to participating in the study. To draw negative conclusions about the resulting experience is to concede a prior decision was wrong.

As Robert Cialdini notes [8]:

Once we have made a choice or taken a stand, we will encounter personal and interpersonal pressures to behave consistently with that commitment. Those pressures will cause us to respond in ways that justify our earlier decision.

All these sources of population bias are given short shrift at the tail end of Williams' ACM paper [1]:

It must also be noted that the majority of those involved in the study and of those that agreed to do complete [sic] the survey are self-selected pair-programmers. Further study is needed to examine the eventual satisfaction of programmers who are forced to pair-program despite their adamant resistance.

The Hawthorne Effect

The tendency to better short-term performance when knowingly under scrutiny is called the Hawthorne Effect. This experiment provides the typical circumstances under which the effect manifests. Williams acknowledges the observational pressure the students are under:

The students were aware of the importance of the experiment, the need to keep accurate information, and that each person (whether in the control or experimental group) was a very important part of the outcome.

The students could not be unaware of the vested interest that Williams has in achieving a pro-collaboration outcome. It is well known that collaborative software development is at the heart of her academic and professional identity. She is also a known champion of collaborative development in an educational context.

Consider also that the students timed their own work efforts and submitted them using a web-based tool. One can only presume that these submissions were not anonymous, otherwise the results could not be used to academically grade the students. Consequently, the students must be aware that their performance is traceable back to themselves.

The experimental group is therefore particularly motivated (consciously or otherwise) to perform well so as to create an outcome that Williams considers favorable, knowing that they can be associated with this outcome. The control group is oppositely motivated (consciously or otherwise) knowing that high performance on their part could oppose the preferred hypothesis of an authority figure.

To appreciate the students' position, imagine your boss has instituted a new quality program in your workplace, of which he is an enthusiastic proponent. To evaluate its efficacy, he distributes questionnaires asking "Did this program work?" The questionnaires are not anonymous, and you know he can see the responses. Do you think he'll get an accurate indication of people's perceptions?

Williams claims to have mitigated against the Hawthorne Effect, but misses the point entirely:

Since both groups will receive the same information about the study and in all lecture materials, the Hawthorne effect should not pose a thread to this study.

The Placebo Effect

A significant methodological flaw in the experiment is that it was not independently run. Williams notes:

All students attended the same classes, received the same instruction, and participated in class discussions on the pros and cons of pair programming.

The paper does not say who delivered this information and moderated these discussions, but if it was Williams herself then she will likely have communicated her pro-collaboration bias to her students well and truly by the time the experiment is over, thereby creating a sense of positive expectation in the experimental group and negative expectation in the control group. Even if it was not Williams, then her history of research in collaborative software development education constitutes a contextual bias towards pro-collaboration outcomes.

This group may well have been unintentionally primed to succumb to the placebo effect. They have been given an accounting of pair programming with a positive bias, and they are therefore predisposed to bring their own experience into alignment with that.

Results + Observational Bias = Desired Outcome

Note that technical problems prevented the accurate recording of completion times for Program 4, so it has been excluded from the result set.

In analyzing these results, Williams chooses to dismiss the data for Program 1 as being atypical. She claims that the pairs are "jelling" over this time, and adjusting to an unfamiliar work mode. She attempts to justify this by referring to the observance of a similar pattern by Nosek, but there is no mention of such an effect in the Nosek paper [4]. In fact, Nosek's pairs performed immediately and on a task of only 45 minutes duration. Furthermore, she describes an earlier trial run of her CSP process that was conducted in the Summer of 1999, where all students in the class programmed collaboratively. She boasts:

... each of the the ten collaborative groups turned in eight project and all 80 were on time. Additionally, all projects were of very high quality. The average grade on all 80 assignments was 98%.

So why did this group not experience a markedly lower performance during their "jelling" period as well?

She also makes references to the rapidity with which teams jell that, if relevant, would seem to contradict the decision to dismiss the entirety of Program 1 as an adjustment period. She claims:

It doesn't take many victorious, clean compiles or declarations of "We just got through our test with no defects!" for the teams to celebrate their union - and to feel as one jelled, collaborative team.<

And later:

In industry, this adjustment period has historically taken hours or days, depending upon the individuals.

It is inconsistent to both celebrate the rapidity with which pairs jell, and then dismiss the first quarter of one's data set to allow time for them to jell. This appears to be a clear case of observational bias - the post facto rejection of data that is not consistent with one's preferred hypothesis.

This is a clear case of observational bias - the dismissal of data that does not fit with the researcher's desired outcome.

Not Statistically Significant

The bombshell is that the completion times for individuals and pairs was not statistically significant after Program 1! Williams confesses:

... after the first program, the difference between the time for individuals and pairs was no longer statistically significant. (... For the difference in time values, p = 0.380, which indicates that there is almost a 40% chance the difference in time values would be observed by chance.)

In [2], the following conjecture is made based on the "Lines of Code" results:

... they [pairs] consistently implemented the same functionality as the individuals in fewer lines of code. â?¦ We believe this is an indication that the pairs had better designs.

I find the leap from "lines of code" to "better designs" to be a little hasty, to say the least.

Conclusions

The results of empirical investigations tend to carry a certain weight of authority - and rightly so. But part of the reason that empirically derived results carry such weight is the assumption that they have been derived through well-controlled experimentation by researchers who are fully cognizant of the need to account for and mitigate against the effects of methodological and observational bias. Without such mitigation, the results have only the appearance of authenticity and none of the credibility that we would otherwise attribute them with.

The Williams experiment is so open to the effects of bias and so poorly controlled that the results are next to meaningless - generalization based upon them, even more so.

In brief, the following features of the experiment are in question:

The experimental group was not selected randomly
Pairings within the experimental group were not selected randomly
The experimental group undertook additional assignments that the control group did not
The experimental group used the CSP, the control group the PSP
The experiment was not independently run
Of four assignments, the last did not result in usable data, the first was dismissed as atypical and the completion time data was not statistically significant.

What I want to know is - why, as a community, do software developers eat this stuff up so uncritically? Why are there so many people, including Williams herself, quoting this experiment and its conclusions as if they were some sort of vindication of pair programming, when they are in fact meaningless? Where are the critical examinations of her experimental method? I've found only one other - that provided by Stephens & Rosenberg in "Extreme Programming Refactored". I wonder, have those that appeal to this experiment actually read the thesis describing the experimental method at all? In a scientific community, such pseudo-experimentation would be laughed at. In software development, we can't seem to be even bothered thinking about it.

Addendum: The Nosek Experiment

In 1998 John Nosek published the results of an experiment in pair programming [4]. Williams' use of the word "strengthening" in the title of her paper is in part a reference to this earlier work, which she mentions in her paper's introduction. I mention it here because in several ways it is methodologically superior to the Williams experiment:

The participants (15 professional programmers) were divided into control and experimental groups on a truly random basis.
Programmatic results were assessed on both functionality and readability.
All raw experimental data was reported, not just percentages

Nosek was also more conservative in his subsequent analysis. Unfortunately little can be generalized from his results because the experiment only involved the completion of a single 45 minute task.

References

L. Williams et al, "Strengthening the Case for Pair Programming"
L. Williams and A. Cockburn, "The Costs and Benefits of Pair Programming"
L. Williams, "Pair Programming", presentation - North Carolina State University
J. Nosek, "The Case for Collaborative Programming", Communications of the ACM, vol. 41 no. 3, 1998, pp. 105-108
L. Williams and R. Upchurch, "In Support of Student Pair Programming"
J. Nawrocki and A. Wojciechowski, "Experimental Evaluation of Pair Programming"
F. Brunyate, "A Survey of Current Research on the Efficacy of Pair Programming"
R. Cialdini, "Influence: The Psychology of Persuasion", 1993, Quill
J. Meltzoff, "Critical Thinking About Research", 2003, American Psychological Association