And Should you Care?
This is about the reproducibility of results in Computer Systems. The papers that we shed blood, sweat, and tears for getting into our hyper-competitive conferences (definitely the latter two, the first is not widely documented). Are they helping us progress as fast and as efficiently as they could? Are our software artifacts reusable? Are our data resources reusable? Are our insights generalizable?
This the first of a two part series. In this, I look at the evidence to see if there is a problem. In the second part, I look at some promising directions that are being explored and further ones that we need to embark on.
I realize there have been reams written on this topic. But on this topic, we seem to be on a journey where the destination recedes farther away as we take some tentative steps. And there have been several promising steps taken. First, there is the growing adoption of artifacts being evaluated for our conferences and a badge of approval being provided to those that survive some testing (most do). Second, there are funded efforts, medium to large on providing curated datasets for the community to prove or disproe their hypotheses. This includes an effort by us, University of Illinois at Urbana-Champaign, and University of Texas at Austin to create a public dataset of computer system usage and failure data from operational computing clusters at various universities. This has been funded by the NSF through an initial grant and in a reassuring reassertion of trust, through a newly awarded grant. Third, there is recognition of the effort that goes into providing such usable software and datasets and its value to the community, through awards at some of our top conferences2.
However, there is a feeling that reproducibility initiatives are falling farther behind relative to the speed of new claimed innovations. Too often we have data used to claim victory in a paper locked away behind padlocks, too often we have software artifacts that are either closed source or open sourced but unusable, too often we have flawed evaluations by the authors of a software package that fall flat when replication is tried by another group.
The hypothesis that we have a problem with reproducibility in science and engineering broadly has been well validated by respected bodies like the National Academies of Science, Engineering, and Medicine (NASEM), the National Science Foundation (NSF), and the National Institutes of Health (NIH)1. I will therefore resist from repeating that. And those of us in CS research, face this whenever we have to show our results in the context of the best prior work. The software package or the hardware device that looked sparkling in the original paper, starts to look rather tawdry when we take it out for a spin.
There have been some well-publicized cases of such missteps, in our community and others. The Machine Learning community was facing this increasingly as the ever more powerful models being reported upon often could not be replicated. And building models from scratch was so costly that it was out of reach for many many interested folks. ML is an area where sharing the code is not enough — one also has to share the hyperparameters of the model and for the more complex models, these can run into the hundreds of thousands. The community to its credit took some substantive steps to correct this situation, such as through a reproducibility initiative at their premier conference NeurIPS starting in 2016. Any paper submitted to the conference has to check off elements from a reproducibility checklist. However, this does not affect the outcome of the submission and so anecdotally, the beneficial effects of this initiative are yet to be widely felt. The most heroic attempt was made by the author Edward Raff of a NeurIPS 2019 paper. In it, he manually attempted to implement 255 papers published from 1984 until 2017 (note, this means not using the original authors’ code even if available), recording features of each paper, and performing statistical analysis of the results. The conclusion: reproducibility has not improved in the last 35 years. An earlier paper from 2018 had determined that only about a quarter of the results from 400 papers at the top AI conferences (AAAI and IJCAI) were reproducible.
Many of us in the computer systems area can tell you long tales of trying to replicate experiments with systems where ML pixie dust has been sprinkled. And where the ingredients of that pixie dust (be it model details or hyperparameters of the model) are missing. This has introduced the new twist where we need very good replication not just of the code but also the data that goes into the models that the code operates on.
Why should you care?
As a researcher.
Our work can go further faster if we can truly stand on the shoulders of giants, or even normally sized people, who have gone before us. Our time and efforts and those of our team members can be more effectively spent. We need not have the dilemma — how much more effort should we spend replicating a prior system or result that we want to use as comparison, so that we are being fair to it. But not spending so much that we do not have time to showcase our own innovation.
Consider the utopia that there is a well curated library of datasets, well documented hardware IP blocks and software modules that we can plug and play and then add our own, the digital equivalent of our magical childish moments of putting the Lego building blocks together. I would like you to consider that in such a utopia our efforts will be better directed to create innovative technology to make the world a better place, a little faster.
As a practitioner.
When I found myself a few years back in the middle of starting a company out of the discoveries from our lab, me and my two enterprising students were faced with the task of vetting the claims by other solutions, some academic and some commercial. We spent many frustrating hours in that process — are we not getting it because we are making a rookie mistake or is there some pixie dust that needs to be sprinkled for the package to show its true magical colors (pixie dust is the non-technical term for a specific library we need to link to or specific value of some parameter or one of a myriad different causative factors). Serving as the technical advisor to two multi-national software companies, I see similar hand-wringing at a unit level when a couple of engineers want to pitch their new ideas to management. Has it been done before — and no flashy slides do not equate to being done — or should we spend the time of our team on this idea? And take a risk of our careers.
In the commercial space, there are some understandable reasons for keeping the prototype opaque — so as not to divulge the secret sauce. But at least it should be possible to take it out for a trial run to see what promise it has, without being besieged by marketing folks.
As a general tax-paying public.
The issue of reproducibility is most passionately debated in the musty hallways of academia and industrial research labs. But this matters for the wide public as well, even if we do not realize it quite yet and even though it will never make its way to an election campaign manifesto. After all, it is the tax dollars that fund much of the research that we are talking about. In the US in 2017, a full $120B of the Research and Development (R&D) funding came from the federal government3. Coming to Computer Science, a large majority of the funding for fundamental computer science research comes from public sources. So the tax dollars will go much farther if researchers could build on others (tax-paid) research.
To be Continued
In this post, we have looked at mounting evidence that backs up the uneasy feeling that a lot of shiny innovations in Computer Systems are not replicable. We are therefore not moving as efficiently forward as we should. This affects us as researchers and practitioners of the field for sure. But it also affects a far wider set, the broad public tax-payer who is funding most of the research and development in the field of Computer Science and Engineering. Promising news is that the technical community has been doing something to fix the situation. We will visit this in the next post plus what big moves we need to do to come out on the winning side.
 The technical definitions are: Replicability allows a different researcher to obtain the same results for an experiment under exactly the same conditions and using exactly the same artifacts, i.e., another independent researcher can reliably repeat an experiment of someone other than herself (“Different team, same experimental setup”). Reproducibility enables researcher other than the authors to obtain the same results for an experiment under different conditions and using her self-developed artifacts(“Different team, different experimental setup”). For the purpose of this post, we are confining ourselves to the easier reach of Replicability. Reproducibility is needed for an engineering principle to become widely used and almost always needs to follow Replicability.
 Such awards often have the term “community award” in its name. They are encouragingly becoming more commonplace at our top conferences.
 The share of total U.S. R&D funded by the federal government stood at 22% in 2017. The total US R&D budget was $548 billion. The component called basic research, which is mostly done at universities, is 17% of the total and 42% of the funding for all basic research is provided by the federal government.