Designing the Reproducibility Program for NeurIPS 2020

--

Joelle Pineau, McGill & FAIR

Koustuv Sinha, McGill & FAIR

Reproducibility Chairs, Organizing Committee NeurIPS 2020

In 2019, the Neural Information Processing Systems (NeurIPS) conference launched its first ever Reproducibility Program, to support high quality writing and exchange of research results across the machine-learning community. Having learned from this first experience, we’re returning as Reproducibility Chairs for NeurIPS 2020. We’re excited to share what we learned from last year, and describe what we have in store for this year.

Code submission policy

The NeurIPS 2019 code-submission policy encouraged all accepted papers to be accompanied by the necessary software and data artifacts to reproduce the results. It is worth noting that code submission was not mandatory, and the code was not expected to be used during the review process to decide on the soundness of the work. The guidance was “expect[ed] (…) only for accepted papers, and only by the camera-ready deadline”. Given this flexibility, about 40% of authors reported that they had provided a link to code at the submission stage, and 75% of accepted papers were accompanied by a link to code at their final submission. These numbers compared quite favorably to the roughly 50% of accepted papers at NeurIPS 2018 that included a link to code — -an impressive increase in just one year!

For 2020, authors are strongly encouraged (though it is still not yet mandatory) to upload their code as part of the supplementary material at submission time, to help reviewers assess the quality of the work. Furthermore, we are now providing guidelines and templates for code submission. These were prepared by our friends at Papers with Code, after their careful analysis of the impact (as measured by github stars) of different components of submitted repositories. Their analysis is described in further detail here. We highly recommend the analysis to anyone planning to submit a NeurIPS 2020 paper that includes empirical results.

These changes were shaped by feedback from 2019 reviewers. After each review, reviewers were asked whether they looked at the code: 13% said yes (21% said no, the rest said it did not apply, perhaps in many cases because code was not available at that time). Furthermore, in the cases where code was not provided, when asked whether they wished it had been available — 21% of respondents said yes (36% said no; 43% said it did not apply). Perhaps most interesting, we found that the availability of code at submission (as indicated by authors) was highly positively associated with the reviewer score (p < 1e − 08).

Reproducibility Challenge

The NeurIPS 2019 Reproducibility Challenge officially started in early November 2019, right after the final paper submission deadline, so that participants could have the benefit of any code submission by authors. At that point in the review process, the authors’ identity was also known, allowing collaborative interaction between participants and authors. We used OpenReview.net to enable communication between authors and challenge participants. A total of 173 papers were claimed for reproduction. This is a 92% increase since the last Reproducibility Challenge at ICLR 2019. Participants came from 73 different institutions distributed around the world, including 63 universities and 10 industrial labs.

Once submitted, all reproducibility reports were shared openly with the community on OpenReview, and also underwent a review cycle (by reviewers of the NeurIPS conference), to select a small number of high-quality reports, which will be published in an upcoming edition of the journal ReScience. People may be looking for a simple yes/no answer to the question: Is this paper reproducible? However, there is rarely such a concise outcome to a reproducibility study. Most reports produced during the challenge offer a much more detailed and nuanced account of their efforts, including the level of fidelity to which they could reproduce the methods, results and claims of each paper. These reports provide a much richer source of knowledge for any other researcher seeking to reproduce or build on another team’s work.

For 2020, we are planning a new edition of the Reproducibility Challenge, sometime in Fall 2020. Details will be announced later. Interested participants can sign up here for details.

Reproducibility Checklist

The ML reproducibility checklist was first released during NeurIPS 2018 to provide the community with a practical tool for checking that a machine-learning paper contained necessary components and evidence to support its claims. While such a checklist cannot provide rigorous guarantees about completeness or correctness of the claims, it at least serves as a guide for authors and reviewers, and helps set standard expectations in the field.

The checklist (v 1.2) was then deployed at NeurIPS 2019 during the paper submission process. Analyzing the responses provided by authors, who filled out a checklist for each submitted paper and updated it later in the case of accepted papers, we observe several interesting trends. It is reassuring to see that 97% of submissions are said to contain: A clear description of the mathematical setting, algorithm, and/or model. Since we expect all papers to contain these items, the 3% No/NA answers might reflect margin of error in how authors interpreted the questions. Next, we notice that 89% of submission authors answered in the affirmative when asked: For all figures and tables that present empirical results, indicate if you include: A description of how experiments were run. This finding is reasonably consistent with the fact that 9% of NeurIPS 2019 submissions indicated “Theory” as their primary subject area, and thus may not contain empirical results. One set of responses that raises interesting questions is the following pair: Check if you include: (a) A clear definition of the specific measure or statistics used to report results. (b) Clearly defined error bars. We were surprised to note that for 87% of papers, the authors see value in clearly defining the metrics and statistics used, yet for 36% of papers, the authors judge that error bars are not applicable to their results. There seems to be some resistance within the community about the importance of characterizing uncertainty when reporting empirical results.

For 2020, we have updated the checklist to provide better grouping of questions, organizing algorithms, theoretical claims, datasets, code, and experimental results into separate sections, with 2–6 questions each. We have also removed or regrouped a few questions, and aligned some of the questions on other checklists that have been tried this year (e.g., EMNLP 2020), so that we can start comparing trends between related communities. Similarly to last year, the new ML Reproducibility Checklist (v 2.0) will be incorporated in the NeurIPS 2020 paper-submission process within the CMT interface.

Conclusion

Over the last year, we have seen many people join the conversation around reproducibility, both within the ML community, and more broadly across other disciplines of computer science. The discussions show a deep interest in improving our standards for both experimental research and paper writing. Between customized checklists, a steady flow of open-source projects, workshops on the topic of reproducibility, new venues for publishing reproducibility reports, and people willing to devote time and effort to reproducing recent results, we are raising the bar and helping the community move faster and contribute more robust scientific findings.

Anyone interested in a deeper dive into the findings, results, and analysis of the NeurIPS 2019 reproducibility program, in particular several more results concerning the ML reproducibility checklist, is invited to read our recent arXiv publication.

We look forward to continuing to engage with the NeurIPS community throughout the year!

--

--