Announcing the NeurIPS 2021 Datasets and Benchmarks Track

Joaquin Vanschoren and Serena Yeung

There are no good models without good data (Sambasivan et al. 2021). The vast majority of the NeurIPS community focuses on algorithm design, but often can’t easily find good datasets to evaluate their algorithms in a way that is maximally useful for the community and/or practitioners. Hence, many researchers resort to data that are conveniently available, but not representative of real applications. For instance, many algorithms are only evaluated on toy problems, or data that is plagued with bias, which could lead to biased models or misleading results, and subsequent public criticism of the field (Paullada et al. 2020).

Researchers are often incentivized to benchmark their methods on a handful of popular datasets that have been well established in the field, with state-of-the-art results on these key benchmark datasets helping to secure a paper acceptance. Conversely, evaluations on lesser known real-world datasets, and other benchmarking efforts to connect models to real world impacts, are often harder to publish and are consequently devalued within the field.

In all, there are currently not enough incentives at NeurIPS to work and publish on data and benchmarks, as evidenced by the lack of papers on this topic. In recent NeurIPS conferences, very few (less than 5) accepted papers per year focus on proposing new datasets, and only about 10 focus on systemic benchmarking of algorithms across a wide range of datasets. This is partially due to publishing and reviewing guidelines which are meaningful for algorithmic papers but less for dataset and benchmark papers. For instance, datasets can often not be reviewed in a double-blind fashion, but do require additional specific checks, such as a proper description of how the data was collected, whether they show intrinsic bias, and whether they will remain accessible.

We therefore propose a new track at NeurIPS as an incubator to bootstrap publication on data and benchmarks. It will serve as a venue for publications, talks, and posters, as well as a forum for discussions on how to improve dataset development and data-oriented work more broadly. Submissions to the track will be part of the NeurIPS conference, presented alongside the main conference papers, as well as published in an associated journal. For this, we plan to establish a subjournal of JMLR called Datasets for Machine Learning Research (DMLR). Submissions to this track will be reviewed according to a set of stringent criteria specifically designed for datasets and benchmarks. Next to a scientific paper, authors must also submit supplementary materials which provide full detail on how the data was collected and organized, what kind of information it contains, how it should be used ethically and responsibly, as well as how it will be made available and maintained. Authors are free in describing this to the best of their ability. For instance, dataset papers could make use of dataset documentation frameworks, such as datasheets for datasets, dataset nutrition labels, data statements for NLP, and accountability frameworks. For benchmarks, best practices on reproducibility should be followed.

In addition, we welcome submissions that detail advanced practices in data collection and curation that are of general interest even if the data itself cannot be shared. Audits of existing datasets, or systematic analysis of existing systems on novel datasets that yield important new insight are also in scope. As part of this track, we aim to gather advice on best practices in constructing, documenting, and using datasets, including examples of known exemplary as well as problematic datasets, and create a website that makes this information easily accessible.

Different from other tracks, we will require single blind review, since datasets cannot always be transferred to an anonymous platform. We leave the choice of hosting platform to the creators, but make it clear that publication comes with certain responsibilities, especially that the data remain accessible (possibly through a curated interface) and that the authors bear responsibility for their maintenance (e.g. resolving rights violations).

There are some existing related efforts in the broader community, such as dataset descriptors (e.g., Nature Scientific Data) or papers on the state of the AI field (e.g., the AI Index Report). However, dataset journals tend to focus purely on the data and less on its relation to machine learning, and projects such as the AI Index are very broad and do not focus on new experimental evaluations or technical improvements of such evaluations. This track will bring together and span these related efforts from a machine learning-centric perspective. We anticipate the output to be a rich body of publications around topics such as new datasets and benchmarks, novel analysis of datasets and data curation methods, evaluation and metrics, and societal impacts such as ethics considerations.

If you have exciting datasets, benchmarks, or ideas to share, we warmly welcome you to submit to this new track. To allow near-continuous submission, we will have two deadlines, this year on the 4th of June and the 23rd of August 2021. Submissions will be reviewed through OpenReview to facilitate additional public discussion, and the most appreciated submissions will also feature in an inaugural symposium at NeurIPS2021. Please see the call for papers for further details.

For any questions, ideas and remarks, please contact us at

We would like to thank Emily Denton, Isabelle Guyon, Neil Lawrence, Marc’Aurelio Ranzato, and Olga Russakovsky for their valued feedback on this blog post.

Tweets sent to this account are not actively monitored. To contact us please go to

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store