Data Availability for Research
The Non-Paper Session will take place at Room Mondego.
Christophe Diot
Google Research
Academics have relied on data to support their research for years, and even more so with the rise of AI research. However, accessing realistic data—whether traffic traces or workloads—has proven very difficult, as most relevant datasets are owned by private companies, which are often reluctant to share them because they may reveal proprietary information or be costly to extract. This is particularly frustrating because industry researchers continue to publish results that are difficult to verify without access to shared data. This session will address the broader problem of making “realistic” datasets available to the research community, including but not limited to: identifying what data academics need and how they can collaborate with industry to gain access; inferring data from commercial networks through measurement; running academic datasets and workloads on commercial networks; calibrating synthetic datasets to match production data in terms of feature distributions and properties; and leveraging large-scale platforms to approximate the characteristics of industry datasets. We invite researchers from both academia and industry to present creative ideas or ongoing projects aimed at building or inferring the properties of industry-owned datasets, and we will hold a panel discussion bringing together industry and academic researchers to explore this recurring issue and draft a list of actions to help academia access privately owned datasets and workloads.