NIO-VE: A View from the Edge

Network Trace Data from Mobile Devices Via the National Internet Observatory

Presenters

Presenter	Role	Institution
David Choffnes, PhD	PI, National Internet Observatory	Northeastern University
Scott Cambo, PhD	Director of Data Science, National Internet Observatory	Northeastern University

Tutorial Timetable

45 min	NIO Overview + Q&A Overview of the context of Internet research today, NIO infrastructure, data collection basics, research vision, and the pros and cons of the observatory model of data collection.
30 min	Data Collection + Q&A Details on all the different types of data that NIO collects: Mobile, Desktop, and Survey Data, and details about the NIO sample/participants. Includes mechanisms of data collection as well as examples of what gets captured.
30 min	Parsed Datasets / Data Products + Interactive Dashboard Activity + Q&A Brief explanation of network traffic data and how raw, scraped data gets processed and structured to create organized data collections or data products for researchers. Followed by detailed overviews of certain data products available to the research community, along with data analyses examples, and an active exploration session of participants interacting with dashboards to visually understand trends and patterns in collected data.
15 min	Break
15 min	Aggregated Dataset Overview + Q&A Overview of an aggregated dataset and some example analyses to motivate further analysis by the tutorial participants.
30 min	Hands-on Dataset Exploration Session + Q&A Participants access and explore the above aggregated dataset built using NIO on their own machines.
45 min	Research Examples + Q&A + Break Examples of research conducted using NIO data, to provide concrete ideas on how NIO supports research trying to understand important aspects of network behavior in the wild.
15 min	Onboarding Process and Researcher Experience + Q&A An overview of what happens after a researcher submits their proposal requesting access to NIO data and the infrastructure for conducting research.
15 min	General Q&A + Future Datasets Feedback Open-ended Q&A session with all participants, speakers, and organizers, and space for feedback about what participants would like NIO to collect that is not being currently collected.

Summary

Individuals are increasingly spending a significant portion of their lives online. Currently, more than half of the world's population has access to the Internet [1], with notably higher percentages in developed countries (e.g., 89% of US adults). The Internet plays a pivotal role in connecting people and serves as a primary medium for obtaining and disseminating information. The shift towards more extensive online engagement worldwide presents unique opportunities for researchers across various disciplines to study the intersection of network and human behavior on an unprecedented scale.

However, virtually none of the data required for such studies is available for academic research. The vast amount of data created about online human behavior is siloed, proprietary, and generally unavailable for independent validation. Many existing network traffic studies focus on cellular networks, campus networks, or limited deployments of proxied traffic. The collected datasets are often not publicly available (e.g., Internet provider data are considered confidential and proprietary, and made available only to collaborators); are siloed; are rarely well documented and thus not replicable; and are often decontextualized from other information about individuals. The collection of such data has also become increasingly challenging and restricted in recent years, largely due to (valid) privacy and ethical concerns.

Some researchers rely on data from companies, including AT&T, Comcast, and Arbor Networks, who collect network traffic and in some cases make them available to a small set of collaborators under restrictive terms. Although such data provide unique insights, these data collection systems are entirely proprietary, making it impossible to assess many dimensions of scientific validity, from sample to instrumentation quality. Other common limitations of the current instruments include (i) highly aggregated/sampled datasets that make it impossible to attribute network behavior with specific users and apps; and (ii) a view that is limited to the visibility from one single network provider or a small set of vantage points.

The National Internet Observatory (NIO) aims to help address these challenges by serving as an open, large-scale, secure, and privacy-preserving observatory of online behavior to enable academic research without relying on bespoke data collection, proprietary sources, or partnerships with industry. Participants of NIO install a browser extension and/or mobile apps to donate their online activity data along with comprehensive survey responses. The infrastructure offers researchers access to a suite of structured, parsed content data for selected domains to enable analyses and understanding of Internet use in the US. This is all conducted within a robust research ethics framework, emphasizing ongoing informed consent and multiple layers, technical and legal, of interventions to protect the values at stake in data collection, data access, and research [2].

This tutorial aims to provide a brief overview of the NIO infrastructure, the data collected, the participants, and the researcher intake process. The organizers will present concrete examples of research being conducted with this new source of data, in order to motivate and inform tutorial participants' own ideas around the kind of research they can conduct with NIO. This tutorial will be interactive, including a facilitated hands-on session where participants will themselves interact with and explore an aggregated, de-identified dataset built using data collected from NIO participants. An open-ended discussion with participants will enable feedback for future data products NIO can provide to the research community.

Outline

The tutorial is 4 hours: one continuous half-day session with breaks in between to facilitate extended discussions between participants. Our tutorial consists of different sections, including presentations and interactive activities (see schedule above for details). Attendees will use interactive dashboards on their own laptop's web browser to explore the data we collect and trends in network traffic generated from NIO participant's devices. In the hands-on data exploration session, tutorial participants simply need their own devices with Python or R installed, and some data science or analytics 101 experience to explore the dataset themselves, with tutorial organizers providing examples and guiding the exploration as needed. The aggregated dataset will use real data collected from NIO participants, such as flow-level network traffic summaries from mobile devices; this will enable participants to directly interact with NIO data and understand the possibilities of working with our data donations.

Expected Audience and Prerequisites

The tutorial is designed for academic researchers from all SIGCOMM-associated disciplines interested in online activity data collection and trace and survey data-based research. There are no formal prerequisites for participation, and the contents will be relevant and accessible to all conference participants. A background in network trace data research can help participants better understand the challenges and opportunities that NIO presents, but it is not required. Introductory data science skills in Python/R will help with the hands-on session.

By the end of the tutorial, participants will:

Learn about various methods of online activity data collection (and their pros and cons).
Learn about a new infrastructure and framework for data collection.
Learn about the research and analytical possibilities enabled by NIO and data donations, including the cross-platform potential of working with participants' network traffic and mobile app activity, the kinds of research enabled by digital trace data being linked with survey data, and the interdisciplinary uses of these alternative data collection methods.
Understand the data collected by NIO and how it could inform their own research.

Prerequisites

Attendees are strongly encouraged to complete the researcher intake process for data access on the National Internet Observatory platform before attending. Doing so will enable you to interact with the highly sensitive data we furnish to qualified researchers. This process can take multiple months due to the need for an executed Data Usage Agreement by your organization's counsel, so attendees should begin this process as soon as possible.

Laptop Requirements

Any reasonably modern laptop/OS with a few GBs of free space should suffice. Installation of Python and/or R is necessary for the hands-on session.

Biographies

Dr. David Choffnes (choffnes@ccs.neu.edu; primary contact) is a Professor in the Khoury College of Computer Sciences at Northeastern University, USA. He obtained his PhD in Computer Science from Northwestern University. His research interests broadly span Internet measurements, networking, and distributed systems, with a focus on privacy, security, and transparency across IoT, mobile, and web modalities. He is a PI of the National Internet Observatory NSF Midscale Research Infrastructure, leading the mobile data collection team. His work has been published in SIGCOMM, CoNEXT, NDSI, IMC, and many other interdisciplinary venues.

Dr. Scott Allen Cambo (s.cambo@northeastern.edu) is a Senior Data Scientist and the Director of Data Science at the National Internet Observatory (NIO) at Northeastern University, USA. He earned his PhD in Technology and Social Behavior from Northwestern University. His research explores and validates computational methods for analyzing subjective differences between both manual labeling approaches using crowdsourced labor and automated labeling approaches using machine learning. In his private sector work, Scott has held a variety of critical data science roles at civic tech and responsible AI startups where he developed products for Human-AI Collaboration and designed algorithm auditing processes. More recently, he served as General Manager for the AI Incident Database to improve the way we collect, annotate, and share data regarding AI harm.

Additional Information

We will post additional details as we have them.

References

https://internetlivestats.com/internet-users
Meyer, M.N., Basl, J., Choffnes, D. et al. Enhancing the ethics of user-sourced online data collection and sharing. Nat Comput Sci 3, 660–664 (2023). https://doi.org/10.1038/s43588-023-00490-7