Invited Speakers and Program

Invited Speakers

Anna Rogers, University of Copenhagen

What kinds of questions have we been asking? A taxonomy for QA/RC benchmarks

This talk provides an overview of the current landscape of resources for Question Answering and Reading comprehension, highlighting the current lacunae for future work. I will also present a new taxonomy of "skills" targeted by QA/RC datasets and discuss various ways in which questions may be unanswerable.

Sam Bowman, Assistant Professor, New York University & Visiting Researcher (Sabbatical), Anthropic

Why Adversarially-Collected Test Sets Don’t Work as Benchmarks

Dynamic and/or adversarial data collection can be quite useful as a way of collecting training data for machine-learning models, identifying the conditions under which these models fail, and conducting online head-to-head comparisons between models. However, it is essentially impossible to use these practices to build usable static benchmark datasets for use in evaluating or comparing future new models. I defend this point using a mix of conceptual and empirical points, focusing on the claims (i) that adversarial data collection can skew the distribution of phenomena such as to make it unrepresentative of the intended task, and (ii) that adversarial data collection can arbitrarily shift the rankings of models on its resulting test sets to disfavor systems that are qualitatively similar to the current state of the art.

Jordan Boyd-Graber, Associate Professor, University of Maryland at College Park

Incentives for Experts to Create Adversarial QA and Fact-Checking Examples

I'll discuss two examples of our work putting experienced writers in front of a retrieval-driven adversarial authoring system: question writing and fact-checking. For question answering, we develop a retrieval-based adversarial authoring platform and create incentives to get people to use our system in the first place, write interesting questions humans can answer, and challenge a QA system. While the best humans lose to computer QA systems on normal questions, computers struggle to answer our adversarial questions. We then turn to fact checking, creating a new game (Fool Me Twice) to solicit difficult-to-verify claims---that can be either true or false---and to test how difficult the claims are both for humans and computers. We argue that the focus on retrieval is important for knowledge-based adversarial examples because it highlights diverse information, prevents frustration in authors, and takes advantage of users' expertise.

Lora Aroyo, Research Scientist, Google

Data Excellence: Better Data for Better AI

The efficacy of machine learning (ML) models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which their empirical progress is measured. Benchmark datasets define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the algorithmic aspect of the models rather than critiquing and improving the data with which our models operate. If “data is the new oil,” we are still missing work on the refineries by which the data itself could be optimized for more effective use. In this talk, I will discuss data excellence and lessons learned from software engineering to achieve the scare and rigor in assessing data quality.

Sherry Tongshuang Wu, Assistant Professor, Carnegie Mellon University (CMU HCII)

Model-in-the-loop Data Collection: What Roles does the Model Play?

Assistive models have been shown useful for supporting humans in creating challenging datasets, but how exactly do they help? In this talk, I will discuss different roles of assistive models in counterfactual data collection (i.e., perturbing existing text inputs to gain insight into task model decision boundaries), and the characteristics associated with these roles. I will use three examples (CheckList, Polyjuice, Tailor) to demonstrate how our objectives shift when we perturb texts for evaluation, explanation, and improvement, and how that change the corresponding assistive models from enhancing human goals (requiring model controllability) to competing with human bias (requiring careful data reranking). I will conclude by exploring additional roles that these models can play to become more effective.

Program (14^th July, 2022)

09:00 – 09:10: Opening remarks

09:10 – 09:45: Invited Talk 1: Anna Rogers

09:45 – 10:20: Invited Talk 2: Jordan Boyd-Graber

10:20 – 10:35: Collaborative Progress: MLCommons Introduction

10:35 – 10:50: Coffee Break

10:50 – 11:10: Best Paper Talk: Margaret Li and Julian Michael

11:10 – 11:45: Invited Talk 3: Sam Bowman

11:45 – 12:20: Invited Talk 4: Sherry Tongshuang Wu

12:20 – 13:20: Lunch

13:20 – 13:55: Invited Talk 5: Lora Aroyo

13:55 – 14:55: Panel on The Future of Data Collection moderated by Adina Williams. Panelists: Anna Rogers, Jordan Boyd-Graber, Sam Bowman, Sherry Tongshuang Wu, Lora Aroyo, Douwe Kiela & Swabha Swayamdipta.

14:55 – 15:10: Coffee Break

15:10 – 15:20: Shared Task Introduction: Max Bartolo

15:20 – 15:30: Shared Task Presentations: Team Fireworks

15:30 – 15:40: Shared Task Presentations: Team Longhorns

15:40 – 15:50: Shared Task Presentations: Team Supersamplers

15:50 – 16:50: Poster Session

16:50 – 17:00: Closing Remarks

18:30 – 21:30: The DADC Social Event

Invited Speakers and Program