Accepted Papers
The following tiny papers have been accepted to the EvalEval Workshop at NeurIPS 2024:
Oral Presentations
-
Provocation: Who benefits from “inclusion” in Generative AI?, (Slides)
Samantha Dalal, Siobhan Mackenzie Hall, Nari Johnson -
(Mis)use of Nude Images in Machine Learning Research, (Slides)
Arshia Arya, Princessa Cintaqia, Deepak Kumar, Allison McDonald, Lucy Qin, Elissa M Redmiles -
Evaluating Refusal, (Slides)
Shira Abramovich, Anna Ma -
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark, (Slides)
Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Kazuki Egashira, Jeonghun Baek, Xiang Yue, Graham Neubig, Kiyoharu Aizawa -
Critical human-AI use scenarios and interaction modes for societal impact evaluations
Lujain Ibrahim, Saffron Huang, Lama Ahmad, Markus Anderljung -
Cascaded to End-to-End: New Safety, Security, and Evaluation Questions for Audio Language Models, (Slides)
Luxi He, Xiangyu Qi, Inyoung Cheong, Prateek Mittal, Danqi Chen, Peter Henderson -
GenAI Evaluation Maturity Framework (GEMF), (Slides)
Yilin Zhang, Frank J Kanayet -
AIR-BENCH 2024: Safety Evaluation Based on Risk Categories from Regulations and Policies
Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, Bo Li -
Evaluating Generative AI Systems is a Social Science Measurement Challenge
Hanna Wallach, Meera Desai, Nicholas J Pangakis, A. Feder Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z Jacobs.
Poster Presentations
-
Evaluations Using Wikipedia without Data Contamination: From Trusting Articles to Trusting Edit Processes
Lucie-Aimée Kaffee, Isaac Johnson -
Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset
Haoming Lu, Feifei Zhong -
Using Scenario-Writing for Identifying and Mitigating Impacts of Generative AI, (Poster)
Kimon Kieslich, Nicholas Diakopoulos, Natali Helberger -
Troubling taxonomies in GenAI evaluation
Glen Berman, Ned Cooper, Wesley Deng, Ben Hutchinson -
Is ETHICS about ethics? Evaluating the ETHICS benchmark
Leif Hancox-Li, Borhane Blili-Hamelin -
Provocation on Expertise in Social Impact Evaluations for Generative AI (and Beyond)
Zoe Kahn, Nitin Kohli -
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique, (Poster)
Suhas Hariharan, Zainab Ali Majid, Jaime Raldua Veuthey, Jacob Haimes -
Contamination Report for Multilingual Benchmarks
Sanchit Ahuja, Varun Gumma, Sunayana Sitaram -
Towards Leveraging News Media to Support Impact Assessment of AI Technologies
Mowafak Allaham, Kimon Kieslich, Nicholas Diakopoulos -
Motivations for Reframing Large Language Model Benchmarking for Legal Applications, (Poster)
Riya Ranjan, Megan Ma -
A Framework for Evaluating LLMs Under Task Indeterminacy
Luke Guerdan, Hanna Wallach, Solon Barocas, Alexandra Chouldechova -
Dimensions of Generative AI Evaluation Design
P. Alex Dow, Jennifer Wortman Vaughan, Solon Barocas, Chad Atalla, Alexandra Chouldechova, Hanna Wallach -
Statistical Bias in Bias Benchmark Design
Hannah Powers, Ioana Baldini, Dennis Wei, Kristin Bennett -
Rethinking Artistic Copyright Infringements in the Era of Text-to-Image Generative Models
Mazda Moayeri, Samyadeep Basu, Sriram Balasubramanian, Priyatham Kattakinda, Atoosa Chegini, Robert Brauneis, Soheil Feizi -
Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems, (Poster)
Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, Hanna Wallach -
Surveying Surveys: Surveys’ Role in Evaluating AI’s Labor Market Impact
Cassandra Duchan Solis, Pamela Mishkin -
Fairness Dynamics During Training
Krishna Patel, Nivedha Sivakumar, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff -
Democratic Perspectives and Institutional Capture of Crowdsourced Evaluations, (Poster)
parth sarin, Michelle Bao -
LLMs and Personalities: Inconsistencies Across Scales
Tosato Tommaso, Lemay David, Mahmood Hegazy, Irina Rish, Guillaume Dumas -
Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks
Nathaniel Demchak, Xin Guan, Zekun Wu, Ziyi Xu, Adriano Koshiyama, Emre Kazim