Accepted Papers

The following tiny papers have been accepted to the EvalEval Workshop at NeurIPS 2024:

Oral Presentations

  • Provocation: Who benefits from “inclusion” in Generative AI?
    Samantha Dalal, Siobhan Mackenzie Hall, Nari Johnson

  • (Mis)use of nude images in machine learning research
    Arshia Arya, Princessa Cintaqia, Deepak Kumar, Allison McDonald, Lucy Qin, Elissa M Redmiles

  • Evaluating Refusal
    Shira Abramovich, Anna Ma

  • JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark
    Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Kazuki Egashira, Jeonghun Baek, Xiang Yue, Graham Neubig, Kiyoharu Aizawa

  • Critical human-AI use scenarios and interaction modes for societal impact evaluations
    Lujain Ibrahim, Saffron Huang, Lama Ahmad, Markus Anderljung

  • Cascaded to End-to-End: New Safety, Security, and Evaluation Questions for Audio Language Models
    Luxi He, Xiangyu Qi, Inyoung Cheong, Prateek Mittal, Danqi Chen, Peter Henderson

  • GenAI Maturity Evaluation Framework (GEMF)
    Yilin Zhang, Frank J Kanayet

  • AIR-Bench 2024: Safety Evaluation Based on Risk Categories
    Kevin Klyman

  • Evaluating Generative AI Systems is a Social Science Measurement Challenge
    Hanna Wallach, Meera Desai, Nicholas J Pangakis, A. Feder Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z Jacobs.

Poster Presentations

  • Evaluations Using Wikipedia without Data Leakage: From Trusting Articles to Trusting Edit Processes
    Lucie-Aimée Kaffee, Isaac Johnson

  • Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset
    Haoming Lu, Feifei Zhong

  • Using Scenario-Writing for Identifying and Mitigating Impacts of Generative AI
    Kimon Kieslich, Nicholas Diakopoulos, Natali Helberger

  • Troubling taxonomies in GenAI evaluation
    Glen Berman, Ned Cooper, Wesley Deng, Ben Hutchinson

  • Is ETHICS about ethics? Evaluating the ETHICS benchmark
    Leif Hancox-Li, Borhane Blili-Hamelin

  • Provocation on Expertise in Social Impact Evaluations for Generative AI (and Beyond)
    Zoe Kahn, Nitin Kohli

  • Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
    Suhas Hariharan, Zainab Ali Majid, Jaime Raldua Veuthey, Jacob Haimes

  • Contamination Report for Multilingual Benchmarks
    Sanchit Ahuja, Varun Gumma, Sunayana Sitaram

  • Towards Leveraging News Media to Support Impact Assessment of AI Technologies
    Mowafak Allaham, Kimon Kieslich, Nicholas Diakopoulos

  • Motivations for Reframing Large Language Model Benchmarking for Legal Applications
    Riya Ranjan, Megan Ma

  • A Framework for Evaluating LLMs Under Task Indeterminacy
    Luke Guerdan, Hanna Wallach, Solon Barocas, Alexandra Chouldechova

  • Dimensions of Generative AI Evaluation Design
    P. Alex Dow, Jennifer Wortman Vaughan, Solon Barocas, Chad Atalla, Alexandra Chouldechova, Hanna Wallach

  • Statistical Bias in Bias Benchmark Design
    Hannah Powers, Ioana Baldini, Dennis Wei, Kristin Bennett

  • Rethinking Artistic Copyright Infringements in the Era of Text-to-Image Generative Models
    Mazda Moayeri, Samyadeep Basu, Sriram Balasubramanian, Priyatham Kattakinda, Atoosa Chegini, Robert Brauneis, Soheil Feizi

  • Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems
    Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, Hanna Wallach

  • Surveying Surveys: Surveys’ Role in Evaluating AI’s Labor Market Impact
    Cassandra Duchan Solis

  • Fairness Dynamics During Training
    Krishna Patel, Nivedha Sivakumar, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

  • Democratic Perspectives and Corporate Captures of Crowdsourced Evaluations
    parth sarin, Michelle Bao

  • LLMs and Personalities: Inconsistencies Across Scales
    Tosato Tommaso, Lemay David, Mahmood Hegazy, Irina Rish, Guillaume Dumas

  • Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks
    Nathaniel Demchak, Xin Guan, Zekun Wu, Ziyi Xu, Adriano Koshiyama, Emre Kazim