9:00 - 9:15 AM |
☕ Coffee |
⏰😴📢⬆ |
9:15 - 9:30 AM |
👋 Welcome and Introduction |
- Opening Remarks
- Overview of Workshop Structure and Objectives
|
9:30 - 10:30 AM |
🎤 Opening Panel: Reflections on the Landscape |
- Panel Discussion on AI Evaluation Challenges
- Panelists: Abeba Birhane, Su Lin Blodgett, Abigail Jacobs, Lee Wan Sie
- Topics:
- Underlying frameworks and incentive structures
- Defining robust evaluations and contextual challenges
- Multimodal evaluation needs (text, images, audio, video)
- Q&A
|
10:30 - 11:30 AM |
💭 Oral Session 1: Provocations and Ethics in AI Evaluation |
- Presentations (25 min):
- "Provocation: Who benefits from 'inclusion' in Generative AI?"
- "(Mis)use of nude images in machine learning research"
- "Evaluating Refusal"
- Breakout (35 min):
- Group Discussion (20 min): Ethics and Bias in Evaluation Design, Refusal and Boundary Setting, Research Ethics and Data Usage
- Report Back (15 min)
|
11:30 AM - 12:30 PM |
🌏 Oral Session 2: Multimodal and Cross-Cultural Evaluation Methods |
- Presentations (25 min):
- "JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark"
- "Critical human-AI use scenarios and interaction modes for societal impact evaluations"
- "Cascaded to End-to-End: New Safety, Security, and Evaluation Questions for Audio Language Models"
- Breakout (35 min):
- Group Discussion (20 min): Language, Image, Audio, Video, Cross-Culture
- Report Back (15 min)
|
12:30 - 2:30 PM |
🍽️ Lunch and Poster Session |
- 12:30 - 1:15 PM: Lunch and Networking
- 1:15 - 2:30 PM: Poster Presentations
|
2:30 - 3:00 PM |
📊 Oral Session 3: Systematic Approaches to AI Impact Assessment |
- Presentations:
- "GenAI Evaluation Maturity Framework (GEMF)"
- "AIR-Bench 2024: Safety Evaluation Based on Risk Categories"
- "Evaluating Generative AI Systems is a Social Science Measurement Challenge"
|
3:00 - 3:30 PM |
🔄 Break |
|
3:30 - 4:05 PM |
💡 Oral Session 3 Breakout |
- Group Discussion (20 min):
- Choosing Evaluations: Selecting relevant evaluations from a large repository
- Reviewing Tools and Datasets: Assessment of current tools and gaps
- Evaluating Reliability and Validity: Exploring construct validity and ranking methods
- Report Back (15 min)
|
4:05 - 5:00 PM |
🤝 What's Next? Coalition Development |
- Recap and Teasers (15 min):
- Overview of coalition groups
- Interactive Discussion (40 min):
- Measurement Modeling
- Developing Criteria for Evaluating Evaluations
- Documentation: Creating Proposed Documentation Standards
- Eval Repository: Building Out Resource Repositories
- Scorecard/Checklist: Conducting Reviews and Publishing Annual Scorecards
|
5:00 - 5:30 PM |
👋 Closing Session |
- Summary of Key Insights and Next Steps
|