All times in Pacific (Vancouver BC Local Time)

Time Session Description
9:00 - 9:15 AM ☕ Coffee ⏰😴📢⬆
9:15 - 9:30 AM 👋 Welcome and Introduction
  • Opening Remarks
  • Overview of Workshop Structure and Objectives
9:30 - 10:30 AM 🎤 Opening Panel: Reflections on the Landscape
  • Panel Discussion on AI Evaluation Challenges
  • Panelists: Abeba Birhane, Su Lin Blodgett, Abigail Jacobs, Lee Wan Sie
  • Topics:
    • Underlying frameworks and incentive structures
    • Defining robust evaluations and contextual challenges
    • Multimodal evaluation needs (text, images, audio, video)
  • Q&A
10:30 - 11:30 AM 💭 Oral Session 1: Provocations and Ethics in AI Evaluation
  • Presentations (25 min):
    • "Provocation: Who benefits from 'inclusion' in Generative AI?"
    • "(Mis)use of nude images in machine learning research"
    • "Evaluating Refusal"
  • Breakout (35 min):
    • Group Discussion (20 min): Ethics and Bias in Evaluation Design, Refusal and Boundary Setting, Research Ethics and Data Usage
    • Report Back (15 min)
11:30 AM - 12:30 PM 🌏 Oral Session 2: Multimodal and Cross-Cultural Evaluation Methods
  • Presentations (25 min):
    • "JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark"
    • "Critical human-AI use scenarios and interaction modes for societal impact evaluations"
    • "Cascaded to End-to-End: New Safety, Security, and Evaluation Questions for Audio Language Models"
  • Breakout (35 min):
    • Group Discussion (20 min): Language, Image, Audio, Video, Cross-Culture
    • Report Back (15 min)
12:30 - 2:30 PM 🍽️ Lunch and Poster Session
  • 12:30 - 1:15 PM: Lunch and Networking
  • 1:15 - 2:30 PM: Poster Presentations
2:30 - 3:00 PM 📊 Oral Session 3: Systematic Approaches to AI Impact Assessment
  • Presentations:
    • "GenAI Evaluation Maturity Framework (GEMF)"
    • "AIR-Bench 2024: Safety Evaluation Based on Risk Categories"
    • "Evaluating Generative AI Systems is a Social Science Measurement Challenge"
3:00 - 3:30 PM 🔄 Break
3:30 - 4:05 PM 💡 Oral Session 3 Breakout
  • Group Discussion (20 min):
    • Choosing Evaluations: Selecting relevant evaluations from a large repository
    • Reviewing Tools and Datasets: Assessment of current tools and gaps
    • Evaluating Reliability and Validity: Exploring construct validity and ranking methods
  • Report Back (15 min)
4:05 - 5:00 PM 🤝 What's Next? Coalition Development
  • Recap and Teasers (15 min):
    • Overview of coalition groups
  • Interactive Discussion (40 min):
    • Measurement Modeling
    • Developing Criteria for Evaluating Evaluations
    • Documentation: Creating Proposed Documentation Standards
    • Eval Repository: Building Out Resource Repositories
    • Scorecard/Checklist: Conducting Reviews and Publishing Annual Scorecards
5:00 - 5:30 PM 👋 Closing Session
  • Summary of Key Insights and Next Steps