Vishal Srivastava
The paper shows that black-box safety evaluations of AI systems have fundamental limitations in predicting deployment risks, especially when models depend on unobserved variables that are rare during evaluation but common during deployment.
When testing AI systems, we often assume that how they perform in controlled environments will predict how they behave in real-world situations. However, this paper reveals that this assumption can be flawed, especially if the AI's behavior depends on hidden factors not visible during testing but present during actual use. The authors show that no matter how we try to evaluate these systems in a black-box manner (without looking inside the model), we can't fully predict their risks in deployment. They suggest that additional measures, like better model transparency and monitoring, are needed to ensure safety.