Congratulations to Rom Himelstein and Amit Levi, graduate students in the Faculty of Computer Science, supervised by Prof. Avi Mendelson and Dr. Brit Youngmann, on receiving an Honorable Mention in the Israeli National AI Safety Research Award.
Their paper, "Silenced Biases: The Dark Side LLMs Learned to Refuse," was presented at AAAI 2026 in Singapore and was selected for an oral presentation from among tens of thousands of submissions.
In their research, the team uncovered a significant flaw in the way bias is commonly measured in large language models (LLMs). They demonstrated that, in many cases, models appear to be "fair" simply because they refuse to answer sensitive questions, while biases and stereotypes remain embedded in their internal representations.
To address this issue, the researchers developed a novel method for measuring these hidden biases more accurately, even when a model attempts to avoid answering. They showed that many existing benchmarks—widely used both in academic research and by leading technology companies—may significantly underestimate the level of bias present in AI models.
For example, to evaluate whether a model associates certain professions with men or women, or positive and negative traits with different national, religious, or ethnic groups, the model is presented with multiple-choice questions. One of the answer options is typically "Cannot be determined based on the given information." When the model selects this option, conventional evaluation metrics often interpret it as evidence of fairness.
However, the researchers found that a model's refusal to answer does not necessarily indicate the absence of bias, nor does it accurately reflect the knowledge or preferences encoded in its internal representations.
By prompting models to provide answers instead of refusing, the researchers uncovered substantial stereotypes that remained hidden under conventional evaluation methods.
Based on these findings, they developed a new evaluation methodology and introduced the Silenced Bias Benchmark (SBB) - a benchmark containing thousands of questions spanning dozens of topics and a wide range of demographic groups. Using SBB, they evaluated several families of large language models and demonstrated a significant gap between a model's outward behavior and the hidden preferences embedded within it.
Congratulations on this outstanding achievement!
[Back to the news index]