דלג לתוכן (מקש קיצור 's')
אירועים

אירועים והרצאות בפקולטה למדעי המחשב ע"ש הנרי ומרילין טאוב

What I’ve learned so far about the illusion of safety in LLM alignment
event speaker icon
עמית לוי (הרצאה סמינריונית למגיסטר)
event date icon
יום רביעי, 31.12.2025, 12:30
event location icon
טאוב601 & זום
event speaker icon
מנחה: פרופ' אבי מנדלסון

As AI capabilities rapidly advance toward AGI, safety and security often lag behind power and profit. A core challenge in AI safety is our inability to reliably evaluate even basic safe behavior in large language models (LLMs), leading to misalignment, fake alignment, unreliable safety benchmarks, and mismeasured risks. My AI safety journey began in 2023 during my second semester of a BSc at the Technion, as an exchange student at EPFL, where graduate-level AI courses led me to first train and safety fine-tune LLMs. By early 2024, I began working with Professor Avi Mendelson on alignment vulnerabilities. In 2024, we showed that an initialization-based jailbreak can effectively reverse a model’s safety alignment using only ~20 tokens, demonstrating that current safety training is not truly informative (EMNLP/ACL 2025). Building on this work, we identified a vulnerability caused by alignment training itself: existing bias and fairness evaluations detect at best ~80% of biases, and many debiasing methods create fake alignment rather than addressing underlying issues. We proposed a benchmark that surfaces both visible and hidden biases (oral, AAAI 2026). Following this, we developed three additional works on pad-token influence, failures of unlearning evaluation, and quantization via hidden representations, A2A vulnerabilities (on-going, Black-Hat, NeurIPS 2025),and alongside many more ongoing alignment research we are doing this days. In this seminar, I will present key insights from my research works work on alignment vulnerabilities, adversarial failures, fake alignment, unlearning evaluation failures, and the risks posed by the current trajectory of AI development with current alignment methods.