Artificial Intelligence (AI), particularly neural networks, has become central to a wide array of applications — from language modeling to text-to-image generation. Despite these achievements, ensuring the robustness of AI models remains a significant challenge. Robustness refers to the ability of models to maintain performance across diverse inputs and avoid issues such as out-of-distribution failures, generation of harmful or incorrect content, and the propagation of social biases. Addressing robustness is crucial for deploying reliable AI systems in real-world scenarios.
Motivated by these challenges, this thesis aims to improve the understanding, evaluation, and ultimately the robustness of AI models through interpretability-based methods. Interpretability research, which aims to elucidate the decision-making processes of these models, offers a promising pathway to address robustness challenges with customizable and cost-effective methods. In this seminar, I will present our research on enhancing AI robustness by applying insights from interpretability studies, focusing on mitigating biases, reducing harmful content, improving adaptability, and addressing hallucinations.