דלג לתוכן (מקש קיצור 's')
אירועים

אירועים והרצאות בפקולטה למדעי המחשב ע"ש הנרי ומרילין טאוב

שיערוך אי הודאות של מודלי שפה גדולים ביצירה ארוכת טווח באמצעות תווית דטרמיניסטית
event speaker icon
עידו עמית (הרצאה סמינריונית למגיסטר)
event date icon
יום רביעי, 18.03.2026, 14:00
event speaker icon
מנחה: פרופ' רן אל-יניב

As LLMs generate increasingly long outputs, effective uncertainty estimation must identify errors at fine-grained levels rather than discard entire responses. While such methods exist, evaluating uncertainty at any resolution (token to an entire generation) is challenging and highly sensitive to label imperfections, making zero-noise benchmarks essential; yet, long-form generation benchmarks tend to rely on fallible labels rather than deterministic ground truth.

We introduce Single-answer Atomic Long-form Target (SALT), a benchmark of six procedurally generated tasks with single deterministic long textual ground truths, enabling unit-level evaluation of correctness, calibration, and ranking without external judges. Equipped with SALT, our analysis of 50+ LLMs reveals key insights: We identify which confidence functions dominate each uncertainty aspect and show that effective ranking benefits more from coarser evaluation resolutions; SALT further facilitates precise calibration tracking throughout generation, revealing a divergence in the accuracy–calibration relationship, with high- and low-performing models exhibiting degradation ($\rho=0.87$) and improvement ($\rho=-0.92$).

Finally, we demonstrate that reasoning, via Chain-of-Thought prompting or internalized through training, introduces a trade-off, improving accuracy while degrading confidence ranking. These findings directly impact risk-critical applications requiring reliable error identification and mitigation.