Room impulse responses provide an indirect acoustic probe of scene geometry: as an agent moves, the recorded reverberation changes with nearby walls, openings, and free space. However, converting acoustic observations into maps is inherently ambiguous, since different reflector configurations can produce similar responses, especially when observations are sparse or motion is limited.
We study active acoustic scene reconstruction, where an agent must choose sensing poses that improve its geometric belief rather than follow a predefined scan. We introduce a cross-modal acoustic world model that encodes histories of RIRs and known poses into a motion-conditioned latent state used for both local occupancy decoding and future acoustic-latent prediction. At test time, candidate trajectories are rolled out in latent space, decoded into imagined occupancy maps, and scored by predicted map-space information gain. We construct a synthetic benchmark of paired acoustic trajectories, poses, and floor-plan geometry. Experiments show improved local acoustic-to-geometry reconstruction over geometric and passive baselines, and closed-loop mapping that matches or improves frontier-based exploration while substantially reducing collisions.