Vision-Language Models (VLMs) like CLIP have transformed the field by enabling joint reasoning across modalities, zero-shot transfer, and enhanced multimodal alignment. Despite their success and widespread adoption, embeddings derived from CLIP exhibit limitations, including challenges in object binding, relation comprehension, and interpretability due to difficulty in interpretability stemming from entangled feature representations. This work, under Supervision of Prof. Guy Gilboa, investigates methods for decomposing and analyzing CLIP's embedding space, employing various statistical and decomposition techniques. Our approach seeks to enhance performance, interpretability, and robustness across multiple applications, including image classification and editing. By addressing foundational representational challenges, this research contributes towards a deeper understanding of multimodal embedding geometry and advances the interpretability of modern VLMs.