Proteins are fundamental to biological systems, and accurately representing them is essential for understanding biological function and drug discovery. Recent protein language models learn representations from amino acid sequences, yet proteins are inherently multidimensional, characterized by structure, dynamics, and molecular interactions. This thesis investigates how integrating multidimensional protein knowledge can enhance protein language models and improve biological understanding.
We introduce GOProteinGNN for integrating protein knowledge graphs, FusionProt for sequence–structure fusion, ProtLigand for leveraging protein–ligand interactions, and DynamicsPLM for modeling conformational dynamics. Across diverse biological tasks, these approaches improve performance and produce biologically meaningful representations.
This research was also applied in a drug discovery laboratory, demonstrating the practical value of multidimensional protein representations. Overall, this thesis shows that integrating functional, structural, dynamic, and interaction-based information substantially enhances protein representation learning and supports advances in biomedical research.