Abstract: Given a query in one modality, video-text retrieval aims to retrieve the most similar samples from the database in another modality. The primary challenge lies in the alignment of ...