In our work, we introduces Arabic Contrastive Language-Image Pre-training (AraCLIP), a model designed for Arabic image retrieval tasks, building upon the Contrastive Language-Image Pre-training (CLIP) architecture. AraCLIP leverages Knowledge Distillation to transfer cross-modal knowledge from English to Arabic, enhancing its ability to understand Arabic text and retrieve relevant images. Unlike existing multilingual models, AraCLIP is uniquely positioned to understand the intricacies of the Arabic language, including specific terms, cultural nuances, and contextual constructs. By leveraging the CLIP architecture as our foundation, we introduce a novel approach that seamlessly integrates textual and visual modalities, enabling AraCLIP to effectively retrieve images based on Arabic textual queries. We offer an online demonstration allowing users to input Arabic prompts and compare AraCLIP's performance with state-of-the-art multilingual models. We conduct comprehensive experiments to evaluate AraCLIP's performance across diverse datasets, including Arabic XTD-10, amd Arabic Flicker 8k. Our results showcase AraCLIP's superiority in image retrieval accuracy, demonstrating its effectiveness in handling Arabic queries. AraCLIP represents a significant advancement in cross-lingual image retrieval, offering promising applications in Arabic language processing and beyond.
The overall view of the cross-lingual Arabic framework involves three main stages as shown below:
In stage 1, we illustrates the English text model and the image model on contrastive learning. These models are strongly connected in the embedding space since this is our assumption from the beginning. Therefore, if we give the model an image and its caption, their embedding should show a high cosine similarity. Also, these models are already trained using contrastive learning to be able to capture the similarity between image-text pairs based on their embedding. Note that the English text model and image model can be separated.
In stage 2, we conduct the model training where we get the English text model (teacher model) from the first stage and a pre-trained Arabic model (student model), Arabert model in this work. The input for the teacher model will be the English caption and the input for the student model will be the translated and pre-processed Arabic caption. Then, the Arabic text model (student model) will produce an embedding vector of the Arabic caption that is similar to the embedding vector of the English caption that was produced from English model. Then, we minimize the Mean Squared Error (MSE) between the resulting output embeddings of these models.
It is important to highlight that this approach is different from the original training objective of CLIP, which focuses on training the model to establish correlations through cosine similarity calculations between pairs of images and texts. Nevertheless, it is feasible to directly employ cosine similarity in the context of teacher learning; however, prior research has demonstrated that the minimization of MSE yields a more informative signal for learning. On this stage, we are using the teacher-learning (Knowledge Distillation) training, as the student model will learn the features from the teacher, so it could be connected to the image model features.
Stage 3 shows the the evaluation of the Arabic text trained model (student model) with the image model, to check the performance in understanding text images. In our work, we evaluated it using different datasets (Arabic XTD-10 and Arabic flicker 8k) based on different metrics.
We used (synthetic) Conceptual Captions used in the
subset of (synthetic) Conceptual Captions that has been
Filtered by ViT-L.
All datasets used on the project can be found at Arabic-Clip on
huggingface.
The figure below demonstrates a result of the complex category which has some complexity on the sentence as رجل يتزلج على الماء بلوح شراعي (Man surfing on a windsurf). For this example, we see that AraCLIP was better than mCLIP on retriving the images related to this query. AraCLIP retrieved images that have sail while mCLIP fails on this. Both model retrieve images that have some objects mentioned on the input text.
In the figure below, we provided an example of a sentence with some objects that are shown clearly as كلب يهاجم قطة بينما القطة تحت مقعد خشبي (A dog attacks a cat while the cat is under a wooden bench) on the retrieved images. Our model was able to capture the objects even though they are not clear in the image, it got the related image as the highest image while the other images are related by containing some related objects to the input sentence on them. While mCLIP struggles to capture the correct image as the first option, it can retrieve related images later which has the correct image.
@inproceedings{al2024araclip,
title={AraCLIP: Cross-Lingual Learning for Effective Arabic Image Retrieval},
author={Al-Barham, Muhammad and Afyouni, Imad and Almubarak, Khalid and Elnagar, Ashraf and Turky, Ayad and Hashem, Ibrahim},
booktitle={Proceedings of The Second Arabic Natural Language Processing Conference},
pages={102--110},
year={2024}
}