Andreas Koukounas
Georgios Mastrapas
Michael Günther
Bo Wang
Scott Martens
Isabelle Mohr
Saba Sturua
Mohammad Kalim Akram
Joan Fontanals Martínez
Saahil Ognawala
Susana Guzman
Maximilian Werk
Nan Wang
Han Xiao
Bajaj et al. (2016)Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., McNamara, A., Mitra, B., Nguyen, T., et al.MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.arXiv preprint arXiv:1611.09268, 2016.URL https://arxiv.org/abs/1611.09268.
Bowman et al. (2015)Bowman, S., Angeli, G., Potts, C., and Manning, C. D.A large annotated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.632–642, 2015.doi: 10.18653/v1/D15-1075.URL https://aclanthology.org/D15-1075.
Chen et al. (2024)Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z.BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation.arXiv preprint arXiv:2402.03216, 2024.URL https://arxiv.org/abs/2402.03216.
Chen et al. (2023)Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D.ShareGPT4V: Improving Large Multi-Modal Models with Better Captions.arXiv preprint arXiv:2311.12793, 2023.URL https://arxiv.org/abs/2311.12793.
Chen et al. (2015)Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L.Microsoft COCO Captions: Data Collection and Evaluation Server.arXiv preprint arXiv:1504.00325, 2015.URL http://arxiv.org/abs/1504.00325.
Cherti et al. (2023)Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., and Jitsev, J.Reproducible Scaling Laws for Contrastive Language-Image Learning.In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.2818–2829, 2023.doi: 10.1109/CVPR52729.2023.00276.URL https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.00276.
Devlin et al. (2019)Devlin, J., Chang, M., Lee, K., and Toutanova, K.BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.In Burstein, J., Doran, C., and Solorio, T.(eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp.4171–4186.Association for Computational Linguistics, 2019.doi: 10.18653/V1/N19-1423.URL https://doi.org/10.18653/v1/n19-1423.
Fang et al. (2023)Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., and Cao, Y.EVA-02: A Visual Representation for Neon Genesis.arXiv preprint arXiv:2303.11331, 2023.URL https://arxiv.org/abs/2303.11331.
Günther et al. (2023)Günther, M., Mastrapas, G., Wang, B., Xiao, H., and Geuter, J.Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models.In Tan, L., Milajevs, D., Chauhan, G., Gwinnup, J., and Rippeth, E.(eds.), Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pp.8–18, Singapore, 2023.Association for Computational Linguistics.doi: 10.18653/v1/2023.nlposs-1.2.URL https://aclanthology.org/2023.nlposs-1.2.
Günther et al. (2023)Günther, M., Ong, J., Mohr, I., Abdessalem, A., Abel, T., Akram, M. K., Guzman, S., Mastrapas, G., Sturua, S., Wang, B., Werk, M., Wang, N., and Xiao, H.Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents.arXiv preprint arXiv:2310.19923, 2023.URL https://arxiv.org/abs/2310.19923.
Hodosh et al. (2013)Hodosh, M., Young, P., and Hockenmaier, J.Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics.Journal of Artificial Intelligence Research, 47:853–899, 2013.doi: 10.1613/jair.3994.URL https://www.jair.org/index.php/jair/article/view/10833.
Ilharco et al. (2021)Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L.OpenCLIP (0.1).Zenodo, 2021.doi: 10.5281/zenodo.5143773.URL https://doi.org/10.5281/zenodo.5143773.Software.
Kossen et al. (2023)Kossen, J., Collier, M., Mustafa, B., Wang, X., Zhai, X., Beyer, L., Steiner, A., Berent, J., Jenatton, R., and Kokiopoulou, E.Three Towers: Flexible Contrastive Learning with Pretrained Image Models.arXiv preprint arXiv:2305.16999, 2023.URL https://arxiv.org/abs/2305.16999.
Kwiatkowski et al. (2019)Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S.Natural Questions: a Benchmark for Question Answering Research.Transactions of the Association of Computational Linguistics, 7:452–466, 2019.doi: 10.1162/tacl˙a˙00276.URL https://aclanthology.org/Q19-1026.
Loshchilov & Hutter (2017)Loshchilov, I. and Hutter, F.Fixing Weight Decay Regularization in Adam.arXiv preprint arXiv:1711.05101v1, 2017.URL https://arxiv.org/abs/1711.05101v1.
Mohr et al. (2024)Mohr, I., Krimmel, M., Sturua, S., Akram, M. K., Koukounas, A., Günther, M., Mastrapas, G., Ravishankar, V., Martínez, J. F., Wang, F., Liu, Q., Yu, Z., Fu, J., Ognawala, S., Guzman, S., Wang, B., Werk, M., Wang, N., and Xiao, H.Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings.arXiv preprint arXiv:2310.19923, 2024.URL https://arxiv.org/abs/2402.17016.
Muennighoff et al. (2023)Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.MTEB: Massive Text Embedding Benchmark.pp.2014–2037, 2023.doi: 10.18653/v1/2023.eacl-main.148.URL https://aclanthology.org/2023.eacl-main.148.
Ni et al. (2022)Ni, J., Qu, C., Lu, J., Dai, Z., Ábrego, G. H., Ma, J., Zhao, V. Y., Luan, Y., Hall, K. B., Chang, M., and Yang, Y.Large Dual Encoders Are Generalizable Retrievers.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pp.9844–9855, 2022.doi: 10.18653/V1/2022.EMNLP-MAIN.669.URL https://doi.org/10.18653/v1/2022.emnlp-main.669.
Oquab et al. (2024)Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P.DINOv2: Learning Robust Visual Features without Supervision.arXiv preprint arXiv:2304.07193, 2024.URL https://arxiv.org/abs/2304.07193.
Press et al. (2021)Press, O., Smith, N. A., and Lewis, M.Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.arXiv preprint arXiv:2108.12409, 2021.URL https://arxiv.org/abs/2108.12409.
Radford et al. (2021)Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I.Learning Transferable Visual Models From Natural Language Supervision.pp.8748–8763, 2021.URL https://proceedings.mlr.press/v139/radford21a.html.
Reimers & Gurevych (2019)Reimers, N. and Gurevych, I.Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.In Inui, K., Jiang, J., Ng, V., and Wan, X.(eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp.3980–3990.Association for Computational Linguistics, 2019.doi: 10.18653/V1/D19-1410.URL https://doi.org/10.18653/v1/D19-1410.
Schuhmann et al. (2021)Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A.LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs.arXiv preprint arXiv:2111.02114, 2021.URL https://arxiv.org/abs/2111.02114.
Schuhmann et al. (2022)Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C. W., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S. R., Crowson, K., Schmidt, L., Kaczmarczyk, R., and Jitsev, J.LAION-5B: An open large-scale dataset for training next generation image-text models.In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A.(eds.), Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Datasets and Benchmarks Track, volume 35, pp.25278–25294, 2022.
Sun et al. (2023)Sun, Q., Fang, Y., Wu, L., Wang, X., and Cao, Y.EVA-CLIP: Improved Training Techniques for CLIP at Scale.arXiv preprint arXiv:2303.15389, 2023.URL https://arxiv.org/abs/2303.15389.
Sun et al. (2024)Sun, Q., Wang, J., Yu, Q., Cui, Y., Zhang, F., Zhang, X., and Wang, X.EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters.arXiv preprint arXiv:2402.04252, 2024.URL https://arxiv.org/abs/2402.04252.
Thomee et al. (2016)Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., and Li, L.YFCC100M: The New Data in Multimedia Research.Communications of the ACM, 59(2):64–73, 2016.doi: 10.1145/2812802.URL https://doi.org/10.1145/2812802.
Van den Oord et al. (2018)Van den Oord, A., Li, Y., and Vinyals, O.Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748, 2018.URL http://arxiv.org/abs/1807.03748.
Wang & Liu (2021)Wang, F. and Liu, H.Understanding the Behaviour of Contrastive Loss.In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.2495–2504, 2021.doi: 10.1109/CVPR46437.2021.00252.URL https://ieeexplore.ieee.org/document/9577669.
Wang et al. (2022)Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F.Text Embeddings by Weakly-Supervised Contrastive Pre-training.arXiv preprint arXiv:2212.03533, 2022.URL https://arxiv.org/abs/2212.03533.
Yang et al. (2018)Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D.HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J.(eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pp.2369–2380, 2018.doi: 10.18653/V1/D18-1259.URL https://doi.org/10.18653/v1/d18-1259.
Young et al. (2014)Young, P., Lai, A., Hodosh, M., and Hockenmaier, J.From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014.doi: 10.1162/tacl˙a˙00166.URL https://aclanthology.org/Q14-1006.
Zhai et al. (2022)Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., and Beyer, L.LiT: Zero-Shot Transfer with Locked-image text Tuning.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, pp.18102–18112.IEEE, 2022.doi: 10.1109/CVPR52688.2022.01759.URL https://doi.org/10.1109/CVPR52688.2022.01759.
Zhai et al. (2023)Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L.Sigmoid Loss for Language Image Pre-Training.arXiv preprint arXiv:2303.15343, 2023.URL https://arxiv.org/abs/2303.15343.
Zhang et al. (2024)Zhang, B., Zhang, P., Dong, X., Zang, Y., and Wang, J.Long-CLIP: Unlocking the Long-Text Capability of CLIP.arXiv preprint arXiv:2403.15378, 2024.URL https://arxiv.org/abs/2403.15378.
Zhao et al. (2023)Zhao, R., Chen, H., Wang, W., Jiao, F., Long, D. X., Qin, C., Ding, B., Guo, X., Li, M., Li, X., and Joty, S.Retrieving Multimodal Information for Augmented Generation: A Survey.In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.4736–4756, 2023.doi: 10.18653/V1/2023.FINDINGS-EMNLP.314.URL https://doi.org/10.18653/v1/2023.findings-emnlp.314.