Previously, we introduced TextIn's open-source ACGE text embedding model and its basic usage. Today, let's dive deep into the technical framework that powers this cutting-edge model.
The Evolution of Text Embeddings
From Word2Vec to BERT, and now to large language models (LLMs), embedding technologies have continuously evolved. They play a crucial role in various applications, from traditional search and QA systems to modern Retrieval-Augmented Generation (RAG) pipelines.
Figure 1: Flow diagram of an embedding-based retrieval system
Three Core Technologies Behind ACGE
1. SimCSE: Harnessing the Power of Contrastive Learning
SimCSE (Simple Contrastive Learning of Sentence Embeddings) uses contrastive learning, which is like teaching a model to play "spot the difference" with text. Here's how it works:
Figure 2: Comparison of supervised and unsupervised SimCSE approaches
Unsupervised Approach
Imagine having two slightly different versions of the same sentence - like looking at the same picture through different Instagram filters. The model learns to recognize that these are basically the same thing, just with minor variations. Meanwhile, it learns to distinguish these from completely different sentences.
Supervised Approach
This is more like having a teacher who provides explicit examples:
- "These two sentences mean the same thing" (positive pairs)
- "These sentences contradict each other" (hard negative pairs)
The beauty of this approach is that it helps the model develop a nuanced understanding of semantic relationships - not just matching words, but grasping meaning.
2. EWC: The Memory Keeper
Ever tried learning a new skill while trying not to forget an old one? That's exactly what Elastic Weight Consolidation (EWC) helps our model do. It's particularly crucial for embedding models that need to handle multiple tasks without dropping the ball on any of them.
Figure 3: Elastic Weight Consolidation (EWC) training strategy visualization
How EWC Works:
- Importance Assessment: Uses Fisher information matrix to identify which parameters are crucial for existing tasks
- Selective Protection: Adds constraints to protect important parameters while learning new tasks
- Balanced Learning: Uses a hyperparameter λ to balance between preserving old knowledge and acquiring new skills
Think of it like having a really good teacher who knows exactly which fundamentals you need to keep practicing while learning advanced concepts.
3. MRL: The Matryoshka Approach
Named after Russian nesting dolls, Matryoshka Representation Learning (MRL) is perhaps one of the most innovative aspects of ACGE. It's all about efficiency and flexibility.
Figure 4: Matryoshka Representation Learning (MRL) training and inference process
Key Benefits:
- Reduced Embedding Size: Achieves up to 14x reduction while maintaining accuracy
- Faster Retrieval: Significant speed-up in large-scale retrieval tasks
- Improved Performance: Better handling of long-tail classification tasks
Think of it like having a Swiss Army knife where you can use just the tools you need for each specific task, rather than carrying the whole toolbox everywhere.
Practical Implementation
# Note: Keep the original Chinese examples in code
from sentence_transformers import SentenceTransformer
sentences = [
"这是一个测试句子", # "This is a test sentence"
"另一个测试句子" # "Another test sentence"
]
model = SentenceTransformer('acge_text_embedding')
# Set desired embedding dimension (1024 or 1792 recommended)
embeddings = model.encode(sentences, normalize_embeddings=True)
Why ACGE Stands Out
Compared to other top models on the C-MTEB leaderboard, ACGE offers several distinct advantages:
- Smaller model size = lower resource requirements
- 1024 token input length covers most use cases
- Flexible output dimensions for resource optimization
- Strong performance across various NLP tasks
Future Directions
The field of embedding models continues to evolve, with several exciting trends:
- Development of lightweight models for broader accessibility
- Enhanced interpretability
- Improved generalization through multi-task learning
- Specialized developments in RAG applications
Resources
Find the model here:
- Hugging Face: acge_text_embedding
References
[1] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence em_x0002_beddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
[2] Huang, Jui-Ting, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano and Linjun Yang. “Embedding-based Retrieval in Facebook Search.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020): n. pag.
[3] Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang and Haofen Wang. “Retrieval-Augmented Generation for Large Language Models: A Survey.” ArXiv abs/2312.10997 (2023): n. pag.
[4] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, 2021.
[5] Kirkpatrick, James, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran and Raia Hadsell. “Overcoming catastrophic forgetting in neural networks.” Proceedings of the National Academy of Sciences 114 (2016): 3521 - 3526.
[6] Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249, 2022.