TextIn - BLOG - Deep Dive: Technical Framework Behind ACGE's Chinese Text Embedding Mode

BLOG

Follow TextIn's latest updates to stay informed about the newest product developments. Text Intelligence has been focused on the field of intelligent document processing for 17 years, providing global users with the world's best document parsing service, which allows you to parse complex documents such as PDFs, imgaes and into structured data.

BLOG>Details

Deep Dive: Technical Framework Behind ACGE's Chinese Text Embedding Mode

2024-11-01 14:50:59

Previously, we introduced TextIn's open-source ACGE text embedding model and its basic usage. Today, let's dive deep into the technical framework that powers this cutting-edge model.

The Evolution of Text Embeddings

From Word2Vec to BERT, and now to large language models (LLMs), embedding technologies have continuously evolved. They play a crucial role in various applications, from traditional search and QA systems to modern Retrieval-Augmented Generation (RAG) pipelines.

Figure 1: Flow diagram of an embedding-based retrieval system

Three Core Technologies Behind ACGE

1. SimCSE: Harnessing the Power of Contrastive Learning

SimCSE (Simple Contrastive Learning of Sentence Embeddings) uses contrastive learning, which is like teaching a model to play "spot the difference" with text. Here's how it works:

Figure 2: Comparison of supervised and unsupervised SimCSE approaches

Unsupervised Approach

Imagine having two slightly different versions of the same sentence - like looking at the same picture through different Instagram filters. The model learns to recognize that these are basically the same thing, just with minor variations. Meanwhile, it learns to distinguish these from completely different sentences.

Supervised Approach

This is more like having a teacher who provides explicit examples:

"These two sentences mean the same thing" (positive pairs)
"These sentences contradict each other" (hard negative pairs)

The beauty of this approach is that it helps the model develop a nuanced understanding of semantic relationships - not just matching words, but grasping meaning.

2. EWC: The Memory Keeper

Ever tried learning a new skill while trying not to forget an old one? That's exactly what Elastic Weight Consolidation (EWC) helps our model do. It's particularly crucial for embedding models that need to handle multiple tasks without dropping the ball on any of them.

Figure 3: Elastic Weight Consolidation (EWC) training strategy visualization

How EWC Works:

Importance Assessment: Uses Fisher information matrix to identify which parameters are crucial for existing tasks
Selective Protection: Adds constraints to protect important parameters while learning new tasks
Balanced Learning: Uses a hyperparameter λ to balance between preserving old knowledge and acquiring new skills

Think of it like having a really good teacher who knows exactly which fundamentals you need to keep practicing while learning advanced concepts.

3. MRL: The Matryoshka Approach

Named after Russian nesting dolls, Matryoshka Representation Learning (MRL) is perhaps one of the most innovative aspects of ACGE. It's all about efficiency and flexibility.

Figure 4: Matryoshka Representation Learning (MRL) training and inference process

Key Benefits:

Reduced Embedding Size: Achieves up to 14x reduction while maintaining accuracy
Faster Retrieval: Significant speed-up in large-scale retrieval tasks
Improved Performance: Better handling of long-tail classification tasks

Think of it like having a Swiss Army knife where you can use just the tools you need for each specific task, rather than carrying the whole toolbox everywhere.

Practical Implementation

# Note: Keep the original Chinese examples in code
from sentence_transformers import SentenceTransformer

sentences = [
    "这是一个测试句子",  # "This is a test sentence"
    "另一个测试句子"    # "Another test sentence"
]
model = SentenceTransformer('acge_text_embedding')
# Set desired embedding dimension (1024 or 1792 recommended)
embeddings = model.encode(sentences, normalize_embeddings=True)

Why ACGE Stands Out

Compared to other top models on the C-MTEB leaderboard, ACGE offers several distinct advantages:

Smaller model size = lower resource requirements
1024 token input length covers most use cases
Flexible output dimensions for resource optimization
Strong performance across various NLP tasks

Future Directions

The field of embedding models continues to evolve, with several exciting trends:

Development of lightweight models for broader accessibility
Enhanced interpretability
Improved generalization through multi-task learning
Specialized developments in RAG applications

Resources

Find the model here:

Hugging Face: acge_text_embedding

References

[1] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence em_x0002_beddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.

[2] Huang, Jui-Ting, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano and Linjun Yang. “Embedding-based Retrieval in Facebook Search.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020): n. pag.

[3] Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang and Haofen Wang. “Retrieval-Augmented Generation for Large Language Models: A Survey.” ArXiv abs/2312.10997 (2023): n. pag.

[4] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, 2021.

[5] Kirkpatrick, James, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran and Raia Hadsell. “Overcoming catastrophic forgetting in neural networks.” Proceedings of the National Academy of Sciences 114 (2016): 3521 - 3526.

[6] Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249, 2022.

Free to Use