In a breakthrough for natural language processing, the ACGE text embedding model has hit a remarkable milestone - over 30,000 downloads in a single month on Hugging Face. But what makes this open-source model so special? Let's dive in.
The Power of Text Embeddings
Before we get into the specifics, let's break down why text embeddings matter. In our increasingly digital world, making sense of text data has become crucial across industries. Whether you're analyzing customer sentiment on social media, searching through vast document repositories, or building advanced chatbots, you need a way to make text comprehensible to machines.
This is where text embeddings come in. They transform human-readable text into dense vectors - essentially converting words and sentences into mathematical representations that computers can process efficiently. Think of it as creating a "GPS coordinate system" for meaning, where similar concepts end up close to each other in this mathematical space.
Meet ACGE: A Game-Changing Model
ACGE text embedding, developed by the TextIn team, has recently claimed the top spot on the Chinese MTEB (Massive Text Embedding Benchmark) leaderboard this March. What's even more exciting? It's completely open-source and available on both Hugging Face and GitHub.
Key Features That Set It Apart
1. Superior Recall Performance
- Uses contrastive learning techniques to minimize distance between positive pairs while maximizing distance between negative pairs
- Results in more accurate semantic representations and better retrieval performance
2. Robust Generalization
- Trained on diverse, high-quality, large-scale datasets
- Demonstrates exceptional performance across different domains and tasks
3. Balanced Learning
- Employs multi-task mixed training with task-specific loss functions
- Implements continuous learning to prevent catastrophic forgetting when incorporating new data
4. Enhanced Processing Speed
- Leverages Matryoshka Representation Learning (MRL)
- Supports flexible embedding dimensions (recommended: 1024 or 1792)
- Reduces storage requirements while maintaining performance
Real-World Applications
ACGE is already making waves in several key areas:
Document Classification
By combining OCR technology with ACGE's powerful text encoding capabilities, organizations can build robust, general-purpose classification models that understand document context and content.
Long Document Information Extraction
Using document parsing engines and hierarchical slicing techniques, ACGE generates vector indices that make it easier to extract and process information from lengthy documents with high precision.
Knowledge Q&A Systems
The model excels at pinpointing relevant information within documents, enabling accurate question-answering systems through vector indexing and precise content location.
Getting Started with ACGE
Want to try it out? Here's a quick example using the sentence-transformers library to compute similarity between two texts:
from sentence_transformers import SentenceTransformer
sentences = ["Company A is a great company", "Tell me about Company A"]
model = SentenceTransformer('acge_text_embedding')
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
You can also customize the vector dimensions using Matryoshka Representation Learning:
from sklearn.preprocessing import normalize
from sentence_transformers import SentenceTransformer
sentences = ["Data 1", "Data 2"]
model = SentenceTransformer('acge_text_embedding')
embeddings = model.encode(sentences, normalize_embeddings=False)
matryoshka_dim = 1024
embeddings = embeddings[..., :matryoshka_dim] # Adjust embedding dimensions
embeddings = normalize(embeddings, norm="l2", axis=1)
Reproducing C-MTEB Benchmark Results
Want to validate ACGE's performance on the C-MTEB benchmark? Here's the complete code to reproduce our benchmark results:
import torch
import argparse
import functools
from C_MTEB.tasks import *
from typing import List, Dict
from sentence_transformers import SentenceTransformer
from mteb import MTEB, DRESModel
class RetrievalModel(DRESModel):
def __init__(self, encoder, **kwargs):
self.encoder = encoder
def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
input_texts = ['{}'.format(q) for q in queries]
return self._do_encode(input_texts)
def encode_corpus(self, corpus: List[Dict[str, str]], **kwargs) -> np.ndarray:
input_texts = ['{} {}'.format(doc.get('title', ''), doc['text']).strip() for doc in corpus]
input_texts = ['{}'.format(t) for t in input_texts]
return self._do_encode(input_texts)
@torch.no_grad()
def _do_encode(self, input_texts: List[str]) -> np.ndarray:
return self.encoder.encode(
sentences=input_texts,
batch_size=512,
normalize_embeddings=True,
convert_to_numpy=True
)
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', default="acge_text_embedding", type=str)
parser.add_argument('--task_type', default=None, type=str)
parser.add_argument('--pooling_method', default='cls', type=str)
parser.add_argument('--output_dir', default='zh_results',
type=str, help='output directory')
parser.add_argument('--max_len', default=1024, type=int, help='max length')
return parser.parse_args()
if __name__ == '__main__':
args = get_args()
encoder = SentenceTransformer(args.model_name_or_path).half()
encoder.encode = functools.partial(encoder.encode, normalize_embeddings=True)
encoder.max_seq_length = int(args.max_len)
task_names = [t.description["name"] for t in MTEB(task_types=args.task_type,
task_langs=['zh', 'zh-CN']).tasks]
TASKS_WITH_PROMPTS = ["T2Retrieval", "MMarcoRetrieval", "DuRetrieval", "CovidRetrieval", "CmedqaRetrieval",
"EcomRetrieval", "MedicalRetrieval", "VideoRetrieval"]
for task in task_names:
evaluation = MTEB(tasks=[task], task_langs=['zh', 'zh-CN'])
if task in TASKS_WITH_PROMPTS:
evaluation.run(RetrievalModel(encoder), output_folder=args.output_dir, overwrite_results=False)
else:
evaluation.run(encoder, output_folder=args.output_dir, overwrite_results=False)