TextIn - BLOG - ACGE Text Embedding: The Open-Source Model Taking Chinese NLP By Storm

BLOG

Follow TextIn's latest updates to stay informed about the newest product developments. Text Intelligence has been focused on the field of intelligent document processing for 17 years, providing global users with the world's best document parsing service, which allows you to parse complex documents such as PDFs, imgaes and into structured data.

BLOG>Details

ACGE Text Embedding: The Open-Source Model Taking Chinese NLP By Storm

2024-11-01 14:20:00

In a breakthrough for natural language processing, the ACGE text embedding model has hit a remarkable milestone - over 30,000 downloads in a single month on Hugging Face. But what makes this open-source model so special? Let's dive in.

The Power of Text Embeddings

Before we get into the specifics, let's break down why text embeddings matter. In our increasingly digital world, making sense of text data has become crucial across industries. Whether you're analyzing customer sentiment on social media, searching through vast document repositories, or building advanced chatbots, you need a way to make text comprehensible to machines.

This is where text embeddings come in. They transform human-readable text into dense vectors - essentially converting words and sentences into mathematical representations that computers can process efficiently. Think of it as creating a "GPS coordinate system" for meaning, where similar concepts end up close to each other in this mathematical space.

Meet ACGE: A Game-Changing Model

ACGE text embedding, developed by the TextIn team, has recently claimed the top spot on the Chinese MTEB (Massive Text Embedding Benchmark) leaderboard this March. What's even more exciting? It's completely open-source and available on both Hugging Face and GitHub.

Key Features That Set It Apart

1. Superior Recall Performance

Uses contrastive learning techniques to minimize distance between positive pairs while maximizing distance between negative pairs
Results in more accurate semantic representations and better retrieval performance

2. Robust Generalization

Trained on diverse, high-quality, large-scale datasets
Demonstrates exceptional performance across different domains and tasks

3. Balanced Learning

Employs multi-task mixed training with task-specific loss functions
Implements continuous learning to prevent catastrophic forgetting when incorporating new data

4. Enhanced Processing Speed

Leverages Matryoshka Representation Learning (MRL)
Supports flexible embedding dimensions (recommended: 1024 or 1792)
Reduces storage requirements while maintaining performance

Real-World Applications

ACGE is already making waves in several key areas:

Document Classification

By combining OCR technology with ACGE's powerful text encoding capabilities, organizations can build robust, general-purpose classification models that understand document context and content.

Long Document Information Extraction

Using document parsing engines and hierarchical slicing techniques, ACGE generates vector indices that make it easier to extract and process information from lengthy documents with high precision.

Knowledge Q&A Systems

The model excels at pinpointing relevant information within documents, enabling accurate question-answering systems through vector indexing and precise content location.

Getting Started with ACGE

Want to try it out? Here's a quick example using the sentence-transformers library to compute similarity between two texts:

from sentence_transformers import SentenceTransformer

sentences = ["Company A is a great company", "Tell me about Company A"]
model = SentenceTransformer('acge_text_embedding')
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

You can also customize the vector dimensions using Matryoshka Representation Learning:

from sklearn.preprocessing import normalize
from sentence_transformers import SentenceTransformer

sentences = ["Data 1", "Data 2"]
model = SentenceTransformer('acge_text_embedding')
embeddings = model.encode(sentences, normalize_embeddings=False)
matryoshka_dim = 1024
embeddings = embeddings[..., :matryoshka_dim]  # Adjust embedding dimensions
embeddings = normalize(embeddings, norm="l2", axis=1)

Reproducing C-MTEB Benchmark Results

Want to validate ACGE's performance on the C-MTEB benchmark? Here's the complete code to reproduce our benchmark results:

import torch
import argparse
import functools
from C_MTEB.tasks import *
from typing import List, Dict
from sentence_transformers import SentenceTransformer
from mteb import MTEB, DRESModel


class RetrievalModel(DRESModel):
    def __init__(self, encoder, **kwargs):
        self.encoder = encoder

    def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
        input_texts = ['{}'.format(q) for q in queries]
        return self._do_encode(input_texts)

    def encode_corpus(self, corpus: List[Dict[str, str]], **kwargs) -> np.ndarray:
        input_texts = ['{} {}'.format(doc.get('title', ''), doc['text']).strip() for doc in corpus]
        input_texts = ['{}'.format(t) for t in input_texts]
        return self._do_encode(input_texts)

    @torch.no_grad()
    def _do_encode(self, input_texts: List[str]) -> np.ndarray:
        return self.encoder.encode(
            sentences=input_texts,
            batch_size=512,
            normalize_embeddings=True,
            convert_to_numpy=True
        )


def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name_or_path', default="acge_text_embedding", type=str)
    parser.add_argument('--task_type', default=None, type=str)
    parser.add_argument('--pooling_method', default='cls', type=str)
    parser.add_argument('--output_dir', default='zh_results',
                        type=str, help='output directory')
    parser.add_argument('--max_len', default=1024, type=int, help='max length')
    return parser.parse_args()


if __name__ == '__main__':
    args = get_args()
    encoder = SentenceTransformer(args.model_name_or_path).half()
    encoder.encode = functools.partial(encoder.encode, normalize_embeddings=True)
    encoder.max_seq_length = int(args.max_len)

    task_names = [t.description["name"] for t in MTEB(task_types=args.task_type,
                                                      task_langs=['zh', 'zh-CN']).tasks]
    TASKS_WITH_PROMPTS = ["T2Retrieval", "MMarcoRetrieval", "DuRetrieval", "CovidRetrieval", "CmedqaRetrieval",
                          "EcomRetrieval", "MedicalRetrieval", "VideoRetrieval"]
    for task in task_names:
        evaluation = MTEB(tasks=[task], task_langs=['zh', 'zh-CN'])
        if task in TASKS_WITH_PROMPTS:
            evaluation.run(RetrievalModel(encoder), output_folder=args.output_dir, overwrite_results=False)
        else:
            evaluation.run(encoder, output_folder=args.output_dir, overwrite_results=False)

Free to Use