Recently, TextIn's PDF paser completed its latest product iteration, achieving a remarkable feat: parsing 100-page PDFs in under 2 seconds. Here's a breakdown of the performance metrics:
Note: "P50" represents the median response time, meaning half of the samples were faster and half were slower. Similarly, "P90" indicates that 90% of parsing operations completed within 1.75 seconds, reflecting the experience of most users.
After the release, a user in our technical discussion group raised an intriguing question: For individual users, waiting 5-6 seconds for a long document to parse seems acceptable. So why does the TextIn team keep pushing to optimize parsing speed, even compressing 100-page document parsing to under 2 seconds?
Let's dive into the reasons behind this need for speed.
Why Do We Need Lightning-Fast PDF Parsers?
1.1 Empowering Big Data Scenarios
Consider the financial big data industry as an example. Financial data service providers often deal with large-scale data input in short timeframes. To provide timely and accurate big data query and retrieval services, especially during peak periods like annual report seasons, these companies need T+0 database updates with high-quality standards.
Traditional data entry methods, involving data cleaning and extensive regex use for web content extraction, fall short when dealing with document data. Complex, multi-format documents often become data "black holes," where crucial information is difficult to efficiently and accurately transform into usable data.
As enterprise-level applications of Large Language Models (LLMs) become more prevalent, the financial big data industry is shifting paradigms. The new preferred model is "Data + Document Parsing + LLM + Prompt."
Compared to writing regex expressions, creating prompts is more maintainable and user-friendly. Leveraging the superior reading, comprehension, and generation capabilities of LLMs, professionals can enhance their content interpretation and data analysis skills. The key challenge now becomes: How to convert document content into LLM-friendly formats for analysis?
Our answer: Efficient, stable, and reliable document parsers.
In financial big data scenarios, an effective document parser should have:
- Lightning-fast parsing speed: During peak data entry periods, newly released financial reports and legal documents need to be processed and made available on the same day.
- Accurate table structure reproduction: Reports often contain complex structures like borderless tables, multi-page tables, merged cells, and dense tables. Precision is crucial, as even minor errors in content input can significantly impact database quality.
- Excellent compatibility: The parser should handle various layouts effectively, minimizing parsing failures.
Speed is key to efficient business operations. Typically, listed company annual reports exceed 200-300 pages. TextIn's document parser can process annual reports from thousands of companies within 8 hours, supporting big data companies in timely data updates.
1.2 Accelerating Large Language Model Training
The performance of LLMs largely depends on their ability to understand and generate human language. This challenging task, given the complexity and diversity of natural language, makes LLM training crucial.
During training, LLMs process massive amounts of text data to learn language patterns, semantics, and context, developing capabilities in translation, dialogue, question-answering, and text generation.
The quality of LLM training significantly impacts model performance and applications. High-quality training enables better semantic understanding and text inference. Aside from sufficient computational resources, high-quality data is paramount. In simple terms, the data "fed" to the model determines its performance.
Fast, accurate document parsing tools can significantly accelerate LLM training:
- Accurate recognition of complex layouts and good format compatibility improve pre-training data quality.
- LLM pre-training often involves tens of millions of pages of text data. For instance, processing 30 million pages of PDF documents typically takes about two weeks with current tools. Using TextIn's current version, this could be reduced to 5 working days, saving over a week of processing time.
- In the efficiency-driven AI era, a week can make a crucial difference in development cycles. A good document parsing tool can serve as an "accelerator" for LLM development and application.
1.3 Enhancing User Experience for Individual Users
Currently, the user experience for document Q&A still has room for improvement. For instance, when users upload large scanned documents or books, they often exceed the LLM's document size limit or face parsing failures after long loading times.
Also, due to network speed and computational power limitations, waiting for document parsing is common in LLM Q&A products, typically ranging from seconds to minutes.
For individual users, high-speed, stable document parsing can significantly improve the user experience. It can help overcome barriers posed by scanned documents and complex layouts in LLM Q&A applications, enabling more accurate information retrieval and enhancing overall Q&A performance.
TextIn vs. X: Parsing Speed Comparison of Current Products
How does TextIn's 2-second parsing speed for 100 pages stack up in the industry?
To answer this, let's look at some speed test data.
We conducted a comparative speed test using a corporate annual report as an example.
The chosen annual report file was 38.8MB, containing 49 pages with various charts, data, and certificate pages, as shown in the images below.
We used TextIn, LlamaParse, and a popular domestic LLM Q&A product to parse the document.
LlamaParse, created by LlamaIndex, is a technology for parsing and representing PDF files for efficient retrieval and context enhancement through the LlamaIndex framework. It's suitable for complex PDF documents and is currently a highly discussed open-source parser. Using conversational LLMs for document parsing and Q&A is a common scenario for individual users. We tested these two products against TextIn using the same file, with the following speed test results:
Note: In TextIn's API output, the duration is in milliseconds.
For TextIn and LlamaParse, we used API calls and test scripts to directly observe the runtime. For the LLM product, we uploaded a PDF and observed the "Uploading..." and "Parsing..." states. The end-to-end time in the table is the sum of upload and parsing times. The "Uploading" state corresponds to an XHR request in the control panel, while the "Parsing" state corresponds to a "parse_process" request.
We've listed the parsing speed and end-to-end speed (including upload time) for each product. All tests were conducted under the same network conditions. LlamaParse doesn't support separate parsing speed measurement, so only end-to-end speed is available.
For the same document, TextIn's PDF paser demonstrated a significant speed advantage. In enterprise-level scenarios, when dealing with millions or tens of millions of document pages, parsing speed becomes a crucial factor affecting business implementation and LLM development efficiency.
Try the Current Version of TextIn's PDF to Markdown Parser
If you have immediate needs, can you use TextIn's doc paser on demand?
Developers can register an account on the TextIn platform and try the latest version of the TextIn's PDF to markdown paser at any time.
Visit: http://textin.ai/experience/pdf_to_markdown
If you want to try code calls, you can also visit the corresponding API documentation:
http://textin.ai/document/pdf_to_markdown
The platform provides a Playground to help developers pre-debug the interface.
Click the "API Debug" button on the page to enter the debugging page.
Here you can configure some interface parameters, and after initiating the call, the results will appear on the right side.
If you want to use Python for calls, you can refer to the general sample code on the platform or join TextIn's discord group to get more comprehensive demo code.
Our PDF to Markdown Paser now offers a free trial quota of 1000 pages of PDF, which can be claimed by joining our discord group. We welcome everyone to communicate more with our team and provide opinions or suggestions.