Today, I’m excited to introduce a new “evaluation tool” that offers a convenient, quick, and comprehensive way to showcase the capabilities of PDF Parsers.
Yes, this evaluation tool is developed by us.
Its primary purpose is to assist users who need to parse PDFs in selecting the most suitable product for their specific scenarios.
During our interactions with users, we’ve discovered that their needs are highly diverse and vary greatly depending on the focus: annual reports, financial statements, academic papers, policy documents, internal company files, textbooks, exam papers, formulas, and so on.
Although all parsing products aim to become “all-rounders,” it’s a normal part of the development process that their capabilities differ at this stage.
Therefore, we are providing this evaluation tool to help save time and effort in the “selection” and “testing” processes, allowing users to better focus on their business scenarios.
The evaluation criteria are divided into five dimensions, focusing on quantitative evaluation of tables, paragraphs, headings, reading order, and formulas.
Let me briefly introduce how to use this evaluation tool.
Without further ado, here’s the access link:
https://github.com/intsig/markdown_tester
It’s very easy to use and supports uploading any sample you want to test.
First, run install.sh to install the necessary packages:
./install.sh
Place the samples to be evaluated in the following manner:
dataset/├── pred/│ ├── gpt-4o/│ ├── vendor_A/│ ├── vendor_B/│ ├── ...├── gt/
Run the following command
Here’s how to use the command:
python run_test.py --ped_path path_to_pred_md --gt_path path_to_gt_md
path_to_pred_md: The folder containing the prediction files
path_to_gt_md: The folder containing the ground truth files
Let's use a test case to demonstrate the usage. The output results are as follows:
"Table" data results:
And a visual "radar chart":
Whether you are an expert in document processing or someone who needs to use document parsing tools, this tool can assist you in quickly and efficiently evaluating the capabilities of various parsing products in your business scenario.
Let me explain why we decided to make this “Swiss Army knife” tool, originally intended for internal use, publicly available.
Recently, we have received increasing requests for evaluation tools. Our users and industry colleagues have found the process of evaluating parsing products quite challenging: testing results are either end-to-end or visually assessed. The former is difficult to pinpoint parsing performance, while the latter is time-consuming and only observes a small subset of samples.
One of our clients, who primarily uses LLM-powered QA bots, previously compared parsing products by randomly testing QA effectiveness and manually cross-referencing documents to infer parsing capabilities. Aside from the labor intensity of such evaluations, it is scientifically random and inaccurate.
After using the testing tool we shared, our clients no longer need to rely on "visual observation" of parsing results.
The challenges in evaluation today stem from the evolving needs and product forms driven by advancements in large language models. For instance, traditional OCR technology might only output the position and value of each cell when handling tables. However, when using large models for QA, what we need is the data content within the table, and the clearer the data, the higher the quality of the answers. Thus, we prefer to present this data in comma-separated or Markdown formats.
In addition to tables, issues like headings, text paragraphs, and single-column versus double-column formatting also face similar challenges. This means that the methods previously used to evaluate OCR effectiveness might not be applicable in the context of large models and RAG.
So, how do we compare the performance of different document parsing products in business scenarios? How much improvement do product updates actually bring to real-world usage?
Our goal in making the evaluation tool public is to make these questions open and transparent. Thus, during the design and optimization of this tool, we focused on the following elements:
- Defining the main goals and key metrics for evaluation
- Selecting evaluation metrics that accurately reflect performance
- Reducing unnecessary complexity
- Ensuring adherence to industry standards and best practices
- Making evaluation results easy to interpret and understand
- Maintaining transparency in the evaluation process
We hope our evaluation tool will help solve the challenges of assessing the products you need. Going forward, we will continue to “sharpen the knife,” expanding the evaluation dimensions and vendors to make this tool even more useful.