Two Approaches to B2B Document Extraction: Rules vs. Large Language Models

Introduction

Automating the extraction of structured data from B2B documents—such as purchase orders, invoices, and shipment confirmations—is a common pain point. While traditional rule-based methods have been the go‑to solution for decades, the emergence of large language models (LLMs) offers an alternative that promises greater flexibility. In this article, we compare two implementations of a document extractor built for a realistic B2B order scenario: one using a rule‑based system with pytesseract (an OCR engine) and the other using an LLM approach with Ollama and LLaMA 3. We examine accuracy, speed, cost, and maintainability to help you decide which path suits your use case.

Two Approaches to B2B Document Extraction: Rules vs. Large Language Models
Source: towardsdatascience.com

The Rule‑Based Approach

How It Works

The rule‑based extractor relies on pytesseract, a Python wrapper for Google’s Tesseract OCR engine. The process begins with image preprocessing (deskewing, binarisation, and noise removal) followed by OCR to convert the scanned PDF or image into raw text. Hand‑crafted regular expressions and positional heuristics then extract fields such as order number, date, line items, and totals. For example, a pattern like Order\s*#:\s*(\d+) captures the order number, while table boundaries are guessed based on horizontal lines or consistent spacing.

Strengths and Weaknesses

The LLM‑Based Approach

How It Works

The LLM‑based system uses Ollama to run LLaMA 3 locally. The scanned document is first processed by an OCR layer (the same pytesseract) to extract all visible text, but instead of applying rules, the entire plain‑text output is fed into a prompt that instructs the model to return a structured JSON object containing the required fields. The prompt includes a few examples (few‑shot prompting) and describes the expected schema.

Strengths and Weaknesses

Comparative Analysis

Accuracy

In our B2B order test set (which included 50 invoices from five different vendors with varying layouts), the rule‑based system achieved 92% field‑level accuracy, mostly failing on oddly placed line‑item tables. The LLM approach reached 97% accuracy, successfully handling variations like merged cells and missing headers. However, the LLM occasionally invented a field when the information was truly missing (a false positive), whereas the rule system simply returned null.

Two Approaches to B2B Document Extraction: Rules vs. Large Language Models
Source: towardsdatascience.com

Speed and Throughput

For a single‑page order, the rule‑based extractor processed 100 documents in 40 seconds (0.4 s per doc). The LLM took 6 minutes and 20 seconds (3.8 s per doc) using a NVIDIA A10G GPU. On a CPU‑only machine, LLM inference was impractically slow (over 30 s per doc).

Maintenance and Flexibility

Over a six‑month period, the rule‑based system required 12 manual updates to adapt to vendor template changes. The LLM‑based system required none—new layouts were handled without code changes. On the other hand, the LLM system needed occasional prompt tuning (e.g., adding examples for a new vendor’s abbreviation style).

Cost

Practical Recommendations

Choose the rule‑based approach if:

Choose the LLM‑based approach if:

For many teams, a hybrid approach works best: use rules for simple, high‑volume documents and fall back on an LLM for complex or unknown layouts. This balances cost, speed, and accuracy.

Conclusion

Both pytesseract‑based rules and Ollama‑powered LLMs can successfully extract data from B2B documents, but they serve different needs. The rule system is fast, cheap, and deterministic, yet brittle. The LLM system is flexible and accurate but slower and more expensive. By understanding the trade‑offs described in this comparison, you can select the best tool—or combine both—for your document extraction pipeline.

Tags:

Recommended

Discover More

5 Things You Need to Know About GitHub's Overprotective Security SystemLeveraging Azure's Pre-Built AI Services for Business InnovationHow to Analyze VC Investment Signals in Privacy Coins: The ZEC Case StudyThe SpaceMob Effect: How a 50,000-Member Online Community Propelled AST SpaceMobile to a 6,000% SurgeSlash Your Phone Bill in Half: How Mint Mobile Delivers Big Savings Without Sacrificing Quality