How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches

Introduction

Extracting structured data from B2B documents—such as purchase orders, invoices, or delivery notes—is a common challenge. Two primary approaches exist: a traditional rule-based method using pytesseract for OCR and regex for parsing, and a modern LLM-based method using Ollama with LLaMA 3. This guide walks you through building both versions of the same document extractor, comparing their strengths and tradeoffs using a realistic B2B order scenario. By the end, you'll be able to choose the right approach for your own projects.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

What You Need

Step-by-Step Instructions

Step 1: Set Up the Environment

Create a new Python virtual environment and install all required packages:

pip install pytesseract pdf2image Pillow requests

Ensure Tesseract OCR is installed globally (sudo apt install tesseract-ocr on Linux, or download the Windows installer). Also install and start Ollama, then pull the LLaMA 3 model:

ollama pull llama3

Step 2: Convert PDF to Images

B2B documents are often scanned PDFs. Use pdf2image to turn each page into a PNG image. Write a function that:

Step 3: Perform OCR with pytesseract

For each image, call pytesseract.image_to_string() to extract raw text. This step is identical for both rule-based and LLM approaches, as they both need the text first. Store the extracted text per page.

Step 4: Build the Rule-Based Extractor

Use regular expressions and string logic to locate fields like Order Number, Date, Client Name, and Line Items. For example:

This method is fast and predictable, but fragile if the document format changes.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

Step 5: Build the LLM-Based Extractor

Instead of writing rules, send the extracted text to LLaMA 3 via Ollama’s API. Send a structured prompt that asks the model to extract specific fields in JSON format:

prompt = f"""
Extract the following information from this purchase order:
- order_number
- date
- client_name
- line_items (array of objects with 'item', 'quantity', 'price')
Return only valid JSON.

Text:
{text}
"""

Use the requests library to call Ollama:

response = requests.post('http://localhost:11434/api/generate', json={'model':'llama3', 'prompt':prompt, 'stream':False})

Parse the JSON from the response.

Step 6: Compare Outputs

Run both extractors on the same set of PDFs and compare:

The original experiment showed that the rule-based approach failed on a slightly different document format, while the LLM gracefully adapted—but hallucinated one item.

Tips for Success

By following these steps, you can build your own B2B document extractor and decide which approach best fits your needs. For a deep dive into the original comparison, see the full article.

Tags:

Recommended

Discover More

8 Things You Need to Know About Linux 7.2’s AMDGPU Power Module UpdateA Blueprint for High-Quality State Preschool: Balancing Funding and StandardsWeb Development Never Settles: The Constant Cycle of DisruptionCanvas Login Portals Targeted in ShinyHunters Extortion BlitzSecuring TP-Link Routers: A Guide to Understanding and Mitigating CVE-2023-33538 Exploitation