Microservice Architecture for OCR and LLM Pipelines in Production

arXiv paper details a microservice architecture for OCR and LLM pipelines handling thousands of multi-page documents per hour with practical production insights.

Microservice Architecture for OCR and LLM Pipelines in Production

Paper Summary and Core Contribution

Yao Fehlis and eleven co-authors published the paper on arXiv on 12 May 2026. It describes a microservice architecture that runs classification, OCR, and LLM-based field extraction on thousands of multi-page documents per hour. The work focuses on production constraints rather than new model research. It separates GPU inference from CPU orchestration, uses asynchronous queues for IO-heavy steps, and applies independent horizontal scaling. The authors report concrete measurements from batch profiling on real workloads.

Design Decisions in the Microservice Setup

The architecture treats each stage as an independent service. Classification runs first to route documents, then OCR extracts text, and finally an LLM parses structured fields. Hybrid classification combines lightweight rules with a small model to reduce unnecessary GPU calls. GPU-bound inference stays isolated from CPU-bound coordination so that workers handling queues do not compete for accelerator time. Asynchronous processing handles file I/O, database writes, and network calls to external storage without blocking inference threads.

Horizontal scaling applies separately to each service. OCR workers scale with CPU cores while inference services scale with available GPUs. This avoids the common pattern of tying every component to a single deployment unit. The authors note that message queues decouple the stages, allowing back-pressure when OCR throughput lags behind downstream LLM calls. In practice this means adding more OCR pods does not automatically increase end-to-end rate if GPU capacity remains fixed.

Latency and Saturation Findings

Batch profiling revealed that OCR accounts for the majority of end-to-end latency, not the LLM parsing step. Even with fast vision models, character recognition on multi-page scans dominates the timeline. The second observation concerns concurrency limits: the system saturates according to shared GPU-inference capacity rather than the number of worker processes. Once GPU utilization reaches peak, additional CPU workers simply queue up without raising throughput.

These measurements suggest monitoring strategies focused on GPU queue depth and OCR service latency rather than generic worker counts. Teams running similar pipelines can profile the OCR stage first when tuning for higher volume. Adjusting batch sizes inside the OCR service produced clearer gains than simply increasing concurrency at the orchestration layer.

Practical Takeaways for Implementation

When building comparable systems, start by profiling the OCR component under representative document sizes. Measure both per-page time and variance across document types. Then allocate GPU resources based on the inference stage that follows classification, not on total worker count. Asynchronous task queues such as those in Python with Celery or Node.js with BullMQ fit the IO-bound segments well, while keeping inference containers lightweight and GPU-attached.

Separate configuration for each microservice also simplifies rollouts. Changes to the LLM prompt or extraction schema can deploy without touching the OCR service. This reduces risk when documents arrive in new formats. The paper does not prescribe specific frameworks, but the pattern aligns with standard Kubernetes deployments that expose separate deployments and horizontal pod autoscalers for CPU versus GPU workloads.

FAQs

How does the architecture handle variable document lengths? It processes pages in parallel within the OCR service and aggregates results before the LLM stage, so longer documents increase queue time rather than blocking other jobs.

Does the paper recommend specific OCR engines? No, it focuses on placement of OCR within the pipeline and its measured latency contribution rather than endorsing particular libraries.

What limits scaling once GPUs are saturated? Additional workers wait on inference results, so throughput plateaus until more GPU capacity or model optimizations are added.

---

๐Ÿ“– Related articles

Need a consultation?

I help companies and startups build software, automate workflows, and integrate AI. Let's talk.

Get in touch
โ† Back to blog