AI/ML2025Featured

MetaExtract

Extract 45,000+ metadata fields from any document

PythonFastAPIReactDockerTesseract OCRPostgreSQLRedisCelery

Problem

Organizations need to extract structured metadata from thousands of heterogeneous documents — invoices, contracts, medical records, research papers — each with different formats and field layouts.

Approach

Built a FastAPI backend with a modular extraction pipeline. Each document type gets a dedicated parser combining regex patterns, layout analysis, and ML classification. Results are normalized into a unified schema and served via REST API with React dashboard for monitoring.

Technical Implementation

Architecture

Modular FastAPI backend with pluggable parsers per document type. Async job queue for batch processing. PostgreSQL for metadata storage, Redis for caching.

OCR Pipeline

Tesseract OCR with custom preprocessing (deskewing, denoising, contrast normalization). Layout analysis using heuristic + ML-based region detection.

Schema Normalization

Unified schema mapping 45K+ field variations to standardized outputs. Confidence scoring per field with fallback rules.

Performance

Handles 1000+ pages/hour on single instance. Sub-2s response for single-page documents.

Integration

REST API with webhook callbacks. React dashboard for monitoring queue status and reviewing extraction accuracy.

Outcomes

→Processing 45,000+ distinct field types from heterogeneous document formats
→Reduced manual data entry by ~80% for healthcare document workflows
→Confidence scoring enables human review queue prioritization
→Deployed in production handling thousands of documents weekly

Ownership & scope

Owned system design and implementation end-to-end: extraction pipeline, normalization logic, API layer, and operator-facing monitoring flow.

Constraints

→Heterogeneous source documents with inconsistent layouts and noisy scans
→Need for usable confidence scores to support human review queues
→Production reliability requirements in document-heavy healthcare operations

Trade-offs

→Used a hybrid rules + ML approach instead of model-only extraction to keep outputs debuggable and reliable by field type
→Prioritized high-confidence core fields first, then expanded long-tail field coverage incrementally

What changed

→Manual extraction and copy-paste became API-backed structured outputs
→Review teams moved from full-document checking to confidence-prioritized exception review

Workflow artifacts

→Queue-level extraction status view for operations
→Per-field confidence and normalization trace for reviewer decisions
→Unified output schema mapping thousands of field variants

Result

Production-grade metadata extraction system built for document-heavy healthcare workflows.

← Back to all projects

Hire Me Work With Me