MetaExtract
Extract 45,000+ metadata fields from any document
Problem
Organizations need to extract structured metadata from thousands of heterogeneous documents — invoices, contracts, medical records, research papers — each with different formats and field layouts.
Approach
Built a FastAPI backend with a modular extraction pipeline. Each document type gets a dedicated parser combining regex patterns, layout analysis, and ML classification. Results are normalized into a unified schema and served via REST API with React dashboard for monitoring.
Technical Implementation
Architecture
Modular FastAPI backend with pluggable parsers per document type. Async job queue for batch processing. PostgreSQL for metadata storage, Redis for caching.
OCR Pipeline
Tesseract OCR with custom preprocessing (deskewing, denoising, contrast normalization). Layout analysis using heuristic + ML-based region detection.
Schema Normalization
Unified schema mapping 45K+ field variations to standardized outputs. Confidence scoring per field with fallback rules.
Performance
Handles 1000+ pages/hour on single instance. Sub-2s response for single-page documents.
Integration
REST API with webhook callbacks. React dashboard for monitoring queue status and reviewing extraction accuracy.
Outcomes
- →Processing 45,000+ distinct field types from heterogeneous document formats
- →Reduced manual data entry by ~80% for healthcare document workflows
- →Confidence scoring enables human review queue prioritization
- →Deployed in production handling thousands of documents weekly
Ownership & scope
Owned system design and implementation end-to-end: extraction pipeline, normalization logic, API layer, and operator-facing monitoring flow.
Constraints
- →Heterogeneous source documents with inconsistent layouts and noisy scans
- →Need for usable confidence scores to support human review queues
- →Production reliability requirements in document-heavy healthcare operations
Trade-offs
- →Used a hybrid rules + ML approach instead of model-only extraction to keep outputs debuggable and reliable by field type
- →Prioritized high-confidence core fields first, then expanded long-tail field coverage incrementally
What changed
- →Manual extraction and copy-paste became API-backed structured outputs
- →Review teams moved from full-document checking to confidence-prioritized exception review
Workflow artifacts
- →Queue-level extraction status view for operations
- →Per-field confidence and normalization trace for reviewer decisions
- →Unified output schema mapping thousands of field variants
Result
Production-grade metadata extraction system built for document-heavy healthcare workflows.