Case Study

Data Extract & Enrich – CV Parsing API

New ELT pipelines and warehouse on AWS to extract unstructured text from documents and images — enriched with logic and embeddings to return clean, structured JSON ready for downstream platforms.

OCR & Amazon Textract
Computer Vision redaction
Embedding-based match score
Structured JSON Output
Ready for ATS/CRM ingestion
{
  "candidate_id": "anon_7c2f...",
  "name": "[REDACTED]",
  "experience": [
    { "company": "RetailCo", "role": "Data Analyst", "from": "2021-04", "to": "2024-06" }
  ],
  "skills": ["SQL", "Python", "Power BI", "NLP"],
  "education": [ { "institution": "[REDACTED]", "degree": "BSc" } ],
  "pii_redactions": ["face_obfuscation", "email", "phone", "address"],
  "bias_redactions": ["university", "marital_status", "hobbies"],
  "match_score": 0.86,
  "matched_skills": ["NLP", "SQL", "Python"],
  "meta": { "source": "cv.pdf", "processed_ms": 812, "version": "v1.9" }
}

The Challenge

Global recruiters and platforms needed an enterprise‑grade API to parse CVs across file types, extract accurate data from documents and images, remove bias signals and PII, and deliver a consistent schema in real time — all under strict security requirements.

Pain points
What we observed
Unstructured, inconsistent CV formats
Bias from visual & personal signals
Hard to map to clean, queryable schemas

The Solution

We built a real‑time parsing API on AWS using Python (Flask), orchestrated ELT into a warehouse, and an agentic workflow for extraction, validation, and enrichment. Amazon Textract powers OCR; Computer Vision detects and obfuscates faces; PII and bias indicators are redacted. A vector embedding framework scores candidate‑to‑role fit for instant search and ranking.

1

Ingest

Upload via API; extract text with OCR from PDFs & images

2

Normalize

Map fields to a canonical schema; dedupe & validate

3

Redact

Remove PII (emails, phones, address) and bias signals

4

Enrich

Derive skills, seniority, gaps; compute embedding match score

5

Deliver

Return structured JSON; push to ELT and data warehouse

Architecture Snapshot
AWS‑native, enterprise‑grade
AWS + Flask
Runtime
ELT to warehouse
Pipelines
Embeddings
Relevance
ISO‑27001 aligned
Security

The Results

75% faster CV screening
Time saved for talent teams

Automated parsing and enrichment reduced manual sifting time by up to 75%, enabling recruiters to focus on candidate engagement.

Bias‑aware shortlists
Redaction by default

Obfuscation of photos plus removal of personal & background indicators helps reduce unconscious bias in early screening.

Instant search & match
Embedding‑powered

A vector index enables lightning‑fast search across the candidate database and accurate relevance scoring against job descriptions.

Integration‑ready
White‑label API

Delivered as a white‑label integration for global companies and recruitment platforms with configurable schemas and SLAs.

Security, Compliance & Ops

Built on AWS with ISO‑27001 aligned controls, audit logging, and PII minimisation. Real‑time processing with confidence thresholds and human‑in‑the‑loop escalation when required.

Controls included
Trust by design
Encryption in transit & at rest
Role‑based access & least privilege
PII redaction & face obfuscation
Full audit trail & observability

"The ability to turn applications into clear, plain-language summaries will be transformative for how quickly hiring managers make decisions."

CEO, Applicant Tracking System

Need enterprise‑grade parsing and enrichment?

Turn unstructured documents into accurate, compliant data your platforms can trust.