New ELT pipelines and warehouse on AWS to extract unstructured text from documents and images — enriched with logic and embeddings to return clean, structured JSON ready for downstream platforms.
{
"candidate_id": "anon_7c2f...",
"name": "[REDACTED]",
"experience": [
{ "company": "RetailCo", "role": "Data Analyst", "from": "2021-04", "to": "2024-06" }
],
"skills": ["SQL", "Python", "Power BI", "NLP"],
"education": [ { "institution": "[REDACTED]", "degree": "BSc" } ],
"pii_redactions": ["face_obfuscation", "email", "phone", "address"],
"bias_redactions": ["university", "marital_status", "hobbies"],
"match_score": 0.86,
"matched_skills": ["NLP", "SQL", "Python"],
"meta": { "source": "cv.pdf", "processed_ms": 812, "version": "v1.9" }
}Global recruiters and platforms needed an enterprise‑grade API to parse CVs across file types, extract accurate data from documents and images, remove bias signals and PII, and deliver a consistent schema in real time — all under strict security requirements.
We built a real‑time parsing API on AWS using Python (Flask), orchestrated ELT into a warehouse, and an agentic workflow for extraction, validation, and enrichment. Amazon Textract powers OCR; Computer Vision detects and obfuscates faces; PII and bias indicators are redacted. A vector embedding framework scores candidate‑to‑role fit for instant search and ranking.
Upload via API; extract text with OCR from PDFs & images
Map fields to a canonical schema; dedupe & validate
Remove PII (emails, phones, address) and bias signals
Derive skills, seniority, gaps; compute embedding match score
Return structured JSON; push to ELT and data warehouse
Automated parsing and enrichment reduced manual sifting time by up to 75%, enabling recruiters to focus on candidate engagement.
Obfuscation of photos plus removal of personal & background indicators helps reduce unconscious bias in early screening.
A vector index enables lightning‑fast search across the candidate database and accurate relevance scoring against job descriptions.
Delivered as a white‑label integration for global companies and recruitment platforms with configurable schemas and SLAs.
Built on AWS with ISO‑27001 aligned controls, audit logging, and PII minimisation. Real‑time processing with confidence thresholds and human‑in‑the‑loop escalation when required.
"The ability to turn applications into clear, plain-language summaries will be transformative for how quickly hiring managers make decisions."
Turn unstructured documents into accurate, compliant data your platforms can trust.