Migrating OCR extraction data, rules, and page metadata for certificate processing and data extraction workflows.
Migration of extraction and OCR data from DTX 1.0 to DTX 2.0. This is Phase 4 — depends on Area 4 (Certificates & Documents).
The extraction system in DTX 1.0 used a two-phase approach: (1) external OCR via the AgencyPort Turnstile API (ocr_event), and (2) local coordinate-based field extraction using rule templates (extraction, extraction_rule, page, page_attribute).
DTX 2.0 replaced the external Turnstile API with a fully local Poppler-based extraction pipeline (docutrax_extractor app). The local extraction architecture is structurally preserved — Page, PageAttribute, ExtractionRule, and Extraction models all exist in DTX 2.0 and closely mirror the DTX 1.0 schema. However, ocr_event has no corresponding model in DTX 2.0 (see Decision #6).
Key architectural change: DTX 1.0 relied on an external API for text extraction and OCR of scanned documents. DTX 2.0 uses Poppler pdftotext for text extraction only — scanned/image-only PDFs return ERR_NO_TEXT and are not supported. This is a capability regression that the team should be aware of.
| DTX 1.0 Table | Rows (sample) | Type | DTX 2.0 Target |
|---|---|---|---|
page |
11 | Reference/config | Page (docutrax_extractor) |
page_attribute |
~100 | Reference/config | PageAttribute (docutrax_extractor) |
extraction_rule |
~100+ | Configuration | ExtractionRule (docutrax_extractor) |
extraction |
~100+ | Transactional | Extraction (docutrax_extractor) |
ocr_event |
~100+ | Transactional | No target model |
| DTX 2.0 Model | App | File | Lines |
|---|---|---|---|
Page |
docutrax_extractor |
docutrax_extractor/models.py |
8–25 |
PageAttribute |
docutrax_extractor |
docutrax_extractor/models.py |
28–50 |
ExtractionRule |
docutrax_extractor |
docutrax_extractor/models.py |
53–126 |
Extraction |
docutrax_extractor |
docutrax_extractor/models.py |
129–168 |
| Component | File | Purpose |
|---|---|---|
PDFParser |
docutrax_extractor/services/pdf_parser.py |
2,290-line extraction engine |
trigger_pdf_parsing() |
docutrax_profiles/signals.py:450–673 |
Auto-parse on Document upload |
parse_coi_pdf_task |
docutrax_profiles/tasks.py:38–114 |
Celery async task (exists but unused) |
auto_parse_coi_pdf |
docutrax_profiles/signals.py:676–748 |
Post-save signal on Document |
| Depends On | What's Needed |
|---|---|
| Area 2: Customers/Accounts | extraction.so_id → Account mapping (for validation only, column dropped) |
| Area 3: Vendors/Records | ocr_event.vendor_id → Record mapping (for validation only) |
| Area 4: Certificates/Documents | extraction.certificate_id → Certificate mapping; deferred certificate fields |
| Depended On By | What It Provides |
|---|---|
| Area 10: Reporting | Extraction metrics for reports |
The page table is a small reference table (11 rows) defining known ACORD 25 form templates. Each row is a version + variant combination (e.g., ACORD 25 2016/03 "native", ACORD 25 2016/03 "StateFarmA4").
Column mapping:
| DTX 1.0 | DTX 2.0 | Action |
|---|---|---|
page_id |
page_id (PK) |
Direct |
producer |
producer |
Direct |
model |
model |
Direct |
version |
version |
Direct |
title |
— | Drop (always "CERTIFICATE OF LIABILITY INSURANCE", not in 2.0 model) |
width |
width |
Direct |
height |
height |
Direct |
variant |
variant |
Direct |
Approach:
- Migrate all 11 rows with preserved page_id values (other tables reference these IDs).
- Drop the title column — it is a constant value and DTX 2.0 does not store it.
- Note: One row (page_id=6, StateFarmA4) uses A4 dimensions (595x842) while all others use US Letter (612x792).
The page_attribute table contains fingerprint anchors — expected text locations used to identify which ACORD form template a PDF matches. The DTX 2.0 fingerprinting algorithm (pdf_parser.py:527–606) matches PDF word coordinates against these anchors within a ±3px tolerance. It does not compare text content — only coordinates.
Column mapping:
| DTX 1.0 | DTX 2.0 | Action |
|---|---|---|
page_attribute_id |
attribute_id (PK) |
Rename |
page_id |
page (FK) |
FK reference, direct |
producer |
producer |
Direct (denormalized from Page) |
model |
model |
Direct (denormalized from Page) |
version |
version |
Direct (denormalized from Page) |
variant |
variant |
Direct (denormalized from Page) |
fieldname |
fieldname |
Direct (3-letter code: IMP, VER, NAI, etc.) |
exp_value |
— | Drop (expected text string — not in 2.0 model, not used by algorithm) |
width |
— | Drop (denormalized from Page, not in 2.0 model) |
height |
— | Drop (denormalized from Page, not in 2.0 model) |
xMin |
xMin |
Direct |
yMin |
yMin |
Direct |
xMax |
xMax |
Direct |
yMax |
yMax |
Direct |
Approach:
- Migrate all rows with preserved PK values.
- Drop exp_value — DTX 2.0's fingerprinting uses coordinate matching only, not text comparison. The fieldname (3-letter code like "IMP", "VER", "NAI") serves as the human-readable label.
- Drop width, height — redundant with the referenced Page record.
- Note: Pages 8–11 in the page table have no corresponding page_attribute rows in the sample. If the full dataset also lacks them, those templates have no fingerprinting and will never match.
The extraction_rule table defines field-level extraction rules — where to look on the page for each data field (bounding boxes, anchor text, regex patterns). This is the "brain" of the extraction engine. The sample contains 100 rules, all for page_id=1 (ACORD 25 2010/05 "native"). The full dataset likely has rules for other page templates.
Column mapping:
| DTX 1.0 | DTX 2.0 | Action |
|---|---|---|
extraction_rule_id |
extraction_rule_id (PK) |
Direct |
page_id |
page (FK) |
FK reference, direct |
fieldname |
fieldname |
Direct |
fieldtype |
fieldtype |
Map values (see below) |
pattern |
pattern |
Direct (regex patterns) |
above_str |
above_str |
Direct |
below_str |
below_str |
Direct |
left_str |
left_str |
Direct |
right_str |
right_str |
Direct |
above_offset |
above_offset |
Direct |
below_offset |
below_offset |
Direct |
left_offset |
left_offset |
Direct |
right_offset |
right_offset |
Direct |
above_ref |
above_ref |
Direct (TOP/BOTTOM/LEFT/RIGHT) |
below_ref |
below_ref |
Direct |
left_ref |
left_ref |
Direct |
right_ref |
right_ref |
Direct |
above_targ |
above_targ |
Direct (always NULL in sample) |
below_targ |
below_targ |
Direct (always NULL in sample) |
left_targ |
left_targ |
Direct |
right_targ |
right_targ |
Direct |
pagesection |
— | Drop (section grouping, not in 2.0) |
box_xMin |
— | Drop (bounding box, 2.0 uses boundary markers instead) |
box_yMin |
— | Drop |
box_xMax |
— | Drop |
box_yMax |
— | Drop |
above_residx |
— | Drop (always NULL) |
below_residx |
— | Drop (always NULL) |
left_residx |
— | Drop (always NULL) |
right_residx |
— | Drop (always NULL) |
above_custxp |
— | Drop (custom XPath fragments — DTX 2.0 uses xpath_query or dynamic generation) |
below_custxp |
— | Drop (always NULL) |
left_custxp |
— | Drop |
right_custxp |
— | Drop |
fielddefault |
— | Drop (always NULL) |
metatag |
— | Drop (post-extraction field mapping, handled differently in 2.0) |
pattern_start |
— | Drop (boundary strings, not in 2.0) |
pattern_end |
— | Drop (boundary strings, not in 2.0) |
mask |
— | Drop (format validation, moved into fieldtype transformation) |
box_element |
— | Drop (extraction granularity: word/line/block/flow — hardcoded in 2.0 parser) |
| — | method |
New — leave NULL (custom extraction method names, populated as needed) |
| — | xpath_query |
New — leave NULL (dynamic query generation in 2.0 parser) |
Fieldtype value mapping:
| DTX 1.0 | DTX 2.0 | Notes |
|---|---|---|
SINGLE_LINE_TEXT |
TEXT |
|
MULTI_LINE_TEXT |
TEXT |
|
NUMBER |
TEXT |
No NUMBER type in 2.0 |
LIMIT |
LIMIT |
|
CHECKBOX |
CHECKBOX |
|
YNBOX |
YNBOX |
|
DATE |
DATE |
|
PHONE_NUMBER |
PHONE |
|
PHONE_NUMBER_EXT |
PHONE |
|
EMAIL_ADDRESS |
EMAIL |
|
pagesection |
— | Drop row — these are section divider pseudo-rules, not actual extraction rules |
Approach:
- Migrate all rules except rows where fieldtype = 'pagesection' (section dividers, not real rules).
- Map fieldtype values per the table above.
- Leave method and xpath_query as NULL — DTX 2.0's parser generates these dynamically.
- Preserve PK values since extraction records reference these implicitly via page_id.
- Pre-migration action: Verify which left_custxp / right_custxp values contain XPath constraints that may need manual re-implementation as method entries in DTX 2.0. The sample shows constraints like [@yMin >= 200].
The extraction table records individual extraction events — which template matched, what confidence score, how many fields were extracted. These are historical audit records of past PDF processing.
Column mapping:
| DTX 1.0 | DTX 2.0 | Action |
|---|---|---|
extraction_id |
extraction_id (PK) |
Direct |
page_id |
page (FK) |
FK reference, direct (NULL for 13 error rows) |
certificate_id |
certificate (FK) |
FK reference, remap via Area 4 mapping (NULL for error rows) |
status |
status |
Map values: SUCCESS → PASS (if DTX 2.0 uses PASS) |
confidence |
confidence |
Direct |
fp_total |
fp_total |
Direct |
fp_pass |
fp_pass |
Direct |
rules |
rules |
Direct (always 112 when populated) |
fieldsWithData |
fieldsWithData |
Direct |
scantime |
scantime |
Direct (milliseconds) |
create_time |
create_time |
Direct (see Known Issue #1) |
repo_file_id |
— | Drop (DTX 2.0 links via Document, not repo_file) |
pagewidth |
— | Drop (available from Page model) |
pageheight |
— | Drop (available from Page model) |
raw |
— | Drop (always NULL in sample) |
layoutxml |
— | Drop (always NULL in sample) |
metaxml |
— | Drop (always NULL in sample) |
so_id |
— | Drop (derivable from certificate → RequirementsGroup → Account) |
ch_id |
— | Drop (derivable from certificate chain) |
ph_id |
— | Drop (derivable from certificate chain) |
filename |
— | Drop (available via Document) |
filepath |
— | Drop (server paths not applicable in 2.0) |
fields |
— | Drop (same as rules in all sample rows) |
PDFversion |
— | Drop (PDF metadata, not stored in 2.0) |
tolerance |
— | Drop (constant 3.0, hardcoded in 2.0 parser) |
| — | extracted_data |
New — see Decision #5 |
| — | content_hash |
New — see Decision #5 |
Status value mapping:
| DTX 1.0 | DTX 2.0 | Notes |
|---|---|---|
SUCCESS |
Verify actual value used by pdf_parser.py |
Check process_document() return values |
ERR_NO_TEXT |
ERR_NO_TEXT |
Same value used |
ERR_NO_PAGE_TEMPLATE_FOUND |
ERR_NO_PAGE_TEMPLATE_FOUND |
Same value used |
Pre-migration action: Confirm the exact status string the DTX 2.0 parser writes for successful extractions. If it uses "SUCCESS" already, no mapping needed. If it uses "PASS", map accordingly.
Approach:
- Migrate extraction records with preserved PKs.
- Remap certificate_id via the Area 4 certificate ID mapping.
- Drop 14 columns that are either always NULL, redundant, or not in the 2.0 model.
- Handle error rows (13 in sample) where page_id, certificate_id, and metrics are NULL — migrate as-is with NULLs preserved.
DTX 2.0 added two fields not present in DTX 1.0:
extracted_data (JSONField) — Stores all extracted field values as a JSON object. In DTX 2.0, this is populated by pdf_parser.py during extraction. DTX 1.0 did not store the raw extraction results — it immediately wrote them into Certificate and coverage model fields.
content_hash (CharField, indexed) — SHA-256 hash of the source PDF, used for extraction caching (currently disabled in DTX 2.0 per signals.py:564–568).
Options for extracted_data:
| Option | Pros | Cons |
|---|---|---|
| A. Leave NULL | Simple, no data to reconstruct | Historical extractions won't show in 2.0 extraction detail views |
| B. Backfill from Certificate + coverage records | Complete data | Complex reverse-engineering; field names may not match exactly |
Options for content_hash:
| Option | Pros | Cons |
|---|---|---|
| A. Leave NULL | Simple | Historical extractions won't match cache lookups (cache is disabled anyway) |
| B. Compute from repo_file data | Complete | Requires access to original PDF files on DTX 1.0 file server |
Recommendation: Leave both fields NULL for migrated records. The extraction cache is currently disabled, and historical extraction detail views are not critical. New extractions in DTX 2.0 will populate both fields.
The ocr_event table (34 columns, 100 rows in sample) records calls to the external AgencyPort Turnstile API (turnstile.agencyport.com). Each row logs an OCR processing attempt including API status codes, response times, confidence grades (A–F), form identification, and extraction technique (Text vs OCR).
DTX 2.0 does not have an OcrEvent model. The external Turnstile API was replaced by the local Poppler-based extraction pipeline. The Extraction model serves as the sole processing record.
Key data in ocr_event with no 2.0 equivalent:
- ExtractionTechnique — "Text" (50 rows) vs "OCR" (20 rows): Indicates whether the PDF had embedded fonts or required image OCR. DTX 2.0 only handles text-based PDFs.
- confidence grade (A–F): Letter grades from the external API, different from the 0–1 numeric confidence in extraction.
- api_statusCode / api_statusDesc: External API response details.
- FormName: ACORD form identification (some with OCR-misread values like .CORD 25, \CORD 25).
- ProductVersion: Turnstile API version.
The team must decide:
legacy_ocr_technique) to the Extraction model to preserve the Text/OCR distinction for migrated records.Capability gap: DTX 1.0's Turnstile integration could process scanned/image PDFs via OCR (20 of 100 sample rows used "OCR" technique). DTX 2.0 cannot — scanned PDFs fail with ERR_NO_TEXT. This is a known functional regression.
Area 4 (Certificates & Documents) deferred three certificate fields to this area:
| DTX 1.0 Field | Deferred Reason | Resolution |
|---|---|---|
certificate.ocr |
"OCR flag — Area 7" | Resolved. This flag indicated whether the certificate was produced via OCR. DTX 2.0 does not store this flag. The extraction technique is implicit: if an Extraction record links to the Certificate, it was auto-extracted; otherwise it was manually created. No migration action needed. |
certificate.page_name |
"OCR-related — Area 7" | Resolved. This was the ACORD form name (e.g., "ACORD_25_2016_03"). In DTX 2.0, this is derivable from Certificate → Extraction → Page (producer + model + version). No separate field needed. |
certificate.starting_page |
"OCR-related — Area 7" | Resolved. This indicated which page of a multi-page PDF the certificate data came from. DTX 2.0 does not store this (current parser handles single-page extraction). Drop — not critical for migration. |
All three deferred references are resolved. No migration action required.
The sample shows duplicate certificate_id values in the extraction table: 72235 appears 2 times, 72302 appears 3 times. These map to the same repo_file_id values, indicating the same file was extracted multiple times (likely re-runs or retries).
DTX 2.0's Extraction model allows multiple extraction records per certificate (no unique constraint). Migrate all rows, including duplicates — they represent the actual processing history. The most recent extraction (highest extraction_id) for each certificate reflects the final result.
| # | Issue | Impact | Mitigation |
|---|---|---|---|
| 1 | Batch timestamp anomaly: 87 of 100 extraction rows share create_time of 2019-02-14 16:12:48 |
These are not real extraction timestamps — they were backfilled. Historical timing data is unreliable. | Migrate as-is. Flag in documentation that pre-migration create_time values may not reflect actual processing time. |
| 2 | Zero-confidence SUCCESS rows: 9 extraction rows have status=SUCCESS but confidence=0.000 |
May indicate false positive success status or poor-quality extractions with no fingerprint matches. | Migrate as-is. Post-migration validation should flag these for review. |
| 3 | OCR-misread FormName in ocr_event: Values like .CORD 25, \CORD 25, ACORDACORD 25 |
Only relevant if ocr_event data is preserved (Decision #6). Shows the Turnstile API returned OCR'd form names, not clean identifiers. | Document for reference; no action needed if ocr_event is not migrated. |
| 4 | Pages 8–11 missing fingerprints: 4 page templates (AON_RISK, Brixmor, iText, Maffey) have no page_attribute rows in sample |
These variants may never match during extraction. Check if full dataset has fingerprints for these pages. | Pre-migration action: query full DTX 1.0 database for page_attribute rows where page_id IN (8,9,10,11). |
| 5 | extraction_rule only covers page_id=1 in sample: All 100 sample rows are for the ACORD 25 2010/05 "native" template | Full dataset likely has rules for all 11 page templates. Sample is not representative. | Pre-migration action: query full DTX 1.0 database for extraction_rule counts per page_id. |
| 6 | left_custxp / right_custxp XPath fragments: Some extraction rules contain custom XPath constraints like [@yMin >= 200] |
These are dropped in migration (Decision #3). If any rules depend on these constraints for correct extraction, they may produce incorrect results in DTX 2.0. | Pre-migration action: identify which rules have non-NULL custxp values and test extraction accuracy without them. |
| 7 | Extraction cache disabled in DTX 2.0: signals.py:564–568 disables content_hash cache lookup due to "coverage data quality issues" |
Historical content_hash values would not be used even if populated. | No action — aligns with Decision #5 recommendation to leave content_hash NULL. |
| # | Action | Blocking? |
|---|---|---|
| 1 | Confirm DTX 2.0 extraction status values: Check pdf_parser.py process_document() for the exact status string written on success (SUCCESS vs PASS). |
Yes — needed for Decision #4 status mapping. |
| 2 | Query full page_attribute coverage: Check DTX 1.0 for page_attribute rows for page_ids 8–11. | No — but affects data completeness. |
| 3 | Query full extraction_rule counts: Get per-page_id row counts from DTX 1.0 to understand full rule coverage. | No — but needed for accurate migration scope. |
| 4 | Audit custxp constraints: List all extraction_rule rows with non-NULL left_custxp or right_custxp and assess whether the XPath fragments are critical for extraction accuracy. |
No — but may require DTX 2.0 ExtractionRule model extension or method column usage. |
Step 1: Seed page table (11 rows, reference data)
Step 2: Seed page_attribute table (fingerprints, reference data)
Step 3: Seed extraction_rule table (config, with fieldtype mapping)
Step 4: Migrate extraction records (historical, with certificate_id remapping)
Step 5: Resolve deferred Area 4 references (no action — all resolved)
| Area | Reference | Status |
|---|---|---|
| Area 2: Customers | extraction.so_id (dropped) |
Resolved — column not migrated |
| Area 3: Vendors | ocr_event.vendor_id |
Resolved — depends on Decision #6 |
| Area 4: Certificates | extraction.certificate_id remap; deferred ocr, page_name, starting_page |
Resolved — Decision #4, Decision #7 |
| Area 11: File Repository | extraction.repo_file_id (dropped), ocr_event.repo_file_id |
Resolved — columns not migrated |
| # | Decision | Status | Approach |
|---|---|---|---|
| 1 | page → Page (seed data) | Documented | See decision details below |
| 2 | page_attribute → PageAttribute (seed data) | Documented | See decision details below |
| 3 | extraction_rule → ExtractionRule (configuration) | Documented | See decision details below |
| 4 | extraction → Extraction (historical records) | Documented | See decision details below |
| 5 | New Extraction fields (extracted_data, content_hash) | Documented | See decision details below |
| 6 | ocr_event → No target model | Documented | See decision details below |
| 7 | Deferred certificate fields from Area 4 | Documented | See decision details below |
| 8 | Duplicate extraction records | Documented | See decision details below |