📋 Migration Summary

Source Tables: 5
Target Models: 4
Decisions: 8

Migrating OCR extraction data, rules, and page metadata for certificate processing and data extraction workflows.

🎯 Key Actions

⚠️ Key Challenges

Area 7: Extraction & OCR

Overview

Migration of extraction and OCR data from DTX 1.0 to DTX 2.0. This is Phase 4 — depends on Area 4 (Certificates & Documents).

The extraction system in DTX 1.0 used a two-phase approach: (1) external OCR via the AgencyPort Turnstile API (ocr_event), and (2) local coordinate-based field extraction using rule templates (extraction, extraction_rule, page, page_attribute).

DTX 2.0 replaced the external Turnstile API with a fully local Poppler-based extraction pipeline (docutrax_extractor app). The local extraction architecture is structurally preserved — Page, PageAttribute, ExtractionRule, and Extraction models all exist in DTX 2.0 and closely mirror the DTX 1.0 schema. However, ocr_event has no corresponding model in DTX 2.0 (see Decision #6).

Key architectural change: DTX 1.0 relied on an external API for text extraction and OCR of scanned documents. DTX 2.0 uses Poppler pdftotext for text extraction only — scanned/image-only PDFs return ERR_NO_TEXT and are not supported. This is a capability regression that the team should be aware of.

Source Tables (5)

DTX 1.0 Table Rows (sample) Type DTX 2.0 Target
page 11 Reference/config Page (docutrax_extractor)
page_attribute ~100 Reference/config PageAttribute (docutrax_extractor)
extraction_rule ~100+ Configuration ExtractionRule (docutrax_extractor)
extraction ~100+ Transactional Extraction (docutrax_extractor)
ocr_event ~100+ Transactional No target model

Target Models

DTX 2.0 Model App File Lines
Page docutrax_extractor docutrax_extractor/models.py 8–25
PageAttribute docutrax_extractor docutrax_extractor/models.py 28–50
ExtractionRule docutrax_extractor docutrax_extractor/models.py 53–126
Extraction docutrax_extractor docutrax_extractor/models.py 129–168

Supporting Infrastructure

Component File Purpose
PDFParser docutrax_extractor/services/pdf_parser.py 2,290-line extraction engine
trigger_pdf_parsing() docutrax_profiles/signals.py:450–673 Auto-parse on Document upload
parse_coi_pdf_task docutrax_profiles/tasks.py:38–114 Celery async task (exists but unused)
auto_parse_coi_pdf docutrax_profiles/signals.py:676–748 Post-save signal on Document

Dependencies

Depends On What's Needed
Area 2: Customers/Accounts extraction.so_id → Account mapping (for validation only, column dropped)
Area 3: Vendors/Records ocr_event.vendor_id → Record mapping (for validation only)
Area 4: Certificates/Documents extraction.certificate_id → Certificate mapping; deferred certificate fields
Depended On By What It Provides
Area 10: Reporting Extraction metrics for reports

Decisions

Decision #1: page → Page (seed data)

The page table is a small reference table (11 rows) defining known ACORD 25 form templates. Each row is a version + variant combination (e.g., ACORD 25 2016/03 "native", ACORD 25 2016/03 "StateFarmA4").

Column mapping:

DTX 1.0 DTX 2.0 Action
page_id page_id (PK) Direct
producer producer Direct
model model Direct
version version Direct
title Drop (always "CERTIFICATE OF LIABILITY INSURANCE", not in 2.0 model)
width width Direct
height height Direct
variant variant Direct

Approach: - Migrate all 11 rows with preserved page_id values (other tables reference these IDs). - Drop the title column — it is a constant value and DTX 2.0 does not store it. - Note: One row (page_id=6, StateFarmA4) uses A4 dimensions (595x842) while all others use US Letter (612x792).


Decision #2: page_attribute → PageAttribute (seed data)

The page_attribute table contains fingerprint anchors — expected text locations used to identify which ACORD form template a PDF matches. The DTX 2.0 fingerprinting algorithm (pdf_parser.py:527–606) matches PDF word coordinates against these anchors within a ±3px tolerance. It does not compare text content — only coordinates.

Column mapping:

DTX 1.0 DTX 2.0 Action
page_attribute_id attribute_id (PK) Rename
page_id page (FK) FK reference, direct
producer producer Direct (denormalized from Page)
model model Direct (denormalized from Page)
version version Direct (denormalized from Page)
variant variant Direct (denormalized from Page)
fieldname fieldname Direct (3-letter code: IMP, VER, NAI, etc.)
exp_value Drop (expected text string — not in 2.0 model, not used by algorithm)
width Drop (denormalized from Page, not in 2.0 model)
height Drop (denormalized from Page, not in 2.0 model)
xMin xMin Direct
yMin yMin Direct
xMax xMax Direct
yMax yMax Direct

Approach: - Migrate all rows with preserved PK values. - Drop exp_value — DTX 2.0's fingerprinting uses coordinate matching only, not text comparison. The fieldname (3-letter code like "IMP", "VER", "NAI") serves as the human-readable label. - Drop width, height — redundant with the referenced Page record. - Note: Pages 8–11 in the page table have no corresponding page_attribute rows in the sample. If the full dataset also lacks them, those templates have no fingerprinting and will never match.


Decision #3: extraction_rule → ExtractionRule (configuration)

The extraction_rule table defines field-level extraction rules — where to look on the page for each data field (bounding boxes, anchor text, regex patterns). This is the "brain" of the extraction engine. The sample contains 100 rules, all for page_id=1 (ACORD 25 2010/05 "native"). The full dataset likely has rules for other page templates.

Column mapping:

DTX 1.0 DTX 2.0 Action
extraction_rule_id extraction_rule_id (PK) Direct
page_id page (FK) FK reference, direct
fieldname fieldname Direct
fieldtype fieldtype Map values (see below)
pattern pattern Direct (regex patterns)
above_str above_str Direct
below_str below_str Direct
left_str left_str Direct
right_str right_str Direct
above_offset above_offset Direct
below_offset below_offset Direct
left_offset left_offset Direct
right_offset right_offset Direct
above_ref above_ref Direct (TOP/BOTTOM/LEFT/RIGHT)
below_ref below_ref Direct
left_ref left_ref Direct
right_ref right_ref Direct
above_targ above_targ Direct (always NULL in sample)
below_targ below_targ Direct (always NULL in sample)
left_targ left_targ Direct
right_targ right_targ Direct
pagesection Drop (section grouping, not in 2.0)
box_xMin Drop (bounding box, 2.0 uses boundary markers instead)
box_yMin Drop
box_xMax Drop
box_yMax Drop
above_residx Drop (always NULL)
below_residx Drop (always NULL)
left_residx Drop (always NULL)
right_residx Drop (always NULL)
above_custxp Drop (custom XPath fragments — DTX 2.0 uses xpath_query or dynamic generation)
below_custxp Drop (always NULL)
left_custxp Drop
right_custxp Drop
fielddefault Drop (always NULL)
metatag Drop (post-extraction field mapping, handled differently in 2.0)
pattern_start Drop (boundary strings, not in 2.0)
pattern_end Drop (boundary strings, not in 2.0)
mask Drop (format validation, moved into fieldtype transformation)
box_element Drop (extraction granularity: word/line/block/flow — hardcoded in 2.0 parser)
method New — leave NULL (custom extraction method names, populated as needed)
xpath_query New — leave NULL (dynamic query generation in 2.0 parser)

Fieldtype value mapping:

DTX 1.0 DTX 2.0 Notes
SINGLE_LINE_TEXT TEXT
MULTI_LINE_TEXT TEXT
NUMBER TEXT No NUMBER type in 2.0
LIMIT LIMIT
CHECKBOX CHECKBOX
YNBOX YNBOX
DATE DATE
PHONE_NUMBER PHONE
PHONE_NUMBER_EXT PHONE
EMAIL_ADDRESS EMAIL
pagesection Drop row — these are section divider pseudo-rules, not actual extraction rules

Approach: - Migrate all rules except rows where fieldtype = 'pagesection' (section dividers, not real rules). - Map fieldtype values per the table above. - Leave method and xpath_query as NULL — DTX 2.0's parser generates these dynamically. - Preserve PK values since extraction records reference these implicitly via page_id. - Pre-migration action: Verify which left_custxp / right_custxp values contain XPath constraints that may need manual re-implementation as method entries in DTX 2.0. The sample shows constraints like [@yMin >= 200].


Decision #4: extraction → Extraction (historical records)

The extraction table records individual extraction events — which template matched, what confidence score, how many fields were extracted. These are historical audit records of past PDF processing.

Column mapping:

DTX 1.0 DTX 2.0 Action
extraction_id extraction_id (PK) Direct
page_id page (FK) FK reference, direct (NULL for 13 error rows)
certificate_id certificate (FK) FK reference, remap via Area 4 mapping (NULL for error rows)
status status Map values: SUCCESS → PASS (if DTX 2.0 uses PASS)
confidence confidence Direct
fp_total fp_total Direct
fp_pass fp_pass Direct
rules rules Direct (always 112 when populated)
fieldsWithData fieldsWithData Direct
scantime scantime Direct (milliseconds)
create_time create_time Direct (see Known Issue #1)
repo_file_id Drop (DTX 2.0 links via Document, not repo_file)
pagewidth Drop (available from Page model)
pageheight Drop (available from Page model)
raw Drop (always NULL in sample)
layoutxml Drop (always NULL in sample)
metaxml Drop (always NULL in sample)
so_id Drop (derivable from certificate → RequirementsGroup → Account)
ch_id Drop (derivable from certificate chain)
ph_id Drop (derivable from certificate chain)
filename Drop (available via Document)
filepath Drop (server paths not applicable in 2.0)
fields Drop (same as rules in all sample rows)
PDFversion Drop (PDF metadata, not stored in 2.0)
tolerance Drop (constant 3.0, hardcoded in 2.0 parser)
extracted_data New — see Decision #5
content_hash New — see Decision #5

Status value mapping:

DTX 1.0 DTX 2.0 Notes
SUCCESS Verify actual value used by pdf_parser.py Check process_document() return values
ERR_NO_TEXT ERR_NO_TEXT Same value used
ERR_NO_PAGE_TEMPLATE_FOUND ERR_NO_PAGE_TEMPLATE_FOUND Same value used

Pre-migration action: Confirm the exact status string the DTX 2.0 parser writes for successful extractions. If it uses "SUCCESS" already, no mapping needed. If it uses "PASS", map accordingly.

Approach: - Migrate extraction records with preserved PKs. - Remap certificate_id via the Area 4 certificate ID mapping. - Drop 14 columns that are either always NULL, redundant, or not in the 2.0 model. - Handle error rows (13 in sample) where page_id, certificate_id, and metrics are NULL — migrate as-is with NULLs preserved.


Decision #5: New Extraction fields (extracted_data, content_hash)

DTX 2.0 added two fields not present in DTX 1.0:

  1. extracted_data (JSONField) — Stores all extracted field values as a JSON object. In DTX 2.0, this is populated by pdf_parser.py during extraction. DTX 1.0 did not store the raw extraction results — it immediately wrote them into Certificate and coverage model fields.

  2. content_hash (CharField, indexed) — SHA-256 hash of the source PDF, used for extraction caching (currently disabled in DTX 2.0 per signals.py:564–568).

Options for extracted_data:

Option Pros Cons
A. Leave NULL Simple, no data to reconstruct Historical extractions won't show in 2.0 extraction detail views
B. Backfill from Certificate + coverage records Complete data Complex reverse-engineering; field names may not match exactly

Options for content_hash:

Option Pros Cons
A. Leave NULL Simple Historical extractions won't match cache lookups (cache is disabled anyway)
B. Compute from repo_file data Complete Requires access to original PDF files on DTX 1.0 file server

Recommendation: Leave both fields NULL for migrated records. The extraction cache is currently disabled, and historical extraction detail views are not critical. New extractions in DTX 2.0 will populate both fields.


Decision #6: ocr_event → No target model

The ocr_event table (34 columns, 100 rows in sample) records calls to the external AgencyPort Turnstile API (turnstile.agencyport.com). Each row logs an OCR processing attempt including API status codes, response times, confidence grades (A–F), form identification, and extraction technique (Text vs OCR).

DTX 2.0 does not have an OcrEvent model. The external Turnstile API was replaced by the local Poppler-based extraction pipeline. The Extraction model serves as the sole processing record.

Key data in ocr_event with no 2.0 equivalent: - ExtractionTechnique — "Text" (50 rows) vs "OCR" (20 rows): Indicates whether the PDF had embedded fonts or required image OCR. DTX 2.0 only handles text-based PDFs. - confidence grade (A–F): Letter grades from the external API, different from the 0–1 numeric confidence in extraction. - api_statusCode / api_statusDesc: External API response details. - FormName: ACORD form identification (some with OCR-misread values like .CORD 25, \CORD 25). - ProductVersion: Turnstile API version.

The team must decide:

  1. Archive and discard — The external API no longer exists. These records are historical artifacts of a deprecated integration. No 2.0 model needed.
  2. Create an OcrEvent model — If audit trail of past OCR processing is valuable (e.g., to understand which documents were originally OCR'd vs text-extracted, or to track the migration from external to internal processing).
  3. Merge selected fields into Extraction — Add a column (e.g., legacy_ocr_technique) to the Extraction model to preserve the Text/OCR distinction for migrated records.

Capability gap: DTX 1.0's Turnstile integration could process scanned/image PDFs via OCR (20 of 100 sample rows used "OCR" technique). DTX 2.0 cannot — scanned PDFs fail with ERR_NO_TEXT. This is a known functional regression.


Decision #7: Deferred certificate fields from Area 4

Area 4 (Certificates & Documents) deferred three certificate fields to this area:

DTX 1.0 Field Deferred Reason Resolution
certificate.ocr "OCR flag — Area 7" Resolved. This flag indicated whether the certificate was produced via OCR. DTX 2.0 does not store this flag. The extraction technique is implicit: if an Extraction record links to the Certificate, it was auto-extracted; otherwise it was manually created. No migration action needed.
certificate.page_name "OCR-related — Area 7" Resolved. This was the ACORD form name (e.g., "ACORD_25_2016_03"). In DTX 2.0, this is derivable from Certificate → Extraction → Page (producer + model + version). No separate field needed.
certificate.starting_page "OCR-related — Area 7" Resolved. This indicated which page of a multi-page PDF the certificate data came from. DTX 2.0 does not store this (current parser handles single-page extraction). Drop — not critical for migration.

All three deferred references are resolved. No migration action required.


Decision #8: Duplicate extraction records

The sample shows duplicate certificate_id values in the extraction table: 72235 appears 2 times, 72302 appears 3 times. These map to the same repo_file_id values, indicating the same file was extracted multiple times (likely re-runs or retries).

DTX 2.0's Extraction model allows multiple extraction records per certificate (no unique constraint). Migrate all rows, including duplicates — they represent the actual processing history. The most recent extraction (highest extraction_id) for each certificate reflects the final result.


Known Issues

# Issue Impact Mitigation
1 Batch timestamp anomaly: 87 of 100 extraction rows share create_time of 2019-02-14 16:12:48 These are not real extraction timestamps — they were backfilled. Historical timing data is unreliable. Migrate as-is. Flag in documentation that pre-migration create_time values may not reflect actual processing time.
2 Zero-confidence SUCCESS rows: 9 extraction rows have status=SUCCESS but confidence=0.000 May indicate false positive success status or poor-quality extractions with no fingerprint matches. Migrate as-is. Post-migration validation should flag these for review.
3 OCR-misread FormName in ocr_event: Values like .CORD 25, \CORD 25, ACORDACORD 25 Only relevant if ocr_event data is preserved (Decision #6). Shows the Turnstile API returned OCR'd form names, not clean identifiers. Document for reference; no action needed if ocr_event is not migrated.
4 Pages 8–11 missing fingerprints: 4 page templates (AON_RISK, Brixmor, iText, Maffey) have no page_attribute rows in sample These variants may never match during extraction. Check if full dataset has fingerprints for these pages. Pre-migration action: query full DTX 1.0 database for page_attribute rows where page_id IN (8,9,10,11).
5 extraction_rule only covers page_id=1 in sample: All 100 sample rows are for the ACORD 25 2010/05 "native" template Full dataset likely has rules for all 11 page templates. Sample is not representative. Pre-migration action: query full DTX 1.0 database for extraction_rule counts per page_id.
6 left_custxp / right_custxp XPath fragments: Some extraction rules contain custom XPath constraints like [@yMin >= 200] These are dropped in migration (Decision #3). If any rules depend on these constraints for correct extraction, they may produce incorrect results in DTX 2.0. Pre-migration action: identify which rules have non-NULL custxp values and test extraction accuracy without them.
7 Extraction cache disabled in DTX 2.0: signals.py:564–568 disables content_hash cache lookup due to "coverage data quality issues" Historical content_hash values would not be used even if populated. No action — aligns with Decision #5 recommendation to leave content_hash NULL.

Pre-Migration Actions

# Action Blocking?
1 Confirm DTX 2.0 extraction status values: Check pdf_parser.py process_document() for the exact status string written on success (SUCCESS vs PASS). Yes — needed for Decision #4 status mapping.
2 Query full page_attribute coverage: Check DTX 1.0 for page_attribute rows for page_ids 8–11. No — but affects data completeness.
3 Query full extraction_rule counts: Get per-page_id row counts from DTX 1.0 to understand full rule coverage. No — but needed for accurate migration scope.
4 Audit custxp constraints: List all extraction_rule rows with non-NULL left_custxp or right_custxp and assess whether the XPath fragments are critical for extraction accuracy. No — but may require DTX 2.0 ExtractionRule model extension or method column usage.

Migration Order

Step 1: Seed page table (11 rows, reference data)
Step 2: Seed page_attribute table (fingerprints, reference data)
Step 3: Seed extraction_rule table (config, with fieldtype mapping)
Step 4: Migrate extraction records (historical, with certificate_id remapping)
Step 5: Resolve deferred Area 4 references (no action — all resolved)

Cross-Area References

Area Reference Status
Area 2: Customers extraction.so_id (dropped) Resolved — column not migrated
Area 3: Vendors ocr_event.vendor_id Resolved — depends on Decision #6
Area 4: Certificates extraction.certificate_id remap; deferred ocr, page_name, starting_page Resolved — Decision #4, Decision #7
Area 11: File Repository extraction.repo_file_id (dropped), ocr_event.repo_file_id Resolved — columns not migrated

Decision Summary

# Decision Status Approach
1 page → Page (seed data) Documented See decision details below
2 page_attribute → PageAttribute (seed data) Documented See decision details below
3 extraction_rule → ExtractionRule (configuration) Documented See decision details below
4 extraction → Extraction (historical records) Documented See decision details below
5 New Extraction fields (extracted_data, content_hash) Documented See decision details below
6 ocr_event → No target model Documented See decision details below
7 Deferred certificate fields from Area 4 Documented See decision details below
8 Duplicate extraction records Documented See decision details below