📋 Migration Summary

Source Tables: 5

Target Models: 4

Decisions: 8

Migrating OCR extraction data, rules, and page metadata for certificate processing and data extraction workflows.

🎯 Key Actions

Migrate extraction records → Extraction model
Map extraction_rule → ExtractionRule configuration
Archive ocr_event data (no target model in DTX 2.0 - external OCR deprecated)
Migrate page and page_attribute → Page model with attributes
Link extracted data to certificates and documents

⚠️ Key Challenges

Complex extraction rules: pattern matching and field mapping configurations
OCR capability regression: Turnstile API deprecated, DTX 2.0 cannot process scanned/image-only PDFs
ocr_event table has no target model - external OCR audit trail lost
Coordinate-based extraction brittle: PDF layout changes break extraction rules

Area 7: Extraction & OCR

Overview

Migration of extraction and OCR data from DTX 1.0 to DTX 2.0. This is Phase 4 — depends on Area 4 (Certificates & Documents).

The extraction system in DTX 1.0 used a two-phase approach: (1) external OCR via the AgencyPort Turnstile API (ocr_event), and (2) local coordinate-based field extraction using rule templates (extraction, extraction_rule, page, page_attribute).

DTX 2.0 replaced the external Turnstile API with a fully local Poppler-based extraction pipeline (docutrax_extractor app). The local extraction architecture is structurally preserved — Page, PageAttribute, ExtractionRule, and Extraction models all exist in DTX 2.0 and closely mirror the DTX 1.0 schema. However, ocr_event has no corresponding model in DTX 2.0 (see Decision #6).

Key architectural change: DTX 1.0 relied on an external API for text extraction and OCR of scanned documents. DTX 2.0 uses Poppler pdftotext for text extraction only — scanned/image-only PDFs return ERR_NO_TEXT and are not supported. This is a capability regression that the team should be aware of.

Source Tables (5)

DTX 1.0 Table	Rows (sample)	Type	DTX 2.0 Target
`page`	11	Reference/config	`Page` (docutrax_extractor)
`page_attribute`	~100	Reference/config	`PageAttribute` (docutrax_extractor)
`extraction_rule`	~100+	Configuration	`ExtractionRule` (docutrax_extractor)
`extraction`	~100+	Transactional	`Extraction` (docutrax_extractor)
`ocr_event`	~100+	Transactional	No target model

Target Models

DTX 2.0 Model	App	File	Lines
`Page`	`docutrax_extractor`	`docutrax_extractor/models.py`	8–25
`PageAttribute`	`docutrax_extractor`	`docutrax_extractor/models.py`	28–50
`ExtractionRule`	`docutrax_extractor`	`docutrax_extractor/models.py`	53–126
`Extraction`	`docutrax_extractor`	`docutrax_extractor/models.py`	129–168

Supporting Infrastructure

Component	File	Purpose
`PDFParser`	`docutrax_extractor/services/pdf_parser.py`	2,290-line extraction engine
`trigger_pdf_parsing()`	`docutrax_profiles/signals.py:450–673`	Auto-parse on Document upload
`parse_coi_pdf_task`	`docutrax_profiles/tasks.py:38–114`	Celery async task (exists but unused)
`auto_parse_coi_pdf`	`docutrax_profiles/signals.py:676–748`	Post-save signal on Document

Dependencies

Depends On	What's Needed
Area 2: Customers/Accounts	`extraction.so_id` → Account mapping (for validation only, column dropped)
Area 3: Vendors/Records	`ocr_event.vendor_id` → Record mapping (for validation only)
Area 4: Certificates/Documents	`extraction.certificate_id` → Certificate mapping; deferred certificate fields

Depended On By	What It Provides
Area 10: Reporting	Extraction metrics for reports

Decisions

Decision #1: page → Page (seed data)

The page table is a small reference table (11 rows) defining known ACORD 25 form templates. Each row is a version + variant combination (e.g., ACORD 25 2016/03 "native", ACORD 25 2016/03 "StateFarmA4").

Column mapping:

DTX 1.0	DTX 2.0	Action
`page_id`	`page_id` (PK)	Direct
`producer`	`producer`	Direct
`model`	`model`	Direct
`version`	`version`	Direct
`title`	—	Drop (always "CERTIFICATE OF LIABILITY INSURANCE", not in 2.0 model)
`width`	`width`	Direct
`height`	`height`	Direct
`variant`	`variant`	Direct

Approach: - Migrate all 11 rows with preserved page_id values (other tables reference these IDs). - Drop the title column — it is a constant value and DTX 2.0 does not store it. - Note: One row (page_id=6, StateFarmA4) uses A4 dimensions (595x842) while all others use US Letter (612x792).

Decision #2: page_attribute → PageAttribute (seed data)

The page_attribute table contains fingerprint anchors — expected text locations used to identify which ACORD form template a PDF matches. The DTX 2.0 fingerprinting algorithm (pdf_parser.py:527–606) matches PDF word coordinates against these anchors within a ±3px tolerance. It does not compare text content — only coordinates.

Column mapping:

DTX 1.0	DTX 2.0	Action
`page_attribute_id`	`attribute_id` (PK)	Rename
`page_id`	`page` (FK)	FK reference, direct
`producer`	`producer`	Direct (denormalized from Page)
`model`	`model`	Direct (denormalized from Page)
`version`	`version`	Direct (denormalized from Page)
`variant`	`variant`	Direct (denormalized from Page)
`fieldname`	`fieldname`	Direct (3-letter code: IMP, VER, NAI, etc.)
`exp_value`	—	Drop (expected text string — not in 2.0 model, not used by algorithm)
`width`	—	Drop (denormalized from Page, not in 2.0 model)
`height`	—	Drop (denormalized from Page, not in 2.0 model)
`xMin`	`xMin`	Direct
`yMin`	`yMin`	Direct
`xMax`	`xMax`	Direct
`yMax`	`yMax`	Direct

Approach: - Migrate all rows with preserved PK values. - Drop exp_value — DTX 2.0's fingerprinting uses coordinate matching only, not text comparison. The fieldname (3-letter code like "IMP", "VER", "NAI") serves as the human-readable label. - Drop width, height — redundant with the referenced Page record. - Note: Pages 8–11 in the page table have no corresponding page_attribute rows in the sample. If the full dataset also lacks them, those templates have no fingerprinting and will never match.

Decision #3: extraction_rule → ExtractionRule (configuration)

The extraction_rule table defines field-level extraction rules — where to look on the page for each data field (bounding boxes, anchor text, regex patterns). This is the "brain" of the extraction engine. The sample contains 100 rules, all for page_id=1 (ACORD 25 2010/05 "native"). The full dataset likely has rules for other page templates.

Column mapping:

DTX 1.0	DTX 2.0	Action
`extraction_rule_id`	`extraction_rule_id` (PK)	Direct
`page_id`	`page` (FK)	FK reference, direct
`fieldname`	`fieldname`	Direct
`fieldtype`	`fieldtype`	Map values (see below)
`pattern`	`pattern`	Direct (regex patterns)
`above_str`	`above_str`	Direct
`below_str`	`below_str`	Direct
`left_str`	`left_str`	Direct
`right_str`	`right_str`	Direct
`above_offset`	`above_offset`	Direct
`below_offset`	`below_offset`	Direct
`left_offset`	`left_offset`	Direct
`right_offset`	`right_offset`	Direct
`above_ref`	`above_ref`	Direct (TOP/BOTTOM/LEFT/RIGHT)
`below_ref`	`below_ref`	Direct
`left_ref`	`left_ref`	Direct
`right_ref`	`right_ref`	Direct
`above_targ`	`above_targ`	Direct (always NULL in sample)
`below_targ`	`below_targ`	Direct (always NULL in sample)
`left_targ`	`left_targ`	Direct
`right_targ`	`right_targ`	Direct
`pagesection`	—	Drop (section grouping, not in 2.0)
`box_xMin`	—	Drop (bounding box, 2.0 uses boundary markers instead)
`box_yMin`	—	Drop
`box_xMax`	—	Drop
`box_yMax`	—	Drop
`above_residx`	—	Drop (always NULL)
`below_residx`	—	Drop (always NULL)
`left_residx`	—	Drop (always NULL)
`right_residx`	—	Drop (always NULL)
`above_custxp`	—	Drop (custom XPath fragments — DTX 2.0 uses `xpath_query` or dynamic generation)
`below_custxp`	—	Drop (always NULL)
`left_custxp`	—	Drop
`right_custxp`	—	Drop
`fielddefault`	—	Drop (always NULL)
`metatag`	—	Drop (post-extraction field mapping, handled differently in 2.0)
`pattern_start`	—	Drop (boundary strings, not in 2.0)
`pattern_end`	—	Drop (boundary strings, not in 2.0)
`mask`	—	Drop (format validation, moved into fieldtype transformation)
`box_element`	—	Drop (extraction granularity: word/line/block/flow — hardcoded in 2.0 parser)
—	`method`	New — leave NULL (custom extraction method names, populated as needed)
—	`xpath_query`	New — leave NULL (dynamic query generation in 2.0 parser)

Fieldtype value mapping:

DTX 1.0	DTX 2.0	Notes
`SINGLE_LINE_TEXT`	`TEXT`
`MULTI_LINE_TEXT`	`TEXT`
`NUMBER`	`TEXT`	No NUMBER type in 2.0
`LIMIT`	`LIMIT`
`CHECKBOX`	`CHECKBOX`
`YNBOX`	`YNBOX`
`DATE`	`DATE`
`PHONE_NUMBER`	`PHONE`
`PHONE_NUMBER_EXT`	`PHONE`
`EMAIL_ADDRESS`	`EMAIL`
`pagesection`	—	Drop row — these are section divider pseudo-rules, not actual extraction rules

Approach: - Migrate all rules except rows where fieldtype = 'pagesection' (section dividers, not real rules). - Map fieldtype values per the table above. - Leave method and xpath_query as NULL — DTX 2.0's parser generates these dynamically. - Preserve PK values since extraction records reference these implicitly via page_id. - Pre-migration action: Verify which left_custxp / right_custxp values contain XPath constraints that may need manual re-implementation as method entries in DTX 2.0. The sample shows constraints like [@yMin >= 200].

Decision #4: extraction → Extraction (historical records)

The extraction table records individual extraction events — which template matched, what confidence score, how many fields were extracted. These are historical audit records of past PDF processing.

Column mapping:

DTX 1.0	DTX 2.0	Action
`extraction_id`	`extraction_id` (PK)	Direct
`page_id`	`page` (FK)	FK reference, direct (NULL for 13 error rows)
`certificate_id`	`certificate` (FK)	FK reference, remap via Area 4 mapping (NULL for error rows)
`status`	`status`	Map values: SUCCESS → PASS (if DTX 2.0 uses PASS)
`confidence`	`confidence`	Direct
`fp_total`	`fp_total`	Direct
`fp_pass`	`fp_pass`	Direct
`rules`	`rules`	Direct (always 112 when populated)
`fieldsWithData`	`fieldsWithData`	Direct
`scantime`	`scantime`	Direct (milliseconds)
`create_time`	`create_time`	Direct (see Known Issue #1)
`repo_file_id`	—	Drop (DTX 2.0 links via Document, not repo_file)
`pagewidth`	—	Drop (available from Page model)
`pageheight`	—	Drop (available from Page model)
`raw`	—	Drop (always NULL in sample)
`layoutxml`	—	Drop (always NULL in sample)
`metaxml`	—	Drop (always NULL in sample)
`so_id`	—	Drop (derivable from certificate → RequirementsGroup → Account)
`ch_id`	—	Drop (derivable from certificate chain)
`ph_id`	—	Drop (derivable from certificate chain)
`filename`	—	Drop (available via Document)
`filepath`	—	Drop (server paths not applicable in 2.0)
`fields`	—	Drop (same as `rules` in all sample rows)
`PDFversion`	—	Drop (PDF metadata, not stored in 2.0)
`tolerance`	—	Drop (constant 3.0, hardcoded in 2.0 parser)
—	`extracted_data`	New — see Decision #5
—	`content_hash`	New — see Decision #5

Status value mapping:

DTX 1.0	DTX 2.0	Notes
`SUCCESS`	Verify actual value used by `pdf_parser.py`	Check `process_document()` return values
`ERR_NO_TEXT`	`ERR_NO_TEXT`	Same value used
`ERR_NO_PAGE_TEMPLATE_FOUND`	`ERR_NO_PAGE_TEMPLATE_FOUND`	Same value used

Pre-migration action: Confirm the exact status string the DTX 2.0 parser writes for successful extractions. If it uses "SUCCESS" already, no mapping needed. If it uses "PASS", map accordingly.

Approach: - Migrate extraction records with preserved PKs. - Remap certificate_id via the Area 4 certificate ID mapping. - Drop 14 columns that are either always NULL, redundant, or not in the 2.0 model. - Handle error rows (13 in sample) where page_id, certificate_id, and metrics are NULL — migrate as-is with NULLs preserved.

Decision #5: New Extraction fields (extracted_data, content_hash)

DTX 2.0 added two fields not present in DTX 1.0:

extracted_data (JSONField) — Stores all extracted field values as a JSON object. In DTX 2.0, this is populated by pdf_parser.py during extraction. DTX 1.0 did not store the raw extraction results — it immediately wrote them into Certificate and coverage model fields.
content_hash (CharField, indexed) — SHA-256 hash of the source PDF, used for extraction caching (currently disabled in DTX 2.0 per signals.py:564–568).

Options for extracted_data:

Option	Pros	Cons
A. Leave NULL	Simple, no data to reconstruct	Historical extractions won't show in 2.0 extraction detail views
B. Backfill from Certificate + coverage records	Complete data	Complex reverse-engineering; field names may not match exactly

Options for content_hash:

Option	Pros	Cons
A. Leave NULL	Simple	Historical extractions won't match cache lookups (cache is disabled anyway)
B. Compute from repo_file data	Complete	Requires access to original PDF files on DTX 1.0 file server

Recommendation: Leave both fields NULL for migrated records. The extraction cache is currently disabled, and historical extraction detail views are not critical. New extractions in DTX 2.0 will populate both fields.

Decision #6: ocr_event → No target model

The ocr_event table (34 columns, 100 rows in sample) records calls to the external AgencyPort Turnstile API (turnstile.agencyport.com). Each row logs an OCR processing attempt including API status codes, response times, confidence grades (A–F), form identification, and extraction technique (Text vs OCR).

DTX 2.0 does not have an OcrEvent model. The external Turnstile API was replaced by the local Poppler-based extraction pipeline. The Extraction model serves as the sole processing record.

Key data in ocr_event with no 2.0 equivalent: - ExtractionTechnique — "Text" (50 rows) vs "OCR" (20 rows): Indicates whether the PDF had embedded fonts or required image OCR. DTX 2.0 only handles text-based PDFs. - confidence grade (A–F): Letter grades from the external API, different from the 0–1 numeric confidence in extraction. - api_statusCode / api_statusDesc: External API response details. - FormName: ACORD form identification (some with OCR-misread values like .CORD 25, \CORD 25). - ProductVersion: Turnstile API version.

The team must decide:

Archive and discard — The external API no longer exists. These records are historical artifacts of a deprecated integration. No 2.0 model needed.
Create an OcrEvent model — If audit trail of past OCR processing is valuable (e.g., to understand which documents were originally OCR'd vs text-extracted, or to track the migration from external to internal processing).
Merge selected fields into Extraction — Add a column (e.g., legacy_ocr_technique) to the Extraction model to preserve the Text/OCR distinction for migrated records.

Capability gap: DTX 1.0's Turnstile integration could process scanned/image PDFs via OCR (20 of 100 sample rows used "OCR" technique). DTX 2.0 cannot — scanned PDFs fail with ERR_NO_TEXT. This is a known functional regression.

Decision #7: Deferred certificate fields from Area 4

Area 4 (Certificates & Documents) deferred three certificate fields to this area:

DTX 1.0 Field	Deferred Reason	Resolution
`certificate.ocr`	"OCR flag — Area 7"	Resolved. This flag indicated whether the certificate was produced via OCR. DTX 2.0 does not store this flag. The extraction technique is implicit: if an `Extraction` record links to the Certificate, it was auto-extracted; otherwise it was manually created. No migration action needed.
`certificate.page_name`	"OCR-related — Area 7"	Resolved. This was the ACORD form name (e.g., "ACORD_25_2016_03"). In DTX 2.0, this is derivable from `Certificate → Extraction → Page` (producer + model + version). No separate field needed.
`certificate.starting_page`	"OCR-related — Area 7"	Resolved. This indicated which page of a multi-page PDF the certificate data came from. DTX 2.0 does not store this (current parser handles single-page extraction). Drop — not critical for migration.

All three deferred references are resolved. No migration action required.

Decision #8: Duplicate extraction records

The sample shows duplicate certificate_id values in the extraction table: 72235 appears 2 times, 72302 appears 3 times. These map to the same repo_file_id values, indicating the same file was extracted multiple times (likely re-runs or retries).

DTX 2.0's Extraction model allows multiple extraction records per certificate (no unique constraint). Migrate all rows, including duplicates — they represent the actual processing history. The most recent extraction (highest extraction_id) for each certificate reflects the final result.

Known Issues

#	Issue	Impact	Mitigation
1	Batch timestamp anomaly: 87 of 100 extraction rows share `create_time` of `2019-02-14 16:12:48`	These are not real extraction timestamps — they were backfilled. Historical timing data is unreliable.	Migrate as-is. Flag in documentation that pre-migration `create_time` values may not reflect actual processing time.
2	Zero-confidence SUCCESS rows: 9 extraction rows have `status=SUCCESS` but `confidence=0.000`	May indicate false positive success status or poor-quality extractions with no fingerprint matches.	Migrate as-is. Post-migration validation should flag these for review.
3	OCR-misread FormName in ocr_event: Values like `.CORD 25`, `\CORD 25`, `ACORDACORD 25`	Only relevant if ocr_event data is preserved (Decision #6). Shows the Turnstile API returned OCR'd form names, not clean identifiers.	Document for reference; no action needed if ocr_event is not migrated.
4	Pages 8–11 missing fingerprints: 4 page templates (AON_RISK, Brixmor, iText, Maffey) have no `page_attribute` rows in sample	These variants may never match during extraction. Check if full dataset has fingerprints for these pages.	Pre-migration action: query full DTX 1.0 database for page_attribute rows where page_id IN (8,9,10,11).
5	extraction_rule only covers page_id=1 in sample: All 100 sample rows are for the ACORD 25 2010/05 "native" template	Full dataset likely has rules for all 11 page templates. Sample is not representative.	Pre-migration action: query full DTX 1.0 database for extraction_rule counts per page_id.
6	left_custxp / right_custxp XPath fragments: Some extraction rules contain custom XPath constraints like `[@yMin >= 200]`	These are dropped in migration (Decision #3). If any rules depend on these constraints for correct extraction, they may produce incorrect results in DTX 2.0.	Pre-migration action: identify which rules have non-NULL custxp values and test extraction accuracy without them.
7	Extraction cache disabled in DTX 2.0: `signals.py:564–568` disables content_hash cache lookup due to "coverage data quality issues"	Historical content_hash values would not be used even if populated.	No action — aligns with Decision #5 recommendation to leave content_hash NULL.

Pre-Migration Actions

#	Action	Blocking?
1	Confirm DTX 2.0 extraction status values: Check `pdf_parser.py` `process_document()` for the exact status string written on success (SUCCESS vs PASS).	Yes — needed for Decision #4 status mapping.
2	Query full page_attribute coverage: Check DTX 1.0 for page_attribute rows for page_ids 8–11.	No — but affects data completeness.
3	Query full extraction_rule counts: Get per-page_id row counts from DTX 1.0 to understand full rule coverage.	No — but needed for accurate migration scope.
4	Audit custxp constraints: List all extraction_rule rows with non-NULL `left_custxp` or `right_custxp` and assess whether the XPath fragments are critical for extraction accuracy.	No — but may require DTX 2.0 ExtractionRule model extension or `method` column usage.

Migration Order

Step 1: Seed page table (11 rows, reference data)
Step 2: Seed page_attribute table (fingerprints, reference data)
Step 3: Seed extraction_rule table (config, with fieldtype mapping)
Step 4: Migrate extraction records (historical, with certificate_id remapping)
Step 5: Resolve deferred Area 4 references (no action — all resolved)

Cross-Area References

Area	Reference	Status
Area 2: Customers	`extraction.so_id` (dropped)	Resolved — column not migrated
Area 3: Vendors	`ocr_event.vendor_id`	Resolved — depends on Decision #6
Area 4: Certificates	`extraction.certificate_id` remap; deferred `ocr`, `page_name`, `starting_page`	Resolved — Decision #4, Decision #7
Area 11: File Repository	`extraction.repo_file_id` (dropped), `ocr_event.repo_file_id`	Resolved — columns not migrated

Decision Summary

#	Decision	Status	Approach
1	page → Page (seed data)	Documented	See decision details below
2	page_attribute → PageAttribute (seed data)	Documented	See decision details below
3	extraction_rule → ExtractionRule (configuration)	Documented	See decision details below
4	extraction → Extraction (historical records)	Documented	See decision details below
5	New Extraction fields (extracted_data, content_hash)	Documented	See decision details below
6	ocr_event → No target model	Documented	See decision details below
7	Deferred certificate fields from Area 4	Documented	See decision details below
8	Duplicate extraction records	Documented	See decision details below