01 //
Document ingestion
PDFs, scanned PDFs, and image-based reports are accepted. Image pages are OCR'd; text-based pages are parsed directly so original characters and positions are preserved.
// THE SOURCEGRAPH FRAMEWORK
SourceGraph is the ten-stage pipeline that runs behind every Paper Data tool. The same pipeline powers annual-report extraction, deck parsing, transcript structuring, and document comparison — so every output carries the same source-anchor guarantee.
> sourcegraph run ar_FY24.pdf
[01] ingest ok · 284 pages [02] classify ok · Annual Report [03] layout ok · 1,842 blocks [04] extract ok · 12,406 entities [05] anchor ok · 12,406 anchors [06] validate ok · 0 mismatches [07] normalize ok · canonical mapping [08] export ok · .xlsx .csv .json [09] compare skipped (single doc) [10] audit trail ok · 47 KB
// THE TEN STAGES
01 //
PDFs, scanned PDFs, and image-based reports are accepted. Image pages are OCR'd; text-based pages are parsed directly so original characters and positions are preserved.
02 //
The document is classified — annual report, quarterly results, investor presentation, earnings transcript, shareholding pattern, or generic financial PDF — to load the right extraction profile.
03 //
Each page is parsed into structural blocks: titles, paragraphs, tables (including merged-cell and multi-page tables), figures, footnotes, and headers/footers.
04 //
Line items, dates, periods, currencies, units, percentages, and references (e.g. 'Note 17') are extracted into typed entities — not just strings.
05 //
Every entity is anchored to a coordinate: {file, page, block_id, table_id, row_id, cell_id, char_span}. This anchor travels with the value into every export.
06 //
Sub-totals are recomputed, statement totals are cross-checked, and disclosed totals are reconciled against extracted line items. Mismatches are flagged for review.
07 //
Line items are mapped to a canonical ontology (Schedule III / IFRS / US GAAP). Units, currencies, and periods are normalized into separate metadata columns.
08 //
Outputs are produced as .xlsx workbooks, CSV bundles, JSON with full source anchors, and an HTTP API that returns the same data programmatically.
09 //
Documents from different periods are aligned by canonical ontology and section, then diffed — values, sentences, and slide structures — with both-side source links.
10 //
Every transformation is recorded. From raw character span to canonical output row, the entire path is reproducible and inspectable.
// SOURCE ANCHOR · SPEC
Every extracted value — whether it ends up in Excel, CSV, JSON, or the API — carries the same anchor object. The anchor is what makes every figure traceable to its origin in the original PDF.
{
"file": "ar_FY24.pdf",
"page": 118,
"block_id": "b_412",
"table_id": "t_3",
"row_id": "r_7",
"cell_id": "c_2",
"char_span": [1841, 1847],
"section": "Balance Sheet — Standalone",
"label": "Trade receivables",
"value": 4180,
"unit": "INR Cr",
"period": "FY24"
}// FAQ
Every output cell carries an identifier of the form {file, page, table_id, row_id, cell_id}. Clicking the reference re-opens the original PDF at the exact position the value was extracted from.
Yes. Image-based pages are OCR'd before layout parsing, and the character positions returned by OCR are used to compute source anchors.
The validation stage cross-checks sub-totals and statement totals. Any mismatch is flagged with the relevant source page so a human can review.
Yes. The same data that powers the web tools is available through an HTTP API, including the full source-anchor object for every extracted value.
// EARLY ACCESS
Paper Data is currently in private beta. Request access to start converting your financial documents into source-linked tables.