// THE SOURCEGRAPH FRAMEWORK

Document in.
Source-anchored data out.

SourceGraph is the ten-stage pipeline that runs behind every Paper Data tool. The same pipeline powers annual-report extraction, deck parsing, transcript structuring, and document comparison — so every output carries the same source-anchor guarantee.

sourcegraph · pipeline
> sourcegraph run ar_FY24.pdf
[01] ingest ok · 284 pages [02] classify ok · Annual Report [03] layout ok · 1,842 blocks [04] extract ok · 12,406 entities [05] anchor ok · 12,406 anchors [06] validate ok · 0 mismatches [07] normalize ok · canonical mapping [08] export ok · .xlsx .csv .json [09] compare skipped (single doc) [10] audit trail ok · 47 KB

// THE TEN STAGES

01 //

Document ingestion

PDFs, scanned PDFs, and image-based reports are accepted. Image pages are OCR'd; text-based pages are parsed directly so original characters and positions are preserved.

02 //

Document classification

The document is classified — annual report, quarterly results, investor presentation, earnings transcript, shareholding pattern, or generic financial PDF — to load the right extraction profile.

03 //

Layout parsing

Each page is parsed into structural blocks: titles, paragraphs, tables (including merged-cell and multi-page tables), figures, footnotes, and headers/footers.

04 //

Financial entity extraction

Line items, dates, periods, currencies, units, percentages, and references (e.g. 'Note 17') are extracted into typed entities — not just strings.

05 //

Source anchoring

Every entity is anchored to a coordinate: {file, page, block_id, table_id, row_id, cell_id, char_span}. This anchor travels with the value into every export.

06 //

Validation

Sub-totals are recomputed, statement totals are cross-checked, and disclosed totals are reconciled against extracted line items. Mismatches are flagged for review.

07 //

Normalization

Line items are mapped to a canonical ontology (Schedule III / IFRS / US GAAP). Units, currencies, and periods are normalized into separate metadata columns.

08 //

Export to Excel, CSV, JSON, and API

Outputs are produced as .xlsx workbooks, CSV bundles, JSON with full source anchors, and an HTTP API that returns the same data programmatically.

09 //

Document comparison

Documents from different periods are aligned by canonical ontology and section, then diffed — values, sentences, and slide structures — with both-side source links.

10 //

Audit trail

Every transformation is recorded. From raw character span to canonical output row, the entire path is reproducible and inspectable.

// SOURCE ANCHOR · SPEC

One pointer travels with every value.

Every extracted value — whether it ends up in Excel, CSV, JSON, or the API — carries the same anchor object. The anchor is what makes every figure traceable to its origin in the original PDF.

anchor.schema.json
{
  "file":      "ar_FY24.pdf",
  "page":      118,
  "block_id":  "b_412",
  "table_id":  "t_3",
  "row_id":    "r_7",
  "cell_id":   "c_2",
  "char_span": [1841, 1847],
  "section":   "Balance Sheet — Standalone",
  "label":     "Trade receivables",
  "value":     4180,
  "unit":      "INR Cr",
  "period":    "FY24"
}

// FAQ

What does 'source-linked' mean in practice?

Every output cell carries an identifier of the form {file, page, table_id, row_id, cell_id}. Clicking the reference re-opens the original PDF at the exact position the value was extracted from.

Does SourceGraph handle scanned PDFs?

Yes. Image-based pages are OCR'd before layout parsing, and the character positions returned by OCR are used to compute source anchors.

How are extraction errors handled?

The validation stage cross-checks sub-totals and statement totals. Any mismatch is flagged with the relevant source page so a human can review.

Can I use SourceGraph via an API?

Yes. The same data that powers the web tools is available through an HTTP API, including the full source-anchor object for every extracted value.

// EARLY ACCESS

Join Paper Data early access

Paper Data is currently in private beta. Request access to start converting your financial documents into source-linked tables.

← Browse all 20 tools