How Document Processing Works

When you upload a document to Virza, it doesn’t just get stored. It goes through a multi-stage AI pipeline that transforms a static file into a research-ready artifact. Each stage extracts structured knowledge that powers search, chat, citations, and evidence analysis.

The processing lifecycle

Every document passes through four phases. You can use the document as soon as Phase 2 completes. Later phases add progressively deeper analysis.


Upload → Security scan → Core extraction → AI enrichment → Ready
        (instant)       (15-30s)         (30-90s)

Phase 1: Upload and security (instant)

Your file is uploaded directly to secure storage via a presigned URL. The file data never passes through our API server. Before any processing begins, the file enters a quarantine zone where it is scanned for malware.

File integrity verified (SHA-256 hash)
File type validated (PDF, DOCX, or TXT only)
Virus scan completed
File size and page count checked against your plan limits

If the virus scan detects a threat, the document is immediately quarantined and deleted. You will see an Infected status. If file validation fails (wrong format, too large, too many pages), you will see a specific error explaining what went wrong. See Failure Modes for all possible errors.

Phase 2: Core extraction (15–30 seconds)

The document enters the extraction pipeline. Your document status changes to Processing.

What’s extracted	How it works
Full text	Docling AI parser extracts text, headers, captions, and footnotes with document structure preserved. Falls back to PyMuPDF for compatibility.
Metadata	Title, authors, abstract, DOI, journal, and publication year are identified from the document and enriched via CrossRef and arXiv APIs
Sections	The document is segmented into semantic sections (Abstract, Introduction, Methods, Results, Discussion, etc.)
Tables	Tables are detected with bounding boxes, confidence scores, headers, and row data
Figures	Figures are cropped, deduplicated (perceptual hashing prevents repeated logos/headers), and saved as previews
Equations	LaTeX equations are extracted from mathematical content

After this phase, your document status changes to Available: you can read it, search for it, and see its structure.

Phase 3: AI enrichment (30–90 seconds)

Additional AI stages run in parallel to deepen the extraction. Your document status shows Enriching: you can already use it while enrichment completes.

What’s produced	Description	Available on
Executive summary	AI-generated overview of key findings	All plans
Citations	References parsed via GROBID, matched to your library, DOIs resolved via CrossRef	All plans
Artifact descriptions	AI-generated captions for tables and figures	All plans
Search embeddings	Vector representations enabling semantic search	All plans
Context enrichment	Links the document to related content in your workspace	All plans
Vision descriptions	AI vision model analyzes charts, plots, and diagrams	Pro+
Structured tables	Tables become queryable structured data (not just images)	Pro+
Document structure	Semantic structure analysis: sections, arguments, evidence flow	Pro+
Academic embeddings	Discipline-aware embeddings for academic search	Pro+
Multi-level summaries	TLDR + structured breakdown + detailed narrative	Pro+
QA pairs	Pre-generated question-answer pairs for rapid comprehension	Pro+
Claims extraction	Typed claims with p-values, effect sizes, and variable tracking	Enterprise
Methodology scoring	Automated quality assessment of research methods	Enterprise

Phase 4: Ready

All stages complete. Your document is fully searchable, chat-ready, and all extracted artifacts are available.

If any non-critical stage failed (for example, an embedding provider was temporarily unavailable), the document is marked Ready with warnings: core functionality works, but some enrichment artifacts may be missing.

Processing time

Document type	Typical time	Notes
Standard academic paper (5–20 pages)	15–60 seconds	Most papers complete in under 30 seconds
Long report (50–100 pages)	1–3 minutes	More figures and tables mean more extraction time
Book-length document (100+ pages)	3–10 minutes	Enterprise plans support up to 5,000 pages

Pro and higher plans get priority processing. Your documents are processed ahead of the standard queue.

What affects processing quality

Text-based PDFs produce the best results. Publisher PDFs from journals and arXiv have consistent layouts that the parser is trained on.
Scanned PDFs (image-only) trigger OCR, which works well for clean scans but may produce lower-quality text from poor images.
Password-protected PDFs cannot be processed. Remove the password before uploading.
DOCX files have good text extraction but limited metadata and citation parsing compared to PDF.
TXT files have basic support with no metadata extraction.

Duplicate detection

If you upload the same file twice (identical bytes), Virza detects the duplicate via SHA-256 hash matching and links to the existing processed version. No extra processing needed.

If you upload the same paper from a different source (for example, an arXiv preprint and the published version), Virza detects the match via DOI and notifies you of the existing copy.

Metadata is editable

If Virza’s automatic detection gets something wrong, click on any metadata field to correct it manually. This is common for preprints, working papers, and non-standard formats.

Troubleshooting

Issue	What happened	What to do
Size limit exceeded	File is larger than your plan allows	Compress the PDF or upgrade your plan
Page limit exceeded	Document has more pages than your plan allows	Split the document or upgrade
Encryption detected	PDF is password-protected	Remove the password and re-upload
Parse timeout	Document structure was too complex for the parser (300s limit)	Try re-uploading; contact [email protected] if it persists
OCR failed	Scanned PDF couldn’t be reliably read	Use a cleaner scan or the text-based version if available
Stuck in “Processing”	A pipeline stage may have timed out	Try re-uploading. If the issue persists, contact [email protected]

For the complete reference of all error states, see Failure Modes & Partial Results.