How Document Processing Works
When you upload a document to Virza, it doesn’t just get stored. It goes through a multi-stage AI pipeline that transforms a static file into a research-ready artifact. Each stage extracts structured knowledge that powers search, chat, citations, and evidence analysis.
The processing lifecycle
Every document passes through four phases. You can use the document as soon as Phase 2 completes. Later phases add progressively deeper analysis.
Upload → Security scan → Core extraction → AI enrichment → Ready
(instant) (15-30s) (30-90s)Phase 1: Upload and security (instant)
Your file is uploaded directly to secure storage via a presigned URL. The file data never passes through our API server. Before any processing begins, the file enters a quarantine zone where it is scanned for malware.
- File integrity verified (SHA-256 hash)
- File type validated (PDF, DOCX, or TXT only)
- Virus scan completed
- File size and page count checked against your plan limits
If the virus scan detects a threat, the document is immediately quarantined and deleted. You will see an Infected status. If file validation fails (wrong format, too large, too many pages), you will see a specific error explaining what went wrong. See Failure Modes for all possible errors.
Phase 2: Core extraction (15–30 seconds)
The document enters the extraction pipeline. Your document status changes to Processing.
| What’s extracted | How it works |
|---|---|
| Full text | Docling AI parser extracts text, headers, captions, and footnotes with document structure preserved. Falls back to PyMuPDF for compatibility. |
| Metadata | Title, authors, abstract, DOI, journal, and publication year are identified from the document and enriched via CrossRef and arXiv APIs |
| Sections | The document is segmented into semantic sections (Abstract, Introduction, Methods, Results, Discussion, etc.) |
| Tables | Tables are detected with bounding boxes, confidence scores, headers, and row data |
| Figures | Figures are cropped, deduplicated (perceptual hashing prevents repeated logos/headers), and saved as previews |
| Equations | LaTeX equations are extracted from mathematical content |
After this phase, your document status changes to Available: you can read it, search for it, and see its structure.
Phase 3: AI enrichment (30–90 seconds)
Additional AI stages run in parallel to deepen the extraction. Your document status shows Enriching: you can already use it while enrichment completes.
| What’s produced | Description | Available on |
|---|---|---|
| Executive summary | AI-generated overview of key findings | All plans |
| Citations | References parsed via GROBID, matched to your library, DOIs resolved via CrossRef | All plans |
| Artifact descriptions | AI-generated captions for tables and figures | All plans |
| Search embeddings | Vector representations enabling semantic search | All plans |
| Context enrichment | Links the document to related content in your workspace | All plans |
| Vision descriptions | AI vision model analyzes charts, plots, and diagrams | Pro+ |
| Structured tables | Tables become queryable structured data (not just images) | Pro+ |
| Document structure | Semantic structure analysis: sections, arguments, evidence flow | Pro+ |
| Academic embeddings | Discipline-aware embeddings for academic search | Pro+ |
| Multi-level summaries | TLDR + structured breakdown + detailed narrative | Pro+ |
| QA pairs | Pre-generated question-answer pairs for rapid comprehension | Pro+ |
| Claims extraction | Typed claims with p-values, effect sizes, and variable tracking | Enterprise |
| Methodology scoring | Automated quality assessment of research methods | Enterprise |
Phase 4: Ready
All stages complete. Your document is fully searchable, chat-ready, and all extracted artifacts are available.
If any non-critical stage failed (for example, an embedding provider was temporarily unavailable), the document is marked Ready with warnings: core functionality works, but some enrichment artifacts may be missing.
Processing time
| Document type | Typical time | Notes |
|---|---|---|
| Standard academic paper (5–20 pages) | 15–60 seconds | Most papers complete in under 30 seconds |
| Long report (50–100 pages) | 1–3 minutes | More figures and tables mean more extraction time |
| Book-length document (100+ pages) | 3–10 minutes | Enterprise plans support up to 5,000 pages |
Pro and higher plans get priority processing. Your documents are processed ahead of the standard queue.
What affects processing quality
- Text-based PDFs produce the best results. Publisher PDFs from journals and arXiv have consistent layouts that the parser is trained on.
- Scanned PDFs (image-only) trigger OCR, which works well for clean scans but may produce lower-quality text from poor images.
- Password-protected PDFs cannot be processed. Remove the password before uploading.
- DOCX files have good text extraction but limited metadata and citation parsing compared to PDF.
- TXT files have basic support with no metadata extraction.
Duplicate detection
If you upload the same file twice (identical bytes), Virza detects the duplicate via SHA-256 hash matching and links to the existing processed version. No extra processing needed.
If you upload the same paper from a different source (for example, an arXiv preprint and the published version), Virza detects the match via DOI and notifies you of the existing copy.
Metadata is editable
If Virza’s automatic detection gets something wrong, click on any metadata field to correct it manually. This is common for preprints, working papers, and non-standard formats.
Troubleshooting
| Issue | What happened | What to do |
|---|---|---|
| Size limit exceeded | File is larger than your plan allows | Compress the PDF or upgrade your plan |
| Page limit exceeded | Document has more pages than your plan allows | Split the document or upgrade |
| Encryption detected | PDF is password-protected | Remove the password and re-upload |
| Parse timeout | Document structure was too complex for the parser (300s limit) | Try re-uploading; contact [email protected] if it persists |
| OCR failed | Scanned PDF couldn’t be reliably read | Use a cleaner scan or the text-based version if available |
| Stuck in “Processing” | A pipeline stage may have timed out | Try re-uploading. If the issue persists, contact [email protected] |
For the complete reference of all error states, see Failure Modes & Partial Results.