Supported Formats

Virza handles the most common academic document formats.

Fully supported

Format	Extension	Notes
PDF	`.pdf`	Best support. Academic papers, preprints, reports. Full text extraction, metadata detection, and citation parsing.
Microsoft Word	`.docx`	Good support. Text and basic formatting extracted.
Plain text	`.txt`	Basic support. No metadata extraction.

Best results with PDF

PDF documents from academic publishers and preprint servers (arXiv, bioRxiv, SSRN) give the best results because they follow standard layouts that our parser is trained to understand.

For best parsing quality:

Use text-based PDFs: Not scanned images. If the text can be selected in your PDF reader, it’s text-based.
Publisher PDFs: Direct downloads from journals produce better results than screenshots or re-saved files
arXiv papers: Excellent parsing results due to consistent formatting

Scanned PDFs (image-only) have limited support. Virza attempts OCR, but results may vary. For best results, use the text-based version if available.

Planned formats

We’re working on support for:

EPUB: E-book format
HTML: Web pages and articles
Markdown: .md files
BibTeX: .bib citation files for bulk import

Have a format request? Let us know at [email protected].