Skip to Content
DocumentsSupported Formats

Supported Formats

Virza handles the most common academic document formats.

Fully supported

FormatExtensionNotes
PDF.pdfBest support. Academic papers, preprints, reports. Full text extraction, metadata detection, and citation parsing.
Microsoft Word.docxGood support. Text and basic formatting extracted.
Plain text.txtBasic support. No metadata extraction.

Best results with PDF

PDF documents from academic publishers and preprint servers (arXiv, bioRxiv, SSRN) give the best results because they follow standard layouts that our parser is trained to understand.

For best parsing quality:

  • Use text-based PDFs: Not scanned images. If the text can be selected in your PDF reader, it’s text-based.
  • Publisher PDFs: Direct downloads from journals produce better results than screenshots or re-saved files
  • arXiv papers: Excellent parsing results due to consistent formatting

Scanned PDFs (image-only) have limited support. Virza attempts OCR, but results may vary. For best results, use the text-based version if available.

Planned formats

We’re working on support for:

  • EPUB: E-book format
  • HTML: Web pages and articles
  • Markdown: .md files
  • BibTeX: .bib citation files for bulk import

Have a format request? Let us know at [email protected].

Last updated on