Hunter Heidenreich, Yosheb Getachew, Olivia Dinica, Ben Elliott
PubMed-OCR is a large, annotated corpus of scientific articles from PubMed Central, designed to support OCR-related research and development.
The PubMed-OCR project provides a dataset of scientific articles that have been processed using Optical Character Recognition (OCR) technology. This dataset consists of over 209,000 articles, covering about 1.5 million pages and approximately 1.3 billion words. Each page has been annotated with detailed information about the text layout, such as the positions of words, lines, and paragraphs. This resource is intended to help researchers develop and evaluate new OCR technologies and applications, although it currently relies on a single OCR engine and some heuristic methods.