PaperPulse logo
FeedTopicsAI Researcher FeedBlogPodcastAccount

Stay Updated

Get the latest research delivered to your inbox

Platform

  • Home
  • About Us
  • Search Papers
  • Research Topics
  • Researcher Feed

Resources

  • Newsletter
  • Blog
  • Podcast
PaperPulse•

AI-powered research discovery platform

© 2024 PaperPulse. All rights reserved.

PubMed-OCR: PMC Open Access OCR Annotations

ArXivSource

Hunter Heidenreich, Yosheb Getachew, Olivia Dinica, Ben Elliott

cs.CV
cs.CL
cs.DL
cs.LG
|
Jan 16, 2026
10,900 views

One-line Summary

PubMed-OCR is a large, annotated corpus of scientific articles from PubMed Central, designed to support OCR-related research and development.

Plain-language Overview

The PubMed-OCR project provides a dataset of scientific articles that have been processed using Optical Character Recognition (OCR) technology. This dataset consists of over 209,000 articles, covering about 1.5 million pages and approximately 1.3 billion words. Each page has been annotated with detailed information about the text layout, such as the positions of words, lines, and paragraphs. This resource is intended to help researchers develop and evaluate new OCR technologies and applications, although it currently relies on a single OCR engine and some heuristic methods.

Technical Details