PaperPulse - AI/ML Summarization Platform

One-line Summary

PubMed-OCR is a large, annotated corpus of scientific articles from PubMed Central, designed to support OCR-related research and development.

Plain-language Overview

The PubMed-OCR project provides a dataset of scientific articles that have been processed using Optical Character Recognition (OCR) technology. This dataset consists of over 209,000 articles, covering about 1.5 million pages and approximately 1.3 billion words. Each page has been annotated with detailed information about the text layout, such as the positions of words, lines, and paragraphs. This resource is intended to help researchers develop and evaluate new OCR technologies and applications, although it currently relies on a single OCR engine and some heuristic methods.

PubMed-OCR: PMC Open Access OCR Annotations

One-line Summary

Plain-language Overview

Technical Details

PubMed-OCR: PMC Open Access OCR Annotations

One-line Summary

Plain-language Overview

Technical Details

Methodology

Data

Results