The ISRI OCR Performance Toolkit
The OCR Performance Toolkit (OCRtk) is software for measuring the accuracy of optical character recognition (OCR) output combined with a large and diverse corpus of scanned page images with corresponding ground-truth text. Both were developed at ISRI throughout the middle 1990s as part of a program to compare leading OCR technologies.
Feeling overwhelmed? You might want to start with the OCR Frontiers Toolkit -- a single CD containing a sample of interesting data and pre-compiled binaries for several platforms.
Contents:
News
- 28 May 2007: The Analytic Tools are now distributed under the terms of the Apache License, Version 2.0.
- 28 Mar 2005: Obviously, CAN-SPAM has not stopped email address scraping from web pages (surprise). Apologies to those who subscribed to the announcements list only to find a new source of spam. Since the list is announce-only, posting has been restricted to a person at ISRI.
- 17 Feb 2005: Thanks to Andre Steenveld for noticing that greyscale zone files were missing from the M and N collections. If you have already downloaded
Ng.tgzand/orMg-part?.tgz, you do not need to download them again. Packages containing only the missing zones files are provided below. - 1 Mar 2005: We have created a mailing list for announcements.
Analytic Tools
The ISRI Analytic Tools are 19 programs designed to measure the accuracy of and experiment with OCR output. The following metrics are available:- character accuracy
- accuracy by character class
- word accuracy
- non-stopword accuracy
- cost of zoning errors
Download
- User Guide
(295k PDF)
- Source Distribution
(221k Gzipped Tar file)
- Stephen V. Rice's dissertation (Rice96): Measuring the Accuracy of Page-Reading Systems,
(497k PDF)
Images and Ground Truth
The table below shows the nine datasets, along with the numbers of pages and characters, and the ISRI Annual tests that used it. Since image quality is a major factor controlling recognition accuracy, most datasets were manually scanned multiple times to produce bitonal images at 200, 300, and 400dpi, and 300dpi greyscale. The B and L sets also have Standard- and Fine-mode fax images.| Sample | Number of | Bitonal | Grey | Fax | Annual | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Name | Description | # Pages | # Chars | 200 | 300 | 400 | 300 | Fine | Std | Test(s) |
| 2 | DOE Sample 2 | 460 | 817,946 | | 1993, 94 | |||||
| M | Magazine Sample | 200 | 666,134 | | | | | 1994, 95 | ||
| N | Newspaper Sample | 200 | 492,080 | | | | | 1995 | ||
| B | Business Letter Sample | 200 | 319,756 | | | | | | | 1995, 96 |
| L | Legal Document Sample | 300 | 372,098 | | | | | | | 1996 |
| S | Spanish Newspaper Sample | 144 | 348,091 | | | | | 1995, 96 | ||
| 3 | DOE Sample 3 | 785 | 1,463,512 | | | | | 1995, 96 | ||
| R | Annual Report Sample | 300 | 892,266 | | | | | 1996 | ||
| Z | Magazine Sample 2 | 300 | 1,244,171 | | | | | 1996 | ||
| TOTAL | 2,889 | 6,616,054 | 2,429 | 2,889 | 2,429 | 2,429 | 500 | 500 | ||
| # Pages | ||||||||||
Naming Conventions
Each page has an id of the form DDDD_PPP, where DDDD indicates the document number and PPP the page number within the document. A given page has multiple TIFF image files:- DDDD_PPP.2B
- 200dpi, bitonal, fixed thresholding
- DDDD_PPP.3B
- 300dpi, bitonal, fixed thresholding
- DDDD_PPP.4B
- 400dpi, bitonal, fixed thresholding
- DDDD_PPP.3G
- 300dpi, 8-bit greyscale
- DDDD_PPP.3A
- 300dpi, bitonal, scanner built-in adaptive thresholding
- DDDD_PPP.SF
- standard-mode fax
- DDDD_PPP.FF
- fine-mode fax
Download
Because of their size, greyscale images have been pulled out into separate packages. Bitonal images, zone files, ground-truth packages:
2b.tgz (20MB)
3b.tgz (133MB)
Bb.tgz (43MB)
Lb.tgz (60MB)
Mb.tgz (117MB)
Nb.tgz (62MB)
Rb.tgz (85MB)
Sb.tgz (46MB)
Zb.tgz (178MB)
Greyscale image packages: 3b.tgz (133MB)
Bb.tgz (43MB)
Lb.tgz (60MB)
Mb.tgz (117MB)
Nb.tgz (62MB)
Rb.tgz (85MB)
Sb.tgz (46MB)
Zb.tgz (178MB)
3g-part1.tgz (1.8GB)
3g-part2.tgz (1.8GB)
Bg.tgz (684MB)
Lg-part1.tgz (738MB)
Lg-part2.tgz (749MB)
Mg-part1.tgz (696MB)
Mg-part2.tgz (698MB)
Ng.tgz (673MB)
Rg-part1.tgz (752MB)
Rg-part2.tgz (712MB)
Sg.tgz (539MB)
Zg-part1.tgz (993MB)
Zg-part2.tgz (1003MB)
3g-part2.tgz (1.8GB)
Bg.tgz (684MB)
Lg-part1.tgz (738MB)
Lg-part2.tgz (749MB)
Mg-part1.tgz (696MB)
Mg-part2.tgz (698MB)
Ng.tgz (673MB)
Rg-part1.tgz (752MB)
Rg-part2.tgz (712MB)
Sg.tgz (539MB)
Zg-part1.tgz (993MB)
Zg-part2.tgz (1003MB)
Missinggreyscale zones (if you downloaded Ng or Mg before 17 Feb 2005):
ISRI Annual Tests of OCR Accuracy
In 1992-1996, ISRI conducted 5 tests of the performance of leading OCR systems. There was some attempt to compare products to stimulate competition and to advance the technology, but the overall goal was to establish what performance range was provided by contemporary systems and to identify where improvements were most needed. All of the annual tests are provided here.Download
- "A Report on the Accuracy of OCR Devices", 1992
(384k PDF)
- "An Evaluation of OCR Accuracy", 1993
(160k PDF)
- "The Third Annual Test of OCR Accuracy", 1994
(412kb PDF)
- "The Fourth Annual Test of OCR Accuracy", 1995
(3.6MB PDF)
- "The Fifth Annual Test of OCR Accuracy", 1996
(2.1MB PDF)
OCR Frontiers Toolkit
The Frontiers Toolkit is a companion to the book Optical Character Recognition: An Illustrated Guide to the Frontier, by Rice, Nagy, and Nartker (RiceEtal99). It includes all page images (280) used in the book, manually-determined zone information, ground-truth text, and OCR output from three leading OCR systems.Download
- User Guide
(77kb PDF)
- Source Distribution
(25MB Gzipped Tar file)
Tesseract OCR
An ideal complement to the OCRtk is the open-source Tesseract OCR engine. Tesseract OCR was developed at HP Labs until 1995 and was released to the open-source community in 2005. It ranked very highly in The Fourth Annual Test of OCR Accuracy.Mailing Lists
We currently have an announcements-only mailing list, ocrtk-announce@isri.unlv.edu. If you wish to subscribe, send a message to ocrtk-announce-request@isri.unlv.edu (note: this is a human).References
- [KanaiEtal95]
-
Junichi Kanai, Stephen V. Rice, Thomas A. Nartker, and George Nagy.
Automated Evaluation of OCR Zoning.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(1):86-90, January 1995.
(.pdf ) - [NagyEtal00]
-
George Nagy, Thomas A. Nartker, and Stephen V. Rice.
Optical Character Recognition: An illustrated guide to the
frontier.
In Proc. IS&T/SPIE 2000 Intl. Symp. on Electronic Imaging
Science and Technology, volume 3967, pages 58-69, San Jose, CA, January
2000.
(invited).
(.pdf ) - [NartkerEtal05]
-
Thomas A. Nartker, Stephen V. Rice, and Steven E. Lumos.
Software Tools and Test Data for Research and Testing of
Page-Reading OCR Systems.
In Proc. IS&T/SPIE 2005 Intl. Symp. on Electronic Imaging
Science and Technology, January 2005.
(.pdf ) - [Rice96]
-
Stephen V. Rice.
Measuring the Accuracy of Page-Reading Systems.
PhD thesis, University of Nevada, Las Vegas, 1996.
(.pdf ) - [RiceEtal94]
-
Stephen V. Rice, Junichi Kanai, and Thomas A. Nartker.
An algorithm for matching OCR-generated text strings.
International Journal of Pattern Recognition and Artificial
Intelligence, 8(5):1259-1268, 1994.
- [RiceEtal99]
-
Stephen V. Rice, George Nagy, and Thomas A. Nartker.
Optical Character Recognition: An Illustrated Guide to the
Frontier.
Kluwer Academic Publishers, April 1999.