The ISRI OCR Performance Toolkit

The OCR Performance Toolkit (OCRtk) is software for measuring the accuracy of optical character recognition (OCR) output combined with a large and diverse corpus of scanned page images with corresponding ground-truth text. Both were developed at ISRI throughout the middle 1990s as part of a program to compare leading OCR technologies.

Feeling overwhelmed? You might want to start with the OCR Frontiers Toolkit -- a single CD containing a sample of interesting data and pre-compiled binaries for several platforms.

News

  • 28 May 2007: The Analytic Tools are now distributed under the terms of the Apache License, Version 2.0.
  • 28 Mar 2005: Obviously, CAN-SPAM has not stopped email address scraping from web pages (surprise). Apologies to those who subscribed to the announcements list only to find a new source of spam. Since the list is announce-only, posting has been restricted to a person at ISRI.
  • 17 Feb 2005: Thanks to Andre Steenveld for noticing that greyscale zone files were missing from the M and N collections. If you have already downloaded Ng.tgz and/or Mg-part?.tgz, you do not need to download them again. Packages containing only the missing zones files are provided below.
  • 1 Mar 2005: We have created a mailing list for announcements.

Analytic Tools

The ISRI Analytic Tools are 19 programs designed to measure the accuracy of and experiment with OCR output. The following metrics are available:

  • character accuracy
  • accuracy by character class
  • word accuracy
  • non-stopword accuracy
  • cost of zoning errors

along with aggregation, confidence intervals, etc. The programs are written in C and should compile on most platforms without changes. Pre-compiled binaries of most of the Analytic Tools are included in the Frontiers Toolkit below.

Download

Images and Ground Truth

The table below shows the nine datasets, along with the numbers of pages and characters, and the ISRI Annual tests that used it. Since image quality is a major factor controlling recognition accuracy, most datasets were manually scanned multiple times to produce bitonal images at 200, 300, and 400dpi, and 300dpi greyscale. The B and L sets also have Standard- and Fine-mode fax images.

Sample Number of Bitonal Grey Fax Annual
Name Description # Pages # Chars 200 300 400 300 Fine Std Test(s)
2 DOE Sample 2 460 817,946   DONE         1993, 94
M Magazine Sample 200 666,134 DONE DONE DONE DONE     1994, 95
N Newspaper Sample 200 492,080 DONE DONE DONE DONE     1995
B Business Letter Sample 200 319,756 DONE DONE DONE DONE DONE DONE 1995, 96
L Legal Document Sample 300 372,098 DONE DONE DONE DONE DONE DONE 1996
S Spanish Newspaper Sample 144 348,091 DONE DONE DONE DONE     1995, 96
3 DOE Sample 3 785 1,463,512 DONE DONE DONE DONE     1995, 96
R Annual Report Sample 300 892,266 DONE DONE DONE DONE     1996
Z Magazine Sample 2 300 1,244,171 DONE DONE DONE DONE     1996
TOTAL 2,889 6,616,054 2,429 2,889 2,429 2,429 500 500
# Pages

For each page, manually-keyed ground-truth is provided, along with manually-determined zone information (for each resolution).

Naming Conventions

Each page has an id of the form DDDD_PPP, where DDDD indicates the document number and PPP the page number within the document. A given page has multiple TIFF image files:

DDDD_PPP.2B
200dpi, bitonal, fixed thresholding
DDDD_PPP.3B
300dpi, bitonal, fixed thresholding
DDDD_PPP.4B
400dpi, bitonal, fixed thresholding
DDDD_PPP.3G
300dpi, 8-bit greyscale
DDDD_PPP.3A
300dpi, bitonal, scanner built-in adaptive thresholding
DDDD_PPP.SF
standard-mode fax
DDDD_PPP.FF
fine-mode fax

Fax images were made by a Xerox 7024 fax machine, all others were made by a Fujitsu M3096G scanner.

Each image has a corresponding zone file, e.g., DDDD_PPP.2BZ, DDDD_PPP.3BZ, ... For 3A files, use the 3BZ zone file. The format of the zone file is one zone per line, each having the five columns "left", "top", "width", "height", and "type".

For each zone on each page, there is a corresponding ground-truth text file, DDDD_PPP.Z01, DDDD_PPP.Z02, etc.

Finally, each sample has a PAGES file, containing a list of all page ids in the sample; PAGES_1, ..., PAGES_5, which define Page Quality Groups, and a ZTYPES file which lists all zone types used. For the S sample, there are also PAGES.ARG, PAGES.MEX, and PAGES.SPA, which indicate whether the source document originated in Argentina, Mexico, or Spain.

Download

Because of their size, greyscale images have been pulled out into separate packages.

Bitonal images, zone files, ground-truth packages:

2b.tgz (20MB)
3b.tgz (133MB)
Bb.tgz (43MB)
Lb.tgz (60MB)
Mb.tgz (117MB)
Nb.tgz (62MB)
Rb.tgz (85MB)
Sb.tgz (46MB)
Zb.tgz (178MB)

Greyscale image packages:

3g-part1.tgz (1.8GB)
3g-part2.tgz (1.8GB)
Bg.tgz (684MB)
Lg-part1.tgz (738MB)
Lg-part2.tgz (749MB)
Mg-part1.tgz (696MB)
Mg-part2.tgz (698MB)
Ng.tgz (673MB)
Rg-part1.tgz (752MB)
Rg-part2.tgz (712MB)
Sg.tgz (539MB)
Zg-part1.tgz (993MB)
Zg-part2.tgz (1003MB)

Missing greyscale zones (if you downloaded Ng or Mg before 17 Feb 2005):

Mg-zones.tgz (11kb)
Ng-zones.tgz (18kb)

ISRI Annual Tests of OCR Accuracy

In 1992-1996, ISRI conducted 5 tests of the performance of leading OCR systems. There was some attempt to compare products to stimulate competition and to advance the technology, but the overall goal was to establish what performance range was provided by contemporary systems and to identify where improvements were most needed. All of the annual tests are provided here.

Download

OCR Frontiers Toolkit

The Frontiers Toolkit is a companion to the book Optical Character Recognition: An Illustrated Guide to the Frontier, by Rice, Nagy, and Nartker (RiceEtal99). It includes all page images (280) used in the book, manually-determined zone information, ground-truth text, and OCR output from three leading OCR systems.

  ftk-1.0
  bin precompiled binary versions of the analytic tools
  Irix
  Linux
  Solaris
  Win32
  data
  2.1-5.7 coresponding to section numbers from the Frontiers book
  *_3.tif sample pages: bitonal, 300 dpi, CCITT Group 4 compressed
  *_3.sdf bounding box for the snippet used in the book
  *_3.zon bounding boxes for all (manually-determed) zones
  *.txt ground-truth text for all zones
  *.z01, *.z02,... ground-truth text for individual zones
  man
  man1 Unix-style man pages for analytic tools
  samples sample OCR output for 3 leading OCR systems
  ocr1
  ocr2
  ocr3
  ftk.pdf main documentation for the analytic tools
  README.txt

Download

Tesseract OCR

An ideal complement to the OCRtk is the open-source Tesseract OCR engine. Tesseract OCR was developed at HP Labs until 1995 and was released to the open-source community in 2005. It ranked very highly in The Fourth Annual Test of OCR Accuracy.

Mailing Lists

We currently have an announcements-only mailing list, ocrtk-announce@isri.unlv.edu. If you wish to subscribe, send a message to ocrtk-announce-request@isri.unlv.edu (note: this is a human).

References

[KanaiEtal95]
Junichi Kanai, Stephen V. Rice, Thomas A. Nartker, and George Nagy. Automated Evaluation of OCR Zoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1):86-90, January 1995. (.pdf )

[NagyEtal00]
George Nagy, Thomas A. Nartker, and Stephen V. Rice. Optical Character Recognition: An illustrated guide to the frontier. In Proc. IS&T/SPIE 2000 Intl. Symp. on Electronic Imaging Science and Technology, volume 3967, pages 58-69, San Jose, CA, January 2000. (invited). (.pdf )

[NartkerEtal05]
Thomas A. Nartker, Stephen V. Rice, and Steven E. Lumos. Software Tools and Test Data for Research and Testing of Page-Reading OCR Systems. In Proc. IS&T/SPIE 2005 Intl. Symp. on Electronic Imaging Science and Technology, January 2005. (.pdf )

[Rice96]
Stephen V. Rice. Measuring the Accuracy of Page-Reading Systems. PhD thesis, University of Nevada, Las Vegas, 1996. (.pdf )

[RiceEtal94]
Stephen V. Rice, Junichi Kanai, and Thomas A. Nartker. An algorithm for matching OCR-generated text strings. International Journal of Pattern Recognition and Artificial Intelligence, 8(5):1259-1268, 1994.

[RiceEtal99]
Stephen V. Rice, George Nagy, and Thomas A. Nartker. Optical Character Recognition: An Illustrated Guide to the Frontier. Kluwer Academic Publishers, April 1999.

Contact

Information / questions / help: support@isri.unlv.edu

 

r1.62 - 21 Jun 2007 - 14:04 - ISRI.StevenLumos