0

Does anyone have a suggestion for digitizing + OCR'ing a printed corpus with images?

I have 1,200 pages of text sprinkled with essential photographs. Assuming I have perfect scans of the pages, what are my options for preserving the layout of the original text and allowing me to feed this to a program?

yannis
  • 39,547
  • 40
  • 183
  • 216
wnewport
  • 417
  • 4
  • 7

1 Answers1

1

djvu (eg. http://djvu.sf.net) is a somewhat standard format for scans, and it provides a "backstore" for plain text from OCR (with formatting, at least when generated with the commercial tools).

PDF can work the same way: You can show the images but have it backed with formatted text, so copy&paste works. In fact, the ABBYY set of OCR applications can create such PDFs.

PDF is much more common, but usually requires more space than djvu for the same data at the same quality.

  • Thanks, I've used djvu before on the user-side. At a glance, it seems there's more/better tools for PDF and Python. – wnewport May 03 '11 at 05:59