Linux, OCR and PDF – problem solved

Imagine you've scanned some book into a PDF file on Linux, such that every pdf-page contains two book-pages and there is a lot of additional white-space and maybe the page orientation is wrong. And, worst of all, there is no full-text search, thus no full-text indexing for desktop search engines. I consider this problem solved on Linux!

[UPDATE 2013-03-29] Three years later, I wrote up a better solution to OCR scans on Linux. The explanations and comments from this thread might still be helpful.

In general, the first thing you'll have to do is to convert the PDF (or, basically any file format you've scanned to, like TIFF or PNG or DjVu ...) into the hOCR format, which is the text extracted, in a XHTML file with layout annotations. Then you can apply a program to convert the hOCR file into a searchable PDF again.

There are several approaches to solve this problem (however, you can download my solution directly, if you wish). To extract the text from a scan, you have to use OCR software such as gocr, ocrad, tesseract or cuneiform. I have achieved the best results with tesseract and the worst with gocr, however the most convenient way to produce hOCR files was using Cuneiform. Cuneiform is a Russian software, once one of the best proprietary OCR software in the world. Now they have ported it to Linux.
In future (maybe two years), the project OCRopus will have a nice UI, then this may be another good way to OCR with Linux.

To convert a hOCR file into a searchable, indexable PDF, I only know of hocr2pdf from the ExactImage package.

The best results were found if each pdf-page is cropped and split in two, such that the files processed by the OCR program are PNG-files that contain exactly one book-page without additional stuff (graphics are OK). To do this, you need some batch-processing.

To get fast results without much work, I wrote a shell-script that calls pdf-to-image converters, OCR software and hocr2pdf in the right sequence with the right command-line options. The shell-script isn't perfect nor beautiful, but maybe you can use it to model upon it your own shell-script to suit your needs.

What the script does:

splits one pdf into many (one pdf-file per pdf-page) via pdftk
converts each pdf-page into a monochrome image with 300dpi via ImageMagick & ghostscript
converts each pdf-page into two images for each book-page (after rotating & cropping the pdf-page appropriately) via ImageMagick
OCRs each book-page via Cuneiform
converts each book-page into PDF format via ExactImage
merges all book-pages into one PDF file via pdfjam (& LaTeX)
writes metadata (optionally) via pdftk

So the dependencies are convert (ImageMagick), ghostscript, pdftk, pdfjam, hocr2pdf (ExactImage) and cuneiform. In Ubuntu, you can run sudo apt-get install imagemagick ghostscript pdftk pdfjam exactimage to get the most dependencies. Cuneiform, however, must be installed by hand (grab the .tar.bz2 file from their launchpad website and read the readme.txt installation instructions; maybe run sudo apt-get install cmake).

The dependency to ImageMagick could be dropped because ExactImage provides the same tools (although ExactImage is faster).

There are three minor issues to discuss:

After installing Cuneiform, it will complain about a missing library "libpuma.so" even if it's there. Solution: sudo ldconfig
If you're looking for the program hocr2pdf, it's in the debian package "exactimage".
You will get many warning messages because of malformed PDFs. This is not really a problem and will be fixed in future versions of pdftk.

You can either download the script to make PDFs searchable and indexable under Linux here or copy&paste the code from below:
#!/bin/bash echo "usage: pdfocr.sh document.pdf orientation split left top right bottom lang author title" # where orientation is one of 0,1,2,3, meaning the amount of rotation by 90° # and split is either 0 (already single-paged) or 1 (2 book-pages per pdf-page) # and (left top right bottom) are the coordinates to crop (after rotation!) # and lang is a language as in "cuneiform -l". # and author,title are used for the PDF metadata # all values relative to a resolution of 300dpi # # usage examples: # ./pdfocr.sh SomeFile.pdf 0 0 0 0 2500 2000 ger SomeAuthor SomeTitle # will process a PDF with one page per pdf-page, cropping to width 2500 and height 2000 pdftk "$1" burst dont_ask for f in pg_*.pdf do echo "pre-processing $f ..." convert -quiet -rotate $[90*$2] -monochrome -normalize -density 300 "$f" "$f.png" convert -quiet -crop $6x$7+$4+$5 "$f.png" "$f.png" if [ "1" = "$3" ]; then convert -quiet -crop $[$6/2]x$7+0+0 "$f.png" "$f.1.png" convert -quiet -crop 0x$7+$[$6/2]+0 "$f.png" "$f.2.png" rm -f "$f.png" else echo no splitting fi rm -f "$f" done for f in pg_*.png do echo "processing $f ..." convert "$f" "$f.bmp" cuneiform -l $8 -f hocr -o "$f.hocr" "$f.bmp" convert -blur 0.4 "$f" "$f.bmp" hocr2pdf -i "$f.bmp" -s -o "$f.pdf" < "$f.hocr" rm -f "$f" "$f.bmp" "$f.hocr" done echo "InfoKey: Author" > in.info echo "InfoValue: $9" >> in.info echo "InfoKey: Title" >> in.info echo "InfoValue: $10" >> in.info echo "InfoKey: Creator" >> in.info echo "InfoValue: PDF OCR scan script" >> in.info pdfjoin --fitpaper --tidy --outfile "$1.ocr1.pdf" "pg_*.png.pdf" rm -f pg_*.png.pdf pdftk "$1.ocr1.pdf" update_info doc_data.txt output "$1.ocr2.pdf" pdftk "$1.ocr2.pdf" update_info in.info output "$1-ocr.pdf" rm -f "$1.ocr1.pdf" "$1.ocr2.pdf" doc_data.txt in.info rm -rf pg_*_files

And if you think something with this code is wrong, not good or ugly, you can write me an email with corrections.

Happy scanning & searching in PDFs!

#GNU/Linux#LaTeX#Metadata#OCR#PDFs#Photos#Software#Taking Notes#Web

2010-01-19

Comments

Very nice and works very well. Would you also know of an approach that the result would be placed into a Word/OpenOffice Document?
Having one Script that could do either or both would be interesting so that the “original” pdf could be used or to change portions of the document that were NOT recognized.

Have you tried to skip the step "hocr2pdf", and looked instead at the .hocr files with a webbrowser? These are actually HTML files, so you could try to use the HTML-import feature of your Office application.

But I haven't tried this yet and I guess the result won't be very usable. For the purpose of editing text, I would use a simpler approach, not using hOCR but directly converting the pdf-files to pure text files with Tesseract. Scripts to do this can be found elsewhere.

I have spent the last few hours doing some heavy scanning and OCRing and I found your page. First of all, thanks a lot - I didn't use your script, but I semi-manually performed the same steps, and it was a big help.

Secondly, I wanted to make an observation. I might have side-stepped this problem if I'd used your script, but I am not sure, and having finally figured it out I had to share this somewhere. :-) If your pages are monochrome, it's important that they are in a 1bpp file format when submitted to hocr2pdf - it's not smart enough (and I can't really blame it) to notice this by itself. I suspect your script works fine - I hit this because my inputs were PDFs produced by gscan2pdf, so I burst them into individual files for cleanup and OCR using pdftoppm, which produced 24bpp output. As a result my PDF file ballooned from 15MB before bursting-and-OCRing to 114MB when reassembled. Converting the image files to PBM before giving them to hocr2pdf reduced my final PDF size to 14MB. (I assume I saved some extra space as a result of cleaning up noise on the bitmaps after bursting.)

Thanks for an excellent and detailed solution to this problem.

The next problem is a command-line search facility. I have a few hundred OCR-ed PDF files that I would like to search with regex strings, and have some program return a list of filename and page # combinations where the search string was found.

Does such a program exist?

This is fantastic, thank you!

I've been waiting for years to do be able to do something like this without making files immense (there is a program on the Mac that I have used, but it bloats the files 5-10x and isn't very accurate).

I adapted this script to use Tesseract 3.00 for English text on Ubuntu 10.04.1: http://wwww.ubuntuforums.org/showpost.php?p=10327088&postcount=5

The accuracy is pretty good, too, at least in English.

Hi there,

I have a general problem with the ocr-ing step.
I have here a perfectly readable page that stubbornly resists to any attempt of being ocred on either linux or mac osx, using your proposal (cuneiform), but also tesseract, and ocroscript.
I tried to increase the resolution up to 1200 in the convert to png step, but to no avail.
The ocr step either fails completely (mac osx, segmentation fault in cuneiform) or else only produces gibberish.
Adobe acrobat professional perfectly ocrs this file.
Maybe you have an idea how to tweak some of you parameters to get this thing ocred? You can find the file here:
files.me.com/bjrnfrdnnd2/3sg35f

I looked at your file.

With 1200 dpi, a cuneiform bug prevents OCRing, because the file is too large. See https://bugs.launchpad.net/cuneiform-linux/+bug/349110

Your problem seems to be related to font rendering issues. Using the -monochrome switch in convert produces an image with bad fonts which are unlikely to be OCRed properly. I think it has nothing to do with the rendering resolution (however, I guess 600dpi is fine for your data).

Using GIMP, I converted the image to a high-contrast monochrome image which worked well in cuneiform (at 300 dpi). The OCR quality was not perfect, but tuning the parameters you'll get much better results.

Thanks for your message.
I tried also 300, 600 and 1200 dpi using your script, but none worked. Your solution seems logical (using high contrast), but using gimp means that you leave the world of batch processing (and you wouldn't want to start gimp for every one of the 100 pages that you want to scan, would you?).
So while gimp certainly is a workaround for one page, do you have an idea which of the parameters in the convert step in your script to tune in order to get cuneiform to work on the file?

Gimp can be used for batch-processing and convert has lots of options which might help as well. Just take a look at their manpages ("man convert") or google for more documentation.

I actually already tried that: using edge detection, threshold, sharpening with convert. I never succeeded producing anything that cuneiform would be able to ocr. I was even unable to find any suitable gimp operation that would produce an image that cuneiform would be able to ocr.
So if you still remember what your successful gimp operation was, please tell me.

I get
" pdfjam ERROR: pg_*.png.pdf not found"

Any idea?

@Drew
it seems the hocr2pdf step failed. Can you check manually, which step fails? Maybe I should add you should execute this script in a shell, in the folder where your file is.
Anyway, I have no idea what's wrong - try debugging ;-)

@Drew
If you edit the shell script at line 54, removing the quotation marks around "pg_*.png.pdf", that is, change "pg_*.png.pdf" to pg_*.png.pdf the filename expansion should work properly. At least it worked for me, I had the same problem. But, after that, I'm getting the following error on the pdfjoin call :
" pdfjam: Effective call for this run of pdfjam:
/usr/bin/pdfjam --fitpaper 'true' --rotateoversize 'true' --suffix joined --fitpaper '--no-tidy' --outfile 0010775_Manual_Eletronico_H61H2_M2_RevC.pdf.ocr1.pdf -- pg_0001.pdf.png.pdf - pg_0002.pdf.png.pdf - pg_0003.pdf.png.pdf - pg_0004.pdf.png.pdf - pg_0005.pdf.png.pdf - pg_0006.pdf.png.pdf - pg_0007.pdf.png.pdf - pg_0008.pdf.png.pdf - pg_0009.pdf.png.pdf - pg_0010.pdf.png.pdf - pg_0011.pdf.png.pdf - pg_0012.pdf.png.pdf - pg_0013.pdf.png.pdf - pg_0014.pdf.png.pdf - pg_0015.pdf.png.pdf - pg_0016.pdf.png.pdf - pg_0017.pdf.png.pdf - pg_0018.pdf.png.pdf - pg_0019.pdf.png.pdf - pg_0020.pdf.png.pdf - pg_0021.pdf.png.pdf - pg_0022.pdf.png.pdf - pg_0023.pdf.png.pdf - pg_0024.pdf.png.pdf - pg_0025.pdf.png.pdf - pg_0026.pdf.png.pdf - pg_0027.pdf.png.pdf - pg_0028.pdf.png.pdf -
pdfjam: Calling pdflatex...
pdfjam: FAILED.
The call to 'pdflatex' resulted in an error.
If '--no-tidy' was used, you can examine the
log file at
/var/tmp/pdfjam-sVj7OI/a.log
to try to diagnose the problem.
pdfjam ERROR: Output file not written
"

So, what does the logfile tell?

Beats me. There is no /var/tmp/pdfjam-sVj7OI/a.log file, unfortunately.

I had the same missing log file problem. I substituted the pdfjoin line with

pdfjam --outfile $1.ocr1.pdf --a4paper pg_*.png.pdf

which made the script finish successfully.

I unfortunately get in spite of this no satisfying results. The resulting PDF is virtually not searchable. When running pdftotext on it, most of the text is missing and the recognized text is of poor quality.

But this is not due to bad original scans! If I directly do the OCR with cuneiform on JPGs, the plain text output is superb. My problem is the conversion from recognized text to a searchable PDF.

Any ideas to that?

First a have my own version of the script. Here the right left bottom top are how many pixels to crop the image. And middle is how much to remove from the middle when splitting.
http://pastebin.com/pEx1zmCn

But I have some problems with hocrpdf2. I have checked the hocr-files and they contain almost all the text. But when after hocr2pdf have executed the resulting pdf contains garbage. The lines are cropped and some text is too large.
http://peecee.dk/upload/view/313549

Solved my problem. Apparently hocr2pdf works better with tiff.
So edited my ealier version of the script. A small benchmark with four two-side pages were 3.5x faster. Furthermore it "handles" errors where cuneiform fails to read any text and crashes and just uses the page from the original pdf.
http://pastebin.com/6ag39WnW

New version, more use of ghostscript.
http://pastebin.com/FKz6LRs7

I search a program which adjust = rotate etc scanned text pages automatically, i.e. analyse the scan for the best rotation angle and for the best trapezoidal correction of each scanned page. Such a solution should exist, as google's scans of old books always are corrected very good for these problems.

I'm hopeful that this feature will be built into OCRopus directly and better:
http://code.google.com/p/ocropus/issues/detail?id=146&q=searchable%20pdf#makechanges
Adobe Acrobat (and our printers at work) can take a PDF scan and 'hide' OCR text behind the words, so that it looks like the scan, but lets you search or select text from the OCR layer. This is what I'm really wanting.

I haven't tried it yet, but my reading of your script is that it will only show the OCR layer in the resulting PDF, rather than the scan with the OCR hidden behind. Is that correct?

The script does the same, i.e. "hiding" the OCRed text behind an image of the scanned page. I hope OCRopus will supersede something like my script soon :-)

Hope this is of use to someone - I make extensive extensive use of creating PDFs from scanned OCRed images, and though I would rather use open source I simply haven't found any of the options perform well enough. There are however two solutions I'd recommend that work under linux (with Wine or Crossover), even if they're not open source. The first is PDF XChange Viewer from Tracker (portable and installable versions) which works well under WINE. It does occasionally crash if you move too quickly through a large PDF for a few hundred pages by pressing page down but otherwise seems rock solid. Its latest version has OCR (hidden but aligned text layer under image) built in, and remains free (Beer). The other is very much not free (beer) but it is really really effective, that is ReadIris (I have the corporate edition), which although they don't support it for linux in any way runs perfectly well under Crossover (I've OCRed tens of thousands of pages with no crashes) and produces very compact PDFs.

I keep watching the open source options - when they get close I'll be thrilled to switch, but not got there yet for me.

One other extremely useful tool, potentially, is Infix PDF Editor (also works in Wine) - its killer feature for me is that it will happily edit even PDFs created with Adobe's Clearscan function, something Acrobat itself can't do, so it's very handy for correcting OCR output.

P.S. whatever OCR software you're using, results can be much improved if you have the time to run the original page images through ScanTailor (open source/multi platform) - it's voodoo magic freakily good software! (P.P.S. I have no connection with any of the above, it's just the workflow I've arrived at through several years of experimentation)

Hello,
when I use your script, I obtain this :

pdfjam: This is pdfjam version 2.05.
pdfjam: Reading any site-wide or user-specific defaults...
(none found)
pdfjam ERROR: pg_*.png.pdf not found
Error: Failed to open PDF file:
2010_01_00.pdf.ocr1.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.
Error: Failed to open PDF file:
2010_01_00.pdf.ocr2.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.

Why ?

Thanks.

I need to extract text from pdf files.
Isn't there anything recognizable as text in the pdf encoding?
The browser (any) recognizes the text portions of the pdf screen display. How does it do that?
I just think that a OCR approach is not very reliable.
I need to do the extractions in an automated fashion - i.e. via a php script with no manual intervention.

If I load a pdf file with a browser (like Firefox) and do a ctl-A, ctl-C, and a ctl-V to a notepad or wordpad screen - I end up with the document converted to readable text. Isn't there some way to automate that process?
I tried and I just don't understand all this stuff about loading and executing OCR software programatically.

I have looked through this and I'm afraid it's too complex and involves several elements I'm not familiar with. How can I find someone who can help me for a fee?
I just need a function callable from php to extract text from simple pdf files.

Dear Ron Z, if you can extract text from a PDF with your browser like you described, then the text ist already "in the PDF", so no OCR is necessary. There are tools to get the text from a PDF in this case, like "pdf2text" or similar. I don't know wether something like that is available in PHP but surely it is for Unix/Linux. For real OCR, maybe there is some pay-for webservice that can help. For example, the new Google Drive (previously Google Docs) can do OCR and they seem to have open APIs.. good luck!

Sounds good but so many tools to use!

For the last 3 steps, you can use a PDF software called PDF Studio to import image files as a PDF and add document metadata. It would be nice if PDF studio had some OCR integration with Cunieform.

i made a pdf2text with ocr based on what the article taught me in case someone wants it ready:

https://github.com/cirosantilli/bash/blob/master/bin/pdfocr2txt.sh

pdf2txt with ocr based on you article in case someone wants one ready:

https://github.com/cirosantilli/bash/blob/2c7cfd1fb77e8fafab66c229067d30994d16b3f9/pdfocr2txt.sh

Thank you very much for this great job!

I'd like to point Cuneiform is now also available from Ubuntu repos. So,

$ sudo apt-get install cuneiform

... must work.

Regards!

Very Nice!
How useful is for mathematics pdfs (a lot of math formulas in the pdf?

Thanks

Formulas don't parse correctly (most of the time). This is a highly non-trivial problem, much harder than just latin character OCR.

Hi Konrad,

First of all I own you a huge thanks. Your detailed posts about OCR in Linux saved me some precious time.

As your script wasn't working on my machine, due to a bug on pdfjam when it calls pdflatex, I decided to make my own simplified version. You can find it at: https://gist.github.com/dllud/8892741

Main improvements/differences:
- Dropped the use of pdfjam. Instead, I use pdftk to merge PDFs, which besides being bug free, is faster and does not recompress PDFs (as nate suggested).
- Added support for doing OCR with tesseract besides cuneiform. As you explain on "Linux, OCR and PDF: Scan to PDF/A", tesseract gives the best results (also true for me).
- Removed the option for cropping the PDF pages. Besides being confusing when one first approaches the script (it took me some time to check the size of my PDF pages in pixels), I found little use for it. Most unOCRed PDFs I get need no cropping. Now the division of pages in half, when split is set, is done using relative (%) sizes.
- Removed the option to rotate the pages. Again, found little use for it.

I hope it is useful for someone.

Regards!

I had to add a "+matte -compress none" here:

...
echo "processing $f ..."
convert +matte -compress none "$f" "$f.bmp"

Because I got this error:

X.pdf.png.bmp is a compressed BMP. Only uncompressed BMP files are supported

Now that tesseract can directly export pdf files, some of this script can easily be refactored out. I hope you update the article with this information.

I don't have time right now ... but why don't you try to do this? I'll be happy linking to your blog then.