Tuesday, January 19th, 2010 | Author: Konrad Voelkel
Imagine you've scanned some book into a PDF file on Linux, such that every pdf-page contains two book-pages and there is a lot of additional white-space and maybe the page orientation is wrong. And, worst of all, there is no full-text search, thus no full-text indexing for desktop search engines. I consider this problem solved on Linux!
[UPDATE 2013-03-29] Three years later, I wrote up a better solution to OCR scans on Linux. The explanations and comments from this thread might still be helpful.
In general, the first thing you'll have to do is to convert the PDF (or, basically any file format you've scanned to, like TIFF or PNG or DjVu ...) into the hOCR format, which is the text extracted, in a XHTML file with layout annotations. Then you can apply a program to convert the hOCR file into a searchable PDF again.
There are several approaches to solve this problem (however, you can download my solution directly, if you wish). To extract the text from a scan, you have to use OCR software such as gocr, ocrad, tesseract or cuneiform. I have achieved the best results with tesseract and the worst with gocr, however the most convenient way to produce hOCR files was using Cuneiform. Cuneiform is a Russian software, once one of the best proprietary OCR software in the world. Now they have ported it to Linux.
In future (maybe two years), the project OCRopus will have a nice UI, then this may be another good way to OCR with Linux.
The best results were found if each pdf-page is cropped and split in two, such that the files processed by the OCR program are PNG-files that contain exactly one book-page without additional stuff (graphics are OK). To do this, you need some batch-processing.
To get fast results without much work, I wrote a shell-script that calls pdf-to-image converters, OCR software and hocr2pdf in the right sequence with the right command-line options. The shell-script isn't perfect nor beautiful, but maybe you can use it to model upon it your own shell-script to suit your needs.
What the script does:
- splits one pdf into many (one pdf-file per pdf-page) via pdftk
- converts each pdf-page into a monochrome image with 300dpi via ImageMagick & ghostscript
- converts each pdf-page into two images for each book-page (after rotating & cropping the pdf-page appropriately) via ImageMagick
- OCRs each book-page via Cuneiform
- converts each book-page into PDF format via ExactImage
- merges all book-pages into one PDF file via pdfjam (& LaTeX)
- writes metadata (optionally) via pdftk
So the dependencies are convert (ImageMagick), ghostscript, pdftk, pdfjam, hocr2pdf (ExactImage) and cuneiform. In Ubuntu, you can run
sudo apt-get install imagemagick ghostscript pdftk pdfjam exactimage to get the most dependencies. Cuneiform, however, must be installed by hand (grab the .tar.bz2 file from their launchpad website and read the readme.txt installation instructions; maybe run sudo apt-get install cmake).
The dependency to ImageMagick could be dropped because ExactImage provides the same tools (although ExactImage is faster).
There are three minor issues to discuss:
- After installing Cuneiform, it will complain about a missing library "libpuma.so" even if it's there. Solution:
- If you're looking for the program hocr2pdf, it's in the debian package "exactimage".
- You will get many warning messages because of malformed PDFs. This is not really a problem and will be fixed in future versions of pdftk.
You can either download the script to make PDFs searchable and indexable under Linux here or copy&paste the code from below:
echo "usage: pdfocr.sh document.pdf orientation split left top right bottom lang author title"
# where orientation is one of 0,1,2,3, meaning the amount of rotation by 90°
# and split is either 0 (already single-paged) or 1 (2 book-pages per pdf-page)
# and (left top right bottom) are the coordinates to crop (after rotation!)
# and lang is a language as in "cuneiform -l".
# and author,title are used for the PDF metadata
# all values relative to a resolution of 300dpi
# usage examples:
# ./pdfocr.sh SomeFile.pdf 0 0 0 0 2500 2000 ger SomeAuthor SomeTitle
# will process a PDF with one page per pdf-page, cropping to width 2500 and height 2000
pdftk "$1" burst dont_ask
for f in pg_*.pdf
echo "pre-processing $f ..."
convert -quiet -rotate $[90*$2] -monochrome -normalize -density 300 "$f" "$f.png"
convert -quiet -crop $6x$7+$4+$5 "$f.png" "$f.png"
if [ "1" = "$3" ];
convert -quiet -crop $[$6/2]x$7+0+0 "$f.png" "$f.1.png"
convert -quiet -crop 0x$7+$[$6/2]+0 "$f.png" "$f.2.png"
rm -f "$f.png"
echo no splitting
rm -f "$f"
for f in pg_*.png
echo "processing $f ..."
convert "$f" "$f.bmp"
cuneiform -l $8 -f hocr -o "$f.hocr" "$f.bmp"
convert -blur 0.4 "$f" "$f.bmp"
hocr2pdf -i "$f.bmp" -s -o "$f.pdf" < "$f.hocr" rm -f "$f" "$f.bmp" "$f.hocr" done echo "InfoKey: Author" > in.info
echo "InfoValue: $9" >> in.info
echo "InfoKey: Title" >> in.info
echo "InfoValue: $10" >> in.info
echo "InfoKey: Creator" >> in.info
echo "InfoValue: PDF OCR scan script" >> in.info
pdfjoin --fitpaper --tidy --outfile "$1.ocr1.pdf" "pg_*.png.pdf"
rm -f pg_*.png.pdf
pdftk "$1.ocr1.pdf" update_info doc_data.txt output "$1.ocr2.pdf"
pdftk "$1.ocr2.pdf" update_info in.info output "$1-ocr.pdf"
rm -f "$1.ocr1.pdf" "$1.ocr2.pdf" doc_data.txt in.info
rm -rf pg_*_files
And if you think something with this code is wrong, not good or ugly, you can write me an email with corrections.
Happy scanning & searching in PDFs!