Linux, OCR and PDF: Scan to PDF/A

Friday, March 29th, 2013 | Author:

The (by far) most visited post on this blog is from 2010, about OCRing a PDF in GNU/Linux (Optical Character Recognition), and it contains a small shell script that has been improved by others several times. After having bought a new flatbed scanner, I re-investigated how to scan and OCR pdfs, how to produce DJVU files that are incredibly small and how to get metadata right. It turns out what I really ever wanted was to create PDF/A compliant documents (I just didn't know what PDF/A was before). But let me explain the details after presenting you the quick solution. At the end, I have a shell script that scans directly to PDF/A.

A PDF/A file is a document that probably ends in .pdf, complies to the PDF 1.4 standard (not more or less), has ORCed text in the background layer to allow for full-text search, has valid metadata in XMP format (yay!), and the compression is Mixed Raster Compression (MRC) which allows quite small documents (though DJVU is still slightly smaller in my experience). Actually that was more-or-less PDF/A-1b, the basic version. Now there is also PDF/A-2, where you can use better compression (JPEG2000), transparencies and layers, since it's based on PDF 1.7. The "A" in PDF/A stands for "archive-able".

Quick solution

In Debian or Ubuntu GNU/Linux, if you like graphical user interfaces:
sudo apt-get install scantailor
will bring you all you need. Under the hood works a command-line tool:
sudo apt-get install unpaper
and you can get ScanTailor and unpaper also from their websites.

Long Story

Step 1: produce high-quality input data

To find good hardware for scanning that is linux-compatible, just compare technical specifications and prices as usual, and once you have a list of 1--5 devices you might buy, check against the list of linux-supported devices here.
An alternative to the usual flatbed-scanner setup is to construct something yourself, like an open-source book scanner, another open-source book scanner, or a slide-scanner made from a camera.

I use a recent Canon model (LiDe 210) that works without quirks in Ubuntu Linux 12.10. I use scanimage on the command-line and the GUI of XSane (though it looks a bit old-fashioned) so let me tell you about the available options on XSane. Using other scanning software on linux most probably means using another UI to the Sane library, so the options are the same.

For OCR, the best mode is "Gray" or "Color", but not lineart. The resolution should be 300 or 600 DPI, more is usually not necessary and slows down the post-processing. If you're low on memory, high DPI values might even make the post-processing impossible. There is a green-blueish button for automatic gamma, brightness and contrast, which makes sense after acquiring a preview of your scan; I recommend the default enhancement values (1,0,0) since we can post-process later in the proper tools. Some post-processing tools have problems with 16-bit images, so I recommend to use 8-bit (in the "Bit depth"/"Standard Options" window of XSane). For most post-processing tools, it is convenient to have the scans in TIFF or PNG format. With TIFF you have to make sure that lossless compression is activated in the Sane configuration (see also these scanning tips).

Speckles and black borders on a document can make it really hard for OCR software, so you should try to get your scan as clean as possible. It may help to acquire a preview and crop manually.

To make sure the end-result contains all available relevant metadata, I recommend taking as much information as possible into your filename already, like some date attached to the scanned piece (if it is a letter or a photo) and some context. This will later make it easy to move this information into the PDF, especially if you intend to scan many pieces at once.

If you want to generate PDF/A compliant PDFs, one solution is to use LaTeX, where you just insert your scan(s) as embedded images, and the metadata where it belongs. There is a tutorial for PDF/A compliant PDFs out of LaTeX, though it doesn't touch the issue of embedding scanned images or OCRed text.

Step 2 (optional): use unpaper to remove artifacts

UnPaper is a very useful software to remove any paper artifacts from you scans. In principle, this enables you to get printouts of your scan that look like actual re-prints, not photocopies. This is especially useful for the purpose of OCR.

The standard interface for UnPaper is the command-line, but there are also GUIs available. Some of them are still at an early stage of development, like GScan2PDF, others seem to be discontinued, like OCRFeeder, so I recommend using ScanTailor (download!).

ScanTailor has the assumption of scanning books in mind, so it is optimized to scan two pieces at once and later splitting them into two separate pages. This was useful to me when I wanted to scan large amount of photos, 4 at a time, to split them later in ScanTailor.

Warning: with high resolution come large files, so the post-processing that happens in ScanTailor can be slow. If you have a whole book to scan, I would recommend finding out the right parameters by hand and using the command-line UnPaper instead of ScanTailor.

UnPaper and ScanTailor take image files like TIFF or PNG and give back TIFF.

Step 3a: compress into a djvu

DjVu files are known for their incredible compression. However, the magic ingredient for that is "Mixed Raster Compression" (MRC), which you can also use in PDFs. Since PDF/A is the archive standard, not DJVU/A, and future tools enable MRC in PDF, DjVu will become even less important.

There is already a wonderfully detailed tutorial online on how to digitize books to DjVu, even with a section covering OCR.

As far as I know, this must be done on the command-line, since no free GUI is available.

Step 3b: compress into a pdf

To convert a bunch of TIFFs to PDF, there is tiff2pdf. You can supply some metadata on the command-line, to be included in the PDF.

Example usage:
tiff2pdf -o outputfile.pdf -z -u m -p "A4" -F inputfile.tif

The switch "-z" enables lossless compression, instead you could use "-j -q 95" for 95% quality JPEG compression. The switch "-p "A4"" specifies the paper size, which could also be "letter". The switch "-F" causes the TIFF to fit the entire PDF page, to avoid borders.

Another example:
tiff2pdf -o outputfile.pdf -z -u m -p "A4" -F -c "tiff2pdf" -a "Author Name" -t "Document Title" -s "Document Subject" -k "keyword1,keyword2,keyword3"
−e 20130324103000 inputfile.tif

This line will include the given metadata into the resulting PDF.

Step 3c: OCR

Between post-processing the scans and compressing them into a PDF, we might want to run OCR on them. I still use tesseract/hocr2pdf to do that, since the Tesseract engine tends to give me the best results, and hocr2pdf is the only solution I know of that can "hide" the scanned text in a layer behind the scanned image, to give you true full-text search without damaging the scan quality at all.

With whatever input data you have, I recommend the following:
convert -normalize -density 300 -depth 8 "inputfile.ext" "normalized-input.png"
since tesseract really works best with normalized images at density 300 and bit-depth 8, in PNG format.

Tesseract is language-sensitive. If you do
tesseract -l deu -psm 1 "normalized-input.png" "output.pdf" hocr
it will assume german text (deu=deutsch=german), but the switch "-l eng" will change that to english language. There are many other languages available (see "man tesseract"), and you can build your own.

To merge back the hocr data into the PDF, you need to convert the PNG to JPEG and run hocr2pdf:
convert "normalized-input.png" "normalized-input.jpg"
hocr2pdf -i "normalized-input.jpg" -s -o "output.pdf" < "output.pdf.html"

To get the metadata right, you might want to use PDFTk and its dump_data,update_info commands. Take a look at the final shell script below for this.

Step 4 (optional): validate

Standards are only good as long as you can validate them. This is possible for PDF/A with JHOVE, the JSTOR/Harvard Object Validation Environment (pronounced "jove"). Though it still has some bugs, it is the only viable free alternative to Adobe's Windows-only Preflight mode (which is still better, I admit).

After extracting the JHOVE files to some directory "jhove", you have to edit the file "jhove/conf/jhove.conf" and change something in "something" to the actual directory (ending in "/jhove").

After you got that right, run
java -jar jhove/bin/JhoveView.jar
to get the interactive program. You can change the configuration there as well. Once I had the strange issue that I had to change the directory from the UI tool to make the CLI tool work...

If you prefer to stay on the command-line, to automate your workflow, try
java -jar jhove/bin/JhoveApp.jar -m PDF-hul "filename.pdf"
and watch out for the lines beginning with "Status" and "ErrorMessage".

You'll notice that most documents have some errors, but these don't affect reading the documents. It is actually quite hard to get a PDF/A-conforming document!

I did a little survey on my own archive of PDFs, mostly from the arXiv and mathematical journals, in total about 500 PDFs. The errors (also happening in files that seem to be generated from a TeX source and files from JSTOR or Journal homepages) were:

  • InfoMessage: Too many fonts to report; some fonts omitted.: Total fonts = ...
  • InfoMessage: Outlines contain recursive references.
  • ErrorMessage: Improperly formed date
  • ErrorMessage: Lexical error
  • InfoMessage: File header gives version as 1.4, but catalog dictionary gives version as 1.6
  • ErrorMessage: Invalid page dictionary object
  • ErrorMessage: Invalid outline dictionary item
  • ErrorMessage: Invalid object number in cross-reference stream
  • ErrorMessage: Invalid destination object
  • ErrorMessage: Invalid Resources Entry in document
  • ErrorMessage: Malformed dictionary
  • ErrorMessage: Malformed filter
  • ErrorMessage: No PDF header
  • ErrorMessage: No PDF trailer
  • ErrorMessage: Unexpected error in findFonts: java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
  • ErrorMessage: Unexpected error in findFonts: java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfStream cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary

The last two are obviously bugs in JHOVE. The "too many fonts to report" info message came about 100 times. About 100 files (not the same, but with some overlap) out of the total 500 were invalid PDF/A. Nevertheless, all these files are perfectly readable. It is not clear, if they would be readable on other devices, like a Kindle or Android. I also encountered printing errors with malformed PDFs in the past, so I recommend getting rid of these errors at least in the files you produce after scanning.

One Shell Script to Rule Them All

This is a script to call from the command-line, to scan and OCR directly to PDF/A.

usage:
./scan-archive.sh filename.pdf title subject keywords

example usage:
konrad@sagebird:~/Documents/scans$ ./scan-archive.sh Letter-20130324-Bankaccount-closing.pdf "Letter from the bank" finances bank,account,closing

full script (also available on pastebin):
#!/usr/bin/env bash
echo "usage: ./scan-archive.sh filename.pdf title subject keywords"
echo "scanning \"$2\" on \"$3\" about \"$4\"... ($1)"
scanimage --mode Color --depth 8 --resolution 600 --format pnm > out.pnm
echo "processing... ($1)"
scantailor-cli --color-mode=black_and_white --despeckle=normal out.pnm ./
rm -rf cache out.pnm
tiff2pdf -o "$1" -z -u m -p "A4" -F -c "scanimage+unpaper+tiff2pdf+pdftk+imagemagick+tesseract+exactimage" -a "Author Name" -t "$2" -s "$3" -k "$4" out.tif
rm -f out.tif
echo "converting to PDF 1.4 ($1)..."
mv "$1" "$1.bak"
pdftk "$1.bak" dump_data > data_dump.info
pdftk "$1.bak" cat output "$1.bk2" flatten
echo "OCR in lang deu... ($1)"
convert -normalize -density 300 -depth 8 "$1.bk2" "$1.png"
tesseract -l deu -psm 1 "$1.png" "$1" hocr
convert "$1.png" "$1.jpg"
hocr2pdf -i "$1.jpg" -s -o "$1.bk2" < "$1.html" echo "Inserting metadata... ($1)" pdftk "$1.bk2" update_info data_dump.info output "$1" rm -f "$1.bak" "$1.bk2" data_dump.info rm -f "$1.png" "$1.jpg" "$1.html" "$1.pdf" echo "done. wrote file. ($1)" echo "validating... ($1)" java -jar jhove/bin/JhoveApp.jar -m PDF-hul "$1" |egrep "Status|Message"

You should obviously customize "Author Name", and you might want to skip the validation step in the end. In other environments, "A4" might be better replaced with "Letter" or "A3", depending on your scan format. Purists might want to skipt the conversion to JPEG, which I used to get smaller files. In JPEG2000, the same compression technique that powers DjVu (MRC) is possible.

Maybe one should try the suggestions here for other Tesseract UIs, but I'll stick to the command-line for now. Any other suggestions?

Tags » , , , «


Category: English, Not Mathematics

Comments are currently closed.

 

20 Responses

  1. Hi,

    Thanks for your very interesting post.
    I have one question though: How do you tell jhove to check that the pdf is not only valid, but also complies to the pdf/a rules?

    Thanks

  2. That is not possible, since parts of the PDF/A rules are not machine-checkable AFAIK (like "metadata matches data" -- jhove is unable to understand the semantics of your document).

  3. Hi,

    One more remark. Before the validation of the final PDF file against the PDF/A standard, it is required to use the respective tool(s) to generate it according to the latter standard. Looking at your script, I do not think thas this has been done already.

    Searching the internet I found this interesting post, explaining how to generate a PDF/A file from a PDF file using the cmd line:
    http://stackoverflow.com/questions/1659147/how-to-use-ghostscript-to-convert-pdf-to-pdf-a-or-pdf-x

    basically, one one command is required: gs -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=output_filename.pdf input_filename.pdf

    This might be interesting for you.

    Regards

  4. Thanks for the remark. I shouldn't have left that out!

  5. Hi,

    Because I was not fully satisfied by any of the solutions I found on the web (but many of them, like yours, were a very good source of inspiration), I developped my own script to OCR my PDF files.

    The solution I found on the web:
    - Either produced PDF files with misplaced text under the image (making copy/paste impossible)
    - Or they did not display correctly some escaped html characters located in the hocr file produced by the OCR engine
    - Or they changed the resolution of the embedded images
    - Or they generated PDF file having a ridiculous big size
    - Or they crashed when trying to OCR some of my PDF files
    - Or they did not produce valid PDF files (even though they were readable with my current PDF reader)
    - On top of that none of them produced PDF/A files (format dedicated for long time storage / archiving)

    Thanks once more for the inspiration.

    If you have any interrest, you can find my solution here:
    https://github.com/fritz-hh/OCRmyPDF

    Regards

    fritz

  6. Hello, you have a very good write-up here.

    I would suggest you encode the tif files using the JBIG2 compression method instead of JPEG. JBIG2 is a bitonal image compression algorithm, like djvulibre's cjb2 (the jb2 in cjb2 stands for JBIG2 I believe). JBIG2 encoded files are much, much smaller and better quality than JPEG, but they have to be in black and white.

    There is a free software encoder for linux, called jbig2enc. You can get it here: https://github.com/agl/jbig2enc

    An example of how to use it is as follows: jbig2 -4 -s -p *.tif && pdf.py output >out.pdf

    but read the manual for more information.

    So first you process your scans with scantailor in black and white mode, and then you encode those tifs with jbig2enc. You can encode any color images to pdf with imagemagick using JPEG or JPEG2000 compression algorithm. However, make sure to pass the density option to image magick with the correct dpi, or else the result can look messed up.

    Finally when you merge the pdf files together, you want to use a tool that will merge the files without recompressing, or else your pdf file will lose quality from the recompression, and gain in size, because the tool will probably use JPEG to encode it. For this purpose, I like to use pdftk. Be warned though, ghostscript will recompress your image, so don't use this tool to merge pdfs.

    You merge them like so: pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf

    Using this method, I have made a 600dpi 350pages scanned book at 1.5MB. Pretty small, I think! I have yet to add OCR to it, but that will not increase the file size that much.

    For pdf bookmarks I use a tool called JPDFBookMarks. It's freesoftware and written in java, and works under linux.

    The final missing piece for me is OCR, and it looks like you have covered how to do that part in your blog entry here. Also thanks for the info about jhove, that looks like a useful tool.

    Wow this comment got really long, maybe I should make my own blog post about this. Hope this helps you, definitely should check out using JBIG2 compression, it is worth it!

  7. @nate thanks a lot, these are very good suggestions which I'll definitely try the next time I scan something!

  8. Hi Konrad,

    a good solution for me works with the tool gscan2pdf.

    sudo apt-get install gscan2pdf

    It also performs OCR via tesseract. Some things are missing but in my opinion it's a good "all-in-one" gui solution for scanning, basic editing a page (crop) and ocr ;-)

    Bye
    Patrick

  9. thanks! you saved me weeks of research!

  10. Hello, I am the same "nate" from early in the comment thread. Since that time, I have fleshed out my comment here into a fully formed post on my site about my process of book scanning in Linux. You can read it here:

    I hope it is useful to everyone interested in book scanning in Linux. I cover scanning with different kinds of scanners (flatbed, formfeed, handheld), post processing the scanned images with ScanTailor, and packaging the final scans into either pdf or djvu format, including OCR, Indexing, and Bundling. I also go into more detail about the JBIG2 compression size reduction technique I mentioned earlier, and explain how to use all the tools exactly, and provide links to where you can get them. IF anyone has any feedback you can just leave a comment on my post, or on here, etc. Happy Scanning!

  11. The release notes for the latest version of Tesseract states that they "added support for PDF output with searchable text." I'm not sure how that will will play with hocr2pdf but it may let there be one fewer steps in conversion.

  12. Thank you for mentioning this! I tried to install the latest tesseract from sources but after finally getting all dependencies and building it, it just segfaults. I guess I'll try again in a few months when the package is ready in Debian or Ubuntu. ... which reminds me that I have to switch Distros, given that I dislike the last few iterations of Ubuntu.

  13. Konrad,

    I am new to manipulating scanned PDF documents. I have a LARGE library of books that are simply scanned PDFs, I would like to have the ability to index the text yet not lose the large number of pictures in the books. I'm really not sure where in your script I would pick up since I don't need to actually scan the files. Any help would be appreciated.

    Chris
    USA

  14. Hello,

    first thank you for your posts. They are very interesting and I think I'll use some described methods in my workflow.

    I'm searching something and perhaps you have any hints. After the OCR is done, I would like to improve the accuracy. My first idea is to run a spellchecker to correct the mispelled words. And to add some texts where nothing is recognized. Do you have any idea what I could use ? Thank you!

    Cédric

  1. 15
    Links 190-191 | Meet the GIMP! (via Pingback)
    2013-04-16 (16. April 2013)

    [...] you have a scanner, paper documents to process and run Linux? Then Konrad Voelkel has something for you: “After having bought a new flatbed scanner, I re-investigated how to scan and OCR [...]

  2. [...] sono alcune guide e script in merito, nonché un live CD nato per fare solo questo. Tuttavia sono tutte [...]

  3. […] http://www.konradvoelkel.com/2013/03/scan-to-pdfa/ […]

  4. 18
    Seegras Logbook » Blog Archive » Scanning Books on Linux (via Pingback)
    2014-03-24 (24. March 2014)

    […] Scan to PDF/A […]

  5. […] http://www.konradvoelkel.com/2013/03/scan-to-pdfa/#comment-2736 […]

  6. […] forming a new PDF file with a text searchable layer hidden underneath the scanned images. See e.g. Voelkel’s and OCRmyPDF […]