Linux, OCR and PDF: Scan to PDF/A

Friday, March 29th, 2013 | Author:

Optical Character Recognition

The (by far) most visited post on this blog is from 2010, about OCRing a PDF in GNU/Linux (Optical Character Recognition), and it contains a small shell script that has been improved by others several times. After having bought a new flatbed scanner, I re-investigated how to scan and OCR pdfs, how to produce DJVU files that are incredibly small and how to get metadata right. It turns out what I really ever wanted was to create PDF/A compliant documents (I just didn't know what PDF/A was before). But let me explain the details after presenting you the quick solution. At the end, I have a shell script that scans directly to PDF/A.

Continue reading «Linux, OCR and PDF: Scan to PDF/A»

Category: English, Not Mathematics | 19 Comments

Mass renaming papers with BibTex+JabRef export filters

Monday, June 28th, 2010 | Author:

JabRef

If you manage your (scientific) references, such as journal articles, arXiv papers and textbooks within some reference management system that uses BibTex as storage/export format, and you have local copies of your files, then the following might be of interest:

I wrote a JabRef export filter that takes a BibTex file with file links (so, BibTex fields of the form file={somefile.pdf}) and writes a linux shell script to rename the files systematically according to the scheme [bibtexkey] - [authors] - [title].[extension]. Then JabRef can find the file again via its automatic file association mechanism. I use lower-case bibtexkeys but the export filter is easily adaptable, read about it on the JabRef custom export filter documentation page.

Just create (or download) a file named "renamer.layout" and fill in this line:
\begin{file}mv "\format[FileLink]{\file}" "\format[ToLowerCase,FormatChars]{\bibtexkey} - \format[AuthorNatBib,ToLowerCase,FormatChars,RemoveBrackets]{\author} - \format[FormatChars,RemoveBrackets,ToLowerCase]{\title}.\format[Replace(.*:,),ToLowerCase]{\file}"\end{file}
then open JabRef and go to the menu entry Options->Manage custom exports->Add new where you enter (for example) "renamer" as Export name, the full path to your renamer.layout file in the Main layout file field and "sh" as File extension.

Then open your BibTex file (.bib) with JabRef and then select the menu entry File->Export and select in the drop-down-menu Files of Type your newly created export filter renamer (*.sh). This gives you a shell script which, if executed, renames all files linked from the BibTex document into a standardised format (and moves all into the directory from where you execute the script).

Continue reading «Mass renaming papers with BibTex+JabRef export filters»

Category: English, Mathematics | Leave a Comment

A survey of GNU/Linux shortcomings

Sunday, February 14th, 2010 | Author:

Tux

A long time ago, I switched from Micro$oft Windows to GNU/Linux. Since Ubuntu, I even recommend GNU/Linux to non-computerfreaks. Sadly, Ubuntu is not perfect. In particular, some applications are still missing. What follows is a wish-list of future Ubuntu features/applications. Some of these are available on Windows or Mac OSX, most aren't.
Continue reading «A survey of GNU/Linux shortcomings»

Category: English | One Comment

Managing the paper's metadata

Monday, January 25th, 2010 | Author:

Metadata

Today in the series "How to do XYZ with software?":

Annotations and other metadata issues

(You might not want to read this if you're not using Linux or if you're not a developer)
Continue reading «Managing the paper's metadata»

Category: English | 2 Comments

Managing papers

Saturday, January 23rd, 2010 | Author:

Digital Bookshelf

Today in the series "How to do XYZ with software?":

How to manage papers?

I have lots of PDFs on my hard-disk, and most of them is half-read or unread. Since I'm studying mathematics, these PDFs are lecture notes, research papers, my own notes and several more-or-less relevant books. How do I organise them? It's a problem.

Continue reading «Managing papers»

Category: English, Mathematics | Comments off

Linux, OCR and PDF – problem solved

Tuesday, January 19th, 2010 | Author:

Optical Character Recognition

Imagine you've scanned some book into a PDF file on Linux, such that every pdf-page contains two book-pages and there is a lot of additional white-space and maybe the page orientation is wrong. And, worst of all, there is no full-text search, thus no full-text indexing for desktop search engines. I consider this problem solved on Linux!
Continue reading «Linux, OCR and PDF – problem solved»

Category: English | 41 Comments