Managing the paper’s metadata

Today in the series "How to do XYZ with software?":

Annotations and other metadata issues

(You might not want to read this if you're not using Linux or if you're not a developer)

On Linux there is no good system to annotate PDFs. There is PDFedit, which is slow and has a horrible user interface, obviously intended to be used to modify technical aspects of PDFs, not for fast annotation. There is Evince, my favourite PDF/PostScript/DejaVu reader, but the project "annotations in Evince" hasn't come very far. And there is Xournal, a tablet PC application which is very comfortable when it comes to annotating PDFs. Sadly, it is not as comfortable as Evince when it comes to reading PDFs - and in the end I want my notes to be exportable in some open format, so they won't get lost.

Maybe you have heard of the KDE application Okular, which allows annotations (Okular works fine under Linux and because of the nature of KDE4, it may even work under Windows and Mac OS X). These annotations are stored in an additional XML file. This has advantages and disadvantages, the advantage being that I can share the PDFs without sharing the annotations. The disadvantage, however, is that I have to stick to one program (Okular) if I use Okular annotations. I want my metadata to be included in the PDF, in the XMP format.

This problem is not yet solved even by Adobe: the Acrobat Reader (in it's professional variant) takes annotations in an obscure closed-source format, stored somewhere else than in the PDF itself. That's why I'm either writing notes by hand, after printing the PDF, or I'm taking notes in the Tomboy note taking application. If someone knows a solution to this disaster, please tell me! Maybe the right direction is to put the Okular annotations optionally into the XMP stream and add this ability to Evince, too.

(image "Emergency Exit" licensed from semanticwebcompany under a Creative Commons Attribution-Noncommercial-Share Alike 2.0 Generic license)

General document metadata: author and title

Okay, let's try to forget annotations for a moment. What about the simple metadata "author" and "title"? They're often used in desktop search engines and document management tools, so it would be nice to have correct metadata in PDFs (and PS and DjVu files, too). How to do this with Linux? I recently spent hours investigating this question.

With jPdf Tweak, a Java application (runs under Linux, Windows, Mac OS X), you can edit nearly every PDF metadata and do much more. It is basically a graphical user interface for the iText library, which is the same library that also powers the popular command-line tool pdftk. Sadly, the interface is not really usable in editing many PDFs in a row. It also has no batch processing capabilities. If you have just one or two PDFs to edit, this seems to be the perfect tool for you. I found out that JabRef is able to write XMP metadata. Nice!

(image "Metadata sticks" licensed from Gideon Burton under a Creative Commons Attribution-Share Alike 2.0 Generic license)

My frustration with the tools available culminated in writing my own document-metadata-tool. What I have done so far is a short command-line hack written in bash and python that takes a PDF, prints it's metadata using pdftk, asks for a new author, title, keywords, year and URI and, if something new is entered, writes this metadata into the PDF, using pdftk again. It also prints out BibTex code that includes URI and file-links, for direct import into JabRef.

This works somehow, but: the code is ugly, it doesn't work with PS or DjVu, it doesn't write XMP metadata (you can do this via JabRef after importing the BibTex file), it doesn't offer a nice graphical user interface for processing huge amounts of documents and there is a strange bug so some PDFs can't be manipulated (this seems to be related to a bug in the iText library used by pdftk and pre-damaged PDFs). If you're interested in the code, leave a comment, then I'll publish it here somewhere. At least for me it was very useful.

Get rid of PostScript documents

I decided to abandon PostScript files, that means I converted every PS file into PDF format, using the following shell script:
ps2pdf -sPAPERSIZE=a4 \ -dMaxSubsetPct=100 -dCompatibilityLevel=1.3 \ -dSubsetFonts=true -dEmbedAllFonts=true \ -dAutoFilterColorImages=false \ -dAutoFilterGrayImages=false \ -dColorImageFilter=/FlateEncode \ -dGrayImageFilter=/FlateEncode \ -dMonoImageFilter=/FlateEncode \ "$1" "$1.pdf"You can either save this text into a file "abandonps.sh" and make it executable via "chmod u+x abandonps.sh" or you replace $1 with the filename. Maybe you will want to use "letter" instead of "a4", too. I'm very happy that some old PostScript documents not only load much faster after being converted but there is also (somehow magically) full-text search for some documents now!

The general solution to get full-text search and indexable documents (with Linux) is to look at another article on this blog:
Linux, OCR and PDF - problem solved

The ideal document metadata editor

I did some research on how to write an ideal document metadata editor. ExifTool, a Perl library and command-line application (for Linux, Windows and Mac OS X), seems to be able to manipulate DjVu metadata (currently only reading; when I try writing I get "Writing of AIFF files is not yet supported"). However, the company that owns the DjVu format seems to stick to their own metadata format, so XMP in DjVu won't be available officially. To manipulate PostScript metadata it seems possible to just write a text-based parser, because the metadata is stored as plain text in the first few lines of a PS document. It may be even better to use ExifTool's PostScript capabilities. To manipulate PDF metadata, it seems to me the best solution is to use the iText library directly (not via pdftk and the command-line, to exploit Java's so-called platform-independence).
In general, to manipulate XMP metadata, it seems best to use Exempi, which is an open-source implementation based on Adobe's own XMP SDK. There are python bindings available, too.
If you know of any nautilus/dolphin/other file-manager extension that already does some or all of these things, please tell me where to get this!

Comments

Read further down in your DjVu XMP reference:

http://www.djvu.org/forum/phpbb/viewtopic.php?t=530

You will see that there is now a scheme to embed XMP in DjVu. However, as you point out exiftool doesn't yet support writing of AIFF-format files (upon which DjVu is based).

I haven't tried this out, but it might be what you are (were) looking for:

https://github.com/grigio/xmp-manager

I hear your pain...I have a large bunch of (work-related) scientific documents in different formats (pdf, djvu, epub) that I'd like to use together with a desktop search enginge (Recoll - the only one that supports .djvu). Recoll can be made to work alongside TMSU (http://www.tmsu.org/) but also has noticeable disadvantages. It doesn't have a gui either either but I manged to get the devs interested in making some nautilus scripts: https://github.com/oniony/TMSU/issues/62 XMP Manager appears to be dead? I also found TAGSpaces https://www.tagspaces.org/ , which seems to be fairly advanced (GUI-Wise, I haven't tried it out yet) but of course does not work with Recoll - but perhaps it suits your needs?

...I forgot: I'd indeed like to try out your script!

I just had a brief look at TMSU and Tagspaces, which both look interesting, so thank you for that.

Then I found https://github.com/Glutanimate/PDFMtEd which might be what you're after (at least for PDFs, I haven't tried anything else yet).

A brief update:

- I read in the Evince mailing-list that the devs are working on annotation-features.
- Here's one tool (sadly for windows) that lets you batch-edit metadata in pdf's: http://www.hexonic.de/index.php/hexonic-pdf-metadata-editor
- And another one - java-based: http://broken-by.me/pdf-metadata-editor/ After asking for support of other file-types, the dev asked me to open a ticket for it here: https://github.com/zaro/pdf-metadata-editor/issues/6 Perhaps you can chime in?
I've also read about "Adobe Bridge" but found nothing about the supported file types.