LuaTeX is a powerful extension of TeX which allows to fully harness the capabilities of traditional TeX systems without the huge development overhead this usually entails. LuaTeX contains the Lua scripting language which is executed on top of the TeX runtime. The multi-paradigm programming language Lua incorporated into LuaTEX provides a very convenient bridge to parts of the TEX engine through interfaces and callback hooks making it easier to perform complex tasks and customize the typesetting process.
This blog entry explores how we can use LuaTeX to embed metadata – or just about any data – in PDF documents.
Creating an object
PDF files can contain a variety of different content types such as text, pictures, metadata, forms, interactive elements an so on. All these different parts are packed into containers in the PDF file format which are called objects. In LuaTeX we can use the pdf
backend to create a PDF object. The obj function creates a pdf object and returns its object number. These are the allowed parameters of the function:
pdf.obj {
type = ,
immediate = ,
objnum = ,
attr = ,
compresslevel = ,
objcompression = ,
file = ,
string = ,
nolength = ,
}
Now to create a simple object which contains the string “hello world!” for example we can use the following snippet:
local new_obj = pdf.obj {
type = 'stream',
attr = '/Type /Text /Subtype /Plain',
immediate = true,
compresslevel = 0,
string = 'hello world!',
}
To write this object at the end of the PDF document we can run the code in the finish_pdffile
callback.
luatexbase.add_to_callback('finish_pdffile', function()
pdf.obj {
type = 'stream',
attr = '/Type /Text /Subtype /Plain',
immediate = true,
compresslevel = 0,
string = 'hello world!',
}
end, 'finish')
Or directly where we want it to be in the document using LaTeX:
\directlua{%
pdf.obj {
type = 'stream',
attr = '/Type /Text /Subtype /Plain',
immediate = true,
compresslevel = 0,
string = 'hello world!',
}
}%
Adding Metadata to the PDF file
PDF Metadata contains information like the title, author or modification dates of the document. Starting with PDF 1.4, metadata can be stored either in metadata streams which contain XML data or in a document information dictionary. While the document information dictionary as a key-value store is rather restricted in terms of content flexibility, metadata streams can contain arbitrary XML constructs. For metadata streams Adobe recommends the eXtensible Metadata Platform (XMP) standard (sections 1.6.1 and 2.2). XMP is based on the RDF language which relies on triple notation to document information. Here is an example how an uncompressed metadata stream can look like which describes earth as a sphere.
<< /Type /Metadata /Subtype /XML /Length 1706 >>
stream
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-...">
<rdf:Description rdf:about="http://example_uri.net#Earth" xmlns:ex="http://example_uri.net">
<ex:has_shape>
<ex:Sphere/>
</ex:has_shape>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
endstream
endobj
PDF files can have multiple packets of XMP metadata. The “main” Metadata object can be referenced in the top level PDF dictionary, so an application that understands PDF can find the newest metadata packet. Let’s see how we can achieve this in LuaTeX.
metadata_obj = pdf.obj {
type = 'stream',
attr = '/Type /Metadata /Subtype /XML',
immediate = true,
compresslevel = 0,
string = "<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?> \
<x:xmpmeta xmlns:x='adobe:ns:meta/'> \
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-...'> \
<rdf:Description rdf:about='http://example_uri.net#Earth' xmlns:ex='http://example_uri.net'> \
<ex:has_shape> \
<ex:Sphere/> \
</ex:has_shape > \
</rdf:Description> \
</rdf:RDF> \
</x:xmpmeta > \
<?xpacket end='w'?>",
}
-- adding the new object to the catalog as a metadata entry
local catalog = pdf.getcatalog() or ''
pdf.setcatalog(catalog..string.format('/%s Metadata 0 R', metadata_obj))
And that’s how you can save a custom XMP metadata object with LuaTeX. Of course the general principle here applies to arbitrary additional data added to a PDF file. The XMP standard is not enforced by PDF.
If you are interested in a practical example applying these techniques consider to check out a paper I wrote with colleagues Oliver Karras and Ildar Baimuratov about a new LaTeX package to embed main scientific contributions of a paper to PDF metadata here and the code repository of the project here.