linux tools that deal with metadata - pdf

there is pdfinfo for PDFs, exifinfo for images, ffprobe for multimedia and so on. are the a collection or a standardized toolset for extracting filetype dependent(like pdf, image, doc, odt) metadata of all files in linux?
or even distinct tool that are file-format specific for most common file types like ppt, epub and other file types we commonly find in internet downloads.

Exiftool can extract metadata from more than just images. See Supported File Types.

Related

Extract xml from ZUGFeRD PDF with Ghostscript

We would like to automate the processing of Zugferd invoices.
Is there a way to extract and save the xml files embedded in the PDF using Ghostscript?
as mentioned by KenS Ghostscript can help assemble Zugferd files but not extract the contents. Below we can see those contents in the source xml (lower) and a good !? PDF where the plain text is visible (upper part of image is PDF viewed in WordPad) and can be easily extracted as text. However nothing about PDF extraction is reliable since the format of one PDF is rarely the same as the next unless you make it so.
Many PDF readers have the ability to export such attachments as the source file and many PDF libraries will allow for extraction of the named file in a scripted fashion.
The samples above are from currently very up to date Open Source Java application https://www.mustangproject.org/
For very simple cross platform use there is pdfdetach which can save any attachments by name or all attachments

Can QPDF utility be used to extract attachments from a PDF file?

I have a PDF file with other PDF files attached to it. Acrobat shows them in "Attachments" tab and allows to open them in turn.
QPDF documentations says something about extracting attachments but I failed to find any particular commands that do that.
Is it possible to extract these attachments and have them stored on the disk as separate PDF files?
UPDATE: Just a notice to explain better what you can see in the UI: "Attachments" tab was present in older versions of Acrobat, as well as a special page of the container document recommending to download newer version of Acrobat (this page seems to be really existing as it is shown in other viewers as well as on preview image). Latest versions of Acrobat (Reader) skip this page and get you to the first attached document, with the list of all attachments shown on the left side of the screen.
I found an old GitHub issue which a little bit clarify the possibilities of attachment extraction.
It is possible to extract attachments from PDF files using the qpdf
library by understanding the PDF file structure and pulling the
attachments out "manually" by knowing which objects to extract. There
is nothing in the public API at the moment nor in the command-line
tool that enables you to work with attachments as a first-class thing,
but there is an item in the TODO list, and there is some private code
used internally to detect cases where attachments are encrypted
differently from the rest of the file. The main reason, aside from
lack of time, that attachments are not more directly supported is
because there have been various ways that they are stored in the file,
and I don't know whether I have examples of all of them. I'm reluctant
to add a feature for attachments that may miss some attachments in
some older PDF files.
https://github.com/qpdf/qpdf/issues/24
So, it seems it is possible but you should examine the details of the pdf file.
Starting with qpdf 10.2, you can work with file attachments in PDF files from the command line. The following options are available:
http://qpdf.sourceforge.net/files/qpdf-manual.html#ref.attachments

What is a pdf bcmap file?

I am using a pdfjs viewer in my web application and it comes with all these bcmap files. I traced the network traffic and they are not being called for.
I don't really want to add these files to version control or the issue tracking system b/c there are so many of them, if they will not be needed.
What is a bcmap file?
The word "bcmap" stands for "binary cmap".
CMaps (Character Maps) are text files that are used in PostScript and other Adobe products to map character codes to character glyphs in CID fonts.
See this document by Adobe to see what CID fonts are good for. They are mostly used when dealing with East Asian writing systems. (This technology is a legacy technology, so it should not be used in pdfs created by modern tools)
pdfjs needs the CMap file when it wants to display such CID fonts. For that, you need to provide the CMaps.
You specify the url to the folder where the CMaps are stored via settings on the PDFJS global object.
PDFJS.cMapUrl = '../web/cmaps/';
By default, pdfjs will try to load a file with the name of the required CMap and no extension, for example "../web/cmaps/Hankaku".
If you enable the setting cMapPacked like this:
PDFJS.cMapPacked = true;
pdfjs will instead try to read a compressed version of the CMap file with the extension ".bcmap", for example "../web/cmaps/Hankaku.bcmap".
The compression itself is done with the tool at https://github.com/mozilla/pdf.js/tree/master/external/cmapscompress.
Conclusion: Include the files and set the PDFJS options correctly if there is a possibility that you need to display pdfs with East Asian texts that were created by legacy pdf creation tools. Don't include the files if you are certain you won't need to display such files.

What is a "packed PDF", and how can it be read?

I have been sent versions of "packed PDF" files where the top-level PDF contains child PDFs.
The top-level PDF acts primarily as a container. The packing is not always evident in Adobe reader (e.g. when pdftk is used to pack the link does not show). I can find little by Googling for this term nor in my 2012 book ("Whittington", "PDF Explained", O'Reilly).
Is this a standard part of PDF? If so I'd be grateful for pointers. And can PDFBox analyze it?
Concerning your question whether using PDF as a container file format is a standard part of PDF:
Yes, it is. ISO 32000-1:2008 describes it in section 7.11.4 Embedded file streams.
Most prominent are files associated to some document page, see 12.5.6.15, File Attachment Annotations, and those associated with the document as a whole through the EmbeddedFiles entry (PDF 1.4) in the PDF document’s name dictionary (see 7.7.4, Name Dictionary).
#JesseGood's link to PDF File Specification on the PDFBox site explains how to deal with the latter ones.
I'm not very knowledgeable concerning PDFBox and, therefore, don't know whether it allows easy access to the other kind of attachments, too. If it does not, you will essentially have to iterate the annotations of all pages to find the file attachment annotations and handle the contents according to the PDF specification.

Why are ePub files so much smaller than mobi or PDF files for the same book

When I buy ebooks I download all of the available formats. I've noticed that the file sizes for the various formats can be markedly different and epub is typically much smaller.
For example:
PDF - 5.7mb;
ePub - 2.7mb;
Mobi - 8.1mb.
Or:
PDF - 4.5mb;
ePub - 1.8mb;
Mobi - 5.3mb.
I've flipped through them and tried to confirm that the contents are the same and they seem to be (i.e. no large images missing). Can anyone explain why epub is so much smaller than the other two?
The mobi versions can be larger because they include the legacy mobi format, the new KF8 format and a copy of the original epub, this is assuming the mobi file was generated with the latest version of kindlegen.
For the PDF's I'm guessing (and that's all it is here) that embedded fonts may be the cause of a larger file size, another thing that comes into play here is image optimisation. Depending on the image optimisation settings used when the PDF was created will largely affect the final file size.
Epub's are basically just a bunch HTML, CSS and image files with a few XML files for defining the books metadata, chapter order and table of contents navigation. The epub file is really just a zip file with a .epub extension and since it doesn't have 3 copies of the same book like the Kindle version does it will always be much smaller.
Because the epubs are similar to a website. An epub book is made from XHTML & CSS2 & some features like CSS3, then the software that reads epub interpret that file and make a visual representation from that code.
.epub files are compressed (in fact, they are just zip files).
.mobi files are not compressed. If you zip a mobi file, you may get a smaller file than the epub.
Incidentally, this makes text searching much faster on mobi files than on epub.
That depends on the format of the mobi that you have. As you must be already aware, an epub file can be converted into any ebook format that you choose - you can consider the epub format as the base for any other format.
I am guessing that the mobi file that you have has the original epub embedded inside it. This is to assist editing tools (as direct editing of mobi files is cumbersome). Also, some mobi files contain several versions of the mobi(mobi-7 and KF8) to maintain backward compatibility with readers that do not support the latest format.
You can find more information about the file formats here