Can QPDF utility be used to extract attachments from a PDF file? - pdf

I have a PDF file with other PDF files attached to it. Acrobat shows them in "Attachments" tab and allows to open them in turn.
QPDF documentations says something about extracting attachments but I failed to find any particular commands that do that.
Is it possible to extract these attachments and have them stored on the disk as separate PDF files?
UPDATE: Just a notice to explain better what you can see in the UI: "Attachments" tab was present in older versions of Acrobat, as well as a special page of the container document recommending to download newer version of Acrobat (this page seems to be really existing as it is shown in other viewers as well as on preview image). Latest versions of Acrobat (Reader) skip this page and get you to the first attached document, with the list of all attachments shown on the left side of the screen.

I found an old GitHub issue which a little bit clarify the possibilities of attachment extraction.
It is possible to extract attachments from PDF files using the qpdf
library by understanding the PDF file structure and pulling the
attachments out "manually" by knowing which objects to extract. There
is nothing in the public API at the moment nor in the command-line
tool that enables you to work with attachments as a first-class thing,
but there is an item in the TODO list, and there is some private code
used internally to detect cases where attachments are encrypted
differently from the rest of the file. The main reason, aside from
lack of time, that attachments are not more directly supported is
because there have been various ways that they are stored in the file,
and I don't know whether I have examples of all of them. I'm reluctant
to add a feature for attachments that may miss some attachments in
some older PDF files.
https://github.com/qpdf/qpdf/issues/24
So, it seems it is possible but you should examine the details of the pdf file.

Starting with qpdf 10.2, you can work with file attachments in PDF files from the command line. The following options are available:
http://qpdf.sourceforge.net/files/qpdf-manual.html#ref.attachments

Related

Extract xml from ZUGFeRD PDF with Ghostscript

We would like to automate the processing of Zugferd invoices.
Is there a way to extract and save the xml files embedded in the PDF using Ghostscript?
as mentioned by KenS Ghostscript can help assemble Zugferd files but not extract the contents. Below we can see those contents in the source xml (lower) and a good !? PDF where the plain text is visible (upper part of image is PDF viewed in WordPad) and can be easily extracted as text. However nothing about PDF extraction is reliable since the format of one PDF is rarely the same as the next unless you make it so.
Many PDF readers have the ability to export such attachments as the source file and many PDF libraries will allow for extraction of the named file in a scripted fashion.
The samples above are from currently very up to date Open Source Java application https://www.mustangproject.org/
For very simple cross platform use there is pdfdetach which can save any attachments by name or all attachments

I'd like to recognize the text of all pdfs on my computer and save them without moving them from their locations. Is it possible?

I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."
When I start this process and it asks for the directory, I've chose C:, my main hard drive.
It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.
Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.
So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.
My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.
I think you can do this using a combination of
regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
iText : to iterate over the PDF document and extract all images
Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text
Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

Save data as editable Pdf

We have a software, which creates user reports and saves them into pdf documents. We're using Ghostscript for this.
I'm aware that PDF is "normally" an export format which is not editable, but one of our customer needs the possibility (for legal reasons) to edit these files.
I thought it can be possible to save the text in fillable forms (like adobe acrobat offers) and save it that way. Is it possible to create Text within a fillable form in a PDF and save it (with free tools like Ghostscript), so that the user can edit it later?
I read the Ghostscript documentation, but I didn't find anything.
GhostScript isn't really a terrific tool for this. You'd be better off with a PDF generation library which can add the appropriate annotations to the page - if you're wedded to using annotations.
If the "content" must be edited by end users, using widget annotations is not a horribly bad way of doing things, except that every end user needs to have a copy of Acrobat and if only some people are allowed to edit, you will likely have to play with owner password protection and permissions in order prevent anyone from changing field contents.
As for free tools, depending on the usage you could use iText or iTextSharp.
If you are required to be able to take the content of the document and be able to make changes to it on the fly, that's a trickier beast. If you can afford it (and it's certainly not free), my company Atalasoft, publishes a product that I wrote that lets you build PDF documents from scratch or from templates and embed the .NET objects that create the content into the PDF itself, which means that you can read those objects back out and change the content with a site-specific application, for example.

Link to pages of PDFs created with GhostScript

We have created PDFs from converting individual PostScript pages into a single PDF (and embedding appropriate fonts) using GhostScript.
We've found that an individual page of the PDF cannot be linked to; for example, through the usage of
http://xxxx/yyy.pdf#page=24
There must be something within the PDF that makes this not possible. Are there any specific GhostScript options that should be passed when creating the PDF that would allow this type of page-destination link to work?
There are no specific pdfwrite (the Ghostscript device which actually produces PDF) options to do this. Without knowing why the (presumably) web browser or plugin won't open the file at the specified page its a little difficult to offer any more guidance.
What are you using to view the PDF files ?
Can you make a very simple file that fails ? Can you make that file public ?
If I can reproduce the problem, and the file is sufficiently simple, it may be possible to determine the problem. By the way, which version of Ghostscript are you using ?