We would like to automate the processing of Zugferd invoices.
Is there a way to extract and save the xml files embedded in the PDF using Ghostscript?
as mentioned by KenS Ghostscript can help assemble Zugferd files but not extract the contents. Below we can see those contents in the source xml (lower) and a good !? PDF where the plain text is visible (upper part of image is PDF viewed in WordPad) and can be easily extracted as text. However nothing about PDF extraction is reliable since the format of one PDF is rarely the same as the next unless you make it so.
Many PDF readers have the ability to export such attachments as the source file and many PDF libraries will allow for extraction of the named file in a scripted fashion.
The samples above are from currently very up to date Open Source Java application https://www.mustangproject.org/
For very simple cross platform use there is pdfdetach which can save any attachments by name or all attachments
Related
I have a PDF file with other PDF files attached to it. Acrobat shows them in "Attachments" tab and allows to open them in turn.
QPDF documentations says something about extracting attachments but I failed to find any particular commands that do that.
Is it possible to extract these attachments and have them stored on the disk as separate PDF files?
UPDATE: Just a notice to explain better what you can see in the UI: "Attachments" tab was present in older versions of Acrobat, as well as a special page of the container document recommending to download newer version of Acrobat (this page seems to be really existing as it is shown in other viewers as well as on preview image). Latest versions of Acrobat (Reader) skip this page and get you to the first attached document, with the list of all attachments shown on the left side of the screen.
I found an old GitHub issue which a little bit clarify the possibilities of attachment extraction.
It is possible to extract attachments from PDF files using the qpdf
library by understanding the PDF file structure and pulling the
attachments out "manually" by knowing which objects to extract. There
is nothing in the public API at the moment nor in the command-line
tool that enables you to work with attachments as a first-class thing,
but there is an item in the TODO list, and there is some private code
used internally to detect cases where attachments are encrypted
differently from the rest of the file. The main reason, aside from
lack of time, that attachments are not more directly supported is
because there have been various ways that they are stored in the file,
and I don't know whether I have examples of all of them. I'm reluctant
to add a feature for attachments that may miss some attachments in
some older PDF files.
https://github.com/qpdf/qpdf/issues/24
So, it seems it is possible but you should examine the details of the pdf file.
Starting with qpdf 10.2, you can work with file attachments in PDF files from the command line. The following options are available:
http://qpdf.sourceforge.net/files/qpdf-manual.html#ref.attachments
I use iText to read a PDF document containing an XFA form.
I convert it to XML, read data from the XML and insert it in a datatbase.
But if I dont have an XFA form in the PDF then how I can efficiently read data from the PDF?
It depends on your expectations.
You can use text extraction to retrieve all the text on a certain page. How you then process the text is up to you. (e.g. regular expressions)
You can also opt for using pdf2Data, an iText7 add-on that allows you to match documents against templates. pdf2Data seems like a good fit, since it produces XML files as its output.
More information on pdf2Data can be found here http://itextpdf.com/itext7/pdf2Data
We receive wordml documents which are basically XML files generated from msword docs which contains all formatting instructions also. Now we have a requirement to convert these files to PDF. I looked at iText xmlworker to do this conversion. What it did was simply removed all XML tags and gave me all the contents as single paragraph in PDF with no formatting.
How to make sure that generated PDF contains text with correct format from this wordml doc.
iText's product XMLWorker requires you to handle each XML element manually (unless you have HTML as input). The XML schema for MS Word documents is extremely complicated, so you'd be working on that for a few years to get something that looks even remotely ok. In short, XMLWorker doesn't do what you think it does.
If you want MS Word to PDF conversion, you need another kind of solution. XDocReport (MIT license) is one of these, and it has plugins for both iText 2 (LGPL license) and iText 5 (AGPL license). Results are not perfect though.
How can I insert (using postscript) into my pdf-file a swf-file that it (flash) can processes other data from my pdf-file?
Flash content inside a PDF file won't be able to 'process data from the PDF file'.
You can't (easily) insert content into a PDF file using PostScript. Although it is a programming language, the task makes my mind boggle.
If you are trying to add somethign to a PDF file using Ghostscript (as the commenter yms above suggested) the short answer is that you can't. The longer answer is that you might be able to, with some PostScript programming, but you haven't supplied enough information to tell. And it still wouldn't be able to 'process other data form the PDF file'
Someone knows how can I create one pdf file from multiple ppt files ?
Whether it to write script or computer program. However if it can be done with some program it will be the best.
I searched the web for something like this but I didn't get any results.
If you want to convert the PPT/PPTX files to PDF and then join those converted PDF files into a single PDF using either .NET or Java, you may try Aspose.Slides and Aspose.Pdf.Kit components.
Aspose.Slides allows you to convert the PPT/PPTX files to PDF and Aspose.Pdf.kit allows you to join the PDF files into a single PDF. Please see if this solution can work for your scenario.
Disclosure: I work as developer evangelist at Aspose.