Itext: insert PDFs *while* generating a PDF - pdf

All the examples I can find seem to assume you are merging existing PDFs, and in particular sticking a bunch of PDFs at the end of another PDF.
In my situation I am generating something analogous to a bibliography but if the reference is to a file of PDF format, they want the contents of that PDF inserted inline, immediately after the citation, not out of place at the end of the file.
Note that there is no guarantee that the external PDFs use the same page size, rotation, etc, as the PDF document I'm generating.
Is there a way to do this? I've tried modifying the itext example on how NOT to do this (with a PdfWriter) but I get an unbalanced save/restore state message. I'm also considering doing this by post-processing as all the examples do, but I'm not quite sure how I'd go about looping through the bibliography PDF and determining where to insert the pages of the external PDFs.
Thanks.

Related

How to merge PDFs into a PDFA1b with watermarks using iText5

Here is what I need to do:
Merge several PDF documents (which may or may not be PDFA) into one PDFA1b.
Add a watermark (a simple text label) on each page of the resulting PDF.
It has to be with iText 5
I have looked at this official merging example: http://developers.itextpdf.com/examples/merging-pdf-documents/adding-cover-page-existing-pdf
But can this method be used to create a PDFA, and also add watermarks?
Or am I stuck with using this other method which he specifically says not to use: http://developers.itextpdf.com/examples/merging-pdf-documents-itext5/how-not-merge-documents
You can create files that conform to PDF/A-1b with just about any PDF library including iText. PDF/A, in general, is a subset of ISO 32000 (PDF) so it's really just a matter of using the tool to do what you need to with the files but not adding anything that is forbidden by PDF/A-1b (in your case).
The thing to be aware of is that iText or any of the other libraries that "support" PDF/A, will not prevent you from modifying PDF in a way that is forbidden by PDF/A... you just need to know what those things are.
So... before merging, you'll want to be sure that the input files don't have any annotations or form fields or any other interactive content.
After merging, add your watermark as page content and be sure your XMP metadata is conforming and you should be OK.

Cannot select PDF from top to bottom

I'm using pdftotext to extract info from a pdf. Currently using the -raw option. I do have a few problems with the PDFs I'm working with. If I select the text from top to bottom it selects in the following fashion.
PDF content:
A
B
C
It selects A then C and then B. So when I extract the text it is presented in the same way. Is there a way to reformat the PDF so I can select the content from top to bottom?
NOTE: I'm aware that if I omit the "raw" option the layout will be preserved, but it seems to be buggy when the document includes tables so raw works better for me.
Yes, you can reformat the PDF so that the content is returned from top to bottom. This is not something that can be easily done using Adobe Acrobat or any other viewer that I am aware of and here is why.
From the documentation of pdftotext, the -raw option is defined as
Keep the text in content stream order. This is a hack which often "undoes" column formatting, etc. Use of raw mode is no longer recommended.
"content stream order" is the important piece in the description.
In PDFs, the content on the page does not have to be written in the content stream (the instructions that are interpreted to display the page) in the order that a human would read the content when the page is rendered. The internals of PDFs do not care about the ordering, they were designed to reproduce the same visualization of a document on a variety of platforms. Since all that matters to PDF is the visualization, applications or libraries that write PDF tend to not order the content stream in any meaningful way.
So you can reorder the instructions in a content stream so that they are in the order a human would read them, it is not an easy task to do by hand and using a library that understands PDF to manipulate the content stream would be one way of doing this. Another way is to look for a more advanced tool to use to extract text from the PDF (there are a number of tools that will look at the placement of the content on a page rather than just where it appears in the content stream).
I am not aware of anything that will reorder the content stream in the PDF based on where the content appears on the page automatically though.

Enabling select and copy of text content in PDF

What can prevent PDF-1.4 document's content from being selectable and copyable?
I'm generating PDF-1.4 documents using TTF fonts, which are successfully embedded in it (see screenshot below).
Yet I can't select and copy the text from the document. I have studied the PDF-1.4 spec and found only one mention of copy-protecting the document, which has a prerequisite of first encrypting it. And I don't encrypt the document.
So, ideally, I'd like to discover an exhaustive list of reasons, that can prevent the PDF text from being copied, and ways to control that.
There is only one reason, you are embedding your fonts partially. The information you are storing there is the minimum required for drawing the glyphs, but it is not enough for allowing text extraction. For example, in Acrobat Professional, optimizing a file for reducing file size will have this effect, since everything that is not strictly required for presenting the content will be discarded.

PDF data extraction gives symbols/gibberish?

I have a piece of software called PDF2XL which is normally great for extracting tables of data from PDF files. I've used it with hundreds of files before.
This one file though, gives me gibberish output that I can't even copy and paste into this textarea correctly. All sorts of unicode weirdness.
If I copy and paste as per normal into excel/notepad I get the same issue.
I assume it's something to do with a messed up character encoding header in the PDF file? How can I change this? I'm on Windows and have no software that can edit PDFs, so if I need to edit/re-save it, please recommend a free piece of SW to do it.
Thanks!
There are an increasing number of PDF files the used subsetted fonts which is basically a custom encoding. Normally the font descriptor in the PDF should have a ToUnicode table to allow the text extraction to decode the font encoding and return the correct text.
Some PDF producers are doing this on purpose to prevent easy PDF text extraction for things such as financial reports. If there is only one font then you could manually decode the font but in my experience I have seen PDF's with multiple random encodings which makes it nearly impossible to decode automatically.
One way to test for these types of PDF's is to open the file in Acrobat, select some text, copy it and then paste it into Notepad. If the text is garbled then the PDF is using a subsetted font and there is not much more you can do. If Acrobat can't extract the text correctly then nothing else can. It may as well be a page of hieroglyphs.

A better file format than PDF or EPUB?

My client wants us to build a custom document viewer for their app. (It really, truly needs to be custom, because there are a ton of application-specific features they need.)
We built one for them last year that took PDFs, generated page images, and backed the images using a hidden layer of text that could be selected and copied. We did it in Flex. It was a nightmare. PDF is horrid.
This year, we need to build one in HTML 5 with similar requirements, except that most of the documents now are in Word or HTML, that is, they have reflowable text, instead of the fixed layout and glyphs of PDF. But they still want to do PDF in the same viewer.
I'm thinking that we need to convert all documents to some common file format that can handle both reflowable text and also the fixed-position glyphs of PDF. (Each document would probably support one or the other, but not both). It would be nice if it were an XML-like markup language that would say:
<text>here's some text</text>
-- or --
<glyph letter="a" name="my_a_glyph" position="10,10"/>
<image src="my_image" position="20,20"/>
or something like that.
Is there any existing file format out there that can handle it? EPUB won't do the fixed-position text, and PDF sucks in too many ways to describe.
I think you can look at FB2 (FictionBook 2) format . That is an XML-based format, designed for publishing books. It includes images, though I am not sure if they can be aligned absolutely.
Also, you can simply go with HTML and do HTML-to-PDF rendering when needed (there exists various components and libraries for this). I don't see (or you have not listed) any reasons why this way doesn't work.
GROFF? Maybe build a macro library to customize it, as needed.
Groff/troff/nroff, the "run off" programs of Unix, can output to postscript or HTML. The jump from postscript to PDF is built in to some PDF viewers; there are also several existing programs for it, pstopdf, for example.
GROFF has some fixed layout options and some flow-like options. With GROFF, it's almost easier to base most of the printout on flowing text, within proscribed bounds.