How to merge PDFs into a PDFA1b with watermarks using iText5 - pdf

Here is what I need to do:
Merge several PDF documents (which may or may not be PDFA) into one PDFA1b.
Add a watermark (a simple text label) on each page of the resulting PDF.
It has to be with iText 5
I have looked at this official merging example: http://developers.itextpdf.com/examples/merging-pdf-documents/adding-cover-page-existing-pdf
But can this method be used to create a PDFA, and also add watermarks?
Or am I stuck with using this other method which he specifically says not to use: http://developers.itextpdf.com/examples/merging-pdf-documents-itext5/how-not-merge-documents

You can create files that conform to PDF/A-1b with just about any PDF library including iText. PDF/A, in general, is a subset of ISO 32000 (PDF) so it's really just a matter of using the tool to do what you need to with the files but not adding anything that is forbidden by PDF/A-1b (in your case).
The thing to be aware of is that iText or any of the other libraries that "support" PDF/A, will not prevent you from modifying PDF in a way that is forbidden by PDF/A... you just need to know what those things are.
So... before merging, you'll want to be sure that the input files don't have any annotations or form fields or any other interactive content.
After merging, add your watermark as page content and be sure your XMP metadata is conforming and you should be OK.

Related

Itext: insert PDFs *while* generating a PDF

All the examples I can find seem to assume you are merging existing PDFs, and in particular sticking a bunch of PDFs at the end of another PDF.
In my situation I am generating something analogous to a bibliography but if the reference is to a file of PDF format, they want the contents of that PDF inserted inline, immediately after the citation, not out of place at the end of the file.
Note that there is no guarantee that the external PDFs use the same page size, rotation, etc, as the PDF document I'm generating.
Is there a way to do this? I've tried modifying the itext example on how NOT to do this (with a PdfWriter) but I get an unbalanced save/restore state message. I'm also considering doing this by post-processing as all the examples do, but I'm not quite sure how I'd go about looping through the bibliography PDF and determining where to insert the pages of the external PDFs.
Thanks.

Extract font and its corresponding cmap in PDF

I am tried several ways to extract font from pdf viz. fontforge, mupdf, pdfparser in C# and also some pythone script. But am just confusing about get exact pair of a font and its cmap embeded in pdf. Please direct me the right approach by which i will get exact pairs of fonts and its cmaps.
As mentioned in my first comment, that should be easy using iText or iTextSharp or any other such library that allows you to access low-level PDF objects.
In case of iText(Sharp), ListUsedFonts.java and ListUsedFonts.cs can present starting points for you; they inspect all the font dictionaries in a PDF file accessible via at least one page. Instead of the simple output of those examples, simply export all the information you need. For this, ISO 32000-1:2008 should be your reference guide.

Load auto paged pdf in iOS like iBook

In iBook, when you open a PDF, you can auto format and paged the pdf, e.g. if in iPhone, there are 5 pages, but when you view with iPad, it only contains 2 pages.
When you change the text size, the page also updated automatically.
How to do this using CGPDFDocumentRef?
I'm assuming you are talking about Apple iBooks on the iPad? Are you sure you are observing the behavior of a PDF and not an ePub file?
The native format of iBooks is either ePub or the format created by iBooks Author.
PDF files are usually (in the vast majority of cases) used in a non-reflowing way. Reproducing the exact visual appearance of pages - explicitly without reflow - is exactly why PDF was invented.
There are constructs you can add to PDF files to make them a little more alike to formats like HTML and ePub; these constructs can tag text with styles, logically define paragraphs, columns and tables and so on. Usually they are used to make a PDF file suitable for long-time archiving (according to the ISO PDF/A standard) or accessible (suitable for reading by screen-reader software for vision-impaired people for example). Such a PDF file is commonly referred to as a tagged PDF.
As far as I know iBooks doesn't actually support tagged PDFs (meaning, it doesn't use the information in such a PDF file to reflow the file). And as far as I know you cannot create the necessary tags and structure with the built-in iOS library.
If your target app is iBooks, you'd probably be better off looking into generating ePub...

Enabling select and copy of text content in PDF

What can prevent PDF-1.4 document's content from being selectable and copyable?
I'm generating PDF-1.4 documents using TTF fonts, which are successfully embedded in it (see screenshot below).
Yet I can't select and copy the text from the document. I have studied the PDF-1.4 spec and found only one mention of copy-protecting the document, which has a prerequisite of first encrypting it. And I don't encrypt the document.
So, ideally, I'd like to discover an exhaustive list of reasons, that can prevent the PDF text from being copied, and ways to control that.
There is only one reason, you are embedding your fonts partially. The information you are storing there is the minimum required for drawing the glyphs, but it is not enough for allowing text extraction. For example, in Acrobat Professional, optimizing a file for reducing file size will have this effect, since everything that is not strictly required for presenting the content will be discarded.

Save out a new PDF with updates from users

In my iOS app, I would like to regenerate an existing pdf into another pdf after the users are done annotating on the existing pdf.
My regenerated pdf should be an exact replica of the existing pdf but should have embedded annotations and highlights etc which can be opened and viewed on desktops as well.
I have done some research on this including the solutions proposed on other SO posts. I have tried libharu etc.
But somehow I am not able to convert an existing pdf into a replica pdf. I am able to add annotations to a new pdf I create using libharu.
Now my problem is to copy the existing pdf as is to my regenerated pdf. Any pointers will be much helpful.
My understanding is that a library that can save back out a PDF with "true" annotations (those that can be hidden in Acrobat, for example) is not something that exists in a FOSS solution.
LibHaru, for example, only supports creating new PDFs, not editing or appending existing PDFs. From their homepage:
At this moment libHaru does not support reading and editing existing
PDF files and it's unlikely this support will ever appear.
You can render the PDF on a page by page basis, and then re-save it with some additional information. This S.O question has a reasonable looking piece of code. That will save any "annotations" more as an image in the PDF itself, though.
You might try a paid library like PDFNet.