PDF files Stitching with cover page and disclaimer pages - pdf

The possibility to stitch multiple PDF files as one merged PDF file.
We need to stitch multiple PDF financial Report as a single merged PDF file( One PDF report package).
The first page of each report is the cover page, then follows by report body, and last
one or more pages contain disclaimer info. The table content on the cover page is
different for each of the seperated Reports before stiting; and the disclaimers info is also some difference from each other.
What we want to do is that the merged PDF package after stitching should contains
one cover page with a new table content info, such as the page numbers for each section
in this stitched PDF report package; and one set of combined disclosures info on the last
one or several pages should covered all disclaimer info, no missing and no repeating any
of the disclaimer info from the speared reports before stitching.
The report bodies themselves should be concatenated together in the between of the cover
page and first disclaimer pager after merged PDF package.

If you are using either Java or C#, you can probably use the iText PDF library. In particular, see the 'addPage()' method of the PdfCopy class:
PdfCopy Class addPage method

If commercial products are an option for you, and you are targeting Windows, you could use Amyuni PDF Creator ActiveX (C++, Delphi, VB, ASP Classic, etc) or Amyuni PDF Converter .Net (C#, VB.NET etc) for this task, specifically the Append method. You can get customer support during evaluation period.
From the documentation:
Append Method
The Append method can be used to append or concatenate a PDF file to
the current document. Syntax:
C++: HRESULT Append ([in] BSTR FileName,[in] BSTR Password)
C#: void Append (string fileName, string password)
VB: Sub Append (FileName As String, Password As String)
You may also need to use DeletePage and Merge methods. Merge method is for drawing the content of a PDF file on top of another PDF file.
Usual disclaimer applies.

Related

How to merge PDFs into a PDFA1b with watermarks using iText5

Here is what I need to do:
Merge several PDF documents (which may or may not be PDFA) into one PDFA1b.
Add a watermark (a simple text label) on each page of the resulting PDF.
It has to be with iText 5
I have looked at this official merging example: http://developers.itextpdf.com/examples/merging-pdf-documents/adding-cover-page-existing-pdf
But can this method be used to create a PDFA, and also add watermarks?
Or am I stuck with using this other method which he specifically says not to use: http://developers.itextpdf.com/examples/merging-pdf-documents-itext5/how-not-merge-documents
You can create files that conform to PDF/A-1b with just about any PDF library including iText. PDF/A, in general, is a subset of ISO 32000 (PDF) so it's really just a matter of using the tool to do what you need to with the files but not adding anything that is forbidden by PDF/A-1b (in your case).
The thing to be aware of is that iText or any of the other libraries that "support" PDF/A, will not prevent you from modifying PDF in a way that is forbidden by PDF/A... you just need to know what those things are.
So... before merging, you'll want to be sure that the input files don't have any annotations or form fields or any other interactive content.
After merging, add your watermark as page content and be sure your XMP metadata is conforming and you should be OK.

ghostscript extract pages containing a text string

i need to programmatically extract from a multipage pdf, only the pages containing a text string. Is it possible or i need some other tools? I'm working on aix.
thanx in advance
OK firstly Ghostscript doesn't extract pages from PDF files. It creates brand new PDF files whose visual appearance should be the same as the original, but whose content will be different.
There is no way to do this with Ghostscript in a single pass. You could use the txtwrite device to extract the text then grep through the output files for the text you want, note the page numbers and then run another pass to get those pages into new files.
Be aware that extracting text from a PDF file is far from guaranteed to work! That was not the intent of the original PDF format.
Also note that GHostscript currently only allows for handling a single range of pages, First->Last, so if you have a discontinuous set (eg pages 1, 3, 5, 7 etc) then you will have to run this step multiple times.

A Table of Contents Page for a Scanned PDF

I was given some really old but very useful hand-written notes recently and in a bid to preserve them, I had them scanned into a file in the PDF format. What I have is a 35 page PDF but I want to add a contents page at the beginning so that I can use the first page to click my way to a specific topic.
More precisely,
I want a page which says
Topic 1
Topic 2
Topic 3
...
Each one should be linked to a page of my choosing.
I've explored a lot of standard tools out there to help me with this, like LibreOffice, pdftk etc. but the solution does not appear to be in the form of a simple application and a few clicks. My hunch is that this will require a program written in a suitable language. The way I'd want this program to work as follows:
ProgramName Input.pdf CustomTOC.txt
Where CustomTOC.txt could be a simple ASCII table containing two columns, one column being the title and the second column being the page number. The output of this program will be another PDF file which contains one page appended at the beginning of Input.pdf containing a table of contents with hyperlinks to the right pages.
I have managed to solve this problem though I don't think this is the best way to do it. I have written a Python program that accepts two mandatory inputs - the input PDF file and '|' separated ASCII table containing columns and page numbers. A third optional output can be the name of a PDF file which contains the output. If this is not provided then the original input file is rewritten.
How the code works? Uses a system call to 'pdftk' for bursting the PDF file into its constituent pages. Writes a .tex file which contains a \listoffigures command for the first page with the package hyperref ensuring it links to the figures. The later part of the .tex code contains several figure insertion statements where the PDF file corresponding to each page is inserted, providing captions only to those PDFs for which there is an entry in the provided TOC table.
Why the code is not ideal? It relies on too many dependencies. It relies on a system call to the pdftk package, it requires that LaTeX be also installed on the machine with the graphics package. In the current version of the code, the PDFs on each page do have some offset which I am trying to solve using geometry package with custom margin settings. I will try to post the code once this problem is solved.
A more ideal solution. That which does not require LaTeX and can use some PDF library within Python to achieve the same effect. Comments and suggestions welcome!

Page Templates with Form XObject in PDF

I'm writing a PDF generation library and wanted to add the the ability to use other PDFs as templates. The specification notes a TemplateInstantiatedproperty on pages with the alias of the template object should be all that is needed.
Here is a gist of the pdf content:
https://gist.github.com/tyre/89c12f8203181f078001
The template itself is stored in object 16 and the page in object 19.
qpdf --check reports the PDF as invalid:
WARNING: tmp/alpaca.pdf: file is damaged
WARNING: tmp/alpaca.pdf (file position 32089): xref not found
WARNING: tmp/alpaca.pdf: Attempting to reconstruct cross-reference table
checking tmp/alpaca.pdf
PDF Version: 1.7
File is not encrypted
File is not linearized
I'm afraid your PDF document is completely and utterly broken and that you have misunderstood a number of key concepts. You cannot simply incorporate a complete PDF file into another PDF file in the way you have done and expect that to work.
The template system you are referring to is intended to include "hidden" pages - not referenced in the pages tree in the PDF file - in the context of an interactive form document (or interactive document in general). That doesn't sound like what you are intending to do. And these pages need to be valid PDF pages. You can in other words not just include the original PDF document verbatim and expect the PDF reader to sort things out; you need to insert a syntactically correct PDF page object.
What you want to do is take the content of a document and apply that as a background to a document. This most commonly is done using XObjects. Pseudo-code for this could be:
Open the original PDF document
Open the "template" document
Read the template document and copy all elements from the template page into a newly created XObject in the original PDF document.
Modify the page contents of the pages in the original PDF document to paint the new XObject at the beginning of the page description of the existing pages.
It's important to note that again, you're not supposed to simply insert the template document into the stream for the newly created XObject. You will have to create a valid XObject that contains a properly formed resources dictionary referencing all resources needed by your XObject, and that contains the content stream from your template document.
As already indicated in comments, the PDF presented by the OP is structurally defect, the cross reference table position and entries are wrong. Furthermore the transition from one PDF revision to a next update looks questionable. Essentially, therefore, the OP will have to provide a sample PDF which is at least syntactically correct.
That been said, though, the OP indicated he was
writing a PDF generation library and wanted to add the the ability to use other PDFs as templates. The specification notes a TemplateInstantiatedproperty on pages with the alias of the template object should be all that is needed.
The Named Pages mechanism is not meant for something like that. Its main current use (if it is used at all) is in the context of spawning page templates by Acroform actions.
For using pages from other PDFs, one can simply copy them (and the referenced other objects) from the source PDF if they are to be used as separate pages as is; and if multiple templates are to be put onto a single target page, one can wrap the copied sources into form xobjects and include them in the target page.

iTextSharp - when extracting a page it fails to carry over Adobe rectangle highlighting important info

Per the following site...
http://forums.asp.net/t/1630140.aspx?extracting+pdf+pages+using+itextsharp
...I use the function ExtractPages to produce a new PDF based on range of page numbers. My problem is that I noticed a PDF that had a rectangle on the 2nd page was not extracted along with the page. This causes me some fear that perhaps Adobe comments are not being carried over as well as the pages get extracted.
Is there a way I can adjust this code to take into consideration to bring over comments and objects like rectangles to the new PDF when ExtractPages is called? Am I missing a syntax or is that not available with version 5.5.0 of iTextSharp?
Your use of the verb extract in the context of extracting pages is confusing. People will think you want to extract text from a page. In reality, you want to import or copy pages.
The example you refer to uses PdfWriter. That's wrong: you should use PdfStamper (if only one existing PDF is involved) or PdfCopy (if multiple existing PDFs are involved). See my answer to the question How to keep original rotate page in itextSharp (dll) to find out why the example on forums.asp.net is a really, really bad example.
The fact that a page has "a rectangle" is irrelevant. Maybe the rectangle is an annotation. In that case, you're throwing that rectangle away by using the wrong example. Maybe the origin of the page is different from 0,0...
If your purpose is to create a new PDF containing only a selection of pages of the original PDF, please read my answer to Function that I can use to remove a single page from a PDF using iText