Want to add content of pdf page to another pdf - pdfbox

I have two pdfs both with limited data i.e. Single page. I want to merge the data of one pdf into another pdf but on the single page (It should not create the other page to append the external data). Is there any way to do this using pdfbox or pdfclown ?

Related

How to manage bookmarks from XSL-FO after merging multiple PDF using PDFBox

I have written an XSL-FO file as a template for generating a PDF containing 3 bookmarks, each one of them pointing to a different PDF.
The thing is, I generate the 3 pdf and then I merge them together using PDFBox.
At the time of generation, the first PDF doesn't know about the other 2 so the merged PDF bookmark links don't work.
Should I rethink how I managed my bookmarks (inside XSL-FO) and use PDFBox tools instead? If so, how should I proceed?

Create thumbnails using MigraDoc or PDFsharp

We have a need to take a single PDF file, break it into separate page thumbnails, and based on user input, put together selected pages into a new PDF document.
Can someone show a quick example of how to take a single PDF document and generate a thumbnail preview of each page using either MigraDoc or PDFsharp?
Those who read FAQ lists will know that neither PDFsharp nor MigraDoc can render PDF files.
To create thumbnails from PDF pages you have to render them.
You'll need a different library to create thumbnails.
http://pdfsharp.net/wiki/PDFsharpFAQ.ashx

Export PDF Page contents to individual pages

I have a pdf document which contains more than one page within each page.
The original document is only 2 pages - size A4, but has multiple pages on each of the 2 pages.
I need to export each of these "pages within each page" to an individual pdf page.
I have tried increasing the zoom of the pages and printing from there, but it prints incorrectly.
What could I do within Adobe reader or similar program to export each of these pages each as their own pdf page ?
Link to PDF
Within Acrobat reader, you could make a clever use of custom poster printing (possibly to print as a new PDF):
https://apple.stackexchange.com/questions/12305/split-a-single-page-pdf-into-multiple-pages
Otherwise you can do any of these:
Splitting single page into two pages with ghostscript
Alternatively you could use other tools such as Inkscape to do the splitting.

Page Templates with Form XObject in PDF

I'm writing a PDF generation library and wanted to add the the ability to use other PDFs as templates. The specification notes a TemplateInstantiatedproperty on pages with the alias of the template object should be all that is needed.
Here is a gist of the pdf content:
https://gist.github.com/tyre/89c12f8203181f078001
The template itself is stored in object 16 and the page in object 19.
qpdf --check reports the PDF as invalid:
WARNING: tmp/alpaca.pdf: file is damaged
WARNING: tmp/alpaca.pdf (file position 32089): xref not found
WARNING: tmp/alpaca.pdf: Attempting to reconstruct cross-reference table
checking tmp/alpaca.pdf
PDF Version: 1.7
File is not encrypted
File is not linearized
I'm afraid your PDF document is completely and utterly broken and that you have misunderstood a number of key concepts. You cannot simply incorporate a complete PDF file into another PDF file in the way you have done and expect that to work.
The template system you are referring to is intended to include "hidden" pages - not referenced in the pages tree in the PDF file - in the context of an interactive form document (or interactive document in general). That doesn't sound like what you are intending to do. And these pages need to be valid PDF pages. You can in other words not just include the original PDF document verbatim and expect the PDF reader to sort things out; you need to insert a syntactically correct PDF page object.
What you want to do is take the content of a document and apply that as a background to a document. This most commonly is done using XObjects. Pseudo-code for this could be:
Open the original PDF document
Open the "template" document
Read the template document and copy all elements from the template page into a newly created XObject in the original PDF document.
Modify the page contents of the pages in the original PDF document to paint the new XObject at the beginning of the page description of the existing pages.
It's important to note that again, you're not supposed to simply insert the template document into the stream for the newly created XObject. You will have to create a valid XObject that contains a properly formed resources dictionary referencing all resources needed by your XObject, and that contains the content stream from your template document.
As already indicated in comments, the PDF presented by the OP is structurally defect, the cross reference table position and entries are wrong. Furthermore the transition from one PDF revision to a next update looks questionable. Essentially, therefore, the OP will have to provide a sample PDF which is at least syntactically correct.
That been said, though, the OP indicated he was
writing a PDF generation library and wanted to add the the ability to use other PDFs as templates. The specification notes a TemplateInstantiatedproperty on pages with the alias of the template object should be all that is needed.
The Named Pages mechanism is not meant for something like that. Its main current use (if it is used at all) is in the context of spawning page templates by Acroform actions.
For using pages from other PDFs, one can simply copy them (and the referenced other objects) from the source PDF if they are to be used as separate pages as is; and if multiple templates are to be put onto a single target page, one can wrap the copied sources into form xobjects and include them in the target page.

using "PDFBox" how to identify "Table of contents" page

I am using apache pdfbox framework to read pdf text content.
I have to get the content from "Table of Content" page (if present in the pdf), should be able to identify the Table of content page through pdfbox api.
kindly provide your suggestions.
The table of content in a PDF file is not easily identified by any structure you can just pull from the PDF document. You will have to do text extraction and identify the table of content by its properties.
PDF in general doesn't contain content structure such as table of contents, chapters, headers, footers or even paragraphs or lines of text.