Extracting a page from PDF document (using PDFBox)

Extracting a page from PDF document (using PDFBox) - pdfbox

I'm trying to break a PDF down into individual pages. Although it
functionally works, the pdf for each page ends up being almost the size of the original PDF (250MB). I've seen some references in deleting
annotations which might include links to other pages/resources. I've tried the below, but no luck. Can someone let me know what I'm doing wrong?
(Below code is in Kotlin). I've also tried using addPage vs. importPage,
since the later creates a deep copy. Same result.
doc.pages.forEachIndexed { idx: Int, p: PDPage ->
val newDoc = PDDocument()
val newPage = newDoc.importPage(p)
newPage.annotations = null
newPage.resources = null
newDoc.save("/tmp/$idx.pdf")
newDoc.close()
}

Related

Adding an Annotation to a PdfFormXObject so the Annotation is reusable

I'm using iText 7 to construct reusable PDF components that I reuse across multiple pages within a document. I'm using iText-dotnet for this task (v7), using F# as the language. (This shouldn't be hard to follow for non-F# people as it's just iText calls :D)
I know how to add annotations to a Page, that isn't the issue. Adding the annotation to the page is as simple as page.AddAnnotation(newAnnotation).
Where I'm having difficulty, is that there is no "Page" associated with a Canvas when you are using a PdfFormXObject() to render a Pdf fragment.
let template = new PdfFormXObject(rect)
let templateCanvas = PdfCanvas(template, pageContext.Canvas.GetPdfDocument())
let newCanvas = new Canvas(templateCanvas, rect)
Once I have the new Canvas, I try to write to the Canvas and add the Annotation via Page.AddAnnotation(). The problem is that there is no Page attached to the PdfFormXObject!
// Create the destination and annotation (destPage is the pageNumber)
let dest = PdfExplicitDestination.CreateFitB(destPage)
let action = PdfAction.CreateGoTo(dest)
let annotation = PdfLinkAnnotation(rect)
let border = iText.Kernel.Pdf.PdfAnnotationBorder(0f, 0f, 0f)
// set up the Annotation with action and display information
annotation
.SetHighlightMode(PdfAnnotation.HIGHLIGHT_PUSH)
.SetAction(action)
.SetBorder(border)
|> ignore
// Try adding the annotation to the page BOOM! (There is *NO* page (null) associated with newCanvas)
newCanvas.GetPage().AddAnnotation(annotation) |> ignore // HELP HERE: Is there another way to do this?
The issue is that I do not know of a different way to set the Annotation on the canvas. Is there a way to render the annotation and just add the annotation directly to the canvas as raw PDF instructions?
Alternatively, is there a way create a different reusable PDF fragment in iText so I can also reuse the GoTo annotation.
N.B. I could split off the annotations and then apply them every time I use the PdfFormXObject() on a new page, but that sort of defeats the purpose of reusing Pdf fragments (template) in my final PDF to reduce it's size.
If you can point me in the right direction, that would be great.
Again, this is not how to add an annotation to a Page(), that's easy. It's how to add an annotation to a PdfFormXObject (or similar mechanism that I'm unaware of for constructing rusable Pdf fragments).
-- As per John's comments below:
I cannot seem to find any reference to single use annotations.
I'm aware of the following example link, so I modified it to look like this:
private static void Main(string[] args)
{
try
{
PdfDocument pdfDocument = new PdfDocument(new PdfWriter("TestMultiLink.pdf"));
Document document = new Document(pdfDocument);
string destinationName = "MyForwardDestination";
// Create a PdfStringDestination to use more than once.
var stringDestination = new PdfStringDestination(destinationName);
for (int page = 1; page <= 50; page++)
{
document.Add(new Paragraph().SetFontSize(100).Add($"{page}"));
switch (page)
{
case 1: // First use of PdfStringDestination
document.Add(new Paragraph(new Link("Click here for a forward jump", stringDestination).SetFontSize(20)));
break;
case 3: // Re-use the stringDestination
document.Add(new Paragraph(new Link("Click here for a forward jump", stringDestination).SetFontSize(10)));
break;
case 42:
pdfDocument.AddNamedDestination(destinationName, PdfExplicitDestination.CreateFit(pdfDocument.GetLastPage()).GetPdfObject());
break;
}
if (page < 50)
document.Add(new AreaBreak(AreaBreakType.NEXT_PAGE));
}
document.Close();
}
catch (Exception e)
{
Console.WriteLine($"Ouch: {e.Message}");
}
}
If you dig into the iText source for iText.Layout.Link, you'll see that the String Destination is added as an Annotation. Therefore, I'm not sure if John's answer is true anymore.
Does anyone know how I can convert the Annotation to a Dictionary and how I would go about adding the PdfDictionary (raw) info into the PftFormXObject?
Thanks

#johnwhitington is correct.
Per PDF specification, annotations can only be added to a page, they cannot be added to a form XObject. It is not a limitation of iText or any other PDF library.
Annotations cannot be reused, each annotation is a distinct object.

jspdf Some data is being cut

The jspdf library is being used to generate PDF files in html.
That's a really good thing.
But I have a problem with pdf.
The data is about three pages long, but if check the downloaded pdf file, I see only one page and the rest will be truncated.
Here's my code:
let pdfName = this.contractlist_detail.title
var doc = new jsPDF();
var NotoSansCJKjp;
doc.addFileToVFS('NotoSansCJKjp-Regular.ttf', VFS);
doc.addFont('NotoSansCJKjp-Regular.ttf', 'NotoSansCJKjp', 'Bold');
doc.setFont('NotoSansCJKjp', 'Bold');
doc.setFontSize(12);
var paragraph = data;
var lines = doc.splitTextToSize(paragraph, 150);
doc.text(15, 15, lines)
doc.save(pdfName + '.pdf');
How do I make all of my data visible to downloaded pdf without being truncated?

jspdf library doesn't handle multi-pages by its own. You need to add pages manually when content is cropped (you need also to calculate manually if text is cropped).
Here is the method to add a new page :
addPage method
a demo is available in section "two page hello world" to know how to use this method
enter link description here

HTML string to PDF conversion

I need to create various reports in PDF format and email it to specific person. I managed to load HTML template into string and am replacing certain "custom markers" with real data. At the end I have a fulle viewable HTML file. This file must now be printed into PDF format which I am able todo after following this link : https://www.appcoda.com/pdf-generation-ios/. My problem is that I do not understand how to determine the number of pages from the HTML file as the pdf renderer requires creating page-by-page.

I know this is an old thread, I would like to leave this answer here. I also used the same tutorial you've mention and here's what I did to make multiple pages. Just modify the drawPDFUsingPrintPageRenderer method like this:
func drawPDFUsingPrintPageRenderer(printPageRenderer: UIPrintPageRenderer) -> NSData! {
let data = NSMutableData()
UIGraphicsBeginPDFContextToData(data, CGRect.zero, nil)
printPageRenderer.prepare(forDrawingPages: NSMakeRange(0, printPageRenderer.numberOfPages))
let bounds = UIGraphicsGetPDFContextBounds()
for i in 0...(printPageRenderer.numberOfPages - 1) {
UIGraphicsBeginPDFPage()
printPageRenderer.drawPage(at: i, in: bounds)
}
UIGraphicsEndPDFContext()
return data
}
In your custom PrintPageRenderer you can access the numberOfPages to have the total count of the pages

How to check multiple PDF files for annotations/comments?

Problem: I routinely receive PDF reports and annotate (highlight etc.) some of them. I had the bad habit of saving the annotated PDFs together with the non-annotated PDFs. I now have hundreds of PDF files in the same folder, some annotated and some not. Is there a way to check every PDF file for annotations and copy only the annotated ones to a new folder?
Thanks a lot!
I'm on Win 7 64bit, I have Adobe Acrobat XI installed and I'm able to do some beginner coding in Python and Javascript
Please ignore the following suggestion, since the answers already solved the problem.
EDIT: Following Mr. Wyss' suggestion, I created the following code for Acrobat's Javascript console to be run only once at the beginning:
counter = 1;
// Open a new report
var rep = new Report();
rep.size = 1.2;
rep.color = color.blue;
rep.writeText("Files WITH Annotations");
Then this code should be applied to all PDFs:
this.syncAnnotScan();
annots = this.getAnnots();
path = this.path;
if (annots) {
rep.color = color.black;
rep.writeText(" ");
rep.writeText(counter.toString()+"- "+path);
rep.writeText(" ");
if (counter% 20 == 0) {
rep.breakPage();
}
counter++;
}
And, at last, one code to be run only once at the end:
//Now open the report
var docRep = rep.open("files_with_annots.pdf");
There are two problems with this solution:
1. The "Action Wizard" seems to always apply the same code afresh to each PDF (that means that the "counter" variable, for instance, is meaningless; it will always be = 1. But more importantly, var "rep" will be unassigned when the middle code is run on different PDFs).
2. How can I make the codes that should be run only once run only at the beginning or at the end, instead of running everytime for every single PDF (like it does by default)?
Thank you very much again for your help!

This would be possible using the Action Wizard to put together an action.
The function to determine whether there are annotations in the document would be done in Acrobat JavaScript. Roughly, the core function would look like this:
this.syncAnnotScan() ; // updates all annots
var myAnnots = this.getAnnots() ;
if (myAnnots != null) {
// do something if there are annots
} else {
// do something if there are no annots
}
And that should get you there.
I am not completely positive, but I think there is also a Preflight check which tells you whether there are annotations in the document. If so, you would create a Preflight droplet, which would sort out the annotated and not annotated documents.

Mr. Wyss is right, here's a step-by-step guide:
In Acrobat XI Pro, go to the 'Tools' panel on the right side
Click on the 'Action Wizard' tab (you must first make it visible, though)
Click on 'Create New Action...', choose 'More tools' > 'Execute Javascript' and add it to right-hand pane > click on 'Execute Javascript' > 'Specify Settings' (uncheck 'prompt user' if you want) > paste this code:
.
this.syncAnnotScan();
var annots = this.getAnnots();
var fname = this.documentFileName;
fname = fname.replace(",", ";");
var errormsg = "";
if (annots) {
try {
this.saveAs({
cPath: "/c/folder/"+fname,
bPromptToOverwrite: false //make this 'true' if you want to be prompted on overwrites
});
} catch(e) {
for (var i in e)
{errormsg+= (i + ": " + e[i]+ " / ");}
app.alert({
cMsg: "Error! Unable to save the file under this name ('"+fname+"'- possibly an unicode string?) See this: "+errormsg,
cTitle: "Damn you Acrobat"
});
}
;}
annots = 0;
Save and run it! All your annotated PDFs will be saved to 'c:\folder' (but only if this folder already exists!)
Be sure to enable first Javascript in 'Edit' > 'Preferences...' > 'Javascript' > 'Enable Acrobat Javascript'.
VERY IMPORTANT: Acrobat's JS has a bug that doesn't allow Docs to be saved with commas (",") in their names (e.g., "Meeting with suppliers, May 11th.pdf" - this will get an error). Therefore, I substitute in the code above all "," for ";".

Need alternative to local or remote goto/destinations with merged documents

BACKGROUND
I have a java program that analyzes data and creates a pdf report using itext 5.
I recently had to add a summary of major problems at the start of the document so a user would not have to read over a hundred pages to find problems. Problems are only discovered when serially looking through the data.
I solved the problem by creating 3 pdf documents and then merging them, a start/title pdf, the summary of problems pdf, and the body or analysis pdf. (Basically splitting the original document at the point I wanted to insert the summary)
I use PdfReader and PdfCopy to combine the documents. I am able to keep the chapter bookmarks OK.
THE PROBLEM
As I encounter a significant problem I add it to the 'summary' document. I want to add a link in the summary to point to the problem in the body.
I tried to use Chunk.setLocalDestination and setLocalGoto but realized why that did not work, so I tried using setLocalDestination and setRemoteGoto (with and without 'file://'), but that did not work either. (Also, I used the final pdf document name in the RemoteGoto, not the temporary pdf document name.)
I do not want to use bookmarks because that seems wrong and would not look right.
I am hoping someone could suggest an alternate method or make a suggestion.
To recap, in my current code a create a Chunk with setLocalDestination and that chunk goes into the 'body' document. At the same time I create a setRemoteGoto which is put into the summary document. I was hoping when they were combined the link would work, but when the link is clicked, you go to the first page of the combined document.
Thanks.....
PS I have both iText in action books
CLARIFICATION 3/5/2014
What I was calling 'bookmarks' are really Chapter class entities that are inserted into sections of the 3 documents as they are being created.
After saving the 3 documents, PdfReader is used to open each and PdfCopy is used to put them into a new, final document.
I get the data from the Chapters, which creates the 'bookmarks' on the left side of the Pdf reader used by the user, e.g. Acrobat Reader.
int thisPdfPages = reader.getNumberOfPages();
reader.consolidateNamedDestinations();
java.util.List<HashMap<String, Object>> bookmarks = SimpleBookmark.getBookmark(reader);
if (bookmarks != null) {
if (pageOffset != 0) {
if (debug3) auditLogger.log("Shifting pages by " + pageOffset );
SimpleBookmark.shiftPageNumbers(bookmarks, pageOffset, null);
}
masterBookmarks.addAll(bookmarks);
}
for (int i = 0; i < thisPdfPages;) {
page = copy.getImportedPage(reader, ++i);
stamp = copy.createPageStamp(page);
// add page numbers
ColumnText.showTextAligned(stamp.getUnderContent(), Element.ALIGN_CENTER, new Phrase(String.format("page %d of %d", start + i, totalPages)), 297.5f, 28, 0);
stamp.alterContents();
copy.addPage(page);
}
PRAcroForm form = reader.getAcroForm();
if (form != null) {
copy.copyAcroForm(reader);
}
When analyzing the data I have 2 documents open, a base document which contains all the details and a summary document which contains notable events over some thresholds.
//NOTE section is part of the 'body' document
//NOTE summaryPhrase is a part of the 'summary' document
String linkName = "summaryPf_" + networkid ;
//create Link target
section.add(new Chunk("CHANGE TO EMPTY STRING WHEN WORKING").setLocalDestination( linkName ));
//create Link
Chunk linkChunk = new Chunk( "[Link] " );
Font linkFont = new Font( regularFont );
linkFont.setColor(BaseColor.BLUE);
linkFont.setStyle( Font.UNDERLINE );
linkChunk.setFont( linkFont );
boolean useLocal = true;
// both local and remote goto's fail
if (useLocal) {
linkChunk.setLocalGoto( linkName);
} else {
// all permutations of setting filename fail,
// but it does bring up a permissions dialog when the link is clicked.
//String remotePdfName = "file://./" + pdfReportName ;
//String remotePdfName = "file://" + pdfReportName ;
//String remotePdfName = "file:" + pdfReportName ;
String remotePdfName = pdfReportName ;
linkChunk.setRemoteGoto( remotePdfName, linkName);
}
// add link to summary document
summaryPhrase.add( linkChunk );
summaryPhrase.add( String.format("There were %d devices with ping failures", summaryCount));
summaryPhrase.add( Chunk.NEWLINE );
}
If I use setLocalGoto, when you click the link in the final document you goto the first page.
If I use setRemoteGoto, a dialog ask permission to go to a document, but the document fails to open, tried several permutations on filename.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extracting a page from PDF document (using PDFBox) - pdfbox

Related

Adding an Annotation to a PdfFormXObject so the Annotation is reusable

jspdf Some data is being cut

HTML string to PDF conversion

How to check multiple PDF files for annotations/comments?

Need alternative to local or remote goto/destinations with merged documents

Categories

Resources