How to fetch MediaBox of PDF pages without parsing whole file?

How to fetch MediaBox of PDF pages without parsing whole file? - apache

Is there a way to use Apache PDFBox to read the MediaBox Rectangle of all the pages in a PDF without parsing the entire file? I currently use the following code, which takes a long time for files over 1.5 GB.
// Can I avoid this 'load()' function which tries to parse the entire PDF
pdfDocument = PDDocument.load(pdfFile, MemoryUsageSetting.setupTempFileOnly())
// I can only use TempFile, instead of main memory, as there are restrictions to memory usage in the application.
// get the page media box
pdfDocument.getPage(1).getMediaBox()

Related

Ghostscript - create a pdf with multiple identical pages and keep size down

Im trying to use Ghostscript to create a PDF with multiple identical pages. I will later use this together with another multipaged PDF to stamp on unique information onto every page.
Is it possible to use Ghostscript to create such a PDF and keep the size of the final file down? Maby there is a flag that i have not noticed that can do this in a better way than the script below?
I have tried to use a regular merge command like the one below but the size of the resulting PDF grows alot and the original file size of 2,061MB merged to a 100page pdf results in a final size of 46,117MB.
"C:\Program Files\gs\gs9.20\bin\gswin64.exe"^
-dBATCH^
-dNOPAUSE^
-q^
-sDEVICE=pdfwrite^
-sOutputFile=outputpdf.pdf^
"inputpdf.pdf"^
"inputpdf.pdf"^
"inputpdf.pdf"(and so on 100 times)

You can construct such a file manually easily enough, which is much smaller, by reusing the page content stream for each page.
However Ghostscript's pdfwrite device won;t do that, not least because it can't. It cannot know in advance that the page its about to receive is the same as the previous page. As a result it will create a new page content stream for each page, and create new content for it.
Note that resources (forms, patterns, colour spaces, image XObjects etc) which are used on each page will be reused on other pages.
However, it seems to me that you're already getting nearly a 5:1 ratio (2k * 100 pages = 200Kb, the final file is 46Kb) though in fairness a good bit of that 2Kb is 'stuff' around the page.
Without seeing your input file I can't really comment any further, but frankly I doubt its possible to make it any smaller without hand-crafting the file. What's the problem with a 46Kb file anyway ?

PdfDocument's copyPagesTo method or PdfCanvas's copyAsFormXObject to copy content from PDF to PDF

I followed the guide at this URL: http://developers.itextpdf.com/content/itext-7-jump-start-tutorial/chapter-6-reusing-existing-pdf-documents
Following that guide, I had a problem where some content from the PDF was not copied into the destination PDF when using copyAsFormXObject (which I submitted a support ticket for). An alternative I found in the meantime was that I could use the PdfDocument's copyPagesTo method and simply open the page that was copied with getPage on the destination PDF. From that, I can create a PdfCanvas from the existing page and do our transformations (such as scaling) on the object.
This seems to work exactly as the code in the aforementioned guide with the exception that the PDFs I found where content wasn't copied, the content now appears to be copied.
Are there any drawbacks to using the copyPagesTo method to copy the content as opposed to what the guide suggests (copyAsFormXObject)? Performance, memory, or extraneous non-visible content, etc.?
Code that exhibits this problem:
PdfDocument pdf = new PdfDocument(new PdfWriter(dest));
PdfDocument origPdf = new PdfDocument(new PdfReader(src));
PdfPage origPage = origPdf.getPage(1);
PdfPage page = pdf.addNewPage();
PdfCanvas canvas = new PdfCanvas(page);
PdfFormXObject pageCopy = origPage.copyAsFormXObject(pdf);
canvas.addXObject(pageCopy, 0, 0);
pdf.close();
origPdf.close();
Code that does not:
PdfDocument pdf = new PdfDocument(new PdfWriter(dest));
PdfDocument origPdf = new PdfDocument(new PdfReader(src));
origPdf.copyPagesTo(1,2,pdf);
pdf.close();
origPdf.close();

I've provided code and answers for the specific problem on your support ticket.
As for the difference between copyToPages() and copyAsFormXObject() for copying pages:
copyToPages() is a high level method that copies over the entire page, maintaining all structure and adding any applicable resources to the new document.
With copyAsFormXObject(), you first need to transform the page to an XObject, essentially turning it into an appearance stream. If this page needs additional settings or resources to be displayed correctly, such as a different page size or fonts that were not stored on the page itself, they need to be manually set or added. XObject are always added at absolute positions, so this needs to be specified too.
While copying using low-level methods such as XObjects grants a lot more control over what the result can look like, they come with their own dangers and pitfalls. For ubiquitous tasks such as copying pages, it is better to use the high-level methods to avoid such possible problems.
EDIT:
We've decided that this behaviour is a bug and that 'copyAsFormXObject()' should include the used resources even if they're stored at the /Pages level. This will be fixed in a later release of iText

PDFBox - document is empty after loading

I am using Apache PDFBox for rendering thumbnails of PDF documents. Therefore I load the PDF and use the first page as thumbnail. The problem is, that for a particular document, it seems, it is not loaded correctly. For all other docs, it works like expected.
ByteArrayInputStream is = new ByteArrayInputStream(pdfData);
PDDocument pdf = PDDocument.load(is, true);
List<PDPage> pages = pdf.getDocumentCatalog().getAllPages(); //pages is empty here
The pdf file has 238 pages and is around 6,5 MB of size.

Assuming that you're using an 1.8.* version, please use the non sequential parser:
PDDocument pdf = PDDocument.loadNonSeq(is, null);
The non sequential parser is successful in certain cases where the old parser fails, e.g. for PDFs that have had revisions (example). Another advantage is that no extra code is needed for "protected" PDFs that are encrypted with the empty password.

Hyperlink in existing PDF

I am trying to add a hyperlink based off of known position coordinates in the PDF. I have tried editing the physical pdf code and have added a link, but in the process deleted other content on the pdf.
[/Rect [ x x x x ]
/Action
<</Subtype /URI/URI (http://www.xxxxx.com/)>>
/Subtype /Link
/ANN pdfmark
Is there any way of adding the hyperlink without corrupting the existing pdf? Would converting to a different file format adding the link and converting back be a better approach? Possible commercial use prevents use of some gnu licensed products.

Debenu Quick PDF Libarary also provides a solution. I also recommend to don't edit the 'physical code' of the PDF file (with Notepad or others), because it won't give any solution - neither in other cases.
Here is a sample code how to do it with the Debenu Quick PDF Library:
/* Add a link to a webpage*/
// Set the origin for the co-ordinates to be the top left corner of the page.
DPL.SetOrigin(1);
// Adding a link to an external web page using the AddLinkToWeb function.
DPL.AddLinkToWeb(200, 100, 60, 20, "www.debenu.com", 0);
// Hyperlinks and text are two separate elements in a PDF,
//so we'll draw some text now so that you know
//where the hyperlink is located on the page.
DPL.DrawText(205, 114, "Click me!");
// When the Debenu Quick PDF Library object is initiated a blank document
// is created and selected in memory by default. So
// all we need to do now is save the document to
// the local hard disk to see the changes that we've made.
DPL.SaveToFile("link_to_web.pdf");
Member of Debenu

Docotic.Pdf library can add hyperlinks to existing PDFs. The library is not *GPL-licensed and can be used in commercial solutions after purchasing a license.
Below is a code that adds hyperlink on to the first page of a PDF.
using System;
using System.Drawing;
public static void AddHyperlink()
{
// NOTE:
// When used in trial mode, the library imposes some restrictions.
// Please visit http://bitmiracle.com/pdf-library/trial-restrictions.aspx
// for more information.
using (PdfDocument pdf = new PdfDocument("input.pdf"))
{
PdfPage page = pdf.Pages[0];
RectangleF rectWithLink = new RectangleF(10, 70, 200, 100);
page.AddHyperlink(rectWithLink, new Uri("http://google.com"));
pdf.Save("output.pdf");
}
}
Disclaimer: I work for the vendor of the library.

Adding image to existing PDF (vb.net)

We have a program that generates PDF documents, the staff member who uses these documents needs to hand sign all the generated pages (some 700+). What I would like to do is have a scaned image of his signature and insert it on every page in the existing PDF.
My question thus is how is this done easyest ussing vb.net

You can automate that process by using a PDF editing library. Use for example the PDFLib 2.1 which is an open source project. Download it from here http://pdflib.codeplex.com/ and try editing your pages.
It exposes a function named GetPages which returns a list of the PDF pages. By iterating through every page, you can edit it or add new content to it.

You can quite easily add an image to all pages of a PDF with help of Docotic.Pdf library.
Here is sample code (VB.NET):
Public Shared Sub AddImageToAllPages()
Using pdf As New PdfDocument("input.pdf")
Dim image As PdfImage = pdf.AddImage("image.png")
For Each page As PdfPage In pdf.Pages
page.Canvas.DrawImage(image, 100, 100)
Next
pdf.Save("out.pdf")
End Using
End Sub
and here is the same for C#:
public static void AddImageToAllPages()
{
using (PdfDocument pdf = new PdfDocument("input.pdf"))
{
PdfImage image = pdf.AddImage("image.png");
foreach (PdfPage page in pdf.Pages)
page.Canvas.DrawImage(image, 100, 100);
pdf.Save("out.pdf");
}
}
The code will open PDF, open image and add the image to all pages of the PDF. The image will be reused, so the PDF byte length won't be increased too much. Only one copy of the added image will be stored in output PDF.
Disclaimer: I work for vendor of the library.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to fetch MediaBox of PDF pages without parsing whole file? - apache

Related

Ghostscript - create a pdf with multiple identical pages and keep size down

PdfDocument's copyPagesTo method or PdfCanvas's copyAsFormXObject to copy content from PDF to PDF

PDFBox - document is empty after loading

Hyperlink in existing PDF

Adding image to existing PDF (vb.net)

Categories

Resources