Minimizing IO and Memory Usage When Extracting Pages from a PDF - pdf

I am working on a cloud-hosted web application that needs to serve up extracted pages from a library of larger PDFs. For example, 5 pages from a 50,000 page PDF that is > 1 GB in size.
To facilitate this, I am using iTextSharp to extract page ranges from the large PDFs using the advised approach found in this blog article.
The trouble I am running into is that during testing, I have found that the PdfReader is reading the entire source PDF in order to extract the few pages I need. I know enough about PDF structure to be dangerous, and I know that resources can be spread around such that random read access all over the file is going to be expected, but I was hoping to avoid the need to read ALL the file content.
I even found several mentions of RandomAccessFileOrArray being the silver bullet to address high memory usage when opening large PDFs, but alas, even when I use that, the source PDF is still being read in it's entirety.
Is there a more efficient method (using iText or otherwise) to access just the content I need from the source PDF in order to extract a few pages?

Related

How to directly stream large content to PDF with minimal memory footprint?

I am trying to stream large content (say 200 MB) of formatted data to PDF with minimal memory footprint (say 20 MB per Client/Thread). The PDF structure is written in Adobe postscript and it is complex to directly write in PDF postscript format. I have been using the following APIs to stream content to PDF.
Jasper Reports
iText
The problem I am facing with Jasper reports is that it needs all the input data to be taken into in-memory and only supports OutputStream. There is a function supporting InputStream of data in Jasper Reports but in the backend Jasper loads the whole of the InputStream data into the memory and effectively exhausting the memory.
The problem with iText is that it is commercial. I am now looking to write my own Java API to stream formatted data including tables, images to PDF directly. I have referred the following books to understand the PDF structure:
Pdf Structure by Adobe
Pdf Explained (O'REILLY)
The above books cover only the basic PDF formatting such as Text and 2D Graphics. How do I draw tables, icons, and all other formatting that I am able to generate with HTML/CSS into the PDF?
I need some pointers on understanding the PDF structure in depth. Or, is there already a Java API which supports direct streaming of input content to PDF without holding the entire data in memory?
Note: Headless browsers (PhantomJS, wkhtmltopdf), Apache FOP, Apache PdfBox renders PDF by loading the entire data in memory.

Can PDFBox load a source PDF once then save multiple, variable page ranges as individual PDFs?

I am writing a system that is processing very large PDFs, up to 400,000 pages with 100,000 individual statements per PDF. My task is to quickly split this PDF up into individual statements. This is made complicated by the fact that the statements vary in page count so I can't do a simple split on every 4th page.
I'm using parallel processing on a 36 core AWS instance to speed up the job but doing an initial split of a 400,000 page PDF into 36 chunks is very, very, slow, although processing the resulting 11,108 page chunks is very quick, so there's a lot of overhead for a good result in the end.
The way I think this could be done even faster would be to write a process using PDFBox that loads the source PDF into memory one time (versus calling commandline utilities like pdftk or cpdf 36 times to split the massive PDF) then have it listen on a port for the children of my other process to tell it to split pages x-y into a pdf named z.
Is this possible with PDFBox and if so what methods would I use to accomplish it?

How to increase the page turn speed

Are there some general rules how to create pdf documents in order to achieve an optimal page turn speed?
I recently created few pdf documents (without graphics) using Microsoft Word and realized that my ebook reader (SONY) leafs through them at a slower pace compared with some pdf documents containing more pages and graphics.
What features of pdf documents or ebook readers affect the page turn speed?
How these feature should be configured to increase the page turn speed?
Thanks
There are a few major items that could affect PDF document rendering speeds in the general case. These items are, in order of severity:
Vector graphics - this is by far the worst offender. Too many vector graphics could bring document rendering speeds to a crawl
High-resolution images - even if the images in the actual document appear small, the originals, stored in the document, could be high resolution. This leads to slow rendering speeds as well
Font problems - this category is expanded on below
I imagine issues with fonts is what comes into play here. The rendering speed can be affected by the number of fonts in a PDF document. Also the use of multi-byte and cid fonts sometimes could trip up a mobile viewer.
Another problem could be the version of word used. Older versions don't do as good of a job generating PDF documents, as newer versions.

Pdf tools to analyze pdf attributes

Is there any pdf tools that generate information regarding the loading time and memory usage to display pdf in browser, and also total element inside the pdf?
Unfortunately not really. I've done some of this research, not for PDF in a browser but (and perhaps this is what you are looking at as well) PDF on mobile devices.
There are a number of factors that contribute and that to some extent can be tested for:
Whether or not big images exist in the PDF and what resolution they are. This is linked directly to memory usage.
What compression method is used for image compression. Decompressing JPEG-2000 images specifically can increase load time significantly. Even worse, as JPEG-2000 can be progressively decompressed, it can give the appearance of a really bad PDF until the images has been fully decompressed and loaded (this is ugly specifically on somewhat older tablets for example).
How complex the transparency effects are that are used in the document.
How many fonts are used in the document.
How many line-art objects (vector elements) with a large number of nodes (points) are used on a page.
You can test what is in the document using Acrobat Pro to some extent (there is a well-hidden tool when you save an optimised PDF file that can audit what objects use how much of the space in a PDF document). You can also use a preflight solution such as pdfToolbox from callas (I'm affiliated with this company) or pitstop from enfocus; these tools would allow you to get a report with the results of custom checks such as image resolution, compression, vector objects, color spaces etc.

Is it possible to extract tiff files from PDFs without external libraries?

I was able to use Ned Batchelder's python code, which I converted to C++, to extract jpgs from pdf files. I'm wondering if the same technique can be used to extract tiff files and if so, does anyone know the appropriate offsets and markers to find them?
Thanks,
David
PDF files may contain different image data (not surprisingly).
Most common cases are:
Fax data (CCITT Group 3 and 4)
raw raster data with decoding parameters and optional palette all compressed with Deflate or LZW compression
JPEG data
Recently, I (as developer of a PDF library) start noticing more and more PDFs with JBIG2 image data. Also, JPEG2000 sometimes can be put into a PDF.
I should say, that you probably can extract JPEG/JBIG2/JPEG2000 data into corresponding *.jpeg / *.jp2 / *.jpx files without external libraries but be prepared for all kinds of weird PDFs emitted by broken generators. Also, PDFs quite often use object streams so you'll need to implement sophisticated parser for PDF.
Fax data (i.e. what you probably call TIFF) should be at least packed into a valid TIFF. You can borrow some code for that from open source libtiff for example.
And then comes raw raster data. I don't think that it makes sense to try to extract such data without help of a library. You could do that, of course, but it will take months of work.
So, if you are trying to extract only specific kind of image data from a set of PDFs all created with the same generator, then your task is probably feasible. In all other cases I would recommend to save time, money and hair and use a library for the task.
PDF files store Jpegs as actual JPEGS (DCT and JPX encoding) so in most cases you can rip the data out. With Tiffs, you are looking for CCITT data (but you will need to add a header to the data to make it a Tiff). I wrote 2 blog articles on images in PDF files at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/ and http://www.jpedal.org/PDFblog/2011/07/extract-raw-jpeg-images-from-a-pdf-file/ which might help.