Can PDFBox load a source PDF once then save multiple, variable page ranges as individual PDFs? - pdf

I am writing a system that is processing very large PDFs, up to 400,000 pages with 100,000 individual statements per PDF. My task is to quickly split this PDF up into individual statements. This is made complicated by the fact that the statements vary in page count so I can't do a simple split on every 4th page.
I'm using parallel processing on a 36 core AWS instance to speed up the job but doing an initial split of a 400,000 page PDF into 36 chunks is very, very, slow, although processing the resulting 11,108 page chunks is very quick, so there's a lot of overhead for a good result in the end.
The way I think this could be done even faster would be to write a process using PDFBox that loads the source PDF into memory one time (versus calling commandline utilities like pdftk or cpdf 36 times to split the massive PDF) then have it listen on a port for the children of my other process to tell it to split pages x-y into a pdf named z.
Is this possible with PDFBox and if so what methods would I use to accomplish it?

Related

How to append a one-page-pdf to a Huge PDF file

I have a huge pdf file(which is 1TB or more)and a small pdf which only have one page, is there any way to append the small pdf to the end of huge one at a acceptable cost of time, without rewriting them to the disk?
I have tried mupdf/podofo but get nothing.
thanks.

Minimizing IO and Memory Usage When Extracting Pages from a PDF

I am working on a cloud-hosted web application that needs to serve up extracted pages from a library of larger PDFs. For example, 5 pages from a 50,000 page PDF that is > 1 GB in size.
To facilitate this, I am using iTextSharp to extract page ranges from the large PDFs using the advised approach found in this blog article.
The trouble I am running into is that during testing, I have found that the PdfReader is reading the entire source PDF in order to extract the few pages I need. I know enough about PDF structure to be dangerous, and I know that resources can be spread around such that random read access all over the file is going to be expected, but I was hoping to avoid the need to read ALL the file content.
I even found several mentions of RandomAccessFileOrArray being the silver bullet to address high memory usage when opening large PDFs, but alas, even when I use that, the source PDF is still being read in it's entirety.
Is there a more efficient method (using iText or otherwise) to access just the content I need from the source PDF in order to extract a few pages?

Atomic writes with Ghostscript output devices

I'm using Ghostscript to convert PDF pages to PNG using the following on the command line:
gs -dDOINTERPOLATE -sDEVICE=pnggray -r200x200 -o 'page%%d.png' filename.pdf
My intent is to take in large PDFs and do other work with the PNGs as they are built, cleaning them up after I'm done. However, it seems that the output PNGs aren't generated atomically -- that is, they become available before they're complete. Is there a way to get Ghostscript to generate these files atomically, or some way I can access them as the command runs without encountering incomplete files?
No, there isn't. Ghostscript opens the file for writing immediately that the page begins. It write the data either in one large lump when the page is complete, or in a series of horizontal stripes (at high page sizes or resolutions).
Since it might be writing the page in a series of bands, it has to open the file up front.
You could write an application around Ghostscript using the API, that will produce a callback on page completion which you could then use to trigger your other processing.

Optimized conversion of many PDF files to PNG if most of them have 90% the same content

I'm using ImageMagick to convert a few hundred thousand PDF files to PNG files. ImageMagick takes about ten seconds to do this. Now, most of these PDF files are automatically generated grading certificates, so it's basically just a bunch of PDF files with the forms filled in with different numbers. There are also a few simple raster images on each PDF> I mean, one option is to just throw computing power at it, but that means money as well as making sure they all end up in the right place when they come back. Another option is to just wait it out on our current computer. But I did the calculations here, and we won't even be able to keep up with the certificates we get in real-time.
Now, the option I'm hoping to pursue is to somehow take advantage of the fact that most of these files are very similar, so if we have some sort of pre-computed template to use, we can skip the process of calculating the entire PDF file every time the conversion is done. I'd do a quick check to see if the PDF fits any of the templates, run the optimized conversion if it does, and just do a full conversion if it doesn't.
Of course, my understanding of the PDF file format is intermediate at best, and I don't even know if this idea is practical or not. Would it require making a custom version of ImageMagick? Maybe contributing to the ImageMagick source code? Or is there some solution out there already that does exactly what I need it to? (We've all spend weeks on a project, then had this happen, I imagine)
Ok, I have had a look at this. I took your PDF and converted it to a JPEG like this - till you tell me the actual parameters you prefer.
convert -density 288 image.pdf image.jpg
and it takes 8 seconds and results in a 2448x3168 pixel JPEG of 1.6MB - enough for a page size print.
Then I copied your file 99 times so I had 100 PDFs, and processed them sequentially like this:
#!/bin/bash
for p in *.pdf; do
echo $new
new="${p%%pdf}jpg"
convert -density 288 $p $new
done
and that took 14 minutes 32 seconds, or average of 8.7 seconds.
Then I tried using GNU Parallel to do exactly the same 100 PDF files, like this:
time parallel -j 8 convert -density 288 {} {.}.jpg ::: *.pdf
keeping all 8 cores of my CPU very busy. but it processed the same 100 PDFs in 3 minutes 12, so averaging 1.92 seconds each, or a 4.5x speed-up. I'd say well worth the effort for a pretty simple command line.
Depending on your preferred parameters for convert there may be further enhancements possible...
The solution in my case ended up being to use MuPDF (thanks #Vadim) from the command line, which is about ten times faster than GhostScript (the library used by Imagemagick). MuPDF fails on about 1% of the PDF files though, due to improper formatting, which GhostScript is able to handle reasonably well, so I just wrote an exception handler to only use Imagemagick in those cases. Even so, it took about 24 hours on an 8-core server to process all the PDF files.

Speeding up Solr Indexing

I am kind of working on speeding up my Solr Indexing speed. I just want to know by default how many threads(if any) does Solr use for indexing. Is there a way to increase/decrease that number.
When you index a document, several steps are performed :
the document is analyzed,
data is put in the RAM buffer,
when the RAM buffer is full, data is flushed to a new segment on disk,
if there are more than ${mergeFactor} segments, segments are merged.
The first two steps will be run in as many threads as you have clients sending data to Solr, so if you want Solr to run three threads for these steps, all you need is to send data to Solr from three threads.
You can configure the number of threads to use for the fourth step if you use a ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html). However, there is no mean to configure the maximum number of threads to use from Solr configuration files, so what you need is to write a custom class which call setMaxThreadCount in the constructor.
My experience is that the main ways to improve indexing speed with Solr are :
buying faster hardware (especially I/O),
sending data to Solr from several threads (as many threads as cores is a good start),
using the Javabin format,
using faster analyzers.
Although StreamingUpdateSolrServer looks interesting for improving indexing performance, it doesn't support the Javabin format. Since Javabin parsing is much faster than XML parsing, I got better performance by sending bulk updates (800 in my case, but with rather small documents) using CommonsHttpSolrServer and the Javabin format.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.
This article describes an approach to scaling indexing with SolrCloud, Hadoop and Behemoth. This is for Solr 4.0 which hadn't been released at the time this question was originally posted.
You can store the content in external storage like file;
What are all the field that contains huge size of content,in schema set stored="false" for that corresponding field and store the content for that field in external file using some efficient file system hierarchy.
It improves indexing by 40 to 45% reduced time. But when doing search, search time speed is some what increased.For search it took 25% more time than normal search.