Atomic writes with Ghostscript output devices - pdf

I'm using Ghostscript to convert PDF pages to PNG using the following on the command line:
gs -dDOINTERPOLATE -sDEVICE=pnggray -r200x200 -o 'page%%d.png' filename.pdf
My intent is to take in large PDFs and do other work with the PNGs as they are built, cleaning them up after I'm done. However, it seems that the output PNGs aren't generated atomically -- that is, they become available before they're complete. Is there a way to get Ghostscript to generate these files atomically, or some way I can access them as the command runs without encountering incomplete files?

No, there isn't. Ghostscript opens the file for writing immediately that the page begins. It write the data either in one large lump when the page is complete, or in a series of horizontal stripes (at high page sizes or resolutions).
Since it might be writing the page in a series of bands, it has to open the file up front.
You could write an application around Ghostscript using the API, that will produce a callback on page completion which you could then use to trigger your other processing.

Related

Ghostscript Text Extraction Time?

I am Extracting Text from pdf and for that i am using Ghostscript v9.52
The time taken by ghostscript with default txtwrite commands is ~400ms and the commands are:
-dSafer -dBATCH -dNOPAUSE -sPDFPassword=thispdf -device="txtwrite" stdout pdf.pdf
Then i tried to lower down the resolution of renderring and that saved some time was able to make it down to ~300ms:
-dSafer -dBATCH -dNOPAUSE -r2 -dDEVICEWIDTHPOINTS=50 -dDEVICEHEIGHTPOINTS=50 -dFIXEDMEDIA -sPDFPassword=thispdf -device="txtwrite" stdout pdf.pdf
Have no idea how setting low resolution is working here.
How can i speed up Text Extraction near to 100ms if possible ?
If that's how long it's taking, then that's the length of time it takes. The interpreter has to be started up and establish a full working PostScript environment then fully interpret the input, including all the fonts, and pass that to the output device. The output device records the font, point size, orientation, colour, position and attempts to calculate the Unicode code points for all the text. Then, depending on the options (which you haven't given) it may reorder the text before output. Then it outputs the text closes the input and output files, releases all the memory used and cleanly shuts down the interpreter.
You haven't given an example of the file you are using, but half a second doesn't seem like a terribly long time to do all that.
In part you can blame all the broken PDF files out there, every time another broken file turns up 'but Acrobat reads it' another test has to be made and a work-around established, generally all of which slow the interpreter down.
The resolution will have no effect, and I find it very hard to believe the media size makes any difference at all, since it's not used. Don't use NOGC, that's a debugging tool and will cause the memory usage to increase.
The most likely way to improve performance would be not to shut the interpreter down between jobs, since the startup and shutdown are probably the largest part of the time spent when it's that quick. Of course that means you can't simply fork Ghostscript, which likely means doing some development with the API and that would potentially mean you were infringing the AGPL, depending what your eventual plans are for this application.
If you would like to supply a complete command line and an example file I could look at it (you could also profile the program yourself) but I very much doubt that your goal is attainable, and definitely not reliably attainable for any random input file.

Ghostscript to compress a batch of PDFs

I have no experience of programming.
My PDFs won't display images on the iPad in PDFExpert or GoodNotes as the images are in JPEG2000, from what I could find on the internet.
These are large PDFs, upto 1500-2000 pages with images. One of these was an 80MB or so file. I tried printing it with Foxit to convert the images to JPG from JPEG2000 but the file size jumped to 800MB...plus it's taking too long.
I stumbled upon Ghostscript, but I have NO clue how to use the command line interface.
I am very short on time. Pretty much need a step by step guide for a small script that converts all my PDFs in one go.
Very sorry about my inexperience and helplessness. Can someone spoon-feed me the steps for this?
EDIT: I want to switch the JPEG2000 to any other format that produces less of an increase in file size and causes a minimal loss in quality (within reason). I have no clue how to use Ghostscript. I basically want to change the compression on the images to something that will display correctly on the iPad while maintaining the quality of the rest of the text, as well as the embedded bookmarks.
I'll repeat that I have NO experience with command line...I don't even know how to point GS to the folder my PDFs are in...
You haven't really said what it is you want. 'Convert' PDFs how exactly ?
Note that switching from JPX (JPEG2000) to JPEG will result in a quality loss, because the image data will be quantised (with a different quantisation scheme to JPX) by the JPEG encoder. You can use a lossless compression scheme instead, but then you won't get the same kind of compression. You won't get the same compression ratio as JPX anyway no matter what you use, the result will be larger.
A simple Ghostscript command would be:
gs -sDEVICE=pdfwrite -o out.pdf in.pdf
Because JPEG2000 encoding is (or at least, was) patent encumbered, the pdfwrite device doesn't write images as JPX< by default it will write them several times with different compression schemes, and then use the one that gives the best compression (practically always JPEG).
Getting better results will require more a complex command line, but you'll also have to be more explicit about what exactly you want to achieve, and what the perceived problem with the simplistic command line is.
[EDIT]
Well, giving help on executing a command line is a bit off-topic for Stack Overflow, this is supposed to be a site for software developers :-)
Without knowing what operating system you are using its hard to give you detailed instructions, I also have no idea what an iPad uses, I don't generally use Apple devices and my only experience is with Macs.
Presumably you know where (the directory) you installed Ghostscript. Either open a command shell there and type the command ./gs or execute the command by giving the full path, such as :
/usr/bin/gs
I thought the arguments on the command line were self-explanatory, but....
The -sDEVICE=pdfwrite switch tells Ghostscript to use the pdfwrite device, as you might guess from the name, that device writes PDF files as its output.
The -o switch is the name (and full path if required) of the output file.
The final argument is the name (and again, full path if its not in the current directory) of the input file.
So a command might look like:
/usr/bin/gs -sDEVICE=pdfwrite -o /home/me/output.pdf /home/me/input.pdf
Or if Ghostscript and the input file are in the same directory:
./gs -sDEVICE=pdfwrite -o out.pdf input.pdf

Can PDFBox load a source PDF once then save multiple, variable page ranges as individual PDFs?

I am writing a system that is processing very large PDFs, up to 400,000 pages with 100,000 individual statements per PDF. My task is to quickly split this PDF up into individual statements. This is made complicated by the fact that the statements vary in page count so I can't do a simple split on every 4th page.
I'm using parallel processing on a 36 core AWS instance to speed up the job but doing an initial split of a 400,000 page PDF into 36 chunks is very, very, slow, although processing the resulting 11,108 page chunks is very quick, so there's a lot of overhead for a good result in the end.
The way I think this could be done even faster would be to write a process using PDFBox that loads the source PDF into memory one time (versus calling commandline utilities like pdftk or cpdf 36 times to split the massive PDF) then have it listen on a port for the children of my other process to tell it to split pages x-y into a pdf named z.
Is this possible with PDFBox and if so what methods would I use to accomplish it?

Optimized conversion of many PDF files to PNG if most of them have 90% the same content

I'm using ImageMagick to convert a few hundred thousand PDF files to PNG files. ImageMagick takes about ten seconds to do this. Now, most of these PDF files are automatically generated grading certificates, so it's basically just a bunch of PDF files with the forms filled in with different numbers. There are also a few simple raster images on each PDF> I mean, one option is to just throw computing power at it, but that means money as well as making sure they all end up in the right place when they come back. Another option is to just wait it out on our current computer. But I did the calculations here, and we won't even be able to keep up with the certificates we get in real-time.
Now, the option I'm hoping to pursue is to somehow take advantage of the fact that most of these files are very similar, so if we have some sort of pre-computed template to use, we can skip the process of calculating the entire PDF file every time the conversion is done. I'd do a quick check to see if the PDF fits any of the templates, run the optimized conversion if it does, and just do a full conversion if it doesn't.
Of course, my understanding of the PDF file format is intermediate at best, and I don't even know if this idea is practical or not. Would it require making a custom version of ImageMagick? Maybe contributing to the ImageMagick source code? Or is there some solution out there already that does exactly what I need it to? (We've all spend weeks on a project, then had this happen, I imagine)
Ok, I have had a look at this. I took your PDF and converted it to a JPEG like this - till you tell me the actual parameters you prefer.
convert -density 288 image.pdf image.jpg
and it takes 8 seconds and results in a 2448x3168 pixel JPEG of 1.6MB - enough for a page size print.
Then I copied your file 99 times so I had 100 PDFs, and processed them sequentially like this:
#!/bin/bash
for p in *.pdf; do
echo $new
new="${p%%pdf}jpg"
convert -density 288 $p $new
done
and that took 14 minutes 32 seconds, or average of 8.7 seconds.
Then I tried using GNU Parallel to do exactly the same 100 PDF files, like this:
time parallel -j 8 convert -density 288 {} {.}.jpg ::: *.pdf
keeping all 8 cores of my CPU very busy. but it processed the same 100 PDFs in 3 minutes 12, so averaging 1.92 seconds each, or a 4.5x speed-up. I'd say well worth the effort for a pretty simple command line.
Depending on your preferred parameters for convert there may be further enhancements possible...
The solution in my case ended up being to use MuPDF (thanks #Vadim) from the command line, which is about ten times faster than GhostScript (the library used by Imagemagick). MuPDF fails on about 1% of the PDF files though, due to improper formatting, which GhostScript is able to handle reasonably well, so I just wrote an exception handler to only use Imagemagick in those cases. Even so, it took about 24 hours on an 8-core server to process all the PDF files.

Converting multi-page PDFs to several JPGs using ImageMagick and/or GhostScript

I am trying to convert a multi-page PDF file into a bunch of JPEGs, one for each page in the PDF. I have spent hours and hours looking up how to do this, and eventually I discovered that I need Ghostscript installed. So I did that (from this website: http://downloads.ghostscript.com/public/ And I used the most recent link "ghostscript-9.05.tar.gz" from Feb 8, 2012).
However, even with this installed/downloaded, I am still unable to do what I want. Should I have this saved somewhere special, like in the same folder as ImageMagick?
What I have figured out so far is this:
In Command Prompt I change the working directory to the ImageMagick folder, where that is saved.
I then type
convert "<full file path to pdf>" "<full file path to jpg>"
This is followed by a giant blob of error. It begins with:
Unrecoverable error: rangecheck in.setuserparams
Operand stack:
Followed by a blurb of unreadable numbers and caps. It ends with:
While reading gs_lev2.ps:
%%[ Error: invalidaccess; OffendingCommand: put ]%%
Needless to say, after hours and hours of deliberation, I don't think I am any closer to doing the seemingly simple task of converting this PDF into a JPG.
What I would like are some step by step instructions on how to make this work. Don't leave out anything, no matter how "obvious" it might seem (especially anything involving ghostscript). This has been troubling me and my supervisor for months now.
For further clarification, we are on a Windows XP operating system. The eventual intention is to call these command lines in R, the statistical language, and run it in a script. In addition, I have been able to successfully convert JPGs to PNG format and vice versa, but PDF just is not working.
Help!!!
You don't need ImageMagick for this, Ghostscript can do it all alone. (If you used ImageMagick, it couldn't do that conversion itself, it HAS to use Ghostscript as its 'delegate'.)
Try this for directly using Ghostscript:
c:\path\to\gswin32c.exe ^
-o page_%03d.jpg ^
-sDEVICE=jpeg ^
d:/path/to/input.pdf
This will create a new JPEG for each page, and the filenames will increment as page_001.jpg, page_002.jpg,...
Note, this will also create JPEGs which use all the default settings of the jpeg device (one of the most important ones will be that the resolution will be 72dpi).
If you need higher (or lower resolution) for your images, you can add other options:
gswin32c.exe ^
-o page_%03d.jpg ^
-sDEVICE=jpeg ^
-r300 ^
-dJPEGQ=100 ^
d:/path/to/input.pdf
-r300 sets the resolution to 300dpi and -dJPEGQ=100 sets the highest JPEG quality level (Ghostscript's default is 75).
Also note, please: JPEG is not well suited to represent shapes with sharp edges and high contrast in good quality (such as you typically see in black-on-white text pages with small characters).
The (lossy) JPEG compression method is optimized for continuous-tone pictures + photos, and not for line graphics. Therefore it is sub-optimal for such PostScript or PDF input pages which mainly contain text. Here, the lossy compression of the JPEG format will result in poorer quality output even if the input is excellent. See also the JPEG FAQ for more details on this topic.
You may get better image output by choosing PNG as the output format (PNG uses a lossless compression):
gswin32c.exe ^
-o page_%03d.png ^
-sDEVICE=png16m ^
-r150 ^
d:/path/to/input.pdf
The png16m device produces 24bit RGB color. You could swap this for pnggray (for pure grayscale output), png256 (for 8-bit color), png16 (4-bit color), pngmono (black and white only) or pngmonod (alternative black-and-white module).
There are numerous SaaS services that will do this for you too. HyPDF and Blitline come to mind.