Cross-platform library for separating and converting PDFs to images? - pdf

I'm currently planning an application which involves manipulating PDFs. My goal is to have a program that i can pass in a PDF as an input which then saves separated grayscale images of the colour channels that the PDF consists of as an output. This is basically a simple RIP.
I'm currently using a solution using GhostScript but i want to rewrite the application to optimise speed and usability. (GhostScript doesn't separate PDFs for example.)
Do you know of any other open source libraries that i may find useful to achieve this?

Did you ever try to run (I'm assuming Windows here):
mkdir separated
gswin32c.exe ^
-o separated/page_%04d.tif ^
-sDEVICE=tiffsep ^
d:/path/to/input.pdf
(you can also try -sDEVICE=tiffsep1) and then looked at the files you've gotten in the separated sub directory?!? And this is not a case of Ghostscript separated PDFs in your mind?
The device tiffsep creates multiple output files:
One single 32bit composite CMYK file (tiff32nc format) per PDF page.
Multiple tiffgray files (each compressed with LZW) per PDF page, one for each separation.
The device tiffsep1 behaves similarly, but...
...doesn't create the composite output file...
...and it creates tiffg4 output files for the separations.

Related

batch annotation for similar pdf files

I have many pdf files with same layout in which I need to fill in some details with plain text and sign in certain places with company stamp.
I am trying to automate things, so I am after a way to fill in the details + stamp the first file and then "copy-paste" that operation to all files.
I managed to find:
online websites allowing me to annotate one file at a time
documentation on how to use pdftk and additional tools to create a script, but it can take a lot of manual operations via command line (e.g. scaling the signature jpeg file, positioning it, adding text in right location etc.) which is very tedious.
Is there any way to annotate the first file using visual tools (like #1) and then extract a script of commands to perform these annotations from command line ?
I can then use that script on multiple files.
Thank you,
-Moshe

How to Tesseract multiple .tif files?

a total self-taught noob here. I'm using Windows Command Promt to run Tesseract-ocr.
I managed to find the right command to get as output a two-layers pdf file with the original scanned page BUT ALSO a searchable text.
tesseract filename.tif output -l ita pdf
Quite simple for me too.
But how do I repeat this operation for all the 200+ .tif files in the folder without doing it manually? It makes no difference to me to get as many output pdfs or to get a single output pdf.
Thanks to everyone who will help me.
I found a way in the meanwhile: create a txt file containing the list of all the paths to each .tif file (with the command dir/s/b *.tif > listname.txt) and then use it as input for Tesseract.
Maybe there's a faster way, but this works.

Why does the combination pdf2ps / ps2pdf shrink the PDF?

When researching how to compress a bunch of PDFs with pictures inside (ideally in a lossless fashion, but I'll settle for lossy) I found that a lot of people recommend doing this:
$ pdf2ps file.pdf
$ ps2pdf file.ps
This works! The resulting file is smaller and looks at least good enough.
How / why does this work?
Which settings can I tweak in this process?
If there is some lossy conversion, which one is that?
Where is the catch?
People who recommend this procedure rarely do so from a background of expertise or knowledge -- it's rather based on gut feelings.
The detour of generating a new PDF via PostScript and back (also called "refrying a PDF") is never going to give you the optimal results. Sometimes it is useful, f.e. in cases were the original PDF isn't printed at all, or cannot be processed by another application. But these cases are very rare.
In any case, this "roundtrip" conversion will never lead to the same PDF file as initially.
Also the pdf2ps and ps2pdf tools aren't an independent tools at all: they are just simple wrapper scripts around a Ghostscript (gs or gswin32c.exe) command line. You can check that yourself by doing:
cat $(which ps2pdf)
cat $(which pdf2ps)
This will also reveal the (default) parameters these simple wrappers use for the respective conversions.
If you are unlucky, you will have an ancient Ghostscript installed. The PostScript which is then generated by pdf2ps will be Level 1 PS, and this will be "lossy" for many fonts which could be used by more modern PDF files, resulting in rasterization of previous vector fonts. Not exactly the output you'd like to look at...
Since both tools are using Ghostscript anyway (but behind your back), you are better off to run Ghostscript yourself. This gives you more control over the parameters it uses. Especially advantageous is the fact that this way you can get a direct PDF->PDF conversion, without any detour via an intermediary PostScript file format.
Here are a few answers which would give you some hints about what parameters you could use in order to drive the file size down in a semi-controlled way in your output PDF:
Optimize PDF files (with Ghostscript or other) (StackOverflow)
Remove / Delete all images from a PDF using Ghostscript or ImageMagick (StackOverflow)

Use ghostscript to delete a page (not extracting a range)

I know ghostscript can use -dfirstpage -dlastpage to only make a file from a range of pages, but I need to make it (or another command line program) delete the 2nd page in any pdf where the range of pages is not explicitly told. I thought this would be far easier because most printers let you specify "1,3-end" and I have been using PDFCreator to do it that way.
The one way I can think of doing it (very very messy) is to extract page 1, extract pages 3 to end, and then merge the two pdfs. But I also don't know how to have GS determine the number of pages.
Use the right tool for the job!
For reasons outlined by KenS, Ghostscript is not the best tool for what you want to achieve. A better tool for this task is pdftk. To remove the 2nd page from input.pdf, you should run this command line:
pdftk input.pdf cat 1 3-end output output.pdf
OK first things first, if you use Ghostscript's pdfwrite device you are NOT extracting, or deleting, or performing any other 'manipulation' operation on your source PDF file. I keep on reiterating this, but I'm going to say it again.
When you pass an input file through Ghostscript it is completely interpreted to a series of graphical primitives which are passed to the device, in general the device will render the primitives to a bitmap. In the case of the 'high level' devices such as pdfwrite, the primitives are re-assmebled into a brand new file, in the case of pdfwrite a PDF file.
This flexibility allows for input in a number of different page description languages (PostScript, PDF, PCL, PCL-XL, XPS) and then output in a few different high level formats (PostScript, EPS, flavours of PDF, XPS, PCL, PCL-XL).
But the new file bears no relation to the original, other than its appearance.
Now, having got that out of the way... You can use the pdf_info.ps PostScript program, supplied in the 'toolin' directory of the Ghostscript installation, to get a variety of information about PDF files, one of the things you can get is the number of pages in the PDF. You also don't need to bother, run the file once with -dLastPage=1, then run it again with -dFirstPage=2 (don't set LastPage), then run both resulting files to create a file with the pages from each combined.

Ghostscript skips characters when merging PDFs

I have a problem when using Ghostscript (version 8.71) on Ubuntu to merge PDF files created with wkhtmltopdf.
The problem I experience on random occasions is that some characters get lost in the merge process and replaced by nothing (or space) in the merged PDF. If I look at the original PDF it looks fine but after merge some characters are missing.
Note that one missing character, such as number 9 or the letter a, can be lost in one place in the document but show up fine somewhere else in the document so it is not a problem displaying it or a font issue as such.
The command I am using is:
gs \
-q \
-dNOPAUSE \
-sDEVICE=pdfwrite \
-sOutputFile=/tmp/outputfilename \
-dBATCH \
/var/www/documents/docs/input1.pdf \
/var/www/documents/docs/input2.pdf \
/var/www/documents/docs/input3.pdf
Anyone else that have experienced this, or even better know a solution for it?
I've seen this happening if the names for embedded font subsets are identical, but the real content of these subsets are different (containing different glyph sets).
Check all your input files for the fonts used. Use Poppler's pdffonts utility for this:
for i in input*.pdf; do
pdffonts ${i} | tee ${i}.pdffonts.txt
done
Look for the font names used in each PDF.
My theory/bet is on you seeing identical font names used (names which are similar to BAAAAA+ArialMT) by different input files.
The BAAAAA+ font name prefix to be used for subset fonts is supposed to be random (though the official specification is not very clear about this). Some applications use predictable prefixes, however, starting with BAAAAA+, CAAAAAA+ DAAAAA+ etc. (OpenOffice.org and LibreOffice are notorious for this). This means that the prefix BAAAAA+ gets used in every single file where at least one subset font is used...
It can easily happen that your input files do not use the exact same subset of characters. However the identical names used could make Ghostscript think that the font really is the same. It (falsely) 'optimizes' the merged PDF and embeds only one of the 2 font instances (both having the same name, for example BAAAAA+Arial). However, this instance may not include some glyphs which where part of the other instance(s).
This leads to some characters missing in merged output.
I know that more recent versions of Ghostscript have seen a heavy overhaul of their font handling code. Maybe you'll be more lucky with trying Ghostscript v9.06 (the most recent release to date).
I'm very much interested in investigating this in even bigger detail. If you can provide a sample of your input files (as well as the merged output given by GS v8.70), I can test if it works better with v9.06.
What you could do to avoid this problem
Try to always embed fonts as full sets, not subsets:
I don't know if and how you can control to have full font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to enforce full font embeddings are:
gs -o output.pdf -sDEVICE=pdfwrite -dSubsetFonts=false input.file
Some type of fonts cannot be embedded fully, but only subsetted (TrueType, Type3, CIDFontType0, CIDFontType1, CIDFontType2). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do the following only if you are sure that no-one else gets to see or print or use your individual input files: Do not embed the fonts at all -- only embed when merging with Ghostscript the final result PDF from your inputs.
I don't know if and how you can control to have no font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to prevent font embedding are:
gs -o output.pdf -sDEVICE=pdfwrite -dEmbedAllFonts=false -c "<</AlwaysEmbed [ ]>>setpagedevice" input.file
Some type of fonts cannot be embedded fully, but only subsetted (Type3, CIDFontType1). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do not use Ghostscript, but rather use pdftk for merging PDFs. pdftk is a more 'dumb' utility than Ghostscript (at least older versions of pdftk are) when it comes to merging PDFs, and this dumbness can be an advantage...
Update
To answer once more, but this time more explicitly (following the extra question of #sacohe in the comments below. In many (not all) cases the following procedure will work:
Re-'distill' the input PDF files with the help of Ghostscript (preferably the most recent version from the 9.0x series).
The command to use is this (or similar):
gs -o redistilled-out.pdf -sDEVICE=pdfwrite input.pdf
The resulting output PDF should then be using different (unique) prefixes to the font names, even when the input PDF used the same name prefix for different font (subsets).
This procedure worked for me when I processed a sample of original input files provided to me by 'Mr R', the author of the original question. After that fix, the "skipped character problem" was gone in the final result (a merged PDF created from the fixed input files).
I wanted to give some feedback that unfortunately the re-processing trick doesn't seem to work with ghostscript 8.70 (in redhat/centos releases) and files exported as pdf from word 2010 (which seems to use ABCDEE+ prefix for everything). and i haven't been able to find any pre-built versions of ghostscript 9 for my platform.
you mention that older versions of pdftk might work. we moved away from pdftk (newer versions) to gs, because some pdf files would cause pdftk to coredump. #Kurt, do you think that trying to find an older version of pdftk might help? if so, what version do you recommend?
another ugly method that halfway works is to use:
-sDEVICE=pdfwrite -dCompatibilityLevel=1.2 -dHaveTrueType=false
which converts the fonts to bitmap, but it then causes the characters on the page to be a bit light (not a big deal), trying to select text is off by about one line height (mildly annoying), and worst is that even though the characters display ok, copy/paste gives random garbage in the text.
(I was hoping this would be a comment, but I guess I can't do that, is answer closed?)
From what I can tell, this issue is fixed in Ghostscript version 9.21. We were having a similar issue where merged PDFs were missing characters, and while #Kurt Pfeifle suggestion of re-distilling those PDFs did work, it seems a little infeasible/silly to us. Some of our merged PDFs consisted of up to 600 or more individual PDFs, and re-distilling every single one of those to merge them just seemed nuts
Our production version of Ghostscript was 9.10 which was causing this problem. But when I did some tests on 9.21 the problem seemed to vanish. I have been unable to produce a document with missing or mangled characters using GS 9.21 so I think that's the real solution here.