Batch PDF Watermark [PDF -> JPG -> PDF] - pdf

I am working with over thousands PDF files for a Sheet Music publisher.
All of these PDF files needs a preview PDF. A watermark for PDF files can easily be removed so I am asking for a true way to watermark our PDF:s in a batch operation.
PDF->Apply Watermark->JPG->Back to PDF
How can I do this? Is there a good tool for this operations?

The free route
ImageMagick can do the complete process for you, especially with the composite command's -watermark operator.
#!/bin/sh
# ImageMagick will pick the correct conversion formats based on filename suffixes, or maybe actual binary content?
InputPDF=$1
WatermarkImg=$2
OutputPDF=$3
pdfToImage=pdfToImage.png
imageWithWatermark=imageWithWatermark.png
# Convert PDF to image
convert \
-density 300 \
-trim \
"$InputPDF" \
-quality 100 \
-flatten \
-sharpen 0x1.0 \
$pdfToImage
# Add watermark to intermediate image
composite \
-dissolve 15 \
-tile \
"$WatermarkImg" \
$pdfToImage \
$imageWithWatermark
# Convert intermediate image back to PDF
convert \
$imageWithWatermark \
"$OutputPDF"
# Clean up
rm $pdfToImage $imageWithWatermark
I find the PDF to image conversion acceptable in terms of quality, though you can see some differences when looking at the before and after side-by-side, especially in how bolded glyphs seem less bold:
You can check this good post and its answers for a number of options for converting a PDF to an image, Convert PDF to image with high resolution.
I checked out PDFtoPPM, which was also highly mentioned in that thread, and I still see some degrading of the bolded fonts when converted:
Some more tiling Magick
I used this copyright symbol from Wikimedia Commons and this ImageMagick script:
#!/bin/sh
Infile="Copyright.png"
Outfile="Copyright_tiled.png"
h2=$(convert $Infile -format "%[fx:round(h/2)]" info:)
convert $Infile \
\( -clone 0 -roll +0+"$h2" \) \
+append \
-write mpr:sometile \
+delete \
-size 1224X1584 \
tile:mpr:sometile \
$Outfile
to create this staggered tiling (1224X1584 is the page size (8.5in x 11in) multiplied by 72 px/in, times 2 for a good density of tiles):

And here it is unwatermarked again
#ZachYoung I used some different image magic, also scriptable, the point is:-
Although "What's done cannot be undone" Macbeth (Act 5.1. 63-4) is very true especially within a PDF or image. We also know and expect that it too applies to any PDF (de)constructs. Thus depending on value of a forgery it will always be worth engineering a partially reversed copy, fit for scrutiny or use, but will like the watermarked copy, still not be the original, however all the same, may look almost just as good.
The Idiom implies don't bother yourself about it. Its best not done in the first place.
The nearest to best, is use a watermark exactly the same as the text outlines, like this:-

Related

Add white border to PDF (change paper format)

I have to change a given PDF from A4 (210mm*297mm) to 216mm*303mm.
The additional 6 mm for each dimension should be set as white border of 3mm on each side. The original content of the PDF pages should be centered on the output pages.
I tried with convert:
convert in.pdf -bordercolor "#FFFFFF" -border 9 out.pdf
This gives me exactly the needed result but I loose very much sharpness of the original images in the PDF. It is all kind of blurry.
I also checked with
convert in.pdf out.pdf
which does no changes at all but also screws up the images.
So I tried Ghostcript but did not get any result. The best approach I found so far from a German side is:
gs -sOutputFile=out.pdf -sDEVICE=pdfwrite -g6120x8590 \
-c "<</Install{1 1 scale 8.5 8.5}>> setpagedevice" \
-dNOPAUSE -dBATCH in.pdf
but I get Error: /typecheck in --.postinstall--.
By default, Imagemagick converts input PDF files into images with 72dpi. This is awfully low resolution, as you experienced firsthad. The output of Imagemagick is always a raster image, so if your input PDF was text, it will no longer be.
If you don't mind the output PDF's getting bigger, you can simply increase the ratio Imagemagick is probing the original PDF using -density option, like this:
convert -density 600 in.pdf -bordercolor "#FFFFFF" -border 9 out.pdf
I used 600 because it is the sweet spot that works well for OCR. I recomment trying 300, 450, 600, 900 and 1200 and picking the best one that doesn't get unwieldably huge.
Shifting the content on the media is not especially hard, but it does mean altering the content stream of the PDF file, which most PDF manipulation packages avoid, with good reason.
The code you quote above really won't work, it leaves garbage on the operand stack, and the PLRM explicitly states that it is followed by an implicit initgraphics which will reset all the standard parameters anyway.
You could try instead setting a /BeginPage procedure to translate the origin, which will probably work:
<</BeginPage {8.5 8.5 translate} >> setpagedevice
Note that you aren't simply manipulating the original PDF file; Ghostscript takes the original PDF file, interprets it into graphics primitives, then reassembles those primitives into a new PDF file, this has implications... For example, if an image is DCT encoded (a JPEG) in the original, it will be decompressed before being passed into the output file. You probably don't want to reapply DCT encoding as this will introduce visible artefacts.
A simpler alternative, but involving multiple processing steps and therefore more potential for problems, is to first convert the PDF to PostScript with the ps2write device, specifying your media size, and also the -dCenterPages switch, then use the pdfwrite device to turn the resulting PostScript into a new PDF file.
Instead of
-g6120x8590 \
-c "<</Install{1 1 scale 8.5 8.5}>> setpagedevice"
(which is wrong), you should use:
-g6120x8590 \
-c "<</Install{8.5 8.5 translate}>> setpagedevice"
or
-g6120x8590 \
-c "<</Install{3 25.4 div 72 mul dup translate}>> setpagedevice"
(which lets Ghostscript calculate the "3mm == 8.5pt" itself...)

Can iTextSharp reduce dpi resolution of a pdf?

I'm trying to upload hi-res PDF files to our servers, but would like to generate a smaller PDF file size so that it loads quickly on my web application by reducing the dpi resolution.
Is this something that iTextSharp can do? Or is there another free software that can achieve this?
PDF files, in general, do not have DPI. Raster images embedded in a PDF file do. What you can do, is to extract the images embedded in your PDF file, resize them to a lower resolution, and put them back in your file.
There is a chapter about this topic in the book iText in Action.
Ghostscript is Free Software (if you want), and it can downsample PDFs any way you want (well, downsample the pixel images that may be embedded on its pages).
Example commandline, which downsamples all images to 72dpi (provided they have a resolution that's more than 144dpi). I'll not use the shortest command, but I deliberately try to enumerate all potentially useful parameters, so that you can experiment:
gs \
-o downsampled.pdf \
-sDEVICE=pdfwrite \
-dColorImageDownsampleThreshold=2.0 \
-dGrayImageDownsampleThreshold=2.0 \
-dMonoImageDownsampleThreshold=2.0 \
-dColorImageDownsampleType=/Bicubic \
-dGrayImageDownsampleType=/Bicubic \
-dMonoImageDownsampleType=/Bicubic \
-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=72 \
-dGrayImageResolution=72 \
-dMonoImageResolution=72 \
-dAutoFilterColorImages=false \
-dAutoFilterGrayImages=false \
\
-dEncodeColorImages=true \
-dEncodeGrayImages=true \
-dEncodeMonoImages=true \
-dColorImageFilter=/DCTEncode \
-dGrayImageFilter=/DCTEncode \
-dMonoImageFilter=/CCITTFaxEncode \
input.pdf
If you want to downsample all color images (that is, also the ones from 73dpi to 144dpi), then use -dColorImageDownsampleThreshold=1.0 (Ghostscript's default is =1.5); the same goes for other *ImageDownsampleThreshold settings.
For the *ImageDownsampleTypes -- you can also experiment with values of /Average or /Subsample instead of my suggested /Bicubic. And you are of course als free to use different settings for resolution, sampling type and thresholds across the mono, gray and color image types.

Convert multipage PDF to PNG and back (Linux)

I have a lot of PDF documents that I want to convert to PNG, edit in Gimp, and then save back to the multipage Acrobat file. I'm filling out forms and adding scanned signature, trying to avoid printing, signing, then scanning back in, with the ability to type the information I need to enter.
I've been trying to use Imagemagick to convert to png files, which seems to work fine. I use the command convert -quality 100 -density 300x300 multipage.pdf single%d.png
(I'm not really sure if the quality parameter is right for png).
But I'm having problems with saving back to PDF. Some of the files have the wrong page size, and I've tried every command and procedure I can find, but there are always a few odd sizes. The resolution seems to vary so that it looks good at a certain zoom level, but either a few pages are specified at about 2" wide, or they are 8.5x11 but the others are about 35" wide. I've tried making sure Gimp had the canvass size and resolution correct, and to save the resolution in the file, but that doesn't seem to matter.
The command I use to save the files is convert -page letter -adjoin single*.png multipage.pdf I've tried other parameters, but none seemed to matter.
If anyone has any ideas or alternatives, I'd appreciate it.
"I'm not really sure if the quality parameter is right for PNG."
For PNG output, the -quality setting is very unlike JPEG's quality setting (which simply is an integer from 0 to 100).
For PNG it is composed by two single digits:
The first digit (tens) is (largely) the zlib compression level, and it may go from 0 to 9.
(However the setting of 0 has a special meaning: when you use it you'll get Huffman compression, not zlib compression level 0. This is often better... Weird but true.)
The second digit is the PNG data encoding filter type (before it is compressed):
0 is none,
1 is "sub",
2 is "up",
3 is "average",
4 is "Paeth", and
5 is "adaptive".
In practical terms that means:
For illustrations with solid sequences of color a "none" filter (-quality 00) is typically the most appropriate.
For photos of natural landscapes an "adaptive" filtering (-quality 05) is generally the best.
"I'm having problems with saving back to PDF. Some of the files have the wrong page size, and I've tried every command and procedure I can find [...] but either a few pages are specified at about 2" wide, or they are 8.5x11 but the others are about 35" wide."
Not having available your PNG files, I created a few simple ones with different dimensions to verify the different commands (as I wasn't sure myself any more). Indeed, the one you used:
convert -page letter -adjoin single*.png multipage.pdf
does create all PDF pages in (same) letter size, but it places my sample of (differently sized) PNGs always on the lower left corner of the PDF page. (Should a PNG exceed the PDF page size, it does scale them down to make them fit -- but it doesn't scale up smaller PNGs to fill the available page space.)
The following modification to the command will place the PNGs into the center of each PDF page:
convert \
-page letter \
-adjoin \
single*.png \
-gravity center \
multipage.pdf
If this is still not good enough for you, you can enforce a (possibly non-proportional!) scaling to almost fill the letter area by adding a -scale '590!x770!' parameter (this will leave a border of 11 pt at each edge of the page):
convert \
-page letter \
-adjoin \
single*.png \
-gravity center \
-scale '590!x770!' \
multipage.pdf
To leave away the extra border, use -scale '612!x792!'. -- Should you want only upward scaling to happen if required while keeping the aspect ratio of the PNG, use -scale '590<x770<':
convert \
-page letter \
-adjoin \
single*.png \
-gravity center \
-scale '590<x770<' \
multipage.pdf
Why not just use Xournal? That's what I use to annotate PDFs

Replacing vector images in a PDF with raster images

Is there any easy (scriptable) way to convert a PDF with vector images into a PDF with raster images? In other words, I want to generate a PDF with the exact same (un-rasterized) text but with each vector image replaced with a rasterized version.
I occasionally read PDFs of technical articles on my Kindle, and have found that reading a PDF directly is frustrating. Thankfully, Amazon's automatic conversion of PDFs to the Kindle format does a good job of reflowing the text portions of most of PDFs I have tried. However, while raster images seem to make it through the conversion process fine, vector images get horribly mangled. It would be great if I could easily convert a PDF so that all of its vector images were rasterized.
I am interested in any possible solutions, but a Linux- or Windows-based one would be preferable.
I had a similar issue, and solved it using ImageMagics convert tool (http://www.imagemagick.org/script/index.php). That comes with linux and runs fine on Windows/Cygwin or OS X
convert -density 300 largeVectorFileFromR.pdf out.pdf
With -density 300 you control resolution (as DPI).
Downside: Text is rasterized as well, I understand that Michael does not want this.
After some days searching for some solution, based on "Remove all text from PDF file" and "How to add a picture onto an existing pdf file?" I found a (ugly) scriptable solution:
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $INPUT_FILE && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfwrite -dFILTERTEXT $INPUT_FILE && \
convert -density $DPI -quality 100 /tmp/graphics.pdf /tmp/graphics.png && \
convert -density $DPI -quality 100 /tmp/graphics.png /tmp/graphics.pdf && \
pdftk /tmp/graphics.pdf stamp /tmp/onlytxt.pdf output $OUTPUT_FILE && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf /tmp/graphics.png
were we have three variables INPUT_FILE, OUTPUT_FILE, and DPI. We split the textual and graphical contents via Ghostscript, convert the graphical image to a raster image (PNG) and join the two using pdftk.
I've been using this successfully to convert huge vector images for use in scientific papers.
Pitstop Pro v2 update 3 from Enfocus can do exactly that. It has an action called "Rasterize page content, keeping text" which works pretty well. It is a plugin to Adobe Acrobat so it requires a little more but is also available as a server solution.
It's a little complicated, but you asked for any possible solution. Furthermore this solution is not automatable.
1) Open the pdf with the vector images in Inkscape. Then select the whole image with the select tool (F1)
2) If the vector image is consistant of more than one svg graphic press Ctrl + G (Object --> Group)
3) cut the grouped svg image Ctrl + x
4) open a new InkScape Window Ctrl + n and paste the image Ctrl + v
5) choose File --> export Bitmap (Shift + Ctrl + e), maybe you want to increase the dpi
6) go back to the first InkScape window, File --> import (Ctrl + i) and choose the previously exported bitmap
7) place the bitmap to the location where the svg image was
Save the pdf and the vector image is replaced by a bitmap image.
Here's one way to solve your problem:
Step 1: Use an online PDF-to-HTML converter, like the one here:
http://www.idrsolutions.com/online-pdf-to-html5-converter/
This tool converts the PDF into a set of images and a text overlay. The vector images should be converted to raster at this point.
Step 2: Convert the HTML+images back into PDF:
http://pdfcrowd.com/#convert_by_upload+with_options
The resulting PDF will have all the vector images rasterized, and all text will remain text, so you can select, copy, etc.
Convert the pdf to djvu with https://jwilk.net/software/pdf2djvu converter. Uncheck "antialias fonts,vectors..". It will reduce file size significantly and improve document load times.
I used the following:
gswin32c -o "%2" -dFirstPage=1 -dLastPage=1 -sDEVICE=pngalpha -r72x72 -dUseCropBox -dFitPage "%1" -dBATCH -dNOPAUSE
where %1 is the input file and %2 is the output. This can be used with LaTeX, the generated PNG has the same ratio and page size as the original PDF so the relative position of the image will not change.
Note that in Linux, you may need to use gs rather than gswin32c.
You can also set the page range and then print the pages back to PDF. The downside is that the text gets rasterized as well.
inkscape is the best solution, I quickly made this rather unoptimized batch file that does exactly that and you can play with it and change options. ImageMacick convert, gs, or pdftoimages don't work as good as inkscape they either don't export the layers or export but with bad quality :
#!/bin/bash
#set -xev
ORIGINAL_FOLDER=`pwd`
JPEGS=`mktemp -d`
unzip "$1" -d "$JPEGS"
cd "$JPEGS"
# expang the pdf in pdf pages
pdftk combined_to_do.pdf burst output pg_%04d.pdf
#1) print the pdf's to pngs as they are seen with alpha, layers, transparency etc, this cannot be done by ImageMacick convert or pdftoimages
ls ./pg*.pdf | xargs -L1 -I {} inkscape {} -z --export-dpi=300 --export-area-drawing --export-png={}.png
#2) Second change to jpgs
rm *.pdf
ls ./p*.png | xargs -L1 -I {} convert {} -quality 100 -density 300 {}.jpg
#3) This to make a pdf file out of every jpg image without loss of either resolution or quality:
ls -1 ./*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf
#4) This to concatenate the pdfpages into one:
pdftk *.jpg.pdf cat output combined.pdf
#5) And last I add an OCRed text layer that doesn't change the quality of the scan in the pdfs so they can be searchable:
pypdfocr combined.pdf
cp "$JPEGS/combined_ocr.pdf" "$ORIGINAL_FOLDER/$1_ocr.pdf"
cp "$JPEGS/combined.pdf" "$ORIGINAL_FOLDER/$1.pdf"
Based on Civ Lins solution, I came up with this:
#!/usr/bin/env sh
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $1 && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfimage24 -dFILTERTEXT -r600 -dDownScaleFactor=6 $1 && \
pdftk /tmp/graphics.pdf multistamp /tmp/onlytxt.pdf output $2 && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf
(In contrast to the previous solution, it handles multipage PDFs and uses gs to directly render the rasterized image without the detour of convert.)

Converting (any) PDF to black (K)-only CMYK

This is related to:
Converting PDF to CMYK (with identify recognizing CMYK).
Script (or some other means) to convert RGB to CMYK in PDF?
... but a bit more specific here: say I have an RGB PDF, where the text color is "rich black" (R:0 G:0 B:0 gone to C:100 M:100 Y:100 K:100), and diverse images and vector graphics.
I would like to convert this to a CMYK PDF, using a free command line tool (so it is batch scriptable under Linux), which
has contents only in the black (K) channel:
Preserves vector graphics (+ text glyphs) - colors become grayscale in black (K) channel only
Images get converted to grayscale in black (K) channel only
Thanks in advance for any answers,
Cheers!
As hinted in my comment to #Mark Storer, it turns out that forcing a gray print only on the K plate in CMYK, may not be so trivial ... I guess it depends much on what is being used as "preflight" preview device - for Linux, the only thing I can find is ghostscript with tiffsep, which is what I use for 'sanity check' regarding CMYK separations.
Anyways, I got a lot of help in this thread on comp.lang.postscript:
PDF to PDF (gs?): rich RGB black to plain K (CMYK) black? - comp.lang.postscript | Google Groups
... and one workflow that works for me is:
Convert PDF to PS using ghostscript's ps2write
Use ghostscript to convert this PS back to PDF, while executing replacement functions in HackRGB-cmyk-inv.ps
Use ghostscript's tiffsep to check actual separations
In respect to, say, this PDF generated by OpenOffice: blah-slide.pdf, the command lines would be:
# PDF to PS using `ps2write` device of `ghostscript`
gs \
-dNOPAUSE \
-dBATCH \
-sDEVICE=ps2write \
-sOutputFile=./blah-slide-gsps2w.ps \
./blah-slide.pdf
# PS to PDF using replacement function in HackRGB-cmyk-inv.ps
gs \
-dNOPAUSE \
-dBATCH \
-sDEVICE=pdfwrite \
-sOutputFile=./blah-slide-hackRGB-cmyk-inv.pdf \
./HackRGB-cmyk-inv.ps \
./blah-slide-gsps2w.ps
# check separations
gs \
-dNOPAUSE \
-dBATCH \
-dSAFER \
-sDEVICE=tiffsep \
-dFirstPage=1 \
-dLastPage=1 \
-sOutputFile=p%02d.tif \
blah-slide-hackRGB-cmyk-inv.pdf \
\
&& eog p01.tif 2>/dev/null
This should only work on RGB values where R=G=B (and hopefully grayscale values), and only on text colors, and it also flattens text information - but it should be possible to confirm via tiffsep that the text indeed ends up only on the K plate.
As mentioned in the newsgroup post, this is not extensively tested, but looks promising so far...
Cheers!
As an improvement to sdaau's great answer, I can recommend using pdftops from xpdf for converting pdf to ps, instead of ghostscript ps2write, because the latter e.g. causes the font to become staircasey, and is said to not preserve the original pdf accurately. Compare by zooming into text areas of the resulting pdfs.
I suggest you convert the PDF using GS twice. Once to a Shades Of Gray colorspace, and then to CMYK.
I'm not sure it'll work, but I'd be a bit surprised if it didn't. G->CMYK sounds like a brain-dead X -> 0 0 0 X conversion. At least if you stick to "device gray" and "device CMYK" instead of some calibrated color space that'll tweak things this way and that.