Is there a way to convert images embed in a pdf from jpg/gif/whatever to png or gif? - pdf

the biggest part of the question is in the title...
I have big pdf files made from concatenated scanned documents which are something like press articles: text+images. The important part is the text, not the pictures...
That's why I thought (accordingly to this article) to compress all the images in the pdf to png or gif...
Thanks for all your propositions, I already spend too much time to try optimize my ghostscript command line options :-p
FYI that my current ghostscript 9.14 command line in production :
gs -q -sDEVICE=pdfwrite \
-dSAFER -dNOPAUSE -dBATCH -dQUIET -dPDFSETTINGS=/ebook \
-dColorImageResolution=150 -dGrayImageResolution=150 -dMonoImageResolution=800 \
-dPreserveOPIComments=false -dPreserveOverprintSettings=false \
-dUCRandBGInfo=/Remove -dProcessColorModel=/DeviceRGB -dMaxInlineImageSize=0 \
-dDetectDuplicateImages=true -dFastWebView=false -dUseFlateCompression=true \
-dAutoFilterGrayImages=false -dAutoFilterColorImages=false \
-dColorImageDownsampleThreshold=1.2 \
-sOUTPUTFILE=/tmp/screen_20140602103745.pdf \
-c "512000000 setvmthreshold /QFactor 0.80 /Blend 1 /ColorTransform 1 /HSamples [2 1 1 2] /VSamples [2 1 1 2]" \
-f /usr/bases/dicodrp/pdf/pdf_concatenes/20140602103745.pdf
I got about 40% compression and something just readable, but I think I can improve the readibility just in changing the image compression type (I got that noisy jpg artifacts...)
No, I can't increase the dpi because that will increase the file size... :-)

Related

ghostscript: convert PDF into GRAY preserving pure Black for text

I need to convert RGB PDF into a CMYK/GRAY PDF.
I use the following command line:
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfwrite \
-dEmbedAllFonts=true \
-dPDFSETTINGS="/prepress" \
\
-sColorConversionStrategy=$2 \
-sColorConversionStrategyForImages=$2 \
-dProcessColorModel=$3 \
\
-dAutoFilterColorImages=false -dColorImageFilter=/FlateEncode \
-dAutoFilterGrayImages=false -dGrayImageFilter=/FlateEncode \
-dMonoImageFilter=/FlateEncode \
\
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
where $3 is /DeviceGray or DeviceCMYK and
$2 is CMYK or Gray.
Unfortunately the text is in the gray mode only 91% black.
In the CMYK mode the text is 100% black.
How can I set the text to 100% black in the gray mode?
I use GS 9.26 and no special ICC profiles.
You cannot convert to Gray while preserving text as Black, it will be converted to Gray as well.
You could provide an ICC prfile which converts R=G=B=0 into a CIE colour which, when mapped back through through the Gray ICC profile results in 100% gray, which is the same (obviously) as black. I'm afraid its up to you to source suitable ICC profiles.
With the current version of Ghostscript you don't need to (and shouldn#'t) set the ProcessColorModel when using ColorConversionStrategy, it'll be set for you.

How to modify JPEG compression in PDF files using ghostscript

I'd like to reduce PDF file size not only by reducing image DPI but also by changing quality settings of JPEG compression.
First I tried:
gs -dNOPAUSE -dQUIET -dBATCH -sDEVICE=pdfwrite -dPDFSETTINGS=/screen \
-dColorImageResolution=120 -dGrayImageResolution=120 \
-dUseFlateCompression=false -sOutputFile=test1.pdf \
-c "<< /GrayImageDict << /Blend 1 /VSamples [1 1 1 1] /QFactor 0.1 /HSamples [1 1 1 1] >> /ColorImageDict << /Blend 1 /VSamples [1 1 1 1] /QFactor 0.1 /HSamples [1 1 1 1] >> >> setdistillerparams " \
-f test.ps
Second I changed Gray- / ColorImageDict entries and tried:
gs -dNOPAUSE -dQUIET -dBATCH -sDEVICE=pdfwrite -dPDFSETTINGS=/screen \
-dColorImageResolution=120 -dGrayImageResolution=120 \
-dUseFlateCompression=false -sOutputFile=test2.pdf \
-c "<< /GrayImageDict << /Blend 1 /VSamples [2 1 1 2] /QFactor 2.4 /HSamples [2 1 1 2] >> /ColorImageDict << /Blend 1 /VSamples [2 1 1 2] /QFactor 2.4 /HSamples [2 1 1 2] >> >> setdistillerparams " \
-f test.ps
But the result of both commands are identical in size.
Any suggestions what's the mistake / misunderstanding or how to increase JPEG compression otherwise?
(Version: GPL Ghostscript 9.16)
It would be best to share an example file and command liens so that others can reproduce your findings. Without that, its not possible to say why you are getting the results you are.
Your command line isn't ideal. You've used one of the canned PDFSETTINGS, I wouldn't do that if I were you. Use the defaults and alter the ones you want to change. You've mixed command line switches and a PostScript call to setdistillerparams. It would be better to use just setdistillerparams.
The most likely reason is that you aren't getting JPEG in the output, note that you haven't disabled auto filter selection. As described in the distiller params documentation, UseFlateCompression only affects the page compression, not images, and in any event is fixed to true (see the Ghostscript documentation) in Ghostscript. Setting it to false does nothing.
So I'd suggest you post a sample file and we take it from there.
Oh and you should use the current version, 9.16 is 2 years old.

ghostscript cmyk black value in pdf

I'm still trying to convert an RGB-pdf to an CMYK-pdf using PSOcoated_v3.icc as outputProfile (see my earlier question). I'm convinced now that ghostscript treats the profile like lcms2 does (and apparently also Photoshop). However, when using ghostscript to write a PDF file the black still looks washed out so I decided to dig into the PDF file and grab the cmyk color value.
The situation is as follows: I start with an RGB-pdf exported from Inkscape which is simply a black rectangle filling up the entire page; lets name that file black.pdf. Now I convert the pdf via
gs -dBATCH -dNOPAUSE -dNOCACHE \
-sDEVICE=pdfwrite \
-sProcessColorModel=DeviceCMYK \
-sColorConversionStrategy=CMYK \
-sOutputICCProfile=PSOcoated_v3.icc \
-sDefaultRGBProfile=sRGB2014.icc \
-dOverrideICC=true \
-dRenderIntent=1 \
-sOutputFile=black.cmyk.pdf\
black.pdf
and examine the content of the resulting pdf. The print commands for the rectangle look different whether I use gs9.20 from the debian repository or gs9.22 binary from ghostscript website.
in case of version 9.20 I get
q 0.1 0 0 0.1 0 0 cm
/R7 gs
0.722 0.675 0.671 0.882 k
0 0.0195313 10902.9 7748.55 re
f
Q
and for version 9.22
q 0.1 0 0 0.1 0 0 cm
/R7 gs
1 1 1 0 k
0 0.0195313 10902.9 7748.55 re
f
Q
In both cases the cmyk-black value is different from [0.83, 0.67, 0.51, 0.95] which I would expect using the PSOcoated_v3 profile and relative colorimetric intent.
Simply changing to sDevice=tiff32nc yields the expected cmyk representation for black.
Any ideas?
By the way. Is the output color profile saved within the pdf?

stamp a pdf with an image

I am currently writing an application in which wone of the processes is to stamp an existing 1-page pdf-document with an image provided by the user. The stamp needs to be scaled and position correctly onto the pdf.
I've successfully followed the incstructions in Kurt Pfeifle's answer to Stamp PDF file with control for position of stamp file
.
In the answer, Kurt
Creates a stamp on the fly using ghostscript.
Creates an empty A4-sized-pdf, with the stamp position in.
He then merges the newly created pdf, with the original pdf using pdftk
As I said, this all works great. However, if I do the same process with my own image-file(converted to pdf), something goes wrong in the second step with the sizing in the second step. The sizing in the command seems to be ignored, and instead, the pdf gets the same size as the image. Se output below for a comparison of original command with original stamp as pdf and my modified command using a converted image.
Original working command:
gs \
-o A4-stamp.pdf \
-sDEVICE=pdfwrite \
-g5950x8420 \
-c "<</PageOffset [280 790]>> setpagedevice" \
-f stamp-small.pdf
Modified command with image
gs \
-o A4-image.pdf \
-sDEVICE=pdfwrite \
-g5950x8420 \
-c "<</PageOffset [280 790]>> setpagedevice" \
-f image.pdf
As can be seen, the size and ratio is all wrong, and should match the original.
The original stamp-small.pdf (from original answer) can be generated like this:
gs \
-o stamp-small.pdf \
-sDEVICE=pdfwrite \
-g3200x500 \
-c "/Helvetica-Bold findfont 36 scalefont setfont" \
-c "0 .8 0 0 setcmykcolor" \
-c "12 12 moveto" \
-c "(This is my stamp) show" \
-c "showpage"
The image I used in the command is the following, but the same thing happens with any image I have tried, after converting the image to pdf:
convert image.png image.pdf
There seem to be some issues related to:
transparency in your png image (transparency is not supported by PDF)
convert output from jpg to pdf (some kind of bug in convert?)
In short, without going into the details of the problems, you can use
convert image.png -size 640x562 xc:white +swap -compose over -composite image.jpg - this removes png transparency to white (as background) and converts image to jpg (note the -size, this is the same as the image you added in this post, but should be stated to be the correct one for your stamp)
img2pdf image.jpg -o image.pdf - properly add jpg image to pdf
gs -o A4-image.pdf -sDEVICE=pdfwrite -g5950x8420 -c "<</PageOffset [100 500]>> setpagedevice" -f image.pdf
The best way currently I found is:
1, Convert your image to pdf, e.g.
rsvg-convert -f pdf -o stamp.pdf in.svg
2, Change ratio and position of the image, e.g. to the right top (7cm, 12cm)
pdfjam --paper 'a4paper' --scale 0.3 --offset '7cm 12cm' stamp.pdf
3, Overylay stamp to some page, e.g. page 4
qpdf in.pdf --overlay --to=4 stamp.pdf --out.pdf

How to convert a PDF to grayscale from command line avoiding to be rasterized?

I'm trying to convert to grayscale this PDF: https://dl.dropboxusercontent.com/u/10351891/page-27.pdf
Ghostscript (v 9.10) with pdfwrite Device fails with a "Unable to convert color space to Gray, reverting strategy to LeaveColorUnchanged." message.
I'm able to convert it through an intermediary ps file (using gs, pdftops (v 0.24.3) or pdf2ps) but this convertion rasterize the whole PDF.
I tryed a lot of other things: normalize the PDF using qpdf (v 5.0.1) or pdftk (v 1.44), transform it to a svg file and back to a PDF via Inkscape (v 0.48.4)... nothing seems to work.
The only one solution I found (which is not suitable for me in production environment) is to use Preview on my Mac and apply a Quartz Gray Tone filter manually or with an Automator script.
Anyone find another working way to do it?
Or is it possible to normalize the PDF or fix the issue to prevent the Ghostscript message "Unable to convert color space..." or to force the color space in another way?
Thanks!
gs \
-sDEVICE=pdfwrite \
-sProcessColorModel=DeviceGray \
-sColorConversionStrategy=Gray \
-dOverrideICC \
-o out.pdf \
-f page-27.pdf
This command converts your file to grayscale (GS 9.10).
A bit late in the day, but the top answer doesn't work for me with a different file. The underlying problem appears to be old code in Ghostscript, for which there is a later version that is not enabled by default. More on that here: http://bugs.ghostscript.com/show_bug.cgi?id=694608
The page above also gives a command that works for me:
gs \
-sDEVICE=pdfwrite \
-dProcessColorModel=/DeviceGray \
-dColorConversionStrategy=/Gray \
-dPDFUseOldCMS=false \
-o out.pdf \
-f in.pdf
Use the most recent code (not yet released) and set ColorConversionStrategy=Gray
If you crack into the file, you'll find that most of the colors are determined through an RGB ICC based color space (look for 8 0 R to find all the references to this colorspace). Perhaps gs is complaining about that?
Who knows.
The take away is that converting a page from one colorspace to another without affecting the content is non-trivial in that you need to be able to render the page and trap all changes to the current color/colorspace and substitute an equivalent in the target space as well as convert all image XObjects in the wrong colorspace, which will require decoding the image data and re-encoding it in the target space, as well as all form XObjects, which will be a task similar to trying to convert the parent page since form XObjects (I think your doc has 4) also contain resources and a content stream of page marking operators (which may include more XObjects).
It's certainly doable, but the process is nearly the same as rendering but with some fairly special-purpose code.
very late response, but the following command should work :
convert -colorspace GRAY input.pdf input_gray.pdf
In Linux:
Install pdftk
apt-get install pdftk
Once you have installed pdftk, save the script as graypdf.sh with the following code
# convert pdf to grayscale, preserving metadata
# "AFAIK graphicx has no feature for manipulating colorspaces. " http://groups.google.com/group/latexusersgroup/browse_thread/thread/5ebbc3ff9978af05
# "> Is there an easy (or just standard) way with pdflatex to do a > conversion from color to grayscale when a PDF file is generated? No." ... "If you want to convert a multipage document then you better have pdftops from the xpdf suite installed because Ghostscript's pdf to ps doesn't produce nice Postscript." http://osdir.com/ml/tex.pdftex/2008-05/msg00006.html
# "Converting a color EPS to grayscale" - http://en.wikibooks.org/wiki/LaTeX/Importing_Graphics
# "\usepackage[monochrome]{color} .. I don't know of a neat automatic conversion to monochrome (there might be such a thing) although there was something in Tugboat a while back about mapping colors on the fly. I would probably make monochrome versions of the pictures, and name them consistently. Then conditionally load each one" http://newsgroups.derkeiler.com/Archive/Comp/comp.text.tex/2005-08/msg01864.html
# "Here comes optional.sty. By adding \usepackage{optional} ... \opt{color}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds_color}} \opt{grayscale}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds}} " - http://chem-bla-ics.blogspot.com/2008/01/my-phd-thesis-in-color-and-grayscale.html
# with gs:
# http://handyfloss.net/2008.09/making-a-pdf-grayscale-with-ghostscript/
# note - this strips metadata! so:
# http://etutorials.org/Linux+systems/pdf+hacks/Chapter+5.+Manipulating+PDF+Files/Hack+64+Get+and+Set+PDF+Metadata/
COLORFILENAME=$1
OVERWRITE=$2
FNAME=${COLORFILENAME%.pdf}
# NOTE: pdftk does not work with logical page numbers / pagination;
# gs kills it as well;
# so check for existence of 'pdfmarks' file in calling dir;
# if there, use it to correct gs logical pagination
# for example, see
# http://askubuntu.com/questions/32048/renumber-pages-of-a-pdf/65894#65894
PDFMARKS=
if [ -e pdfmarks ] ; then
PDFMARKS="pdfmarks"
echo "$PDFMARKS exists, using..."
# convert to gray pdf - this strips metadata!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME" "$PDFMARKS"
else # not really needed ?!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME"
fi
# dump metadata from original color pdf
## pdftk $COLORFILENAME dump_data output $FNAME.data.txt
# also: pdfinfo -meta $COLORFILENAME
# grep to avoid BookmarkTitle/Level/PageNumber:
pdftk $COLORFILENAME dump_data output | grep 'Info\|Pdf' > $FNAME.data.txt
# "pdftk can take a plain-text file of these same key/value pairs and update a PDF's Info dictionary to match. Currently, it does not update the PDF's XMP stream."
pdftk $FNAME-gs-gray.pdf update_info $FNAME.data.txt output $FNAME-gray.pdf
# (http://wiki.creativecommons.org/XMP_Implementations : Exempi ... allows reading/writing XMP metadata for various file formats, including PDF ... )
# clean up
rm $FNAME-gs-gray.pdf
rm $FNAME.data.txt
if [ "$OVERWRITE" == "y" ] ; then
echo "Overwriting $COLORFILENAME..."
mv $FNAME-gray.pdf $COLORFILENAME
fi
# BUT NOTE:
# Mixing TEX & PostScript : The GEX Model - http://www.tug.org/TUGboat/Articles/tb21-3/tb68kost.pdf
# VTEX is a (commercial) extended version of TEX, sold by MicroPress, Inc. Free versions of VTEX have recently been made available, that work under OS/2 and Linux. This paper describes GEX, a fast fully-integrated PostScript interpreter which functions as part of the VTEX code-generator. Unless specified otherwise, this article describes the functionality in the free- ware version of the VTEX compiler, as available on CTAN sites in systems/vtex.
# GEX is a graphics counterpart to TEX. .. Since GEX may exercise subtle influence on TEX (load fonts, or change TEX registers), GEX is op- tional in VTEX implementations: the default oper- ation of the program is with GEX off; it is enabled by a command-line switch.
# \includegraphics[width=1.3in, colorspace=grayscale 256]{macaw.jpg}
# http://mail.tug.org/texlive/Contents/live/texmf-dist/doc/generic/FAQ-en/html/FAQ-TeXsystems.html
# A free version of the commercial VTeX extended TeX system is available for use under Linux, which among other things specialises in direct production of PDF from (La)TeX input. Sadly, it���s no longer supported, and the ready-built images are made for use with a rather ancient Linux kernel.
# NOTE: another way to capture metadata; if converting via ghostscript:
# http://compgroups.net/comp.text.pdf/How-to-specify-metadata-using-Ghostscript
# first:
# grep -a 'Keywo' orig.pdf
# /Author(xxx)/Title(ttt)/Subject()/Creator(LaTeX)/Producer(pdfTeX-1.40.12)/Keywords(kkkk)
# then - copy this data in a file prologue.ini:
#/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
#[/Author(xxx)
#/Title(ttt)
#/Subject()
#/Creator(LaTeX with hyperref package + gs w/ prologue)
#/Producer(pdfTeX-1.40.12)
#/Keywords(kkkk)
#/DOCINFO pdfmark
#
# finally, call gs on the orig file,
# asking to process pdfmarks in prologue.ini:
# gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
# -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -dDOPDFMARKS \
# -sOutputFile=out.pdf in.pdf prologue.ini
# then the metadata will be in output too (which is stripped otherwise;
# note bookmarks are preserved, however).
give the file excecution permissions
chmod +x greypdf.sh
And execute it like this:
./greypdf.sh input.pdf
It will create a file input-gray.pdf in the same location than the initial file
gs -dQUIET -dBATCH -dNOPAUSE -r150 -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -sOutputFile=output.pdf input.pdf
You can use something which I created. It gives you the option to choose the specific page numbers that you want to convert to grayscale. Handy if you don't want to grayscale the entire pdf. https://github.com/shoaibkhan94/PdfGrayscaler.