Replace colors in PDF using ghostscript - pdf

How can I convert a single color from a PDF document into another color, for example convert all instances of #ff0000 (red) to #ffffff (white)?
I've seen a number of ghostscript commands doing something similar (using setcolor, setcolortransfer), but I can't find a solution for this exact problem.
For example, the follwing will create an image-negative of the input PDF:
gs -o output.pdf -sDEVICE=pdfwrite -c "{1 exch sub}{1 exch sub}{1 exch sub}{1 exch sub} setcolortransfer" -f input.pdf
I'd move past this with a higher level of control, focusing on a single color being replaced with a different (not it's negative) color.

Essentially, you can't (or at least not using Ghostscript).
Firstly you seem to be assuming that the colours will be specified in RGB when in fact they could be specified in CMYK, ICC, CalRGB or Lab. You also need to consider Indexed colour spaces.
Secondly Ghostscript does not 'edit' PDF files, when you send a PDF file as input to Ghostscript it is fully interpreted to graphics primitives and the primitives are processed.
When the output is PDF the primitives are reassembled into a new PDF file. The goal of this process is that the visual appearance of the new PDF file should match the original. It is NOT the same PDF file, its internals will likely be completely different.
Finally, how do you plan to handle images ? Will you process those byte by byte to massage the colours ? Or do you plan to ignore them ? Same goes for shadings, where the colours aren't even present in the PDF file directly, but are generated from functions.
Without knowing why you want to do this I can't even offer a different approach other than; decompress the PDF file, read it and replace the colours manually.

Related

Avoiding fragmenting of text extracted from PDF after processing with Ghostscript

After processing with Ghostscript, I sometimes see whitespace breaking up the words as seen with pdftotext or in a PDF viewer when searching or selecting. Possibly unrelated but the anomalies seem to correspond with kerning variations in the rendered font.
Is there a way to avoid this?
For example, from GS 9.23 (also occurred with earlier versions):
gs -sDEVICE=pdfwrite \
-dNOPAUSE -dQUIET -dPARANOIDSAFER -dBATCH \
-sOutputFile=./output.pdf input.pdf
Excerpt from pdftotext input.pdf:
Review this manual before
operating deep cleaner
while pdftotext output.pdf:
Re vie w t his m a nua l be fore
ope ra t ing de e p c le a ne r
Ghostscript and the pdfwrite device (as explained in VectorDevices.htm) does not simply 'fiddle' with the input when producing a PDF file. The input (from whatever source; PDF, PostScript, XPS, PCL, PCL-XL) is fully interpreted into marking operations, those marking operations are sent to the device which turns them back into PDF constructs.
So the low level (PDF) format describing the page need not bear any relation to the low level format of the input. In particular you cannot expect the PDF operations in the input to be reflected in the output.
The visual appearance will be the same (or should be, because that's the main goal), but the actual operations won't be.
The reason for the difference in the text output is because, basically, there is no 'metadata' in a PDF file that describes words, paragraphs, columns etc. When you extract text from a PDF file what you actually get is a series of character codes and positions.
Its up to the text extraction code to try and make some sense of that. I'd guess that pdftotext is using the rather naive approach of assuming that text strings are words.
This is problematic because there are numerous different ways to handle kerning, justification and other spacings in PDF. You could do something like :
(Te) Tj
10 0 Td
(st) Tj
Or :
[(Te) 2 (st)] TJ
The pdfwrite device doesn't know what the original was, so what it emits could be either of those, depending on some heuristics. The chances of it matching the original are low.
I suspect that pdftotext would regard the first operation as "Te st" and the second operation as "Test"
One possible solution would be to use Ghostscript's txtwrite device to extract the text, it might do a better job.
As with your other question, it would be best to supply examples when asking these kinds of questions, because without that its pretty much guesswork.
TL;DR
Is there a way to avoid this?
No.

How to get bounding boxes of elements in EPS files

I need to check if a EPS/PDF file contains any vector elements
First I convert the PDF to EPS and remove all text elements and images from the file like this
pdftocairo -f $page_number -l $page_number -eps $input - | sed '/BT/,/ET/ d' | sed '/^8 dict dup begin$/,/^Q$/ c Q' > $output
But how can I then check if any elements are written to the canvas?
What do you mean, exactly, by 'vector elements' ? Anything except an actual bitmap image ? Why do you care ? Perhaps if you explained what you want to achieve it would be easier to help you.
Note that the approach you are using is by no means guaranteed to work, there can easily be 'elements' in the file which won't be removed by your rather basic approach to finding image.
You could use Ghostscript; run the file to a bitmap and specify -dFILTERTEXT and -dFILTERIMAGES. Then examine the pixels fo the bitmap to see if any are non-white. If they are, then there was vector content i the file. You could probably use something like ImageMagick to count the colours and see if there's more than 1.
Or run the file to bitmap twice, once normally, and once with -dFILTERVECTOR. Compare the two bitmaps (MD5 on them would be sufficient). If there are no differences then there was no vector content.
Any PDF that has vector elements will use at least one of the path painting operators. According to chapter 8 of the PDF standard, those are:
S, s, f, F, f*, B, B*, b, b*, n
Of course, since PDF files can be complex, you'll also need it in a standard form. You can do that using the qpdf program's QDF format. (apt install qpdf if you don't have it).
qpdf -qdf schedule.pdf - | egrep -m1 -q '\b[SsfFBbn]\*?$' && echo Yup
That'll print "Yup" if the file schedule.pdf has vector graphics in it.
Note: I think this will do the job for you, but it is not fool proof. It's possible to get false negatives if your PDF is loading vectors from an external file, embedding raw postscript, or doing some other trickiness. And, of course it can have false positives (for example, a file that draws a completely transparent, 0pt dot in white ink on a white background).
Other answers have addressed identifying the drawing operators in a plain text stream. For the other question,
But how can I then check if any elements are written to the canvas?
For this, the elements need to be part of a content stream that is referred to
in the /Contents member of the Page object.
If you read in all of the pdf objects, there will be a tree connecting all the content streams to the Root object declared in the trailer.
Trailer : /Root is a reference to the Document Catalog object
Document Catalog : /Pages is an array of Page objects or Pages nodes
Page : /Contents is an array of references to Content Stream objects that draw the elements of the page
It is possible for there to be stray content stream objects that are not referenced in the Document tree. By traversing the Pages tree you could collect any and all actual content and then feed that result to one of the solutions from the other answers.

Losslessly Compress PDF Generated from PostScript

I am generating multiple EPS files, which contain several PostScript drawing commands that are not necessarily encoded efficiently. The first update in the answer to this question describes similar inefficiencies.
Each of my EPS files are around 18 MB, and the resulting PDF files are around 3 MB. I am generating the PDF files using epstopdf, which enables some sort of compression by default.
Are there any suggestions for how to further reduce the resulting PDF file sizes without changing the quality (e.g. rasterizing the vector graphics)?
I tried reducing the precision of the coordinates from 8 digits past the decimal to 3. This reduced the EPS file sizes to about 14 MB, but, counter-intuitively, the PDF file sizes slightly increased.
Update 1: The EPS files contain several occurrences of the sample code below for different coordinates and colors.
newpath
1 setlinejoin
1 setlinecap
<<
/BBox [322 384.0417 615.0087 651.9958]
/Domain [322 384.0417 615.0087 651.9958]
/ShadingType 6
/ColorSpace [/DeviceRGB]
/DataSource
[
0
350.00000000 651.99583594
336.00000000 645.75890880
336.00000000 645.75890880
322.00000000 639.52198166
339.17140372 627.26533984
339.17140372 627.26533984
356.34280743 615.00869803
370.19224806 621.16169097
370.19224806 621.16169097
384.04168868 627.31468392
367.02084434 639.65525993
367.02084434 639.65525993
0.23047 0.29688 0.75
0.23047 0.29688 0.75
0.41081 0.54141 0.93366
0.41112 0.54178 0.93388
]
>>
gsave
322 615.0087 62.04169 36.98714 rectclip
shfill
grestore
Update 2: I have been able to reduce the PDF file sizes by about 15% by using pdftocairo, followed by gs -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDetectDuplicateImages=true -sOutputFile=out.pdf in_.pdf.
PostScript is a programming language and PDF is not, so often you can actually create a smaller PostScript program than the resulting PDF file.
The 'inefficiencies' you mention in your EPS program, and the precision of the input numbers, are completely irrelevant to the size of the PDF file. The operators in PDF are not the same names as the operators in PostScript, so a 'moveto' in PostScript does not simply get translitereated into a 'moveto' in the resulting PDF file. The precision of numbers in the output PDF file is not tied to the precision of the numbers in the input.
In addition, PostScript interpreters often use a fixed precision arithmetic (Ghostscript for example uses 24:8), so (eg) 1.5 on the input may not be produced as 1.5 on the output, it may instead become 1.49999999.
So the result of this, basically, is that nobody can tell why your PDF files are as large as they are without seeing them. I would suggest that a 6:1 reduction in size is pretty reasonable personally. If you post a representative example somewhere its possible someone could look at it and might be able to offer some suggestions, but without seeing the content its not really possible to tell.
NB rendering the content would most likely increase the size of the PDF file, unless you render at a really low resolution.
EDIT
The supplied example is simply a shading dictionary, the PDF file will contain almost exactly the same data for that particular construct. Its already about as compact as you could expect, I very much doubt it this is the sort of thing occupying 18MB of source, that would be an enormous amount of shadings. There is no realistic way to make that smaller, and rendering it to a bitmap (even at very low resolution) would actually make it larger.
Its entirely possible the EPS contains things like a bitmap preview, which will, of course, be removed when creating a PDF. It may also (depending on the creating application) contain the original document, stored as comments, which will also be removed when creating a PDF file. Without seeing the original EPS its not really possible to suggest much.
I'm afraid posting little bits of the file isn't going to help really.

Make all text in pdf slightly thicker/fatter. Like simulating dot-gain in offset printing?

I would like to make all text in a pdf (from a customer) slightly thicker/fatter to simulate how it will look when printed (normal dot-gain) in a offset press.
If I use the PitStop-plugin in Acrobat I can convert all text to outline and then add a stroke to the outline so that it will be thicker/fatter.
However, those are manual steps and I need to automate it completely.
My thought was to go with GhostScript and I've managed to convert it to outline, but I cant find if there's a way to add stroke or something similar within GhostScript?
My current command is:
gs -o output.pdf -dNoOutputFonts -StrokeWidth=2 -sDEVICE=pdfwrite input.pdf
I've tried to add: -StrokeWidth=2 but that gave me no effect (I don't even know what kind of measure it wants)
Any ideas/solutions?
Best Regards
Niclas Rådström
it is possible to insert a redefinition of the 'show' operator which strokes the outline of the character with a line width of your choice in addition to the regular show operation. But I agree with Max Wyss that something actually designed to address your problem like a dot gain profile would be a better bet.
/show{gsave dup true charpath .025 setlinewidth stroke grestore //show}def

Adding encoding in postscript, ghostscript renders text correctly, but converting to PDF does not show the characters

We have to construct a postscript file that contains Arabic text, so as English text.
GhostScript shows the Arabic text correctly, but converting it to pdf does not show the Arabic letters.
PS file contains the following:
/TraditionalArabic findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /kafinitialarabic put
Encoding 2 /behinitialarabic put
Encoding 3 /yehmedialarabic put
Encoding 4 /seenfinalarabic put
Encoding 5 /eacute put
Encoding 6 /a put
/ArabicTradDict currentdict definefont pop
end
%%Page: 1 1
%%BeginPageSetup
%%PageMedia: Color Weight Type
<< /MediaColor (Blue)/MediaWeight 75 /MediaType () /xx {2.803464567 mul} def /xx {2.83464567 mul} def /PageSize [240 xx 345 xx]>> setpagedevice
%%EndPageSetup
/ArabicTradDict 18 selectfont
72 xx 300 xx moveto
(\004\003\002\001) show
showpage
To run ghostScript: running it from command line to include all windows fonts:
gswin64.exe -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true
To convert the PS file to PDF file: running the following command:
gswin64.exe -dBATCH -dNOPAUSE - sOutputFile=c:/Users/mob/Desktop/TimesNewRomanPSMT.pdf -sDEVICE=pdfwrite - dPDFSETTINGS=/prepress -dCompressFonts=false -dSubsetFonts=false -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true -dEmbedAllFonts=true -f c:/Users/mob/Desktop/TimesNewRomanPSMT.ps
So when converting to PDF, the Arabic characters are not showing correctly, but showing as squares that are of no meaning...
If I use Adobe tool to convert to PDF, the PDF we get is same, except the "eacute -(005) " if included in the PS file, will show after conversion, where as when I convert with the previous command line, all characters that are added from the Encoding are not shown correctly.
Any help with that?
Thanks to KenS hints I was able to solve my problem. The encoding used wrong character names like kafinitialarabic (i mean by wrong, pdf could not understand that), everything that ended with arabic was wrong. The Traditional Arabic font does not have those names for characters. In order to know what it really understood, have converted the ttf font to afm and pfa using the following command, that is converting the true type font to type 42 font which will be understood once embed in postscript file at conversion to pdf
C:\Program Files\gs\gs9.10\bin>gswin64c.exe -dNODISPLAY -q -- ttf2pf.ps times tim
esPS timesAFM
where times is the ttf font name. I then checked the generated pfa file for the characters I wanted to add, instead of kafinitialarabic, there was kafinitial, and for kafmedialarabic there was kafmedial and so on...
It works fine now to add those in encoding, but I want to find a way instead of adding all those characters in the dictionary, I want to use the font like we use with setfont in postscript normally - if that is possible...
As already suggested, you need to ensure the glyph names you use are in the font you use, or create a new font.
I haven't found anything that will choose the correct glyph from the set of initial, medial, final, isolated, depending on context, though.
I resorted to writing a program which takes unicode arabic, reverses it the arabic characters, and then decides which tone of character to use based on it's position in a word, and whether the previous or next characters are forced into isolated or final forms. Unfortunately had to embed quite some intrinsic knowledge about the font in use and the glyph names it has, as well as typos in them, into the program.
If that's of interest, I've stuck it on github, but it's very raw and initial.
It does work, though.
https://github.com/gbjk/arabic2ps
The font I used was a traditional arabic font, with quite a few idiosyncrasies.