Avoiding fragmenting of text extracted from PDF after processing with Ghostscript

Avoiding fragmenting of text extracted from PDF after processing with Ghostscript - pdf

After processing with Ghostscript, I sometimes see whitespace breaking up the words as seen with pdftotext or in a PDF viewer when searching or selecting. Possibly unrelated but the anomalies seem to correspond with kerning variations in the rendered font.
Is there a way to avoid this?
For example, from GS 9.23 (also occurred with earlier versions):
gs -sDEVICE=pdfwrite \
-dNOPAUSE -dQUIET -dPARANOIDSAFER -dBATCH \
-sOutputFile=./output.pdf input.pdf
Excerpt from pdftotext input.pdf:
Review this manual before
operating deep cleaner
while pdftotext output.pdf:
Re vie w t his m a nua l be fore
ope ra t ing de e p c le a ne r

Ghostscript and the pdfwrite device (as explained in VectorDevices.htm) does not simply 'fiddle' with the input when producing a PDF file. The input (from whatever source; PDF, PostScript, XPS, PCL, PCL-XL) is fully interpreted into marking operations, those marking operations are sent to the device which turns them back into PDF constructs.
So the low level (PDF) format describing the page need not bear any relation to the low level format of the input. In particular you cannot expect the PDF operations in the input to be reflected in the output.
The visual appearance will be the same (or should be, because that's the main goal), but the actual operations won't be.
The reason for the difference in the text output is because, basically, there is no 'metadata' in a PDF file that describes words, paragraphs, columns etc. When you extract text from a PDF file what you actually get is a series of character codes and positions.
Its up to the text extraction code to try and make some sense of that. I'd guess that pdftotext is using the rather naive approach of assuming that text strings are words.
This is problematic because there are numerous different ways to handle kerning, justification and other spacings in PDF. You could do something like :
(Te) Tj
10 0 Td
(st) Tj
Or :
[(Te) 2 (st)] TJ
The pdfwrite device doesn't know what the original was, so what it emits could be either of those, depending on some heuristics. The chances of it matching the original are low.
I suspect that pdftotext would regard the first operation as "Te st" and the second operation as "Test"
One possible solution would be to use Ghostscript's txtwrite device to extract the text, it might do a better job.
As with your other question, it would be best to supply examples when asking these kinds of questions, because without that its pretty much guesswork.
TL;DR
Is there a way to avoid this?
No.

Related

Can't get Ghostscript "viewraw.ps" or "viewrgb.ps" programs to work (scrambled output)

I've had good results in the past using the "viewjpeg.ps" PostScript program included with Ghostscript to place JPEG images into generated PDFs. Now I'm trying to do the same for bitmaps, and I just haven't been able to make it work. My hunch is that the program I need is either "viewraw.ps" or "viewrgb.ps," and I can see that the parameters expected will be a bit different from those passed to "viewjpeg.ps."
So far this is what I have:
"C:\Program Files\gs\gs9.10\bin\gswin64c.exe" -q -sDEVICE=pdfwrite -DNOSAFER -r200x200 -sOutputFile=o.pdf z:\home\dell\reporting\viewrgb.ps -c "(out002.bmp) 6800 viewrgb"
This gets pretty close to what I want, but my bitmap (though clearly identifiable) is scrambled in the output PDF: compressed vertically, upside-down, and somewhat wrong in color.
I have attempted to address these issues by tweaking the "width" parameter (6800 above). My bitmap is 1,700 pixels wide, and uses 4 bytes per pixel, so 1,700 * 4 = 6,800 seemed like a logical choice. I've also tried 1,700 (width in pixels) and 54,400 (bits per image row). 5,100 (3 * 1,700) seemed to work best, but it's still wrong.
Note that "viewjpeg.ps" does not expect a "width" parameter, so I haven't had to deal with this before. (It was an examination of "viewrgb.ps" that made me realize this parameter was required.)
Can anyone spot my mistake, or maybe point me to an example that uses "viewraw.ps" or "viewrgb.ps"?

You haven't said (or I missed it) what format your 'bitmaps' are, and you haven't supplied an example to look at so I can't tell (or experiment).
You say your output is 4 bytes per pixel so that's either CMYK or something like RGBa. Either way viewrgb isn't going to work, because it only expects 3 channels. It's intended to view the output of the Ghostscript bitrgb device.
Viewraw just reads raw data, straight image samples, no header IIRC and it's CMYK, so unless your 4 bytes are CMYK then it's not going to be correct either.
Since both of these are RAW format, they don't expect a header, if your image format includes a header, then that's going to be treated as image data which will certainly cause the image to be drawn incorrectly.
Both of these PostScript programs will display a usage message on the back channel if you invoke them incorrectly.
You don't need -dNOSAFER with such an old version of Ghostscript (9.10).
-r has little effect on pdfwrite and will have no effect at all when you feed it an image as input; you should probably omit that.

Losslessly Compress PDF Generated from PostScript

I am generating multiple EPS files, which contain several PostScript drawing commands that are not necessarily encoded efficiently. The first update in the answer to this question describes similar inefficiencies.
Each of my EPS files are around 18 MB, and the resulting PDF files are around 3 MB. I am generating the PDF files using epstopdf, which enables some sort of compression by default.
Are there any suggestions for how to further reduce the resulting PDF file sizes without changing the quality (e.g. rasterizing the vector graphics)?
I tried reducing the precision of the coordinates from 8 digits past the decimal to 3. This reduced the EPS file sizes to about 14 MB, but, counter-intuitively, the PDF file sizes slightly increased.
Update 1: The EPS files contain several occurrences of the sample code below for different coordinates and colors.
newpath
1 setlinejoin
1 setlinecap
<<
/BBox [322 384.0417 615.0087 651.9958]
/Domain [322 384.0417 615.0087 651.9958]
/ShadingType 6
/ColorSpace [/DeviceRGB]
/DataSource
[
0
350.00000000 651.99583594
336.00000000 645.75890880
336.00000000 645.75890880
322.00000000 639.52198166
339.17140372 627.26533984
339.17140372 627.26533984
356.34280743 615.00869803
370.19224806 621.16169097
370.19224806 621.16169097
384.04168868 627.31468392
367.02084434 639.65525993
367.02084434 639.65525993
0.23047 0.29688 0.75
0.23047 0.29688 0.75
0.41081 0.54141 0.93366
0.41112 0.54178 0.93388
]
>>
gsave
322 615.0087 62.04169 36.98714 rectclip
shfill
grestore
Update 2: I have been able to reduce the PDF file sizes by about 15% by using pdftocairo, followed by gs -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDetectDuplicateImages=true -sOutputFile=out.pdf in_.pdf.

PostScript is a programming language and PDF is not, so often you can actually create a smaller PostScript program than the resulting PDF file.
The 'inefficiencies' you mention in your EPS program, and the precision of the input numbers, are completely irrelevant to the size of the PDF file. The operators in PDF are not the same names as the operators in PostScript, so a 'moveto' in PostScript does not simply get translitereated into a 'moveto' in the resulting PDF file. The precision of numbers in the output PDF file is not tied to the precision of the numbers in the input.
In addition, PostScript interpreters often use a fixed precision arithmetic (Ghostscript for example uses 24:8), so (eg) 1.5 on the input may not be produced as 1.5 on the output, it may instead become 1.49999999.
So the result of this, basically, is that nobody can tell why your PDF files are as large as they are without seeing them. I would suggest that a 6:1 reduction in size is pretty reasonable personally. If you post a representative example somewhere its possible someone could look at it and might be able to offer some suggestions, but without seeing the content its not really possible to tell.
NB rendering the content would most likely increase the size of the PDF file, unless you render at a really low resolution.
EDIT
The supplied example is simply a shading dictionary, the PDF file will contain almost exactly the same data for that particular construct. Its already about as compact as you could expect, I very much doubt it this is the sort of thing occupying 18MB of source, that would be an enormous amount of shadings. There is no realistic way to make that smaller, and rendering it to a bitmap (even at very low resolution) would actually make it larger.
Its entirely possible the EPS contains things like a bitmap preview, which will, of course, be removed when creating a PDF. It may also (depending on the creating application) contain the original document, stored as comments, which will also be removed when creating a PDF file. Without seeing the original EPS its not really possible to suggest much.
I'm afraid posting little bits of the file isn't going to help really.

Make all text in pdf slightly thicker/fatter. Like simulating dot-gain in offset printing?

I would like to make all text in a pdf (from a customer) slightly thicker/fatter to simulate how it will look when printed (normal dot-gain) in a offset press.
If I use the PitStop-plugin in Acrobat I can convert all text to outline and then add a stroke to the outline so that it will be thicker/fatter.
However, those are manual steps and I need to automate it completely.
My thought was to go with GhostScript and I've managed to convert it to outline, but I cant find if there's a way to add stroke or something similar within GhostScript?
My current command is:
gs -o output.pdf -dNoOutputFonts -StrokeWidth=2 -sDEVICE=pdfwrite input.pdf
I've tried to add: -StrokeWidth=2 but that gave me no effect (I don't even know what kind of measure it wants)
Any ideas/solutions?
Best Regards
Niclas Rådström

it is possible to insert a redefinition of the 'show' operator which strokes the outline of the character with a line width of your choice in addition to the regular show operation. But I agree with Max Wyss that something actually designed to address your problem like a dot gain profile would be a better bet.
/show{gsave dup true charpath .025 setlinewidth stroke grestore //show}def

Replace colors in PDF using ghostscript

How can I convert a single color from a PDF document into another color, for example convert all instances of #ff0000 (red) to #ffffff (white)?
I've seen a number of ghostscript commands doing something similar (using setcolor, setcolortransfer), but I can't find a solution for this exact problem.
For example, the follwing will create an image-negative of the input PDF:
gs -o output.pdf -sDEVICE=pdfwrite -c "{1 exch sub}{1 exch sub}{1 exch sub}{1 exch sub} setcolortransfer" -f input.pdf
I'd move past this with a higher level of control, focusing on a single color being replaced with a different (not it's negative) color.

Essentially, you can't (or at least not using Ghostscript).
Firstly you seem to be assuming that the colours will be specified in RGB when in fact they could be specified in CMYK, ICC, CalRGB or Lab. You also need to consider Indexed colour spaces.
Secondly Ghostscript does not 'edit' PDF files, when you send a PDF file as input to Ghostscript it is fully interpreted to graphics primitives and the primitives are processed.
When the output is PDF the primitives are reassembled into a new PDF file. The goal of this process is that the visual appearance of the new PDF file should match the original. It is NOT the same PDF file, its internals will likely be completely different.
Finally, how do you plan to handle images ? Will you process those byte by byte to massage the colours ? Or do you plan to ignore them ? Same goes for shadings, where the colours aren't even present in the PDF file directly, but are generated from functions.
Without knowing why you want to do this I can't even offer a different approach other than; decompress the PDF file, read it and replace the colours manually.

How to search my PDF with grep?

I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?

If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas