How to use ps2pdf and force it to keep plain text if the original pdf contains real text?
Sometimes if a PDF has some areas with background colors it convert the whole pdf to an image!?
How to force ps2pdf to keep plain text?
Syntax:
pdf2ps file.pdf file.pdf.ps
ps2pdf -dPDFSETTINGS=/screen -dColorImageResolution=50 -dGrayImageResolution=50 file.pdf.ps file_output.pdf
PDF example
www.bluemachines.dk/pdf_comp/dyn.pdf
The first answer is drive Ghostscript directly, don't use ps2pdf (or pdf2ps).
If you are getting text converted to an image, then its most likely because the original PDF file has transparency, which cannot be represented in PostScript. The only way to deal with that is to render the area of transparency.
There is no way to maintain the Encoding of the text, though in general it won't change. However this is highly dependent on the font and encoding used in the input. I can't say more without seeing an example.
Related
Referring to this post, GhostScript Conversion Font Issues, is it safe to assume that GhostScript's PS-to-PDF conversions still do not guarantee cut-&-paste text from the converted document? Because I too am getting garbled copy-&-paste results with formatted documents, although it works with plain text files.
sample Word document .DOC
printed to PostScript by MS PS Driver
converted to PDF by GhostScript
On the color issue, I am using the Microsoft PS Class Driver to print documents to PostScript format files, and then convert them to PDF format with the GhostScript v9.20 DLL (sample source and outputs attached above). The options used are as follows:
-dNOPAUSE
-dBATCH
-dSAFER
-sDEVICE=pdfwrite
-sColorConversionStrategy=/RGB
-dProcessColorModel=/DeviceRGB
However, it is converted without color. Have I missed some option?
You can never guarantee getting a PDF file with text you can cut and paste from a PostScript program. There is no guarantee that there is any ToUnicode information in the PostScript program, and without that, if the font is subset as here, then there is no way to know what the Unicode code point for a given glyph is.
Regarding colour, the PostScript file you have supplied contains no colour, so its not Ghostscript, the problem is in the way you have produced the PostScript. At a guess you have used a Printer Definition (PPD file) which is for a monochrome printer.
You might be able to improve the text by playing with the options for downloading fonts, the basic problem is that your PostScript program doesn't contain the information we need to be able to construct a ToUnicode CMap. Without that we are forced to assume that the character codes are ASCII, and in your case, because the fonts are subset, they are not ASCII.
For some reason the content of your PostScript appears to be downloading the font as bitmaps. This is ugly, doesn't scale well, and might be the source of your inability to get ToUnicode data inserted. It may also be caused by the fonts you are using, you might try some standard system fonts (if you aren't already) like TimesNewRoman.
While its great that you supplied an example to look at, I'd suggest that in future you make the example smaller, much smaller.... There's really no need for 13 pages of multiply repeated content in this case. More content means it takes more time to decipher, try and keep example files to the minimum required to demonstrate the problem.
In short, it looks like both your problems are due to the way you are (or the application) generating the PostScript.
I want to convert the white text in this PDF into black text and generate a new PDF with the changed text.
I have found this
http://www.artifex.com/files/Ghostscript_Color_Architecture.pdf
which mentions settings like -sTextICCProfile but using black_output.icc from
http://www(dot)ghostscript.com/doc/toolbin/color/icc_creator/effects/
like so:
gs -o test.pdf -sTextICCProfile=black_output.icc out.pdf
does not change the text colour to black.
Is the usage of the .icc profile incorrect? Is it even the right approach?
Is there a way to achieve this with postscript?
Example PDF
The usage of the ICCProfile is correct...
However, that usage is for rendering, it has no effect on the pdfwrite device at all (because it doesn't render the input, it turns it into a PDF file). So no, this is not the correct approach.
There is no real means to do what you want with Ghostscript. Technically its probably possible, but it wouldn't be easy. You also haven't apparently posted an example of the PDF file. Its entirely possible that the 'text' is not actually text. It may be an image, or vectors, which look like text.
There may also be transparency ivolved which would complicate the matter still further.
I'm converting PDF to JPG with gs.
Does gs substitute embedded fonts? How exactly this works? Like if i embed all fonts that is used in PDF does gs still look for some substitution or can it use that embedded font data?
So does embedding fonts in PDF mean that all glyphs used in PDF with that font is being embedded and i don't need to have that font in my gs font path?
Thanks!
When you’re outputting a JPEG file, you’re in effect outputting an image. This means that Ghostscript renders the page as image, then compresses the image using JPEG (lossy – to prevent reduced legibility of the text, use a lossless compression format such as PNG instead; JPEG is basically only good for photography because lossless would be much too big there).
In a bitmap image, there are no fonts, only pixels – so, for text rendering (e.g. black text on a white page), Ghostscript will create a bitmap image consisting only of greyscale pixels (by means of anti-aliasing), then save that.
To be able to do that, Ghostscript must have access to the fonts at the time of PDF rendering and JPEG creation. This means that the fonts either must be installed on the system (and in your font path), or embedded in the PDF in the first place. They are not necessary to view the JPEG file.
When I print a PDF-file with a PS-driver and then convert the PS-file to a searchable PDF with ghostscript (pdfwrite device) something is wrong with the final pdf file. It becomes corrupt.
In some cases, the space character disappears, and in other cases the text width becomes too large so text overlap text.
The settings for gs is -dNOPAUSE -dBatch -sDEVICE=pdfwrite -dEmbedAllFonts=true -dSubsetFonts=false -sOutputFile=output.pdf input.ps
I am wondering if it is ghostscript that just cant produce a good output when the input file is a pdf.
If I print a word-document everything works fine!
Are there any other solutions like using a xps-driver and convert the xps file to a searchable pdf instead? are there any solutions out there that can do this?
I use gs 9.07.
Best regards
Joe
Why are you going through the step of printing the PDF file to a PostScript file? Ghostscript is already capable of accepting a PDF file as input.
This simply adds more confusion, it certainly won't add anything useful.
Its not possible to say what the problem 'might' be without seeing the original PDF file and the PostScript file produced by your driver. My guess would be that whatever application is processing the PDF hasn't embedded the font, or that the PostScript driver hasn't been able to convert the font into something suitable for PostScript, resulting in the font being missing in the output, and the pdfwrite device having to substitute 'something else' for the missing font.
Ghostscript (more accurately the pdfwrite device) is perfectly capable of producing a decent PDF file when the input is PDF, but your input isn't PDF, its PostScript!
To be perfectly honest, if your original PDF file isn't 'searchable' its very unlikely that the PDF file produced by pdfwrite will be either, no matter whether you use the original PDF or mangle it into PostScript instead.
The usual reasons why a PDF file are not 'searchable' are because there is no ToUnicode information and the font is encoded with a custom encoding and deos not use standard glyph names. If this is the case there is nothing you can do with the PDF file except OCR it.
I'm trying to convert PDFs to PCL (using ghostscript, but I'd love to hear alternative suggestions), and every driver (ghostscript device), including all of the built-ins and gutenprint generate PCL files many times larger than the input PDF. (This is the problem - I need my PCL to be about as small as the input).
Given that the text doesn't show up in the PCL file, I guess that Ghostscript is rasterizing the text. Is there a way to prevent GS generally, or just gutenprint, from doing that? I'd rather either have it embed the fonts, or not even embed the fonts (leave it to the printer to render the fonts)?
Unfortunately, there doesn't seem to be any documentation on this point.
There are 3 (I think) types of font in PCL. There are rendered bitmaps, TrueType fonts (in later versions) and the HPGL stick font.
PDF and PostScript Have type 1, 2 (CFF), 3 and 42 (TrueType, but not the same as PCL) and CIDFonts based on any of the preceding types.
The only font type the two have in common is TrueType, so in order to retain text, any font which was not TrueType would have top be converted into TrueType. This is not a simple task. So Ghostscript simply renders the text, which is guaranteed to work.
PDF is, in general, a much richer format than PCL< there are many PDF constructs (fonts, shading, stroke/fill in a single operation, transparency) which cannot be represented in PCL. So its entirely possible that the increase in size is nothing to do with text and fonts.
In fact, I believe that the PXL drivers in Ghostscript simply render the entire page to a bitmap at the required resolution, and then wrap that up with enough PCL to be successfully sent to a printer. (I could be mistaken on this point though)
Basically, you are not going to get PCL of a similar size to your PDF out of Ghostscript.
Here is a way to 'prevent Ghostscript from rasterizing text'. But its output will be PostScript. You may however succeed convert this PostScript to a PCL5e in an additional step.
The method will convert all glyphs into outline shapes for its PostScript output, and it does not work for its PDF or PCL output. The key here is the -dNOCACHE parameter:
gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
Of course, converting font glyphs to outlines will take more space than keeping the original fonts embedded, because "fonts" are a space-optimized concept to store, retrieve and render glyph shapes.
Once you have this PostScript, you may be able to convert it to PCL5e with the help of either of the methods you tried before for PDF input (including {Apache?} FOP).
However, I have no idea if the output will be much smaller than versions with rasterized fonts (or even wholesome rasterized pages). But it may be worth a test.
Now vote down this answer too...
Update
Apparently, from version 9.15 (to be released during September/October 2014), Ghostscript will support a new command line parameter:
-dNoOutputFonts
which will cause the output devices pdfwrite, ps2write and eps2write to "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".
That means that the above command should be replaced by this:
gs -o somepdf.ps -dNoOutputFonts -sDEVICE=ps2write somepdf.pdf
Caveats: I've tested this with a few input files using a self-compiled Ghostscript based on current Git sources. It worked flawlessly in each case.