I'm trying to convert pdfs to text files.
I use this command to perform the conversion:
gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=output.txt input.pdf
Ghostscript version is 9.07.
I get all the text shown in PDF. I'd like to preserve the blank lines in the text file if possible.
Thanks
You should upgrade, the current version of Ghostscript is 9.18 and 9.19 will be released very shortly. Each of the interim versions includes fixes to the txtwrite device.
Although it is true that PDF files do not include blank lines, the txtwrite device does have a mode whereby it will attempt to produce a reasonable representation of the original layout by using spaces and blank lines in a text file.
This is the default action in the current version of txtwrite, so you ought to be getting this already, unless you have selected a different TextFormat.
This mode is highly heuristic, easily fooled, doesn't cope well with superscripts, subscripts, significant point size changes and possibly other attributes which make the layout difficult to reproduce. Obviously without seeing your input file, there's nothing more I can tell you.
Related
We have ready created single page pdfs with trim and bleed boxes and greyscaled using an ICC profile. We are then using Ghostscript to combine into a multi-page pdf however after it has combined them the trim and bleed boxes disappear and the greyscale reverts to color. We can use the Ghostscript greyscale command but this doesn't help with the trim/bleed boxes which we need for imposition.
This is what we are using:
$command = 'gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile="' . $outputPath . '" ' . implode(' ', $pdfFiles);
Be glad of any help or suggestions, we do a high volume so are currently using PDFTK to combine which keeps the boxes but doesn't fix the greyscale issue either.
You have not stated which operating system you are using, nor which version of Ghostscript, and you haven't supplied an example file.
The pdfwrite device goes to considerable effort not to alter the colour space or values of the input. If the input is in DeviceGray, then the output will be in DeviceGray, unless you specifically request a different space using the ColorConversionStrategy switch. What exactly do you mean by "the greyscale reverts to colour" ? The PDF displays differently ? Some other tool reports the file is 'colour' ?
There's really nothing anyone can suggest without a lot more information, in particular an example input file and ideally the file after you've run it through Ghostscript using the pdfwrite device.
Please note that Ghostscript's pdfwrite device does not 'combine' PDF files. The actual process is complex, and though the end result may appear to be the original files 'combined' that's not what is going on behind the scenes. The actual process is documented here.
I'm trying to convert a PDF to PDF/A. At every pass I'm getting the error "GPL Ghostscript 9.19: Annotation set to non-printing, not permitted in PDF/A, reverting to normal PDF output".
The PDF has previously been generated from HTML by wkhtmltopdf. With the error being quite vague I've done some research around PDF annotations. I've confirmed the PDF has no annotations, flattening annotations (though there isn't one) hasn't worked, I tried the -dShowAnnots=false switch. All to no avail. I've also tried it with a variety of different PDFs and I'm getting the same error on them all.
The command I'm using to do the conversion is "gs -dPDFA=2 -dNOOUTERSAVE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o output.pdf /Users/work/Documents/Projects/pdf-generator-service-tests/PDFA_def.ps -dPDFACompatibilityPolicy=1 input.pdf"
I tried creating a basic PDF page from Google's homepage using wkhtmltopdf https://google.com putput.pdf and again, no joy (this is an example of the PDFs I've tried to convert, for people who may want to try and replicate the issue).
I thought the error was quite specific; PDF/A does not permit annotations to be set to non-printing. You haven't included an actual example of the kind of file causing you a problem, so I can't possibly comment on the presence of any annotations, but I assure you that its not possible to get this message without having annotations.
Since you've already set PDFACompatibility to 1 there's not much else I can say. You could open a bug report and attach the file there, or post a link to one here. Without that I can't say much.
Oh and you don't say which version of Ghostscript you are using, or where you sourced it from. Occasionally packagers break things so it might be worth trying to build from source.
One point; You execute the PDFA_def.ps file before setting PDFACompatibility=1, that's probably not going to work, you'll want to switch those two around. You should set the controls before you do any input or stuff might go awry, trying to change midstream isn't really a good idea.
I used gs (v9.21) to convert a PDF with annotations set to non-printing (hyperref) to a PDF/A compliant file. Annotations will not be present in the output file but, in my case, that was not an issue.
The command I used is:
gs -dPDFA=2 -dBATCH -dNOPAUSE -dPDFACompatibilityPolicy=1 -dUseCIEColor -sProcessColorModel=DeviceGray -sDEVICE=pdfwrite -sOutputFile=output_file.pdf input_file.pdf
Notes:
-dPDFACompatibilityPolicy=1 instead of -sPDFACompatibilityPolicy=1. The latter does not force gs to elide the annotation while the former does.
I used -dUseCIEColor because pdfa validation (https://www.pdf-online.com/osa/validate.aspx) failed with an issue related to the color space. This parameter is deprecated but I did not find any other way around this issue. For more details, see Convert PS files to PDF/A via Ghostscript, color space problems
Like KenS said, it's hard to know anything without a PDF to look at but since you're having trouble with the Google home page converted to PDF, I suspect that it's the external links that are causing the problem. Links are annotations and in PDF/A, external links are not permitted. Any link in HTML when converted to PDF will be considered external.
When I run a PDF which was originally created with LibreOffice on Linux, through ghostscript 9.19 on OSX, to produce another (flattened) PDF, the output is perfect except for one problem. All emdashes in the entire document have been replaced with a standard hyphen (awkwardly followed by half of a space.) Oddly enough, if I highlight the resulting "hyphen+space", my context menu shows that I've selected an emdash, so the underlying text is still an emdash, it is just rendering the wrong glyph.
I can reproduce this on multiple documents from the same source, and I'm assuming there's a setting or switch somewhere that can help resolve this.
I don't know whether the font used makes a difference, but for the sake of reference, the body text of my document is set in Arno Pro. When I use a modern version of LibreOffice on OS X to make a sample document also containing an emdash in Arno Pro, the same problem is not exhibited, so it seems to be specific to the software which originally made these PDF files.
These PDFs are of legacy projects that I am not set-up to re-produce at this time, so I need to prepare them for reprinting using the existing files.
How do I retain emdash glyphs when running a command such as the following?
gs -dSAFER -dBATCH -dNOPAUSE -dNOCACHE -sDEVICE=pdfwrite \
-sColorConversionStrategy=/LeaveColorUnchanged \
-dAutoFilterColorImages=true -dAutoFilterGrayImages=true \
-sOutputFile=output.pdf input.pdf
I can add an example of the input PDF to this question if needed.
Without seeing the PDF file it isn't possible to give you an answer. Most likely the font isn't embedded, or if it is embedded doesn't have an emdash glyph.
Copy and paste uses the ToUnicode CMap, so it isn't dependent on the font. Its simply a list of character codes and the Unicode code point associated with each, when using a given font.
Note that this doesn't mean 'the underlying text is still an emdash'. The ToUnicode information is utterly separate from the font end of things, it is effectively metadata and bears no real relation to the font or rendering.
Put the file on DropBox and post the URL and someone can look into it. I'll be on vacation for the next few days though, but maybe someone else will look.
Note that in PDF you don't necessarily specify characters and positions as a list of consecutive characters; you can specify the position of each individually, or you can specify widths which override the width in the font, etc. So there almost certainly is only one glyph, the 'white space' you refer to is probably just that, white space, its not another glyph.
I should also point out (I do this a lot) that Ghostscript never 'flattens', concatenates, merges, or anything similar operation on PDF files. WHen using Ghostscript and the pdfwrite device the original input (in whatever format) is fully interpreted into graphics marking operations, and sent tot eh device. The device executes the marking operations; in the case of a rendering device, it scan-converts and writes to a bitmap. In the case of pdfwrite, it creates PDF operators.
The result of this is that the output PDF file bears no relation to the input PDF, other than its visual appearance.
You also don't say which version of Ghostscript you are using....
I am looking for a way to 'outline' all text/fonts in a PDF file, i.e. convert them to curves.
I would prefer to do this without having to convert the PDF to PostScript and back. Also, I would like to use free lightweight cross-platform tools that can be automated from the command line, such as Ghostscript or MuPDF.
Yes, you can use Ghostscript to achieve what you want.
I. For Ghostscript versions up to 9.14
You need to go through 2 steps:
Convert the PDF to a PostScript file, but use the side effect of a relatively unknown parameter: it is called -dNOCACHE. This will convert all used fonts to outline shapes:
gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
Convert the PS back to PDF (and, maybe delete the intermediate PS again):
gs -o somepdf-with-outlines.pdf -sDEVICE=pdfwrite somepdf.ps
rm somepdf.ps
This method is not reliable long-term, because the Ghostscript developers have stated that -dNOCACHE may not be present in future versions.
Note: the resulting PDF will very likely be larger than the original one. Plus, without additional command line parameters, all images in the original PDF will likely also be processed according to Ghostscript builtin defaults. This can lead to unwanted side-effects. Those side-effects can be avoided by adding more command line parameters to do otherwise.
II. Ghostscript versions 9.15 or newer
Ghostscript version 9.15 (released in September 2014) supports a new command line parameter:
-dNoOutputFonts
This will cause the output devices pdfwrite, ps2write and eps2write "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".
This means: the two steps described for pre-9.15 GS versions can be avoided. The desired result can be achieved with a single command:
gs -o file-with-outlines.pdf -dNoOutputFonts -sDEVICE=pdfwrite file.pdf
Note: the same caveat is true as already noted in part I. If your PDF includes images, there may be unwanted side effects introduced by the simple command line above. To avoid these, you need to add more specific parameters.
This commit adds a new switch -dNoOutputFonts to the Ghostscript pdfwrite and ps2write devices which will produce a PDF file (or PostScript, depending on the selected device) where all the glyphs have been created as vectors, not as text.
You will need at least version 9.15 of Ghostscript to get this feature. Be aware that the PDF file will almost certainly be larger and copy/paste/search will (obviously) not work.
III. Ghostscript versions 9.54.0 (Windows 10)
I found a method that preserves all fonts flawlessly as vectors without any visual errors and with just two printing steps, after Ghostscript is first installed and configured correctly.
(Note! You must Add the Ghostscript bin-/ and lib-folder to your windows PATH in order to get Ghostscript to do anything)
Instructions here
Print your PDF-file that contains vector based fonts or other vector elements with Acrobat Reader and using Microsoft PS Class Driver to a YourFile.prn file. (To install this driver -- Control Panel - Devices - Printers & Scanners - Add a Printer or scanner -- and let first Windows to look for a while for a connected printer, and when it stops select an option -- The printer that I want is not listed - Add a local printer or network printer with manual settings - Next - Use an existing port: > File:(Print to File) - Next - Microsoft: Microsoft PS Class Driver - Next)
Open Command prompt, navigate to the folder where YourFile.prn file is located and type: "C:\Program Files\gs\gs9.54.0\bin\gswin64c.exe" -dNOPAUSE -dNOCACHE -dBATCH -sDEVICE=eps2write -sOutputFile=YourFile.eps YourFile.prn
If you have a constant need to do this you can also create prn2eps.bat file containing the following:
"C:\Program Files\gs\gs9.54.0\bin\gswin64c.exe" -dNOPAUSE -dNOCACHE -dBATCH -sDEVICE=eps2write -sOutputFile=%1.eps %1.prn
To use that bat file you just need to type: prn2eps YourFile.
(Note! you must have the bat file and Yourfile.prn in the same directory)
For some reason newest Ghostscript ps2epsi function didn't work in Windows 10, and Adobe made PDF:s had e.g. minor but consistent errors in some font characters when I imported them in non-Adobe design software as PDF:s. I have found out during the years that EPS-file format is one of the most reliable formats when vectors must be preserved from one software to another. Many times printing PDF again to PDF using just another printer driver may be enough or single file format change using Ghostscript, but not always.
I've been using Ghostscript to convert my single figure plots rendered in PDF to PNG:
gswin32c -sDEVICE=png16m -r300x300 -sOutputFile=junk.png ^
-dBATCH -dNOPAUSE Figure_001-a.pdf
This works in the sense I get a PNG out and it contains the plot.
But it contains a huge amount of white space as well (an example source image: http://cdsweb.cern.ch/record/1258681/files/Figure_001-a.pdf).
If you view it in Acrobat you'll note there is no white space around the plot. If you use the above command line you'll find the plot is only about 1/3 of the space.
When doing the same thing with an EPS file I run into the same problem. However, there is the command-line parameter -dEPSCrop that one can pass to get the PS rendering engine to pay attention to the BoundingBox.
I need the similar argument for rendering PDFs. I was not able to find it in docs (nor even the -dEPSCrop, actually).
I had exactly the same issue. I fixed it by adding -dUseArtBox switch.
Example:
/usr/bin/gs -dUseArtBox -dNOPAUSE -sDEVICE=pngalpha -sOutputFile=output.png input.pdf
Note: -dUseArtBox switch is supported since ghostscript version 9.07
-dUseArtBox
Sets the page size to the ArtBox rather than the MediaBox. The art box defines the extent of the page's meaningful content (including potential white space) as intended by the page's creator. The art box is likely to be the smallest box. It can be useful when one wants to crop the page as much as possible without losing the content.
There are various options to control which "media size" Ghostscript renders a given input:
-dPDFFitPage
-dUseTrimBox
-dUseCropBox
With PDFFitPage Ghostscript will render to the current page device size (usually the default page size).
With UseTrimBox it will use the TrimBox (and it will at the same time set the PageSize to that value).
With UseCropBox it will use the CropBox (and it will at the same time set the PageSize to that value).
By default (give no parameter), Ghostscript will render using the MediaBox.
For your example, it looks like adding "-dUseCropBox" will do the job you're expecting.
Note, you can additionally control the overall size of your output by using "-sPAPERSIZE" (select amongst all pre-defined values Ghostscript knows) or (for more flexibility) use "-dDEVICEWIDTHPOINTS=NNN -dDEVICEHEIGHTPOINTS=NNN".
Have you tried using pdfcrop using pdftex (comes with texlive for example) or (not tried yet) the python script pdfcrop?
I have a similar workflow using the first tool mentioned.