Ghostscript: convert PDF to EPS with embeded font rather than outlined curve - pdf

I use the following command to convert a PDF to EPS:
gswin32 -dNOCACHE -dNOPAUSE -dBATCH -dSAFER -sDEVICE=epswrite -dLanguageLevel=2 -sOutputFile=test.eps -f test.pdf
I then use the following command to convert the EPS to another PDF (test2.pdf) to view the EPS figure.
gswin32 -dSAFER -dNOPLATFONTS -dNOPAUSE -dBATCH -dEPSCrop -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dCompatibilityLevel=1.4 -dMaxSubsetPct=100 -dSubsetFonts=true -dEmbedAllFonts=true -sOutputFile=test2.pdf -f test.eps
I found the text in the generated test2.pdf have been converted to outline curves. There is no font embedded anymore either.
Is it possible to convert PDF to EPS without convert text to outlines? I mean, to EPS with embedded font and text.
Also after the conversion (test.pdf -> test.eps -> test2.pdf), the height and width of the PDF figure (test2.pdf) is a little bit smaller than the original PDF (test.pdf):
test.pdf:
test2.pdf:
Is it possible to keep the width and height of the figure after conversion?
Here is the test.pdf: https://dl.dropboxusercontent.com/u/45318932/test.pdf
I tried KenS's suggestion:
gswin32 -dNOPAUSE -dBATCH -dSAFER -sDEVICE=eps2write -dLanguageLevel=2 -sOutputFile=test.eps -f test.pdf
gswin32 -dSAFER -dNOPLATFONTS -dNOPAUSE -dBATCH -dEPSCrop -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dCompatibilityLevel=1.4 -dMaxSubsetPct=100 -dSubsetFonts=true -dEmbedAllFonts=true -sOutputFile=test2.pdf -f test.eps
I can see the converted test2.pdf have very weird font:
that is different from the original font in test.pdf:
When I copy the text from test2.pdf, I only get a couple of symbols like:
✕ ✖ ✗✘✙ ✚✛
Here is the test2.pdf: https://dl.dropboxusercontent.com/u/45318932/test2.pdf
I was using the latest Ghostscript 9.15. So what is the problem?

I just noticed you are using epswrite, you don't want to do that. That device is terrible and has been deprecated (and removed now). Use the eps2write device instead (you will need a relatively recent version of Ghostscript).
There's nothing you can do with epswrite except throw it away, it makes terrible EPS files. It also can't make level 2 files, no matter what you set -dLanguageLevel to
oh, and don't use -dNOCACHE, that prevents fonts being processed and decomposes everything to outlines or bitmaps.
UPDATE
You set subset fonts to true. By doing so the character codes which are used are more or less random. The first glyph in the document (say for example the 'H' in 'Hello World') gets the code 1, the second one (eg 'e') gets the code 2 and so on.
If you have a ToUnicode CMap, then Acrobat and other readers can convert these character codes to Unicode code points, without that the readers have to fall back on heuristics, the final one being 'treat it as ASCII'. Because the encoding arrangement isn't ASCII, then you get gibberish. MS Windows' PostScript output can contain additional ToUnicode information, but that's not something we try to mimic in ps2write. After all, presumably you had a PDF file already....
Every time you do a conversion you run the risk of this kind of degradation, you should really try and minimise this in your workflow.
The problem is even worse in this case, the input PDF file has a TrueType CID Font. Basic language level 2 PostScript can't handle CIDFonts (IIRC this was introduced in version 2015). Since eps2write only emits basic level 2 it cannot write the font as a CIDFont. So instead it captures the glyph outlines and stores them in a type 3 font.
However, our EPS/PS output doesn't attempt to embed ToUnicode information in the PostScript (its non-standard, very few applications can make use of it and it therefore makes the files larger for little benefit). In addition CIDFonts use multiple (2 or more) bytes for the character code, so there's no way to encode the type 3 fonts as ASCII.
Fundamentally you cannot use Ghostscript to go PDF->PS->PDF and still be able to copy/paste/search text, if the input contains CIDFonts.
By the way, there's no point in setting -dLanguageLevel at all. eps2write only creates level 2 output.

I used Inkscape To convert a .pdf to .EPS. Just upload the .pdf file to Inkscape, in the options to open chose high mesh, and save as . an EPS file.

Related

Ghostscript won't generate PDF/A with UTF16BE text string detected in DOCINFO - in spite of PDFACompatibilityPolicy saying otherwise

I am trying to convert normal PDF files to PDF/A with this command line:
gs -dPDFA -dBATCH -dNOPAUSE -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=output.pdf input.pdf
However, I get the message
GPL Ghostscript 9.26: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, reverting to normal PDF output
an gs reverts to normal PDF.
Apparently, the message stems from this code fragment of gs, but there we read that the message can occur only when pdev->PDFACompatibilityPolicy == 0. My understanding was that the parameter -sPDFACompatibilityPolicy=1 in the command line has the purpose of preventing this.
Q: Why does gs behave as if the desired policy were 0 instead of 1? Is there another way to set the policy to 1?
Also, just as it makes me curious:
Q: Is there a way to see what kind of strange DOCINFO there is causing the original problem or to prevent it in the first place? Using Acrobat Reader, I cannot see anything "suspicuous" in the file. If it helps: The input.pdf is generated on Window from Word (and I tried even with the UseISO19005-1 setting, which should produce PDF/A to begin with, but the problem occurs anyway).
You have put -sPDFACompatibilityPolicy=1. That, I'm afraid, is incorrect. Ghostscript has two kinds of switches -s which deals with string values, and -d which deals with numeric and name values (names in PostScript begin with '/').
You've assigned a string value of '1' to the parameter PDFACompatbilityPolicy, which (internally) expects a numeric value. For reasons to do with the fact that these values are required to be accessible from the PostScript environment, we can't flag the type confusion as an error. Instead we leave the actual control at its default value of 0.
If you instead set -dPDFACompatibilityPolicy=1 I expect you will see the behaviour you expect.
As for seeing the data, without looking at the PDF file I cannot tell. However, if you stop in the debugger at that point and look at p->data you will be able to see what the data is. If you look at pairs + i instead of pairs + i + 1 you will be able to see the key which is associated with the value from the DOCINFO pdfmark.
You won't be able to see anything 'suspicious' by looking at the file in Acrobat, because Acrobat will translate the UTF16BE into whatever your system requires in order to display the text correctly. It may even be that this is ASCII, you can still represent that as UTF16.
If you open the file in a text editor you may be able to see the relevant string (note that the BOM in Ghostscript is in octal, so that's 0xFE 0xFF in hexadecimal), provided its not in a compressed object stream.
Examining the source of latest ghostscript (9.50), it seems that the PDFACompatibilityPolicy values in this case (see devices/vector/gdevpdfm.c around line 1951) set the error-containing behavior as such:
0 will revert to normal PDF output (not really what I wanted)
1 will discard PDFINFO (even worse)
2 will throw an error (even even worse)
any other value is ignored in the switch and works as a pass-through!
So, in my case, the whole thing was solved simply by setting
-dPDFACompatibilityPolicy=3
Ghostscript does not complain, does not abort PDF/A output, does not discard the PDFINFO, and, most importantly, veraPDF checker still verifies the PDF as perfectly okay.
I'm not commenting on how ugly this solution is, but it works just great. Since all other switch statements just assume compatibility policy 0 if anything above 2 gets passed in, this "shortcut" seems to be an unintended, but very useful bug.
The answer of exa is not quiet correct. Ghostscript will continue its output but the resulting pdf will not conform the veraPDF validator.
At this moment im busy trying to make ghostscript work so i get a valid zugferd invoice pdf. Therefore the PDF needs to be a valid PDF/A-3(a,b or u) file.
Problem with the Answer
If you just use -dPDFACompatibilityPolicy=3 verPDF wont validate the PDF.
Instead you should fix the file with right encoding.
In my case the pdf looked like this:
How to resolve it:
Create a new file (example "pdfmarks") with this content:
[ /Title (Foo Title)
/Author (Foo Bar)
/Subject (Foo Bar Subject)
/Keywords ()
/ModDate (D:20061204092842)
/CreationDate (D:20061204092842)
/Creator (Foo Bar)
/Producer (Foo Bar)
/DOCINFO pdfmark
(There's no ending square brackets ']')
Run gs like this:
Windows:
"C:\Program Files\gs\gs9.53.3\bin\gswin64c.exe" -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=/path/to/output.pdf /path/to/input.pdf /path/to/pdfmarks
Linux:
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=/path/to/output.pdf /path/to/input.pdf /path/to/pdfmarks
You can either include your stuff or call gs a second time.
I hope i could safe you guys some time with this.

GhostScript PDF 1.5 (from tiff to PDF with ImageMagick) convert to PDF/A

I need to create a PDF/A from a Folder of Tiff Files.
Creating a PDF (1.5) is working with ImageMagick.
But Converting this PDF to a PDF/A using Ghostscript is a problem.
My GhostScript cmd:
-dPDFA=2 -dNOOUTERSAVE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o "C:\Temp\TestData\TIFF to PDF Imagemagick\pdfa.pdf" "C:\Temp\TestData\TIFF to PDF Imagemagick\PDFA_def.ps" -dPDFACompatibilityPolicy=1 "C:\Temp\TestData\TIFF to PDF Imagemagick\test.pdf"
Also tryed:
-dPDFA=2 -dBATCH -dNOPAUSE -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile="C:\Temp\TestData\TIFF to PDF Imagemagick\pdfa.pdf" "C:\Temp\TestData\TIFF to PDF Imagemagick\PDFA_def.ps" "C:\Temp\TestData\TIFF to PDF Imagemagick\test.pdf"
my PDFA_def.ps is the GS standard with:
/ICCProfile (AdobeRGB1998.icc) % Customise
The created PDF/? is not passing the "Verify compliance with PDF/A-2b" preflight in Adobe Acrobat:
Error
Metadata missing (XMP)
PDF/A entry missing
Syntax problem: Indirect object “endobj” keyword not preceded by an EOL marker
Syntax problem: Stream dictionary improperly formatted
Also not the https://www.pdf-online.com/osa/validate.aspx validator:
File pdfa.pdf
Compliance pdf1.5
Result Document does not conform to PDF/A.
Details
Validating file "pdfa.pdf" for conformance level pdf1.5
XML line 10:212: xmlParseCharRef: invalid xmlChar value 0.
The document does not conform to the requested standard.
The document's meta data is either missing or inconsistent or corrupt.
The document does not conform to the PDF 1.5 standard.
Done.
Also tryed VeraPDF ....
What kind of settings have I forgotten?
Well there's quite a few problems here.
You haven't said what version of Ghostscript you are using, nor have you supplied an example file to experiment with. You also haven't given the back channel output which might contain additional information.
You can't use the supplied model PFA_def.ps without modification, at the very least you need to modify the /ICCProfile entry to point to a real valid ICC profile. I suspect this has caused pdfwrite to abort PDF/A-2 production, which would normally be mentioned in the back channel output.
You haven't set -dColorConversionStrategy, just setting the ProcessColorModel is not sufficient, pdfwrite will mostly ignore that. If you don't tell pdfwrite that you want colours converted to a different space, it will preserve them unchanged, regardless of the Process color model.
With this command its now running:
-dPDFA=2 -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -dNOPAUSE -dBATCH -o "C:\Temp\TestData\tiff2pdfa\pdfatest.pdf" "C:\Temp\TestData\tiff2pdfa\PDFA\PDFA_def.ps" "C:\Temp\TestData\tiff2pdfa\test.pdf"
Thanks to:
Batch Convert PDF to PDF/A - MARK BERRY
But i have still some Error:
GPL Ghostscript 9.25: UTF16BE text string detected in DOCINFO cannot be represented
in XMP for PDF/A 1, discarding DOCINFO
Processing pages 1 through 56.
Page 1
GPL Ghostscript 9.25: Setting Overprint Mode to 1
not permitted in PDF/A-2, overprint mode not set
Should I be thinking about this "Overpirnt Mode"?

Ghostscript's pdfwrite to grayscale results in wrong graylevel

I try to convert a PDF file (test.pdf, attached below) using Ghostscript (9.20 on Windows) to only use the Graylevel colorspace (not RGB or CMY):
gswin64c.exe -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -dUseCIEColor -o gray.pdf -f test.pdf
The result indeed only uses gray colors:
>gswin64c.exe -o - -sDEVICE=inkcov gray.pdf
GPL Ghostscript 9.20 (2016-09-26)
Copyright (C) 2016 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
0.00000 0.00000 0.00000 0.92673 CMYK OK
(I need to use -dUseCIEColor, otherwise CMY values are >0, this is a separate problem which I havent yet solved...)
My problem: The resulting gray.pdf uses significantly different graylevels than the original test.pdf (open in your PDF viewer and compare for yourself).
Does anyone see my mistake or what I should do differently to get the same PDF but in grayscale rather than RGB colorspace?
Thank you very much!
test.pdf: https://drive.google.com/open?id=0BzjatAIrG6P3S2F5Vng4cUhUS0U
gray.pdf: https://drive.google.com/open?id=0BzjatAIrG6P3cEtTY3JaaTJCS2c
You are doing a multiple conversion, and not managing the colour space conversions at all.
Firstly you convert the original colour into a CIEBased colour space (and the space varies depending on the number of components in the original space). Since you don't specify Colour Rendering Dictionaries, this is an uncontrolled conversion, you are using the defaults.
You then embark on another conversion from CIEBased (which cannot, in general, be represented in PDF anyway, so would always result in an additional conversion) into DeviceGray. Again you haven't supplied any ICC profiles for this conversion, so you are using the default ones.
If you insist on using -dUseCIEColor (which I would very strongly advise against, controlling this is hard) then you need to supply ColorRendering Dictionaries to control the conversion from device space into CIE space, and also ICC profiles to control the subsequent conversion from CIE space into DeviceGray.
But I strongly suspect that you will get better results by not using -dUseCIEColor, just like Ghostscript tells you.
I can only guess about what you need based on source file. There's DeviceRGB 0.5/0.5/0.5 filled rectangle, and I suspect you want it to become 0.5 DeviceGray.
The solutions and speculations below will work for that and similar cases only. (E.g., I have no idea what are "CMY values" you write about, i.e. if there are DeviceCMYK or ICC-based or anything else in your files). There're simple formulas to convert between device color spaces (see PDF Reference), one of them indeed maps from equal values in DeviceRGB to same value in DeviceGray. To make it work, use GhostScript 9.10:
"C:\Program Files\gs\gs9.10\bin\gswin32c.exe" -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dUseFastColor -o test_1.pdf -f test.pdf
Note the switch -dUseFastColor. You'll get "correct" 0.5 grayscale filled rectangle.
To make it work in versions 9.10 .. 9.20 (excluding both), I had to add another switch: -dPDFUseOldCMS. Again, 0.5 grayscale filled rectangle in result.
As last switch name indicates, simple things were probably considered deprecated, and looks like were scrapped in 9.20.
Instead, new wonderful CMS engine was introduced (since 9.10). Except, it doesn't work for high-level devices (pdfwrite included). Either switched off or broken, for many releases.
I was unable to make it work for any combination of device- or ICC-based colors in source and command line options, to make it actually use the -sOutputICCProfile option, for either DeviceCMYK or DeviceGray output (or ICC-based output, whatever). Same color values in produced files.
I'd appreciate if someone indicates I'm wrong and shows an opposite example.
It worked, actually (partly -- for device source colors only), in 9.10:
"C:\Program Files\gs\gs9.10\bin\gswin32c.exe" -sDEVICE=pdfwrite -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -sOutputICCProfile=sgray.icc -o test_2.pdf -f test.pdf
Using different icc profiles results in different (and correct, it looks) output. To convert from equal RGB values to same Gray values one would need grayscale profile with same gamma as (default) sRGB. Just, use free ICC Profile Inspector to extract a curve from sRGB and import it into e.g. sgray.icc (distributed with Ghostscript).
The advantage of using a profile to convert RGB to Gray, preserving gamma, opposed to "simple formula" described above, may or may not be worth the effort. Check for your files and purposes.

Cropped PCL after gswin PDF to PCL conversion

I have a PDF, which I want to convert to PCL
I convert PDF to PCL using the following command:
(gs 8.70)
gswin32c.exe -q -dNOPAUSE -dBATCH \
-sDEVICE=ljetplus -dDuplex=false -dTumble=false \
-sPAPERSIZE=a4 -sOutputFile="d:\doc1.pcl" \
-f"d:\doc1.pdf" -c -quit
When I view or print the output PCL, it is cropped. I would expect the output to start right at the edge of the paper (at least in the viewer).
Is there any to way get the whole output without moving the contents of the page away from the paper edge?
I tried the -dPDFFitpage option which works, but results in a scaled output.
You are using -sPAPERSIZE=a4. This causes the PCL to render for A4 sized media.
Very likely, your input PDF is made for a non-A4 size. That leaves you with 3 options:
...you either use that exact page size for the PCL too (which your printer possibly cannot handle),
...or you have to add -dPDFFitPage (as you tried, but didn't like),
...or you skip the -sPAPERSIZE=... parameter altogether (which most likely will automatically use the same as size as the PDF, and which your printer possibly cannot handle...)
Update 1:
In case ljetplus is not a hard requirement for your requested PCL format variant, you could try this:
gs -sDEVICE=pxlmono -o pxlmono.pcl a4-fo.pdf
gs -sDEVICE=pxlcolor -o pxlcolor.pcl a4-fo.pdf
Update 2:
I can confirm now that even the most recent version of Ghostscript (v9.06) cannot handle non-Letter page sizes for ljetplus output.
I'd regard this as a bug... but it could well be that it won't be fixed, even if reported at the GS bug tracker. However, the least that can be expected is that it will get documented as a known limitation for ljetplus output...

PDF text extraction from given coordinates

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript.
Can anyone help me out?
Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only.
First: Ghostscript's txtwrite output device (not so good)
gs \
-dBATCH \
-dNOPAUSE \
-sDEVICE=txtwrite \
-dFirstPage=3 \
-dLastPage=5 \
-sOutputFile=- \
/path/to/your/pdf
This will output all text contained on pages 3-5 to stdout. If you want output to a text file, use
-sOutputFile=textfilename.txt
gs Update:
Recent versions of Ghostscript have seen major improvements in the txtwrite device and bug fixes. See recent Ghostscript changelogs (search for txtwrite on that page) for details.
Second: Ghostscript's ps2ascii.ps PostScript utility (better)
This one requires you to download the latest version of the file ps2ascii.ps from the Ghostscript Git source code repository. You'd have to convert your PDF to PostScript, then run this command on the PS file:
gs \
-q \
-dNODISPLAY \
-P- \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
/path/to/ps2ascii.ps \
input.ps \
-c quit
If the -dSIMPLE parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used.
If you replace that parameter by -dCOMPLEX, you'll get additional infos about colors and images used.
Read the comments inside the ps2ascii.ps to learn more about this utility. It's not comfortable to use, but for me it worked in most cases I needed it....
Third: XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
A more comfortable way to do text extraction: use pdftotext (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:
pdftotext \
-f 13 \
-l 17 \
-layout \
-opw supersecret \
-upw secret \
-eol unix \
-nopgbrk \
/path/to/your/pdf
- |less
This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less...
pdftotext -h displays all available commandline options.
Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-)
pdftotext Update:
Recent versions of Poppler's pdftotext have now options to extract "a portion (using coordinates) of PDF" pages, like the OP asked for. The parameters are:
-x <int> : top left corner's x-coordinate of crop area
-y <int> : top left corner's y-coordinate of crop area
-W <int> : crop area's width in pixels (defaults to 0)
-H <int> : crop area's height in pixels (defaults to 0)
Best, if used with the -layout parameter.
Fourth: MuPDF's mutool draw command can also extract text
The cross-platform, open source MuPDF application (made by the same company that also develops Ghostscript) has bundled a command line tool, mutool. To extract text from a PDF with this tool, use:
mutool draw -F txt the.pdf
will emit the extracted text to <stdout>. Use -o filename.txt to write it into a file.
Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)
TET, the Text Extraction Toolkit from the pdflib family of products can find the x-y-coordinate of text content in a PDF file (and much more). TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...) Quote from their website:
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.
In my experience, while it's does not sport the most straight-forward CLI interface you can imagine: after you got used to it, it will do what it promises to do, for most PDFs you throw towards it...
And there are even more options:
podofotxtextract (CLI tool) from the PoDoFo project (Open Source)
calibre (normally a GUI program to handle eBooks, Open Source) has a commandline option that can extract text from PDFs
AbiWord (a GUI word processor, Open Source) can import PDFs and save its files as .txt: abiword --to=txt --to-name=output.txt input.pdf
I'm not sure GhostScript can accept coordinates, but you can convert the PDF to a image and send it to an OCR engine either as a subimage cropped from the given coordinates or as the whole image along with the coordinates. Some OCR API accepts a rectangle parameter to narrow the region for OCR.
Look at VietOCR for a working example, which uses Tesseract as its OCR engine and GhostScript as PDF-to-image converter.
Debenu Quick PDF Library can extract text from a defined area on a page. The SetTextExtractionArea function lets you specify the x and y coordinates and then you can also specify the width and height of the area.
Left = The horizontal coordinate of the left edge of the area
Top = The vertical coordinate of the top edge of the area
Width = The width of the area
Height = The height of the area
Then the GetPageText function can be called immediately after this to extract the text from that defined area.
Here's an example using C# (though the library is multi-platform and can be used with many different programming languages):
DPL.LoadFromFile(#"Sample.pdf", "");
DPL.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left
DPL.SetTextExtractionArea(35, 35, 229, 30); // Left, Top, Width, Height
string ExtractedContent = DPL.GetPageText(8);
Console.WriteLine(ExtractedContent);
Using GetPageText it is also possible to return just the text located in that area or the text located in that area as well as information about the text's font such as name, color and size.