Pdf text box markups missing in the converted tiff file (GhostScript) - pdf

I am trying to convert pdf to tiff. You can view the pdf in the below link:
Original pdf
http://bugs.ghostscript.com/attachment.cgi?id=7736
I currently having Ghostscript 9.02 installed in my system.
I am using the below command to convert the pdf files to tiff.
gswin32 -dSAFER -dNOPAUSE -dBATCH -q -sPAPERSIZE=a4 -r300 -sDEVICE=tiffg4
-dPDFFitPage -dGraphicalAlphaBits=1 -dTextAlphaBits=1
-sOutputFile="d:/temp/test/ConvertedPage%06d.tiff"
"d:/temp/test/TextBoxMarkupfile.pdf"
There are 3 marked up text box available in the second page. After conversion
those text values are missing in the tiff file.
Is there any options available in the ghostscript to include those text values
in the converted image?
If any workarounds available please suggest me.
Thanks,
Rajesh

A similar question+answer recommended pdftk.
I used a two-step process of [pdf] >> pdftk >> ImageMagick:
pdftk original_form.pdf output flattened_form.pdf flatten
convert -define quantum:polarity=min-is-white -endian MSB
-units PixelsPerInch -density 204x196 -monochrome
-compress Fax -sample 1728
"flattened_form.pdf" "final.tif"
Since ImageMagick uses GhostScript for pdf conversion it should be possible to use GS directly, but I've gotten better quality from ImageMagick; perhaps I'm not just using the right GS settings.

Related

Use Ghostscript to convert each page of a PDF to images and the output is still PDF

I know that Adobe Acrobat Reader DC can select the Microsoft Print to PDF printer to output to a PDF file with Print As Image checked in the Advanced Print Setup dialog. However, I want to use a command to do this. I tried the following command, as a result it failed to convert each page to images (Note the output file is still PDF).
gs -o 0.999.watermask.compact.screen.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true -dPDFSETTINGS=/screen 0.999.watermask.pdf
References
7.4 PDF file output
iText 7 iText 7 for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring.
itext-rups-7.1.14.jar iText RUPS iText® 7.1.14 ©2000-2020 iText Group NV (AGPL-version)
Your -switches include -dDetectDuplicateImages=true which under the circumstances should be superfluous and the device selection can be from one of four as pointed out by KenS.
gs -o 0.999.watermask.compact.screen.pdf -sDEVICE=pdfimage32 -dPDFSETTINGS=/screen 0.999.watermask.pdf
If you want to emulate MS Print As Image PDF on Windows you would find the result in some ways inferior (and often many times bigger). But for comparison it would be,
NOTE:- "%%printer%%... is for a batch file for a command line use "%printer%...
gswin64c.exe -sDEVICE=mswinpr2 -dNoCancel -o "%%printer%%Microsoft Print to PDF" -dPDFSETTINGS=/screen -f "0.999.watermask.pdf"

GhostScript PDF 1.5 (from tiff to PDF with ImageMagick) convert to PDF/A

I need to create a PDF/A from a Folder of Tiff Files.
Creating a PDF (1.5) is working with ImageMagick.
But Converting this PDF to a PDF/A using Ghostscript is a problem.
My GhostScript cmd:
-dPDFA=2 -dNOOUTERSAVE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o "C:\Temp\TestData\TIFF to PDF Imagemagick\pdfa.pdf" "C:\Temp\TestData\TIFF to PDF Imagemagick\PDFA_def.ps" -dPDFACompatibilityPolicy=1 "C:\Temp\TestData\TIFF to PDF Imagemagick\test.pdf"
Also tryed:
-dPDFA=2 -dBATCH -dNOPAUSE -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile="C:\Temp\TestData\TIFF to PDF Imagemagick\pdfa.pdf" "C:\Temp\TestData\TIFF to PDF Imagemagick\PDFA_def.ps" "C:\Temp\TestData\TIFF to PDF Imagemagick\test.pdf"
my PDFA_def.ps is the GS standard with:
/ICCProfile (AdobeRGB1998.icc) % Customise
The created PDF/? is not passing the "Verify compliance with PDF/A-2b" preflight in Adobe Acrobat:
Error
Metadata missing (XMP)
PDF/A entry missing
Syntax problem: Indirect object “endobj” keyword not preceded by an EOL marker
Syntax problem: Stream dictionary improperly formatted
Also not the https://www.pdf-online.com/osa/validate.aspx validator:
File pdfa.pdf
Compliance pdf1.5
Result Document does not conform to PDF/A.
Details
Validating file "pdfa.pdf" for conformance level pdf1.5
XML line 10:212: xmlParseCharRef: invalid xmlChar value 0.
The document does not conform to the requested standard.
The document's meta data is either missing or inconsistent or corrupt.
The document does not conform to the PDF 1.5 standard.
Done.
Also tryed VeraPDF ....
What kind of settings have I forgotten?
Well there's quite a few problems here.
You haven't said what version of Ghostscript you are using, nor have you supplied an example file to experiment with. You also haven't given the back channel output which might contain additional information.
You can't use the supplied model PFA_def.ps without modification, at the very least you need to modify the /ICCProfile entry to point to a real valid ICC profile. I suspect this has caused pdfwrite to abort PDF/A-2 production, which would normally be mentioned in the back channel output.
You haven't set -dColorConversionStrategy, just setting the ProcessColorModel is not sufficient, pdfwrite will mostly ignore that. If you don't tell pdfwrite that you want colours converted to a different space, it will preserve them unchanged, regardless of the Process color model.
With this command its now running:
-dPDFA=2 -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -dNOPAUSE -dBATCH -o "C:\Temp\TestData\tiff2pdfa\pdfatest.pdf" "C:\Temp\TestData\tiff2pdfa\PDFA\PDFA_def.ps" "C:\Temp\TestData\tiff2pdfa\test.pdf"
Thanks to:
Batch Convert PDF to PDF/A - MARK BERRY
But i have still some Error:
GPL Ghostscript 9.25: UTF16BE text string detected in DOCINFO cannot be represented
in XMP for PDF/A 1, discarding DOCINFO
Processing pages 1 through 56.
Page 1
GPL Ghostscript 9.25: Setting Overprint Mode to 1
not permitted in PDF/A-2, overprint mode not set
Should I be thinking about this "Overpirnt Mode"?

ghostscript - remove only specific text in PDF file

Ghostscript allows to generate a new PDF without text from a source one with this easy script:
gs -o output_no_text.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
My purpose is delete just one specific fixed string into the first PDF not all the text. Is there a parameter to set in order to do so?

Ghostscript: convert PDF to EPS with embeded font rather than outlined curve

I use the following command to convert a PDF to EPS:
gswin32 -dNOCACHE -dNOPAUSE -dBATCH -dSAFER -sDEVICE=epswrite -dLanguageLevel=2 -sOutputFile=test.eps -f test.pdf
I then use the following command to convert the EPS to another PDF (test2.pdf) to view the EPS figure.
gswin32 -dSAFER -dNOPLATFONTS -dNOPAUSE -dBATCH -dEPSCrop -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dCompatibilityLevel=1.4 -dMaxSubsetPct=100 -dSubsetFonts=true -dEmbedAllFonts=true -sOutputFile=test2.pdf -f test.eps
I found the text in the generated test2.pdf have been converted to outline curves. There is no font embedded anymore either.
Is it possible to convert PDF to EPS without convert text to outlines? I mean, to EPS with embedded font and text.
Also after the conversion (test.pdf -> test.eps -> test2.pdf), the height and width of the PDF figure (test2.pdf) is a little bit smaller than the original PDF (test.pdf):
test.pdf:
test2.pdf:
Is it possible to keep the width and height of the figure after conversion?
Here is the test.pdf: https://dl.dropboxusercontent.com/u/45318932/test.pdf
I tried KenS's suggestion:
gswin32 -dNOPAUSE -dBATCH -dSAFER -sDEVICE=eps2write -dLanguageLevel=2 -sOutputFile=test.eps -f test.pdf
gswin32 -dSAFER -dNOPLATFONTS -dNOPAUSE -dBATCH -dEPSCrop -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dCompatibilityLevel=1.4 -dMaxSubsetPct=100 -dSubsetFonts=true -dEmbedAllFonts=true -sOutputFile=test2.pdf -f test.eps
I can see the converted test2.pdf have very weird font:
that is different from the original font in test.pdf:
When I copy the text from test2.pdf, I only get a couple of symbols like:
✕ ✖ ✗✘✙ ✚✛
Here is the test2.pdf: https://dl.dropboxusercontent.com/u/45318932/test2.pdf
I was using the latest Ghostscript 9.15. So what is the problem?
I just noticed you are using epswrite, you don't want to do that. That device is terrible and has been deprecated (and removed now). Use the eps2write device instead (you will need a relatively recent version of Ghostscript).
There's nothing you can do with epswrite except throw it away, it makes terrible EPS files. It also can't make level 2 files, no matter what you set -dLanguageLevel to
oh, and don't use -dNOCACHE, that prevents fonts being processed and decomposes everything to outlines or bitmaps.
UPDATE
You set subset fonts to true. By doing so the character codes which are used are more or less random. The first glyph in the document (say for example the 'H' in 'Hello World') gets the code 1, the second one (eg 'e') gets the code 2 and so on.
If you have a ToUnicode CMap, then Acrobat and other readers can convert these character codes to Unicode code points, without that the readers have to fall back on heuristics, the final one being 'treat it as ASCII'. Because the encoding arrangement isn't ASCII, then you get gibberish. MS Windows' PostScript output can contain additional ToUnicode information, but that's not something we try to mimic in ps2write. After all, presumably you had a PDF file already....
Every time you do a conversion you run the risk of this kind of degradation, you should really try and minimise this in your workflow.
The problem is even worse in this case, the input PDF file has a TrueType CID Font. Basic language level 2 PostScript can't handle CIDFonts (IIRC this was introduced in version 2015). Since eps2write only emits basic level 2 it cannot write the font as a CIDFont. So instead it captures the glyph outlines and stores them in a type 3 font.
However, our EPS/PS output doesn't attempt to embed ToUnicode information in the PostScript (its non-standard, very few applications can make use of it and it therefore makes the files larger for little benefit). In addition CIDFonts use multiple (2 or more) bytes for the character code, so there's no way to encode the type 3 fonts as ASCII.
Fundamentally you cannot use Ghostscript to go PDF->PS->PDF and still be able to copy/paste/search text, if the input contains CIDFonts.
By the way, there's no point in setting -dLanguageLevel at all. eps2write only creates level 2 output.
I used Inkscape To convert a .pdf to .EPS. Just upload the .pdf file to Inkscape, in the options to open chose high mesh, and save as . an EPS file.

Split multi-page PDF into JPG (or PNG...) files using ImageMagick

I'm facing a little problem with Image Magick, which I found a marvellous tool so far, but here it doesn't achieve what I expect (N.B: I work in Windows 7)
I read that, to split a 3 pages (for example) pdf file, you just have to do:
img2img My3pageFile.pdf SplittedImage.jpg
and then, ImageMAgick would automatically create SplittedImage-1.jpg, SplittedImage-2.jpg and SplittedImage-3.jpg.
Well instead of this, I obtain an error message like this: (let me hope you'll believe me if I say that I have no doubt here under that the file "benef.pdf" does exist on D:).
D:\>img2img benef.pdf benef.jpg
img2img: `%s': %s "gswin32c.exe" -q -dQUIET -dSAFER -dPARANOIDSAFE -dBATCH
-dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dEPSCrop -dAlignToPixels=0 -dGridFitTT=0
"-sDEVICE=pnmraw" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-g595x842" "-r72x72"
"-sOutputFile=C:/Users/ADM-A2~1/AppData/Local/Temp/magick-o3McMMZQ" "-fC:/Users
/ADM-A2~1/AppData/Local/Temp/magick-EOLT_ZO2" "-fC:/Users/ADM-A2~1/AppData/Local
/Temp/magick-mUWMMcc0".
img2img: Postscript delegate failed `benef.pdf'.
img2img: missing an image filename `benef.jpg'.
The answer is simply to install and download GhostScript at the following address, after what the instruction I gave at the beginning works perfectly well.
So here's the link:
http://downloads.ghostscript.com/public/