a background grid appears after using ps2pdf - pdf

Morning, everyone,
Quick question about PS2PDF. I use it to convert graphics that I produce directly in postscript to PDF. While there is no visual problem on PS files, I see a grid on my PDF viewer. At first I thought the problem was in the viewer, but it remains present when I compile my TeX files containing the figures with PDFLaTeX. Do you have any ideas for settings that can "fix" this display? Thanks in advance :)

Evince is independent of Ghostscript as far as PDF files are concerned, but I don't know how it can be viewing PostScript files.
I believe what you are seeing is an artefact of the PDF rendering engine in use, and the way the PDF file is constructed (which is itself dependent on the way the PostScript is constructed).
Much of the content is drawn by creating little rectangles which are intended to butt up against each other (and basically do). However, depending on the resolution, the precise numerical accuracy of the calculations and the accuracy of the co-ordinates, it can be the case that these rectangles do not quite touch ideally. There is a theoretical gap between them.
You can see this occur with Adobe Acrobat, and zooming in and out changes where the lines appear (it changes the effective resolution, thereby changing the calculations from user space to device space, ie to the actual pixels on screen).
I cannot say for sure that the same problem exists with Evince, but I expect it does. Withh Acrobat I can turn off anti-aliasing, which is where the problem really arises. Acrobat is attempting to insert an anti-aliased pixel between the two rectangles, which leads to these faint lines. Turning it off (In Acrobat X Edit->Preferences->PageDisplay->Smooth Line art) makes the lines disappear.
Ghostscript doesn't apply anti-aliasing by default, so these lines don't appear when rendering either the PostScript or the PDF files, but if I turn on anti-aliasing (-dGraphicsAlphaBits=4) then Ghostscript renders the lines in both the PostScript and the PDF file.
Essentially I think the problem is that your PDF viewer is using anti-aliasing and your PostScript viewer isn't, so they don't look the same.

Related

Ghostscript Pdf Transparant Objects Removal

I have several .pdf containing reoccuring transparent objects (text).
(It has non transparant objects (text and vectors) as well as images)
Not watermarks made by Acrobat or others. It is in the background as styling.
To removing them manualy these is impossible, since the content on the pdf pages is mixed with the text on the page (grouping).
Is there a way to alter the opacity of translucent objects to 0. Or even removing them completely from the pdf, with ghostscript?
Using adobe acrobat pre-flight, and moving them as images, removes all of the images in the pdf, instead of only the transparent objects.
How can this be achieved with the help of Ghostscript and the
appropriate PostScript code?
These Awnsers
https://stackoverflow.com/a/29657475/9921462
https://stackoverflow.com/a/37858893/9921462
Where helpful, to know how to get objects and images, but not filtering specifically for transparent objects.
Any ideas are appreciated as well.
I suspect that the reason you can't remove the objects using Acrobat is not because they are transparent. Its much more likely that they are described by a Form XObject. Those can't normally be edited by Acrobat.
You can use -dNOTRANSPARENCY and the pdfwrite device to produce a new PDF file with no transparency, but that will eliminate all the transparency in the output file.
Fundamentally there's no real way to do what you want, other than by manually editing the PDF file in an editor. You should go back to the original document and do the work there. PDF is not intended as an editable format.

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Issue with ghostscript rendering PPT into PDF

I've been tinkering with Ghostscript with a port monitor(on a HP PCL 6 Universal driver) to convert print job into PDF. I've tested with a few applications such as Words, Excel, Adobe Reader, Microsoft Edge etc and they are all working properly.
However upon testing Microsoft Powerpoint 2016, it seems like there are some graphics that are unable to be rendered properly through Ghostscript.
Actual Slide Below
Output From Ghostscript in PDF Below
I've tested this even with some other PDF generators such as BioPDF,CutePDF as well as AdobePDF and they would all result in the same output as above.
Just wondering has anyone tried and have faced similar issues before? if so could someone point me in the right direction??
What you are doing isn't a single step PowerPoint to PDF and Ghostscript is not rendering the PowerPoint. In fact if you are creating a PDF file Ghostscript isn't (ideally) rendering anything.
What's actually happening is that you are asking PowerPoint to print to a canvas, which is then passed to the PostScript printer driver. That produces PostScript which is sent to the Port. Your (and others) Port Monitor then sends the PostScript to the 'Distiller' (in your case Ghostscript and the pdfwrite device). The Distiller reformats the vector drawing commands into a PDF format and builds a PDF file from them. It doesn't render (turn into a bitmap image) anything unless forced to.
Obviously there are several places along that road where the problem could creep in. Given that you say that the Adobe product (the others you mention al use Ghostscript) has the same problem, I think its safe to assume that the problem isn't Ghostscript.
This also means that you aren't using the driver you think you are. Adobe can't handle PCL as an input medium as far as I'm aware, and nor can Ghostscript. GhostPCL will handle PCL as an input, but that's not what you say you are using.
Of course you haven't linked to an example file to demonstrate the problem, nor supplied an example command line, so this is all supposition.
Now if, somehow, you are using a PCL6 device, then the problem is most likely due to the presence of rasterOps in the output. Rasterops are part of the PCL imaging model which do not exist in PDF and are a form of transparency. There are three ways to handle such content for a PDF output device; firstly render the whole page content to an image, secondly ignore the rasterOps objects, thirdly treat the rasterOps as opaque.
GhostPCL and the pdfwrite device take the third option. So, its just conceivable that your original content has some transparent objects which are being handled as rasterOps by the PCL printer driver, and then rendered as opaque by GhostPCL and the pdfwrite device.
If that's somehow the case then the solution is simple; don't use a PCL printer driver, use the PostScript one.
If you post a link to a (simple, eg single page) example of what you are sending to Ghostscript, and a command line, then I can look at it. Please don't send me the PowerPoint, I can't use it and even if I could, my print setup would not match yours. I need the data being sent to Ghostscript.
[EDIT after looking at files]
Don't mean to sound like I'm lecturing, the problem is people find these result on Google searches and then try to apply them based on a poor understanding of what's happening. So I find it best to be really clear in my answers about what's going on. It saves questions later :-)
The first thing I see is that the PCL is indeed PCL, and if you try running that through Ghostscript it throws horrible errors and exits. So presumably you aren't doing that.
The PostScript file contains nothing except huge images, rendered (presumably at 600 dpi) contains 2 pages, the two pages look like your images above. Which is why the PostScript is better than 20 times larger than the PCL file.
But.... If I open the .ppt file with OpenOffice (4.0.0 is what I have to hand) I see exactly the same thing. I don't, I'm afraid, have a copy of Microsoft PowerPoint, but from what I see here there are two conclusions;
firstly that the PDF I get looks pretty much like the PowerPoint when viewed with OpenOffice at least. So there's something 'interesting' about your PowerPoint.
secondly, even if that's not what you expect, its what's in the PostScript program. That means that either PowerPoint rendered the slide to a bitmap or the Windows printing system/HP driver did.
Now, if I run the PCL through GhostPCL instead of Ghostscript (rendering, not producing a PDF) then the result is more like what I think you are expecting. However, when sent to a PDF file the result is horrible. Which strongly suggests to me that there's some form of transparency involved, PostScript doesn't support transparency at all, and PCL does it through rasterOPs.
I'm afraid that this means that the problem lies either in PowerPoint, the Windows print system or the PostScript printer driver you are using. Since the PCL is at least close to what you expect, I suspect that this means PowerPoint is doing the right thing, and its the printer driver messing up. It appears you are using the Windows PostScript printer driver.
So there's no way you can 'fix' this for files like this, at least not with Ghostscript. You would need to 'fix' the Windows PostScript printer driver, or possibly the Windows print system. You could try reporting a bug to Microsoft, presumably these files print incorrectly when sent to physical PostScript printers too.

PDF with OCR text visible, how to hide it from existing PDF

I have several PDF files that have been OCR-processed (not by me). They contain both the scanned image and the OCR text. They seem to work fine in some viewers (iPhone/iPad), but not in others (Preview.app on macOS) which makes them somewhat awkward to read.
From googling around, it seems that the text & image may be layered incorrectly or there is a problem with the fonts used? I'm not even sure I'm using the correct vocabulary, as most hits I get are worthless.
Is it possible to use ghostscript or something to batch-fix these files?
Example of "bad" rendering:
Its impossible to say what's wrong with the PDF file (or viewer) without seeing the PDF file, which alse makes it hard to propose solutions!
You could certainly run the file through Ghostscript to the pdfwrite device, and use the -dFILTERTEXT switch to not process the text. The resulting document would therefore not contain the offending text, but would still contain the image.
Of course, this would then not be possible to search or highlight.
You could instead use -dFILTERIMAGE which would remove the original image leaving the text behind. But then anything in the original document which was not text would now be missing.
The usual 'best practice' is to have the text drawn in rendering mode 3, which makes no marks. This allows you to see the original image without the OCR'ed text interfering. Its possible that the viewer you are using is not honouring the text rendering mode, which would be a (fairly serious) bug in the viewer. The most recent versions of MacOS seems to have some nasty bugs in the Quartz PDF rendering engine.
The other way to do this is to draw the text first, then put the original image on top of it, but that's hard to get wrong, I suspect its more likely the text rendering mode.
EDIT
The PDF file first draws the text, then draws the image on top of the text. The underlying text should not appear. mkl is quite correct in his comment.
The correct way to fix this is to fix the consumer which is rendering it incorrectly. As I mentioned above the latest version of Quartz seems to have some fairly serious bugs, you might choose to raise this as a bug with Apple.
The only other solution would be to run this through something which will remove the text. Ghostscript can do this but there are implications; firstly it will no longer be possible to search/copy/paste text from the document. Secondly you would need to run quite a complex command line in order to prevent the decompressed JPX images being recompressed as JPEG, which would probably result in compromised quality. Finally the resulting file size would be larger.

Blurry results on EPS/PDF import [duplicate]

I want to create a visualization of a matrix for some academic work. I decided to go about this by having the pixels in the image correspond to the values in the matrix. I created the nice small png that follows:
When properly scaled up, you get a very reasonable image:
This is a screenshot from within inkscape. However, when export this as a pdf, both evince and chrome do a terrible job at upscaling what should be very trivial, and instead I get something that looks like:
The pdf itself seems to scale appropriately well for printing, but unfortunately I do a lot of my editing without printing, and this looks unacceptable. I did find this incredibly old thread about people seeming to have a similar issue with chrome's pdf viewer, and the "solution" was to just upscale the raster graphics. This is a solution, but is terribly inefficient.
Is anyone aware of a way to change the pdf so that it gets upscaled appropriately? Maybe a config change in evince or chrome that will render these properly? Even a nice way to go from a raster image to a vector image might be suitable?
The comments aggregated into an answer...
An image dictionary in a PDF has an (optional) boolean entry Interpolate. It is specified as a flag indicating whether image interpolation shall be performed by a conforming reader.
The program used by the OP to create the PDF, Inkscape, seems to have explicitly set this flag to true. Editing the PDF to unset this flag creates a file which looks as desired by the OP.
(This also is a solution proposed in this Inkscape forum thread eventually found by the OP, which is to save the PDF with high-resolution bitmaps embedded. File -> Inkscape Preferences -> Bitmaps -> Resolution for Create Bitmap Copy, and set it to 6000 dpi)
The fact that interpolation looks different in different viewers and different output media, is by design. The PDF specification states on interpolation:
A conforming Reader may choose to not implement this feature of PDF, or may use any specific implementation of interpolation that it wishes.
A different way to get around this problem (especially as some PDF viewers have the tendency to not really live up to the specification and e.g. interpolate ignoring that flag) would be to use vector graphics here, drawing the bitmap pixels as rectangles. The result should be optimal.