PDF Table Lines Missing from GhostScript - pdf

I am trying to convert a PDF file to an image format (ideally PNG), but some of the table lines do not render in the output, which is an issue since the purpose of my conversion is to use computer vision on it.
I unfortunately do not have access to the file used to generate the PDF.
Thank you in advance for your help!
Attached is the ghostscript rendering vs the actual pdf:
Original
GhostScript
EDIT: Thanks for the answers. Here is what I had already tried:- ---
Changing the scaling & Changing the Antialiasing (I doubt that any combination of this will work in Ghostscript at this point)
Converting to PostScript and then to PNG/PDF
Saving from a Browser
Saving from various virtual printers to PDF
Using Poppler to do the rendering
All to no avail. Digging deeper, I found some interesting things which may be helpfull. Ghostscript does recognize the lines when using -sDevice=X11 and -sDevice=PS2Write (apologies for coding typos). That is, using Ghostscript to visualize the PDF does work, but not to process them into anything else than Postscript.
Also, printing into a PDF from Adobe Acrobat does fix my problem, however this is something that I need to be able to do from the command line on thousands of files.
Hope this helps!
EDIT2:
Link to a concerned file
https://transfer.sh/PuIF90/e176ad9824ddc6cb5e6aead2d389c131-filer.pdf

I thought that I would share the fix that I found. Turns out that a bunch of the pdf we need to process were generated using a specific HTML5 to PDF conversion tool which turns each lines of the PDF into a rectangle with size 0. Solution for me has been to automate decompressing PDFs, and looking through the text file for "A A A A re", with all "A's" being numbers. Should the last or next to last A be a zero, I change it to size 1.
For instance (once again, after decompressing the PDF):
1000 2000 0 14 re
to
1000 2000 1 14 re
Hope this helps someone else out there and let me know if there is a more elegant way of doing this, I am still a beginner about all things PDF.

Related

Issue with ghostscript rendering PPT into PDF

I've been tinkering with Ghostscript with a port monitor(on a HP PCL 6 Universal driver) to convert print job into PDF. I've tested with a few applications such as Words, Excel, Adobe Reader, Microsoft Edge etc and they are all working properly.
However upon testing Microsoft Powerpoint 2016, it seems like there are some graphics that are unable to be rendered properly through Ghostscript.
Actual Slide Below
Output From Ghostscript in PDF Below
I've tested this even with some other PDF generators such as BioPDF,CutePDF as well as AdobePDF and they would all result in the same output as above.
Just wondering has anyone tried and have faced similar issues before? if so could someone point me in the right direction??
What you are doing isn't a single step PowerPoint to PDF and Ghostscript is not rendering the PowerPoint. In fact if you are creating a PDF file Ghostscript isn't (ideally) rendering anything.
What's actually happening is that you are asking PowerPoint to print to a canvas, which is then passed to the PostScript printer driver. That produces PostScript which is sent to the Port. Your (and others) Port Monitor then sends the PostScript to the 'Distiller' (in your case Ghostscript and the pdfwrite device). The Distiller reformats the vector drawing commands into a PDF format and builds a PDF file from them. It doesn't render (turn into a bitmap image) anything unless forced to.
Obviously there are several places along that road where the problem could creep in. Given that you say that the Adobe product (the others you mention al use Ghostscript) has the same problem, I think its safe to assume that the problem isn't Ghostscript.
This also means that you aren't using the driver you think you are. Adobe can't handle PCL as an input medium as far as I'm aware, and nor can Ghostscript. GhostPCL will handle PCL as an input, but that's not what you say you are using.
Of course you haven't linked to an example file to demonstrate the problem, nor supplied an example command line, so this is all supposition.
Now if, somehow, you are using a PCL6 device, then the problem is most likely due to the presence of rasterOps in the output. Rasterops are part of the PCL imaging model which do not exist in PDF and are a form of transparency. There are three ways to handle such content for a PDF output device; firstly render the whole page content to an image, secondly ignore the rasterOps objects, thirdly treat the rasterOps as opaque.
GhostPCL and the pdfwrite device take the third option. So, its just conceivable that your original content has some transparent objects which are being handled as rasterOps by the PCL printer driver, and then rendered as opaque by GhostPCL and the pdfwrite device.
If that's somehow the case then the solution is simple; don't use a PCL printer driver, use the PostScript one.
If you post a link to a (simple, eg single page) example of what you are sending to Ghostscript, and a command line, then I can look at it. Please don't send me the PowerPoint, I can't use it and even if I could, my print setup would not match yours. I need the data being sent to Ghostscript.
[EDIT after looking at files]
Don't mean to sound like I'm lecturing, the problem is people find these result on Google searches and then try to apply them based on a poor understanding of what's happening. So I find it best to be really clear in my answers about what's going on. It saves questions later :-)
The first thing I see is that the PCL is indeed PCL, and if you try running that through Ghostscript it throws horrible errors and exits. So presumably you aren't doing that.
The PostScript file contains nothing except huge images, rendered (presumably at 600 dpi) contains 2 pages, the two pages look like your images above. Which is why the PostScript is better than 20 times larger than the PCL file.
But.... If I open the .ppt file with OpenOffice (4.0.0 is what I have to hand) I see exactly the same thing. I don't, I'm afraid, have a copy of Microsoft PowerPoint, but from what I see here there are two conclusions;
firstly that the PDF I get looks pretty much like the PowerPoint when viewed with OpenOffice at least. So there's something 'interesting' about your PowerPoint.
secondly, even if that's not what you expect, its what's in the PostScript program. That means that either PowerPoint rendered the slide to a bitmap or the Windows printing system/HP driver did.
Now, if I run the PCL through GhostPCL instead of Ghostscript (rendering, not producing a PDF) then the result is more like what I think you are expecting. However, when sent to a PDF file the result is horrible. Which strongly suggests to me that there's some form of transparency involved, PostScript doesn't support transparency at all, and PCL does it through rasterOPs.
I'm afraid that this means that the problem lies either in PowerPoint, the Windows print system or the PostScript printer driver you are using. Since the PCL is at least close to what you expect, I suspect that this means PowerPoint is doing the right thing, and its the printer driver messing up. It appears you are using the Windows PostScript printer driver.
So there's no way you can 'fix' this for files like this, at least not with Ghostscript. You would need to 'fix' the Windows PostScript printer driver, or possibly the Windows print system. You could try reporting a bug to Microsoft, presumably these files print incorrectly when sent to physical PostScript printers too.

Adjusting format of PDF to print it faster

I am using a combination of iTextSharp and PdfSharp to assemble a large PDF file for printing to a Canon Oce VarioPrint 6000 series printer. The PDF is replacing a postscript file.
Both this new file and the old are transferred to the printer via an LPR command.
The postscript file would take maybe 10 minutes to rip to the printer. My PDF version of the same file is taking over 30 minutes to process before it is ready to print.
Can anyone give me pointers into ways I could change the way this file is written / created that would decrease the processing time on the Vario?
EDIT: I took the file that was ripping so slowly and ran it through Acrobat Preflight and it found many RGB images, that it wanted to convert to CMYK. When I look at the PDF though, they are all black and white logos, so I had Preflight do a fix up to convert all images to print Black and White.
I also noticed the Preflight was consolidating backgrounds. Half of the pages have the same logo on them, so leveraging this conversion is probably also helpful.
When I LPR'd that file, it copyed and ripped in less than 5 minutes! So I guess the real question is how can I do that programmatically?
I am modifying the title and tags.
Thanks!
An equivalent result to the preflight repair process in this case can be gotten by using iText (or in my case, iTextSharp). I replaced the PdfSharp method of aggregating the pdfs with the PdfSmartCopy class. This brought down the size of the outputted pdf significantly, combined with using iText's reader.RemoveUnusedObjects(), and my rip time to the printer was lowered to the same or below the previous rip times that we had with the postscript file. Very pleased.
So the RGB images that were probably contributing to the large processing time, were narrowed by the Smart copy removing duplicates.
More info on PdfSmartCopy can be found at: http://api.itextpdf.com/itext/com/itextpdf/text/pdf/PdfSmartCopy.html
and in Bruno's book, iText In Action, more specifically in Chapter 6.

Font issue with PDFtk

I'm having difficulties filling in a form using pdftk with text fields with true type fonts.
Font files (.ttf) are added to /Library/Fonts (OSX Mavericks)
The form is created with Adobe Acrobat Pro
The form includes normal (non form) text using these fonts
The form text fields also use these fonts
The form can successfully be filled and printed using Adobe Acrobat Pro and even Preview
However, pdftk throws an error when trying to fill it using the command:
pdftk ./my_form.pdf fill_form my_data.fdf output ./the_output.pdf
The output is:
Unhandled Java Exception in create_output():
java.lang.ArrayIndexOutOfBoundsException: 0
at pdftk.com.lowagie.text.pdf.DocumentFont.fillEncoding(pdftk)
at pdftk.com.lowagie.text.pdf.DocumentFont.doType1TT(pdftk)
at pdftk.com.lowagie.text.pdf.DocumentFont.<init>(pdftk)
at pdftk.com.lowagie.text.pdf.AcroFields.getAppearance(pdftk)
at pdftk.com.lowagie.text.pdf.AcroFields.setField(pdftk)
at pdftk.com.lowagie.text.pdf.AcroFields.setFields(pdftk)
If I change the font of the text inputs to Helvetica, Times Roman or Courier, pdftk will successfully create a PDF. Oddly though, Arial and Georgia also throw the same error.
I have tried to no avail to embed the fonts in the PDF using Ghostscript as suggested in this question How to repair a PDF file and embed missing fonts. gs may have embedded the fonts, but it removes the form fields so the resulting PDF can't feed back into pdftk.
A working resolution would be greatly appreciated.
I was getting the same java.lang.ArrayIndexOutOfBoundsException: 0 error using pdftk to fill forms on an Adobe Acrobat generated PDF. This question is super old, but I couldn't find a consistent answer on stackoverflow or elsewhere so I figured I'd post my fix.
What ended up working for me:
Opening the PDF in the OS X app Preview
Clicking into a form field, adding text then deleting that text (so nothing is actually changed)
Saving it
Running the PDF through pdftk again
I'm not that familiar with encoding or PDFs in general, but saving the PDF with Preview seems to fix the encoding or at least get it to a place where pdftk can work with it. Good luck.
This was causing a huge headache for me for 2 days. It turns out I was focusing on the wrong end of the problem.
A nice alternative that isn't as manual and only has to be done once is to enter some text in a field of the source PDF form, in your case ./my_form.pdf. I don't know EXACTLY why this works, but it does. that way if you want to create a new file at any time, you dont have to go through this trouble :)

How to troubleshoot badly rendered PDF file

I have a small PDF file, which is supposed to display just the string "Hello World!".
Unfortunately, it displays black boxes instead of the characters. I suppose there is some problem with the fonts, but I am not sure.
Is there a way to diagnose and troubleshoot this issue? All I see on the Internet is advices to do this and to do that, which helps to some and does not to others (nothing helped me). Looks like shooting in the dark to me.
Here is a concrete example. Why does this PDF display black squares instead of the string Hello World ?
EDIT
A bit of the context. I am trying to convert a trivial HTML to PDF using the wkhtmltopdf tool. It is an absolute frustration, because according to the Internet searches the tool is supposed to work and do it quite well. But the thing does not work for me and nothing I do changes this! Unfortunately, this tool seems the only free tool to convert HTML to PDF. This is a huge bummer.
If you want to find out whether a PDF is valid or what is wrong with it, there are a few general steps you can take:
1) Open it in Adobe Acrobat or Adobe Reader (on a desktop platform, not a tablet device). For a very long time the PDF format was owned by Acrobat and the way their software handles PDF is still close to the gold standard. However, there is a caveat with this; Acrobat is very, very smart in the way it handles PDF files and it will overlook or actively correct a number of mistakes other PDF engines might have a problem with...
2) Get yourself a preflight tool. These tools were invented for use in graphic arts, but have applications outside of it too. Popular examples are callas pdfToolbox (warning, I'm affiliated with this vendor!) or the "Preflight" plug-in you'll find in Adobe Acrobat Pro (which is actually also callas technology under the hood). Then preflight specifically against the PDF/A-1b or PDF/A-2b standard.
That last point deserves some more explanation. You should pick a PDF/A compliant preflight profile because the PDF/A (or PDF for Archival) standard is extremely picky. It's goal is to make sure that PDF files will still be readable in exactly the same way 50 years from now and to ensure that it tests a whole range of properties of the file itself and the different components in it. You might be able to ignore some of the errors you get (because some of them will be connected to the fact that the PDF/A identification isn't correct for example) but I wouldn't ignore any other errors unless you understand exactly what they mean and why they aren't relevant.
PS: Can you make your test file available some other way? The file you shared in your question is useless I think. When I do "Download" I get a PDF file that doesn't contain text and doesn't have fonts in it. Those rectangles you see are exactly that - rectangles. So this PDF renders fine - it's the PDF generation process (or the fact that you stored the file on Google docs - I really have no clue what that might do) that went berserk apparently.
In addition to David's hints (first using a known good viewer and then some preflight tool), there is a third level in the inspection process:
3) Inspect the PDF with your own eyes and with the PDF specification (made available by Adobe here) at hand in a text viewer (for a first impression) and (if the cause of the issue at hand is not immediately visible) then in a PDF browsing tool (for in-depth analysis).
This step is quite cumbersome at first but after some time you learn your way around in the PDFs.
A sample for such a PDF browser tool is RUPS but there are others around, too.
'Small PDF file supposed to display "Hello World!"'
Not correct. The file you linked to does not contain any code that could render pixels on screen or on paper that a human brain would read as "Hello World!". The file indeed does only contain vector drawing operations which result in 12 black boxes.
The command line tool pdffonts does not indicate any font being used in the file:
pdffonts so-file-#15858199.pdf
What could still cause the "rendering" of the words you are looking for: some vector or pixel drawing code contained in the PDF. To find out about this, you'll have to look into the low level source code of the PDF.
The original file is 1.570 Bytes. So this task looks not as being overly huge.
'Is there a way to diagnose and troubleshoot this issue?'
Using qpdf, a "command-line program that does structural, content-preserving transformations on PDF files", you can expand all contained streams (which are normally compressed):
qpdf --qdf --object-streams=disable so-file-#15858199.pdf qdf-#15858199.pdf
The resulting file, qdf-#15858199.pdf, is 3.875 Bytes. Now open it in a text editor. PDF object no. 6 (lines 66-219) contains the contents of the page. Lines 123-194 contain only the operators m (moveto), l (lineto) and h (closepath). These lines contain 12 different groups of drawing commands, where each one represents the path for one of the 12 black boxes you see rendered on screen or printed on paper:
102.400001 12.8000001 m
268.800004 12.8000001 l
268.800004 179.200002 l
102.400001 179.200002 l
102.400001 12.8000001 l
h
Line 196 contains
f
which is the fill operator to actually fill black color into so far constructed (closed) path. Nothing in the other lines (which I didn't analyze in detail) does any drawing that may resemble the shapes of any glyphs.
'Unfortunately, this tool seems the only free tool to convert HTML to PDF'
Not correct either.
1.
Assuming your "free" is meant as free as in liberty, then an alternative option is HTMLDOC.
HTMLDOC does not support specific fonts which may be assigned to your HTML input via CSS, but it does a good job in converting one or multiple HTML documents into a single PDF book containing chapters, page-numbering, page headers and footers and more. For all options available, see its full documentation.
2.
Assuming your "free" is meant as free as in beer, then an alternative option (for private usage only) could be PrinceXML.
PrinceXML does an extraordinarily good job when it comes to support almost all CSS features your HTML document may be using. See its documentation and also some of the sample PDF files produced by PrinceXML.

How to convert PDF to an Image without text

I would like to know if its possible to convert a PDF to and image without fonts. My goal is to have only the image without text ?
And if yes, can I do it with ImageMagick/GhostScript ?
Here an example
The image final http://crocodoc_public.s3.amazonaws.com/8b8aa154-45e3-41f9-a465-628e1b2e955d/images/page-001.png
and the original PDF http://crocodoc.com/demo/efwpa (page 2) We can see that the text are on overlay over the image, what I want is to do the same.
So if I got you right, what you want is to remove some text from your PDF (not fonts), and you want to do it programmatically. I suspect you know already that this will only possible if the text is placed on some kind of separate layer in your PDF files. You can try to utilize iText for that. Beware, this will mean you will have to invest some days of learning how to use that library.
I too am the lookout for something like that.
While playing with imagemagick I tried this a command and got some unexpected results.
convert -input.pdf -blur 0x0 output.jpg
this removes the text layers from the pdfs I tried.
I cannot guarantee that this will work for you and if this the right way to achieve, but you may try.
You can do that with Adobe Acrobat. Select the text with the touch up tool and delete it. I don't think you can do that with Ghostscript. You could consider editing the PDF by hand (qpdf helps).