Is there any pdf analyser / debugger to debug a pdf file? - pdf

Is there any validator or PDF analyser which can tell me what is wrong with a PDF I created with a hint or indicator which object in my PDF is wrong or something like that?
I would like to create and understand the PDF file format better and I think I should be pretty close to a working PDF but I can not find the problem in it and why PDF readers are not able to read it.
Isn't there a program or an online service which can give me at least a hint what is wrong with my pdf structure or where the problem occurs or even tell me what is wrong? How to debug something like that?
Here is a link to the PDF (just the attached image converted to a PDF):
https://nonepatchwork.patchwork3d.de/create_pdf/created_pdf.pdf
Best regards and thank you very much in advance
Fuchur

The request for a validator or PDF analyser is not on topic on stack overflow, it's better suited for the Software Recommendations site. In this answer, therefore, I'll focus on analyzing the provided example files.
created_pdf.pdf
Here a number of issue immediately leap to the eye, in particular:
Your page object 5 points to object 6 as content stream, but object 6 is not a content stream but an image xobject! (You probably meant to point at object 7.)
All your cross reference table offsets are wrong.
The Size entry in the trailer is wrong.
There is an /ID between trailer dictionary and startxref.
There probably are more issues, but start by fixing these.
created_pdf_2.pdf
Here you fixed the errors listed above but the file still does not display as expected, Adobe Acrobat Reader in particular says:
Looking at the image dictionary the cause becomes clear:
6 0 obj <<
/Type /XObject
/Subtype /Image
/Width 595.276
/Height 841.89
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Filter/DCTDecode
/Interpolate true
/Length 707
>>
...
The values of Width and Height are floats which is invalid. Furthermore, inspecting the actual image data in the stream it becomes clear that the values are completely incorrect, the image only is 20×20 in size.
Thus, replace the Width and Height entries by
/Width 20
/Height 20

Related

Use Ghostscript to set PDF natural language via pdfmarks

I'm setting metadata on PDFs using Ghostscript and pdfmarks. I'm able to set just about everything I need IE: Title, Author, Bookmarks, etc using pdfmarks. However, I can not set the Natural Language. I'm sure I'm just missing the correct syntax, as I've looked over Adobe documentation and see it listed in there.
This is what I have tried:
[ /Type /Catalog /Lang (en-US) /StPNE pdfmark
[ /Subtype /document /Lang (en-US) /StPNE pdfmark
Neither of these works, unfortunately. Does anyone know the correct syntax to add a language?
That's a logical structure pdfmark StPNE, but the last pdfmark reference I can find (version 9 from 2008) does not list /Lang as a legal attribute for a logical structure pdfmark.
I note that the PDF specification does permit /Lang to be a member of a logical structure element, but that doesn't mean there's a pdfmark for it. I think Adobe stopped updating the pdfmark reference with new content for new versions of the PDF specification.
/Type /Catalog won't be legal either.
Can you explain which part of the resulting PDF you are trying to add this to ? Ghostscript only implements the pdfmarks listed in the pdfmark refrence, and I don't think it fully implements all of those currently.
[EDIT]
I just checked and Ghostscript's pdfwrite device does not implement the StPNE pdfmark at all, so that's not going to do anything.
[further edit]
It may be (looking at the PDF specification) that what you want is to set a key called /Lang in the Catalog object of the PDF file. Obviously I'm not certain but....
[{Catalog} <</Lang (en-US)>> /PUT pdfmark
puts a key called /Lang in the Catalog dictionary, and assigns it the string value (en-US). That may be sufficient, I can't tell.

Scaling PDF file using ghostscript

Our system takes 8.5 x 11 PDF files (only) and does things to them. Sometimes customers hand us files to manipulate into the right shape. We're working to automate scaling non-standard sized PDF files into 8.5 x 11.
We've been able to handle most files we've tested with ghostscript, but we have this one customer submitted file that we are unable to handle. (And unfortunately we can't recreate the condition and, of course, can't post the customer's data.)
The file is PDF v1.7 and contains seven 8.5x11 pages followed by four pages that are 25.5 x 45.33 inches. I don't know how they were generated (Adobe Acrobat 10.1.2 per pdfinfo).
We have gradually added a series of parameters to our gs command until we arrived at this:
gs -sDEVICE=pdfwrite -sOutputFile=$final_file -dBATCH -dNOPAUSE -sPAPERSIZE=letter -q -r720 -g6120x7920 -dPDFFitPage -dFIXEDMEDIA $files_to_convert
This seems to work fine for our other files, but for this ONE file, the 25.5 x 45.33 pages are not scaled to letter size. Here are the measurements for the output file's pages 7 and 8's per pdfinfo:
Page 7 size: 612 x 792 pts (letter)
Page 7 rot: 0
Page 8 size: 1836 x 3264 pts
Page 8 rot: 0
I've read that PostScript has Policies, PageSize options, but I'm not aware of such a thing with PDF. And if it exists, I don't know how to alter it using ghostscript.
How can I make sure all pages are scaled to letter?
Well, Ghostscript uses PostScript as its scripting language, so anything you can do in PostScript you can do to a PDF file.
I really wouldn't use -g with pdfwrite, because -g specifies pixels, and since pdfwrite is a vector device that doesn't really work well. Use DEVICEHEIGHTPOINTS and DEVICEWIDTHPOINTS instead.
Don't set -sPAPERSIZE either, you can't set the media to be letter in one place and something different (the -g switch) elsewhere.
Its not really possible to tell you what's going on exactly with your PDF file without seeing it, and you haven't really explained what's wrong. You imply that the pages are not being scaled, but you don't say what size they are being drawn at. You also don't say why you think the pages are 'legal' size when viewed in Acrobat.
If you are saying that the pages in question are 'legal' but the media is much larger, then that is entirely possible and would suggest that the pages have a CropBox. Ghostscript uses the MediaBox for page sizes, Acrobat uses a plethora of different boxes, but usually defaults to the CropBox.
If you want Ghostscript to use the CropBox then just tell it -dUseCropBox.
Alternatively post an example somewhere and I can look at it.

A very weird pdf document: only half visible in Adobe reader

I found a very weird PDF document here:
This is the PDF document
When opening it in Adobe reader, only half of the contents are visible; while if I change to SumatraPDF reader, then all the contents are visible.
What is happening to this document? and how can I fix it so that it is normal in Adobe reader?
Acrobat X says 'an error exists on this page...' which is why only half of it is visible. It draws up to the point where the error occurs.
SumatraPDF is based on MuPDF and clearly MuPDF is simply more tolerant of this particular class of broken PDF file. Acrobat is normally quite tolerant and doesn't even bother to issue warnings most of the time, sadly.
Ghostscript gives me 2 warnings; first that it expected a number and didn't get one, so it replaced it with 0, and second that an invalid shading was ignored.
The actual problem is the shading dictionary in object 90:
90 0 obj
<<
/BBox [ 0.0260000005 0.467999995 0.973999977 ]
Bounding Boxes are required to have 4 values and this one only has 3, so its not valid.
Its not easy to fix a PDF file, the best solution is to make it afresh with a fixed tool. The file is compressed, so you'll need to decompress it before you can modify it, then you'll have to guess what the missing value ought to be.

PDF special searching iOS

I know that there's a great source that works on iOS for PDF searching, it's PDFKitten
But my case is that I encounter some PDF files that this source don't work for search. I tried to open these file by 'Preview' app on Mac and tried to search, it works.
I uploaded one file here.
You can check by open this file by 'Preview' app and search the word 'ra'. It works perfect. By if you drag this file to the source PDFKitten and make some configurations so that the source open it, then try to search, it don't work.
I inspected the source, it cares all the text showing operator, including Tj, ', '', TJ. I placed some log lines in these operator's call backs and I saw these call backs are not called.
Can you give my some suggestions or any ideas?
If I understand the code correctly, PDFKitten looks for fonts only in the /Font entry of the /Resources dictionary of the page. At least that's my interpretation of the method fontCollectionWithPage of Scanner the result of which is queried by setFont in pdfScannerCallbacks to set the current font object.
Furthermore there is no callback for the Do operator (i.e. the operator used to inject the contents of a XObject resource into the page content). Unless CGPDFScannerScan interprets this operator under the hood, the content of included XObjects is not scanned at all. This would match your observation that the text setting operator callbacks never get called.
Your file mundo1.pdf, though, does not have any immediate /Font entry in the /Resources dictionaries of its pages. Instead all the actual content of each page is wrapped into a single /XObject resources respectively. These XObjects in turn have their own /Resources dictionaries which contain a /Font entry defining the fonts used for the respective page.
Thus, PDFKitten does not know anything about the fonts used in your file, especially about their encodings, and so cannot extract the text from the PDF contents. Maybe it does not even get to see the PDF contents to interpret.
I would, therefore, propose you post this issue on the PDFKitten issue management site.
By the way, this PDF construct is completely according to the PDF spec. Nonetheless it looks like a non-adequate use of the iText library. The author of the software using iText like that should review his code and start using better suited classes of the iText library.

What to use instead of QuickLook.framework to handle HTML?

I was using
QuickLook.framework
to show a pdf file in the most simple way but now I need to display an HTML document instead but what it displays is plain unreadable text that starts with
%PDF-1.3 %Äåòåë§ó ÐÄÆ 4 0 obj << /Length 5 0 R /Filter /FlateDecode >> stream
So QuickLook obviously isn't a good for displaying HTML.
What can I use instead that works similarly?
Or can I adapt QuickLook to use it for HTML?
On the iOS, the UIWebView can display PDF files without problems. I have done this on numerous occasions and it is very simple, you get a lot for free (scrolling, zooming, etc).
QuickLook can not display HTML content.