How to debug a corrupt pdf file? [duplicate] - pdf

This question already has answers here:
How do you debug PDF files?
(8 answers)
Closed 5 years ago.
im generating pdf files using a ruby library called "prawn". I have one particular file that seems to be considered "Corrupt" by adobe reader. It shows up fine in both preview and in adobe reader. It gives errors like:
Sometimes I get:
"Could not find the XObject named '%s'.
Othertimes I get:
"Could not find the XObject named "Im4".
Then always I get:
"An error exists on this page. Acrobat may not display the page
correctly. Please contact the person who created the PDF document to
correct the problem."
Is there a way to open a pdf with some tool and have it tell you what is technically wrong with the pdf? Im sure I could figure it out quickly with something like this...
thanks
Joel

A PDF is a dump of PDF objects so it sounds like objects are missing or the references pointing to the object are wrong. You can view a PDF in a text editor and see the refs table and you can see the PDF objects in Acrobat (I wrote a blog article on this at (http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects).
Your best bet might to take an Open Source tool like IText which can read PDFs and add some debugging code to get it to show the object structures.

the general post about debugging pdf might have been also helpful as rups / pdfstreamdump etc is mentioned there How do you debug PDF files?

Related

pdfbox embedding subset font for annotations - part 2

I am creating a separate question, stemming from this one. The used code is almost the same. The reason is that the original problem was about subsetting a font with pdfbox, which I kind of dealt with. I got faced though with another problem, which is : the annotations, and how the fonts used in them are interpreted by particularly Acrobat Reader DC.
I tried different combinations of fonts and embedding options and got rather desperate. The fact is that I had a feeling that in particular the way these things are handled by the programs that interpret the PDF files is non-standard. I think I read somewhere that the annotations and the way they are displayed is on purpose non-standardized by the PDF format, to give freedom to the interpreters to handle them in their own way, since the main purpose of the annotations is the interaction with the user. TL;DR I cannot understand why Acrobat Reader DC doesn't like the annotations I have created and saved with PDFBOX. I even opened a question on friendly and helpful Adobe's User Community forum. But as I expected, someone suggested me to better investigate this question with the PDFBOX team.
Everything is possible, but rather than writing a question on PDFBOX mailing list (I could never get used or understand the efficient use of the mailing lists btw), I want to open a question here because I hope that it could help others to understand the PDF format better.
I basically rephrase the above question from the Adobe's forums here: Here is an example (Google Drive link) with FreeText annotations (but it seems to make no difference if I use Stamp annotations instead), it causes problems when open by Adobe Acrobat Reader DC (file) version 21.001.20149.37945 (I think this corresponds to April 16th '21 update). Specifically the problem happens when the Comments pane is opened by the user, either manually or automatically.
Manually:
link
Automatically:
link
While experimenting, I also tried to unset the "Use local fonts" option in Preferences -> Page Display. I had the impression that maybe Acrobat Reader will be more eager to show the error message once it is not allowed to substitute the erroneously embedded fonts with the possible local fonts. I am not sure if this is true.
The error that I get is the infamous "Cannot extract the embedded font XXXXXX+SomeFontName" as seen in the below picture:
link
The same problems happen also if I use full font embed (subsetting option set to false when using PDType0Font.load). I also tried to embed OpenSans font instead of LiberationSans, also tried to manually convert LiberationSans to a TTF font with fewer glyphs using FontForge, even tried to use Windows ARIALN.TTF, thinking that maybe the font is the problem. All cause the same behavior in Acrobat Reader DC. I have also tried to run Acrobat Reader 2019 Pro Preflight on the document and in the profile that scans the document for the possible font inconsistencies, it reports no errors.
Of course, when I use e.g. PDType1Font.HELVETICA instead of custom TTF font, I do not get the above errors. But I cannot use it because it does not contain the glyphs for the Unicode characters that I use. Does anybody have a better idea?
Thank you very much!
EDIT: to make myself clear - the error does not appear ALWAYS. it appears on some machines constantly (e.g. I am using Windows 7 64-bit with latest Acrobat Reader DC installed to reproduce it fairly well), while on my Windows 10 64-bit with the same version of Acrobat Reader DC it sometimes appears, and sometimes not - I haven't figured out why or in what cases.. - which makes me think - but no - I checked that too - the font I am using opens up alright on the machine where the problem is fairly constant)
UPDATE: at my wits ends again, I created a blank page with Apache OpenOffice, exported it to PDF, opened it with Acrobat Reader DC (last version), added a FreeTextTypewriter annotation (View -> Tools -> Comment -> Open) with 4 greek letters in ArialNarrow font, saved it, reopened it with Acrobat Reader DC, and it gives me the same error (cannot extract the embedded font...).. So this could be the Reader problem? But they made this so difficult to diagnose.. Here is the file, but I do not expect it to show errors on other machines. It's one of those moments that you start to believe in magic and the power of prayer (and a good sleep)
UPDATE 30/04/2021
So, to sum things up, I haven't come with a solution yet, but I came up with three files created with PDFBOX, OpenPDF (iText5 fork) and Acrobat Reader DC itself (can append annotations and save - just adding a simple Text box with greek text through Comment pane) - and they all issue the above error message, when open by Acrobat Reader DC. I have posted details in the Acrboat Reader forum here (same link as in comment)
I have added the code that I used to create the OpenPDF example file here and the example 3 files are in the same repository here

Putting an iframe overlaid on a pdf document in a browser extension

I created a browser extension that lets you look up words in Wikipedia or Wiktionary without needing to open a new tab ( https://addons.mozilla.org/en-US/firefox/addon/in-page-lookup/ , almost done porting to Chrome). It is very useful when you are doing research and come across a word you don't know or want to know more about. The only thing is, a lot of research content is in PDF format. A long time ago (~2013ish) I had an older version of the app based on the old Firefox add-on framework and that did let iframes show up over pdf documents but this has not been the case for many years. I don't think the extension is even recognized in pdf documents, I get "Error: Could not establish connection. Receiving end does not exist" and there is no extension content script on the pdf page. So, my question is, is it possible to put an iframe over a pdf document? Do I need to work on the background side, and if so, how? Thanks.

PDF downloading instead of opening in new tab

This is not a back-end programming question. I can only modify the markup or script (or the document itself). The reason I'm asking here is because all my searches for appropriate terms inevitably lead to questions and solutions about programming this functionality. I'm not trying to force it via progrmaming; I have to find out why this PDF is behaving differently.
So:
I have a bunch of links to PDFs on a page. Most of them open in new tabs, but one of them, the most recent, starts to open in a tab, but then the tab closes and the PDF gets downloaded as a file instead. All markup is consistent - there's nothing differnt about the odd-man-out except the actual URL.
You can see this here:
http://calwater.mwnewsroom.com/Investor-Relations/Financial-Reports/Annual-Reports
All annual reports up to 2012 open in a new tab, but 2013 downloads instead.
This leads me to believe that there is some meta-data property of the PDF itself that tells it how to open, and that, in this case, the 2013 PDF was created using different settings.
Apparently, the PDF was saved out to PDF from InDesign.
Does anyone have any insight?
Problem solved. There was simply an error in the string (like an extra period) that references the attachment such that it couldn't tell it was a PDF. Fixing the reference fixed the problem.

Where does Preview store PDF annotations on OS X Lion?

I'm working on a tool in Python to extract highlighted passages from PDF files. I regularly highlight PDFs in Preview on OS X Lion but haven't found a good tool to extract these passages. Other apps exist that do allow you to highlight and export such as Skim but I figure there has to be a way to extract the ones I add in Preview.
I figured that the highlights would be stored in the HFS+ extended attributes for the PDF file but after looking at them using xattr it seems that they're stored elsewhere. I also looked at PDFKit but I only saw how to create annotations rather than locate them.
If someone could tell me where to find the highlights/annotations or point me at some documentation that explains this I would really appreciate it.
When using PDFKit you can get annotation from any PDFPage instance.
[myPDFPage annotations] will return an array of annotations for that particular page.
See the docs for more info.
Technically speaking, highlighting parts of a PDF is adding an annotation to the file. These annotations are PDF objects defined in the PDF specification. They are stored inside the PDF file itself, i.e. they do modify the original file! That's why you'll not find a trace of the highlights in the HFS+ extended attributes...
So the answer to the question of your title line is: Preview stores the highlights inside the PDF file as fully compliant PDF objects.
The answer to your real question implied in your text ('I want to extract the highlighted passages') was well answered by sosborn.

Adobe Reader Error Codes

I am programmatically creating PDFs, and a recent change to my generator is creating documents that crash both Mac Preview and Adobe Reader on my Mac. Before Adobe Reader crashes, it reports:
There was an error processing a page.
There was a problem reading this document (18).
I suspect that that "18" might give me some information on what is wrong with the PDF I've created. Is there a document explaining the meaning of these status codes?
Hold down the Ctrl key while pressing OK and you should be able to load past this point in the document and possibly get more details.
What tool are you using to create the PDF (Aspose)?
I wasn't able to locate any info on the Adobe error code, so I ended up installing xpdf via Darwinports. Loading my PDF with xpdf spit out much more useful error information and I was able to track down the problem. (I was creating a circular reference in a form when I copied content from one document to another.)