Removing screen reader data in PDF so Microsoft Edge can open it? - pdf

I am hit with a Microsoft Edge bug that has been around for a long time, and doesn't seem to get any attention: Microsoft Edge doesn’t open some PDF files if they have data for screen readers
I have an application that generates a PDF, which is then printed. To support Microsoft Edge and workaround the bug, I am thinking to open and strip out any data that gives Edge trouble using PDFBox. However, the issue is slim on details, and I can't find any info on what specifically triggers the problem for Edge. Does anyone have experience with this and can suggest what specifically I should be stripping out to make a PDF open in Edge?
[Edit]: Just to add, currently if I download the PDF and open the PDF in Edge, it still wouldn't open even though if I open the same local PDF in Chrome, IE11 or Firefox, it works fine.

The file has some weirdness, if you open it with NOTEPAD++ it will show that there is some data ater %%EOF. Anyway, try this code, which removes some unneeded stuff.
PDDocument doc = PDDocument.load(new File("myfile.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
cat.getCOSObject().removeItem(COSName.PAGE_MODE);
cat.getCOSObject().removeItem(COSName.VIEWER_PREFERENCES);
PDPageTree pageTree = cat.getPages();
pageTree.getCOSObject().removeItem(COSName.PARENT);
doc.save("myfile2.pdf");
It is possible that the three "removeItem" calls are not needed, or only some of them, I can't test it myself.
If it still doesn't work, please ping me again and I'll try another idea (setting the mediabox at the page level).

Related

pdfbox embedding subset font for annotations - part 2

I am creating a separate question, stemming from this one. The used code is almost the same. The reason is that the original problem was about subsetting a font with pdfbox, which I kind of dealt with. I got faced though with another problem, which is : the annotations, and how the fonts used in them are interpreted by particularly Acrobat Reader DC.
I tried different combinations of fonts and embedding options and got rather desperate. The fact is that I had a feeling that in particular the way these things are handled by the programs that interpret the PDF files is non-standard. I think I read somewhere that the annotations and the way they are displayed is on purpose non-standardized by the PDF format, to give freedom to the interpreters to handle them in their own way, since the main purpose of the annotations is the interaction with the user. TL;DR I cannot understand why Acrobat Reader DC doesn't like the annotations I have created and saved with PDFBOX. I even opened a question on friendly and helpful Adobe's User Community forum. But as I expected, someone suggested me to better investigate this question with the PDFBOX team.
Everything is possible, but rather than writing a question on PDFBOX mailing list (I could never get used or understand the efficient use of the mailing lists btw), I want to open a question here because I hope that it could help others to understand the PDF format better.
I basically rephrase the above question from the Adobe's forums here: Here is an example (Google Drive link) with FreeText annotations (but it seems to make no difference if I use Stamp annotations instead), it causes problems when open by Adobe Acrobat Reader DC (file) version 21.001.20149.37945 (I think this corresponds to April 16th '21 update). Specifically the problem happens when the Comments pane is opened by the user, either manually or automatically.
Manually:
link
Automatically:
link
While experimenting, I also tried to unset the "Use local fonts" option in Preferences -> Page Display. I had the impression that maybe Acrobat Reader will be more eager to show the error message once it is not allowed to substitute the erroneously embedded fonts with the possible local fonts. I am not sure if this is true.
The error that I get is the infamous "Cannot extract the embedded font XXXXXX+SomeFontName" as seen in the below picture:
link
The same problems happen also if I use full font embed (subsetting option set to false when using PDType0Font.load). I also tried to embed OpenSans font instead of LiberationSans, also tried to manually convert LiberationSans to a TTF font with fewer glyphs using FontForge, even tried to use Windows ARIALN.TTF, thinking that maybe the font is the problem. All cause the same behavior in Acrobat Reader DC. I have also tried to run Acrobat Reader 2019 Pro Preflight on the document and in the profile that scans the document for the possible font inconsistencies, it reports no errors.
Of course, when I use e.g. PDType1Font.HELVETICA instead of custom TTF font, I do not get the above errors. But I cannot use it because it does not contain the glyphs for the Unicode characters that I use. Does anybody have a better idea?
Thank you very much!
EDIT: to make myself clear - the error does not appear ALWAYS. it appears on some machines constantly (e.g. I am using Windows 7 64-bit with latest Acrobat Reader DC installed to reproduce it fairly well), while on my Windows 10 64-bit with the same version of Acrobat Reader DC it sometimes appears, and sometimes not - I haven't figured out why or in what cases.. - which makes me think - but no - I checked that too - the font I am using opens up alright on the machine where the problem is fairly constant)
UPDATE: at my wits ends again, I created a blank page with Apache OpenOffice, exported it to PDF, opened it with Acrobat Reader DC (last version), added a FreeTextTypewriter annotation (View -> Tools -> Comment -> Open) with 4 greek letters in ArialNarrow font, saved it, reopened it with Acrobat Reader DC, and it gives me the same error (cannot extract the embedded font...).. So this could be the Reader problem? But they made this so difficult to diagnose.. Here is the file, but I do not expect it to show errors on other machines. It's one of those moments that you start to believe in magic and the power of prayer (and a good sleep)
UPDATE 30/04/2021
So, to sum things up, I haven't come with a solution yet, but I came up with three files created with PDFBOX, OpenPDF (iText5 fork) and Acrobat Reader DC itself (can append annotations and save - just adding a simple Text box with greek text through Comment pane) - and they all issue the above error message, when open by Acrobat Reader DC. I have posted details in the Acrboat Reader forum here (same link as in comment)
I have added the code that I used to create the OpenPDF example file here and the example 3 files are in the same repository here

Putting an iframe overlaid on a pdf document in a browser extension

I created a browser extension that lets you look up words in Wikipedia or Wiktionary without needing to open a new tab ( https://addons.mozilla.org/en-US/firefox/addon/in-page-lookup/ , almost done porting to Chrome). It is very useful when you are doing research and come across a word you don't know or want to know more about. The only thing is, a lot of research content is in PDF format. A long time ago (~2013ish) I had an older version of the app based on the old Firefox add-on framework and that did let iframes show up over pdf documents but this has not been the case for many years. I don't think the extension is even recognized in pdf documents, I get "Error: Could not establish connection. Receiving end does not exist" and there is no extension content script on the pdf page. So, my question is, is it possible to put an iframe over a pdf document? Do I need to work on the background side, and if so, how? Thanks.

Edit texts in a PDF on Chrome using Chrome inspect

Is there any way to modify texts in PDF on Chrome using the Chrome inspect tool? I was stuck because in the Chrome inspect element, differently than any other websites and even PowerPoint presentations opened in Chrome, I'm able to modify texts, while with PDFs I cannot. Does anyone know how to do it?
Edit: Yes I know that the changes made through Chrome DevTools are temporary, but usually I'm able to make those changes, even if they're temporary. But with PDFs I can't.
There are differences in the way some browsers handle PDF data.
Chromium based browsers are more traditional in that the PDF plug-in is based on a Foxit/Skia collaboration, So you need to understand in that case, the downloaded PDF you are viewing is in the binary application/pdf (file already outside of the html wrapper).
Just as you cannot edit the PDF text in Acrobat Reader, the most you can do is incrementally add comments/annotation or field data to the end of the file, before save as a secondary download. The server cannot see your changes unless you submit as an upload.
With Firefox and Google docs there is often a different approach where the PDF is "Repr"oduced as an "Ex"ample (A ReprEx of the PDF) so it is built of a hybrid image and text overlay to emulate that part of the real PDF source. When you previously or later save the underlying downloaded PDF (for viewing) it would not necessarily include any browser based HTML editing, in the saving.
There are other techniques for other cases, but to answer the basic OP question most simply, the answer is NO you cannot change a PDF body, only add notes, etc via extensions. Microsoft variant of Chrome I.E. Edge has some inbuilt annotation ability thus does not need a second extension.
Found this question because I was googling a similar situation--I was wanting to manipulate type sizes and margins on a PDF in inspector via Chrome. I found that FireFox DevTools will allow you to view those styles and even alter the content in the PDF while in browser. I am late to the game but hope this provides answers for someone else in the future.

Error when opening PhantomJS generated PDF in Adobe Acrobat Reader DS

I am generating a PDF using PhantomJS, and it opens fine with Macs built in Preview, Google Docs, and a few other tools that I tested it on. However, when I open it using Adobe Acrobat Reader DC version 15.010.20056, I receive one of the most unhelpful messages of all time.
After this, my PDF is only partially generated. This happens both on PCs and Macs. I have no idea how to debug this or even start to figure out the cause.
In case anyone was wondering, PhantomJS doesn't properly render Tiling Patterns, setting up offsets X and Y offsets to be 0, which is actually not proper PDF specifications. This is one of the many reasons that PhantomJS renders things differently depending on how you open the generated PDF.

Exported PDFs from Mathematica 8 won't print

UPDATE: I wrote to Wolfram support about this and will update the post if they can resolve the problem. Sorry for spamming SO with a technical support question, but here it remains in case anyone else is having the same issue.
Is anyone else having this problem with Mathematica 8? I recently upgraded and noticed that when I export Graphics to a PDF file, although the file appears fine on my computer, it prints as a blank page. For example, try
Rectangle[{1,1}]//
Graphics//
Export["~/test.pdf",#]&
which creates a PDF file containing a black square. This file opens fine, but if I send it to my department printer I just get a blank page. If I don't export the graphics but print the notebook from MM, no problem, the graphics print as expected. If I use MM 7 to do exactly the same thing, the PDF file prints as expected. Exporting to PNG in MM8 seems to work fine. And, using the context menu Save Graphics As ... or File > Save Selection As ... to create a PDF containing just the graphic also works. However, these graphics eventually get included in a TeX document, and it would be far better if I could continue using the script I've got that doesn't require any button clicking to generate them.
I'm running MM 8.0.0.0 on Mac OS 10.6.7. I have not been able to test this on another printer yet, but this printer has never given me problems before and prints other PDF documents fine. Any ideas why this is happening?
Wolfram Research responds:
...
This issue has been reported by other users as
well and our developers are currently looking into it. I have added your
details to the report so you can be notified when this is resolved.
In the meantime, the alternatives that you could try are:
Try a different printer.
Rasterize the image with the function 'Rasterize' before exporting. If
the rasterized image loses some resolution, you could use the option
'ImageResolution' to edit this.
Rasterize[image, ImageResolution -> xxx]
Surely this is a bug (please report it to support#wolfram.com), but you can work around the problem by selecting the graphic and choosing File > Save Selection As... from the menu (or Save Graphic As... from the contextual menu). This produces a slightly different file that doesn't appear to exhibit the undesirable behavior we observe from Export[].
These problematic files, and LaTeX PDFs that include them, can be properly printed by Adobe Reader 10.1.2. That's if you're okay with installing and using a 450MB PDF reader.
I reproduced the problem (leading me to this question) with Mathematica 8.0.4.0 on Mac OS X 10.7.2. Wolfram suggested lame workarounds like Rasterize and told me
This issue has been addressed by our developers, and a fix will be included in a future version of Mathematica.