pdfbox embedding subset font for annotations - part 2 - pdf

I am creating a separate question, stemming from this one. The used code is almost the same. The reason is that the original problem was about subsetting a font with pdfbox, which I kind of dealt with. I got faced though with another problem, which is : the annotations, and how the fonts used in them are interpreted by particularly Acrobat Reader DC.
I tried different combinations of fonts and embedding options and got rather desperate. The fact is that I had a feeling that in particular the way these things are handled by the programs that interpret the PDF files is non-standard. I think I read somewhere that the annotations and the way they are displayed is on purpose non-standardized by the PDF format, to give freedom to the interpreters to handle them in their own way, since the main purpose of the annotations is the interaction with the user. TL;DR I cannot understand why Acrobat Reader DC doesn't like the annotations I have created and saved with PDFBOX. I even opened a question on friendly and helpful Adobe's User Community forum. But as I expected, someone suggested me to better investigate this question with the PDFBOX team.
Everything is possible, but rather than writing a question on PDFBOX mailing list (I could never get used or understand the efficient use of the mailing lists btw), I want to open a question here because I hope that it could help others to understand the PDF format better.
I basically rephrase the above question from the Adobe's forums here: Here is an example (Google Drive link) with FreeText annotations (but it seems to make no difference if I use Stamp annotations instead), it causes problems when open by Adobe Acrobat Reader DC (file) version 21.001.20149.37945 (I think this corresponds to April 16th '21 update). Specifically the problem happens when the Comments pane is opened by the user, either manually or automatically.
Manually:
link
Automatically:
link
While experimenting, I also tried to unset the "Use local fonts" option in Preferences -> Page Display. I had the impression that maybe Acrobat Reader will be more eager to show the error message once it is not allowed to substitute the erroneously embedded fonts with the possible local fonts. I am not sure if this is true.
The error that I get is the infamous "Cannot extract the embedded font XXXXXX+SomeFontName" as seen in the below picture:
link
The same problems happen also if I use full font embed (subsetting option set to false when using PDType0Font.load). I also tried to embed OpenSans font instead of LiberationSans, also tried to manually convert LiberationSans to a TTF font with fewer glyphs using FontForge, even tried to use Windows ARIALN.TTF, thinking that maybe the font is the problem. All cause the same behavior in Acrobat Reader DC. I have also tried to run Acrobat Reader 2019 Pro Preflight on the document and in the profile that scans the document for the possible font inconsistencies, it reports no errors.
Of course, when I use e.g. PDType1Font.HELVETICA instead of custom TTF font, I do not get the above errors. But I cannot use it because it does not contain the glyphs for the Unicode characters that I use. Does anybody have a better idea?
Thank you very much!
EDIT: to make myself clear - the error does not appear ALWAYS. it appears on some machines constantly (e.g. I am using Windows 7 64-bit with latest Acrobat Reader DC installed to reproduce it fairly well), while on my Windows 10 64-bit with the same version of Acrobat Reader DC it sometimes appears, and sometimes not - I haven't figured out why or in what cases.. - which makes me think - but no - I checked that too - the font I am using opens up alright on the machine where the problem is fairly constant)
UPDATE: at my wits ends again, I created a blank page with Apache OpenOffice, exported it to PDF, opened it with Acrobat Reader DC (last version), added a FreeTextTypewriter annotation (View -> Tools -> Comment -> Open) with 4 greek letters in ArialNarrow font, saved it, reopened it with Acrobat Reader DC, and it gives me the same error (cannot extract the embedded font...).. So this could be the Reader problem? But they made this so difficult to diagnose.. Here is the file, but I do not expect it to show errors on other machines. It's one of those moments that you start to believe in magic and the power of prayer (and a good sleep)
UPDATE 30/04/2021
So, to sum things up, I haven't come with a solution yet, but I came up with three files created with PDFBOX, OpenPDF (iText5 fork) and Acrobat Reader DC itself (can append annotations and save - just adding a simple Text box with greek text through Comment pane) - and they all issue the above error message, when open by Acrobat Reader DC. I have posted details in the Acrboat Reader forum here (same link as in comment)
I have added the code that I used to create the OpenPDF example file here and the example 3 files are in the same repository here

Related

PDF downloading instead of opening in new tab

This is not a back-end programming question. I can only modify the markup or script (or the document itself). The reason I'm asking here is because all my searches for appropriate terms inevitably lead to questions and solutions about programming this functionality. I'm not trying to force it via progrmaming; I have to find out why this PDF is behaving differently.
So:
I have a bunch of links to PDFs on a page. Most of them open in new tabs, but one of them, the most recent, starts to open in a tab, but then the tab closes and the PDF gets downloaded as a file instead. All markup is consistent - there's nothing differnt about the odd-man-out except the actual URL.
You can see this here:
http://calwater.mwnewsroom.com/Investor-Relations/Financial-Reports/Annual-Reports
All annual reports up to 2012 open in a new tab, but 2013 downloads instead.
This leads me to believe that there is some meta-data property of the PDF itself that tells it how to open, and that, in this case, the 2013 PDF was created using different settings.
Apparently, the PDF was saved out to PDF from InDesign.
Does anyone have any insight?
Problem solved. There was simply an error in the string (like an extra period) that references the attachment such that it couldn't tell it was a PDF. Fixing the reference fixed the problem.

How to troubleshoot badly rendered PDF file

I have a small PDF file, which is supposed to display just the string "Hello World!".
Unfortunately, it displays black boxes instead of the characters. I suppose there is some problem with the fonts, but I am not sure.
Is there a way to diagnose and troubleshoot this issue? All I see on the Internet is advices to do this and to do that, which helps to some and does not to others (nothing helped me). Looks like shooting in the dark to me.
Here is a concrete example. Why does this PDF display black squares instead of the string Hello World ?
EDIT
A bit of the context. I am trying to convert a trivial HTML to PDF using the wkhtmltopdf tool. It is an absolute frustration, because according to the Internet searches the tool is supposed to work and do it quite well. But the thing does not work for me and nothing I do changes this! Unfortunately, this tool seems the only free tool to convert HTML to PDF. This is a huge bummer.
If you want to find out whether a PDF is valid or what is wrong with it, there are a few general steps you can take:
1) Open it in Adobe Acrobat or Adobe Reader (on a desktop platform, not a tablet device). For a very long time the PDF format was owned by Acrobat and the way their software handles PDF is still close to the gold standard. However, there is a caveat with this; Acrobat is very, very smart in the way it handles PDF files and it will overlook or actively correct a number of mistakes other PDF engines might have a problem with...
2) Get yourself a preflight tool. These tools were invented for use in graphic arts, but have applications outside of it too. Popular examples are callas pdfToolbox (warning, I'm affiliated with this vendor!) or the "Preflight" plug-in you'll find in Adobe Acrobat Pro (which is actually also callas technology under the hood). Then preflight specifically against the PDF/A-1b or PDF/A-2b standard.
That last point deserves some more explanation. You should pick a PDF/A compliant preflight profile because the PDF/A (or PDF for Archival) standard is extremely picky. It's goal is to make sure that PDF files will still be readable in exactly the same way 50 years from now and to ensure that it tests a whole range of properties of the file itself and the different components in it. You might be able to ignore some of the errors you get (because some of them will be connected to the fact that the PDF/A identification isn't correct for example) but I wouldn't ignore any other errors unless you understand exactly what they mean and why they aren't relevant.
PS: Can you make your test file available some other way? The file you shared in your question is useless I think. When I do "Download" I get a PDF file that doesn't contain text and doesn't have fonts in it. Those rectangles you see are exactly that - rectangles. So this PDF renders fine - it's the PDF generation process (or the fact that you stored the file on Google docs - I really have no clue what that might do) that went berserk apparently.
In addition to David's hints (first using a known good viewer and then some preflight tool), there is a third level in the inspection process:
3) Inspect the PDF with your own eyes and with the PDF specification (made available by Adobe here) at hand in a text viewer (for a first impression) and (if the cause of the issue at hand is not immediately visible) then in a PDF browsing tool (for in-depth analysis).
This step is quite cumbersome at first but after some time you learn your way around in the PDFs.
A sample for such a PDF browser tool is RUPS but there are others around, too.
'Small PDF file supposed to display "Hello World!"'
Not correct. The file you linked to does not contain any code that could render pixels on screen or on paper that a human brain would read as "Hello World!". The file indeed does only contain vector drawing operations which result in 12 black boxes.
The command line tool pdffonts does not indicate any font being used in the file:
pdffonts so-file-#15858199.pdf
What could still cause the "rendering" of the words you are looking for: some vector or pixel drawing code contained in the PDF. To find out about this, you'll have to look into the low level source code of the PDF.
The original file is 1.570 Bytes. So this task looks not as being overly huge.
'Is there a way to diagnose and troubleshoot this issue?'
Using qpdf, a "command-line program that does structural, content-preserving transformations on PDF files", you can expand all contained streams (which are normally compressed):
qpdf --qdf --object-streams=disable so-file-#15858199.pdf qdf-#15858199.pdf
The resulting file, qdf-#15858199.pdf, is 3.875 Bytes. Now open it in a text editor. PDF object no. 6 (lines 66-219) contains the contents of the page. Lines 123-194 contain only the operators m (moveto), l (lineto) and h (closepath). These lines contain 12 different groups of drawing commands, where each one represents the path for one of the 12 black boxes you see rendered on screen or printed on paper:
102.400001 12.8000001 m
268.800004 12.8000001 l
268.800004 179.200002 l
102.400001 179.200002 l
102.400001 12.8000001 l
h
Line 196 contains
f
which is the fill operator to actually fill black color into so far constructed (closed) path. Nothing in the other lines (which I didn't analyze in detail) does any drawing that may resemble the shapes of any glyphs.
'Unfortunately, this tool seems the only free tool to convert HTML to PDF'
Not correct either.
1.
Assuming your "free" is meant as free as in liberty, then an alternative option is HTMLDOC.
HTMLDOC does not support specific fonts which may be assigned to your HTML input via CSS, but it does a good job in converting one or multiple HTML documents into a single PDF book containing chapters, page-numbering, page headers and footers and more. For all options available, see its full documentation.
2.
Assuming your "free" is meant as free as in beer, then an alternative option (for private usage only) could be PrinceXML.
PrinceXML does an extraordinarily good job when it comes to support almost all CSS features your HTML document may be using. See its documentation and also some of the sample PDF files produced by PrinceXML.

pdf see current line ruler

I'm looking for accessibility tool , to make it easier to read pdf's.
In short, it should be possible to easily see which line is being read ( a bit like a ruler,when it comes down to text ), to avoid losing the line that is being read.
I was wondering if anyone knows any solution for this , for example a plugin for Adobe Acrobat Reader, etc...
Any suggestions are welcome.
I don't think there is a plug-in for Acrobat Reader. You may want to look at ZoomText or ClaroRead. Of course these only work if the PDF has text, but not images of text.
A low tech solution would be to open a Notepad doc and size it how you need. If you are on Win7 you could do this with sticky notes.
Another approach I've used is to convert the PDF to HTML and then run a server with it. This is fairly simple to accomplish using Live Server in VScode.
In the Chrome browser, we may then use accessibility extensions, such as ReadingBuddies, that have reading ruler functions.
Otherwise consider,
Use a PDF reader that has a built-in reading ruler feature, such as Adobe Acrobat Reader DC or Foxit Reader.
Use a PDF reader that allows you to add a reading ruler as an annotation, such as Xodo PDF Reader.
Use an online tool that allows you to view PDFs with a reading ruler, such as Smallpdf's PDF Reader.
Use a screen ruler tool, such as the one offered by How-To Geek, to measure the PDF on your screen.
The academic term is sometimes called RSVP (Rapid Serial Visual Presentation), there are patented hardware and software versions but in principle it is simply a translucent masking added to the viewport. see https://softwarerecs.stackexchange.com/questions/28582/is-there-an-equivalent-to-a-reading-guide-strip-for-windows-os-x-or-linux and http://www.see-n-read.com/products/esee-n-read-2/
10 years later and its 2023 so software such as browsers should include such features here is Edge in some sites where Immersive Reader is supported but not StackOverflow !! The above example is using an edge extension. https://microsoftedge.microsoft.com/addons/detail/screen-mask/dfanfcmhbdocjfpmnoebccndgmhlincl others are available for other browsers https://chrome.google.com/webstore/detail/reading-ruler/phiedfcbjfjagnjikfbobmldbpmdcpfk
To get the Reader Mode options on Chrome: or Edge look at the available flags
However if you save page as PDF and read aloud it is then used there !
Some PDF readers like Mac Skim include such accessibility option.
However, simplest is :-
Most PDF readers can be reduced to focus viewport on single lines and with auto scrolling that allows for more focused "line by line" reading without the audio, plus fast and easy adjustments/enlarging for PDF variable lines with illustrations.
Note as per above PDF where much of the text is actually one or two lines out of order it is not trivial for a PDF reader to understand which text base line is independently to be used next. in reality "Read Aloud" will read two variable height lines then jump to top of page then back to the second visible line. PDF lines are not the visible order nor a constant height/spacing, you might expect.

Exported PDFs from Mathematica 8 won't print

UPDATE: I wrote to Wolfram support about this and will update the post if they can resolve the problem. Sorry for spamming SO with a technical support question, but here it remains in case anyone else is having the same issue.
Is anyone else having this problem with Mathematica 8? I recently upgraded and noticed that when I export Graphics to a PDF file, although the file appears fine on my computer, it prints as a blank page. For example, try
Rectangle[{1,1}]//
Graphics//
Export["~/test.pdf",#]&
which creates a PDF file containing a black square. This file opens fine, but if I send it to my department printer I just get a blank page. If I don't export the graphics but print the notebook from MM, no problem, the graphics print as expected. If I use MM 7 to do exactly the same thing, the PDF file prints as expected. Exporting to PNG in MM8 seems to work fine. And, using the context menu Save Graphics As ... or File > Save Selection As ... to create a PDF containing just the graphic also works. However, these graphics eventually get included in a TeX document, and it would be far better if I could continue using the script I've got that doesn't require any button clicking to generate them.
I'm running MM 8.0.0.0 on Mac OS 10.6.7. I have not been able to test this on another printer yet, but this printer has never given me problems before and prints other PDF documents fine. Any ideas why this is happening?
Wolfram Research responds:
...
This issue has been reported by other users as
well and our developers are currently looking into it. I have added your
details to the report so you can be notified when this is resolved.
In the meantime, the alternatives that you could try are:
Try a different printer.
Rasterize the image with the function 'Rasterize' before exporting. If
the rasterized image loses some resolution, you could use the option
'ImageResolution' to edit this.
Rasterize[image, ImageResolution -> xxx]
Surely this is a bug (please report it to support#wolfram.com), but you can work around the problem by selecting the graphic and choosing File > Save Selection As... from the menu (or Save Graphic As... from the contextual menu). This produces a slightly different file that doesn't appear to exhibit the undesirable behavior we observe from Export[].
These problematic files, and LaTeX PDFs that include them, can be properly printed by Adobe Reader 10.1.2. That's if you're okay with installing and using a 450MB PDF reader.
I reproduced the problem (leading me to this question) with Mathematica 8.0.4.0 on Mac OS X 10.7.2. Wolfram suggested lame workarounds like Rasterize and told me
This issue has been addressed by our developers, and a fix will be included in a future version of Mathematica.

PDF Outline Text - Automation of Acrobat Sequences

I have built an application that automates the filling out of form fields inside a pdf. It then takes various assets and combines them together to generate a "print ready" product. All of this is accomplished using the magic of iTextSharp. When form fields are populated, they are then flattened to text. The problem is that even with the fonts embedded they aren't really attached to the form fields in a meaningful way (like straight text elements are) and the printers are complaining that the pdf is generating licensing errors due to this. I researched this a bit and it just seems to be the nature of how form fields are.
The artists we are working with requested that we research a way to "outline" the text that is created from flattening the form fields. I found that running the PDF Optimizer with a custom preset allows for Text Outlining in Acrobat, and even better I can generate an Acrobat Sequence that runs this command on the pdf. The problem is that Sequences can not be automated, at all.
I found a plug-in called AutoBatch that allows for the execution of Sequences on the command line through a batch file. The downside is that this would require installing Acrobat Pro and the Plug-in on the server this application will be running on. Further it seems like an overkill solution just to outline the text in the pdf. For all I know at this point iTextSharp may allow me to do this programmatic, but searching for such a thing on google returns little results and nothing relevant.
So the question: Is there a better way to outline text in a pdf than the current solution I have implemented or am I kind of stuck?
TLDR; PDF is generated w/ non-standard fonts. I need to "outline" this text to send it to the printer. Currently using AutoBatch Acrobat Plug-In to execute Acrobat Sequence from the Command Line. Seems excessive, wondering if anyone knows a better way to automate font outlining.
I am also in a printing environment and have used forms for "Box Covers" plenty of times to shorten the code used to produce box covers.
I simple us "pdfStamper.FormFlattening = true;" and the printers (Xerox DP180 and DC5000) has no problems in using the PDF.
The moment I leave out FormFlattening the printer gives a lot of errors regarding the PDF.
If you are using FormFlattening then check if the printer has the font locally installed in order for it to reference the font from the print engine instead of the PDF resources.