Embedded OpenType (CFF) font in a PDF shows strange behaviour in some viewers - pdf

When embedding a subsetted OpenType font with CFF outlines (Noto Sans HK) in a PDF generated by my own library, I am seeing some rather strange behaviour. The PDF shows no glyphs (un-selectable blanks) in Mac Preview and a random assortment of .notdef's and spaces in Adobe Reader with no errors in either.
But here's the deal: it renders perfectly with Poppler in a Docker image with no fonts installed (I have completely removed every pre-installed font so there would be no silent substitutions) and Chrome on my Mac (without the font being installed).
Furthermore, I have also compared the rendering of my PDF in Chrome to that of a reference PDF using the same font created with Cairo, and as shown below overlaying my PDF on the Cairo one at 50% opacity shows they are definitely identical.
Chrome rendering (Noto HK top, PingFang HK bottom):
Preview rendering (Noto HK invisible, PingFang HK as expected):
Other HK Chinese CFF fonts like PingFang HK render perfectly in every PDF reader I have tested, but Noto Sans HK just won't. As far as embedding restrictions go, FontBook shows Noto Sans HK as having "no restrictions", so nothing there either.
I am embedding all fonts as CIDFontType0C fonts with Identity-H encoding, and although I'm not providing ToUnicode maps yet as they are the next thing on the roadmap, that should make no difference to rendering.
Noto HK Font objects (Widths removed for conciseness):
6 0 obj
<< /Ascent 1160 /CapHeight 733 /Descent -288 /Flags 4 /FontBBox [ -991 -1050 2930 1810 ] /FontFile3 10 0 R /FontName /NZGUSD+NotoSansHK-Thin /ItalicAngle 0 /StemV 58 /Type /FontDescriptor >>
endobj
7 0 obj
<< /BaseFont /NZGUSD+NotoSansHK-Thin /DescendantFonts [ 8 0 R ] /Encoding /Identity-H /Subtype /Type0 /Type /Font >>
endobj
8 0 obj
<< /BaseFont /NZGUSD+NotoSansHK-Thin /CIDSystemInfo << /Ordering (Identity) /Registry (Adobe) /Supplement 0 >> /FontDescriptor 6 0 R /Subtype /CIDFontType0 /Type /Font /W 9 0 R >>
endobj
Equivalent PingFang objects:
11 0 obj
<< /Ascent 1060 /CapHeight 860 /Descent -340 /Flags 4 /FontBBox [ -72 -212 1126 952 ] /FontFile3 15 0 R /FontName /DYBBAB+PingFangHK-Regular /ItalicAngle 0 /StemV 95 /Type /FontDescriptor >>
endobj
12 0 obj
<< /BaseFont /DYBBAB+PingFangHK-Regular /DescendantFonts [ 13 0 R ] /Encoding /Identity-H /Subtype /Type0 /Type /Font >>
endobj
13 0 obj
<< /BaseFont /DYBBAB+PingFangHK-Regular /CIDSystemInfo << /Ordering (Identity) /Registry (Adobe) /Supplement 0 >> /FontDescriptor 11 0 R /Subtype /CIDFontType0 /Type /Font /W 14 0 R >>
endobj
Relevant Page objects:
3 0 obj
<< /F4v0 12 0 R /F5v0 7 0 R >>
endobj
4 0 obj
<< /Contents 5 0 R /CropBox [ 2.5 4 595 842 ] /MediaBox [ 0 0 600 850 ] /Parent 2 0 R /Resources << /Font 3 0 R >> /Type /Page >>
endobj
5 0 obj
<< /Length 462 >>
stream
q 1 1 1 rg 0 0 600 850 re F Q BT /F5v0 15.000000 Tf 0 0 0 rg 0 Tr 27.500000 802.000000 Td [<0AFD292728192FFF3162282746BB112F14E410E20E96201D0D820A9111440EC016922CB046A10AFD0EC039AF1D0B272D17D431C92A2B4F4D384719160F2C29C9297634F34F4D1846>] TJ ET BT /F4v0 15.000000 Tf 0 0 0 rg 0 Tr 27.500000 780.280000 Td [<05487DE1129E161216D412A7726A08C175A77465074A7A1706A504E4748207710B1814B5726605480771641D0E4D12580BD481D113A37267628146D107BE7E0D1358AD3772670C18>] TJ ET endstream
endobj
I'm using HarfBuzz to generate subsets with the HB_SUBSET_FLAGS_RETAIN_GIDS flag set, and when I view the generated subset in FontForge, the glyphs expected are present with the correct GIDs.
Minimal reproducible PDF (not linearised or compressed for readability)
Edit:
Some further investigation showed that embedding the same font as a CIDFontType2 font instead of CIDFontType0 makes Preview show the desired result, which is beyond bizarre to me. Adobe Reader still shows the .notdefs, and Poppler warns about using the wrong type (unsurprisingly) but still renders the PDF fine. My assumption is Preview and Poppler are interpreting the embedded font as CIDFontType0 correctly and ignoring the incorrect /Subtype I've provided.
The question still remains of why Preview would correctly display the font when it's embedded incorrectly, but not otherwise.
Edit 2:
When the font is embedded whole, the result is mostly the same, although now rather than seeing nothing I get a few random characters instead:
In chrome the result is the same as before:
The glyphs being rendered definitely do not correspond to the glyph IDs being provided (again, verified with FontForge).
As before, PingFang and other fonts render perfectly in either case.
I'm starting to think I might be missing an edge case here with respect to glyph indexing, where Cairo and other PDF generators are remapping GIDs to low numbers so they have no issues, but I'm retaining the original GIDs (still fitting in 2 bytes, but could be an implementation limitation I haven't seen?).
I'll try remapping the GIDs to see if that helps and report back.

This is happening because of a misunderstanding on my part of how CID fonts work in PDFs.
Let me explain.
When using a font in PDF you will provide several structures (font descriptor, font dictionary, and for Type0 a descendant font) describing the font, and categorising it into one of the predefined types (Type0, Type1, Type3, or TrueType), and in the case of Type0 a subtype (/CIDFontType0 or /CIDFontType2).
What I didn't understand was that Type0 fonts with subtype /CIDFontType0 actually have one further implicit distinction between those that use CIDFont operators in their TopDICT structure, and those that don't (which includes all CFF2 fonts).
The way glyph lookup works differs based on the type of font used too:
With "Simple" fonts (Type1, TrueType) you would use the actual string ((like this) or <0074006800690073>) as the operand to text showing operators, whereas for Composite fonts (Type0) you would typically use hex encoded strings of CIDs (<DEADBEEF...>).
When using Identity mappings with CID fonts, CID == GID so we can use GIDs directly in these strings — unless you're using a CID Font with CFF outlines that has CIDFont operators in its TopDICT. In this (now rather rare) case, CIDs may or may not equal GIDs — in my testing NotoSansHK was the only font that used a different mapping, hence why other fonts worked fine.
What I needed was to parse the charset array in the TopDICT structure, and look up the GID in question to obtain a SID. Normally each SID corresponds to a string in the string index, but in OpenType fonts the SIDs seem to actually encode the CID for the font. Once the CID is obtained, this can be used to encode text in the PDF.
In my case, 人 (U+4EBA) had a GID of 2813, but the PDF reader interpreted that as a CID, which in this case didn't exist. When using the CID of 9749 instead, however, the glyph is shown as expected.

Related

Create PDF file from application/pdf server response in Chrome Dev Tools

Is there a way to convert the below response which I got from the Chrome Dev Tools to a PDF file?
%PDF-1.5
%âãÏÓ
38 0 obj
<<
/Type /XObject
/Subtype /Image
/Width 189
/Height 100
/ColorSpace /DeviceGray
/Matte [0 0 0]
/BitsPerComponent 8
/Interpolate false
/Filter /FlateDecode
/Length 3345
>>
.....
I know this is the RAW PDF data so I tried to create a pdf file out of this in Node.js but the output was a blank PDF file.
There's some conversion missing which I am not able to figure out. Any sort of help is deeply appreciated

How to extract rotation/transformation information for PDF extracted images (i.e. How does viewers know to rotate 180 )

I am using a ScanSnap scanner which generates PDF-1.3 where it will auto-correct the orientation (rotate 0 or 180 degrees) of scanned documents when the PDF is viewed within Adobe Reader. OCR is done by the scanning software and I am assuming the orientation is determined then and encoded into the PDF.
Note that I know I can use Tesseract or other OCR tools to determine if rotation is needed, but I do not want to use it as the scanner software seems to have already determined it and telling PDF viewers if rotation is needed (or not).
When I use image extraction tools (like xpdf pdfimages, python libraries) it does not properly rotate jpeg images 180 degrees (if needed).
NB: pdfimages extracts the raw image data from the PDF file, without
performing any additional transforms. Any rotation, clipping, color
inversion, etc. done by the PDF content stream is ignored.
I have scanned a document twice with rotation (0 degrees, and 180 degrees).
I cannot seem to reverse engineer what is telling Adobe/Foxit to rotate (or not) the image when viewing. I have looked at the PDF-1.3 specification doc, and compared the PDF binary data between the orientation-corrected and not-corrected. I can not determine what is correcting the orientation?
No /Page/Rotate (defaults to 0) in PDF
No EXIF orientation in JPEG
I do not see any transformation matrix (cm operator) in PDF
In both cases the PDF binary looks like the following (stopped at the JPEG streamed data)
UPDATED: links to PDF files rotated-180 rotated-0
%PDF-1.3
%âãÏÓ
1 0 obj
<</Metadata 20 0 R/Pages 2 0 R/Type/Catalog>>
endobj
2 0 obj
<</MediaBox[0.0 0.0 606.6 794.88]/Count 1/Type/Pages/Kids[4 0 R]>>
endobj
4 0 obj
<</Parent 2 0 R/Contents 18 0 R/PieceInfo<</PSL<</Private<</V(3.2.9)>>/LastModified(D:20190201125524-00'00')>>>>/MediaBox[0.0 0.0 606.6 794.88]/Resources<</XObject<</Im0 5 0 R>>/Font<</C0_0 11 0 R/T1_0 16 0 R>>/ProcSet[/PDF/Text/ImageC]>>/Type/Page/LastModified(D:20190201085524-04'00')>>
endobj
5 0 obj
<</Subtype/Image/Length 433576/Filter/DCTDecode/Name/X/BitsPerComponent 8/ColorSpace/DeviceRGB/Width 1685/Height 2208/Type/XObject>>stream
Does anyone know how PDF viewers know to rotate an image 180 (or not). Is it meta-data within the PDF or JPEG image which can be extracted? Does Adobe and other viewers do something dynamically on opening a document to determine if orientation correction is needed?
I'm no expert with PDF specification. But I was hoping someone may have already found a solution to this problem.
The image Im0 in the resources of the page in "internetfile-180.pdf" is not rotated:
But the image Im0 in the resources of the page in "internetfile.pdf" is rotated:
In the viewer both look upright, so in "internetfile.pdf" a technique must be used that rotates the image.
There are two major techniques for this:
Setting the Rotate property of the page accordingly, i.e. here to 180.
Applying a rotation transformation to the current transformation matrix in the content stream of the page.
Let's look at the page dictionary first, a bit pretty-printed:
4 0 obj
<<
/Parent 2 0 R
/Contents 13 0 R
/PieceInfo
<<
/PSL
<<
/Private <</V (3.2.9)>>
/LastModified (D:20190204142537-00'00')
>>
>>
/MediaBox [0.0 0.0 608.64 792.24]
/Resources
<<
/XObject <</Im0 5 0 R>>
/Font <</T1_0 11 0 R>>
/ProcSet [/PDF /Text /ImageC]
>>
/Type /Page
/LastModified (D:20190204102537-04'00')
>>
As we see, there is no Rotate entry present. Thus, we'll have to look at the page content stream. According to the page dictionary it's in object 13, generation 0.
That object is a stream object with deflated stream data:
13 0 obj
<<
/Length 4014
/Filter /FlateDecode
>>
stream
H‰”WÛŽÛF}Ÿ¯Ð[lÀÓÓ÷˾e½
[...]
ÿüòÛÿ ´ß
endstream
endobj
After inflating the stream data, they start like this:
q
-608.3999939 0 0 -792.9600067 608.3999939 792.9600067 cm
/Im0 Do
Q
[...]
And this is indeed an application of the second technique, the cm instruction applies the rotation and the Do instruction paints the image with the rotation active!
In detail, the cm instruction applies the affine transformation represented by the matrix
-608.3999939 0 0
0 -792.9600067 0
608.3999939 792.9600067 1
In other words:
x' = -608.3999939 * x + 608.3999939
y' = -792.9600067 * y + 792.9600067
This transformation actually is a combination of a rotation by 180°, a horizontal scaling by 608.3999939 and a vertical scaling by 792.9600067, and a translation by 608.3999939 horizontally and 792.9600067 vertically.
The Do instruction now paints the image. Here one needs to know that this instruction first scales the image to fit into the unit 1×1 square at the origin and then applies the current transformation matrix.
Thus, the image is drawn rotated by 180°, effectively filling the whole 608.64×792.24 MediaBox of the page.
mkl answered the question correctly doing all the hard work decoding the PDF for me.
I thought I would add in my python (PyPDF2) code to search for the found rotation condition in case it helps someone else.
input1 = PyPDF2.PdfFileReader(open(filepath, "rb"))
totalPages = input1.getNumPages()
for pgNum in range(0,totalPages):
page0 = input1.getPage(pgNum)
# Lets look to see if the page contains a transformation matrix to rotate it 180 degress
# (ScanScap iX500 encoded the PDF with a cm transformation matrix to rotate 180 degrees in PDF viewers
# #see https://stackoverflow.com/questions/54483013/how-to-extract-rotation-transformation-information-for-pdf-extracted-images-i-e
# #see 'PDF 1.3 Reference Manual March 11, 1999' Section 3.10 Transformation matrices which is applied to the scanned image
# [[a b 0]
# [c d 0]
# [e f 1]]
isPageRotated180 = False
pgContent = page0['/Contents'].getData().decode('utf-8')
FLOAT_REG = '([-+]?\d*\.\d+|\d+)'
m = re.search( '{} {} {} {} {} {} cm'.format(FLOAT_REG,FLOAT_REG,FLOAT_REG,FLOAT_REG,FLOAT_REG,FLOAT_REG), pgContent )
if m:
(a,b,c,d,e,f) = list(map(float,m.groups()))
isPageRotated180 = (a == -e and d == -f)

Create Highlight PDF annotations with Ghostscript

I have the following PostScript file containing a pdfmark to create a highlight annotation:
%PS
/Courier 30 selectfont
15 15 moveto
(Test)show
[ /Rect [0 0 80 30]
/Subtype /Highlight
/Color [.8 .8 0]
/QuadPoints [10 40 90 40 10 10 90 10]
/Contents (Test annotation)
/ANN pdfmark
showpage
(Note that the coordinates of the /QuadPoints field are not in the order the specs define, as Adobe implements it differently.)
Ghostscript creates a PDF with an annotation from that, but there are two issues:
It works in Adobe Reader and Okular, but it's not clickable in Evince.
More important: The highlighted area isn't a rectangle but has rounded left and right edges, as can be seen from the following screenshot:
Why is that and how can I get straight edges?
You should start by looking at the content of the PDF file and seeing what Ghostscript (or more accurately the pdfwrite device) has put in there. Posting an example PDF file to look at would be a sensible move too, and would also tell us which version of Ghostscript you are using.
BTW that header should be %!PS, you missed off the '!'. Of course since its a comment it doesn't matter to the PostScript interpreter.
Now here's the output from Adobe Acrobat Distiller for the annotation, using the code in your question:
1 0 obj
<</Type/Annot/Subtype/Highlight/Rect[0 0 80 30]/C[.8 .8 0]/QuadPoints[10 40 90 40 10 10 90 10]/Contents(Test annotation)>>
endobj
And here's the same from Ghostscript's pdfwrite device:
8 0 obj
<</Type/Annot
/Rect [0 0 80 30]
/C [0.8 0.8 0]
/QuadPoints [10 40 90 40 10 10 90 10]
/Contents(Test annotation)
/Subtype/Highlight>>endobj
These are essentially identical.
So to answer your questions:
If it works in Acrobat, then perhaps you should ask the Evince developers this question.
The rounded edges are drawn by the application which reads the PDF annotation. Since Acrobat draws them that way, everyone else does the same (including Ghostscript's PDF interpreter). If you don't like it you will have to change the viewing application.

Ghostscript re-encoding embedded font

I am using Ghostscript (9.14) to "clean-up" PDFs prior to distribution with the pdfwrite driver While it works very well in general, I have noticed that it is frequently re-encoding embedded fonts which often has the effect of preventing sensible text extraction for searching etc.
An example file before ghostscript processing is here: http://download.vistair.com/ghostscript/in.pdf
and the result after processing with ghostscript is here: http://download.vistair.com/ghostscript/out.pdf
Sensible text extraction is possible with the input file, but not with the output file.
Looking in the PDF, in the input file we have:
obj 9 0
Type: /Font
Referencing: 12 0 R, 14 0 R
<<
/BaseFont /GCCBBY+TT187t00
/Encoding 12 0 R
/FirstChar 1
/FontDescriptor 14 0 R
/LastChar 41
/Subtype /TrueType
/Type /Font
/Widths [352 684 633 973 596 427 636 636 636 636 751 632 684 616 695 787 989 421 748 686 575 601 521 633 521 394 274 607 633 623 623 274 352 364 698 623 623 592 592 592 636]
>>
obj 12 0
Type: /Encoding
Referencing:
<<
/BaseEncoding /WinAnsiEncoding
/Differences [1/space/S/u/m/e/r/two/zero/one/four/H/E/A/T/R/O/W/I/N/B/F/a/c/h/s/t/i/o/n/p/b/l/f/period/C/d/g/y/v/k/endash]
/Type /Encoding
>>
In the ghostscript-processed file this has become:
obj 8 0
Type: /Font
Referencing: 9 0 R
<<
/BaseFont /OWPYKO+TT187t00
/FontDescriptor 9 0 R
/Type /Font
/FirstChar 2
/LastChar 6
/Widths [ 684 633 973 596 427]
/Subtype /TrueType
>>
So the font encoding information has been lost and the text is no longer extractable.
Is there a way to stop ghostscript re-encoding existing embedded fonts (or at least preserve any existing font encoding)?
To be blunt, no. Its a TrueType font, and they always get converted to a symbolic font (for complex reasons to do with the way that Ghostscript works).
In the past we did emit an Encoding, because Acrobat will use an Encoding for a TrueType font (even for a Symbolic font, which it should not do). However, the PDF spes is quite clear that symbolic fonts should not specify an Encoding, and it reached the point where doing so was creating more problems than it solved, so we stopped doing it.

putting hyperlink in pdf/postscript around a circle

As you see, there are several IDs around the circle, I don't know exactly about their coordination (is difficult!). So, was wondering if anyone has an idea, to attach hyperlink for each ID, meaning that by clicking on ID, user diverted on the corresponding webpage.
I put the code HERE
This circle, is generated by a postscript script!!
The text is drawn using constructions like this:
247 ux 160.65 uy moveto
(GH6) show stroke
You need to add a pdfmark operation, the exact pdfmark you want to use depends on what you are trying to open, and where. If you want to open another PDF file you can use a Link Annotation with a GoToR action, if you want to open a web page you can use a Launch action or possibly a custom action, depending on what application is viewing the PDF file. I'm going to assume you want a Launch action.
The Launch pdfmark should look something like :
[/Rect [50 425 295 445]
/Action /Launch
/Border [0 0 2]
/Color [.7 0 0]
/URI (http://www.adobe.com)
/Subtype /Link
/ANN pdfmark
Obviously you need to calculate the Rect parameters so that clicking in the area of the text will launch the destination.
The way to do this is to use the PostScript path operators. First we need to save the current setup, then convert the text to a path, then calculate the bounding box of the path. Then we can use those co-ordinates for our Rect parameters.
Eg:
247 ux 160.65 uy moveto
(GH6)
dup % copy the string
gsave % save the current environament
exch % bring the string copy to the top of the stack
[ /Rect % Put a mark and name on stack
3 -1 roll % Bring string copy to top
true
charpath % create a path equivalent to drawing the text
flattenpath % flatten curves
pathbbox % get the bounding box
% we now have our box on the stack
% stack is: (GH6) [ /Rect llx lly urx ury
% So put the other parameters in place
/Action /Launch
/Border [0 0 2]
/Color [.7 0 0]
/URI (www.dummy.com)
/Subtype /Link
/Ann
pdfmark % and execute the pdfmark
grestore % put the graphics state back
show stroke
Some of the text is shown via a slightly different idiom:
241 ux 84.65 uy moveto
(45.0) dup stringwidth pop 2 div neg 0 rmoveto show
you can do exactly the same as above, just put the dup...grestore construction after the rmoveto and before the show.
Caveat: I haven't tested this at all, but it should show you how to proceed.
Whatever portion of the PostScript program draws the numerical IDs also needs to include a pdfmark which has a /Dest of the URI for the web page. It may well also need to specify an /AP appearance stream.
This is probably trivial to do in the original PostScript program but as BryanH implies, impossible to give pointers on without seeing the original PostScript.
Assuming, of course, that the numbers are drawn by the PostScript program, and the tool converting the PostScript to PDF understands the pdfmark extension operator.
The example from KenS is exactly what I've been looking for, but with one small change:
[ pathbbox ]
ie
/Arial findfont 20 scalefont setfont
100 200 moveto (riverdrums)
dup gsave exch
[ /Rect 3 -1 roll true charpath flattenpath [ pathbbox ]
/Action << /Subtype /URI /URI (http://riverdrums.com) >>
/Border [0 0 0]
/Subtype /Link
/ANN pdfmark
grestore
show