How to correctly understand TrueType cmap's subtable Format 4?

How to correctly understand TrueType cmap's subtable Format 4? - truetype

The following is the information, which the TrueType font format documentation provides with regards to the fields of "Format 4: Segment mapping to delta values" subtable format, which may be used in cmap font table (the one used for mapping character codes to glyph indeces):
Type Name Description
1. uint16 format Format number is set to 4.
2. uint16 length This is the length in bytes of the subtable.
3. uint16 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
4. uint16 segCountX2 2 × segCount.
5. uint16 searchRange 2 × (2**floor(log2(segCount)))
6. uint16 entrySelector log2(searchRange/2)
7. uint16 rangeShift 2 × segCount - searchRange
8. uint16 endCode[segCount] End characterCode for each segment, last=0xFFFF.
9. uint16 reservedPad Set to 0.
10. uint16 startCode[segCount] Start character code for each segment.
11. int16 idDelta[segCount] Delta for all character codes in segment.
12. uint16 idRangeOffset[segCount] Offsets into glyphIdArray or 0
13. uint16 glyphIdArray[ ] Glyph index array (arbitrary length)
(Note: I numbered the fields as to allow referencing them)
Most fields, such as 1. format, 2. length,3. language,9. reservedPad` are trivial basic info and understood.
Other fields 4. segCountX2, 5. searchRange, 6 .entrySelector, 7. rangeShift I see as some odd way to have a precomputed values, but basically being only a redundant way to store the number of segments segCount (implicitly). Also those fields I have no major headache understanding.
Lastly there remain the fields that represent arrays. Per each segment there is a field 8. endCode, 10. stadCode, 11. idDelta and 12. idRangeOffset and there might/might not be even a field 13. glyphIdArray. Those are the fields I still struggle to interprete correctly and which this question is about.
To allow for a most helpful answer allow me to sketch quickly my take on those fields:
Working basically segment for segment, each segment maps characters codes from startCode to endCode to the indexes of the fonts glyphs (reflecting the order they appear in the glyf table).
having the character code as input
having the glyph index as output
segment is determined by iterating through them checking that the input value is inside the range of startCode to endCode.
with the segment found thus, the fields respective fields idRangeOffset and idDelta are determined as well.
idRangeOffset conveys a special meaning
case A) idRangeOffset being set to special value 0 means that the ouput can be
calculated from the input value (character code) and the idDelta. (I think it is either glyphId = inputCharCode + idDelta or glyphId = inputCharCode - idDelta )
case B) idRangeOffset being not 0 something different happens, which is part of what I seek an answer about here.
With respect to case B) the documentation states:
If the idRangeOffset value for the segment is not 0, the mapping of
character codes relies on glyphIdArray. The character code offset from
startCode is added to the idRangeOffset value. This sum is used as an
offset from the current location within idRangeOffset itself to index
out the correct glyphIdArray value. This obscure indexing trick works
because glyphIdArray immediately follows idRangeOffset in the font
file. The C expression that yields the glyph index is:
glyphId = *(idRangeOffset[i]/2
+ (c - startCode[i])
+ &idRangeOffset[i])
which I think provides a way to map a continuous input range (hence "segment") to a list of values stored in the field glyphIdArray, possibly as a way to provide output values that cannot be computed via idDelta, for being unordered/non-consecutive. This at least is my read on that what was described as "obscure" in the documentation.

Because glyphIdArray[] follows idRangeOffset[] in the TrueType file, the code segment in question
glyphId = *(&idRangeOffset[i]
+ idRangeOffset[i]/2
+ c - startCode[i])
points to the memory address of the desired position in glyphIdArray[]. To elaborate on why:
&idRangeOffset[i] points to the memory address of idRangeOffset[i]
moving forward idRangeOffset[i] bytes (or idRangeOffset[i]/2 uint16's) brings you to the relevant section of glyphIdArray[]
c - startCode[i] is the position in glyphIdArray[] that contains the desired ID value
From here, in the event that this ID is not zero, you will add idDelta[i] to obtain the glyph number corresponding to c.
It is important to point out *(&idRangeOffset[i] + idRangeOffset[i]/2 + (c - startCode[i])) is really pseudocode: you don't want a value stored in your program's memory, but rather the memory address in the file.
In a more modern language without pointers, the above code segment translates to:
glyphIndexArray[i - segCount + idRangeOffset[i]/2 + (c - startCode[i])]
The &idRangeOffset[i] in the original code segment has been replaced by i - segCount (where segCount = segCountX2/2). This is because the range offset (idRangeOffset[i]/2) is relative to the memory address &idRangeOffset[i].

Related

In PDF format syntax should number 1.e10 be written as 10000000000.?

Looking at the PDF Referene ver 1.7 about how objects of type number
are writen according to valid syntax it informs.
Note: PDF does not support the PostScript syntax for numbers with
nondecimal radices (such as 16#FFFE ) or in exponential format (such
as 6.02E23 ).
However it also does not mandate a maximum range the numbers should be in. This seems to suggest it would be correct to write
1.00E10 as 10000000000
or
1.00E-50 as 0.00000000000000000000000000000000000000000000000001
This question has hence 2 aspects:
a) is the notation correct (as provided in the examples?
b) does pdf format expect implementations to use (or at least fall back
to some bigint/bigfloat handling) of numbers, as it seems to not provide
any range for the numbers?

First of all, for normative information on PDF you should refer to the appropriate ISO standards, in particular ISO 32000. Yes, Part 1 (ISO 32000-1) in particular is derived from the PDF reference 1.7 without that many changes, but not without changes either. (Ok, in some situations one has to consult the old PDF reference, too, to understand some of these changes.)
Adobe has published a copy thereof (with "ISO" in the page headers removed) on its web site: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Now to your question:
According to ISO 32000, both part 1 and 2:
An integer shall be written as one or more decimal digits optionally preceded by a sign. [...]
A real value shall be written as one or more decimal digits with an optional sign and a leading, trailing, or embedded PERIOD (2Eh) (decimal point).
(section 7.3.3 "Numeric Objects")
Thus, concerning your question a)
is the notation correct (as provided in the examples?
Yes, 10000000000 is an integer valued numeric object, 0.00000000000000000000000000000000000000000000000001 is a real valued numeric object.
Concerning your question b)
does pdf format expect implementations to use (or at least fall back to some bigint/bigfloat handling) of numbers, as it seems to not provide any range for the numbers?
No, in the same section as quoted above you also find
The range and precision of numbers may be limited by the internal representations used in the computer on which the conforming reader is running; Annex C gives these limits for typical implementations.
and Annex C recommends at least the following limits:
integer
2,147,483,647
Largest integer value; equal to 231 − 1.
integer
-2,147,483,648
Smallest integer value; equal to −231
real
±3.403 × 1038
Largest and smallest real values (approximate).
real
±1.175 × 10-38
Nonzero real values closest to 0 (approximate). Values closer than these are automatically converted to 0.
real
5
Number of significant decimal digits of precision in fractional part (approximate).
(ISO 32000-1)
Integers
Integer values (such as object numbers) can often be expressed within 32 bits.
Real numbers
Modern computers often represent and process real numbers using IEEE Standard for Floating-Point Arithmetic (IEEE 754) single or double precision.
(ISO 32000-2)

truetype font ARGS_ARE_XY_VALUES meaning?

In the glyf table, if a glyph is a composite glyph, I don't understand what it means if the flag ARGS_ARE_XY_VALUES is not set. The msdn docs say
the first point number indicates the point that is to be matched to the new glyph. The second number indicates the new glyph’s “matched” point. Once a glyph is added, its point numbers begin directly after the last glyphs (endpoint of first glyph + 1).
But I have NO idea what it means:
What is a "point number"? Is it an index into the glyph's points?
What does "matched to the new glyph" mean?

What is a "point number"? Is it an index into the glyph's points?
Yes. It’s an index into the array of pairs of coordinates that make up the glyph’s outline (as defined in the glyph’s contour data).
What does "matched to the new glyph" mean?
It means that the new component glyph of that composite/compound glyph is to be positioned so that the coordinates of its ‘match point’ are equal to those of the ‘match point’ of the base component glyph. In other words: so that the indicated points for the two components match. This is repeated for each new component glyph, with the point numbers/indices of the already matched components being treated as if it were a single, base component glyph.

Apple's TrueType spec is a bit clearer about the meaning of this flag. It says that if the ARGS_ARE_XY_VALUES flag is not set that:
1st short contains the index of matching point in compound being constructed
2nd short contains index of matching point in component
Source: https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6glyf.html
In other words, let m be the first short and n be the second, then the coordinates of point n of the new component should have the same coordinates as point m of the so far constructed compound glyph.

u32 filter matching clarification

I've been following through the tutorial for u32 pattern matching here: Link
Most of it is straightforward until I get to the section where the IP header length is grabbed, using the following:
0>>22&0x3C
I don't understand why this was chosen instead of:
0>>24&0x0F
From my understanding, the filter chosen will shift the first byte 22 to the right, then apply a mask to strip the first and last 2 bits off, giving us to the correct lower nibble for the IP header length. The second will complete the full shift to the right, only needing to strip the first 4 bits.
My question is, why was the first chosen and not the second? I believe it's because of the multiply that needs to take place, but I don't understand what effect that operation would have if both filters would return the correct value.

The IP Header length is specified in 32 bit words rather than 8 bit bytes, so whatever value is in the IPH field will need to be multiplied by 4 which can be accomplished by a shift left of 2; therefore instead of shifting right by 24, masking by 0x0F and shifting right by two the author decided only to shift right by 22 and masking by 0x03c.
That is, both operations don't return the same value, the first operation returns the value of the second multiplied by 4. To get the same result as the first you would
0>>24&0x0F<<2

Fortran: can you explain this formatting string

I have a Fortran program which I need to modify, so I'm reading it and trying to understand. Can you please explain what the formatting string in the following statement means:
write(*,'(1p,(5x,3(1x,g20.10)))') x(jr,1:ncols)

http://www.fortran.com/F77_std/rjcnf0001-sh-13.html
breifly, you are writing three general (g) format floats per line. Each float has a total field width of 20 characters and 10 places to the right of the decimal. Large magnitude numbers are in exponential form.
The 1xs are simply added spaces (which could as well have been accomplished by increasing the field width ie, g21.10 since the numbers are right justified. The 5x puts an additional 5 spaces at the beginning of each line.
The somewhat tricky thing here is tha lead 1p which is a scale factor. It causes the mantissa of all exponential form numbers produced by the following g format to be multiplied by 10, and the exponent changed accordingly, ie instead of the default,
g17.10 -> b0.1234567890E+12
you get:
1p,g17.10 -> b1.2345678900E+11
b denotes a blank in the output. Be sure to allow room for a - in your field width count...
for completeness in the case of scale greater than one the number of decimal places is reduced (preserving the total precision) ie,
3p,g17.10 -> b123.45678900E+09 ! note only 8 digits after the decimal
that is 1p buys you a digit of precision over the default, but you don't get any more. Negative scales cost you precision, preserving the 10 digits:
-7p,g17.10 -> b0.0000000123E+19
I should add, the p scale factor edit descriptor does something completely different on input. Read the docs...

I'd like to add slightly to George's answer. Unfortunately this is a very nasty (IMO) part of Fortran. In general, bear in mind that a Fortran format specification is automatically repeated as long as there are values remaining in the input/output list, so it isn't necessary to provide formats for every value to be processed.
Scale factors
In the output, all floating point values following kP are multiplied by 10k. Fields containing exponents (E) have their exponent reduced by k, unless the exponent format is fixed by using EN (engineering) or ES (scientific) descriptors. Scaling does not apply to G editing, unless the value is such that E editing is applied. Thus, there is a difference between (1P,G20.10) and (1P,F20.10).
Grouping
A format like n() repeats the descriptors within parentheses n times before proceeding.

Converting meshes to metaballs

I'm doing a project where I need to convert an existing polygonal mesh into a static shape made from metaballs (blobs). I have voxelized the mesh with binvox to "a .raw file" (according to the description at binvox), but I have no clue of how it stores the data, and therefore don't know how to load it.
Question1: Is there any non PHD way to do so? Create a metaball model from a polygonal mesh.
Question2: Has anyone ever used the said .raw file format from binvox and if you did, how?

RLE Run length Encoding
The binary voxel data
The binary data consists of pairs of bytes. The first byte of each pair is the value byte and is either 0 or 1 (1 signifies the presence of a voxel). The second byte is the count byte and specifies how many times the preceding voxel value should be repeated (so obviously the minimum count is 1, and the maximum is 255).
http://www.cs.princeton.edu/~min/binvox/binvox.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas