Why do DocBook generated XHTML5 Section titles have ASCII #c2 characters in them? - docbook-5

I noticed my generated XHTML5 numbered section titles have a  between the number and the title string. I thought this was a generation error. But no, the gentext file of my DocBook distribution, common/en.xml, actually specifies this.
Line 338 of common/en.xml:
<l:template name="section" text="%n. %t"/>
The dot and space following the %n are, when viewed in a hex editor, ASCII character codes C2 and A0, which are the  and NBSP characters respectively. I can understand NBSP. But why the �
I understand I can change this in my customization layer. But the default seems odd.
I'm using docbook-xsl-ns-1.77.1.

That is because the encoding is UTF-8, which is the normal Unicode encoding for text these days. In UTF-8, any character above 0x7F is represented by a sequence of 2, 3, or 4 bytes depending on how many significant code bits it contains.
The 0xC2 is one of the chars that starts a 2-byte sequence. In binary, it's 1100 0010. The two 1 bits denote a 2-char sequence, and the bottom five bits are the first five of the encoded character. The second one, 0xA0, is 1001 0000. The single leading 1 bit (followed by a 0 bit) denotes a continuation of the sequence, and the bottom 6 bits are the bottom bits of the encoded character.
Putting the bottom five bits from the first byte together with the bottom six bits from the second, we get 000 1001 0000, in hex U+A0, which is indeed the nonbreaking space.

Related

PDF TJ operator

is it possible to determine if a number in TJ operator represents space between words?
Example: [(Sta)28(ry)-333(Plzenec,)]TJ
Number 28 is not enough for space, otherwise 333 it should be space according to actual font size. Font size is 9.96
First of all please be aware that there is no absolute limit number separating numbers for spaces between words from spaces for kerning. All you can do is develop heuristics which will fail for some documents, usually for very tightly set ones.
Now remember how those numbers are applied when calculating the text replacement tx or ty from the origin of the last character before the number to the origin of the first character thereafter:
(ISO 32000-1, section 9.4.4 Text Space Details, also discussed here)
Thus, first of all such a number only widens the gap to the next character if it's negative.
Furthermore, the number is applied before the font size is multiplied; thus, one does not have to take the font size into account as I incorrectly claimed in a comment to the question.
The number (scaled by 1/1000) is directly subtracted from the glyphs displacement. So one can compare it with the glyph displacements of the font in question to get an impression of the meaning of the number.
The glyph displacements essentially are the numbers from the corresponding font's Widths or W array (defaulting to the MissingWidth / DW value) scaled by 1/1000. As both the TJ numbers and the widths are scaled by 1/1000, you can directly compare them.
Thus, an obvious option would be to compare the absolute value of negative TJ numbers to the width of the space glyph in the font in question. This differs from font to font, e.g. it's 600 for Courier, 278 for Helvetica, and 250 for Times-Roman.
Spaces between words created by TJ numbers don't necessarily have to be as wide as the full space glyph of the font, but a relevant fraction of it, e.g. half its value (YMMV), can be used as minimum for interpreting a TJ number as a space between words.
Unfortunately, though, if a PDF generator creates all spaces between words by TJ numbers and none by space glyphs, and if the font is embedded as a subset only, there is no need to embed the space glyph at all. In that case you might want to use other glyphs to compare with; often the length of a capital 'M' is used as a measure for the widths of a font, you might want to use a relevant fraction thereof, e.g. one fifth (YMMV again).
You can improve your heuristics
by also taking the character spacing value Tc into account: If Tc / Tfs is negative with a relevant absolute value, the text is tightly set. In that case you might want to lessen the limit number determined as above. Or
by an analysis of all the TJ numbers in your text or those in the surrounding text. Here I can only guess, though, what might be acceptable heuristics...

PDF 1.3 text conversion with escape sequence and character code

i got an PDF 1.3 where I want to extract the text.
But in the stream there are 2 different types of text.
Some plain text and some character coded text with escape sequences.
Here an example:
/TextClip BMC
BT
/T1_2 1 Tf
0 Tc 0 Tw 7 Tr 16.2626 0 0 16.2626 37.2512 581.738 Tm
(Test Test)Tj
ET
EMC
q
/GS0 gs
67.6799985 0 0 -13.4399997 37.439994 594.2399583 cm
/Im47 Do
Q
Q
Q
q
37.499 569.52 179.713 8.34 re
W n
q
/GS0 gs
180.959996 0 0 -9.5999998 36.959999 578.3999755 cm
/Im48 Do
Q
Q
q
37.499 569.52 179.713 8.34 re
W n
q
/TextClip BMC
BT
0 Tc 0 Tw 7 Tr 9.899 0 0 9.899 37.2512 569.7178 Tm
[(\000E\000V\000d\000e\000\003\000E\000V\000d\000e)]TJ
ET
EMC
In this example ther are 2 times the text "Test Test". One time as plan text and the other time with the escape sequence \000E\000V\000d\000e\000\003\000E\000V\000d\000e.
I only knew, if there are after an escape sequence 3 digits, that this is an octal character code. But in my example there are some time 4 and some times 3 digits.
The 4 character after the escape sequence is at 15 next to the correct ascii code. (\000E is character "T") But what is the correct conversion?
The text block \000\003 should be a space sign. What is there the conversion hack?
Regards
The encoding of the string arguments of text showing instructions like TJ and Tj depends on the PDF font in question, cf. the specification
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(section 9.4.3 - Text-Showing Operators - in ISO 32000-1)
The font used for the first text showing operation
(Test Test)Tj
probably is a simple font with an ASCII'ish encoding, probably WinAnsiEncoding. The font itself is selected two lines above in
/T1_2 1 Tf
so you only have to look up the font resource T1_2 the associated resources (the resources of the page if you are showing us an excerpt of a page content stream) to verify.
The font used in the second text showing operation
[(\000E\000V\000d\000e\000\003\000E\000V\000d\000e)]TJ
appears to be a composite font with a double-byte encoding, probably Identity-H, and the underlying font program appears to have the glyph codes most often found in TrueType fonts. You should look for a ToUnicode mapping in that PDF font for easy decoding.
The instruction in which this font is selected, is not among the instructions you posted but instead must be somewhere above. This selection has been saved as part of the graphics state (in some early q instructions) and restored again (in some Q instruction between the two text showing instructions you shared).
if there are after an escape sequence 3 digits, that this is an octal character code. But in my example there are some time 4 and some times 3 digits.
No, in your example there always are escape sequences with three octal digits. The character thereafter is a separate byte, i.e. you have the bytes '\000', 'E', '\000', 'V', '\000', 'd', '\000', 'e', '\000', '\003', '\000', 'E', '\000', 'V', '\000', 'd', '\000', and 'e'.
As mentioned above, this looks like a double-byte encoding with in particular the mappings
\000E -> 'T'
\000V -> 'e'
\000d -> 's'
\000e -> 't'
\000\003 -> ' ' (space)
This appears to be a glyph encoding often found in TrueType fonts which for Latin letters merely means a constant offset to their Unicode codes.
But there also are many different multi-byte encodings in common use, sometimes they even are ad-hoc encodings only created for the font on the page at hands.
Thus, if you seriously want to do text extraction from PDFs, you really have to study the PDF specification and implement along its requirements instead of hoping for some conversion hack.
Adobe has published a copy of the old PDF specification ISO 32000-1 on their web page at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Detect if Base 64 string is image or text

Is there a way to detect if the Base 64 string contained in an NSData instance is an image or a text or any other object?
You can't generally just look at the base 64 string and decide, but you can decode the first few bytes of data, look at the hex codes (you can do this by decoding your base-64 string into a NSData and just NSLog it or examining it in the debugger), and draw some conclusions. For example:
Image files generally start with special byte sequences (e.g. JPEG start with the hex bytes FF D8; PNG generally start with hex bytes 89 50 4E 47 0D 0A 1A 0A (e.g. 89 "PNG" CR LF EOF LF, etc.). Note, there are a dizzying number of different image formats, so this is a non-trivial exercise, but sometimes you can get lucky and it will be self-evident that it's one of these common format when you glance at the first few bytes.
NSKeyedArchiver archives generally start with the string "bplist".
ASCII text consists of codes between 20 and 7F (with linefeeds represented by 0A; carriage return and linefeeds represented by OD 0A; tab characters as 09; etc.). Then, again, if it was a text, it's unlikely they'd be base-64 encoding it.
If it was UTF-8 it would conform to the coding pattern outlined here. For example, you can look at the first few high bits of the first byte that might conceivably represent a UTF-8 character, and conclude (a) how many bytes the character is represented by and (b) what high bits will be turned on those subsequent bytes. You can often quickly look at it and confirm whether the data conforms to this UTF-8 pattern or not (especially easy to do for most western languages)
If the first three characters were EF BB BF, that often indicates a UTF-8 byte order mark.
This is, by no means, an exhaustive list of codes, but just a few that leapt out at me.
To do this programmatically and do so exhaustively would be a non-trivial exercise. But if you're just "eye-balling" a base-64 string and trying to draw some logical inferences, decode it and look at the hex bytes and you can quickly narrow down the possibilities, at the very least. If you're unsure about how to interpret it, update your question with the hex representation of the decoded base-64 string (just the first 16-32 bytes, please), and we might be able to point you in the right direction.
It is impossible to clearly distinguish text string and Base64 image encoding string. The only way - check if your string is valid Base 64 encoding string. If it is - probably it is an image. If not - you can be sure it is a text.
How to check if string is valid Base 64 you can ere How to check whether the string is base64 encoded or not.

Hexadecimal numbers vs. hexadecimal enocding (with base64 as well)

Encoding with hexadecimal numbers seems to be different from using hexadecimals to represent numbers. For example, then hex number 0x40 to me should be equal to 64, or BA_{64}, but when I put it through this hex to base64 converter, I get the output: QA== which to me is equal to some number times 64. Why is this?
Also when I check the integer value of the hex string deadbeef I get 3735928559, but when I check it other places I get: 222 173 190 239. Why is this?
Addendum: So I guess it is because it is easier to break the number into bit chunks than treat it as a whole number when encoding? That is pretty confusing to me but I guess I get it.
You may wish to read this:
http://en.wikipedia.org/wiki/Base64
In summary, base64 specifies a specific encoding, which involves using different values for letters than their ASCII encoding.
For the second part, one source is treating the entire string as a 32 bit integer, and the other is dividing it into bytes and giving the value of each byte.

Do certain characters take more bytes than others?

I'm not very experienced with lower level things such as howmany bytes a character is. I tried finding out if one character equals one byte, but without success.
I need to set a delimiter used for socket connections between a server and clients. This delimiter has to be as small (in bytes) as possible, to minimize bandwidth.
The current delimiter is "#". Would getting an other delimiter decrease my bandwidth?
It depends on what character encoding you use to translate between characters and bytes (which are not at all the same thing):
In ASCII or ISO 8859, each character is represented by one byte
In UTF-32, each character is represented by 4 bytes
In UTF-8, each character uses between 1 and 4 bytes
In ISO 2022, it's much more complicated
US-ASCII characters (of whcich # is one) will take only 1 byte in UTF-8, which is the most popular encoding that allows multibyte characters.
It depends on the encoding. In Single-byte character sets such as ANSI and the various ISO8859 character sets it is one byte per character. Some encodings such as UTF8 are variable width where the number of bytes to encode a character depends on the glyph being encoded.
The answer of course is that it depends. If you are in a pure ASCII env, then yes, every char takes 1 byte, but if you are in a Unicode env (all of Windows for example), then chars can range from 1 to 4 bytes in size.
If you choose a char from the ASCII set, then yes your delimter is a small as possible.
No, all characters are 1 byte, unless you're using Unicode or wide characters (for accents and other symbols for example).
A character is 1 byte, or 8 bits, long which gives 256 possible combination to form characters with. 1 byte characters are called ASCII characters. They only use 7 bits (even though 8 are available, but you can't use this 8th bit) to form the standard alphabet and various symbols used when teletypes and typewriters were still common.
You can find an ASCII chart and what numbers correspond to what characters here.