I recently discovered an issue with IE10. We have a web page that displays English text beside a translation in Japanese. Some of the Japanese characters display as squares. In the view source page all characters are correctly rendered. The database also has the characters correctly rendered. The unusual part is that when I block the characters with the cursor they convert to the correct characters.
IE10 I believe has a bug.
Anyone having similar issue or know of a fix? Checked all language settings, regional settings, browser font settings and many other tests. Nothing corrects this issue.
This issue was related to a dual byte character which some fonts and windows applications will support.
Some older fonts may use a two hex character representation to present a single character. Some fonts support this and some do not.
In this case the characters at issue were the following…..
ジ
シ and ゙
The latter two which I think are special characters that combined are intended to represent ジ.
The Unicode Standard from the Unicode ISO web site table defines them like so…..
Decimal Character HEX Name
12472 ジ 30B8 KATAKANA LETTER ZI
12471 シ 30B7 KATAKANA LETTER SI
12441 っ゙ 3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK (combined with small tu (っ))
So some fonts use 12471 + 12441 to make 12472. This is what I found. But the actual string has 12471 + 12441 and not 12472 or the hex: 0x30B7, 0x3099 and not 0x30B8.
Any time a font being used does not support this binding, a box is displayed. The challenge is that it may be as simple as someone creating a birthday card using a non-compliant UTF8 font that could cause a PC to not allow the character to display correctly.
Related
Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.
My program makes calculations on physics vectors and it allows copy/pasting from websites and then tries to parse them into the x, y, and z components automatically. I've come across one website (http://mathinsight.org/cross_product_examples) that has (3,−3,1). While that looks normal, that minus is actually not recognized by VB. Visually, it is longer than the normal minus (− and -), but return the same Unicode of 45. This picture shows the Unicode for every character (I added a minus in front of the first 3 for comparison) in the Textbox. Also, from this website, I had to use Ctrl+c because right clicking shows that this is not simple HTML.
One is valid (the first), but the second gives VB fits as shown below. Either it won't compile (shown by the blue line below) or a simple assignment (the second one) wrecks havok on my form.
I have tried using
vectorString.Replace("–", "-")
and pasting in the longer dash for the target string and a normal keystroke dash as the replacement, but nothing happens. I'm guessing that since they both have the same Unicode.
Is there some way to convert the longer, invalid dash into the one recognized by VB? I tried using dash symbol that Word likes to replace the minus sign with and it comes up as Unicode 150. So, apparently there are at least three different kinds of dashes. Any thoughts?
The character from Math Insight is U+2212, minus sign. The character you tried using in your Replace call is U+2013, en dash. That's why your replace didn't work.
Beyond the standard ASCII hyphen (-, U+0045), there are two common dashes: the en dash (–, U+2013) and the em dash (—, U+2014). There is also a figure dash (‒, U+2012), but it is not as common.
I've been looking into a PDF file to understand how it is built.
I noticed that InDesign has created PDFs with text as below (after decompression using pdftk).
0 Tc /Span<</ActualText<FEFF0009>>> BDC
4.018 -0.2 Td
( )Tj
I understand the role of ActualText (for copy/paste/searching) but I'm wondering exactly how I should be interpreting the FEFF0009. It looks like a UTF-16 string with BOM chars to represent a tab character. This seems incorrect as it's really a space. I'm wondering if there is a special meaning here?
.. This seems incorrect as it's really a space.
No, it's really a tab.
14.9.4 Replacement Text
NOTE 1: Just as alternate descriptions can be provided for images and other items that do not translate naturally into text (as described in the preceding sub-clause), replacement text can be specified for content that does translate into text but that is represented in a nonstandard way.
(PDF 32000-1:2008)
The PDF text engine does not support the concept of 'tabs'. In this case, InDesign mimicked the function of a tab character by inserting a space in the text stream, and it could set the space width to match the distance spanned by the original tab or use a large relative positioning for the rest of the text (which it did here: the horizontal displacement of 4.018 in your code snippet).
The general idea is that a space is rendered on the position of the tab, but when you copy this text and paste somewhere else you get a tab character. I suppose the 'space' is only inserted to have something to copy.
How could i convert a Greek string, to Unicode with VB.NET, without knowing the source encoding?
Without knowing you can't do something very reliable.
But if you know for sure it will be Greek, then you can try the supported Greek code pages:
windows-737 = OEM - Greek 437G
windows-869 = OEM - Modern Greek
windows-875 = IBM EBCDIC - Modern Greek
windows-1253 = Windows - Greek
windows-10006 = MAC - Greek I
windows-20423 = IBM EBCDIC - Greek
windows-28597 = ISO 8859-7 Greek
The most likely one is 1253 (not 1250 as above).
But you can try all of them, one at the time, then check if the resulting characters are in the Greek (and maybe Latin, if you want to accept that).
For validation you can use RegExp with \p (http://msdn.microsoft.com/en-us/library/az24scfc.aspx#character_classes) and using the desired Unicode blocks (http://msdn.microsoft.com/en-us/library/20bw873z.aspx#SupportedNamedBlocks).
You can try [\p{IsBasicLatin}\p{IsGreek}]* (and maybe add IsGreekExtended, although you will not get that from any of the listed code pages).
If you get something else (let's say Cyrillic) you know you got the wrong code page.
Sorry, but without knowing the code page all you do is guess. And there is only so much you can do to improve that guess.
i have some truetype fonts and a programm takes these fonts so that a user can select a font he like to put some symbols around. The programm save these information (which font name und character code) in a file. (I dont have the source of this programm)
Now i have to reed these file into another programm (vb.net) and get the character from the character code. And here comes the problem.
If i'll try chr(144) i'll get an empty char back ... but in the font which the user has selected befor, the character, which display a symbol, exists with the character ç.
Have i to load the font on runtime or what i have to?
I have tried already CharW(144) but with the same result: I'll get an empty char but i need to get the ç
Kind regards
Nico
According to the Extended Latin-1 code chart, ç is U+00E8 (232 in decimal) so I suggest you try ChrW(232).
The value returned by Chr depends on the current thread's default encoding (and I seem to remember it's possible to provoke some odd results) - I would try to avoid it if possible. If you know the encoding you need to use, then use it explicitly with Encoding.GetString etc. Otherwise, stick to Unicode values wherever possible.