String to Unicode conversion with VB.NET - vb.net

How could i convert a Greek string, to Unicode with VB.NET, without knowing the source encoding?

Without knowing you can't do something very reliable.
But if you know for sure it will be Greek, then you can try the supported Greek code pages:
windows-737 = OEM - Greek 437G
windows-869 = OEM - Modern Greek
windows-875 = IBM EBCDIC - Modern Greek
windows-1253 = Windows - Greek
windows-10006 = MAC - Greek I
windows-20423 = IBM EBCDIC - Greek
windows-28597 = ISO 8859-7 Greek
The most likely one is 1253 (not 1250 as above).
But you can try all of them, one at the time, then check if the resulting characters are in the Greek (and maybe Latin, if you want to accept that).
For validation you can use RegExp with \p (http://msdn.microsoft.com/en-us/library/az24scfc.aspx#character_classes) and using the desired Unicode blocks (http://msdn.microsoft.com/en-us/library/20bw873z.aspx#SupportedNamedBlocks).
You can try [\p{IsBasicLatin}\p{IsGreek}]* (and maybe add IsGreekExtended, although you will not get that from any of the listed code pages).
If you get something else (let's say Cyrillic) you know you got the wrong code page.
Sorry, but without knowing the code page all you do is guess. And there is only so much you can do to improve that guess.

Related

Base64 Encoded String for Filename

I cant think of an OS (Linux, Windows, Unix) where this would cause an issue but maybe someone here can tell me if this approach is undesirable.
I would like to use a base64 encoded string as a filename. Something like gH9JZDP3+UEXeZz3+ng7Lw==. Is this likely to cause issues anywhere?
Edit: I will likely keep this to a max of 24 characters
Edit: It looks like I have a character that will cause issues. My function that generated my string is providing stings like: J2db3/pULejEdNiB+wZRow==
You will notice that this has a / which is going to cause issues.
According to this site the / is a valid base64 character so I will not be able to use a base64 encoded string for a filename.
No. You can not use a base64 encoded string for a filename. This is because the / character is valid for base64 strings which will cause issues with file systems.
https://base64.guru/learn/base64-characters
Alternatives:
You could use base64 and then replace unwanted characters but a better option would be to hex encode your original string using a function like bin2hex().
The official RFC 4648 states:
An alternative alphabet has been suggested that would use "~" as the 63rd character. Since the "~" character has special meaning in some file system environments, the encoding described in this section is recommended instead. The remaining unreserved URI character is ".", but some file system environments do not permit multiple "." in a filename, thus making the "." character unattractive as well.
I also found on the serverfault stackexchange I found this:
There is no such thing as a "Unix" filesystem. Nor a "Windows" filesystem come to that. Do you mean NTFS, FAT16, FAT32, ext2, ext3, ext4, etc. Each have their own limitations on valid characters in names.
Also, your question title and question refer to two totally different concepts? Do you want to know about the subset of legal characters, or do you want to know what wildcard characters can be used in both systems?
http://en.wikipedia.org/wiki/Ext3 states "all bytes except NULL and '/'" are allowed in filenames.
http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx describes the generic case for valid filenames "regardless of the filesystem". In particular, the following characters are reserved < > : " / \ | ? *
Windows also places restrictions on not using device names for files: CON, PRN, AUX, NUL, COM1, COM2, COM3, etc.
Most commands in Windows and Unix based operating systems accept * as a wildcard. Windows accepts % as a single char wildcards, whereas shells for Unix systems use ? as single char wildcard.
And this other one:
Base64 only contains A–Z, a–z, 0–9, +, / and =. So the list of characters not to be used is: all possible characters minus the ones mentioned above.
For special purposes . and _ are possible, too.
Which means that instead of the standard / base64 character, you should use _ or .; both on UNIX and Windows.
Many programming languages allow you to replace all / with _ or ., as it's only a single character and can be accomplished with a simple loop.
In Windows, you should be fine as long if you conform to the naming conventions of Windows:
https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions.
As far a I know, any base64 encoded string does not contain any of the reserves characters.
The thing that is probably going to be a problem is the lengte of the file name.

Saving CSV file with degree symbol and ASCII encoded

I have string variable txt. It contains "°" degree symbol. I would like to save string into CSV file ASCII encoded. I use the procedure below But the "°" symbol is converted to "?". Do you have any idea how to save properly degree symbol?
Public Sub Write_File(ByVal txt As String, ByVal fName As String)
Try
Using OutFile As New StreamWriter(fName, False, Text.Encoding.ASCII)
OutFile.Write(txt)
End Using
Me.Write_Log("Succesfully Exported")
Catch ex As Exception
Me.Write_Log("Write Error during export")
End Try
End Sub
Encoding.ASCII is for the standard 7-bit ASCII encoding, which does not contain a degree symbol at all. In order to get a degree symbol in ASCII, you would have to use one of the many 8-bit ASCII encodings. For English, you'd probably be most interested in using the ISO 8859-1 code page, since that's the most standard-ish one there is of the bunch. For instance, instead of using Encoding.ASCII, you could do something like this:
Using OutFile As New StreamWriter(fName, False, Text.Encoding.GetEncoding("iso-8859-1"))
OutFile.Write(txt)
End Using
For a complete list of available encodings, use the Encoding.GetEncodings method, or look at the list of supported ones in the MSDN documentation.
Of course, none of the various 8-bit ASCII encodings are compatible with each other, so, if you do use that, the degree symbol will be a completely different symbol when viewed on a system that uses a different code page by default. That is precisely why UTF-8 has become the new standard. Usage of 8-bit ASCII is widely discouraged since it is practically unworkable in multi-cultural scenarios. If you can use UTF-8 instead, I would. If you must use ASCII, it's best to stick to the standard 7-bit encoding. If you must use an 8-bit ASCII encoding, please do so sparingly and with full awareness of its drawbacks.
One more thing. You mention the degree symbol as being character 167 (0xA7) in your desired target encoding. If that is the case, you may actually be wanting IBM437 encoding rather than ISO 8859-1. IBM437 is the old code page that was used by default in MS-DOS. If you really need to use that code page, you may have additional trouble for two reasons. As you'll see in the MSDN article, that code page is not well supported in the .NET framework. In my testing, outputting the Unicode string containing the degree symbol using that encoding did not work properly. Therefore, you may find yourself needing to use a byte array to represent the data rather than a String variable (which is Unicode). For instance:
File.WriteAllBytes("Test.txt", {167})
The second problem is that IBM437 is likely not the default code page for your windows OS, so even when it is written to the file as byte value 167, it won't actually look like a degree symbol when you view it in a windows application such as notepad.

IE 10 not rendering Japanese correctly

I recently discovered an issue with IE10. We have a web page that displays English text beside a translation in Japanese. Some of the Japanese characters display as squares. In the view source page all characters are correctly rendered. The database also has the characters correctly rendered. The unusual part is that when I block the characters with the cursor they convert to the correct characters.
IE10 I believe has a bug.
Anyone having similar issue or know of a fix? Checked all language settings, regional settings, browser font settings and many other tests. Nothing corrects this issue.
This issue was related to a dual byte character which some fonts and windows applications will support.
Some older fonts may use a two hex character representation to present a single character. Some fonts support this and some do not.
In this case the characters at issue were the following…..
ジ
シ and ゙
The latter two which I think are special characters that combined are intended to represent ジ.
The Unicode Standard from the Unicode ISO web site table defines them like so…..
Decimal Character HEX Name
12472 ジ 30B8 KATAKANA LETTER ZI
12471 シ 30B7 KATAKANA LETTER SI
12441 っ゙ 3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK (combined with small tu (っ))
So some fonts use 12471 + 12441 to make 12472. This is what I found. But the actual string has 12471 + 12441 and not 12472 or the hex: 0x30B7, 0x3099 and not 0x30B8.
Any time a font being used does not support this binding, a box is displayed. The challenge is that it may be as simple as someone creating a birthday card using a non-compliant UTF8 font that could cause a PC to not allow the character to display correctly.

XCode - Display Vietnamese : Unicode problem

I need to display Vietnamese in my APP. But now, i cannot show the words in correct format. For example, the word "&#code" i cannot convert it to Vietnamese, it just display "&#code;".
Does anyone can help me how to handle the word in unicode ?
Thanks a lot!
Tisa
Just write the unicode string inside #"..." without quoting. Strictly speaking, that's non-portable, but as long as you use it for just for Objective-C, it should be OK. It should work on a modern XCode toolchain.
In general, you need to understand that &#... is a way to quote unicode character in HTML, not in a C-string. In C, if you want to be most portable, you need to use \x escapes. Some newer compilers accept \u... and \U... for unicodes.

vb.net font chr()

i have some truetype fonts and a programm takes these fonts so that a user can select a font he like to put some symbols around. The programm save these information (which font name und character code) in a file. (I dont have the source of this programm)
Now i have to reed these file into another programm (vb.net) and get the character from the character code. And here comes the problem.
If i'll try chr(144) i'll get an empty char back ... but in the font which the user has selected befor, the character, which display a symbol, exists with the character ç.
Have i to load the font on runtime or what i have to?
I have tried already CharW(144) but with the same result: I'll get an empty char but i need to get the ç
Kind regards
Nico
According to the Extended Latin-1 code chart, ç is U+00E8 (232 in decimal) so I suggest you try ChrW(232).
The value returned by Chr depends on the current thread's default encoding (and I seem to remember it's possible to provoke some odd results) - I would try to avoid it if possible. If you know the encoding you need to use, then use it explicitly with Encoding.GetString etc. Otherwise, stick to Unicode values wherever possible.