Is there any way to check if the string is UNICODE using VB.net.
Best Regards
inchikka
You need to read the file using the Encoding that the file is written in.
It appears to be a non Unicode file that you are trying to read as Unicode, or possibly a different Unicode encoding than the default UTF-8 (could be UTF-16 for example).
StreamWriter has several constructors that the an Encoding as parameter.
You can do it by validating each character in the string against the 128 characters in the ASCII table. If the character is not found there then it might be a unicode character.
Is that what you mean?
Related
This is a basic question, but I can't find anything on it, since I don't know what to search — each of my tries have come up with unrelated results.
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character? Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
Dim t1() As Byte = Text.Encoding.ASCII.GetBytes("Hello ")
Dim t2() As Byte = Text.Encoding.ASCII.GetBytes("World")
Dim msg As String = Text.Encoding.ASCII.GetString(t1.Concat(t2).ToArray)
Now msg should be "Hello World".
I would like this to work as I don't want to have to convert data I receive back to Strings in order to manipulate it before it is sent again.
What if I used something other than ASCII (like UTF-8, for example)?
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character?
Yes. ASCII is a 7bit encoding, it does not support multi-byte characters. Any Unicode codepoint above U-007F will get converted to a ? character in ASCII.
If you were to use UTF-7 instead, for instance, it can encode individual Unicode codepoints into a sequence of multiple ASCII characters.
Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
In your particular example, yes (provided you are using LINQ's Concat() method - there are other ways to concat arrays together). There is no data loss.
But for other examples, just know that you will have data loss if you convert non-ASCII characters to ASCII, or otherwise mismatch encodings between GetBytes() and GetString().
You can certainly manipulate byte arrays. Just make sure the arrays are in the same encoding if you merge them together.
.NET strings are counted sequences of UTF-16 code units (char), one or two of which encode a Unicode codepoint (int Char.ConvertToUtf32 ). Some codepoints are "combining characters", which when applied to a preceding "base character" form a grapheme (which is then rendered by a font into a glyph).
An encoder from Unicode to an encoding of another character set should attempt to preserve graphemes. In .NET, a grapheme is called a "text element."
So, yes, you can combine encoded byte sequences as long as you haven't defeated the encoder by converting parts of a grapheme into different byte sequences. If you are breaking a string into two before encoding, see TextElementEnumerator and StringInfo class.
I have string variable txt. It contains "°" degree symbol. I would like to save string into CSV file ASCII encoded. I use the procedure below But the "°" symbol is converted to "?". Do you have any idea how to save properly degree symbol?
Public Sub Write_File(ByVal txt As String, ByVal fName As String)
Try
Using OutFile As New StreamWriter(fName, False, Text.Encoding.ASCII)
OutFile.Write(txt)
End Using
Me.Write_Log("Succesfully Exported")
Catch ex As Exception
Me.Write_Log("Write Error during export")
End Try
End Sub
Encoding.ASCII is for the standard 7-bit ASCII encoding, which does not contain a degree symbol at all. In order to get a degree symbol in ASCII, you would have to use one of the many 8-bit ASCII encodings. For English, you'd probably be most interested in using the ISO 8859-1 code page, since that's the most standard-ish one there is of the bunch. For instance, instead of using Encoding.ASCII, you could do something like this:
Using OutFile As New StreamWriter(fName, False, Text.Encoding.GetEncoding("iso-8859-1"))
OutFile.Write(txt)
End Using
For a complete list of available encodings, use the Encoding.GetEncodings method, or look at the list of supported ones in the MSDN documentation.
Of course, none of the various 8-bit ASCII encodings are compatible with each other, so, if you do use that, the degree symbol will be a completely different symbol when viewed on a system that uses a different code page by default. That is precisely why UTF-8 has become the new standard. Usage of 8-bit ASCII is widely discouraged since it is practically unworkable in multi-cultural scenarios. If you can use UTF-8 instead, I would. If you must use ASCII, it's best to stick to the standard 7-bit encoding. If you must use an 8-bit ASCII encoding, please do so sparingly and with full awareness of its drawbacks.
One more thing. You mention the degree symbol as being character 167 (0xA7) in your desired target encoding. If that is the case, you may actually be wanting IBM437 encoding rather than ISO 8859-1. IBM437 is the old code page that was used by default in MS-DOS. If you really need to use that code page, you may have additional trouble for two reasons. As you'll see in the MSDN article, that code page is not well supported in the .NET framework. In my testing, outputting the Unicode string containing the degree symbol using that encoding did not work properly. Therefore, you may find yourself needing to use a byte array to represent the data rather than a String variable (which is Unicode). For instance:
File.WriteAllBytes("Test.txt", {167})
The second problem is that IBM437 is likely not the default code page for your windows OS, so even when it is written to the file as byte value 167, it won't actually look like a degree symbol when you view it in a windows application such as notepad.
I am parsing some data using NSJSONSerialization. After parsing, I get strings like ä ; and %#339; which i think has something to do with encoding. But NSJSONSerialzation doesn't ask for what encoding it requires, it i guess detects it by itself. So my question is, how can I get proper strings instead of these weird ä ; and %#339;.
NSJSONSerialization assumes the encoding is one of the Unicode encodings. Make sure the data you pass to it is in UTF-8 (or UTF-16). ä is C3 A4 in UTF-8 or E4 in UTF-16.
Note that the default encoding for HTTP if none is specified is ISO-8859-1, so it may be that you are passing ISO-8859-1 data instead of UTF-8.
In options try NSJSONReadingMutableLeaves, it must return NSMutableString.. For more take a look at the docs.
I have a string "Artîsté". I use json_encode from PHP on it and I get "Art\u00eest\u00e9".
How do I convert that to an NSString? I have tried many things and none of them work I always end up getting Artîsté
For Example:
NSString stringWithUTF8String:"Art\u00c3\u00aest\u00c3\u00a9"];//Artîsté
#"Art\u00c3\u00aest\u00c3\u00a9"; //Artîsté
You can use CFStringCreateFromExternalRepresentation with the kCFStringEncodingNonLossyASCII encoding to parse the \uXXXX escape sequences. Check out my answer here:
Converting escaped UTF8 characters back to their original form
The problem is your input string:
"Art\u00c3\u00aest\u00c3\u00a9"
does in fact literally mean "Artîsté". \u00c3 is 'Ã', \u00ae is '®', and \u00a9 is '©'.
Whatever is producing your input string is receiving UTF-8 input but expecting something else (e.g., cp1252, ISO-8859-1, or ISO-8859-15)
I'm programming in VB.NET using Visual Studio 2008.
I need to define a string literal containing the character "÷" equivalent to Chr(247).
I understand that internally VS uses UTF-16 encoding, but when the source file is written to disk it contains the single byte value F7 for this character.
This source file is processed by another program that uses UTF-8 encoding by default, so it fails to interpret this character correctly, attempting to combine it with the following single-byte character.
What encoding would correctly interpret the single byte F7 as the single character ÷?
Alternatively, is there a way of expressing a non-ASCII literal that uses only ASCII characters - like using some kind of escape sequence?
well, i always thought that by default VS uses UTF-8 to save files. But ÷ is F7 in encoding ISO 8859-1. If this is not enough for you go here: how to change source file encoding in csharp project (visual studio / msbuild machine)?