Can ASCII arrays be manipulated as arrays without converting to String form? - vb.net

This is a basic question, but I can't find anything on it, since I don't know what to search — each of my tries have come up with unrelated results.
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character? Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
Dim t1() As Byte = Text.Encoding.ASCII.GetBytes("Hello ")
Dim t2() As Byte = Text.Encoding.ASCII.GetBytes("World")
Dim msg As String = Text.Encoding.ASCII.GetString(t1.Concat(t2).ToArray)
Now msg should be "Hello World".
I would like this to work as I don't want to have to convert data I receive back to Strings in order to manipulate it before it is sent again.
What if I used something other than ASCII (like UTF-8, for example)?

If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character?
Yes. ASCII is a 7bit encoding, it does not support multi-byte characters. Any Unicode codepoint above U-007F will get converted to a ? character in ASCII.
If you were to use UTF-7 instead, for instance, it can encode individual Unicode codepoints into a sequence of multiple ASCII characters.
Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
In your particular example, yes (provided you are using LINQ's Concat() method - there are other ways to concat arrays together). There is no data loss.
But for other examples, just know that you will have data loss if you convert non-ASCII characters to ASCII, or otherwise mismatch encodings between GetBytes() and GetString().
You can certainly manipulate byte arrays. Just make sure the arrays are in the same encoding if you merge them together.

.NET strings are counted sequences of UTF-16 code units (char), one or two of which encode a Unicode codepoint (int Char.ConvertToUtf32 ). Some codepoints are "combining characters", which when applied to a preceding "base character" form a grapheme (which is then rendered by a font into a glyph).
An encoder from Unicode to an encoding of another character set should attempt to preserve graphemes. In .NET, a grapheme is called a "text element."
So, yes, you can combine encoded byte sequences as long as you haven't defeated the encoder by converting parts of a grapheme into different byte sequences. If you are breaking a string into two before encoding, see TextElementEnumerator and StringInfo class.

Related

how to correct incoming data that contains char not in Ascii to unicode before saving to the database

I have a webservice api in vb.net that accepts string. but i cannot control the data coming to this API. I sometimes receive chars in between words in this format (–, Á, •ï€,ââ€ï€, etc. ) Is there a way for me to handle these or convert these characters to their correct symbols before saving to the database?
i know that the best solution would be to go after the source where the characters get malformed.. but i'll make that as plan B
my code is already using utf-8 as encoding pattern, but what if the client that uses my API messed up and inadvertently sent the malformed char thru the API. can i clean that string and convert the malformed char to the correct symbol?
If you only want to accept ASCII characters, you could remove non-ASCII characters by encoding and decoding the string - the default ASCII encoding uses "?" as a substitute for unrecognized characters, so you probably want to override that:
' Using System.Text
Dim input As String = "âh€eÁlâl€o¢wïo€râlâd€ï€"
Dim ascii As Encoding = Encoding.GetEncoding(
"us-ascii",
New EncoderReplacementFallback(" "),
New DecoderReplacementFallback(" ")
)
Dim bytes() As Byte = ascii.GetBytes(input)
Dim output As String = ascii.GetString(bytes)
Output:
h e l l o w o r l d
The replacement given to the En/DecoderReplacementFallback can be empty if you just want to drop the non-ASCII characters.
You could use a different encoding than ASCII if you want to accept more characters - but I would imagine that most of the characters you listed are valid in most European character sets.
While you are kind of vague I could guide you in something I think you could potentially do:
Sub Main()
Dim allowedValues = "abcdefghijklmnopqrstuvwxyz".ToCharArray()
Dim someGoodSomeBad = "###$##okay##$##"
Dim onlyGood = New String(someGoodSomeBad.ToCharArray().Where(Function(x) allowedValues.Contains(x)).ToArray)
Console.WriteLine(onlyGood)
End Sub
The first line would be valid characters, in my example I chose to use alpha characters, you could add more characters and numbers too. Basically you are creating a whitelist of acceptable characters, that you the developer would make.
The next line would be an output from your API that has some good and some bad lines.
The next part is really more simple than it looks. I am extending the string to be an array of characters, then I am finding ONLY the characters that match my whitelist in a lambda statement. Then I extend this to an array again because if I do a new String in .NET from a char array.
I then get a good string, but I could make 'good' to be subjective based on a whitelist.
The bigger question though is WHY is your Web API sending garbled data over? It should be sending well formed JSON or XML that is then able to well parsed and strongly type to models. Doing what I have shown above is more of a bandaide than a real fix to the underlying problem and it will have MANY holes.

Inserting string as regular string in mongodb

The pymongo documentation says that BSON strings are UTF-8 encoded so PyMongo must ensure that any strings it stores contain only valid UTF-8 data. Unicode strings (<type ‘unicode’>) are encoded UTF-8 first. The reason our example string is represented in the Python shell as u’Mike’ instead of ‘Mike’ is that PyMongo decodes each BSON string to a Python unicode string, not a regular str.
So I understand that to get rid of the Unicode literal 'u', I will have to call json.dumps() on the document returned by the query.
The documentation also says that Regular strings (<type ‘str’>) are validated and stored unaltered. And I am assuming that the query result also throws it back as a regular string and not a Unicode string.
I created a dictionary with regular string types and inserted it in DB and when I retrieve it, I get the strings as Unicode. Any idea on how do I do it? The purpose is to avoid calling json.dumps() on the query result. I need to fetch large number of documents from the DB and json.dumps() is taking quite some time. The strings that I am storing contain ASCII data so I don't need Unicode strings.
The assumption that the regular string is returned back as regular string was not correct. It is stored unaltered and not encoded to UTF-8 because it is already UTF-8. While decoding during the query, everything is converted back to Unicode.
Source:
Automatic string to unicode object conversion
How can I get pymongo to always return str and not unicode?

Saving CSV file with degree symbol and ASCII encoded

I have string variable txt. It contains "°" degree symbol. I would like to save string into CSV file ASCII encoded. I use the procedure below But the "°" symbol is converted to "?". Do you have any idea how to save properly degree symbol?
Public Sub Write_File(ByVal txt As String, ByVal fName As String)
Try
Using OutFile As New StreamWriter(fName, False, Text.Encoding.ASCII)
OutFile.Write(txt)
End Using
Me.Write_Log("Succesfully Exported")
Catch ex As Exception
Me.Write_Log("Write Error during export")
End Try
End Sub
Encoding.ASCII is for the standard 7-bit ASCII encoding, which does not contain a degree symbol at all. In order to get a degree symbol in ASCII, you would have to use one of the many 8-bit ASCII encodings. For English, you'd probably be most interested in using the ISO 8859-1 code page, since that's the most standard-ish one there is of the bunch. For instance, instead of using Encoding.ASCII, you could do something like this:
Using OutFile As New StreamWriter(fName, False, Text.Encoding.GetEncoding("iso-8859-1"))
OutFile.Write(txt)
End Using
For a complete list of available encodings, use the Encoding.GetEncodings method, or look at the list of supported ones in the MSDN documentation.
Of course, none of the various 8-bit ASCII encodings are compatible with each other, so, if you do use that, the degree symbol will be a completely different symbol when viewed on a system that uses a different code page by default. That is precisely why UTF-8 has become the new standard. Usage of 8-bit ASCII is widely discouraged since it is practically unworkable in multi-cultural scenarios. If you can use UTF-8 instead, I would. If you must use ASCII, it's best to stick to the standard 7-bit encoding. If you must use an 8-bit ASCII encoding, please do so sparingly and with full awareness of its drawbacks.
One more thing. You mention the degree symbol as being character 167 (0xA7) in your desired target encoding. If that is the case, you may actually be wanting IBM437 encoding rather than ISO 8859-1. IBM437 is the old code page that was used by default in MS-DOS. If you really need to use that code page, you may have additional trouble for two reasons. As you'll see in the MSDN article, that code page is not well supported in the .NET framework. In my testing, outputting the Unicode string containing the degree symbol using that encoding did not work properly. Therefore, you may find yourself needing to use a byte array to represent the data rather than a String variable (which is Unicode). For instance:
File.WriteAllBytes("Test.txt", {167})
The second problem is that IBM437 is likely not the default code page for your windows OS, so even when it is written to the file as byte value 167, it won't actually look like a degree symbol when you view it in a windows application such as notepad.

Representing data types e.g. Chars, Strings, Integers etc

I am a .NET Developer and I do not believe I know enough about encoding. I have read this article: http://www.joelonsoftware.com/articles/Unicode.html.
Say I declare this string:
Dim TestString As String = "1"
I believe this will be represented as a Unicode character. Say I declare this integer:
Dim TestInt As Integer = 1
How is this represented? I assume that Unicode is not used? i.e. it is only used for String and Chars? Is that correct? Therefore I believe that on a 32 bit machine 1 would simply be represented as:
00000000 0000000 0000000 00000001
Do numeric data types have byte order marks: http://en.wikipedia.org/wiki/Byte_order_mark ?
All strings in .NET are UTF-16. From the language spec:
Visual Basic .NET defines the following primitive types:
...
The Char value type, which represents a single Unicode character and
maps to System.Char...
The String reference type, which
represents a sequence of Unicode characters and maps to System.String...
Why should an integral value types like an integer be represented with Unicode in computer memory? Unicode is (citing from Wikipedia):
a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems.
So yes, it's only used for Strings and Chars.
Also note that an Integer will always be 4-byte signed integer, no matter if you use a 32 bit or 64 bit machine.
Byte order marks are an entire different topic. As already said in a comment, it's used in text file or stream.

how to check the string is UNICODE vb.net

Is there any way to check if the string is UNICODE using VB.net.
Best Regards
inchikka
You need to read the file using the Encoding that the file is written in.
It appears to be a non Unicode file that you are trying to read as Unicode, or possibly a different Unicode encoding than the default UTF-8 (could be UTF-16 for example).
StreamWriter has several constructors that the an Encoding as parameter.
You can do it by validating each character in the string against the 128 characters in the ASCII table. If the character is not found there then it might be a unicode character.
Is that what you mean?