how to correct incoming data that contains char not in Ascii to unicode before saving to the database - vb.net

I have a webservice api in vb.net that accepts string. but i cannot control the data coming to this API. I sometimes receive chars in between words in this format (–, Á, •ï€,ââ€ï€, etc. ) Is there a way for me to handle these or convert these characters to their correct symbols before saving to the database?
i know that the best solution would be to go after the source where the characters get malformed.. but i'll make that as plan B
my code is already using utf-8 as encoding pattern, but what if the client that uses my API messed up and inadvertently sent the malformed char thru the API. can i clean that string and convert the malformed char to the correct symbol?

If you only want to accept ASCII characters, you could remove non-ASCII characters by encoding and decoding the string - the default ASCII encoding uses "?" as a substitute for unrecognized characters, so you probably want to override that:
' Using System.Text
Dim input As String = "âh€eÁlâl€o¢wïo€râlâd€ï€"
Dim ascii As Encoding = Encoding.GetEncoding(
"us-ascii",
New EncoderReplacementFallback(" "),
New DecoderReplacementFallback(" ")
)
Dim bytes() As Byte = ascii.GetBytes(input)
Dim output As String = ascii.GetString(bytes)
Output:
h e l l o w o r l d
The replacement given to the En/DecoderReplacementFallback can be empty if you just want to drop the non-ASCII characters.
You could use a different encoding than ASCII if you want to accept more characters - but I would imagine that most of the characters you listed are valid in most European character sets.

While you are kind of vague I could guide you in something I think you could potentially do:
Sub Main()
Dim allowedValues = "abcdefghijklmnopqrstuvwxyz".ToCharArray()
Dim someGoodSomeBad = "###$##okay##$##"
Dim onlyGood = New String(someGoodSomeBad.ToCharArray().Where(Function(x) allowedValues.Contains(x)).ToArray)
Console.WriteLine(onlyGood)
End Sub
The first line would be valid characters, in my example I chose to use alpha characters, you could add more characters and numbers too. Basically you are creating a whitelist of acceptable characters, that you the developer would make.
The next line would be an output from your API that has some good and some bad lines.
The next part is really more simple than it looks. I am extending the string to be an array of characters, then I am finding ONLY the characters that match my whitelist in a lambda statement. Then I extend this to an array again because if I do a new String in .NET from a char array.
I then get a good string, but I could make 'good' to be subjective based on a whitelist.
The bigger question though is WHY is your Web API sending garbled data over? It should be sending well formed JSON or XML that is then able to well parsed and strongly type to models. Doing what I have shown above is more of a bandaide than a real fix to the underlying problem and it will have MANY holes.

Related

Can ASCII arrays be manipulated as arrays without converting to String form?

This is a basic question, but I can't find anything on it, since I don't know what to search — each of my tries have come up with unrelated results.
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character? Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
Dim t1() As Byte = Text.Encoding.ASCII.GetBytes("Hello ")
Dim t2() As Byte = Text.Encoding.ASCII.GetBytes("World")
Dim msg As String = Text.Encoding.ASCII.GetString(t1.Concat(t2).ToArray)
Now msg should be "Hello World".
I would like this to work as I don't want to have to convert data I receive back to Strings in order to manipulate it before it is sent again.
What if I used something other than ASCII (like UTF-8, for example)?
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character?
Yes. ASCII is a 7bit encoding, it does not support multi-byte characters. Any Unicode codepoint above U-007F will get converted to a ? character in ASCII.
If you were to use UTF-7 instead, for instance, it can encode individual Unicode codepoints into a sequence of multiple ASCII characters.
Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
In your particular example, yes (provided you are using LINQ's Concat() method - there are other ways to concat arrays together). There is no data loss.
But for other examples, just know that you will have data loss if you convert non-ASCII characters to ASCII, or otherwise mismatch encodings between GetBytes() and GetString().
You can certainly manipulate byte arrays. Just make sure the arrays are in the same encoding if you merge them together.
.NET strings are counted sequences of UTF-16 code units (char), one or two of which encode a Unicode codepoint (int Char.ConvertToUtf32 ). Some codepoints are "combining characters", which when applied to a preceding "base character" form a grapheme (which is then rendered by a font into a glyph).
An encoder from Unicode to an encoding of another character set should attempt to preserve graphemes. In .NET, a grapheme is called a "text element."
So, yes, you can combine encoded byte sequences as long as you haven't defeated the encoder by converting parts of a grapheme into different byte sequences. If you are breaking a string into two before encoding, see TextElementEnumerator and StringInfo class.

standard keyboard character not included in Base64?

I have a function that generates a random Base64 String
Public Shared Function GenerateSalt() As String
Dim rng As RNGCryptoServiceProvider = New RNGCryptoServiceProvider
Dim buff(94) As Byte
rng.GetBytes(buff)
Return Convert.ToBase64String(buff)
End Function
This will always return a 128 Character String. I then take that string and divide it into 4 substrings. I then merge that all back into one big string called MasterSalt like so
MasterSalt = (Salt.Substring(1,32)) + "©" + (Salt.Substring(32,32)) + "©" + etc...
I am doing this because I then put all of this into an array and say Split(MasterSalt, "©")
My concern is I am not overly confident in the stability of using "©" as the delimiter to define where the string should be split. However I have to use something that is not going to be included in the randomly generated base64string. I would like it to be something that can be found on a standard keyboard if possible. So to be clear my question is: is there a glyph or character on a standard keyboard that would never be included in a randomly generated base64string??
Base64 uses 64 characters to encode 6 bits of the content at a time as values 0-63;
A-Z (0-25)
a-z (26-51)
0-9 (52-61)
+ (62)
/ (63)
...and it uses = as filler at the end if required.
Any other character will be available for you to use as a delimiter, for example space, period and minus.

Code for converting long string to pass in URL

I am trying to take a string like "Hello my name is Nick" and transform it to "Hello+my+name+is+Nick" to be passed through a URL. This would be easily done by replacing all the spaces with a + char however I also need to replace all special characters (. , ! &) with their ASCII values. I have searched the net but cannot find anything. I wonder if anyone knows of existing code to do this as its a fairly common task?
I think you're looking for this: HttpUtility.UrlEncode Method (String)
Handles non-URL compliant characters and spaces.

VBA - Read file byte by byte on system with Asian locale

I am trying to convert a file from binary to text, by simply replacing each character with the hexadecimal code. For example, character 'c' will be replaced by '63'.
I have a code which is working fine in normal systems, but it breaks down in the PC where I need to use it as it has default locale set to Chinese.
I am using the following statements to read a byte -
ch$ = " "
Get #f%, , ch$
I suspect there is a problem when I am reading the file byte by byte, as it is skipping certain bytes because they form composite characters. It's probably reading 2 bytes which form an Asian character as one byte. It is thus forming a much smaller file than the expected size.
How can I read the file byte by byte?
Full code is pasted here: http://pastebin.com/kjpSnqzV
Your suspicion is correct. VB file reading automatically converts strings into Unicode from the default code page on the PC. On an Asian code page, some characters are represented as more than one byte.
I advise you to use a Byte variable rather than a string - that will stop VB being over helpful.
Dim ch As Byte
Get #f%, , ch
Another possible problem with the original code is that some byte sequences are illegal on Asian code pages (they don't represent valid characters). So your code could experience errors for some input files, but presumably you want it to work with any file.

How can I write special character in VB code

I have a Sql statament using special character (ex: ('), (/), (&)) and I don't know how to write them in my VB.NET code. Please help me. Thanks.
Find out the Unicode code point for the character (from http://www.unicode.org) and then use ChrW to convert from the code point to the character. (To put this in another string, use concatenation. I'm somewhat surprised that VB doesn't have an escape sequence, but there we go.)
For example, for the Euro sign (U+20AC) you'd write:
Dim euro as Char = ChrW(&H20AC)
The advantage of this over putting the character directly into source code is that your source code stays "just pure ASCII" - which means you won't have any strange issues with any other program trying to read it, diff it, etc. The disadvantage is that it's harder to see the symbol in the code, of course.
The most common way seems to be to append a character of the form Chr(34)... 34 represents a double quote character. The character codes can be found from the windows program "charmap"... just windows/Run... and type charmap
If you are passing strings to be processed as SQL statement try doubling the characters for example.
"SELECT * FROM MyRecords WHERE MyRecords.MyKeyField = ""With a "" Quote"" "
The '' double works with the other special characters as well.
The ' character can be doubled up to allow it into a string e.g
lSQLSTatement = "Select * from temp where name = 'fred''s'"
Will search for all records where name = fred's
Three points:
1) The example characters you've given are not special characters. They're directly available on your keyboard. Just press the corresponding key.
2) To type characters that don't have a corresponding key on the keyboard, use this:
Alt + (the ASCII code number of the special character)
For example, to type ¿, press Alt and key in 168, which is the ASCII code for that special character.
You can use this method to type a special character in practically any program not just a VB.Net text editor.
3) What you probably looking for is what is called 'escaping' characters in a string. In your SQL query string, just place a \ before each of those characters. That should do.
Chr() is probably the most popular.
ChrW() can be used if you want to generate unicode characters
The ControlChars class contains some special and 'invisible' characters, plus the quote - for example, ControlChars.Quote