How to remove Byte Order Mask from utf-8 BinaryWriter? - vb.net

I am generating CSV file which contains double byte characters. Right now I am using BinaryWriter with UTF-8 encoding. Problem is that generated CSV file has BOM prefix(preamble). How can I remove the preamble?
I am adding preamble to BinaryWriter because without it, it still shows unrecognized characters instead of double byte characters.
I tried using different encoding constructor for example: Dim encoding As New System.Text.UTF8Encoding(False) which didn't work.
Dim encoding As Encoding = Encoding.UTF8
Using bw As New BinaryWriter(fs, encoding)
bw.Write(encoding.GetPreamble())
bw.Write(data)
bw.Flush()
Using n As New Ionic.Zip.ZipFile(encoding)
n.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression
fs.Position = 0
n.AddEntry(fileName, fs)
Response.Clear()
Response.ContentType = "application/octet-stream"
n.Save(Response.OutputStream)
End Using
bw.Close()
bw.Dispose()
End Using
This code will generate correct CSV file in Zip file with correct double byte characters but with preamble as prefix of whole data like: "���"
I want to remove this unrecognized prefix, but by removing the line: bw.Write(encoding.GetPreamble()) I lose whole encoding and double byte characters appear as not recognized and prefix is still there.
Changing the encoding constructor to: Dim encoding As New System.Text.UTF8Encoding(False) broke the encoding as well.

Related

What am I supposed to do in .NET with a UTF8 Encoded string?

I am using the Google Chrome Native Messaging which says that it supplies UTF8 encoded JSON. Found here.
I am pretty sure my code is fairly standard and pretty much a copy from answers here in C#. For example see this SO question.
Private Function OpenStandardStreamIn() As String
Dim MsgLength As Integer = 0
Dim InputData As String = ""
Dim LenBytes As Byte() = New Byte(3) {} 'first 4 bytes are length
Dim StdIn As System.IO.Stream = Console.OpenStandardInput() 'open the stream
StdIn.Read(LenBytes, 0, 4) 'length
MsgLength = System.BitConverter.ToInt32(LenBytes, 0) 'convert length to Int
Dim Buffer As Char() = New Char(MsgLength - 1) {} 'create Char array for remaining bytes
Using Reader As System.IO.StreamReader = New System.IO.StreamReader(StdIn) 'Using to auto dispose of stream reader
While Reader.Peek() >= 0 'while the next byte is not Null
Reader.Read(Buffer, 0, Buffer.Length) 'add to the buffer
End While
End Using
InputData = New String(Buffer) 'convert buffer to string
Return InputData
End Function
The problem I have is that when the JSON includes characters such as ß Ü Ö Ä then the whole string seems to be diffent and I cannot deserialize it. It is readable and my log shows the string is fine, but there is something different. As long as the string does NOT include these characters then deserialization works fine. I am not supplying the JavascriptSerializer code as this is not the problem.
I have tried creating the StreamReader with different Encodings such as
New System.IO.StreamReader(StdIn, Encoding.GetEncoding("iso-8859-1"), True)
however the ß Ä etc are then not correct.
What I don't understand is if the string is UTF8 and .NET uses UTF16 how am I supposed to make sure the conversion is done properly?
UPDATE
Been doing some testing. What I have found is if I receive a string with fuß then the message length (provided by native messaging) is 4 but number of Char in the buffer is 3, if the string is fus then the message length is 3 and number of characters is 3. Why is that?
With the above code the Buffer object is 1 too big and thus is why there is a problem. If I simple use the Read method on the stream then it works fine. It appears that Google Messaging is sending a message length that is different when the ß is in the string.
If I want to use the above code then how can I know that the message length is not right?
"Each message is serialized using JSON, UTF-8 encoded and is preceded with 32-bit message length in native byte order. The maximum size of a single message from the native messaging host is 1 MB." This implies that the message length is in bytes, also, that the length is not part of the message (and so its length is not included in length).
Your confusion seems to stem from one of two things:
UTF-8 encodes a Unicode codepoint in 1 to 4 code units. (A UTF-8 code unit is 8 bits, one byte.)
Char is a UTF-16 code unit. (A UTF-16 code unit is 16 bits, two bytes. UTF-16 encodes a Unicode codepoint in 1 to 2 code units.)
There is no way to tell how many codepoints or UTF-16 code units are in the message until after it is converted (or scanned, but then you might as well just convert it).
Then, presumably, stream will either be found to be closed or the next thing to read would be another length and message.
So,
Private Iterator Function Messages(stream As Stream) As IEnumerable(Of String)
Using reader = New BinaryReader(stream)
Try
While True
Dim length = reader.ReadInt32
Dim bytes = reader.ReadBytes(length)
Dim message = Encoding.UTF8.GetString(bytes)
Yield message
End While
Catch e As EndOfStreamException
' Expected when the sender is done
Return
End Try
End Using
End Function
Usage
Messages(stream).ToList()
or
For Each message In Messages(stream)
Debug.WriteLine(message)
Next message
if you're displaying the output of this code in a console, this would diffidently happen. because windows console doesn't display Unicode characters. if this wasn't the case, then try to use a string builder to convert the data inside your StdIn stream to a string

Read Data From The Byte Array Returned From Web Service

I have a web service,which return data in byte array.Now i want to read that data in my console project.How can i do that,i already add the desire references to access that web service.I am using vb.net VS2012.Thanks.My web service method is as follow.
Public Function GetFile() As Byte()
Dim response As Byte()
Dim filePath As String = "D:\file.txt"
response = File.ReadAllBytes(filePath)
Return response
End Function
Something like,
Dim result As String
Using (Dim data As New MemoryStream(response))
Using (Dim reader As New StreamReader(data))
result = reader.ReadToEnd()
End Using
End Using
if you knew the encoding, lets say it was UTF-8 you could do,
Dim result = System.Text.UTF8Encoding.GetString(response)
Following on from your comments, I think you are asserting this.
Dim response As Byte() 'Is the bytes of a Base64 encoded string.
So, we know all the bytes will be valid ASCII (because its Base64,) so the string encoding is interchangable.
Dim base64Encoded As String = System.Text.UTF8Encoding.GetString(response)
Now, base64Encoded is the string Base64 representation of some binary.
Dim decodedBinary As Byte() = Convert.FromBase64String(base64Encoded)
So, we've changed the encoded base64 into the binary it represents. Now, because I can see that in your example, you are reading a file called "D:/file.txt" I'm going to make the assumption that the contents of the file is a character encoded string, but I don't know the encoding of the string. The StreamReader class has some logic in the constructor that can make an educated guess at character encoding.
Dim result As String
Using (Dim data As New MemoryStream(decodedBinary))
Using (Dim reader As New StreamReader(data))
result = reader.ReadToEnd()
End Using
End Using
Hopefully, now result contains the context of the text file.

Determine if a text file without BOM is UTF8 or ASCII VB.NET

I'm using VB.NET 2008.
Dim oEncoding As Encoding
Dim oReader As StreamReader
Dim sReadString As String
oReader = New StreamReader("TextNonBOM.txt", System.Text.Encoding.Default, True)
sReadString = oReader.ReadToEnd().ToLower()
oEncoding = oReader.CurrentEncoding
oReader.Close()
Without BOM is UTF8 or ASCII?
If your only choices are UTF-8 and ASCII, you don't have to detect anything. All ASCII is valid UTF-8, so you can always decode as UTF-8.

How can I read Greek characters from my database in my web app?

I've got Greek text stored in my access database. For some reason it doesn't appear in Greek- it uses other symbols instead.
e.g. Ãëþóóá instead of Γλώσσα
I can convert it in my windows app like this:
Dim encoder As Encoding = Encoding.GetEncoding(1253)
Dim valueInBytes As Byte() = encoder.System.IO.File.ReadAllBytes(lanuageFilePath)
languageValue = encoder.GetString(valueInBytes)
However, I now need to use the values in my web app. But the ReadAllBytes method is not available to me. I've tried using GetBytes instead, but this doesn't seem to produce the same results.
Dim encoder As Encoding = Encoding.GetEncoding(1253)
Dim valueInBytes As Byte() = encoder.GetBytes(languageValue)
languageValue = encoder.GetString(valueInBytes)
What am I doing wrong?
The first one seems to have nothing to do with text in a variable, your reading from a file.
Dim encoder As Encoding = Encoding.GetEncoding(1253)
Dim valueInBytes As Byte() = System.IO.File.ReadAllBytes(languageValue)
languageValue = encoder.GetString(valueInBytes)
ReadAllBytes should be supported in most frameworks so there should not be a problem with this on the server.
The other code seems to be doing soething compleatly different. You are converting the string to bytes and back again in the same encoding, to get this to work you need to find out which encoding access thought it was and encode with that. However it may still not have survived the roundtrip as access may be doing some normalistion of the unicode.
Dim encoder As Encoding = Encoding.GetEncoding(1253)
Dim accessencoder As Encoding = Encoding.GetEncoding({{accesses encoding numer here}})
Dim valueInBytes As Byte() = accessencoder.GetBytes(languageValue)
languageValue = encoder.GetString(valueInBytes)

Converting non-Unicode to Unicode

I'm trying to convert a non-Unicode string like this, '¹ûº¤¡¾­¢º¤ìñ©2' to Unicode like this, 'ໃຊ້ໃນຄົວເຮືອນ' which is in Lao. I tried with the code below and its return value is like this, '??????'. Any idea how can I convert the string?
Public Shared Function ConvertAsciiToUnicode(asciiString As String) As String
' Create two different encodings.
Dim encAscii As Encoding = Encoding.ASCII
Dim encUnicode As Encoding = Encoding.Unicode
' Convert the string into a byte[].
Dim asciiBytes As Byte() = encAscii.GetBytes(asciiString)
' Perform the conversion from one encoding to the other.
Dim unicodeBytes As Byte() = Encoding.Convert(encAscii, encUnicode, asciiBytes)
' Convert the new byte[] into a char[] and then into a string.
' This is a slightly different approach to converting to illustrate
' the use of GetCharCount/GetChars.
Dim unicodeChars As Char() = New Char(encUnicode.GetCharCount(unicodeBytes, 0, unicodeBytes.Length) - 1) {}
encUnicode.GetChars(unicodeBytes, 0, unicodeBytes.Length, unicodeChars, 0)
Dim unicodeString As New String(unicodeChars)
' Return the new unicode string
Return unicodeString
End Function
Your 8-bit encoded Lao text is not in ASCII, but in some codepage like IBM CP1133 or Microsoft LC0454, or most likely, the Thai codepage 874. You have to find out which one it is.
It matters how you have obtained (read, received, computed) the input string. By the time you make it a string it is already in Unicode and is easy to output in UTF-8, for example, like this:
Dim writer As New StreamWriter("myfile.txt", True, System.Text.Encoding.UTF8)
writer.Write(mystring)
writer.Close()
Here is the whole in-memory conversion:
Dim utf8_input as Byte()
...
Dim converted as Byte() = Encoding.Convert(Encoding.GetEncoding(874), Encoding.UTF8, utf8_input)
The number 874 is the number that says in which codepage your input is. Whether a particular operating system installation supports this codepage, is another question, but your own system will nearly certainly support it if you just used it to compose your Stack Overflow question.