Determine if a text file without BOM is UTF8 or ASCII VB.NET - byte-order-mark

I'm using VB.NET 2008.
Dim oEncoding As Encoding
Dim oReader As StreamReader
Dim sReadString As String
oReader = New StreamReader("TextNonBOM.txt", System.Text.Encoding.Default, True)
sReadString = oReader.ReadToEnd().ToLower()
oEncoding = oReader.CurrentEncoding
oReader.Close()
Without BOM is UTF8 or ASCII?

If your only choices are UTF-8 and ASCII, you don't have to detect anything. All ASCII is valid UTF-8, so you can always decode as UTF-8.

Related

How to remove Byte Order Mask from utf-8 BinaryWriter?

I am generating CSV file which contains double byte characters. Right now I am using BinaryWriter with UTF-8 encoding. Problem is that generated CSV file has BOM prefix(preamble). How can I remove the preamble?
I am adding preamble to BinaryWriter because without it, it still shows unrecognized characters instead of double byte characters.
I tried using different encoding constructor for example: Dim encoding As New System.Text.UTF8Encoding(False) which didn't work.
Dim encoding As Encoding = Encoding.UTF8
Using bw As New BinaryWriter(fs, encoding)
bw.Write(encoding.GetPreamble())
bw.Write(data)
bw.Flush()
Using n As New Ionic.Zip.ZipFile(encoding)
n.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression
fs.Position = 0
n.AddEntry(fileName, fs)
Response.Clear()
Response.ContentType = "application/octet-stream"
n.Save(Response.OutputStream)
End Using
bw.Close()
bw.Dispose()
End Using
This code will generate correct CSV file in Zip file with correct double byte characters but with preamble as prefix of whole data like: "���"
I want to remove this unrecognized prefix, but by removing the line: bw.Write(encoding.GetPreamble()) I lose whole encoding and double byte characters appear as not recognized and prefix is still there.
Changing the encoding constructor to: Dim encoding As New System.Text.UTF8Encoding(False) broke the encoding as well.

ASCII to Base32

I am working at making what 10 characters go into a text box in my vb project convert into Base32. Here is my code. I am getting an error
Value of type 'String' cannot be converted to 'Byte()'. WindowsApplication2
Private Sub Ok_Click(sender As Object, e As EventArgs) Handles Ok.Click
Dim DataToEncode As Byte() = txtbox.Text
Dim Base32 As String
Base32 = DataToEncode.ToBase32String()
Auth.Text = Base32
End Sub
The value in txtbox.Text is a string which can't be automatically converted to a byte array. So the line Dim DataToEncode As Byte() = txtbox.Text can't be compiled. To get the ASCII representation of a string use the System.Text.Encoding.ASCII.GetBytes() method.
Dim DataToEncode As Byte() = System.Text.Encoding.ASCII.GetBytes(txtbox.Text)
Also strings in VB.Net do not store ASCII values, they use UTF-16.
As the error indicates, you're trying to take a string (the context of txtbox.Text) and put it in a variable of type Byte(), an array of bytes. A string isn't a byte array, it's a logical sequence of characters that can have different representation in bytes - do you want to treat it as a UTF-8-encoded string? An ASCII string? A full-blown UTF-32 string? All these are different byte representations of what might be the same textual data.
Once you know the representation you care about, use the System.Text.Encoding classes to convert the text to a Byte() and pass that to your method.
Try converting the string into a byte array using the GetBytes method:
Dim DataToEncode As Byte() = Encoding.UTF8.GetBytes(txtbox.Text)

Converting UTF-8 to windows-1255 encoding in VB.NET

I am trying to convert a string encoded in UTF-8 to windows-1255 in VB.NET with no luck. Admittedly, I don't know VB but have tried using an example at MSDN and modifying it to my needs:
Public Function Utf82Hebrew(ByVal Str As String) As String
Dim ascii As Encoding = Encoding.GetEncoding("windows-1255")
Dim unicode As Encoding = Encoding.Unicode
' Convert the string into a byte array.
Dim unicodeBytes As Byte() = unicode.GetBytes(Str)
' Perform the conversion from one encoding to the other.
Dim asciiBytes As Byte() = Encoding.Convert(unicode, ascii, unicodeBytes)
' Convert the new byte array into a char array and then into a string.
Dim asciiChars(ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)-1) As Char
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0)
Dim asciiString As New String(asciiChars)
Utf82Hebrew = asciiString
End Function
This function doesn't actually do anything—the string remains in UTF-8. However, if I change this line:
Dim ascii As Encoding = Encoding.GetEncoding("windows-1255")
To this:
Dim ascii As Encoding = Encoding.ASCII
Then the function returns question marks in the place of the string.
Does anyone know how to properly convert a UTF-8 string to a specific encoding (in this case, windows-1255), and/or what I'm doing wrong in the above code?
Thanks in advance.
I modified your code.
It is very straightforward to convert text from one encoding into another.
This is how you should do it in VB.Net.
Microsof Windows file encoding is 1252, not 1255.
Public Function Utf82Hebrew(ByVal Str As String) As String
Dim ascii As System.Text.Encoding = System.Text.Encoding.GetEncoding("1252")
Dim unicode As System.Text.Encoding = System.Text.Encoding.Unicode
' Convert the string into a byte array.
Dim unicodeBytes As Byte() = unicode.GetBytes(Str)
' Perform the conversion from one encoding to the other.
Dim asciiBytes As Byte() = System.Text.Encoding.Convert(unicode, ascii, unicodeBytes)
' Convert the new byte array into a char array and then into a string.
Dim asciiString As String = ascii.GetString(asciiBytes)
Utf82Hebrew = asciiString
End Function

Converting non-Unicode to Unicode

I'm trying to convert a non-Unicode string like this, '¹ûº¤¡¾­¢º¤ìñ©2' to Unicode like this, 'ໃຊ້ໃນຄົວເຮືອນ' which is in Lao. I tried with the code below and its return value is like this, '??????'. Any idea how can I convert the string?
Public Shared Function ConvertAsciiToUnicode(asciiString As String) As String
' Create two different encodings.
Dim encAscii As Encoding = Encoding.ASCII
Dim encUnicode As Encoding = Encoding.Unicode
' Convert the string into a byte[].
Dim asciiBytes As Byte() = encAscii.GetBytes(asciiString)
' Perform the conversion from one encoding to the other.
Dim unicodeBytes As Byte() = Encoding.Convert(encAscii, encUnicode, asciiBytes)
' Convert the new byte[] into a char[] and then into a string.
' This is a slightly different approach to converting to illustrate
' the use of GetCharCount/GetChars.
Dim unicodeChars As Char() = New Char(encUnicode.GetCharCount(unicodeBytes, 0, unicodeBytes.Length) - 1) {}
encUnicode.GetChars(unicodeBytes, 0, unicodeBytes.Length, unicodeChars, 0)
Dim unicodeString As New String(unicodeChars)
' Return the new unicode string
Return unicodeString
End Function
Your 8-bit encoded Lao text is not in ASCII, but in some codepage like IBM CP1133 or Microsoft LC0454, or most likely, the Thai codepage 874. You have to find out which one it is.
It matters how you have obtained (read, received, computed) the input string. By the time you make it a string it is already in Unicode and is easy to output in UTF-8, for example, like this:
Dim writer As New StreamWriter("myfile.txt", True, System.Text.Encoding.UTF8)
writer.Write(mystring)
writer.Close()
Here is the whole in-memory conversion:
Dim utf8_input as Byte()
...
Dim converted as Byte() = Encoding.Convert(Encoding.GetEncoding(874), Encoding.UTF8, utf8_input)
The number 874 is the number that says in which codepage your input is. Whether a particular operating system installation supports this codepage, is another question, but your own system will nearly certainly support it if you just used it to compose your Stack Overflow question.

Converting String to List of Bytes

This has to be incredibly simple, but I must not be looking in the right place.
I'm receiving this string via a FTDI usb connection:
'UUU'
I would like to receive this as a byte array of
[85,85,85]
In Python, this I would convert a string to a byte array like this:
[ord(c) for c in 'UUU']
I've looked around, but haven't figured this out. How do I do this in Visual Basic?
Use the Encoding class with the correct encoding.
C#:
// Assuming string is UTF8
Encoding utf8 = Encoding.UTF8Encoding();
byte[] bytes = utf8.GetBytes("UUU");
VB.NET:
Dim utf8 As Encoding = Encoding.UTF8Encoding()
Dim bytes As Byte() = utf8.GetBytes("UUU")
depends on what kind of encoding you want to use but for UTF8 this works, you could chane it to UTF16 if needed.
Dim strText As String = "UUU"
Dim encText As New System.Text.UTF8Encoding()
Dim btText() As Byte
btText = encText.GetBytes(strText)