What default encoding is used when using StreamWriter to write into a file with no Encoding parameter? - vb.net

I had a situation where we produce a file for our client, and the file would contain some special characters like accented i or a (í, á) etc.
Originally, we used this code to open file for output:
Using sw As StreamWriter = New StreamWriter(fullpath, True)
However, the í and á would show up in the file as 2 character combinations of bytes with hex codes c3 ad for the í and c3 a1 for the á
We fixed the issue by enforcing the Windows1252 encoding when writing to the file (which is same as Encoding.Default, but according to MSDN we should NOT be using Encoding.Default):
Using sw As StreamWriter = New StreamWriter(fullpath, True, Text.Encoding.GetEncoding(1252))
Question: if Encoding.Default is not really a default encoding when no Encoding parameter was supplied, which encoding is the default default (pardon the pun) encoding?
Question2: probably the same answer as QUestion 1, what is the default default encoding for StreamReader if you don't specify Encoding parameter?

Well, you didn't really fix the issue. To get "c3 ad for the í" you must use Encoding.Utf8
Which is what StreamWriter is already using. However, it uses the Utf8Encoding constructor that takes the encoderShouldEmitUTF8Identifier argument and passes false. Otherwise known as the BOM (Byte Order Mark). The BOM tells the program that reads the file unequivocally what Unicode encoding is used. Sadly, Microsoft cannot force a BOM because the Unicode consortium, in a highly uncharacteristic moment of temporary insanity, made a BOM optional.
It probably works now because the program falls back to the system's default encoding when it can't find the BOM. You might have guessed correctly at 1252, it is common, but certainly no guarantee. Fix:
Using sw As StreamWriter = New StreamWriter(fullpath, True, Encoding.Utf8)
Do beware the True argument you use. Which appends text to the file. If the file already contains text then you can't get the BOM added anymore. Also a rather nasty problem if the file got started with a different encoding, you certainly don't want to get a mix. Do everything you can to avoid having to use True.

Related

Characters not displayed correctly when reading CSV file

I have an issue when trying to read a string from a .CSV file. When I execute the application and the text is shown in a textbox, certain characters such as "é" or "ó" are shown as a question mark symbol.
The idea is that this code reads the whole CSV file and then splits each line into variables depending on the first word of the line.
The code I'm using to read is:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv")
Dim test_chart As String = Array.Find(vls1load, Function(x) (x.StartsWith("sample")))
Dim test_chart_div() As String = test_chart.Split(";")
variable1 = test_chart_div(1)
variable2 = test_chart_div(2)
...etc
I have also tried with:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv", System.Text.Encoding.UTF8)
But none of them works. The .csv file is supposed to be UTF8. The "web options" that you can see when saving the file in excel show encoding UTF8. I also tried the trick of changing the file extension to HTML and opening it with the browser to see that the encoding is also correct.
Can someone advice anything else I can try?
Thanks in advance.
When an Excel file is exported using the CSV Comma Separated output format, the Encoding selected in Tools -> Web Option -> Encoding of Excel's Save As... dialog doesn't actually generate the expected result:
the Text file is saved using the Encoding relative to the current Language selected in the Excel Application, not the Unicode (UTF16-LE) or UTF-8 Encoding selected (which is ignored) nor the default Encoding determined by the current System Language.
To import the CSV file, you can use the Encoding.GetEncoding() method to specify the Name or CodePage of the Encoding used in the machine that generated the file: again, not the Encoding related to System Language, but the Encoding of the Language that the Excel Application is currently using.
CodePage 1252 (Windows-1252) and ISO-8859-1 are commonly used in Latin1 zone.
Based the symbols you're referring to, this is most probably the original encoding used.
In Windows, use the former. ISO-8859-1 is still used, mostly in old Web Pages (or Web Pages created without care for the Encoding used).
As a note, CodePage 1252 and ISO-8859-1 are not exactly the same Encoding, there are subtle differences.
If you find documentation that states the opposite, the documentation is wrong.

wxWidgets string on windows

I have this code:
wxString tmp(wxT("Información del usuario"));
wxStaticBoxSizer* sbSizer1 = new wxStaticBoxSizer (wxVERTICAL, panel, tmp);
This shows rare symbols instead of ñ in Windows but in Linux it shows correctly the letter..any ideas?
The value of the string in your code depends on the encoding of your source file and also the charset used by your compiler. If your source file itself is in Unicode (whether it's UTF-8 or UTF-16), then you can use L"..." to create a wide string literal. If not, or you're not sure, you can always use wxString::FromUTF8() to explicitly encode the string as UTF-8, e.g. wxString::FromUTF8("Informaci\xc3\xb3n...") will always work.

Saving CSV file with degree symbol and ASCII encoded

I have string variable txt. It contains "°" degree symbol. I would like to save string into CSV file ASCII encoded. I use the procedure below But the "°" symbol is converted to "?". Do you have any idea how to save properly degree symbol?
Public Sub Write_File(ByVal txt As String, ByVal fName As String)
Try
Using OutFile As New StreamWriter(fName, False, Text.Encoding.ASCII)
OutFile.Write(txt)
End Using
Me.Write_Log("Succesfully Exported")
Catch ex As Exception
Me.Write_Log("Write Error during export")
End Try
End Sub
Encoding.ASCII is for the standard 7-bit ASCII encoding, which does not contain a degree symbol at all. In order to get a degree symbol in ASCII, you would have to use one of the many 8-bit ASCII encodings. For English, you'd probably be most interested in using the ISO 8859-1 code page, since that's the most standard-ish one there is of the bunch. For instance, instead of using Encoding.ASCII, you could do something like this:
Using OutFile As New StreamWriter(fName, False, Text.Encoding.GetEncoding("iso-8859-1"))
OutFile.Write(txt)
End Using
For a complete list of available encodings, use the Encoding.GetEncodings method, or look at the list of supported ones in the MSDN documentation.
Of course, none of the various 8-bit ASCII encodings are compatible with each other, so, if you do use that, the degree symbol will be a completely different symbol when viewed on a system that uses a different code page by default. That is precisely why UTF-8 has become the new standard. Usage of 8-bit ASCII is widely discouraged since it is practically unworkable in multi-cultural scenarios. If you can use UTF-8 instead, I would. If you must use ASCII, it's best to stick to the standard 7-bit encoding. If you must use an 8-bit ASCII encoding, please do so sparingly and with full awareness of its drawbacks.
One more thing. You mention the degree symbol as being character 167 (0xA7) in your desired target encoding. If that is the case, you may actually be wanting IBM437 encoding rather than ISO 8859-1. IBM437 is the old code page that was used by default in MS-DOS. If you really need to use that code page, you may have additional trouble for two reasons. As you'll see in the MSDN article, that code page is not well supported in the .NET framework. In my testing, outputting the Unicode string containing the degree symbol using that encoding did not work properly. Therefore, you may find yourself needing to use a byte array to represent the data rather than a String variable (which is Unicode). For instance:
File.WriteAllBytes("Test.txt", {167})
The second problem is that IBM437 is likely not the default code page for your windows OS, so even when it is written to the file as byte value 167, it won't actually look like a degree symbol when you view it in a windows application such as notepad.

Copying a text file line by line vb.net half the size?

I'm trying to modify specific lines in a 6 gig text file (SQL script). So I read it in with IO.StreamReader.ReadLine and write to a new file with IO.StreamWriter.WriteLine. If the line matches a certain condition, I'm modifiying it before I write it.
The problem is, the resulting file is exactly half (1.999582...) the size of the original file...
I'm trying to make sure the encoding is the same using:
sw = New IO.StreamWriter(NewFilepath, False, sr.CurrentEncoding)
But it doesn't make a difference, the new file is half the size of the old...
Where are you setting the encoding for your StreamReader, sr? If you are not doing this explicitly, and if you are setting the encoding of the StreamWriter before you perform any reads of your file(my best guess), then the CurrentEncoding of the StreamReader may change (it autodetects from the source file).
From MSDN on StreamReader.CurrentEncoding
The current character encoding used by the current reader. The value
can be different after the first call to any Read method of
StreamReader, since encoding autodetection is not done until the first
call to a Read method.
To determine the encoding you can read off the first line of the file with the StreamReader and then do :
sw = New IO.StreamWriter(NewFilepath, False, sr.CurrentEncoding)

Encoding issue in I/O with Jena

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.