I have followed this process:
Open notepad and enter some text: "Hello World"
Save the ansi file as: c:\HelloWorld.txt
I then run the following query:
select * from openrowset(bulk 'C:\HelloWorld.txt',single_clob) as test
The text appears in a column called: BulkColumn.
I then do this:
Open notepad and enter some text: "Hello World"
Save the unicode file as: c:\HelloWorld.txt
I then run the following query:
select * from openrowset(bulk N'C:\HelloWorld.txt',single_nclob) as test
The error I get is:
SINGLE_NCLOB requires a UNICODE (widechar) input file. The file specified is not Unicode.
Why is this?
You need to double-check how you saved the "Unicode" file. In Windows / .NET / SQL Server, the term "Unicode" refers specifically to "UTF-16 Little Endian (LE)". When dealing with UTF-16 Big Endian (BE), it will be referred to as "Unicode Big Endian" or "Big Endian Unicode". UTF-8 is always UTF-8.
I created a file in Notepad and went to "Save As" and selected "Unicode" from the "Encoding" drop-down and it worked just fine with the statement you are using:
SELECT *
FROM OPENROWSET(BULK N'C:\temp\OPENROWSET_BULK_NCLOB-test.txt', SINGLE_NCLOB) AS [Test];
If I re-saved it with any other encoding, I got the error message you are seeing.
I also used Notepad++ and in the "Encoding" menu selected "Encode in UCS-2 Little Endian". UCS-2 and UTF-16 are identical for Code Points U+0000 through U+FFFF and there is no UTF-16 option in Notepad++ so this was the closest thing. And yep, it also worked.
So somehow you did not actually save the file as "Unicode". If you selected "Unicode big endian" in Notepad, that is not "Unicode" in terms of how Windows is using that term, even if it is a valid Unicode encoding.
Related
I am reading in data from a .xlsx file, apparently encoded in ANSI(?). Labview can take the data just fine and when creating a text file based on the data, when opened/viewed with encoding ANSI (notepad++ or just notepad) it looks fine. The problem being that Notepad++ is defaulted to UTF-8 so not many people know to change the encoding to "ANSI" and the ° symbol does not translate well.
I use the Report Generation Toolkit Excel Get Data VI to get the data from excel and return it as a 2D string array in LabVIEW.
I am making the assumption that it's encoded in ANSI because when I open the text file (the .xml that I insert the excel data into) in Notepad++
I get 2 characters for what was supposed to be my degree symbol °, and when I change the encoding from UTF-8 to ANSI then the data is as how I read it. Also when I open the .xml file in Notepad, the degree symbol shows normally.
I have an issue when trying to read a string from a .CSV file. When I execute the application and the text is shown in a textbox, certain characters such as "é" or "ó" are shown as a question mark symbol.
The idea is that this code reads the whole CSV file and then splits each line into variables depending on the first word of the line.
The code I'm using to read is:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv")
Dim test_chart As String = Array.Find(vls1load, Function(x) (x.StartsWith("sample")))
Dim test_chart_div() As String = test_chart.Split(";")
variable1 = test_chart_div(1)
variable2 = test_chart_div(2)
...etc
I have also tried with:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv", System.Text.Encoding.UTF8)
But none of them works. The .csv file is supposed to be UTF8. The "web options" that you can see when saving the file in excel show encoding UTF8. I also tried the trick of changing the file extension to HTML and opening it with the browser to see that the encoding is also correct.
Can someone advice anything else I can try?
Thanks in advance.
When an Excel file is exported using the CSV Comma Separated output format, the Encoding selected in Tools -> Web Option -> Encoding of Excel's Save As... dialog doesn't actually generate the expected result:
the Text file is saved using the Encoding relative to the current Language selected in the Excel Application, not the Unicode (UTF16-LE) or UTF-8 Encoding selected (which is ignored) nor the default Encoding determined by the current System Language.
To import the CSV file, you can use the Encoding.GetEncoding() method to specify the Name or CodePage of the Encoding used in the machine that generated the file: again, not the Encoding related to System Language, but the Encoding of the Language that the Excel Application is currently using.
CodePage 1252 (Windows-1252) and ISO-8859-1 are commonly used in Latin1 zone.
Based the symbols you're referring to, this is most probably the original encoding used.
In Windows, use the former. ISO-8859-1 is still used, mostly in old Web Pages (or Web Pages created without care for the Encoding used).
As a note, CodePage 1252 and ISO-8859-1 are not exactly the same Encoding, there are subtle differences.
If you find documentation that states the opposite, the documentation is wrong.
Example:
Open "C:\...\someFile.txt" For Output As #1
Print #1, someString
Close #1
If someString contains non-ASCII characters, how are they encoded? (UTF-8, Latin-1, some codepage depending on the Windows locale, ...)
On my system, the code above seems to use Windows-1252, but since neither the documentation of the Open statement nor the documentation of the Print # statement mention string encodings, I cannot be sure whether this is some built-in default or some system setting, and I'm looking for an authorative answer.
Note: Thanks to everyone suggesting alternatives for how to create files with specific encodings (ADODB.Stream, Scripting.FileSystemObject, etc.) - they are appreciated. This question, however, is about understanding the exact behavior of legacy code, so I am only interested in the behavior of the code quoted above.
Testing indicates that the VBA Print command converts Unicode strings to the single-byte character set of the code page for the current Windows "Language for non-Unicode programs" system locale. This can be illustrated with the following code, which attempts to write the Greek word Ώπα:
Option Compare Database
Option Explicit
Sub GreekTest()
Dim someString As String
someString = ChrW(&H38F) & ChrW(&H3C0) & ChrW(&H3B1)
Open "C:\Users\Gord\Desktop\someFile.txt" For Output As #1
Print #1, someString
Close #1
End Sub
When run with Windows set to the default locale for US English, the resulting file contains the bytes
3F 70 61
which correspond to the Windows-1252 characters ?pa. Windows-1252 is the character set most commonly (but incorrectly) referred to as "ANSI".
However, after changing the Windows "non-Unicode" locale setting to Greek (Greece)
the same VBA code writes a file containing the bytes
BF F0 E1
which correspond to the Windows-1253 (Greek) characters Ώπα.
We've some String values stored in Oracle DB. We're writing these values to a .DAT file . The query snippet in our package looks like below :
Opening file :
l_file := UTL_FILE.fopen(l_dir, l_file_name, 'W');
Writing to file :
UTL_FILE.putf(l_file, ('E' ||';'|| TO_CHAR(l_rownum) ||';'|| TO_CHAR(v_cdr_session_id_seq) ||';' || l_record_data));
String value in the DB : "Owner’s address"
String value in the .DAT file : "Owner’s address"
Question is : How to avoid those special special characters while writing it to an output file?
I assume your database uses character set AL32UTF8 (which is the default nowadays). In such case try this:
l_file := UTL_FILE.FOPEN_NCHAR(l_dir, l_file_name, 'W');
PUT_NCHAR(l_file, ('E;'|| l_rownum ||';'|| v_cdr_session_id_seq||';' ||l_record_data);
Note for function FOPEN_NCHAR: Even though the contents of an NVARCHAR2 buffer may be AL16UTF16 or UTF8 (depending on the national character set of the database), the contents of the file are always read and written in UTF8.
Summarising from comments, your Linux session and Vim configuration are both using UTF-8, but your terminal emulator software is using the Windows-1252 codepage. That renders the 'curly' right-quote mark you have, ’, which is Unicode codepoint 2019, as ’.
You need to change your emulator's configuration from Windows-1252 to UTF-8. Exactly how depends on which emulator you are using. For example, in PuTTY you can change the current session by right-clicking on the window title bar and choosing "Change settings...", then going to the "Window" category and its "Translation" subcategory, and changing the value from the default "Win 252 (Western)" to "UTF-8".
If you have the file open in Vim you can control-L to redraw it. That will only affect the current session though; to make that change permanent you'll need to make the same change to the stored session settings, from the "New session..." dialog - loading your current settings, and remembering to save the changes before actually opening the new session.
Other emulators have similar settings but in different places. For XShell, according to their web site:
You can easily switch to UTF8 encoding by selecting Unicode (UTF8) in the Encoding list on the Standard toolbar.
It looks like you can also set it for a session, or as the default.
I am exporting text files from 2 queries in ms access 2010. Queries are from different linked ODBC tables (but tables are different only by data, structure and data types are same). I set up export specification to export text file in utf-8 encoding for both files. Now here come the trouble part. When I export the queries and open them in notepad, one query is in utf-8 and second one is in ANSI. I don't know how is this possible when both queires has the same export specification and it is driving me crazy.
This is my VBA code to export queries:
DoCmd.TransferText acExportDelim, "miniflow", "qry01_CZ_test", "C:\TEST_CZ.txt", no
DoCmd.TransferText acExportDelim, "miniflow", "qry01_SK_test", "C:\TEST_SK.txt", no
I also tried to modify it by adding 65001 as coding argument by the results were same.
Do you have any idea what could be wrong?
Don't rely on the File Open dialog in Notepad to tell you whether a text file is encoded as "ANSI" or UTF-8. That is just Notepad's "guess" based on whether the file begins with the bytes EF BB BF, which is the UTF-8 Byte Order Mark (BOM).
Many (most?) Windows applications will include the UTF-8 BOM at the beginning of a text file that is UTF-8 encoded. Some Unicode purists insist, often quite vigorously, that the BOM is not required for UTF-8 files and should be excluded, but that is the way Windows applications tend to behave.
Unfortunately, Access does not always follow that pattern when it exports files to text. A UTF-8 text file exported from Access may omit the BOM and that can confuse applications like Notepad if they assume that a UTF-8 encoded file will always include the BOM as the first three bytes of the file.
For a more reliable way of determining the encoding of a text file consider using an application like Notepad++ to open the file. It will differentiate between the UTF-8 files with a BOM (which it designates as "UTF-8") and UTF-8 files without a BOM (which it designates as "ANSI as UTF-8")
To illustrate, consider the following Access table
When exported to text (CSV) with UTF-8 encoding,
the File Open dialog in Notepad reports that it is encoded as "ANSI"
but a hex editor shows that it is in fact encoded as UTF-8 (the character é is encoded as C3 A9, not simply E9 as would be the case for true "ANSI" encoding)
and Notepad++ recognizes it as "ANSI as UTF-8"
in other words, a UTF-8 encoded file without a BOM.