UTL_FILE operations produces special characters in the output file - sql

We've some String values stored in Oracle DB. We're writing these values to a .DAT file . The query snippet in our package looks like below :
Opening file :
l_file := UTL_FILE.fopen(l_dir, l_file_name, 'W');
Writing to file :
UTL_FILE.putf(l_file, ('E' ||';'|| TO_CHAR(l_rownum) ||';'|| TO_CHAR(v_cdr_session_id_seq) ||';' || l_record_data));
String value in the DB : "Owner’s address"
String value in the .DAT file : "Owner’s address"
Question is : How to avoid those special special characters while writing it to an output file?

I assume your database uses character set AL32UTF8 (which is the default nowadays). In such case try this:
l_file := UTL_FILE.FOPEN_NCHAR(l_dir, l_file_name, 'W');
PUT_NCHAR(l_file, ('E;'|| l_rownum ||';'|| v_cdr_session_id_seq||';' ||l_record_data);
Note for function FOPEN_NCHAR: Even though the contents of an NVARCHAR2 buffer may be AL16UTF16 or UTF8 (depending on the national character set of the database), the contents of the file are always read and written in UTF8.

Summarising from comments, your Linux session and Vim configuration are both using UTF-8, but your terminal emulator software is using the Windows-1252 codepage. That renders the 'curly' right-quote mark you have, ’, which is Unicode codepoint 2019, as ’.
You need to change your emulator's configuration from Windows-1252 to UTF-8. Exactly how depends on which emulator you are using. For example, in PuTTY you can change the current session by right-clicking on the window title bar and choosing "Change settings...", then going to the "Window" category and its "Translation" subcategory, and changing the value from the default "Win 252 (Western)" to "UTF-8".
If you have the file open in Vim you can control-L to redraw it. That will only affect the current session though; to make that change permanent you'll need to make the same change to the stored session settings, from the "New session..." dialog - loading your current settings, and remembering to save the changes before actually opening the new session.
Other emulators have similar settings but in different places. For XShell, according to their web site:
You can easily switch to UTF8 encoding by selecting Unicode (UTF8) in the Encoding list on the Standard toolbar.
It looks like you can also set it for a session, or as the default.

Related

Characters not displayed correctly when reading CSV file

I have an issue when trying to read a string from a .CSV file. When I execute the application and the text is shown in a textbox, certain characters such as "é" or "ó" are shown as a question mark symbol.
The idea is that this code reads the whole CSV file and then splits each line into variables depending on the first word of the line.
The code I'm using to read is:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv")
Dim test_chart As String = Array.Find(vls1load, Function(x) (x.StartsWith("sample")))
Dim test_chart_div() As String = test_chart.Split(";")
variable1 = test_chart_div(1)
variable2 = test_chart_div(2)
...etc
I have also tried with:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv", System.Text.Encoding.UTF8)
But none of them works. The .csv file is supposed to be UTF8. The "web options" that you can see when saving the file in excel show encoding UTF8. I also tried the trick of changing the file extension to HTML and opening it with the browser to see that the encoding is also correct.
Can someone advice anything else I can try?
Thanks in advance.
When an Excel file is exported using the CSV Comma Separated output format, the Encoding selected in Tools -> Web Option -> Encoding of Excel's Save As... dialog doesn't actually generate the expected result:
the Text file is saved using the Encoding relative to the current Language selected in the Excel Application, not the Unicode (UTF16-LE) or UTF-8 Encoding selected (which is ignored) nor the default Encoding determined by the current System Language.
To import the CSV file, you can use the Encoding.GetEncoding() method to specify the Name or CodePage of the Encoding used in the machine that generated the file: again, not the Encoding related to System Language, but the Encoding of the Language that the Excel Application is currently using.
CodePage 1252 (Windows-1252) and ISO-8859-1 are commonly used in Latin1 zone.
Based the symbols you're referring to, this is most probably the original encoding used.
In Windows, use the former. ISO-8859-1 is still used, mostly in old Web Pages (or Web Pages created without care for the Encoding used).
As a note, CodePage 1252 and ISO-8859-1 are not exactly the same Encoding, there are subtle differences.
If you find documentation that states the opposite, the documentation is wrong.

SQL Server adding extra special characters in query result

I am trying to extract some records in a file using BCP command in SQL Server. However when the file is generated, there are extract spaces in between the result for each column.
To try I just wrote basic SQL Query as simple as this
select 'ABC', 40, 'TEST','NOTWORKING'
When we copy the output of above query and paste it in Notepad, the output comes as
ABC 40 TEST NOTWORKING
Notice the space between each value? The file that system is generating using BCP command also has same space coming in the output file which is incorrect. What I want to see in the output file is
ABC40TESTNOTWORKING
What must be causing this issue? I am simply amazed to see such weird issue and hoping that it can be fixed by some changes or setting. Please help.
Sample BCP command
EXEC xp_cmdshell 'bcp "select ''ABC'', 40, ''TEST'',''NOTWORKING''" queryout "E:\Testfile.txt" -c -T -S""'
Output in the File - Testfile.txt
ABC 40 TEST NOTWORKING
There are probably tabs between the values. If you want a single value, use concat():
select CONCAT('ABC', 40, 'TEST', 'NOTWORKING')
There's no issue. The command line has no field terminator argument, so the default is used, a tab. That's described in the docs :
-t field_term
Specifies the field terminator. The default is \t (tab character). Use this parameter to override the default field terminator. For more information, see Specify Field and Row Terminators (SQL Server).
If you specify the field terminator in hexadecimal notation in a bcp.exe command, the value will be truncated at 0x00. For example, if you specify 0x410041, 0x41 will be used.
If field_term begins with a hyphen (-) or a forward slash (/), do not include a space between -t and the field_term value.
The link points to an entire article that explains how to use terminators, for each of the bulk operations.
As for the Copy/Paste operation, it has nothing to do with SQL Server. SQL Server has no UI, it's a service. I suspect what was pasted in Notepad was copied from an SSMS grid.
SSMS is a client tool just like any other. When you copy data from it into the clipboard, it decides what to put there and what format to use. That format can be plain text, using spaces and tabs for layout, RTF, HTML etc.
Plain text with tabs as field separators is probably the best choice for any tool, as it preserves the visual layout up to a point and uses only a single character as a separator. A fixed-length layout using spaces could also be used but that would add characters that may well be part of a field.
Encodings and codepages
-c exports the data using the user's default codepage. This means that text stored in varchar fields using a different codepage (collation) may get mangled. Non-visible Unicode characters will also get mangled and appear as something else, or as ?.
-c
Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes and with \t (tab character) as the field separator and \r\n (newline character) as the row terminator. -c is not compatible with -w.
It's better to use export the file as UTF16 using -w.
-w
Performs the bulk copy operation using Unicode characters. This option does not prompt for each field; it uses nchar as the storage type, no prefixes, \t (tab character) as the field separator, and \n (newline character) as the row terminator. -w is not compatible with -c.
The codepage can be specified using the -C parameter. -C 1251 for example will export the data using Windows' Latin1 codepage. 1253 will export it using the Greek codepage.
-C { ACP | OEM | RAW | code_page }
Specifies the code page of the data in the data file. code_page is relevant only if the data contains char, varchar, or text columns with character values greater than 127 or less than 32.
SQL Server 2016 and later can also export text as UTF8 with -C 65001. Earlier versions don't support UTF8.
Versions prior to version 13 (SQL Server 2016 (13.x)) do not support code page 65001 (UTF-8 encoding). Versions beginning with 13 can import UTF-8 encoding to earlier versions of SQL Server.
All this is described in bcp's online documentation.
This subject is so important for any database that it has an entire section in the docs, that describes data format and considerations, using format files to specify different settings per column, and guidelines to ensure compatibility with other applications

Update/insert/retrieve accented character in DB?

I am using oracle 12G
when i run #F:\update.sql from sql plus it displays accented character é as junk character when I retrieve from either sqlplus or sql developer
By when run the individual statement from sql plus. Now if I retrieve it from sqlplus, it displays the correct character, but when I retrieve it from sqldeveloper, it again displays the junk character.
update.sql content is this
update employee set name ='é' where id= 1;
What i want is when i run #F:\update.sql , it should insert/update/retrieve it in correct format whether it is from sqlplus or any other tool ?
For information :- when i run
SELECT * FROM NLS_DATABASE_PARAMETERS WHERE PARAMETER LIKE '%CHARACTERSET%'
i get below information
PARAMETER VALUE
------------------------------ ----------------------------------------
NLS_CHARACTERSET WE8MSWIN1252
NLS_NCHAR_CHARACTERSET AL16UTF16
when i run #.[%NLS_LANG%] from command prompt i see
SP2-0310: unable to open file ".[AMERICAN_AMERICA.WE8MSWIN1252]"
I am not familiar with SQL Developer but I can give solution for SQL*Plus.
Presume you like to work in Windows CP1252
First of all ensure that the file F:\update.sql is saved in CP1252 encoding. Many editors call this encoding ANSI which is the same (let's skip the details about difference between term ANSI and Windows-1252)
Then before you run the script enter
chcp 1252
in order to switch encoding of your cmd.exe to CP1252. By default encoding of cmd.exe is most likely CP850 or CP437 which are different.
Then set NLS_LANG environment variable to character set WE8MSWIN1252, e.g.
set NLS_LANG=AMERICAN_AMERICA.WE8MSWIN1252
After that your script should work fine with SQL*Plus. SQL*Plus inherits the encoding (or "character set", if you prefer this term) from parent cmd.exe. NLS_LANG tells the Oracle driver which character set you are using.
Example Summary:
chcp 1252
set NLS_LANG=.WE8MSWIN1252
sqlplus username/password#db #F:\update.sql
Some notes: In order to set encoding of cmd.exe permanently, see this answer: Unicode characters in Windows command line - how?
NLS_LANG can be set either as Environment Variable or in your Registry at HKLM\SOFTWARE\Wow6432Node\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG (for 32-bit Oracle Client), resp. HKLM\SOFTWARE\ORACLE\KEY_%ORACLE_HOME_NAME%\NLS_LANG (for 64-bit Oracle Client).
For SQL Developer check you options, somewhere it should be possible to define encoding of SQL files.
You are not forced to use Windows-1252. The same works also for other encoding, for example WE8ISO8859P1 (i.e. ISO-8859-1, chcp 28591) or UTF-8. However, in case of UTF-8 your SQL-script may contain characters which are not supported by database character set WE8MSWIN1252. Such characters would be replaced by placeholder (e.g. ¿).

Progress codpage export data to sql via dtsx

In our Progress (V9.1B) the codepage used is ibm8858-1. On a unix sco machine.
In Progress I dump data as follows :
export stream strA to /u/usr/ppd/tesk.txt convert target "1252".
for each tab1 no-lock :
put tab1.field1 ";" tab1.field2 skip.
end.
When I use vi to open de file is see the french words as follows : Apr\212s in stead of après.
Normally I ftp that file to a PC and then via a dtsx I load it in sql.
But in sql the french characters are also not in the correct format.
Does somebody knows how I have to export the data (text) and how I need to import it in sql (2005) . Now I also use codepage 1252 in my dtsx.
Tkx,
Jac
Are you 100% sure the field doesn't contain "Apr\212s"? I'm also unsure if there is a ibm8858-1 codepage, I'm guessing you might be mixing it up with iso8859-1 (Scandinavian letters).
Whenever I encounter codepage problems the characters are garbled, not converted to something that looks like a ascii code.
You can always try (assuming its really iso8859-1)
convert source "iso8859-1" target "1252".

Encoding issue in I/O with Jena

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.