Oracle SQL XML Offending characters in XML file (UTF-8) - sql

I'm trying to find a way of either replacing/deleting offending characters from the Oracle SQL XML files I'm creating. The structure of the XML file is correct but the company I'm sending the files too can't load the files because of the offending characters in the XML file. I'm using Oracle 11g release 2 database.
What can I do and what are my options?
A screen shot is below of an example of these offending characters, both myself and the company i'm sending the files too are using the Unicode UTF-8 encoding.
An example of a tag that it does not like is below for ZOË WANAMAKER
<prodAssociatedParty>
<apType>ACTOR</apType>
<lastName>ZOË WANAMAKER</lastName>
</prodAssociatedParty>
Ë (0xCB), É (0xC9), Ï (0xCF), £ (0xA3), Ç (0xC7), Ò (0xD2), Ü (0xDC)
Thanks in Advance for any advice.

Thanks for all your replies. In the end I put the below into my PL/SQL, as like some of you said even though you have version '1.0" encoding="UTF-8" in your SQL XML code the file was being stored in a different encoding. So I needed to force it to write/store an XML file in UTF-8 format.
If you look up DBMS_XSLPROCESSOR.clob2file there are a number of parameters passed to this procedure, one being the character set to use for the output file. Which in this case for UTF-8 it was nls_charset_id('AL32UTF8').
DBMS_XSLPROCESSOR.clob2file(l_clob, l_directory, l_file_name||'.xml',nls_charset_id('AL32UTF8'));
thanks Guys

Obviously your XML files has <?xml version="1.0" encoding="UTF-8"?> as declaration but in fact it is stored with different encoding.
Do not declare your XML-File with <?xml version="1.0" encoding="UTF-8"?> just because "everybody does it". If you declare UTF-8 then you also have to save it as UTF-8. Check save options at your editor, resp. settings in your application which creates the file.
I assume the XML File is saved in Windows-1252 encoding. Try <?xml version="1.0" encoding="ISO-8859-1"?> instead.
Windows-1252 is very similar to ISO 8859-1, see ISO 8859-1 vs. ISO 8859-15 vs. Windows-1252 vs. Unicode so it should work unless your XML contains any of € Š š Ž ž Œ œ Ÿ.
However, according XML specification only UTF-8 and UTF-16 are mandatory, ISO 8859-x are optional, so the target application may not be able to read the file. In this case you have to convert your XML-File as UTF-8.

Related

How to encoding in Spark dataframe while reading xml through sqlContext.read.format("com.databricks.spark.xml")

I have an XML file with encoding="UTF-8" which contains a few French letters inside an element.
Example <Name>Áudio</Name>;
I'm unable to read the XML through
sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag", "root_Tag")
.load("file:/Users/test.xml");
It shows "_corrupt_record" but If I removed the French character, it works perfectly.
I belive that issue is because of the encoding. How can I do encoding in sqlContext while reading XML?
I also tested with .option("charset","UTF-8") in by reading but it does not work. Please help me to resolve the problem.
I think you need to specify the option with lowercase (utf-8)

exporting text file with utf-8 encoding in ms access

I am exporting text files from 2 queries in ms access 2010. Queries are from different linked ODBC tables (but tables are different only by data, structure and data types are same). I set up export specification to export text file in utf-8 encoding for both files. Now here come the trouble part. When I export the queries and open them in notepad, one query is in utf-8 and second one is in ANSI. I don't know how is this possible when both queires has the same export specification and it is driving me crazy.
This is my VBA code to export queries:
DoCmd.TransferText acExportDelim, "miniflow", "qry01_CZ_test", "C:\TEST_CZ.txt", no
DoCmd.TransferText acExportDelim, "miniflow", "qry01_SK_test", "C:\TEST_SK.txt", no
I also tried to modify it by adding 65001 as coding argument by the results were same.
Do you have any idea what could be wrong?
Don't rely on the File Open dialog in Notepad to tell you whether a text file is encoded as "ANSI" or UTF-8. That is just Notepad's "guess" based on whether the file begins with the bytes EF BB BF, which is the UTF-8 Byte Order Mark (BOM).
Many (most?) Windows applications will include the UTF-8 BOM at the beginning of a text file that is UTF-8 encoded. Some Unicode purists insist, often quite vigorously, that the BOM is not required for UTF-8 files and should be excluded, but that is the way Windows applications tend to behave.
Unfortunately, Access does not always follow that pattern when it exports files to text. A UTF-8 text file exported from Access may omit the BOM and that can confuse applications like Notepad if they assume that a UTF-8 encoded file will always include the BOM as the first three bytes of the file.
For a more reliable way of determining the encoding of a text file consider using an application like Notepad++ to open the file. It will differentiate between the UTF-8 files with a BOM (which it designates as "UTF-8") and UTF-8 files without a BOM (which it designates as "ANSI as UTF-8")
To illustrate, consider the following Access table
When exported to text (CSV) with UTF-8 encoding,
the File Open dialog in Notepad reports that it is encoded as "ANSI"
but a hex editor shows that it is in fact encoded as UTF-8 (the character é is encoded as C3 A9, not simply E9 as would be the case for true "ANSI" encoding)
and Notepad++ recognizes it as "ANSI as UTF-8"
in other words, a UTF-8 encoded file without a BOM.

Delete the first characters before parse an XML (SAX)

I have who xml files apparently identical named wrong.xml and good.xml.
The code is the follow:
<?xml version="1.0" encoding="utf-16"?>
<tag>
</tag>
The problem is that the XMLReader class (org.xml.sax.XMLReader) detects the follow error when parse the wrong.xml.
Content is not allowed in prolog
The reason is that exist an hidden characters before prolog.
I only got to see these characters using a basic java file reader and I can see that the first and second characters are -1 and -2.
'-1''-2'<?xml version>......
Notepad, Ultraedit32, Wordpad, Notepad++, etc. neither can see them.
My real problem is that I need read the xml from an FTP automatically, then I need any way for delete these characters before parse with xmlReader without parse all document because some documents are very big.
How delete the first char of a file?
You'll have to remove those characters before the parser sees them, but you don't need to read the whole file and write it back out again with those characters removed.
A sax parser can read from an InputSource based on a Reader. There are many implementations of this Reader interface for reading from a file, url or other data source, but you can also wrap whatever your primary Reader is in a FilterReader extension that you code to perform changes needed to the data before it goes on.
It isn't difficult to code an extension of FilterReader that drops the first two characters but passes on everything else, and that will do just what you need. If the need to drop those characters isn't known until runtime, but can be detected then in a sensible way, this can be to do it only when needed. It might make sense to drop all characters before the first '<'.

Illegal xml parsing import to sql mac roman

I have a xml that says it's encoding is UTF-8. When I use openxml to import data into sql, I always get "XML parsing: line xxxxxx, character xx, illegal xml character.
Right now I can go to each line and replace it with the a legal character and it goes well. Sometimes there maybe be more than 5 mac roman characters and it becomes tedious to replace. I am currently using notepad ++ and there is probably a way for this.
Can anyone suggest if anything can be done in sql level or does it have to checked before ran in sql?
So far, most of the characters found are, x95, x92, x96, xbc, xbd, xbo.
Thanks.
In your question, you did not specify whether illegal characters you had to remove were Unicode or not. Or whether the file was really expected to contain UTF-8 characters. Unlike for the ASCII, for UTF-8 some byte combinations are illegal, so if you declare the text file to be encoded in UTF-8, you might not be able to read it successfully till end (such a thing could never happen with ASCII).
So it is possible that by removal of <?xml version="1.0" encoding="UTF-8"?> you just declared some non-unicode encoding of your file (instead of previously declared UTF-8), so reading the data passed. You did not have many foreign characters like ľťčý in the file, did you? Normally, it is a must that you check what happened to those after the import. It might happen that your import passes without error, but city name Čadca becomes äadca and somebody will thank your company for rendering his address unreadable.

Problem with word "Nestlé" in an XML doc (UTF-8 encoding) using NXXMLParser. Any idea?

We are using NSXMLParser in Objective-C to parse our XML document, which are all UTF-8 encoded. One document has a string "Nestlé" in it (as in ...<title>Nestlé Novelties</title>...). The parser just quit, reporting an error with error code=9, due to the French letter "e" at the end of the word "Nestle". Furthermore, we tried using IE, Chrome, Safari to show the same document directly. They reported a similar encoding error.
We are using UTF-8 for all incoming XML document, which means that all of them have "<?xml version="1.0" encoding="UTF-8" ?>" as the top of the document.
Is this an encoding problem? If so, how do we solve this? What encoding should we use for all of our XML documents? Thanks in advance!
Barclay
Have you checked the file with a hex editor to verify that the "é" is indeed UTF-8, 0xC3 0xA9 ?
In HTML, I would use Nestlé Does that work for your application?
Something I saw just now in an example XML file was that a string containing user-defined input (which happened to include é characters) wrapped the contents of the containing tag in CDATA declarations. This has the effect of making the parser completely ignore the characters contained therein.