Problem with word "Nestlé" in an XML doc (UTF-8 encoding) using NXXMLParser. Any idea? - objective-c

We are using NSXMLParser in Objective-C to parse our XML document, which are all UTF-8 encoded. One document has a string "Nestlé" in it (as in ...<title>Nestlé Novelties</title>...). The parser just quit, reporting an error with error code=9, due to the French letter "e" at the end of the word "Nestle". Furthermore, we tried using IE, Chrome, Safari to show the same document directly. They reported a similar encoding error.
We are using UTF-8 for all incoming XML document, which means that all of them have "<?xml version="1.0" encoding="UTF-8" ?>" as the top of the document.
Is this an encoding problem? If so, how do we solve this? What encoding should we use for all of our XML documents? Thanks in advance!
Barclay

Have you checked the file with a hex editor to verify that the "é" is indeed UTF-8, 0xC3 0xA9 ?

In HTML, I would use Nestlé Does that work for your application?

Something I saw just now in an example XML file was that a string containing user-defined input (which happened to include é characters) wrapped the contents of the containing tag in CDATA declarations. This has the effect of making the parser completely ignore the characters contained therein.

Related

Characters not displayed correctly when reading CSV file

I have an issue when trying to read a string from a .CSV file. When I execute the application and the text is shown in a textbox, certain characters such as "é" or "ó" are shown as a question mark symbol.
The idea is that this code reads the whole CSV file and then splits each line into variables depending on the first word of the line.
The code I'm using to read is:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv")
Dim test_chart As String = Array.Find(vls1load, Function(x) (x.StartsWith("sample")))
Dim test_chart_div() As String = test_chart.Split(";")
variable1 = test_chart_div(1)
variable2 = test_chart_div(2)
...etc
I have also tried with:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv", System.Text.Encoding.UTF8)
But none of them works. The .csv file is supposed to be UTF8. The "web options" that you can see when saving the file in excel show encoding UTF8. I also tried the trick of changing the file extension to HTML and opening it with the browser to see that the encoding is also correct.
Can someone advice anything else I can try?
Thanks in advance.
When an Excel file is exported using the CSV Comma Separated output format, the Encoding selected in Tools -> Web Option -> Encoding of Excel's Save As... dialog doesn't actually generate the expected result:
the Text file is saved using the Encoding relative to the current Language selected in the Excel Application, not the Unicode (UTF16-LE) or UTF-8 Encoding selected (which is ignored) nor the default Encoding determined by the current System Language.
To import the CSV file, you can use the Encoding.GetEncoding() method to specify the Name or CodePage of the Encoding used in the machine that generated the file: again, not the Encoding related to System Language, but the Encoding of the Language that the Excel Application is currently using.
CodePage 1252 (Windows-1252) and ISO-8859-1 are commonly used in Latin1 zone.
Based the symbols you're referring to, this is most probably the original encoding used.
In Windows, use the former. ISO-8859-1 is still used, mostly in old Web Pages (or Web Pages created without care for the Encoding used).
As a note, CodePage 1252 and ISO-8859-1 are not exactly the same Encoding, there are subtle differences.
If you find documentation that states the opposite, the documentation is wrong.

How to encoding in Spark dataframe while reading xml through sqlContext.read.format("com.databricks.spark.xml")

I have an XML file with encoding="UTF-8" which contains a few French letters inside an element.
Example <Name>Áudio</Name>;
I'm unable to read the XML through
sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag", "root_Tag")
.load("file:/Users/test.xml");
It shows "_corrupt_record" but If I removed the French character, it works perfectly.
I belive that issue is because of the encoding. How can I do encoding in sqlContext while reading XML?
I also tested with .option("charset","UTF-8") in by reading but it does not work. Please help me to resolve the problem.
I think you need to specify the option with lowercase (utf-8)

Oracle SQL XML Offending characters in XML file (UTF-8)

I'm trying to find a way of either replacing/deleting offending characters from the Oracle SQL XML files I'm creating. The structure of the XML file is correct but the company I'm sending the files too can't load the files because of the offending characters in the XML file. I'm using Oracle 11g release 2 database.
What can I do and what are my options?
A screen shot is below of an example of these offending characters, both myself and the company i'm sending the files too are using the Unicode UTF-8 encoding.
An example of a tag that it does not like is below for ZOË WANAMAKER
<prodAssociatedParty>
<apType>ACTOR</apType>
<lastName>ZOË WANAMAKER</lastName>
</prodAssociatedParty>
Ë (0xCB), É (0xC9), Ï (0xCF), £ (0xA3), Ç (0xC7), Ò (0xD2), Ü (0xDC)
Thanks in Advance for any advice.
Thanks for all your replies. In the end I put the below into my PL/SQL, as like some of you said even though you have version '1.0" encoding="UTF-8" in your SQL XML code the file was being stored in a different encoding. So I needed to force it to write/store an XML file in UTF-8 format.
If you look up DBMS_XSLPROCESSOR.clob2file there are a number of parameters passed to this procedure, one being the character set to use for the output file. Which in this case for UTF-8 it was nls_charset_id('AL32UTF8').
DBMS_XSLPROCESSOR.clob2file(l_clob, l_directory, l_file_name||'.xml',nls_charset_id('AL32UTF8'));
thanks Guys
Obviously your XML files has <?xml version="1.0" encoding="UTF-8"?> as declaration but in fact it is stored with different encoding.
Do not declare your XML-File with <?xml version="1.0" encoding="UTF-8"?> just because "everybody does it". If you declare UTF-8 then you also have to save it as UTF-8. Check save options at your editor, resp. settings in your application which creates the file.
I assume the XML File is saved in Windows-1252 encoding. Try <?xml version="1.0" encoding="ISO-8859-1"?> instead.
Windows-1252 is very similar to ISO 8859-1, see ISO 8859-1 vs. ISO 8859-15 vs. Windows-1252 vs. Unicode so it should work unless your XML contains any of € Š š Ž ž Œ œ Ÿ.
However, according XML specification only UTF-8 and UTF-16 are mandatory, ISO 8859-x are optional, so the target application may not be able to read the file. In this case you have to convert your XML-File as UTF-8.

Delete the first characters before parse an XML (SAX)

I have who xml files apparently identical named wrong.xml and good.xml.
The code is the follow:
<?xml version="1.0" encoding="utf-16"?>
<tag>
</tag>
The problem is that the XMLReader class (org.xml.sax.XMLReader) detects the follow error when parse the wrong.xml.
Content is not allowed in prolog
The reason is that exist an hidden characters before prolog.
I only got to see these characters using a basic java file reader and I can see that the first and second characters are -1 and -2.
'-1''-2'<?xml version>......
Notepad, Ultraedit32, Wordpad, Notepad++, etc. neither can see them.
My real problem is that I need read the xml from an FTP automatically, then I need any way for delete these characters before parse with xmlReader without parse all document because some documents are very big.
How delete the first char of a file?
You'll have to remove those characters before the parser sees them, but you don't need to read the whole file and write it back out again with those characters removed.
A sax parser can read from an InputSource based on a Reader. There are many implementations of this Reader interface for reading from a file, url or other data source, but you can also wrap whatever your primary Reader is in a FilterReader extension that you code to perform changes needed to the data before it goes on.
It isn't difficult to code an extension of FilterReader that drops the first two characters but passes on everything else, and that will do just what you need. If the need to drop those characters isn't known until runtime, but can be detected then in a sensible way, this can be to do it only when needed. It might make sense to drop all characters before the first '<'.

Illegal xml parsing import to sql mac roman

I have a xml that says it's encoding is UTF-8. When I use openxml to import data into sql, I always get "XML parsing: line xxxxxx, character xx, illegal xml character.
Right now I can go to each line and replace it with the a legal character and it goes well. Sometimes there maybe be more than 5 mac roman characters and it becomes tedious to replace. I am currently using notepad ++ and there is probably a way for this.
Can anyone suggest if anything can be done in sql level or does it have to checked before ran in sql?
So far, most of the characters found are, x95, x92, x96, xbc, xbd, xbo.
Thanks.
In your question, you did not specify whether illegal characters you had to remove were Unicode or not. Or whether the file was really expected to contain UTF-8 characters. Unlike for the ASCII, for UTF-8 some byte combinations are illegal, so if you declare the text file to be encoded in UTF-8, you might not be able to read it successfully till end (such a thing could never happen with ASCII).
So it is possible that by removal of <?xml version="1.0" encoding="UTF-8"?> you just declared some non-unicode encoding of your file (instead of previously declared UTF-8), so reading the data passed. You did not have many foreign characters like ľťčý in the file, did you? Normally, it is a must that you check what happened to those after the import. It might happen that your import passes without error, but city name Čadca becomes äadca and somebody will thank your company for rendering his address unreadable.