I have to process one Xml which is having %20 in its name from Xslt.
Filename: Sample%20Documnet.XML
In Xslt i have written like this,
<xsl:variable name="readDoc" select="document('Sample%20Documnet.XML')"/>
But i am getting this error: Could not find file 'C:\xslt\Sample Documnet.XML'.
I think %20 is getting converted to space internally, which i don't want. Is there any way to stop this behavior.
Related
I have an XML file with encoding="UTF-8" which contains a few French letters inside an element.
Example <Name>Áudio</Name>;
I'm unable to read the XML through
sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag", "root_Tag")
.load("file:/Users/test.xml");
It shows "_corrupt_record" but If I removed the French character, it works perfectly.
I belive that issue is because of the encoding. How can I do encoding in sqlContext while reading XML?
I also tested with .option("charset","UTF-8") in by reading but it does not work. Please help me to resolve the problem.
I think you need to specify the option with lowercase (utf-8)
I need to find solution to fix by using XSLT 1! Most of sent XML files well formatted and someone make mess by adding characters (& < >. . .). Any way to do replace this on my side? I tried XSLT 2 and Replace function does not work as I use XSLT processor from Microsoft
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
exclude-result-prefixes="xs saxon"
version="2.0">
<xsl:param name="path" select="'file:///E:/foo.xml'"></xsl:param>
<xsl:template match="/">
<xsl:copy-of select="unparsed-text($path)"></xsl:copy-of>
<xsl:copy-of select="saxon:parse(replace(unparsed-text($path), '&', '&'))"/>
</xsl:template>
</xsl:stylesheet>
Any other suggestion how to solve this issue. for example I have input XML file like:
<?xml version="1.0" encoding="UTF-8"?>
<name>Stack & Exchange</name>
And is fail on '&' character.
Please advice!
Thank you
Conclusion
Two observations
XSLT requires at least a well formed XML document at the input, so I can't use it to correct invalid XML
(it is an XML transformation language)
in order to use replace or escape invalid characters of XML on input I need to make sure that I use an XSLT 2.0 processor
(I use Microsoft processor XSLT 1.0)
I see two options
If I receive an error on input, investigate and validate manually and send back the error message. - THIS IS I TRIED TO AVOID! (Use text tools like notepad++, excel to find an issue).
write a correcting parser in the .net language to fix before loading as XML
I have who xml files apparently identical named wrong.xml and good.xml.
The code is the follow:
<?xml version="1.0" encoding="utf-16"?>
<tag>
</tag>
The problem is that the XMLReader class (org.xml.sax.XMLReader) detects the follow error when parse the wrong.xml.
Content is not allowed in prolog
The reason is that exist an hidden characters before prolog.
I only got to see these characters using a basic java file reader and I can see that the first and second characters are -1 and -2.
'-1''-2'<?xml version>......
Notepad, Ultraedit32, Wordpad, Notepad++, etc. neither can see them.
My real problem is that I need read the xml from an FTP automatically, then I need any way for delete these characters before parse with xmlReader without parse all document because some documents are very big.
How delete the first char of a file?
You'll have to remove those characters before the parser sees them, but you don't need to read the whole file and write it back out again with those characters removed.
A sax parser can read from an InputSource based on a Reader. There are many implementations of this Reader interface for reading from a file, url or other data source, but you can also wrap whatever your primary Reader is in a FilterReader extension that you code to perform changes needed to the data before it goes on.
It isn't difficult to code an extension of FilterReader that drops the first two characters but passes on everything else, and that will do just what you need. If the need to drop those characters isn't known until runtime, but can be detected then in a sensible way, this can be to do it only when needed. It might make sense to drop all characters before the first '<'.
I have a xml that says it's encoding is UTF-8. When I use openxml to import data into sql, I always get "XML parsing: line xxxxxx, character xx, illegal xml character.
Right now I can go to each line and replace it with the a legal character and it goes well. Sometimes there maybe be more than 5 mac roman characters and it becomes tedious to replace. I am currently using notepad ++ and there is probably a way for this.
Can anyone suggest if anything can be done in sql level or does it have to checked before ran in sql?
So far, most of the characters found are, x95, x92, x96, xbc, xbd, xbo.
Thanks.
In your question, you did not specify whether illegal characters you had to remove were Unicode or not. Or whether the file was really expected to contain UTF-8 characters. Unlike for the ASCII, for UTF-8 some byte combinations are illegal, so if you declare the text file to be encoded in UTF-8, you might not be able to read it successfully till end (such a thing could never happen with ASCII).
So it is possible that by removal of <?xml version="1.0" encoding="UTF-8"?> you just declared some non-unicode encoding of your file (instead of previously declared UTF-8), so reading the data passed. You did not have many foreign characters like ľťčý in the file, did you? Normally, it is a must that you check what happened to those after the import. It might happen that your import passes without error, but city name Čadca becomes äadca and somebody will thank your company for rendering his address unreadable.
We are using NSXMLParser in Objective-C to parse our XML document, which are all UTF-8 encoded. One document has a string "Nestlé" in it (as in ...<title>Nestlé Novelties</title>...). The parser just quit, reporting an error with error code=9, due to the French letter "e" at the end of the word "Nestle". Furthermore, we tried using IE, Chrome, Safari to show the same document directly. They reported a similar encoding error.
We are using UTF-8 for all incoming XML document, which means that all of them have "<?xml version="1.0" encoding="UTF-8" ?>" as the top of the document.
Is this an encoding problem? If so, how do we solve this? What encoding should we use for all of our XML documents? Thanks in advance!
Barclay
Have you checked the file with a hex editor to verify that the "é" is indeed UTF-8, 0xC3 0xA9 ?
In HTML, I would use Nestlé Does that work for your application?
Something I saw just now in an example XML file was that a string containing user-defined input (which happened to include é characters) wrapped the contents of the containing tag in CDATA declarations. This has the effect of making the parser completely ignore the characters contained therein.