Parsing XML with embedded node in text to DataFrame - pandas

I have an XML like this:
<root>
<epig>
string1
<tit>string2</tit>
string3
</epig>
</root>
I'm trying to build a data frame with following:
dftext = pd.read_xml("filename.xml", xpath='root/epig')
which returns in the data frame a column epig containing string1 and a column tit with string2, but string3 is disappeared in data frame. This is the current output:
epig
tit
string1
string2
The data frame output instead should be:
epig
tit
string1+string3
string 2
Where's my error?

In XML speak, there are three nodes under the <epig> element: two <text> nodes and the <tit> node. To retrieve the latter text node, in Python's etree library, you would have to use the .tail attribute on the tit element. In Pandas, read_xml (the convenience method designed to parse flat not all XML types) only parses the first text node since it does not iterate across multiple text nodes.
For this special use case of multiple text nodes, consider re-styling the XML with XSLT, the special-purpose language designed to transform XML files, which is supported in read_xml using the stylesheet argument and default lxml parser (not etree parser).
XSLT (save as .xsl, a special .xml file)
Below concatenates both text nodes into a new <epig> child element that becomes a sibling to <tit> each under a new parent <item> used in xpath.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/root">
<xsl:copy>
<xsl:apply-templates select="epig"/>
</xsl:copy>
</xsl:template>
<xsl:template match="epig">
<item>
<epig>
<xsl:value-of select="normalize-space(concat(text()[1], text()[2]))"/>
</epig>
<xsl:copy-of select="tit"/>
</item>
</xsl:template>
</xsl:stylesheet>
Online Demo
Python
Below will parse all <item> nodes of the flattened output of XSLT.
dftext = pd.read_xml("filename.xml", xpath=".//item", stylesheet="style.xsl")
dftext
# epig tit
# 0 string1 string3 string2

Related

XSLT: variables and "empty" labels

I have an XML datafile containing among other things a string of arbitrarily many comma separated values. I want those values to be displayed in a web browser as a list with one value per line. So I wrote an XSLT template that takes this string, displays the first value followed by a linebreak tag (<br/>), properly name-spaced, and resources with the remainder of the string. In effect, the commas are being replaced by HTML <br/> tags.
Now, when I store the result of calling that template in a xsl:variable, and display that through xsl:value-of, then the HTML tags disappear: what is shown is the string minus the commas.
When I display the result directly by having the xsl:call-template in place of the xsl:value-of, all is fine, and the values appear in a list.
So, what's going on?
Is this behavior an implementation artifact, or is it standard XSLT?
Use xsl:copy-of instead of xsl:value-of if you want to output nodes (like your br elements), xsl:value-of creates a simple text node with the string value(s) selected.
Here is an example that shows the difference between xsl:value-of and xsl:copy-of, you will note that it is not the use of the variable with newly created br elements that makes the difference, it is simply the use of xsl:value-of that creates a text() node with the string conversion of the selection:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="html" indent="yes" version="5" doctype-system="about:legacy-doctype"/>
<xsl:variable name="var">Phrase 1.<br/>Phrase 2.<br/>Phrase 3.</xsl:variable>
<xsl:template match="/">
<html>
<head>
<title>.NET XSLT Fiddle Example</title>
</head>
<body>
<section>
<h1>Example 1: value-of</h1>
<xsl:value-of select="$var"/>
</section>
<section>
<h1>Example 2: copy-of</h1>
<xsl:copy-of select="$var"/>
</section>
<xsl:apply-templates select="//p"/>
<xsl:apply-templates select="//p" mode="copy-of"/>
</body>
</html>
</xsl:template>
<xsl:template match="p">
<section>
<h1>Example 1: value-of</h1>
<xsl:value-of select="."/>
</section>
</xsl:template>
<xsl:template match="p" mode="copy-of">
<section>
<h1>Example 1: copy-of</h1>
<xsl:copy-of select="."/>
</section>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/gWmuiJy/1
Output is
Example 1: value-of
Phrase 1.Phrase 2.Phrase 3.
Example 2: copy-of
Phrase 1.
Phrase 2.
Phrase 3.
Example 1: value-of
Line 1.Line 2.Line 3.
Example 1: copy-of
Line 1.
Line 2.
Line 3.
It seems that you hit the boundaries of the RTF ("Result tree fragment"):
When you use an XML fragment to initialize a variable or a parameter, then the variable or parameter is of the
"result tree fragment" datatype. This is an XSLT 1.0 specific datatype [just like node-set, but slightly different].
A result tree fragment is equivalent to a node-set that contains just the root node.
You cannot apply operators like "/", "//" or predicate on a result tree fragments. They are only applicable for node-set datatypes.
[...]
a) In XSLT 1.0
The resolution of this is to convert the result tree fragment into a node-set. I am not aware of any oracle specific xpath extension functions that can do this trick for you.
You could use EXSLT to achieve this.
b) Use XSLT 2.0
You can code your transformations in XSLT 2.0. XSLT 2.0 deprecates ResultTreeFragments i.e. if you are modeling an XSLT 2.0 transformation, and you create a variable or a parameter that holds a tree fragment, it is implicitly a node sequence.
So without using an XSLT version greater than 1, you're out of luck. So better use XSLT-2.0 or 3.0 to solve this problem.
Is this behavior an implementation artifact, or is it standard XSLT?
It is standard for XSLT-1.0, but not for XSLT-2.0+.

How to put 'and' condition to remove an xml element from an xslt

I am new to XSLT and got a condition to remove an element from XML file on the basis of condition applied in XSLT. I have to remove en element on the basis of 2 conditions on the same attribute.
Here is my dummy code:
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="cdm:attributeList/cdm:attribute[cdm:attributeName = 'format'] and cdm:attributeList/cdm:attribute[cdm:attributeValue = 'XYZ']" />
while running the XSLT file I get the following error:
1. Extra illegal tokens: 'and'
2. "exclude-result-prefixes" attribute is not allowed on the xsl:output element!
Could someone please help me out in this?
Thanks.
The template matching condition is incorrect, and operator is not allowed when matching template. You need to modify the template matching condition as below which corresponds to the combination of the required conditions.
<xsl:template match="cdm:attributeList/cdm:attribute[cdm:attributeName = 'format'][cdm:attributeValue = 'XYZ']" />
For the second issue of exclude-result-prefixes, it is an attribute of <xsl:stylesheet> and not <xsl:output>. So if you do not want the output nodes to have the cdm prefix, modify the <xsl:stylesheet> as below
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:cdm="http://someurl" exclude-result-prefixes="cdm">

Looping a variable length array with namespaces in XSLT

My previous question[1] is related to this. I found the answer for that. Now I want to loop a variable length array with namespaces. My array:
<ns:array xmlns:ns="http://www.example.org">
<value>755</value>
<value>5861</value>
<value>4328</value>
<value>2157</value>
<value>1666</value>
</ns:array>
My XSLT code:(have added the namespace in the root)
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ns1="http://www.example.org">
<xsl:template match="/">
<xsl:variable name="number" select="ns:array" />
<xsl:for-each select="$number">
<xsl:value-of select="$number" />
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
[1]https://stackoverflow.com/questions/20287219/looping-a-variable-length-array-in-xslt
IMHO you confused yourself by introducing a variable called number which actually contains a node set of value tags. Then, as a consequence you used your variable as singe item/node which does not yield the desired result (presumingly, since you did not really tell us what you want to do with the values).
Also, I think your question does not really have anything to with namespace issues as such. You just have to make sure that the namespaces in your select expressions match the namespaces in your input file.
I would suggest to do without the variable and change the way you retrieve the current value:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ns1="http://www.example.org">
<xsl:template match="/">
<xsl:for-each select="ns:array">
<!-- Inside here you can work with the `value` tag as the _current node_.
There are two most likely ways to do this. -->
<!-- a) Copy the whole tag to the output: -->
<xsl:copy-of select="." />
<!-- or b1) Copy the text part contained in the tag to the output: -->
<xsl:value-of select="." />
<!-- If you want to be on the safe side with respect to white space
you can also use this b2). This would handle the case that your output
is required not to have any white space in it but your imput XML has
some. -->
<xsl:value-of select="normalize-space(.)" />
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

Handling 0x19 in XSLT 1.0

I have encountered a problem when input xml contains the character 0x19. I have created a demo xslt to reproduce the issue.
My demo xslt looks like this:
<?xml version="1.0" encoding="utf-8"?>
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
<xsl:output method="xml" indent="yes"/>
<xsl:param name="param1"/>
<xsl:template match="/">
<Value>
<xsl:value-of select="$param1"/>
</Value>
</xsl:template>
</xsl:transform>
I am passing the character 0x19 as param1. The below output gets generated.
<Value></Value>
which is an invalid xml. How can I get it right?
You are correct that
<Value></Value>
is not well-formed XML 1.0 - XML 1.0 does not allow any control characters below U+0020 except U+0009 (tab), U+000A (LF) and U+000D (CR), not even when expressed as numeric character references, so it is simply not possible to include that character in an XML 1.0 document. The processor is wrong to produce that output, it should raise an error to complain that you've tried to insert an illegal character in the output.
However it is well formed XML 1.1, which allows control characters as &# references but not as literals. If your processor supports this (and the donwnstream components that will be receiving your output support it too) then it may be sufficient to add version="1.1" to the xsl:output instruction
<xsl:output method="xml" indent="yes" version="1.1"/>
to tell it to output XML 1.1 instead of XML 1.0.

Getting particular substring using xslt1.0

I have a string as below.
<freeForm>
<text>mnr.getValue().put("xyz","pqr");</text>
</freeForm>
From the above xml portion i need to get the string xyz.
Please provide pointers to achieve the same using xslt1.0.
Use this XPath expression:
substring-before(
substring-after(/*/*, &apos;"&apos;),
&apos;"&apos;
)
Here is a short, complete XSLT transformation that evaluates this XPath expression and outputs the result of evaluating it:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select=
'substring-before(
substring-after(/*/*, &apos;"&apos;),
&apos;"&apos;
)'/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<freeForm>
<text>mnr.getValue().put("xyz","pqr");</text>
</freeForm>
the wanted, correct result is produced:
xyz