Create Smaller XML based on value of element - lxml

On Python 3.7, I am looking to create a subset of a XML. For example, the larger XML is:
<data>
<student>
<result>
<grade>A</grade>
</result>
<details>
<name>John</name>
<id>100</id>
<age>16</age>
<email>john#mail.com</email>
</details>
</student>
<student>
<result>
<grade>B</grade>
</result>
<details>
<name>Alice</name>
<id>101</id>
<age>17</age>
<email>alice#mail.com</email>
</details>
</student>
<student>
<result>
<grade>F</grade>
</result>
<details>
<name>Bob</name>
<id>102</id>
<age>16</age>
<email>bob#mail.com</email>
</details>
</student>
<student>
<result>
<grade>A</grade>
</result>
<details>
<name>Hannah</name>
<id>103</id>
<age>17</age>
<email>hannah#mail.com</email>
</details>
</student>
</data>
and am looking for a new XML like below, the condition to create a smaller subset depends on a list of ids in this case 101 and 102. All other student blocks will be deleted.
<data>
<student>
<result>
<grade>B</grade>
</result>
<details>
<name>Alice</name>
<id>101</id>
<age>17</age>
<email>alice#mail.com</email>
</details>
</student>
<student>
<result>
<grade>F</grade>
</result>
<details>
<name>Bob</name>
<id>102</id>
<age>16</age>
<email>bob#mail.com</email>
</details>
</student>
</data>
i.e. The output XML will depend on a list of id's, in this case ['101',102']
This is what I tried:
import lxml.etree
#Original Large XML
tree = etree.parse(open('students.xml'))
root = tree.getroot()
results = root.findall('student')
textnumbers = [r.find('details/id').text for r in results]
print(textnumbers)
required_ids = ['101','102']
wanted = tree.xpath("//student/details/[not(#id in required_ids)]")
for node in unwanted:
node.getparent().remove(node)
#New Smaller XML
tree.write(open('student_output.xml', 'wb'))
But I am getting an expected error of "Invalid expression" for
wanted = tree.xpath("//student/details/[not(#id in required_ids)]")
I know it's a read, but i am fairly new to Python, thanks in advance for your help.

I think you can do it like this:
from lxml import etree as ET
required_ids = ['101','102']
for event, element in ET.iterparse('students.xml'):
if element.tag == 'student' and not(element.xpath('.//id/text()')[0] in required_ids):
element.clear()
element.getparent().remove(element)
if element.tag == 'data':
ET.dump(element)
Instead of the dump you would of course want to write to a file, that is use
if element.tag == 'data':
tree = ET.ElementTree(element)
tree.write('student_output.xml')
Your attempt fails as you can't simply use a Python list variable in XPath and in is not an XPath 1.0 operator.

Related

Extracting XML data using SQL

I would like to be able to extract specific data from a XML type using Oracle in my example for the customer named "Arshad Ali"
This is my xml data that was inserted:
<Customers>
<Customer CustomerName="Arshad Ali" CustomerID="C001">
<Orders>
<Order OrderDate="2012-07-04T00:00:00" OrderID="10248">
<OrderDetail Quantity="5" ProductID="10" />
<OrderDetail Quantity="12" ProductID="11" />
<OrderDetail Quantity="10" ProductID="42" />
</Order>
</Orders>
<Address> Address line 1, 2, 3</Address>
</Customer>
<Customer CustomerName="Paul Henriot" CustomerID="C002">
<Orders>
<Order OrderDate="2011-07-04T00:00:00" OrderID="10245">
<OrderDetail Quantity="12" ProductID="11" />
<OrderDetail Quantity="10" ProductID="42" />
</Order>
</Orders>
<Address> Address line 5, 6, 7</Address>
</Customer>
<Customer CustomerName="Carlos Gonzlez" CustomerID="C003">
<Orders>
<Order OrderDate="2012-08-16T00:00:00" OrderID="10283">
<OrderDetail Quantity="3" ProductID="72" />
</Order>
</Orders>
<Address> Address line 1, 4, 5</Address>
</Customer>
</Customers>
</ROOT>
using get clob I was able to extract all of the customers.
Was wondering if anyone could help me extract data for a specific customer.. tried using the following but was unsuccessful
SELECT extract(OBJECT_VALUE, '/root/Customers') "customer"
FROM mytable2
WHERE existsNode(OBJECT_VALUE, '/customers[CustomerName="Arshad Ali" CustomerID="C001"]')
= 1;
The case and exact names of the XML nodes matter:
SELECT extract(OBJECT_VALUE,
'/ROOT/Customers/Customer[#CustomerName="Arshad Ali"][#CustomerID="C001"]') "customer"
FROM mytable2
WHERE existsnode (OBJECT_VALUE,
'/ROOT/Customers/Customer[#CustomerName="Arshad Ali"][#CustomerID="C001"]') = 1
db<>fiddle
If you only want to search by name then only use that attribute:
SELECT extract(OBJECT_VALUE,
'/ROOT/Customers/Customer[#CustomerName="Arshad Ali"]') "customer"
FROM mytable2
WHERE existsnode (OBJECT_VALUE,
'/ROOT/Customers/Customer[#CustomerName="Arshad Ali"]') = 1
But extract() and existsnode() are deprecated; use xmlquery() and xmlexists() instead:
SELECT xmlquery('/ROOT/Customers/Customer[#CustomerName="Arshad Ali"][#CustomerID="C001"]'
passing object_value
returning content) "customer"
FROM mytable2
WHERE xmlexists('/ROOT/Customers/Customer[#CustomerName="Arshad Ali"][#CustomerID="C001"]'
passing object_value)
db<>fiddle

Karate: Match repeating element in xml

I'm trying to match a repeating element in a xml to karate schema.
XML message
* def xmlResponse =
"""
<Envelope>
<Header/>
<Body>
<Response>
<Customer>
<keys>
<primaryKey>1111111</primaryKey>
</keys>
<simplePay>false</simplePay>
</Customer>
<serviceGroupList>
<serviceGroup>
<name>XXXX</name>
<count>1</count>
<parentName>DDDDD</parentName>
<pendingCount>0</pendingCount>
<pendingHWSum>0.00</pendingHWSum>
</serviceGroup>
<serviceGroup>
<name>ZZZZZ</name>
<count>0</count>
<parentName/>
<pendingCount>3</pendingCount>
<pendingHWSum>399.00</pendingHWSum>
</serviceGroup>
</serviceGroupList>
</Response>
</Body>
</Envelope>
"""
I want to match each with following karate schema
Given def serviceGroupItem =
"""
<serviceGroup>
<name>##string</name>
<count>##string</count>
<parentName>##string</parentName>
<pendingCount>##string</pendingCount>
<pendingHWSum>##string</pendingHWSum>
</serviceGroup>
"""
This is how I tried
* xml serviceGroupListItems = get xmlResponse //serviceGroupList
* match each serviceGroupListItems == serviceGroupItem
But it doesn't work. Any idea how can I make it work
You have to match each serviceGroup.
* xml serviceGroupListItems = get xmlResponse //serviceGroupList
* match each serviceGroupListItems.serviceGroupList.serviceGroup == serviceGroupItem.serviceGroup

XPath doesn't provide proper tag

I'm trying to get tag "" from xml below.
If i execute request like this:
WITH x(col) AS (select'<document xmlns="http://example.com/digital/back/" xmlns:ns2="http://example.com/digital/back/complexId" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="">
<header>
<docId>13a2f29a28b12ecb</docId>
<dt>2018-12-10T11:59:48.112+03:00</dt>
</header>
<pay>
<reqTransfer id="154638">
<source>
<card>
<virtualCardNum>4B74C1EE187</virtualCardNum>
<bsc>VISA</bsc>
</card>
</source>
</reqTransfer>
</pay>
</document>
'::xml)
SELECT xpath('/document/pay/reqTransfer/source/card/bsc/text()', col) AS bsc
FROM x;
I get {}, but if I relpace the document start tag
<document xmlns="http://example.com/digital/back/" xmlns:ns2="http://example.com/digital/back/complexId" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="">
with <document> or even <document xmlns="">, I get { VISA } - that is right.
What should I do to replace <document xmlns="..."> with <document> or get { VISA } without replacement?
If you are working with XML namespaces, they are worth mentioning in your Xpath queries too, i.e. use
SELECT xpath('/d:document/d:pay/d:reqTransfer/d:source/d:card/d:bsc/text()', col,
ARRAY[ARRAY['d', 'http://example.com/digital/back/']]) AS bsc
http://sqlfiddle.com/#!17/9eecb/24719
See also:
how to ignore namespaces with XPath

Find element or attribute value anywhere in XML

I am trying to find the value of an element / attribute regardless of where it exists in the XML.
XML:
<?xml version="1.0" encoding="UTF-8"?>
<cXML payloadID="12345677-12345567" timestamp="2017-07-26T09:11:05">
<Header>
<From>
<Credential domain="1212">
<Identity>01235 </Identity>
<SharedSecret/>
</Credential>
</From>
<To>
<Credential domain="1212">
<Identity>01234</Identity>
</Credential>
</To>
<Sender>
<UserAgent/>
<Credential domain="8989">
<Identity>10678</Identity>
<SharedSecret>Testing123</SharedSecret>
</Credential>
</Sender>
</Header>
<Request deploymentMode="Prod">
<ConfirmationRequest>
<ConfirmationHeader noticeDate="2017-07-26T09:11:05" operation="update" type="detail">
<Total>
<Money>0.00</Money>
</Total>
<Shipping>
<Description>Delivery</Description>
</Shipping>
<Comments>WO# generated</Comments>
</ConfirmationHeader>
<OrderReference orderDate="2017-07-25T15:22:11" orderID="123456780000">
<DocumentReference payloadID="5678-4567"/>
</OrderReference>
<ConfirmationItem quantity="1" lineNumber="1">
<ConfirmationStatus quantity="1" type="detail">
<ItemIn quantity="1">
<ItemID>
<SupplierPartID>R954-89</SupplierPartID>
</ItemID>
<ItemDetail>
<UnitPrice>
<Money currency="USD">0.00</Money>
</UnitPrice>
<Description>Test Descritpion 1</Description>
<UnitOfMeasure>QT</UnitOfMeasure>
</ItemDetail>
</ItemIn>
</ConfirmationStatus>
</ConfirmationItem>
<ConfirmationItem quantity="1" lineNumber="2">
<ConfirmationStatus quantity="1" type="detail">
<ItemIn quantity="1">
<ItemID>
<SupplierPartID>Y954-89</SupplierPartID>
</ItemID>
<ItemDetail>
<UnitPrice>
<Money currency="USD">0.00</Money>
</UnitPrice>
<Description>Test Descritpion 2</Description>
<UnitOfMeasure>QT</UnitOfMeasure>
</ItemDetail>
</ItemIn>
</ConfirmationStatus>
</ConfirmationItem>
</ConfirmationRequest>
</Request>
</cXML>
I want to get the value of the payloadID on the DocumentReference element. This is what I have tried so far:
BEGIN
Declare #Xml xml
Set #Xml = ('..The XML From Above..' as xml)
END
--no value comes back
Select c.value('(/*/DocumentReference/#payloadID)[0]','nvarchar(max)') from #Xml.nodes('//cXML') x(c)
--no value comes back
Select c.value('#payloadID','nvarchar(max)') from #Xml.nodes('/cXML/*/DocumentReference') x(c)
--check if element exists and it does
Select #Xml.exist('//DocumentReference');
I tried this in an xPath editor: //DocumentReference/#payloadID
This does work, but I am not sure what the equivalent syntax is in SQL
Calling .nodes() (like suggested in comment) is an unecessary overhead...
Better try it like this:
SELECT #XML.value('(//DocumentReference/#payloadID)[1]','nvarchar(max)')
And be aware, that XPath starts counting at 1. Your example with [0] cannot work...
--no value comes back
Select c.value('(/*/DocumentReference/#payloadID)[0]','nvarchar(max)') from...

Read XML data in SQL

I want to query data from XML. I have managed to retrive data from another set of XML data but this are a bit problematic.
Bellow you see the data and the query that does not retrive any data.
DECLARE #xml XML
SET #xml=N'<DocumentXML>
<LoadApplicationResult xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Reaktor.Applikator.DTO">
<Application>
<EmbeddedProductList>
<EmbeddedProduct>
<Flag>false</Flag>
<CustomData>
<root xmlns="">
<Guaranteer ChangeTime="2012-04-28T08:50:07.5706054+02:00" ChangedBy="sven" OldValue="">
<Text>4</Text>
</Guaranteer>
<PercentGuarantee ChangeTime="2012-04-28T08:50:07.5706054+02:00" ChangedBy="sven" OldValue="">
<Number>100</Number>
</PercentGuarantee>
</root>
</CustomData>
<DataChangeTime>2014-04-28T08:50:07.5706054+02:00</DataChangeTime>
<ID>12</ID>
<FinanceSeparately>false</FinanceSeparately>
<Guid>5349efcd-457c-4423-b4bb-a28f97dd5e64</Guid>
<PluginData i:nil="true" />
<PriceCalcTime>2014-04-28T08:50:09.2580946+02:00</PriceCalcTime>
<Data>
<root xmlns="">
<root TableId="192">
<Generic.TypeCode>abba</Generic.TypeCode>
</root>
</root>
</Data>
</EmbeddedProduct>
<EmbeddedProduct>
<Flag>false</Flag>
<CustomData i:nil="true" />
<DataChangeTime>1954-10-03T00:00:00</DataChangeTime>
<ID>30</ID>
<FinanceSeparately>false</FinanceSeparately>
<Guid>d587b9b4-94df-4d9b-ba0d-2fdc62823a17</Guid>
<PluginData i:nil="true" />
<PriceCalcTime>2014-04-28T08:49:55.8831802+02:00</PriceCalcTime>
<Data>
<root xmlns="">
<root TableId="013">
<EmbProd.CMSPrice>0</EmbProd.CMSPrice>
<EmbProd.MonthFee Operator="DBLMUL" Target="CUSTOM.EPTermFee.ADD" Source="XPATH://PaySeries[1]/TermLength" DFValue="200">200</EmbProd.MonthFee>
</root>
<root TableId="759" GroupText="210" GroupText0="210">
<Flag>ink</Flag>
<Generic.TypeCode>fil</Generic.TypeCode>
</root>
</root>
</Data>
</EmbeddedProduct>
<EmbeddedProduct>
<Flag>false</Flag>
<CustomData>
<root xmlns="" />
</CustomData>
<DataChangeTime>2012-04-26T14:41:26.4232222+02:00</DataChangeTime>
<ID>16</ID>
<FinanceSeparately>false</FinanceSeparately>
<Guid>c2e2343f-a5d6-43c8-aa18-c43419d20165</Guid>
<PluginData i:nil="true" />
<PriceCalcTime>2014-04-28T08:49:55.8831802+02:00</PriceCalcTime>
<Data>
<root xmlns="">
<root TableId="102">
<EmbProd.MonthFee Operator="DBLMUL" Target="CUSTOM.EPTermFee.ADD" Source="XPATH://PaySeries[1]/TermLength" DFValue="300">300</EmbProd.MonthFee>
<EP.GenericCost Target="COST">114</EP.GenericCost>
</root>
<root TableId="102" GroupText="11" GroupText0="7">
<EP.TermCount Target="DBLMUL">13</EP.TermCount>
</root>
<root TableId="102" GroupText="210" GroupText0="210">
<Generic.TypeCode>frodinge</Generic.TypeCode>
</root>
</root>
</Data>
</EmbeddedProduct>
</EmbeddedProductList>
</Application>
</LoadApplicationResult>
</DocumentXML>'
SELECT tab.col.value('(Flag)[1]', 'nvarchar(max)') AS Flag
,tab.col.value('(Data/root/EmbProd.MonthFee)[1]', 'nvarchar(max)') AS Value
,tab.col.value('(ID)[1]', 'nvarchar(max)') AS Product
FROM #xml.nodes('/DocumentXML//LoadApplicationResult/Application/EmbeddedProductList/EmbeddedProduct') AS Tab(col)
The expected output should look like this:
+-------+-------+---------+
| Flag | Value | Product |
+-------+-------+---------+
| false | | 12 |
| false | 200 | 30 |
| true | 300 | 16 |
+-------+-------+---------+
You need to specify namespace
WITH XMLNAMESPACES ( 'http://schemas.datacontract.org/2004/07/Reaktor.Applikator.DTO' as x)
SELECT tab.col.value('(x:Flag)[1]', 'nvarchar(max)') AS Flag
,tab.col.value('(x:Data/root/root/EmbProd.MonthFee)[1]', 'nvarchar(max)') AS Value
,tab.col.value('(x:ID)[1]', 'nvarchar(max)') AS Product
FROM #xml.nodes('DocumentXML/x:LoadApplicationResult/x:Application/x:EmbeddedProductList/x:EmbeddedProduct') AS Tab(col);