U-SQL with XmlExtractor - elements inside elements - azure-data-lake

In U-SQL I am trying to get a list of elements inside elements, using the XmlExtractor. But I cannot get the nested collection.
It is a list of items, which has locations. With the XmlExtractor I can get a collection of elements, but I don't see how I can get a collection that contains a collection. An XML sample is shown below.
Any ideas?
<root>
<Item>
<Header>
<id>111</id>
</Header>
<Body>
<Locations>
<Location>
<Station>k4</Station>
<Timestamp>2017-08-30T02:04:18.2506945+02:00</Timestamp>
</Location>
<Location>
<Station>k5</Station>
<Timestamp>2017-08-30T02:04:18.2506945+02:00</Timestamp>
</Location>
</Locations>
</Body>
</Item>
<Item>
<Header>
<id>222</id>
</Header>
<Body>
<Locations>
<Location>
<Station>k4</Station>
<Timestamp>2017-08-30T02:12:36.1218601+02:00</Timestamp>
</Location>
<Location>
<Station>k5</Station>
<Timestamp>2017-08-30T02:12:36.1218601+02:00</Timestamp>
</Location>
</Locations>
</Body>
</Item>
</root>

Solved by making an extractor that takes the XML in one string, and then calls a method using xpath, returning an SQL.Array, where the string has comma separated values of of the result. The result looks like this:
111;k4,2017-08-30T02:04:18.2506945+02:00
111;k5,2017-08-30T02:04:18.2506945+02:00
222;k4,2017-08-30T02:12:36.1218601+02:00
222;k5,2017-08-30T02:12:36.1218601+02:00
The standard XmlExtractor cannot do this, and I also decided that it is better to postpone the parsing of the xml to after it has been extracted, because there can be multiple steps on the same xml.

Azure SQL Database has powerful abilities to shred XML. Maybe if this is already in your estate/architecture it might make a simple alternative to custom code? A simple example:
DECLARE #xml XML = '<root>
<Item>
<Header>
<id>111</id>
</Header>
<Body>
<Locations>
<Location>
<Station>k4</Station>
<Timestamp>2017-08-30T02:04:18.2506945+02:00</Timestamp>
</Location>
<Location>
<Station>k5</Station>
<Timestamp>2017-08-30T02:04:18.2506945+02:00</Timestamp>
</Location>
</Locations>
</Body>
</Item>
<Item>
<Header>
<id>222</id>
</Header>
<Body>
<Locations>
<Location>
<Station>k4</Station>
<Timestamp>2017-08-30T02:12:36.1218601+02:00</Timestamp>
</Location>
<Location>
<Station>k5</Station>
<Timestamp>2017-08-30T02:12:36.1218601+02:00</Timestamp>
</Location>
</Locations>
</Body>
</Item>
</root>'
/*
111;k4,2017-08-30T02:04:18.2506945+02:00
111;k5,2017-08-30T02:04:18.2506945+02:00
222;k4,2017-08-30T02:12:36.1218601+02:00
222;k5,2017-08-30T02:12:36.1218601+02:00
*/
SELECT
r.c.value('(Header/id/text())[1]', 'int' ) id,
b.c.value('(Station/text())[1]', 'varchar(10)' ) station,
b.c.value('(Timestamp/text())[1]', 'varchar(40)' ) [timestamp],
b.c.value('(Timestamp/text())[1]', 'datetimeoffset' ) [timestamp2]
FROM #xml.nodes('root/Item') r(c)
CROSS APPLY r.c.nodes('Body/Locations/Location') b(c)
You can do something similar if the XML is stored in a table also.
My results:

Here is a script that achieves the desired results using the extractors provided.
USE master;
REFERENCE SYSTEM ASSEMBLY [System.Xml]
REFERENCE ASSEMBLY master.[Microsoft.Analytics.Samples.Formats.Xml]
#e = EXTRACT a string, b string
FROM "CollectTest.xml"
USING new Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor(rowPath:"Item",
columnPaths:new SQL.MAP<string, string> { {"Header", "a"}, {"Body", "b"} });
#f = SELECT #e.a, t.c, t.d
FROM #e
CROSS APPLY new Microsoft.Analytics.Samples.Formats.Xml.XmlApplier("b","Location", new SQL.MAP<string,string> { {"Station", "c"}, {"Timestamp", "d"} }) AS t(c string, d string);
OUTPUT #f TO "foo.txt" USING Outputters.Tsv(outputHeader:true);
OUTPUT #e TO "foo2.txt" USING Outputters.Tsv(outputHeader:true);
The first rowset #e uses the XmlDomExtractor to create a row set containing "ID" in col a and the child XML code in col b.
The second rowset #f then uses XmlApplier to extract the values from the nested xml code and cross apply it to the correct rows. The sample xml was copied from the post above and saved in the USQLDataRoot folder as "CollectTest.xml."
Note: Got lazy and the output for Header contains some unwanted node syntax but adding an intermediate xpath or XmlApplier step between #e and #f should solve this.

Related

Retrieving All instances of an 3rd level XML field from an XML column

I have an XML data field in one of my tables that essentially looks like this:
<App xmlns='http://Namespace1'>
<Package xmlns='http://Namespace2'>
<Item>
<ItemDetails xmlns='http://Namespace3'>
<ItemName>ItemNameValue</ItemName>
</ItemDetails>
other_item_stuff
</Item>
<Item>
<ItemDetails>
<ItemName>ItemNameValue</ItemName>
</ItemDetails>
</Item>
...
</Package>
</App>
I need to get all of the ItemNameValues from the XML.
I have tried to adapt many examples found on the web to my purpose, but have failed miserably. The best I seem to be able to do is get one ItemName per Package.
I think that CROSS APPLY is where I need to go, but the syntax to retrieve all the itemdetail.itemname eludes me.
This is my latest failure (returns nothing):
WITH XMLNAMESPACES(
'http://Namespace1' AS xsd,
'http://www.w3.org/2001/XMLSchema-instance' AS xsi,
'http://Namespace2' AS ns1,
'http://Namespace3' AS ns2)
Items.d.value('(ns2:ItemDetails/ItemName/text())[1]','varchar(200)') as
ItemName
FROM MyTable
CROSS APPLY XMLDataColumn.nodes('/xsd:App/ns1:Package/ns1:Item') Items(d)
I hope to get several records from each XML field, but can only ever get the first element.
The biggest problem in this issue is the XML itself:
<App xmlns="http://Namespace1">
<Package xmlns="http://Namespace2">
<Item>
<ItemDetails xmlns="http://Namespace3">
<ItemName>ItemNameValue</ItemName>
</ItemDetails>
other_item_stuff
</Item>
<Item>
<ItemDetails>
<ItemName>ItemNameValue</ItemName>
</ItemDetails>
</Item>
...
</Package>
</App>
Two major Problems:
The namespaces are all declared as default namespaces (they do not include a prefix). All nodes within a node share the same default namespace, if there is nothing else stated explicitly.
The first <ItemDetails> is living within namespace http://Namespace3, while the second <ItemDetails> is living within namespace http://Namespace2 (inherited from <Package>)
That means: If you can - by any chance - change the construction of the XML, try to do this first.
If you have to deal with this, you can try this clean, but clumsy approach.
WITH XMLNAMESPACES(
'http://Namespace1' AS ns1,
'http://www.w3.org/2001/XMLSchema-instance' AS xsi,
'http://Namespace2' AS ns2,
'http://Namespace3' AS ns3)
SELECT COALESCE(Items.d.value('(ns2:ItemDetails/ns2:ItemName/text())[1]','varchar(200)')
,Items.d.value('(ns3:ItemDetails/ns3:ItemName/text())[1]','varchar(200)')) AS ItemName
FROM #xml.nodes('/ns1:App/ns2:Package/ns2:Item') Items(d);
Another approach is to use a namespace wildcard, but be aware of ambigous names...
SELECT Items.d.value('(*:ItemDetails/*:ItemName/text())[1]','varchar(200)') AS ItemName
FROM #xml.nodes('/*:App/*:Package/*:Item') Items(d)

Build New XML From Stored XML Value

We store rather large XML blobs (in an column of XML type) and I'm pursuing a skunkworks project to try to build up a subset of the XML on the fly when needed.
Let's say I have this XML blob stored in our database table in a given column:
<root>
<header>
<id>1</id>
<name id="foo">Name</name>
</header>
<body>
<items>
<addItem>
<val>1</val>
</addItem>
<observeItem>
<val>2</val>
</observeItem>
</items>
</body>
</root>
What I want to get out is this is to basically recreate the above document structure but only include one of the items children, so for example:
<root>
<header>
<id>1</id>
<name id="foo">Name</name>
</header>
<body>
<items>
<observeItem>
<val>2</val>
</observeItem>
</items>
</body>
</root>
If I were interested in just the observeItem record (the items element can have any number of children, but I'll only ever be interested in a single one of them).
I know I can do something like SELECT #XML.query('//items/child::*[2]') to get just a given child item, but how would I build up the full original document in a query with just one of those children?
I've come up with a solution, but I'm not entirely pleased with it:
DECLARE #XML XML = '
<root>
<header>
<id>1</id>
<name id="foo">Name</name>
</header>
<body>
<items>
<addItem>
<val>1</val>
</addItem>
<observeItem>
<val>2</val>
</observeItem>
</items>
</body>
</root>'
DECLARE #NthChild INT = 2
SELECT
#XML.query('//header'),
#XML.query('//items/child::*[sql:variable("#NthChild")]') AS 'items'
FOR XML PATH('root')
I don't like having to specify the root explicitly nor the items, but I think this approach could get me by.

SQL Server Grabbing Value from XML parameter to use in later query

I am really new to SQL Server and stored procedures to begin with. I need to be able to parse an incoming XML file for a specific element's value and compare/save it later in the procedure.
I have a few things stacked against me. One the Element I need is buried deeply inside the document. I have had no luck in searching for it by name using methods similar to this:
select CurrentBOD = c.value('(local-name(.))[1]', 'VARCHAR(MAX)'),
c.value('(.)[1]', 'VARCHAR(MAX)') from #xml.nodes('PutMessage/payload/content/AcknowledgePartsOrder/ApplicationArea/BODId') as BODtable(c)
It always returns null.
So, I am trying something similar to this:
declare #BODtable TABLE(FieldName VARCHAR(MAX),
FieldValue VARCHAR(MAX))
SELECT
FieldName = nodes.value('local-name(.)', 'varchar(50)'),
FieldValue = nodes.value('(.)[1]', 'varchar(50)')
FROM
#xml.nodes('//*') AS BODtable(nodes)
declare #CurrentBOD VARCHAR(36)
set #CurrentBOD = ''
SET #CurrentBOD = (SELECT FieldValue from #BODtable WHERE FieldName = 'BODId')
This provides me the list of node names and values correctly (I test this in a query and BODtable has all elements listed with the correct values), but when I set #CurrentBOD it comes up null.
Am I missing an easier way to do this? Am I messing these two approaches up somehow?
Here is a part of the xml I am parsing for reference:
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity- secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401- wss-wssecurity-utility-1.0.xsd">
<soap:Header>
<payloadManifest xmlns="???">
<c contentID="Content0" namespaceURI="???" element="AcknowledgePartsOrder" version="4.0" />
</payloadManifest>
<wsa:Action>http://www.starstandards.org/webservices/2005/10/transport/operations/PutMessage</wsa:Action>
<wsa:MessageID>uuid:df8c66af-f364-4b8f-81d8-06150da14428</wsa:MessageID>
<wsa:ReplyTo>
<wsa:Address>http://schemas.xmlsoap.org/ws/2004/03/addressing/role/anonymous</wsa:Address>
</wsa:ReplyTo>
<wsa:To>???</wsa:To>
<wsse:Security soap:mustUnderstand="1">
<wsu:Timestamp wsu:Id="Timestamp-bd91e76f-c212-4555-9b23-f66f839672bd">
<wsu:Created>2013-01-03T21:52:48Z</wsu:Created>
<wsu:Expires>2013-01-03T21:53:48Z</wsu:Expires>
</wsu:Timestamp>
<wsse:UsernameToken xmlns:wsu="???" wsu:Id="???">
<wsse:Username>???</wsse:Username>
<wsse:Password Type="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-username-token-profile-1.0#PasswordText">???</wsse:Password>
<wsse:Nonce>???</wsse:Nonce>
<wsu:Created>2013-01-03T21:52:48Z</wsu:Created>
</wsse:UsernameToken>
</wsse:Security>
</soap:Header>
<soap:Body>
<PutMessage xmlns="??????">
<payload>
<content id="???">
<AcknowledgePartsOrder xmlns="???" xmlns:xsi="???" xsi:schemaLocation="??? ???" revision="???" release="???" environment="???n" lang="en-US" bodVersion="???">
<ApplicationArea>
<Sender>
<Component>???</Component>
<Task>???</Task>
<ReferenceId>???</ReferenceId>
<CreatorNameCode>???</CreatorNameCode>
<SenderNameCode>???</SenderNameCode>
<DealerNumber>???</DealerNumber>
<PartyId>???</PartyId>
<LocationId />
<ServiceId />
</Sender>
<CreationDateTime>2013-01-03T21:52:47</CreationDateTime>
<BODId>71498800-c098-4885-9ddc-f58aae0e5e1a</BODId>
<Destination>
<DestinationNameCode>???</DestinationNameCode>
You need to respect the XML namespaces!
First of all, your target XML node <BODId> is inside the <soap:Envelope> and <soap:Body> tags - both need to be included in your selection.
Secondly, both the <PutMessage> as well as the <AcknowledgePartsOrder> nodes appear to have default XML namespaces (those xmlns=.... without a prefix) - and those must be respected when you select your data using XPath.
So assuming that <PutMessage xmlns="urn:pm"> and <AcknowledgePartsOrder xmlns="urn:apo"> (those are just guesses on my part - replace with the actual XML namespaces that you haven't shown use here), you should be able to use this XPath to get what you're looking for:
;WITH XMLNAMESPACES('http://schemas.xmlsoap.org/soap/envelope/' AS soap,
'urn:pm' AS ns, 'urn:apo' AS apo)
SELECT
XC.value('(apo:BODId)[1]', 'varchar(100)')
FROM
#YourXmlVariable.nodes('/soap:Envelope/soap:Body/ns:PutMessage/ns:payload/ns:content/apo:AcknowledgePartsOrder/apo:ApplicationArea') AS XT(XC)
This does return the expected value (71498800-c098-4885-9ddc-f58aae0e5e1a) in my case.

How to ignore XML namespace when creating SQL request?

I have many rows in a DB which contain XML data field. XML approximately looks like this:
<CabasEstimateReply xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="https://cabmb.cab.se/schemas/CABMBGeneralSchemas/CABASEstimateReply/2006-11-16/">
<Estimate xmlns="">
<WorkshopCompanyId>C002006893</WorkshopCompanyId>
<EstimateId>1-SE-AEB965-634921885183891313</EstimateId>
</Estimate>
<EstimateReply xmlns="">
**<EstimateReplyCode>ReplyStatus1</EstimateReplyCode>**
<EstimateReplyVersion>1</EstimateReplyVersion>
<EstimateReplyDate>2013-05-31T11:40:18.6227322+03:00</EstimateReplyDate>
<EstimateReplyComment />
<EstimateReplyMessage>Kunden betalar : 8692 Fakturaadress : Trygg Hansa</EstimateReplyMessage>
<EstimateReplyMessageCompressMethod />
<EstimateReplyReference>010704</EstimateReplyReference>
<EstimateReplyForthcomingInspectionDate />
</EstimateReply>
<Vehicle xmlns="">
<VehicleRegNo>XND108</VehicleRegNo>
<VehicleMake>BMW</VehicleMake>
<VehicleModel>525I TOURING</VehicleModel>
<VehicleModelYear />
<VehicleModelMonth />
<VehicleVINCode />
<VehicleChassiNo>NL51010CM95684</VehicleChassiNo>
<VehicleFirstRegistered>2006-02-23T00:00:00</VehicleFirstRegistered>
<Imported>null</Imported>
</Vehicle>
I need to have a possibility to get a value EstimateReplyCode(marked with bold) via SQL request. I'm doing this like:
;WITH XMLNAMESPACES(DEFAULT 'https://cabmb.cab.se/schemas/CABMBGeneralSchemas/CABASEstimateReply/2006-11-16/')
select [Data],
Data.value('(/CabasEstimateReply/EstimateReply/EstimateReplyCode)[1]', 'nvarchar(64)') AS ReplyCode
from EstimateReplyRawData
But get only null values for ReplyCode. When I tried to convert XML to string, then replace namespaces and then convert to XML back everything worked well, that's why I suppose that the issue is the namespace. What am I doing wrong here?
If you really want to ignore namespaces, you can use namespace wildcards.
select [Data],
Data.value('(/*:CabasEstimateReply/*:EstimateReply/*:EstimateReplyCode)[1]', 'nvarchar(64)') AS ReplyCode
from EstimateReplyRawData

Configuring namespace for sp_xml_preparedocument

I have an RSS xml with this format:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
<channel>
<title></title>
<link></link>
<description></description>
<language></language>
<lastBuildDate></lastBuildDate>
<generator></generator>
<docs></docs>
<managingEditor></managingEditor>
<webMaster></webMaster>
<ttl></ttl>
<item>
<title></title>
<link></link>
<description></description>
<guid isPermaLink="false"></guid>
<pubDate></pubDate>
<author></author>
<dc:date></dc:date>
<dc:publisher></dc:publisher>
<dc:language></dc:language>
</item>
<item>
<title></title>
<link></link>
<description></description>
<guid isPermaLink="false"></guid>
<pubDate></pubDate>
<author></author>
<dc:date></dc:date>
<dc:publisher></dc:publisher>
<dc:language></dc:language>
</item>
</channel>
</rss>
And I want to parse it with sp_xml_preparedocument in SQLServer.
My problem is the "namespce" field. There are three tags in each item which has namespace, and I don't know how to specify them.
I have tried this:
EXEC sp_xml_preparedocument #hDoc OUTPUT, #xmlContent,'<item xmlns:dc="http://purl.org/dc/elements/1.1/"/>'
but it just parse the first item and forget the rest!
Any idea?
The fact that you are only getting one row has nothing to do with the namespace. You have some error in your openxml query against #hDoc.
There might be reasons for you to still use openxml but until you show the query that is not working for you I will suggest you use the XML data type instead.
with xmlnamespaces('http://purl.org/dc/elements/1.1/' as dc)
select C.N.value('(title/text())[1]', 'nvarchar(100)') as channel_title,
I.N.value('(title/text())[1]', 'nvarchar(100)') as item_title,
I.N.value('(dc:publisher/text())[1]', 'nvarchar(100)') as publisher
from #XML.nodes('/rss/channel') as C(N)
cross apply C.N.nodes('item') as I(N);
SQL Fiddle
The namespace needs to be defined as a character type:
EXEC sp_xml_preparedocument #hDoc OUTPUT, #xmlContent,'<item xmlns:dc="http://purl.org/dc/elements/1.1/"/>'
[ xpath_namespaces ]
Specifies the namespace declarations that are used in row and column XPath expressions in OPENXML. xpath_namespaces is a text parameter: char, nchar, varchar, nvarchar, text, ntext or xml.
The default value is . xpath_namespaces provides the namespace URIs for the prefixes used in the XPath expressions in OPENXML by means of a well-formed XML document. xpath_namespaces declares the prefix that must be used to refer to the namespace urn:schemas-microsoft-com:xml-metaprop; this provides metadata about the parsed XML elements. Although you can redefine the namespace prefix for the metaproperty namespace by using this technique, this namespace is not lost. The prefix mp is still valid for urn:schemas-microsoft-com:xml-metaprop even if xpath_namespaces contains no such declaration.
http://msdn.microsoft.com/en-us/library/ms187367.aspx