Configuring namespace for sp_xml_preparedocument - sql

I have an RSS xml with this format:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
<channel>
<title></title>
<link></link>
<description></description>
<language></language>
<lastBuildDate></lastBuildDate>
<generator></generator>
<docs></docs>
<managingEditor></managingEditor>
<webMaster></webMaster>
<ttl></ttl>
<item>
<title></title>
<link></link>
<description></description>
<guid isPermaLink="false"></guid>
<pubDate></pubDate>
<author></author>
<dc:date></dc:date>
<dc:publisher></dc:publisher>
<dc:language></dc:language>
</item>
<item>
<title></title>
<link></link>
<description></description>
<guid isPermaLink="false"></guid>
<pubDate></pubDate>
<author></author>
<dc:date></dc:date>
<dc:publisher></dc:publisher>
<dc:language></dc:language>
</item>
</channel>
</rss>
And I want to parse it with sp_xml_preparedocument in SQLServer.
My problem is the "namespce" field. There are three tags in each item which has namespace, and I don't know how to specify them.
I have tried this:
EXEC sp_xml_preparedocument #hDoc OUTPUT, #xmlContent,'<item xmlns:dc="http://purl.org/dc/elements/1.1/"/>'
but it just parse the first item and forget the rest!
Any idea?

The fact that you are only getting one row has nothing to do with the namespace. You have some error in your openxml query against #hDoc.
There might be reasons for you to still use openxml but until you show the query that is not working for you I will suggest you use the XML data type instead.
with xmlnamespaces('http://purl.org/dc/elements/1.1/' as dc)
select C.N.value('(title/text())[1]', 'nvarchar(100)') as channel_title,
I.N.value('(title/text())[1]', 'nvarchar(100)') as item_title,
I.N.value('(dc:publisher/text())[1]', 'nvarchar(100)') as publisher
from #XML.nodes('/rss/channel') as C(N)
cross apply C.N.nodes('item') as I(N);
SQL Fiddle

The namespace needs to be defined as a character type:
EXEC sp_xml_preparedocument #hDoc OUTPUT, #xmlContent,'<item xmlns:dc="http://purl.org/dc/elements/1.1/"/>'
[ xpath_namespaces ]
Specifies the namespace declarations that are used in row and column XPath expressions in OPENXML. xpath_namespaces is a text parameter: char, nchar, varchar, nvarchar, text, ntext or xml.
The default value is . xpath_namespaces provides the namespace URIs for the prefixes used in the XPath expressions in OPENXML by means of a well-formed XML document. xpath_namespaces declares the prefix that must be used to refer to the namespace urn:schemas-microsoft-com:xml-metaprop; this provides metadata about the parsed XML elements. Although you can redefine the namespace prefix for the metaproperty namespace by using this technique, this namespace is not lost. The prefix mp is still valid for urn:schemas-microsoft-com:xml-metaprop even if xpath_namespaces contains no such declaration.
http://msdn.microsoft.com/en-us/library/ms187367.aspx

Related

Retrieving All instances of an 3rd level XML field from an XML column

I have an XML data field in one of my tables that essentially looks like this:
<App xmlns='http://Namespace1'>
<Package xmlns='http://Namespace2'>
<Item>
<ItemDetails xmlns='http://Namespace3'>
<ItemName>ItemNameValue</ItemName>
</ItemDetails>
other_item_stuff
</Item>
<Item>
<ItemDetails>
<ItemName>ItemNameValue</ItemName>
</ItemDetails>
</Item>
...
</Package>
</App>
I need to get all of the ItemNameValues from the XML.
I have tried to adapt many examples found on the web to my purpose, but have failed miserably. The best I seem to be able to do is get one ItemName per Package.
I think that CROSS APPLY is where I need to go, but the syntax to retrieve all the itemdetail.itemname eludes me.
This is my latest failure (returns nothing):
WITH XMLNAMESPACES(
'http://Namespace1' AS xsd,
'http://www.w3.org/2001/XMLSchema-instance' AS xsi,
'http://Namespace2' AS ns1,
'http://Namespace3' AS ns2)
Items.d.value('(ns2:ItemDetails/ItemName/text())[1]','varchar(200)') as
ItemName
FROM MyTable
CROSS APPLY XMLDataColumn.nodes('/xsd:App/ns1:Package/ns1:Item') Items(d)
I hope to get several records from each XML field, but can only ever get the first element.
The biggest problem in this issue is the XML itself:
<App xmlns="http://Namespace1">
<Package xmlns="http://Namespace2">
<Item>
<ItemDetails xmlns="http://Namespace3">
<ItemName>ItemNameValue</ItemName>
</ItemDetails>
other_item_stuff
</Item>
<Item>
<ItemDetails>
<ItemName>ItemNameValue</ItemName>
</ItemDetails>
</Item>
...
</Package>
</App>
Two major Problems:
The namespaces are all declared as default namespaces (they do not include a prefix). All nodes within a node share the same default namespace, if there is nothing else stated explicitly.
The first <ItemDetails> is living within namespace http://Namespace3, while the second <ItemDetails> is living within namespace http://Namespace2 (inherited from <Package>)
That means: If you can - by any chance - change the construction of the XML, try to do this first.
If you have to deal with this, you can try this clean, but clumsy approach.
WITH XMLNAMESPACES(
'http://Namespace1' AS ns1,
'http://www.w3.org/2001/XMLSchema-instance' AS xsi,
'http://Namespace2' AS ns2,
'http://Namespace3' AS ns3)
SELECT COALESCE(Items.d.value('(ns2:ItemDetails/ns2:ItemName/text())[1]','varchar(200)')
,Items.d.value('(ns3:ItemDetails/ns3:ItemName/text())[1]','varchar(200)')) AS ItemName
FROM #xml.nodes('/ns1:App/ns2:Package/ns2:Item') Items(d);
Another approach is to use a namespace wildcard, but be aware of ambigous names...
SELECT Items.d.value('(*:ItemDetails/*:ItemName/text())[1]','varchar(200)') AS ItemName
FROM #xml.nodes('/*:App/*:Package/*:Item') Items(d)

U-SQL with XmlExtractor - elements inside elements

In U-SQL I am trying to get a list of elements inside elements, using the XmlExtractor. But I cannot get the nested collection.
It is a list of items, which has locations. With the XmlExtractor I can get a collection of elements, but I don't see how I can get a collection that contains a collection. An XML sample is shown below.
Any ideas?
<root>
<Item>
<Header>
<id>111</id>
</Header>
<Body>
<Locations>
<Location>
<Station>k4</Station>
<Timestamp>2017-08-30T02:04:18.2506945+02:00</Timestamp>
</Location>
<Location>
<Station>k5</Station>
<Timestamp>2017-08-30T02:04:18.2506945+02:00</Timestamp>
</Location>
</Locations>
</Body>
</Item>
<Item>
<Header>
<id>222</id>
</Header>
<Body>
<Locations>
<Location>
<Station>k4</Station>
<Timestamp>2017-08-30T02:12:36.1218601+02:00</Timestamp>
</Location>
<Location>
<Station>k5</Station>
<Timestamp>2017-08-30T02:12:36.1218601+02:00</Timestamp>
</Location>
</Locations>
</Body>
</Item>
</root>
Solved by making an extractor that takes the XML in one string, and then calls a method using xpath, returning an SQL.Array, where the string has comma separated values of of the result. The result looks like this:
111;k4,2017-08-30T02:04:18.2506945+02:00
111;k5,2017-08-30T02:04:18.2506945+02:00
222;k4,2017-08-30T02:12:36.1218601+02:00
222;k5,2017-08-30T02:12:36.1218601+02:00
The standard XmlExtractor cannot do this, and I also decided that it is better to postpone the parsing of the xml to after it has been extracted, because there can be multiple steps on the same xml.
Azure SQL Database has powerful abilities to shred XML. Maybe if this is already in your estate/architecture it might make a simple alternative to custom code? A simple example:
DECLARE #xml XML = '<root>
<Item>
<Header>
<id>111</id>
</Header>
<Body>
<Locations>
<Location>
<Station>k4</Station>
<Timestamp>2017-08-30T02:04:18.2506945+02:00</Timestamp>
</Location>
<Location>
<Station>k5</Station>
<Timestamp>2017-08-30T02:04:18.2506945+02:00</Timestamp>
</Location>
</Locations>
</Body>
</Item>
<Item>
<Header>
<id>222</id>
</Header>
<Body>
<Locations>
<Location>
<Station>k4</Station>
<Timestamp>2017-08-30T02:12:36.1218601+02:00</Timestamp>
</Location>
<Location>
<Station>k5</Station>
<Timestamp>2017-08-30T02:12:36.1218601+02:00</Timestamp>
</Location>
</Locations>
</Body>
</Item>
</root>'
/*
111;k4,2017-08-30T02:04:18.2506945+02:00
111;k5,2017-08-30T02:04:18.2506945+02:00
222;k4,2017-08-30T02:12:36.1218601+02:00
222;k5,2017-08-30T02:12:36.1218601+02:00
*/
SELECT
r.c.value('(Header/id/text())[1]', 'int' ) id,
b.c.value('(Station/text())[1]', 'varchar(10)' ) station,
b.c.value('(Timestamp/text())[1]', 'varchar(40)' ) [timestamp],
b.c.value('(Timestamp/text())[1]', 'datetimeoffset' ) [timestamp2]
FROM #xml.nodes('root/Item') r(c)
CROSS APPLY r.c.nodes('Body/Locations/Location') b(c)
You can do something similar if the XML is stored in a table also.
My results:
Here is a script that achieves the desired results using the extractors provided.
USE master;
REFERENCE SYSTEM ASSEMBLY [System.Xml]
REFERENCE ASSEMBLY master.[Microsoft.Analytics.Samples.Formats.Xml]
#e = EXTRACT a string, b string
FROM "CollectTest.xml"
USING new Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor(rowPath:"Item",
columnPaths:new SQL.MAP<string, string> { {"Header", "a"}, {"Body", "b"} });
#f = SELECT #e.a, t.c, t.d
FROM #e
CROSS APPLY new Microsoft.Analytics.Samples.Formats.Xml.XmlApplier("b","Location", new SQL.MAP<string,string> { {"Station", "c"}, {"Timestamp", "d"} }) AS t(c string, d string);
OUTPUT #f TO "foo.txt" USING Outputters.Tsv(outputHeader:true);
OUTPUT #e TO "foo2.txt" USING Outputters.Tsv(outputHeader:true);
The first rowset #e uses the XmlDomExtractor to create a row set containing "ID" in col a and the child XML code in col b.
The second rowset #f then uses XmlApplier to extract the values from the nested xml code and cross apply it to the correct rows. The sample xml was copied from the post above and saved in the USQLDataRoot folder as "CollectTest.xml."
Note: Got lazy and the output for Header contains some unwanted node syntax but adding an intermediate xpath or XmlApplier step between #e and #f should solve this.

SQL SERVER xml with CDATA

I have a table in my database with a column containing xml. The column type is nvarchar(max). The xml is formed in this way
<root>
<child>....</child>
.
.
<special>
<event><![CDATA[text->text]]></event>
<event><![CDATA[text->text]]></event>
...
</special>
</root>
I have not created the db, I cannot change the way information is stored in it but I can retrieve it with a select. For the extraction I use
select cast(replace(xml,'utf-8','utf-16')as xml)
from table
It works well except for cdata, whose content in the query output is: text -> text
Is there a way to retrieve also the CDATA tags?
Well, this is - as far as I know - not possible on normal ways...
The CDATA section has one sole reason: include invalid characters within XML for lazy people...
CDATA is not seen as needed at all and therefore is not really supported by normal XML methods. Or in other words: It is supported in the way, that the content is properly escaped. There is no difference between correctly escaped content and not-escaped content within CDATA actually! (Okay, there are some minor differences like including ]]> within a CDATA-section and some more tiny specialties...)
The big question is: Why?
What are you trying to do with this afterwards?
Try this. the included text is given as is:
DECLARE #xml XML =
'<root>
<special>
<event><![CDATA[text->text]]></event>
<event><![CDATA[text->text]]></event>
</special>
</root>'
SELECT t.c.query('text()')
FROM #xml.nodes('/root/special/event') t(c);
So: Please explain some more details: What do you really want?
If your really need nothing more than the wrapping CDATA you might use this:
SELECT '<![CDATA[' + t.c.value('.','varchar(max)') + ']]>'
FROM #xml.nodes('/root/special/event') t(c);
Update: Same with outdated FROM OPENXML
I just tried how the outdated approach with FROM OPENXML handles this and found, that there is absolutely no indication in the resultset, that the given text was within a CDATA section originally. The "Some value here" is exactly returned in the same way as the text within CDATA:
DECLARE #doc XML =
'<root>
<child>Some value here </child>
<special>
<event><![CDATA[text->text]]></event>
<event><![CDATA[text->text]]></event>
</special>
</root>';
DECLARE #hnd INT;
EXEC sp_xml_preparedocument #hnd OUTPUT, #doc;
SELECT * FROM OPENXML (#hnd, '/root',0);
EXEC sp_xml_removedocument #hnd;
This is how to include cdata on child nodes in XML, using pure SQL. But; it's not ideal.
SELECT 1 AS tag,
null AS parent,
'10001' AS 'Customer!1!Customer_ID!Element',
'AirBallon Captain' AS 'Customer!1!Title!cdata',
'Customer!1' = (
SELECT
2 AS tag,
NULL AS parent,
'Wrapped in cdata, using explicit' AS 'Location!2!Title!cdata'
FOR XML EXPLICIT)
FOR XML EXPLICIT, ROOT('Customers')
CDATA is included, but Child element is encoded using
>
instead of >
Which is so weird from a sensable point of view. I'm sure there are technical explanations, but they are stupid, because there is no difference in the FOR XML specification.
You could include the option type on the inner child node and then loose cdata too..
BUT WHY OH WHY?!?!?!?! would you (Microsoft) remove cdata, when I just added it?
<Customers>
<Customer>
<Customer_ID>10001</Customer_ID>
<Title><![CDATA[AirBallon Captain]]></Title>
<Location>
<Title><![CDATA[wrapped in cdata, using explicit]]></Title>
</Location>
</Customer>
</Customers>

SQL Server Grabbing Value from XML parameter to use in later query

I am really new to SQL Server and stored procedures to begin with. I need to be able to parse an incoming XML file for a specific element's value and compare/save it later in the procedure.
I have a few things stacked against me. One the Element I need is buried deeply inside the document. I have had no luck in searching for it by name using methods similar to this:
select CurrentBOD = c.value('(local-name(.))[1]', 'VARCHAR(MAX)'),
c.value('(.)[1]', 'VARCHAR(MAX)') from #xml.nodes('PutMessage/payload/content/AcknowledgePartsOrder/ApplicationArea/BODId') as BODtable(c)
It always returns null.
So, I am trying something similar to this:
declare #BODtable TABLE(FieldName VARCHAR(MAX),
FieldValue VARCHAR(MAX))
SELECT
FieldName = nodes.value('local-name(.)', 'varchar(50)'),
FieldValue = nodes.value('(.)[1]', 'varchar(50)')
FROM
#xml.nodes('//*') AS BODtable(nodes)
declare #CurrentBOD VARCHAR(36)
set #CurrentBOD = ''
SET #CurrentBOD = (SELECT FieldValue from #BODtable WHERE FieldName = 'BODId')
This provides me the list of node names and values correctly (I test this in a query and BODtable has all elements listed with the correct values), but when I set #CurrentBOD it comes up null.
Am I missing an easier way to do this? Am I messing these two approaches up somehow?
Here is a part of the xml I am parsing for reference:
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity- secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401- wss-wssecurity-utility-1.0.xsd">
<soap:Header>
<payloadManifest xmlns="???">
<c contentID="Content0" namespaceURI="???" element="AcknowledgePartsOrder" version="4.0" />
</payloadManifest>
<wsa:Action>http://www.starstandards.org/webservices/2005/10/transport/operations/PutMessage</wsa:Action>
<wsa:MessageID>uuid:df8c66af-f364-4b8f-81d8-06150da14428</wsa:MessageID>
<wsa:ReplyTo>
<wsa:Address>http://schemas.xmlsoap.org/ws/2004/03/addressing/role/anonymous</wsa:Address>
</wsa:ReplyTo>
<wsa:To>???</wsa:To>
<wsse:Security soap:mustUnderstand="1">
<wsu:Timestamp wsu:Id="Timestamp-bd91e76f-c212-4555-9b23-f66f839672bd">
<wsu:Created>2013-01-03T21:52:48Z</wsu:Created>
<wsu:Expires>2013-01-03T21:53:48Z</wsu:Expires>
</wsu:Timestamp>
<wsse:UsernameToken xmlns:wsu="???" wsu:Id="???">
<wsse:Username>???</wsse:Username>
<wsse:Password Type="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-username-token-profile-1.0#PasswordText">???</wsse:Password>
<wsse:Nonce>???</wsse:Nonce>
<wsu:Created>2013-01-03T21:52:48Z</wsu:Created>
</wsse:UsernameToken>
</wsse:Security>
</soap:Header>
<soap:Body>
<PutMessage xmlns="??????">
<payload>
<content id="???">
<AcknowledgePartsOrder xmlns="???" xmlns:xsi="???" xsi:schemaLocation="??? ???" revision="???" release="???" environment="???n" lang="en-US" bodVersion="???">
<ApplicationArea>
<Sender>
<Component>???</Component>
<Task>???</Task>
<ReferenceId>???</ReferenceId>
<CreatorNameCode>???</CreatorNameCode>
<SenderNameCode>???</SenderNameCode>
<DealerNumber>???</DealerNumber>
<PartyId>???</PartyId>
<LocationId />
<ServiceId />
</Sender>
<CreationDateTime>2013-01-03T21:52:47</CreationDateTime>
<BODId>71498800-c098-4885-9ddc-f58aae0e5e1a</BODId>
<Destination>
<DestinationNameCode>???</DestinationNameCode>
You need to respect the XML namespaces!
First of all, your target XML node <BODId> is inside the <soap:Envelope> and <soap:Body> tags - both need to be included in your selection.
Secondly, both the <PutMessage> as well as the <AcknowledgePartsOrder> nodes appear to have default XML namespaces (those xmlns=.... without a prefix) - and those must be respected when you select your data using XPath.
So assuming that <PutMessage xmlns="urn:pm"> and <AcknowledgePartsOrder xmlns="urn:apo"> (those are just guesses on my part - replace with the actual XML namespaces that you haven't shown use here), you should be able to use this XPath to get what you're looking for:
;WITH XMLNAMESPACES('http://schemas.xmlsoap.org/soap/envelope/' AS soap,
'urn:pm' AS ns, 'urn:apo' AS apo)
SELECT
XC.value('(apo:BODId)[1]', 'varchar(100)')
FROM
#YourXmlVariable.nodes('/soap:Envelope/soap:Body/ns:PutMessage/ns:payload/ns:content/apo:AcknowledgePartsOrder/apo:ApplicationArea') AS XT(XC)
This does return the expected value (71498800-c098-4885-9ddc-f58aae0e5e1a) in my case.

OPENXML, Convert Base64 to Binary

For an import/export process, we put binary data into XML as Base64 encoded strings. The issue came up when getting the values back out...
We're using OPENXML because performance on 2005/2008 is horrid using nodes() - it doesn't scale well at all. They fixed the performance issue in SQL Server 2012, but for the sake of legacy support (2005+) it's not a realistic option, and MS doesn't appear to want to backport things (assuming even possible).
Here's some relevant info on the subject.
Ideally, I'm looking for a single SQL statement using OPENXML to shred an XML document that contains binary data encoded to Base64, and provide a result set there that data is correctly rendered as binary data. I have one solution that doesn't use nodes, hoping someone has something better.
You can specify your column as XML and use .value to get the data as varbinary.
Something like this.
declare #XML xml
set #XML =
'<root>
<item>
<ID>1</ID>
<Col>Um93MQ==</Col>
</item>
<item>
<ID>2</ID>
<Col>Um93Mg==</Col>
</item>
</root>'
declare #idoc int
exec sp_xml_preparedocument #idoc out, #xml
select T.ID,
T.Col.value('.', 'varbinary(max)')
from openxml(#idoc, '/root/item', 2)
with (ID int,
Col xml) as T
exec sp_xml_removedocument #idoc
Or you could make use of CLR (if that is an option for you) something like this:
using System;
using System.Text;
public class CLRTest
{
public static byte[] ConvertBase64ToBinary(string str)
{
if (str == null)
{
return null;
}
return Convert.FromBase64String(str);
}
}