SQL and escaped XML data - sql

I have a table with a mix of escaped and non-escaped XML. Of course, the data I need is escaped. For example, I have:
<Root>
<InternalData>
<Node>
<ArrayOfComment>
<Comment&gt
<SequenceNo>1</SequenceNo>
<IsDeleted>false</IsDeleted>
<TakenByCode>397</TakenByCode>
</Comment&gt
</ArrayOfComment>
</Node>
</InternalData>
</Root>
As you can see, the data in the Node tag is all escaped. I can use a query to obtain the Node data, but how can I convert it to XML in SQL so that it can be parsed and broken up? I'm pretty new to using XML in SQL, and I can't seem to find any examples of this.
Thanks

You have not given enough information about your end goal, but this will get you very close. FYI - You had two missing ; both after comment&gt
declare #xml xml
set #xml = '
<Root>
<InternalData>
<Node>
<ArrayOfComment>
<Comment>
<SequenceNo>1</SequenceNo>
<IsDeleted>false</IsDeleted>
<TakenByCode>397</TakenByCode>
</Comment>
</ArrayOfComment>
</Node>
</InternalData>
</Root>
'
select convert(xml, n.c.value('.', 'varchar(max)'))
from #xml.nodes('Root/InternalData/Node/text()') n(c)
Output
<ArrayOfComment>
<Comment>
<SequenceNo>1</SequenceNo>
<IsDeleted>false</IsDeleted>
<TakenByCode>397</TakenByCode>
</Comment>
</ArrayOfComment>
The result is an XML column that you can put into a variable or cross-apply into directly to get data from the XML fragment.

Your best bet might be to look into a HTML Decoding UDF. I did a quick search and found this one:
http://www.andreabertolotto.net/Articles/HTMLDecodeUDF.aspx
You may want to modify it so it only decodes > and <. The one above seems to go above and beyond your needs.
UPDATE
#Cyberkiwi's solution seems to be a bit cleaner. I will leave this up in case the version of SQL Server you are running doesn't support his solution.

Related

Check if XML nodes are empty in SQL

Hi I am new to XML manipulation, my question would be if there is a possibility of detecting if the XML node is an empty node like this: <gen:nodeName />
I am able to manipulate single nodes however I would be interested if there is an approach like a loop or recursive function that could save some time doing manual labor looking trough every single node. I have no idea how to approach this problem though.
Thanks for help.
You did not specify the dialect of SQL ([sql] is not enough, please specify always the RDBMS incl. version).
This is for SQL-Server, but the semantics should be the same.
DECLARE #xml XML=
N'<root>
<SelfClosing />
<NoContent></NoContent>
<BlankContent> </BlankContent>
<HasContent>blah</HasContent>
<HasContent>other</HasContent>
</root>';
SELECT #xml.query(N'/root/*') AS AnyBelowRoor --All elements
,#xml.query(N'/root/*[text()]') AS AnyWithTextNode --blah and other
,#xml.query(N'/root/*[not(text())]') AS NoText --no text
,#xml.query(N'/root/*[text()="blah"]') AS AnyWithTextNode--blah only
The <SelfClosing /> is semantically the same as the <NoContent><NoContent>. There is no difference.
It might be a surprise, but a blank as content is taken as empty too.
So the check for empty or not empty is the check for the existance of a text() node. one can negate this with not() to find all without a text().
Interesting: The result for NoText comes back as this (SQL-Server)
<SelfClosing />
<NoContent />
<BlankContent />
The three elements are implicitly returned in the shortest format.

Field value query with special character and unfiltered search returning unexpected results?

Field value query is giving unexpected results when any special character(#,=,#,$,%,^,*) is passed.
please find the 4 sample docs I have inserted in to ML.
<root>
<journalTitle>Dinesh</journalTitle>
<sourceType>JA</sourceType>
<title>title1</title>
<volume>volume0</volume>
</root>
<root>
<journalTitle>Gayari</journalTitle>
<sourceType>JA</sourceType>
<title>title1</title>
<volume>volume0</volume>
</root>
<root>
<journalTitle>Dixit</journalTitle>
<sourceType>JA</sourceType>
<title>title1</title>
<volume>volume0</volume>
</root>
<root>
<journalTitle>Singla</journalTitle>
<sourceType>JA</sourceType>
<title>title1</title>
<volume>volume0</volume>
</root>
CTS Query :
cts:search(
fn:doc(),
cts:field-value-query("Sample","######*()", ("unwildcarded")),
"unfiltered"
)
On running this query I am getting all the documents.
As per my understanding, it should return an empty sequence.
please find below the field I have created.
Field (in XML format) :
<field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://marklogic.com/xdmp/database">
<field-name>Sample</field-name>
<field-path>
<path>/root/journalTitle</path>
<weight>1.0</weight>
</field-path>
<word-lexicons/>
<included-elements/>
<excluded-elements/>
<tokenizer-overrides/>
</field>
Index setting:
If I will add any alphabet(s) in the search string it will give me the correct results.
Like:
##$%F
=====S
df===$d
Please help me to resolve this issue?
Try passing "exact" as an option to cts:field-value-query:
cts:search(
fn:doc(),
cts:field-value-query("Sample","######*()", ("exact")),
"unfiltered"
)
MarkLogic has an index for exact values to help in cases like this. Note it's only on when you have both case sensitive and diacritic sensitive indexes enabled (which you do). I know this works for cts:element-value-query so I expect it will for cts:field-value-query as well.
Use instead the 'exact' option in the field-value-query.
This requires the fast diacritic- and case-sensitive options, but you already have those enabled.
You can also try xdmp:plan before and after using 'exact' to see the effect on the query plan.
In the 'tokenizer overrides' option for your field, add these special character(#,=,#,$,%,^,*) as words (select 'word').
These special characters are not considered for matching by default. You need to override the default tokenizer to include them as words.
May I know what output are you expecting on passing this cts:element-word-query(xs:QName("journalTitle"),"=====S") for the above given for xmls.
Changing the one character searches to true in database config, resolves the issue in element-word-query.

SQL SERVER xml with CDATA

I have a table in my database with a column containing xml. The column type is nvarchar(max). The xml is formed in this way
<root>
<child>....</child>
.
.
<special>
<event><![CDATA[text->text]]></event>
<event><![CDATA[text->text]]></event>
...
</special>
</root>
I have not created the db, I cannot change the way information is stored in it but I can retrieve it with a select. For the extraction I use
select cast(replace(xml,'utf-8','utf-16')as xml)
from table
It works well except for cdata, whose content in the query output is: text -> text
Is there a way to retrieve also the CDATA tags?
Well, this is - as far as I know - not possible on normal ways...
The CDATA section has one sole reason: include invalid characters within XML for lazy people...
CDATA is not seen as needed at all and therefore is not really supported by normal XML methods. Or in other words: It is supported in the way, that the content is properly escaped. There is no difference between correctly escaped content and not-escaped content within CDATA actually! (Okay, there are some minor differences like including ]]> within a CDATA-section and some more tiny specialties...)
The big question is: Why?
What are you trying to do with this afterwards?
Try this. the included text is given as is:
DECLARE #xml XML =
'<root>
<special>
<event><![CDATA[text->text]]></event>
<event><![CDATA[text->text]]></event>
</special>
</root>'
SELECT t.c.query('text()')
FROM #xml.nodes('/root/special/event') t(c);
So: Please explain some more details: What do you really want?
If your really need nothing more than the wrapping CDATA you might use this:
SELECT '<![CDATA[' + t.c.value('.','varchar(max)') + ']]>'
FROM #xml.nodes('/root/special/event') t(c);
Update: Same with outdated FROM OPENXML
I just tried how the outdated approach with FROM OPENXML handles this and found, that there is absolutely no indication in the resultset, that the given text was within a CDATA section originally. The "Some value here" is exactly returned in the same way as the text within CDATA:
DECLARE #doc XML =
'<root>
<child>Some value here </child>
<special>
<event><![CDATA[text->text]]></event>
<event><![CDATA[text->text]]></event>
</special>
</root>';
DECLARE #hnd INT;
EXEC sp_xml_preparedocument #hnd OUTPUT, #doc;
SELECT * FROM OPENXML (#hnd, '/root',0);
EXEC sp_xml_removedocument #hnd;
This is how to include cdata on child nodes in XML, using pure SQL. But; it's not ideal.
SELECT 1 AS tag,
null AS parent,
'10001' AS 'Customer!1!Customer_ID!Element',
'AirBallon Captain' AS 'Customer!1!Title!cdata',
'Customer!1' = (
SELECT
2 AS tag,
NULL AS parent,
'Wrapped in cdata, using explicit' AS 'Location!2!Title!cdata'
FOR XML EXPLICIT)
FOR XML EXPLICIT, ROOT('Customers')
CDATA is included, but Child element is encoded using
>
instead of >
Which is so weird from a sensable point of view. I'm sure there are technical explanations, but they are stupid, because there is no difference in the FOR XML specification.
You could include the option type on the inner child node and then loose cdata too..
BUT WHY OH WHY?!?!?!?! would you (Microsoft) remove cdata, when I just added it?
<Customers>
<Customer>
<Customer_ID>10001</Customer_ID>
<Title><![CDATA[AirBallon Captain]]></Title>
<Location>
<Title><![CDATA[wrapped in cdata, using explicit]]></Title>
</Location>
</Customer>
</Customers>

Remove XML attribute based on value

I have an XML variable with only one element in it. I need to check if this element has a particular attribute, and if it does, i need to check if that attribute has a specific value, and if it does, i need to remove that attribute from the XML element.
So lets say I have
DECLARE #Xml XML
SET #XML =
'<person
FirstName="Harvey"
LastName="Saayman"
MobileNumber="Empty"
/>'
The MobileNumber attribute may or may not be there, if it is, and the value is "Empty", i need to change my XML variable to this:
'<person
FirstName="Harvey"
LastName="Saayman"
/>'
I'm a complete SQL XML noob and have no idea how to go about this, any ideas?
Use the modify() DML clause to modify the XML nodes. On this case something like:
SET #XML.modify('delete (/person/#MobileNumber)[1]')
This XML workshop can be helpfull to have a deeper understanding of the DML clauses delete, insert, replace, etc.
SET #XML.modify('delete /person/#MobileNumber[. = "Empty"]')

Import Xml nodes as Xml column with SSIS

I'm trying to use the Xml Source to shred an XML source file however I do not want the entire document shredded into tables. Rather I want to import the xml Nodes into rows of Xml.
a simplified example would be to import the document below into a table called "people" with a column called "person" of type "xml". When looking at the XmlSource --- it seem that it suited to shredding the source xml, into multiple records --- not quite what I'm looking for.
Any suggestions?
<people>
<person>
<name>
<first>Fred</first>
<last>Flintstone</last>
</name>
<address>
<line1>123 Bedrock Way</line>
<city>Drumheller</city>
</address>
</person>
<person>
<!-- more of the same -->
</person>
</people>
I didn't think that SSIS 2005 supported the XML datatype at all. I suppose it "supports" it as DT_NTEXT.
In any case, you can't use the XML Source for this purpose. You would have to write your own. That's not actually as hard as it sounds. Base it on the examples in Books Online. The processing would consist of moving to the first child node, then calling XmlReader.ReadSubTree to return a new XmlReader over just the next <person/> element. Then use your favorite XML API to read the entire <person/>, convert the resulting XML to a string, and pass it along down the pipeline. Repeat for all <person/> nodes.
Could you perhaps change your xml output so that the content of person is seen as a string? Use escape chars for the <>.
You could use a script task to parse it as well, I'd imagine.