TSQL - remove illegal character from malformed XML string - sql

Inspired by convert-string-to-xml-illegal-characters
I wonder if there is way in pure T-SQL to convert malformed XML string to well-formed version.
I have NVARCHAR like:
DECLARE #string NVARCHAR(MAX) =
N'<root>
<stuff attrib="Ooop,bad character<">
<test>Here I get &, and "<" or ">>>>" </test>
<test2>And even more <<<>><><<<><> </test2>
</stuff>
</root>';
SELECT CONVERT(XML, #string);
Of course this will fail because & should be replaced by &, this is easy.
But how to replace < and > when they are in element text or attribute without knowing structure in advance?

There is not a magic method for changing a string into valid XML. You have to be sure that you build your XML string in a way that ensures that it is syntactically correct. Even your simple method of replacing all & with & does not work in all cases. Consider this XML string:
<root>
<stuff>
<test>Here I get &</test>
</stuff>
</root>';
The simple replacement will result in:
<root>
<stuff>
<test>Here I get &amp;</test>
</stuff>
</root>';
Unless you want to write a lot of code to parse strings into XML, you should either:
Use the XML methods to build your XML
Use other standard methods such as the FOR XML clause in select
statements.
Ensure that as you build the string you ensure that any variable part (tags, attributes, or data) conform to the XML standards in conformance to what that variable part represents. For example: wrapping data variables in <![CDATA[ ]]> or replacing invalid characters in variable tags and attributes.

Related

tsql for xml string concat - How it works

I have the following code which works well:
STUFF( (
select
char(13)+'Item '+i.item+' : '--+char(13) +i.item_descr
from #itemlines i
where i.customer=main.customer
FOR XML PATH(''), TYPE
).value('.','varchar(max)')
,1,1, '')
What is the .value() thing? Something like a...select method? What does it do? Any reference links will be appreciated too!
FOR XML will return an XML datatype; the .value(...,...) pulls out the XML value and converts it to the datatype defined. In your case, everything in the root node ('.') converted to varchar(max)
For some blogs/links look at Aaron Bertrand's post or Adam Machanic's also watch out for STRING_AGG a new function in SQL2017
You use FOR XML PATH to convert your table into XML. XML data are represented by a XML data type in SQL Server and it is possible to process a list of different methods on this data type. One of these methods is a value method which has two arguments: XQuery and Data type. The method allows you to convert the data in XML into some other format (varchar in your case).

Why are carriage returns removed when modyfing XML attributes in SQL Server?

In SQL Server 2014, I try to add an XML element with an attribute (that contains a carriage return) using the 'modify' method on the XML datatype.
The carriage returns gets removed - why is that?
Example:
declare #xmldata xml
select
#xmldata = '<root><child myattr="carriage returns
are not a problem"></child></root>'
set
#xmldata.modify('insert <child>modifying text with carriage returns works
ok</child> after (//child)[1]')
set
#xmldata.modify('insert <child myattr="but not
attribute values... why is that?"></child> after (//child)[2]')
select #xmldata
Result:
<root>
<child myattr="carriage returns
are not a problem" />
<child>modifying text with carriage returns works
ok</child>
<child myattr="but not attribute values... why is that?" />
</root>
White space characters can be normalized by parsers.
cf http://www.w3.org/TR/1998/REC-xml-19980210#AVNormalize
While your XML is valid, how exactly white space is rendered is implementation dependent. As you can see the crlf was replaced with a single space.
Please note
In general XML works different with Content and Structural/Meta Data
Attribute values are considered structure and data between tags is considered content.
In the design of XML it was never expected that attributes would be displayed on end-user devices, I would suggest you just make another tag for this end user content.
Section 3.3.3, Attribute-Value Normalization
Before the value of an attribute is passed to the application or
checked for validity, the XML processor MUST normalize the attribute
value by applying the algorithm below, or by using some other method
such that the value passed to the application is the same as that
produced by the algorithm.
All line breaks MUST have been normalized on input to #xA as described in 2.11 End-of-Line Handling, so the rest of this algorithm
operates on text normalized in this way.
Begin with a normalized value consisting of the empty string.
For each character, entity reference, or character reference in the unnormalized attribute value, beginning with the first and
continuing to the last, do the following:
For a character reference, append the referenced character to the normalized value.
For an entity reference, recursively apply step 3 of this algorithm to the replacement text of the entity.
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
For another character, append the character to the normalized value.
If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any
leading and trailing space (#x20) characters, and by replacing
sequences of space (#x20) characters by a single space (#x20)
character.
The XML specification demands that your CR/LF in an attribute is converted to a single space.

SQL Server 2008R2 and creating XML document

First post on forum since I am really stuck on this one.
The following query correctly assigns a valid XML document to the #xTempXML variable (of type xml). Note: The length of the document (converted to varchar(max) = 711
select #xTempXML = (
select
PrescriberFirstName as "row/prescriber/name/first",
PrescriberLastName as "row/prescriber/name/last",
PrescriberAddress1 as "row/prescriber/address/line1",
PrescriberAddress2 as "row/prescriber/address/line2",
PrescriberCity as "row/prescriber/address/city",
PrescriberState as "row/prescriber/address/state",
PrescriberZipCode as "row/prescriber/address/zipcode",
PatientFirstName as "row/patient/name/first",
PatientLastName as "row/patient/name/last",
PatientMiddleName as "row/patient/name/middle",
PatientAddress1 as "row/patient/address/line1",
PatientAddress2 as "row/patient/address/line2",
PatientCity as "row/patient/address/city",
PatientState as "row/patient/address/state",
PatientZipCode as "row/patient/address/zipcode",
PatientFileID as "row/patient/fileid",
PatientSSN as "row/patient/ssn",
PatientDOB as "row/patient/dob",
DrugDescription as "row/medicationprescribed/description",
DrugStrength as "row/medicationprescribed/strength",
DrugDEASchedule as "row/medicationprescribed/deaschedule",
DrugQty as "row/medicationprescribed/qty",
DrugDirections as "row/medicationprescribed/directions",
DrugFormCode as "row/medicationprescribed/form",
DrugDateWritten as "row/medicationprescribed/writtendate",
DrugEffectiveDate as "row/medicationprescribed/effectivedate",
DrugRefillQty as "row/medicationprescribed/refill/qty",
DrugRefillQtyQualifier as "row/medicationprescribed/refill/qualifier",
DrugNote as "row/medicationprescribed/note",
PharmacyStoreName as "row/pharmacy/storename",
PharmacyIdentification as "row/pharmacy/identification",
PharmacyAddress1 as "row/pharmacy/address/line1",
PharmacyAddress2 as "row/pharmacy/address/line2",
PharmacyCity as "row/pharmacy/address/city",
pharmacyState as "row/pharmacy/address/state",
pharmacyZipCode as "row/pharmacy/address/zipcode"
from
Rxarchive
where ArchiveUUID=#ArchiveRefUUID
and CreatedDT between #RptParamStartDT and #RptParamStopDT
and CHARINDEX(',' + PrescriberFID + ',', ',' + #RptParamFID + ',') > 0
FOR XML PATH(''), ROOT('result'), TYPE
)
declare #sXMLVersion varchar(max) = '<?xml version="1.0" encoding="utf-8"?>'
select len(#sXMLVersion + convert(varchar(max),#xTempXML))
Note: The length of the concatenated strings = 749, which is correct.
set #xFinalXML = convert(xml,(#sXMLVersion + CAST(#xTempXML as varchar(max))))
select LEN(convert(varchar(max),#xFinalXML))
Note: The length of this variable is back to 711!
select #xFinalXML
The variable is still a valid XML document, just no version info
What am I doing wrong?
Any and all help greatly appreciated!
You missed a step in your testing. Try this:
SELECT CONVERT(XML, '<?xml version="1.0" encoding="utf-8"?>')
It will return an empty cell.
Based on what you are doing (i.e. converting to VARCHAR in the end), there is no reason to start with the XML datatype. You might as well remove the , TYPE from the FOR XML clause and then just concatenate #sXMLVersion + #xTempXML
The reason this is happening is noted here: Limitations of the xml Data Type
The XML declaration PI, for example, <?xml version='1.0'?>, is not preserved when storing XML data in an xml data type instance. This is by design. The XML declaration (<?xml ... ?>) and its attributes (version/encoding/stand-alone) are lost after data is converted to type xml. The XML declaration is treated as a directive to the XML parser. The XML data is stored internally as ucs-2. All other PIs in the XML instance are preserved.
How to properly handle extracting data from an XML field / variable is noted here: XML Best Practices (under "Text Encoding")
SQL Server 2005 stores XML data in Unicode (UTF-16). XML data retrieved from the server comes out in UTF-16 encoding. If you want a different encoding, you have to perform the required conversion on the retrieved data. Sometimes, the XML data may be in a different encoding. If it is, you have to use care during data loading. For example:
If your text XML is in Unicode (UCS-2, UTF-16), you can assign it to an XML column, variable, or parameter without any problems.
If the encoding is not Unicode and is implicit, because of the source code page, the string code page in the database should be the same as or compatible with the code points that you want to load. If required, use COLLATE. If no such server code page exists, you have to add an explicit XML declaration with the correct encoding.
To use an explicit encoding, use either the varbinary() type, which has no interaction with code pages, or use a string type of the appropriate code page. Then, assign the data to an XML column, variable, or parameter.
Example: Explicitly Specifying an Encoding
Assume that you have an XML document, vcdoc, stored as varchar(max) that does not have an explicit XML declaration. The following statement adds an XML declaration with the encoding "iso8859-1", concatenates the XML document, casts the result to varbinary(max) so that the byte representation is preserved, and then finally casts it to XML. This enables the XML processor to parse the data according to the specified encoding "iso8859-1" and generate the corresponding UTF-16 representation for string values.
SELECT CAST(
CAST (('<?xml version="1.0" encoding="iso8859-1"?>'+ vcdoc) AS VARBINARY (MAX))
AS XML)
The following S.O. questions are related:
SQL Server 2008 - Add XML Declaration to XML Output
How to add xml encoding <?xml version="1.0" encoding="UTF-8"?> to xml Output in SQL Server

Extract and decode B64 encoded data from XML using Xquery

I have a lot of xmls with pdf data that is them and encoded with B64 encoding.
<Document>
<component>
<nonXMLBody>
<text mediaType="application/pdf" representation="B64">JVBERi0xLjMNJf////
</text>
</nonXMLBody>
</component>
</Document>
I have two problems:
1) figuring out the right syntax to get all of the data out. I keep getting truncated versions. I've tried varchar(max) and varbinary.
SELECT x.value('(component/nonXMLBody/text/text())[1]','varchar(max)') as
FROM #XML.nodes('/Document') as Addr (x))
2) How to decode the B64 data.
I found a post that seems like it is close to what I need but I'm still stuck.
Base64 encoding in SQL Server 2005 T-SQL
Thanks in advance for any help
As far as the XQuery part is concerned, you can (should?) cast the Base64 inside the document to an xs:base64Binary with
xs:base64Binary((component/nonXMLBody/text)[1])
assuming the document is bound with the context item as you seem to assume. Without the cast, the contents of the tag will be untyped, or a string at best.
Once you have this xs:base64Binary atomic item, I am not familiar with the binding to SQL Server types and looking at its documentation, it seems that varbinary(max) could do the trick? Something like:
SELECT x.value('xs:base64Binary((component/nonXMLBody/text)[1])','varbinary(max)') as
FROM #XML.nodes('/Document') as Addr (x))

How to get SQL query to not escape HTML data returned in query

I have the following SQL query....
select AanID as '#name', '<![CDATA[' + Answer + ']]>' from AuditAnswers for XML PATH('str'), ROOT('root')
which works wonderfully but the column 'Answer' can sometimes have HTML markup in it. The query automatically escapes this HTML from the 'Answer' column in the generated XML. I don't want that. I will be wrapping this resulting column in CDATA so the escaping is not necessary.
I want the result to be this...
<str name="2"><![CDATA[<DIV><DIV Style="width:55%;float:left;">Indsfgsdfg]]></str>
instead of this...
<str name="2"><![CDATA[<DIV><DIV Style="width:55%;float:left;">In</str>
Is there a function or other mechanism to do this?
Selecting anything "FOR XML" escapes any pre-existing XML so that it will not break the consistency of the XmlDocument. The first example line you gave is considered to be improperly formed XML, and will not be able to be loaded by an XmlDocument object, as well as most parsers. I would consider restructuring what you're trying to do so that you can have a more efficient solution.
You can use for xml explicit and the cdata directive:
select
1 as tag,
null as parent,
AanID as [str!1!name],
Answer as [str!1!!cdata]
from AuditAnswers
for xml explicit
You can specify that the output be treated as CDATA when using EXPLICIT mode XML queries. See:
Using EXPLICIT Mode
and
Example: Specifying the CDATA Directive
What would be the benefit of having <[CDATA[ <div></div> ]]> over having <div></div> in your database output? To me, it looks like you would have a properly escaped HTML fragment in your XML output in both cases, and reading it back with a decent XML parser should give you the unescaped original version in both cases.