Escape XML special characters upon convert - sql

I have working csv splitter for my needs.
You can just grab and run it as is:
declare #t table(data varchar(max))
insert into #t select 'a,b,c,d'
insert into #t select 'e,,,h'
;with cte(xm) as
(
select convert(xml,'<f><e>' + replace(data,',', '</e><e>') + '</e></f>') as xm
from #t
)
select
xm.value('/f[1]/e[1]','varchar(32)'),
xm.value('/f[1]/e[2]','varchar(32)'),
xm.value('/f[1]/e[3]','varchar(32)'),
xm.value('/f[1]/e[4]','varchar(32)')
from cte
Only issue is, that if I introduce an XML sensitive character in the data, like &:
insert into #t select 'i,j,&,k'
It fails with error: character 24, illegal character
One solution is to replace & character to &amp on the fly, like this:
select convert(xml,'<f><e>' + replace(replace(data,'&','&amp'),',', '</e><e>') + '</e></f>') as xm
but there are several dozens of special XML characters which I need to escape upon convert, and I can't really nest dozens replace(replace(replace(... functions in there. That's what i did and it is messy.
How the above code can be modified to escape XML sensitive characters, and produce the same result?
Thanks!

You have got your answer by Martin Smith already. But I think, it is worth to place an answer here for followers. Want to provide some explanantion and furthermor, the rextester-link might not be reachable in future...
If you think of a string in a table like this ...
DECLARE #mockup TABLE(SomeXMLstring VARCHAR(100));
INSERT INTO #mockup VALUES('This is a string with forbidden characters like "<", ">" or "&"');
-- ... you can easily add XML-tags:
SELECT '<root>' + SomeXMLstring + '</root>'
FROM #mockup ;
--The result would look like XML
<root>This is a string with forbidden characters like "<", ">" or "&"</root>
--But it is not! You can test this, the CAST( AS XML) will fail:
SELECT CAST('<root>This is a string with forbidden characters like "<", ">" or "&"</root>' AS XML);
--Sometimes people try to do their own replaces and start to replace <, > and & with the corresponding entities <, > and &. But this will need a lot of replacements in order to be safe.
--But XML is doing all this for us implicitly
SELECT SomeXMLstring
FROM #mockup
FOR XML PATH('')
--This is the result
<SomeXMLstring>This is a string with forbidden characters like "<", ">" or "&"</SomeXMLstring>
--And the funny thing is: We can easily create a nameless element with AS [*]:
SELECT SomeXMLstring AS [*]
FROM #mockup
FOR XML PATH('')
--The result is the same, but without the tags:
This is a string with forbidden characters like "<", ">" or "&"
--Although this is looking like XML in SSMS, this will be implicitly casted to NVARCHAR(MAX) when used as a string.
--You can use this for implicit escaping of a string wherever you feel the need to build a XML with string concatenation:
SELECT CAST('<root>' + (SELECT SomeXMLstring AS [*] FOR XML PATH('')) + '</root>' AS XML)
FROM #mockup ;
To finally answer your question
This line must use the trick:
select convert(xml,'<f><e>' + replace((SELECT data AS [*] FOR XML PATH('')),',', '</e><e>') + '</e></f>') as xm

Related

Remove characters before '-' and after '-' keeping the middle of the string that contain more one character '-'

I got the following entry in my database (varchar data type):
5-359-258756-54
2-456-58994-85
4-458 -478698-42
5-876-5878-26
I want to exclude first number plus char '-', last two numbers plus previous '-' and remove the spaces when available in the middle char ' -'.
The final result must be:
359-258756
456-58994
458-478698
876-5878
I tried to use mainly charindex & patindex with replace ' ','' based in forum suggestions, but not show the expected result, the most close that I could return was 458-478698-42 (removing the first number plus character and the space),
How can I solve it?
If you string format is consistent, then you an use parsename()
Example
Declare #YourTable table (SomeCol varchar(50))
Insert Into #YourTable values
('5-359-258756-54')
,('2-456-58994-85')
,('4-458 -478698-42')
,('5-876-5878-26')
Select *
,NewVal = replace(parsename(replace(SomeCol,'-','.'),3)
+'-'
+parsename(replace(SomeCol,'-','.'),2)
,' ','')
From #YourTable
Results
SomeCol NewVal
5-359-258756-54 359-258756
2-456-58994-85 456-58994
4-458 -478698-42 458-478698
5-876-5878-26 876-5878

Edit string column in SQL - remove sections between separators

I have a string column in my table that contains 'Character-separated' data such as this:
"Value|Data|4|Z|11/06/2012"
This data is fed into a 'parser' and deserialised into a particular object. (The details of this aren't relevant and can't be changed)
The structure of my object has changed and now I would like to get rid of some of the 'sections' of data
So I want the previous value to turn into this
"Value|Data|11/06/2012"
I was hoping I might be able to get some help on how I would go about doing this in T-SQL.
The data always has the same number of sections, 'n' and I will want to remove the same sections for all rows , 'n-x and 'n-y'
So far I know I need an update statement to update my column value.
I've found various ways of splitting a string but I'm struggling to apply it to my scenario.
In C# I would do
string RemoveSecitons(string value)
{
string[] bits = string.split(value,'|');
List<string> wantedBits = new List<string>();
for(var i = 0; i < bits.Length; i++)
{
if ( i==2 || i==3) // position of sections I no longer want
{
continue;
}
wantedBits.Add(bits[i]);
}
return string.Join(wantedBits,'|');
}
But how I would do this in SQL I'm not sure where to start. Any help here would be appreciated
Thanks
Ps. I need to run this SQL on SQL Server 2012
Edit: It looks like parsing to xml in some manner could be a popular answer here, however I can't guarantee my string won't have characters such as '<' or '&'
Using NGrams8K you can easily write a nasty fast customized splitter. The logic here is based on DelimitedSplit8K. This will likely outperform even the C# code you posted.
DECLARE #string VARCHAR(8000) = '"Value|Data|4|Z|11/06/2012"',
#delim CHAR(1) = '|';
SELECT newString =
(
SELECT SUBSTRING(
#string, split.pos+1,
ISNULL(NULLIF(CHARINDEX(#delim,#string,split.pos+1),0),8000)-split.pos)
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY d.Pos), d.Pos
FROM
(
SELECT 0 UNION ALL
SELECT ng.position
FROM samd.ngrams8k(#string,1) AS ng
WHERE ng.token = #delim
) AS d(Pos)
) AS split(ItemNumber,Pos)
WHERE split.ItemNumber IN (1,2,5)
ORDER BY split.ItemNumber
FOR XML PATH('')
);
Returns:
newString
----------------------------
"Value|Data|11/06/2012"
Not the most elegant way, but works:
SELECT SUBSTRING(#str,1, CHARINDEX('|',#str,CHARINDEX('|',#str,1)+1)-1)
+ SUBSTRING(#str, CHARINDEX('|',#str,CHARINDEX('|',#str,CHARINDEX('|',#str,CHARINDEX('|',#str,1)+1)+1)+1), LEN(#str))
----------------------
Value|Data|11/06/2012
You might try some XQuery:
DECLARE #s VARCHAR(100)='Value|Data|4|Z|11/06/2012';
SELECT CAST('<x>' + REPLACE(#s,'|','</x><x>') + '</x>' AS XML)
.value('concat(/x[1],"|",/x[2],"|",/x[5])','nvarchar(max)');
In short: The value is trasformed to XML by some string replacements. Then we use the XQuery-concat to bind the first, the second and the fifth element together again.
This version is a bit less efficient but safe with forbidden characters:
SELECT CAST('<x>' + REPLACE((SELECT #s AS [*] FOR XML PATH('')),'|','</x><x>') + '</x>' AS XML)
.value('concat(/x[1],"|",/x[2],"|",/x[5])','nvarchar(max)')
Just to add a non-xml option for fun:
Edit and Caveat - In case anyone tries this for a different solution and doesn't read the comments...
HABO rightly noted that this is easily broken if any of the columns have a period (".") in them. PARSENAME is dependent on a 4 part naming structure and will return NULL if that is exceeded. This solution will also break if any values ever contain another pipe ("|") or another delimited column is added - the substring in my answer is specifically there as a workaround for the dependency on the 4 part naming. If you are trying to use this solution on, say, a variable with 7 delimited columns, it would need to be reworked or scrapped in favor of one of the other answers here.
DECLARE
#a VARCHAR(100)= 'Value|Data|4|Z|11/06/2012'
SELECT
PARSENAME(REPLACE(SUBSTRING(#a,0,LEN(#a)-CHARINDEX('|',REVERSE(#a))+1),'|','.'),4)+'|'+
PARSENAME(REPLACE(SUBSTRING(#a,0,LEN(#a)-CHARINDEX('|',REVERSE(#a))+1),'|','.'),3)+'|'+
SUBSTRING(#a,LEN(#a)-CHARINDEX('|',REVERSE(#a))+2,LEN(#a))
Here is a quick way to do it.
CREATE FUNCTION [dbo].StringSplitXML
(
#String VARCHAR(MAX), #Separator CHAR(1)
)
RETURNS #RESULT TABLE(id int identity(1,1),Value VARCHAR(MAX))
AS
BEGIN
DECLARE #XML XML
SET #XML = CAST(
('<i>' + REPLACE(#String, #Separator, '</i><i>') + '</i>')
AS XML)
INSERT INTO #RESULT
SELECT t.i.value('.', 'VARCHAR(MAX)')
FROM #XML.nodes('i') AS t(i)
WHERE t.i.value('.', 'VARCHAR(MAX)') <> ''
RETURN
END
GO
SELECT * FROM dbo.StringSplitXML( 'Value|Data|4|Z|11/06/2012','|')
WHERE id not in (3,4)
Note that using a UDF will slow things down, so this solution should be considered only if you have a reasonably small data set to work with.

Using CHAR(13) in a FOR XML SELECT

I'm trying to use CHAR(13) to force a new line, but the problem is, I'm doing this within a FOR XML Select statement:
SELECT
STUFF((SELECT CHAR(13) + Comment
FROM
myTable
FOR XML PATH ('')) , 1, 1, '')
The problem with this, is that I don't get an actual new line. Instead, I get:
#x0D;
So the data literally looks like this:
#x0D;First Line of Data#x0D;Second Line of Data#x0D;Etc
So I tried to just replace #x0D; with CHAR(13) outside of the FOR XML:
REPLACE(SELECT
STUFF((SELECT CHAR(13) + Comment
FROM
myTable
FOR XML PATH ('')) , 1, 1, '')), '#x0D;', CHAR(13))
This gets me close. It DOES add in the line breaks, but it also includes an & at the end of each line, and the start of each line after the first:
First Line of Data&
&Second Line of Data&
&Etc
Your approach is not real XML:
Try this with "output to text":
DECLARE #tbl TABLE(TestText VARCHAR(100));
INSERT INTO #tbl VALUES('line 1'),('line 2'),('line 3');
SELECT STUFF
(
(
SELECT CHAR(10) + tbl.TestText
FROM #tbl AS tbl
FOR XML PATH('')
),1,1,''
)
With CHAR(13)
#x0D;line 1
line 2
line 3
See that your STUFF just took away the ampersand?
With CHAR(10)
line 1
line 2
line 3
But what you really need is:
SELECT STUFF
(
(
SELECT CHAR(10) + tbl.TestText --you might use 13 and 10 here
FROM #tbl AS tbl
FOR XML PATH(''),TYPE
).value('.','nvarchar(max)'),1,1,''
)
The ,TYPE will return real XML and with .value() you read this properly.
Some background
You have a misconception of "So the data literally looks like this"
It does not "look like this", it is escaped to fit to the rules within XML. And it will be back encoded implicitly, when you read it correctly.
And you have a misconception of line breaks:
In (almost) ancient times you needed a CR = Carriage Return, 13 or x0D to move back the printing sledge and additionally you needed a LF = Line Feed, 10 or x0A to turn the platen to move the paper. Hence the widely used need to have a line break coded with two characters (13/10 or 0D/0A).
Today the ASCII 10 (0A) is often seen alone...
But back to your actual problem: Why do you bother about the look of your data? Within XML some string might look ugly, but - if you read this properly - the decoding back to the original look is done implicitly...
Your residues are not more than part of the encoding as this starts with an ampersand and ends with a semicolon: &lg; or 
. Your attempt to replace this is just one character to short. But anyway: You should not do this!
Just try:
SELECT CAST('<x>Hello</x>' AS XML).value('/x[1]','nvarchar(max)')
Thanks everyone for your help.
The ultimate goal here was to present the data in Excel as part of a report. I'm sure there is a more elegant way to do this, but I at least got the results I wanted by doing this:
REPLACE (
REPLACE(
REPLACE(
(SELECT Comment FROM CallNotes WHERE ForeignId = a.ForeignId FOR XML PATH (''))
, '<Comment>', '')
, '</Comment>', CHAR(13) + CHAR(10))
, '
', '') AS Comments
The select statement all by itself returns XML as we would expect:
<comment>This is a comment</comment><comment>This is another comment</comment>
The inner most REPLACE just gets rid of the opening tag:
<comment>
The middle REPLACE removes the closing tag:
</comment>
and replaces it with CHAR(13) + CHAR(10). And the outer most REPLACE gets rid of this:
(I still don't understand where that's coming from.)
So, when the results are sent to Excel, it looks like this inside the cell:
This is a comment.
This is another comment.
Which is exactly what I want. Again, I'm sure there is a better solution. But this at least is working for now.
I think this is cleaner. Basically start with line feeds (or some other special character) then replace them with carriage returns plus line feeds if you want.
Select REPLACE(STUFF((SELECT CHAR(10) + Comment
FROM myTable FOR XML PATH ('')) , 1, 1, ''),
CHAR(10), CHAR(13)+CHAR(10))
I suppose since you need to group on the FK you can use something like this... just replace #TempT with your table...
Select Pri.ForeignKey,
Replace(Left(Pri.Notes,Len(Pri.Notes)-1),',',CHAR(13)) As Notes
From
(
Select distinct T2.ForeignKey,
(
Select T1.Note + ',' AS [text()]
From #TempT T1
Where T1.ForeignKey = T2.ForeignKey
ORDER BY T1.ForeignKey
For XML PATH ('')
) Notes
From #TempT T2
) Pri
Also in the OP that you listed in the comments, you have a duplicate PrimaryKey. I found that odd. Just a heads up.
If you use below query and results to text option you will see line breaks. Line breaks can't be shown using the results to grid functionality.
SELECT
STUFF((SELECT CHAR(10) + Comment
FROM
myTable
FOR XML PATH ('')) , 1, 1, '')
I suggest Comment + Char(10) + Char(13)
The "Carriage Return" "Line feed" should be at the end of the line.
I believe this could help:
REPLACE(STUFF ((SELECT CHAR(13)+CHAR(10) + Field1 + Field2
FROM
((table
WHERE
field3= 'condition1'
FOR XML PATH ('')), 1, 0, '') , '
' , '')

Remove ASCII Extended Characters 128 onwards (SQL)

Is there a simple way to remove extended ASCII characters in a varchar(max). I want to remove all ASCII characters from 128 onwards. eg - ù,ç,Ä
I have tried this solution and its not working, I think its because they are still valid ASCII characters?
How do I remove extended ASCII characters from a string in T-SQL?
Thanks
The linked solution is using a loop which is - if possible - something you should avoid.
My solution is completely inlineable, it's easy to create an UDF (or maybe even better: an inline TVF) from this.
The idea: Create a set of running numbers (here it's limited with the count of objects in sys.objects, but there are tons of example how to create a numbers tally on the fly). In the second CTE the strings are splitted to single characters. The final select comes back with the cleaned string.
DECLARE #tbl TABLE(ID INT IDENTITY, EvilString NVARCHAR(100));
INSERT INTO #tbl(EvilString) VALUES('ËËËËeeeeËËËË'),('ËaËËbËeeeeËËËcË');
WITH RunningNumbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM sys.objects
)
,SingleChars AS
(
SELECT tbl.ID,rn.Nmbr,SUBSTRING(tbl.EvilString,rn.Nmbr,1) AS Chr
FROM #tbl AS tbl
CROSS APPLY (SELECT TOP(LEN(tbl.EvilString)) Nmbr FROM RunningNumbers) AS rn
)
SELECT ID,EvilString
,(
SELECT '' + Chr
FROM SingleChars AS sc
WHERE sc.ID=tbl.ID AND ASCII(Chr)<128
ORDER BY sc.Nmbr
FOR XML PATH('')
) AS GoodString
FROM #tbl As tbl
The result
1 ËËËËeeeeËËËË eeee
2 ËaËËbËeeeeËËËcË abeeeec
Here is another answer from me where this approach is used to replace all special characters with secure characters to get plain latin

Use ampersand in CAST in SQL

The following code snippet on SQL server 2005 fails on the ampersand '&':
select cast('<name>Spolsky & Atwood</name>' as xml)
Does anyone know a workaround?
Longer explanation, I need to update some data in an XML column, and I'm using a search & replace type hack by casting the XML value to a varchar, doing the replace and updating the XML column with this cast.
select cast('<name>Spolsky & Atwood</name>' as xml)
A literal ampersand inside an XML tag is not allowed by the XML standard, and such a document will fail to parse by any XML parser.
An XMLSerializer() will output the ampersand HTML-encoded.
The following code:
using System.Xml.Serialization;
namespace xml
{
public class MyData
{
public string name = "Spolsky & Atwood";
}
class Program
{
static void Main(string[] args)
{
new XmlSerializer(typeof(MyData)).Serialize(System.Console.Out, new MyData());
}
}
}
will output the following:
<?xml version="1.0" encoding="utf-8"?>
<MyData
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<name>Spolsky & Atwood</name>
</MyData>
, with an & instead of &.
It's not valid XML. Use &:
select cast('<name>Spolsky & Atwood</name>' as xml)
You'd need to XML escape the text, too.
So let's backtrack and assume you're building that string as:
SELECT '<name>' + MyColumn + '</name>' FROM MyTable
you'd want to do something more like:
SELECT '<name>' + REPLACE( MyColumn, '&', '&' ) + '</name>' FROM MyTable
Of course, you probable should cater for the other entities thus:
SELECT '<name>' + REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( MyColumn, '&', '&' ), '''', '&apos;' ), '"', '"' ), '<', '<' ), '>', '>' ) + '</name>' FROM MyTable
When working with XML in SQL you're a lot safer using built-in functions instead of converting it manually.
The following code will build a proper SQL XML variable that looks like your desired output based on a raw string:
DECLARE #ExampleString nvarchar(40)
, #ExampleXml xml
SELECT #ExampleString = N'Spolsky & Atwood'
SELECT #ExampleXml =
(
SELECT 'Spolsky & Atwood' AS 'name'
FOR XML PATH (''), TYPE
)
SELECT #ExampleString , #ExampleXml
As John and Quassnoi state, & on it's own is not valid. This is because the ampersand character is the start of a character entity - used to specify characters that cannot be represented literally. There are two forms of entity - one specifies the character by name (e.g., &, or "), and one the specifies the character by it's code (I believe it's the code position within the Unicode character set, but not sure. e.g., " should represent a double quote).
Thus, to include a literal & in a HTML document, you must specify it's entity: &. Other common ones you may encounter are < for <, > for >, and " for ".