SQL Server xml string parsing in varchar field - sql

I have a varchar column in a table that is used to store xml data. Yeah I know there is an xml data type that I should be using, but I think this was set up before the xml data type was available so a varchar is what I have to use for now. :)
The data stored looks similar to the following:
<xml filename="100100_456_484351864768.zip"
event_dt="10/5/2009 11:42:52 AM">
<info user="TestUser" />
</xml>
I need to parse the filename to get the digits between the two underscores which in this case would be "456". The first part of the file name "shouldn't" change in length, but the middle number will. I need a solution that would work if the first part does change in length (you know it will change because "shouldn't change" always seems to mean it will change).
For what I have for now, I'm using XQuery to pull out the filename because I figured this is probably the better than straight string manipulation. I cast the string to xml to do this, but I'm not an XQuery expert so of course I'm running into issues. I found a function for XQuery (substring-before), but was unable to get it to work (I'm not even sure that function will work with SQL Server). There might be an XQuery function to do this easily, but if there is I am unaware of it.
So, I get the filename from the table with a query similar to the following:
select CAST(parms as xml).query('data(/xml/#filename)') as p
from Table1
From this I'd assume that I'd be able to CAST this back to a string then do some instring or charindex function to figure out where the underscores are so that I can encapsulate all of that in a substring function to pick out the part I need. Without going too far into this I am pretty sure that I can eventually get it done this way, but I know that there has to be an easier way. This way would make a huge unreadable field in the SQL Statement which even if I moved it to a function would still be confusing to try to figure out what is going on.
I'm sure there is an easier than this since it seems to be simple string manipulation. Perhaps someone can point me in the right direction. Thanks

You can use XQuery for this - just change your statement to:
SELECT
CAST(parms as xml).value('(/xml/#filename)[1]', 'varchar(260)') as p
FROM
dbo.Table1
That gives you a VARCHAR(260) long enough to hold any valid file name and path - now you have a string and can work on it with SUBSTRING etc.
Marc

The straightforward way to do this is with SUBSTRING and CHARINDEX. Assuming (wise or not) that the first part of the filename doesn't change length, but that you still want to use XQuery to locate the filename, here's a short repro that does what you want:
declare #t table (
parms varchar(max)
);
insert into #t values ('<xml filename="100100_456_484351864768.zip" event_dt="10/5/2009 11:42:52 AM"><info user="TestUser" /></xml>');
with T(fName) as (
select cast(cast(parms as xml).query('data(/xml/#filename)') as varchar(100)) as p
from #t
)
select
substring(fName,8,charindex('_',fName,8)-8) as myNum
from T;
There are sneaky solutions that use other string functions like REPLACE and PARSENAME or REVERSE, but none is likely to be more efficient or readable. One possibility to consider is writing a CLR routine that brings regular expression handling into SQL.
By the way, if your xml is always this simple, there's no particular reason I can see to use XQuery at all. Here are two queries that will extract the number you want. The second is safer if you don't have control over extra white space in your xml string or over the possibility that the first part of the file name will change length:
select
substring(parms,23,charindex('_',parms,23)-23) as myNum
from #t;
select
substring(parms,charindex('_',parms)+1,charindex('_',parms,charindex('_',parms)+1)-charindex('_',parms)-1) as myNum
from #t;

Unfortunately, SQL Server is not a conformant XQuery implementation - rather, it's a fairly limited subset of a draft version of XQuery spec. Not only it doesn't have fn:substring-before, it also doesn't have fn:index-of to do it yourself using fn:substring, nor fn:string-to-codepoints. So, as far as I can tell, you're stuck with SQL here.

Related

Splitting variable content in SQL

I have a variable in a stored procedure that contains a string of characters like
[Tag]MESSAGE[/Tag]
I need a way to get the MESSAGE part from within the tags.
Any help would be much appreciated
Note: I have tested it on Oracle RDBMS
A more reliable approach is to use REGEXP_REPLACE.
REGEXP_REPLACE(value, pattern)
Example
SELECT REGEXP_REPLACE(
'<Tag>Message</Tag>',
'\s*</?\w+((\s+\w+(\s*=\s*(".*?"|''.*?''|[^''">\s]+))?)+\s*|\s*)/?>\s*') FROM DUAL;
Just replace "<" with "[" if your tags are different
What you need is this:
SELECT SUBSTRING(ColumnName,CHARINDEX('html_tag',ColumnName)+LEN('html_tag'),CHARINDEX('html_close_tag',ColumnName)-LEN('html_close_tag')) FROM TableName
You'll require to change the html_tag and html_close_tag with your own HTML tag that you want to get rid of.
If the column contains only single tag, simple call of substring function should be enough. Otherwise there will always be some point where regular expression does not suffice since you fall into trap (see this legendary StackOverflow answer).

SQL query that prevents Excel from converting long integer to scientific notation

So it's been a long time since I've done anything fancy with SQL, so I'm going to do my best to explain. Please be nice, I'm trying my best here.
Basically, I'm pulling information from a database in Snowflake and putting it into a new XML file, and that data is input exactly as-written into a form email.
One of the values is an ID number that's 14 characters long (example: 12345678912345), which is stored in the database as an integer (or so I'm told), but Excel keeps automatically converting it into scientific notation. Since it's an ID number, it needs to look like an ID number, not scientific notation.
Right now, my query just selects & inputs the regular ol' value, and then we manually change it in the Excel sheet. Like literally just SELECT ID_Number from TheThing
One thing I thought might work is:
SELECT CAST(ID_Number as bigint) as ID_Number
... But it doesn't work. Most other solutions I've found don't seem to address my specific scenario of unwanted integer-to-string conversion & I'm distraught.
I'm just an intern and this might have a very obvious answer, but my fellow interns have given up on it and I need to find the answer for my own sanity. It's been a minute since I did anything fancy with SQL so please be nice to me and sorry if this is a dumb question.
In Snowflake, BIGINT and INT(EGER) are the same thing, what you want is VARCHAR. As Ross mentioned in his comment, this is likely just a formatting issue within Excel. In Excel any value can be cast as a string by including a single quote ' at the beginning of the value, or by using the Text-to-Column feature.
If you wanted to try to format it out of Snowflake as a string, casting it might not do the trick unless you include some kind of additional string character.
To get this type of formatting out of Snowflake, you can try:
SELECT '\'' || CAST(ID_Number AS VARCHAR) as ID_Number;

SQL Remove Substring From Query Results

I have a query that is returning data from a database. In a single field there is a rather long text comment with a segment, which is clearly defined with marking tags like !markerstart! and !markerend!. I would like to have a query return with the string segment between the two markers removed (and the markers removed too).
I would normally do this client-side after I get the data back, however, the problem is that the query is an INSERT query that gets it's data from a SELECT statement. I don't want the text segment to be stored in the archival/reporting table (working with an OLTP application here), so I need to find a way to get the SELECT statement to return exactly what is to be inserted, which, in this case, means getting the SELECT statement to strip out the unwanted phrase instead of doing it in post-processing client-side.
My only thought is to use some convoluted combination of SUBSTRING, CHARINDEX, and CONCAT, but I'm hoping there is a better way, but, based on this, I don't see how. Anyone have ideas?
Sample:
This is a long string of text in some field in a database that has a segment that needs to be removed. !markerstart! This is the segment that is to be removed. It's length is unknown and variable. !markerend! The part of this field that appears after the marker should remain.
Result:
This is a long string of text in some field in a database that has a segment that needs to be removed. The part of this field that appears after the marker should remain.
SOLUTION USING STUFF:
I really don't like how verbose this is, but I can put it in a function if I really need to. It isn't ideal, but it is easier and faster than a CLR routine.
SELECT STUFF(CAST(Description AS varchar(MAX)), CHARINDEX('!markerstart!', Description), CHARINDEX('!markerend!', Description) + 11 - CHARINDEX('!markerstart!', Description), '') AS Description
FROM MyTable
You may want to consider implementing a CLR user-defined function that returns the parsed data.
The following link demonstrates how to use a CLR UDF RegEx function for pattern matching and data extraction.
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
Regards,
You can use Stuff function or Replace function and replace your unwanted symbols with ''.
STUFF('EXP',START_POS,'NUMBER_OF_CHARS','REPLACE_EXP')

SQL injection if brackets and semicolons are filtered

I have a statement like this:
SELECT * FROM TABLE WHERE COLUMN = 123456
123456 is provided by the user so it is vulnerable to SQLi but if I strip all semicolons and brackets, is it possible for the hacker to run any other statements (like DROP,UPDATE,INSERT etc) except SELECT?
I am already using prepared statements but I am curious that if the input is stripped of the line-terminator and brackets, can the hacker modify the DB in any way?
Use sql parameters. Attempting to "sanitize" input is an extremely bad idea. Try googling some complex sql injection snippets, you won't believe how creative black hat hackers are.
In general it's very difficult to be 100% certain that you are safe from this type of attack by trying to strip out specific characters - there are just too many ways to get around your code (by using character encodings etc.)
A better option is to pass parameters to a stored procedure, like this:
CREATE PROCEDURE usp_MyStoredProcedure
#MyParam int
AS
BEGIN
SELECT * FROM TABLE WHERE COLUMN = #MyParam
END
GO
That way SQL will treat the value passed in as a parameter, and nothing else, no matter what it contains. And in this case it would only accept a value of type int anyway.
If you don't want, or can't, use a stored procedure, then I'd suggest changing your code so that the input parameter can only contain a pre-defined list of characters - in this case numeric characters. That way you can be certain that the value is safe to use.

How to use the original Guid in the SQL statement instead of the little endian one?

I know we can
INSERT INTO "Table1" VALUES(X'57A00F3015310D4081AD4ADEF3EBDB5E');
But this little endian format is difficult to compare to the original Guid
300FA057-3115-400D-81AD-4ADEF3EBDB5E
How to use the original Guid in the SQL statement instead of the little endian one?
If you want to easily compare to the original without converting then store it as text. It'll take more storage space and will be slower to read/write/compare, but it'll be more human readable.
I'm having similar frustrations and am experimenting with a querying tool to do conversion for me.
For now I get by with something like the below.
select quote(SomeGuid) from MyTable where name = 'Some Name'
Which returns
X'12A0E85D8175514DA792EC3D9A8EFCF7'
Formatting the original guid and comparing to the above:
5DE8A01275814D51A792EC3D9A8EFCF7 -- original no dashes, uppercase
12A0E85D8175514DA792EC3D9A8EFCF7 -- quote(Guid)
I can get away with querying a partial Guid for filtering purposes.
Note % on the right side too - value is quoted
select * from MyTable where quote(SomeGuid) like '%A792EC3D9A8EFCF7%'
Try this:
INSERT INTO [Table1] ([UID]) VALUES ('{57A00F30-1531-0D40-81AD-4ADEF3EBDB5E}');
I always do in this way, didn't find any problem.