Extract a substring from a text field - sql

New to TSQL and SQL generally, please pardon if this is really basic:
I am working with a new-to-me-database that has ignored some best practices. Relevant to this discussion, some data is stored in a generalized note field, including loyalty numbers. The good news is that the loyalty numbers are at least stored consistently within the note.
So, a simplified example from the note table might be:
I have verified that every Loyalty Number is stored consistently ("Loyalty Number ####"), but obviously this is not ideal. I want to extract the Loyalty Number for every primary key that has them, then create a new field that stores the Loyalty Number.
What I'm having trouble with is the following: How do I run a query that will give me each primary key then, if there is a loyalty number return it, if not leave it null or say something like no result found. E.g., turn the above into something like.
It's trivially easy to construct something like "select primary_key, note from note_table where note like '%Loyalty Number%', but that doesn't do the job of clipping down to just the loyalty number (and leaving out extraneous text). The uniformity of the data means I could probably do this in Excel, but I'm wondering if it's possible in TSQL. Thanks in advance for your help.

Give something like this a try using case with substring and charindex:
select id,
case when note like '%Loyalty Number [0-9][0-9][0-9][0-9]%'
then 'Loyalty Number ' +
substring(note,
charindex('Loyalty Number', note) + Len('Loyalty Number ') + 1, 4)
end as Note
from note
SQL Fiddle Demo
The case statement checks to see if Loyalty Number exists in the data. Substring splits the note field using charindex to find the starting position. This is hard coding a length of 4 characters for the loyalty number. Given your comments, this should work. If you have a dynamic number of characters, you'll need to modify this slightly.

Building on #segeddes answer, here's the rest of the code, that will update your new LoyaltyNumber column.
Working SQL Fiddle: http://sqlfiddle.com/#!3/36e46/8
UPDATE note_table
SET LoyaltyNumber =
CASE
WHEN note LIKE '%Loyalty Number [0-9][0-9][0-9][0-9]%'
THEN SUBSTRING(note, CHARINDEX('Loyalty Number', note)
+ LEN('Loyalty Number ') + 1, 4)
ELSE 'Regular Customer'
END
FROM note_table
Table Definition and CRUD
CREATE TABLE note_table (
id int identity(1,1),
Note VarChar(500),
LoyaltyNumber varchar(20)
)
Insert Into note_table(Note) Values
('Customer Since 2012. Loyalty Number 4747'),
('Loyalty Number 2209'),
('Loyalty Number 2234.Customer Since 2009'),
('Pending Order');

Related

Query to ignore rows which have non hex values within field

Initial situation
I have a relatively large table (ca. 0.7 Mio records) where an nvarchar field "MediaID" contains largely media IDs in proper hexadecimal notation (as they should).
Within my "sequential" query (each query depends on the output of the query before, this is all in pure T-SQL) I have to convert these hexadecimal values into decimal bigint values in order to do further calculations and filtering on these calculated values for the subsequent queries.
--> So far, no problem. The "sequential" query works fine.
Problem
Unfortunately, some of these Media IDs do contain non-hex characters - most probably because there was some typing errors by the people which have added them or through import errors from the previous business system.
Because of these non-hex chars, the whole query fails (of course) because the conversion hits an error.
For my current purpose, such rows must be skipped/ignored as they are clearly wrong and cannot be used (there are no medias / data carriers in use with the current business system which can have non-hex character IDs).
Manual editing of the data is not an option as there are too many errors and it is not clear with what the data must be replaced.
Challenge
To create a query which only returns records which have valid hex values within the media ID field.
(Unfortunately, my SQL skills are not enough to create the above query. Your help is highly appreciated.)
The relevant section of the larger query looks like this (xxxx is where your help comes in :-))
select
pureMediaID
, mediaID
, CUSTOMERID
,CONTRACT_CUSTOMERID
from
(
select concat('0x', Replace(Ltrim(Replace(mediaID, '0', ' ')), ' ', '0')) AS pureMediaID
--, CUSTOMERID
, *
from M_T_CONTRACT_CUSTOMERS
where mediaID is not null
and mediaID like '0%'
and xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
) as inner1
EDIT: As per request I have added here some good and some bad data:
Good:
4335463357
4335459809
1426427996
4335463509
4335515039
4335465134
4427370396
4335415661
4427369036
4335419089
004BB03433
004e7cf9c6
00BD23133
00EE13D8C1
00CCB5522C
00C46522C
00dbbe3433
Bad:
4564589+
AB6B8BFC.8
7B498DFCnm
DB218DFChb
d<tgfh8CFC
CB9E8AFCzj
B458DFCjhl
rytzju8DFC
BFCtdsjshj
DB9888FCgf
9BC08CFCyx
EB198DFCzj
4B628CFChj
7B2B8DFCgg
After I did upgrade the compatibility level of the SQL instance to SQL2016 (it was below 2012 before) I could use try_convert with same syntax as the original convert function as donPablo has pointed out. With that the query could run fully through and every MediaID which is not a correct hex value gets nicely converted into a null value - really, really nice.
Exactly what I needed.
Unfortunately, the solution of ALICE... didn't work out for me as this was also (strangely) returning records which had the "+" character within them.
Edit: The added comment of Alice... where you create a calculated field like this:
CASE WHEN "KEY" LIKE '%[^0-9A-F]%' THEN 0 ELSE 1 end as xyz
and then filter in the next query like this:
where xyz = 1
works also with SQL Instances with compatibility level < SQL 2012.
Great addition for people which still have to work with older SQL instances.
An option (although not ideal in terms of performance) is to check the characters in the MediaID through a case statement and regular expression
Hexadecimals cannot contain characters other than A-F and numbers between 0 and 9
CASE WHEN MediaID LIKE '%[0-9A-F]%' THEN 1 ELSE 0 END
I would recommend writing a function that can be used to evaluate MediaID first and checks if it is hexadecimal and then running the query for conversion

Can 2 character length variables cause SQL injection vulnerability?

I am taking a text input from the user, then converting it into 2 character length strings (2-Grams)
For example
RX480 becomes
"rx","x4","48","80"
Now if I directly query server like below can they somehow make SQL injection?
select *
from myTable
where myVariable in ('rx', 'x4', '48', '80')
SQL injection is not a matter of length of anything.
It happens when someone adds code to your existing query. They do this by sending in the malicious extra code as a form submission (or something). When your SQL code executes, it doesn't realize that there are more than one thing to do. It just executes what it's told.
You could start with a simple query like:
select *
from thisTable
where something=$something
So you could end up with a query that looks like:
select *
from thisTable
where something=; DROP TABLE employees;
This is an odd example. But it does more or less show why it's dangerous. The first query will fail, but who cares? The second one will actually work. And if you have a table named "employees", well, you don't anymore.
Two characters in this case are sufficient to make an error in query and possibly reveal some information about it. For example try to use string ')480 and watch how your application will behave.
Although not much of an answer, this really doesn't fit in a comment.
Your code scans a table checking to see if a column value matches any pair of consecutive characters from a user supplied string. Expressed in another way:
declare #SearchString as VarChar(10) = 'Voot';
select Buffer, case
when DataLength( Buffer ) != 2 then 0 -- NB: Len() right trims.
when PatIndex( '%' + Buffer + '%', #SearchString ) != 0 then 1
else 0 end as Match
from ( values
( 'vo' ), ( 'go' ), ( 'n ' ), ( 'po' ), ( 'et' ), ( 'ry' ),
( 'oo' ) ) as Samples( Buffer );
In this case you could simply pass the value of #SearchString as a parameter and avoid the issue of the IN clause.
Alternatively, the character pairs could be passed as a table parameter and used with IN: where Buffer in ( select CharacterPair from #CharacterPairs ).
As far as SQL injection goes, limiting the text to character pairs does preclude adding complete statements. It does, as others have noted, allow for corrupting the query and causing it to fail. That, in my mind, constitutes a problem.
I'm still trying to imagine a use-case for this rather odd pattern matching. It won't match a column value longer (or shorter) than two characters against a search string.
There definitely should be a canonical answer to all these innumerable "if I have [some special kind of data treatment] will be my query still vulnerable?" questions.
First of all you should ask yourself - why you are looking to buy yourself such an indulgence? What is the reason? Why do you want add an exception to your data processing? Why separate your data into the sheep and the goats, telling yourself "this data is "safe", I won't process it properly and that data is unsafe, I'll have to do something?
The only reason why such a question could even appear is your application architecture. Or, rather, lack of architecture. Because only in spaghetti code, where user input is added directly to the query, such a question can be ever occur. Otherwise, your database layer should be able to process any kind of data, being totally ignorant of its nature, origin or alleged "safety".

Query dynamic data from SQL table

I'm using SQL Server 2012.
I want to query data from a specific SQL column that meets certain criteria. This column contains free form text entered by a user. The user can enter whatever he/she wants, but always includes a URL which may be entered anywhere within the free form text.
Each URL is similar and contains consistent elements, such as the domain, but also references a unique "article ID" number within the URL. Think of these numbers as referencing knowledge base articles.
The article ID is a different number depending on the article used and new articles are regularly created.
I need a query identifying all of these article ID numbers within the URLs. The only means I've developed so far is to use SUBSTRING to count characters until reaching the article ID number. This is unreliable since users don't always include the URL at the beginning. It would be better if I could tell SUBSTRING to count from the beginning of the URL regardless of where it resides within the text.
For example, it begins counting whenever it finds 'HTTP://' or a common keyword each URL contains. Another option would be if I could extract the URL into it's own table. I've yet to figure out how to execute either of these ideas inside SQL.
The following is what I have so far.
select
scl.number,
ol.accountnum,
scl.opendate as CallOpenDate,
sce.opendate as NoteEntryDate,
sce.notes,
substring(sce.notes, 102, 4) as ArticleID,
sclcc.pmsoft,
ol.territorydesc
from (select * from supportcallevent as sce
where sce.opendate > '2014-04-01 00:00:00.000') as sce
inner join supportcalllist as scl on scl.SupportCallID=sce.supportcallid
inner join organizationlist as ol on ol.partyid=scl.partyid
inner join supportcalllist_custcare as sclcc on sclcc.supportcallid=scl.supportcallid
where sce.notes like '%http://askus.how%'
order by ol.territorydesc, scl.number;
You can use the CHARINDEX function to find the URL in the string and start the substring from there.
This example will get the next 4 digits after the url:
DECLARE #str VARCHAR(100)
DECLARE #find VARCHAR(100)
SET #str = 'waawhbu aoffawh http://askus.how/1111 auwhauowd'
SET #find = 'http://askus.how/'
SELECT SUBSTRING(#str,CHARINDEX(#find, #str)+17,4)
SQLFiddle

SQL - Conditionally joining two columns in same table into one

I am working with a table that contains two versions of stored information. To simplify it, one column contains the old description of a file run while another column contains the updated standard for displaying ran files. It gets more complicated in that the older column can have multiple standards within itself. The table:
Old Column New Column
Desc: LGX/101/rpt null
null Home
Print: LGX/234/rpt null
null Print
null Page
I need to combine the two columns into one, but I also need to delete the "Print: " and "Desc: " string from the beginning of the old column values. Any suggestions? Let me know if/when I'm forgetting something you need to know!
(I am writing in Cache SQL, but I'd just like a general approach to my problem, I can figure out the specifics past that.)
EDIT: the condition is that if substr(oldcol,1,5) = 'desc: ' then substr(oldcol,6)
else if substr(oldcol,1,6) = 'print: ' then substr(oldcol,7) etc. So as to take out the "desc: " and the "print: " to sanitize the data somewhat.
EDIT2: I want to make the table look like this:
Col
LGX/101/rpt
Home
LGX/234/rpt
Print
Page
It's difficult to understand what you are looking for exactly. Does the above represent before/after, or both columns that need combining/merging.
My guess is that COALESCE might be able to help you. It takes a bunch of parameters and returns the first non NULL.
It looks like you're wanting to grab values from new if old is NULL and old if new is null. To do that you can use a case statement in your SQL. I know CASE statements are supported by MySQL, I'm not sure if they'll help you here.
SELECT (CASE WHEN old_col IS NULL THEN new_col ELSE old_col END) as val FROM table_name
This will grab new_col if old_col is NULL, otherwise it will grab old_col.
You can remove the Print: and Desc: by using a combination of CharIndex and Substring functions. Here it goes
SELECT CASE WHEN CHARINDEX(':',COALESCE(OldCol,NewCol)) > 0 THEN
SUBSTRING(COALESCE(OldCol,NewCol),CHARINDEX(':',COALESCE(OldCol,NewCol))+1,8000)
ELSE
COALESCE(OldCol,NewCol)
END AS Newcolvalue
FROM [SchemaName].[TableName]
The Charindex gives the position of the character/string you are searching for.
So you get the position of ":" in the computed column(Coalesce part) and pass that value to the substring function. Then add +1 to the position which indicates the substring function to get the part after the ":". Now you have a string without "Desc:" and "Print:".
Hope this helps.

Google Style Search Suggestions with Levenshtein Edit Distance

Ok guys working on search suggestions using jQuery-UI AutoComplete with results from sql-sever 2008 db. Using AdventureWorks DB Products table for testing. I want to search across 2 fields in this example. ProductNumber and Name.
I asked 2 questions earlier relating to this...here and here
and ive come up with this so far...
CREATE procedure [dbo].[procProductAutoComplete]
(
#searchString nvarchar(100)
)
as
begin
declare #param nvarchar(100);
set #param = LOWER(#searchString);
WITH Results(result)
AS
(
select TOP 10 Name as 'result'
from Production.Product
where LOWER(Name) like '%' + #param + '%' or (0 <= dbo.lvn(#param, LOWER (Name), 6))
union
select TOP 10 ProductNumber as 'result'
from Production.Product
where LOWER(ProductNumber) like '%' + #param + '%' or (0 <= dbo.lvn(#param, LOWER(ProductNumber), 6))
)
SELECT TOP 20 * from Results
end;
My problem now is ordering of the results...I am getting the correct results but they are just ordered by the Name or product number and are not relevant to the input string...
for example I can search for product Number starting with "BZ-" and the top returned results are ProductNums starting with "A" although I do get more relevant results elsewhere in the list..
any ideas for sorting the results in terms of relevance to the search string??
EDIT:
in regards to the tql implementation of the levenschtein distance found here(linked to in previous question)...
I am wondering what would be the best way to determine the MAX value to send to the function (6 in my example above)
Would it be best to choose an arbitrary value based on what "seems" to work well for my given data set? or would it be best to adjust it dynamically based on the length of the input string...
My initial thoughs were that the value to should be inverely proportional to the length of the searchString...so as the search string grows and becomes more specific..the tolerance decreases...thoughts??
The Full Text Search feature seems to be the way go when using SQL Server
The relevance is the result of dbo.lvn(). It returns the amount of operations need to transform one string into the other. So the answer is simple:
ORDER BY dbo.lvn(#param, LOWER (Name), 6)
But this won't work in combination to the LIKE as this does not return any relevance value. But the usage of LIKE is not a good idea at all. If someone is tiping "tooth" to buy "toothpaste" he would get "bluetooth" as proposal.
To make devlim faster read here:
https://stackoverflow.com/a/14261807/318765