Regular expressions inside SQL Server - sql

I have stored values in my database that look like 5XXXXXX, where X can be any digit. In other words, I need to match incoming SQL query strings like 5349878.
Does anyone have an idea how to do it?
I have different cases like XXXX7XX for example, so it has to be generic. I don't care about representing the pattern in a different way inside the SQL Server.
I'm working with c# in .NET.

You can write queries like this in SQL Server:
--each [0-9] matches a single digit, this would match 5xx
SELECT * FROM YourTable WHERE SomeField LIKE '5[0-9][0-9]'

stored value in DB is: 5XXXXXX [where x can be any digit]
You don't mention data types - if numeric, you'll likely have to use CAST/CONVERT to change the data type to [n]varchar.
Use:
WHERE CHARINDEX(column, '5') = 1
AND CHARINDEX(column, '.') = 0 --to stop decimals if needed
AND ISNUMERIC(column) = 1
References:
CHARINDEX
ISNUMERIC
i have also different cases like XXXX7XX for example, so it has to be generic.
Use:
WHERE PATINDEX('%7%', column) = 5
AND CHARINDEX(column, '.') = 0 --to stop decimals if needed
AND ISNUMERIC(column) = 1
References:
PATINDEX
Regex Support
SQL Server 2000+ supports regex, but the catch is you have to create the UDF function in CLR before you have the ability. There are numerous articles providing example code if you google them. Once you have that in place, you can use:
5\d{6} for your first example
\d{4}7\d{2} for your second example
For more info on regular expressions, I highly recommend this website.

Try this
select * from mytable
where p1 not like '%[^0-9]%' and substring(p1,1,1)='5'
Of course, you'll need to adjust the substring value, but the rest should work...

In order to match a digit, you can use [0-9].
So you could use 5[0-9][0-9][0-9][0-9][0-9][0-9] and [0-9][0-9][0-9][0-9]7[0-9][0-9][0-9]. I do this a lot for zip codes.

SQL Wildcards are enough for this purpose. Follow this link: http://www.w3schools.com/SQL/sql_wildcards.asp
you need to use a query like this:
select * from mytable where msisdn like '%7%'
or
select * from mytable where msisdn like '56655%'

Related

SQL function to transform number with a certain pattern

I need for a SQL query to transform an int with a value between 1 to 300000 to a number which has this pattern : always 8 number.
For example:
1 becomes 00000001,
123 becomes 00000123,
123456 becomes 00123456.
I have no idea how to do that... How can I do it?
In Standard SQL, you can use this trick:
select substring(cast( (num + 100000000) as varchar(255)) from 2)
Few databases actually support this syntax. Any given database can do what you want, but the method depends on the database you are using.
For MS SQL Server
You could use FORMAT function, like this:
SELECT FORMAT(123,'00000000')
https://database.guide/how-to-format-numbers-in-sql-server/#:~:text=Starting%20from%20SQL%20Server%202012,the%20output%20should%20be%20formatted.
Read at the link Leading Zeroes
For MySql/Oracle
You could use LPAD, like this:
SELECT LPAD('123',8,'0')
https://database.guide/how-to-add-leading-zeros-to-a-number-in-mysql/

SQL full text search behavior on numeric values

I have a table with about 200 million records. One of the columns is defined as varchar(100) and it's included in a full text index. Most of the values are numeric. Only few are not numeric.
The problem is that it's not working well. For example if a row contains the value '123456789' and i look for '567', it's not returning this row. It will only return rows where the value is exactly '567'.
What am I doing wrong?
sql server 2012.
Thanks.
Full text search doesn't support leading wildcards
In my setup, these return the same
SELECT *
FROM [dbo].[somelogtable]
where CONTAINS (logmessage, N'28400')
SELECT *
FROM [dbo].[somelogtable]
where CONTAINS (logmessage, N'"2840*"')
This gives zero rows
SELECT *
FROM [dbo].[somelogtable]
where CONTAINS (logmessage, N'"*840*"')
You'll have to use LIKE or some fancy trigram approach
The problem is probably that you are using a wrong tool since Full-text queries perform linguistic searches and it seems like you want to use simple "like" condition.
If you want to get a solution to your needs then you can post DDL+DML+'desired result'
You can do this:
....your_query.... LIKE '567%' ;
This will return all the rows that have a number 567 in the beginning, end or in between somewhere.
99% You're missing % after and before the string you search in the LIKE clause.
es:
SELECT * FROM t WHERE att LIKE '66'
is the same as as using WHERE att = '66'
if you write:
SELECT * FROM t WHERE att LIKE '%66%'
will return you all the lines containing 2 'sixes' one after other

How to substring records with variable length

I have a table which has a column with doc locations, such as AA/BB/CC/EE
I am trying to get only one of these parts, lets say just the CC part (which has variable length). Until now I've tried as follows:
SELECT RIGHT(doclocation,CHARINDEX('/',REVERSE(doclocation),0)-1)
FROM Table
WHERE doclocation LIKE '%CC %'
But I'm not getting the expected result
Use PARSENAME function like this,
DECLARE #s VARCHAR(100) = 'AA/BB/CC/EE'
SELECT PARSENAME(replace(#s, '/', '.'), 2)
This is painful to do in SQL Server. One method is a series of string operations. I find this simplest using outer apply (unless I need subqueries for a different reason):
select *
from t outer apply
(select stuff(t.doclocation, 1, patindex('%/%/%', t.doclocation), '') as doclocation2) t2 outer apply
(select left(tt.doclocation2), charindex('/', tt.doclocation2) as cc
) t3;
The PARSENAME function is used to get the specified part of an object name, and should not used for this purpose, as it will only parse strings with max 4 objects (see SQL Server PARSENAME documentation at MSDN)
SQL Server 2016 has a new function STRING_SPLIT, but if you don't use SQL Server 2016 you have to fallback on the solutions described here: How do I split a string so I can access item x?
The question is not clear I guess. Can you please specify which value you need? If you need the values after CC, then you can do the CHARINDEX on "CC". Also the query does not seem correct as the string you provided is "AA/BB/CC/EE" which does not have a space between it, but in the query you are searching for space WHERE doclocation LIKE '%CC %'
SELECT SUBSTRING(doclocation,CHARINDEX('CC',doclocation)+2,LEN(doclocation))
FROM Table
WHERE doclocation LIKE '%CC %'

SQL query - LEFT 1 = char, RIGHT 3-5 = numbers in Name

I need to filter out junk data in SQL (SQL Server 2008) table. I need to identify these records, and pull them out.
Char[0] = A..Z, a..z
Char[1] = 0..9
Char[2] = 0..9
Char[3] = 0..9
Char[4] = 0..9
{No blanks allowed}
Basically, a clean record will look like this:
T1234, U2468, K123, P50054 (4 record examples)
Junk data looks like this:
T12.., .T12, MARK, TP1, SP2, BFGL, BFPL (7 record examples)
Can someone please assist with a SQL query to do a LEFT and RIGHT method and extract those characters, and do a LIKE IN or something?
A function would be great though!
The following should work in a few different systems:
SELECT *
FROM TheTable
WHERE Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]%'
AND Data NOT LIKE '% %'
This approach will indeed match P2343, P23423JUNK, and other similar text but requires that the format is A0000*.
Now, if the OP implies a format of 1st position is a character and all succeeding positions are numeric, as in A0+, then use the following (in SQL Server and a good deal of other database systems):
SELECT *
FROM TheTable
WHERE SUBSTRING(Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(Data, 2, LEN(Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(Data) >= 5
To incorporate this into a SQL Server 2008 function, since this appears to be what you'd like most, you can write:
CREATE FUNCTION ufn_IsProperFormat(#data VARCHAR(50))
RETURNS BIT
AS
BEGIN
RETURN
CASE
WHEN SUBSTRING(#Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(#Data, 2, LEN(#Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(#Data) >= 5 THEN 1
ELSE 0
END
END
...and call into it like so:
SELECT *
FROM TheTable
WHERE dbo.ufn_IsProperFormat(Data) = 1
...this query needs to change for Oracle queries because Oracle doesn't appear to support bracket notation in LIKE clauses:
SELECT *
FROM TheTable
WHERE REGEXP_LIKE(Data, '^[A-za-z]\d{4,}$')
This is the expansion gbn is doing in his answer, but these versions allow for varying string lengths without the OR conditions.
EDIT: Updated to support examples in SQL Server and Oracle for ensuring the format A0+, so that A1324, A2342388, and P2342 match but A2342JUNK and A234 do not.
The Oracle REGEXP_LIKE code was borrowed from Mark's post but updated to support 4 or more numeric digits.
Added a custom SQL Server 2008 approach which implements these techniques.
Depends on your database. Many have regex functions (note examples not tested so check)
e.g. Oracle
SELECT x
FROM table
WHERE REGEXP_LIKE(x, '^[A-za-z][:digit:]{4}$')
Sybase uses LIKE
Given that you're allowing between 3 and 6 digits for the number in your examples then it's probably better to use the ISNUMERIC() function on the 2nd character onwards:
SELECT *
FROM TheTable
-- start with a letter
WHERE Data LIKE '[A-Za-z]%'
-- everything from 2nd character onwards is a number
AND ISNUMERIC( SUBSTRING( Data, 2, 50 ) ) = 1
-- number doesn't have a decimal place
AND Data NOT LIKE '%.%'
For more information look at the ISNUMERIC function on MSDN.
Also note that:
I've limited the 2nd part with the number to 50 characters maximum, change this to suit your needs.
Strictly speaking you should check for currency symbols etc, as ISNUMERIC allows them, as well as +/- and some others
A better option might be to create a function that checks that each character after the first is between 0 and 9 (or 1 and 0 if you're using ASCII codes).
You can't use Regular Expressions in SQL Server, so you have to use OR. Correcting David Andres' answer...
WHERE
(
Data LIKE '[A-Za-z][0-9][0-9][0-9]'
OR
Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]'
OR
Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9][0-9]'
)
David's answer allows "D1234junk" through
You also only need "[A-Z]" if you don't have case sensitivity

SQL produced by Entity Framework for string matching

Given this linq query against an EF data context:
var customers = data.Customers.Where(c => c.EmailDomain.StartsWith(term))
You’d expect it to produce SQL like this, right?
SELECT {cols} FROM Customers WHERE EmailDomain LIKE #term+’%’
Well, actually, it does something like this:
SELECT {cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Do you know why?
Also, replacing the Where selector to:
c => c.EmailDomain.Substring(0, term.Length) == term
it runs 10 times faster but still produces some pretty yucky SQL.
NOTE: Linq to SQL correctly translates StartsWith into Like {term}%, and nHibernate has a dedicated LikeExpression.
I don't know about MS SQL server but on SQL server compact LIKE 'foo%' is thousands time faster than CHARINDEX, if you have INDEX on seach column. And now I'm sitting and pulling my hair out how to force it use LIKE.
http://social.msdn.microsoft.com/Forums/en-US/adodotnetentityframework/thread/1b835b94-7259-4284-a2a6-3d5ebda76e4b
The reason is that CharIndex is a lot faster and cleaner for SQL to perform than LIKE. The reason is, that you can have some crazy "LIKE" clauses. Example:
SELECT * FROM Customer WHERE EmailDomain LIKE 'abc%de%sss%'
But, the "CHARINDEX" function (which is basically "IndexOf") ONLY handles finding the first instance of a set of characters... no wildcards are allowed.
So, there's your answer :)
EDIT: I just wanted to add that I encourage people to use CHARINDEX in their SQL queries for things that they didn't need "LIKE" for. It is important to note though that in SQL Server 2000... a "Text" field can use the LIKE method, but not CHARINDEX.
Performance seems to be about equal between LIKE and CHARINDEX, so that should not be the reason. See here or here for some discussion. Also the CAST is very weird because CHARINDEX returns an int.
charindex returns the location of the first term within the second term.
sql starts with 1 as the first location (0 = not found)
http://msdn.microsoft.com/en-us/library/ms186323.aspx
i don't know why it uses that syntax but that's how it works
I agree that it is no faster, I was retrieving tens of thousands of rows from our database with the letter i the name. I did find however that you need to use > rather than = ... so use
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) > 0)
rather than
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Here are my two tests ....
select * from members where surname like '%i%' --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) > 0) --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) = 1) --too few results