advanced word searching in sql - sql

i need to write a query in sql server which selects rows containing two word with (at least / at most / exactly) specified number of word between them ...
i wrote this code for implementing exact number of words in between :
SELECT simpledtext
FROM booktexts
WHERE simpledtext LIKE '%hello [^ ] [^ ] search%'
and this code for implementing minimum number of words in between :
SELECT simpledtext
FROM booktexts
WHERE simpledtext LIKE '%hello [^ ] [^ ] % search%'
but i don't know how to write the max words in between t-sql code ...
and the other question is is it possible to implement these kinds of query with full-text-search in sql server 2012 ?

Your like string would only match single character words. If this is what you need, you could put something together like this:
declare #str1 varchar(1024) = 'and hello w w w search how are you',
#str2 varchar(1024) = 'and hello w w search how are you',
#likeStr varchar(512),
#pos int,
#maxMatch int;
set #maxMatch = 2;
set #pos = 0;
set #likeStr = '%hello';
while (#pos < #maxMatch)
begin
set #likeStr += ' [^ ]';
set #pos += 1;
end
set #likeStr += ' search%';
select #likeStr, (case when #str1 like #likeStr then 1 else 0 end), (case when #str2 like #likeStr then 1 else 0 end)
If this isn't what you need, and you know how many characters the words are going to be, you could use [a-zA-Z] in the like string in the loop.
However, I expect this also will not be what you're after. My suggestion would then be to abandon like strings, and move on to the more sophisticated regular expressions.
Unfortunately you can't load System.dll directly into SQL Server 2008 (I think this also applies to SQL Server 2012), so you would need to create a custom .NET assembly and load this into your database. Your should use the IsDeterministic annotation in your .NET code, and load the custom assembly into SQL Server with permission_set = safe. This should ensure you get parallelism for your function, and that you can use it in places like computed columns.
SQL Server is very good at running .NET code, i.e. it can be very
performant. Writing what you need in regular expressions should be
very easy.
As for Full Text Search, contains() is basically a Full Text predicate, and you would have to enable this in SQL Server to use it. near() is used inside contains() predicates. I think this is bulky for what you want to do, both in terms of supported functionality (it does inflections of words for fuzzy matching), and what you need to enable to use it (runs an extra windows service).

Related

How to update text using "regular expressions" in SQL Server?

In a column in a SQL Server database table, the value has a format of X=****;Y=****;Z=5****, where the asterisks represent strings of any lengths and of any values. What I need to do is to change that 5 to a 4 and keep the rest of the string unchanged.
Is there a way to use something like regular expressions to achieve what I want to do? If not using regular expressions, can it be done at all?
MS SQL sadly doesn't have any built in regex support (although it can be added via CLR) but if the format is fixed so that the part you want to change isZ=5toZ=4then usingREPLACEshould work:
REPLACE(your_string,'Z=5','Z=4')
For example:
declare #t table (str varchar(max))
insert #t values
('X=****;Y=****;Z=5****'),
('X=****;Y=**df**;Z=3**sdf**'),
('X=11**;Y=**sdfdf**;Z=5**')
update #t
set str = replace(str,'Z=5','Z=4')
-- or a slightly more ANSI compliant and portable way
update #t
set str = SUBSTRING(str,0, CHARINDEX('Z=5', str)) + 'Z=4' + SUBSTRING(str, CHARINDEX('Z=5', str)+3,LEN(str))
select * from #t
str:
X=****;Y=****;Z=4****
X=****;Y=**df**;Z=3**sdf**
X=11**;Y=**sdfdf**;Z=4**
We need more information. Under what circumstances should 5 be replaced by 4? If it's just where it occurs as the first character after the Z=, then you could simply do...
set Col = Replace(Col,'Z=5','Z=4')
Or, do you just want to replace 5 with 4 anywhere in the column value. In which case you'd obviously just do...
set Col = Replace(Col,'5','4')
Or possibly you mean that 5's should be replaced by 4's anywhere within the value after Z= which would be a lot harder.
update Table set Field = replace(Field, ';Z=5', ';Z=4')
And let's hope that your asterisked data doesn't contain semicolons and equality signs...

Like with Regular Expression not giving right result in sql server

declare #test varchar(50)
set #test='sad#fd'
if #test LIKE '%[a-zA-Z0-9 ./,()?''+-]%'
print 'yes'
else
print 'no'
My above code giving yes result as it should give no as I am not allowing '#' in regular expression. Is there anything wrong?
I want to handle this in my stored procedure where string is alpha numeric with specified list of special character allowed. What should I do?
The result is "Yes", because u have an letter s which is matching the condition
to get more clear, try running the below code
declare #test varchar(1000)
set #test='####'
if #test LIKE '%[a-zA-Z0-9 ./,()?''+-]%'
print 'yes'
else
print 'no'
SQL Server doesn't really have native regular expressions1, but what you're trying to achieve can still be done with LIKE by introducing a double negative:
declare #test varchar(50)
set #test='sad#fd'
if #test NOT LIKE '%[^a-zA-Z0-9 ./,()?''+-]%'
print 'yes'
else
print 'no'
% matches any number of characters. ^ inverts a character range. So, now we're asking - is the string any number of characters, then a character not in the set a-zA-Z0-9 ./,()?''+-, then any number of characters? - or, to put it another way, does this string contain any characters outside of the given set of characters?
1You can access a fully featured regex engine from the .NET framework by using the CLR integration. It's one of the usual samples given when talking about CLR integration. But not really needed here.

How to write SQL query with many % wildcard characters

I have a coloumn in Sql Server table as:
companystring = {"CompanyId":0,"CompanyType":1,"CompanyName":"Test
215","TradingName":"Test 215","RegistrationNumber":"Test
215","Email":"test215#tradeslot.com","Website":"Test
215","DateStarted":"2012","CompanyValidationErrors":[],"CompanyCode":null}
I want to query the column to search for
companyname like '%CompanyName":"%test 2%","%'
I want to know if I'm querying correctly, because for some search string it does not yield the proper result. Could anyone please help me with this?
Edit: I have removed the format bold
% is a special character that means a wildcard. If you want to find the actual character inside a string, you need to escape it.
DECLARE #d TABLE(id INT, s VARCHAR(32));
INSERT #d VALUES(1,'foo%bar'),(2,'fooblat');
SELECT id, s FROM #d WHERE s LIKE 'foo[%]%'; -- returns only 1
SELECT id, s FROM #d WHERE s LIKE 'foo%'; -- returns both 1 and 2
Depending on your platform, you might be able to use some combination of regular expressions and/or lambda expressions which are built into its main libraries. For example, .NET has LINQ , which is a powerful tool that abstracts querying and which provides leveraging for searches.
It looks like you have JSON data stored in a column called "companystring". If you want to search within the JSON data from SQL things get very tricky.
I would suggest you look at doing some extra processing at insert/update to expose the properties of the JSON you want to search on.
If you search in the way you describe, you would actually need to use Regular Expressions or something else to make it reliable.
In your example you say you want to search for:
companystring like '%CompanyName":"%test 2%","%'
I understand this as searching inside the JSON for the string "test 2" somewhere inside the "CompanyName" property. Unfortunately this would also return results where "test 2" was found in any other property after "CompanyName", such as the following:
-- formatted for readability
companystring = '{
"CompanyId":0,
"CompanyType":1,
"CompanyName":"Test Something 215",
"TradingName":"Test 215",
"RegistrationNumber":"Test 215",
"Email":"test215#tradeslot.com",
"Website":"Test 215",
"DateStarted":"2012",
"CompanyValidationErrors":[],
"CompanyCode":null}'
Even though "test 2" isn't in the CompanyName, it is in the text following it (TradingName), which is also followed by the string "," so it would meet your search criteria.
Another option would be to create a view that exposes the value of CompanyName using a column defined as follows:
LEFT(
SUBSTRING(companystring, CHARINDEX('"CompanyName":"', companystring) + LEN('"CompanyName":"'), LEN(companystring)),
CHARINDEX('"', SUBSTRING(companystring, CHARINDEX('"CompanyName":"', companystring) + LEN('"CompanyName":"'), LEN(companystring))) - 1
) AS CompanyName
Then you could query that view using WHERE CompanyName LIKE '%test 2%' and it would work, although performance could be an issue.
The logic of the above is to get everything after "CompanyName":":
SUBSTRING(companystring, CHARINDEX('"CompanyName":"', companystring) + LEN('"CompanyName":"'), LEN(companystring))
Up to but not including the first " in the sub-string (which is why it is used twice).

Access SQL Query - Selecting middle section of many different types of strings

How do i chose the middle section of a string when the desired section is surrounded by the same character "/", and the start of the string is not always at the same index?
e.g. "NWST/330/23/WT6" to "330"
and "NTW/1010/43/TY7" to "1010"
and "TYQT/99/WYT3" to "99"
I have tried combinations of SQL functions including CharIndex, Len, Left, Right, Mid, InStr and InStRe, Please HELP!?!?!?!
:(
You could use the VBA Split() function. You give it a string and tell it what to use as the delimiter; it returns an array of substrings. In your case, it seems you want the second substring, and since the array numbering is zero-based:
? Split("NWST/330/23/WT6", "/")(1)
330
You can't use that function directly in an Access query, but you can create a custom function which uses it.
Public Function CustomSplit(ByVal pInput As String) As String
CustomSplit = Split(pInput, "/")(1)
End Function
Then, from the Immediate Window:
? CustomSplit("NWST/330/23/WT6")
330
So you could use CustomSplit() in a query you run from inside an Access session. However, if you're using some other method (classic ASP, Dot.Net, etc) to query an Access database, user-defined functions are not available so you would need to use a different approach.
So if your text is in a field named raw_text, the query could be this:
SELECT
raw_text,
CustomSplit(raw_text) AS middle_section
FROM YourTableNameHere;
If you prefer a query without a custom function, you can use some of the functions you mentioned in your question.
SELECT
raw_text,
Mid(Left(raw_text, InStr(InStr(1, raw_text, "/") + 1,
raw_text, "/") - 1), InStr(1, raw_text, "/") + 1)
AS middle_section
FROM YourTableNameHere;
Either of those queries produces this as the output:
raw_text middle_section
NWST/330/23/WT6 330
NTW/1010/43/TY7 1010
TYQT/99/WYT3 99
below should do it.
declare #input varchar(255);
declare #input2 varchar(255)
set #input = 'NTW/1010/43/TY7'
set #input2 = substring(#input,charindex('/', #input) + 1, len(#input))
select substring(#input2, 0, charindex('/', #input2))
hardly the most elegant but functional.
I don't have access installed, but this should work, or at least get you close:
=MID(<str>,INSTR(1,<str>,"/")+1,INSTR(INSTR(1,<str>,"/")+1,<str>,"/")-1)
Here is how to do it in Excel (which I tested)
=MID(A4,SEARCH("/",A4,1)+1,SEARCH("/",A4,SEARCH("/",A4)+1)-SEARCH("/",A4)-1)
and SQL (which I also tested)
declare #theStr VARCHAR(200)
set #theStr = 'NWST/330/23/WT6'
select SUBSTRING(#theStr,charindex('/',#theStr)+1,CHARINDEX('./',#theStr,2)+
charindex('/',#theStr)-2)

SQL query - LEFT 1 = char, RIGHT 3-5 = numbers in Name

I need to filter out junk data in SQL (SQL Server 2008) table. I need to identify these records, and pull them out.
Char[0] = A..Z, a..z
Char[1] = 0..9
Char[2] = 0..9
Char[3] = 0..9
Char[4] = 0..9
{No blanks allowed}
Basically, a clean record will look like this:
T1234, U2468, K123, P50054 (4 record examples)
Junk data looks like this:
T12.., .T12, MARK, TP1, SP2, BFGL, BFPL (7 record examples)
Can someone please assist with a SQL query to do a LEFT and RIGHT method and extract those characters, and do a LIKE IN or something?
A function would be great though!
The following should work in a few different systems:
SELECT *
FROM TheTable
WHERE Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]%'
AND Data NOT LIKE '% %'
This approach will indeed match P2343, P23423JUNK, and other similar text but requires that the format is A0000*.
Now, if the OP implies a format of 1st position is a character and all succeeding positions are numeric, as in A0+, then use the following (in SQL Server and a good deal of other database systems):
SELECT *
FROM TheTable
WHERE SUBSTRING(Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(Data, 2, LEN(Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(Data) >= 5
To incorporate this into a SQL Server 2008 function, since this appears to be what you'd like most, you can write:
CREATE FUNCTION ufn_IsProperFormat(#data VARCHAR(50))
RETURNS BIT
AS
BEGIN
RETURN
CASE
WHEN SUBSTRING(#Data, 1, 1) LIKE '[A-Za-z]'
AND SUBSTRING(#Data, 2, LEN(#Data) - 1) NOT LIKE '%[^0-9]%'
AND LEN(#Data) >= 5 THEN 1
ELSE 0
END
END
...and call into it like so:
SELECT *
FROM TheTable
WHERE dbo.ufn_IsProperFormat(Data) = 1
...this query needs to change for Oracle queries because Oracle doesn't appear to support bracket notation in LIKE clauses:
SELECT *
FROM TheTable
WHERE REGEXP_LIKE(Data, '^[A-za-z]\d{4,}$')
This is the expansion gbn is doing in his answer, but these versions allow for varying string lengths without the OR conditions.
EDIT: Updated to support examples in SQL Server and Oracle for ensuring the format A0+, so that A1324, A2342388, and P2342 match but A2342JUNK and A234 do not.
The Oracle REGEXP_LIKE code was borrowed from Mark's post but updated to support 4 or more numeric digits.
Added a custom SQL Server 2008 approach which implements these techniques.
Depends on your database. Many have regex functions (note examples not tested so check)
e.g. Oracle
SELECT x
FROM table
WHERE REGEXP_LIKE(x, '^[A-za-z][:digit:]{4}$')
Sybase uses LIKE
Given that you're allowing between 3 and 6 digits for the number in your examples then it's probably better to use the ISNUMERIC() function on the 2nd character onwards:
SELECT *
FROM TheTable
-- start with a letter
WHERE Data LIKE '[A-Za-z]%'
-- everything from 2nd character onwards is a number
AND ISNUMERIC( SUBSTRING( Data, 2, 50 ) ) = 1
-- number doesn't have a decimal place
AND Data NOT LIKE '%.%'
For more information look at the ISNUMERIC function on MSDN.
Also note that:
I've limited the 2nd part with the number to 50 characters maximum, change this to suit your needs.
Strictly speaking you should check for currency symbols etc, as ISNUMERIC allows them, as well as +/- and some others
A better option might be to create a function that checks that each character after the first is between 0 and 9 (or 1 and 0 if you're using ASCII codes).
You can't use Regular Expressions in SQL Server, so you have to use OR. Correcting David Andres' answer...
WHERE
(
Data LIKE '[A-Za-z][0-9][0-9][0-9]'
OR
Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]'
OR
Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9][0-9]'
)
David's answer allows "D1234junk" through
You also only need "[A-Z]" if you don't have case sensitivity