SQL query to match keywords? - sql

I have a table with a column as nvarchar(max) with text extracted from word documents in it. How can I create a select query that I'll pass another a list of keywords as parameter and return the rows ordered by the number of matches?
Maybe it is possible with full text search?

Yes, possible with full text search, and likely the best answer. For a straight T-SQL solution, you could use a split function and join, e.g. assuming a table of numbers called dbo.Numbers (you may need to decide on a different upper limit):
SET NOCOUNT ON;
DECLARE #UpperLimit INT;
SET #UpperLimit = 200000;
WITH n AS
(
SELECT
rn = ROW_NUMBER() OVER
(ORDER BY s1.[object_id])
FROM sys.objects AS s1
CROSS JOIN sys.objects AS s2
CROSS JOIN sys.objects AS s3
)
SELECT [Number] = rn - 1
INTO dbo.Numbers
FROM n
WHERE rn <= #UpperLimit + 1;
CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]);
And a splitting function that uses that table of numbers:
CREATE FUNCTION dbo.SplitStrings
(
#List NVARCHAR(MAX)
)
RETURNS TABLE
AS
RETURN
(
SELECT DISTINCT
[Value] = LTRIM(RTRIM(
SUBSTRING(#List, [Number],
CHARINDEX(N',', #List + N',', [Number]) - [Number])))
FROM
dbo.Numbers
WHERE
Number <= LEN(#List)
AND SUBSTRING(N',' + #List, [Number], 1) = N','
);
GO
Then you can simply say:
SELECT key, NvarcharColumn /*, other cols */
FROM dbo.table AS outerT
WHERE EXISTS
(
SELECT 1
FROM dbo.table AS t
INNER JOIN dbo.SplitStrings(N'list,of,words') AS s
ON t.NvarcharColumn LIKE '%' + s.Item + '%'
WHERE t.key = outerT.key
);
As a procedure:
CREATE PROCEDURE dbo.Search
#List NVARCHAR(MAX)
AS
BEGIN
SET NOCOUNT ON;
SELECT key, NvarcharColumn /*, other cols */
FROM dbo.table AS outerT
WHERE EXISTS
(
SELECT 1
FROM dbo.table AS t
INNER JOIN dbo.SplitStrings(#List) AS s
ON t.NvarcharColumn LIKE '%' + s.Item + '%'
WHERE t.key = outerT.key
);
END
GO
Then you can just pass in #List (e.g. EXEC dbo.Search #List = N'foo,bar,splunge') from C#.
This won't be super fast, but I'm sure it will be quicker than pulling all the data out into C# and double-nested loop it manually.

how to ... return the rows ordered by the number of [full-text] matches
I have not used it myself but believe SQL Server 2008 supports weighting the CONTAINSTABLE matches which might be of help to you:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
If you don't have an engine that returns results weighted by the number of hits ...
You could write a UDF that takes two inputs and returns an integer: the big textvalue is the first input and the words you're looking for as a comma-delimited string is the second. The function returns an integer representing either the number of distinct looked-for words that were actually found at least once in the text, or the total number of times the looked-for words were found. Implementation --how to weight-- is up to you. Maybe, for example, you'd want to arrange the looked-for words in most-important to least-important order, and give an important word hit more weight than a less important word hit.
You could then use your full text search engine to find all records that contain at least one of the words (you'd OR them), and you'd run this result set through your UDF scalar function:
pseudo code
select title, weightfunction(summary, 'word1,word2,word3....wordN')
from docs
where summary contains ( word1 or word2 or word3 ... or wordN)
order by weightfunction(summary, 'word1,word2,word3....wordN') desc

Related

SQL Query to replace characters by getting record list from other table

I wrote the below query to take an id as input, get DocumentID from Attachment table and then use that id to get Document name from Document table. Once i get document name i am removing anything but the character a-z and numbers. The below Query is working fine if only one Document id is being returned based on Entity id, how can i make it work if one entity id returns more than one Document ID. I also need to return all those new names as well.
ALTER PROCEDURE [dbo].[NormalizeDocumentFileName1]
-- Add the parameters for the stored procedure here
#id nvarchar(16),
#temp varchar(50) OUTPUT
AS
BEGIN
  Select #temp=Document.TheName from Document where id = (Select DocumentId from Attachment where EntityId = #id)
  Declare #KeepValues as varchar(50)
  Set #KeepValues = '%[^a-z0-9-_.]%'
  While PatIndex(#KeepValues, #temp) > 0
  Set #temp = Stuff(#temp, PatIndex(#KeepValues, #temp), 1, '')
   
END
Personally, I would go with a very different approach with this. I'm going to make use of Alan Burstein's NGrams8K.
You want to avoid the WHILE loop, it'll perform awfully, and go with a dataset approach. I'm going to use a Function instead:
CREATE FUNCTION NormalizeDocumentFileName (#FileName varchar(50) )
RETURNS TABLE
AS RETURN
WITH Tokens AS (
SELECT *
FROM dbo.NGrams8k (#FileName,1) --If you didn't create the function on the dbo schema, you'll need to change it.
WHERE token NOT LIKE '%[^a-z0-9-_.]%')
SELECT CONVERT(varchar(50),(SELECT Token + ''
FROM Tokens
ORDER BY Position
FOR XML PATH(''))) AS NormalFileName;
GO
Then you can do something as simple as:
SELECT D.YourColumn, NDFN.NormalFileName
FROM Document D
CROSS APPLY NormalizeDocumentFileName(D.TheName) NDFN;
Another set-based function for this type of thing is PatExclude8K the function does the same thing as what Larnu put together and is reusable. You would have to use the link to get the T-SQL code to create the function. The function works like this:
DECLARE #string varchar(50) = '$$$123___!!!555.ABC???';
SELECT * FROM dbo.patexclude8k(#string, '[^A-Za-z0-9-_.]');
Returns:
NewString
------------
123___555.ABC
Note that what LARNU put together will return entity references for XML characters such as "&", ">", etc. But it will perform better than Patexclude. If you don't expect to deal special XML characters you can use a slightly modified version that will perform relatively the same - here it is:
CREATE FUNCTION dbo.PatExclude8K_NXP
(
#String VARCHAR(8000),
#Pattern VARCHAR(50)
)
/*******************************************************************************
Purpose:
Given a string (#String) and a pattern (#Pattern) of characters to remove,
remove the patterned characters from the string.
Usage:
--===== Basic Syntax Example
SELECT CleanedString
FROM dbo.PatExclude8K_NXP(#String,#Pattern);
--===== Remove all but Alpha characters
SELECT CleanedString
FROM dbo.SomeTable st
CROSS APPLY dbo.PatExclude8K(st.SomeString,'%[^A-Za-z]%');
--===== Remove all but Numeric digits
SELECT CleanedString
FROM dbo.SomeTable st
CROSS APPLY dbo.PatExclude8K(st.SomeString,'%[^0-9]%');
Programmer Notes:
1. #Pattern is case sensitive (the function can be easily modified to make it so)
2. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
Revision History:
Rev 00 - 20180508 Initial Development - Alan Burstein
*******************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH
E1(N) AS (SELECT N FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N)),
itally(N) AS
(
SELECT TOP(CONVERT(INT,LEN(#String),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E1 T1 CROSS JOIN E1 T2 CROSS JOIN E1 T3 CROSS JOIN E1 T4
)
SELECT NewString =
(
SELECT SUBSTRING(#String,N,1)
FROM iTally
WHERE 0 = PATINDEX(#Pattern,SUBSTRING(#String COLLATE Latin1_General_BIN,N,1))
FOR XML PATH('')
);
Lastly, both NGrams8K and PatExclude perform quite a bit better when the optimizer chooses a parallel execution plan. To force a parallel plan you can use Make_parallel by Adam Machanic. Using Larnu's solution as an example you would force a parallel plan like so:
SELECT D.YourColumn, NDFN.NormalFileName
FROM Document D
CROSS APPLY NormalizeDocumentFileName(D.TheName) NDFN;
CROSS APPLY dbo.make_parallel();
Hmmm. You can dispense with the while loop and use a recursive CTE:
with cte as (
Select d.TheName, 0 as lev, d.TheName as orig_TheName
from Document d
where d.id = (Select DocumentId from Attachment where EntityId = #id)
union all
select Stuff(cte.thename, PatIndex(#KeepValues, cte.thename), 1, '') as DocumentId lev + 1, cte.orig_TheName
from cte
where PatIndex(#KeepValues, cte.thename) > 0
)
select theName
from (select theName, max(lev) over (partition by orig_thename) as max_lev
from cte
) x
where lev = max_lev

Dynamic SP returning values in reverse order

I am using MS SQL and created one Dynamic stored procedure:
ALTER Procedure [dbo].[sp_MTracking]
(
#OList varchar(MAX)
)
As
BEGIN TRY
SET NOCOUNT ON
DECLARE #SQL varchar(600)
SET #SQL = 'select os.X,os.Y from Table1 as os join Table2 as s on os.sID=s.sID where s.SCode IN ('+ #OList +')'
exec (#SQL)
END TRY
BEGIN CATCH
Execute sp_DB_ErrorInfo
Select -1 Result
END CATCH
GO
It is working properly, but I am getting x,y values in reverse order.
For example if I am passing 'scode1,scode2' as parameter, I am getting x,y values for scode1 in 2nd row and x,y values for scode2 as first row.
How can I fix this issue
Thanks
This is a bit long for a comment.
SQL tables and results sets represent unordered sets. There is no ordering, unless you explicitly use an ORDER BY clause.
Your query does not have an ORDER BY. Hence, you have no reason to expect the results in any particular order. In addition, the ordering may be different on different runs of the query. If you want the results in a particular order, add ORDER BY.
Probably the easiest way is to use charindex():
order by charindex(',' + s.code + ',' , ',''' + #olist + ''',')
This is a bit more cumbersome in dynamic sql:
SET #SQL = '
select os.X,os.Y
from Table1 os join
Table2 s
on os.sID = s.sID
where s.SCode IN (' + #OList + ')
order by charindex('','' + s.code + '','', '',''' + #OList + ''', '')
';
Well, there are a couple of things here.
The first thing is what Gordon wrote - to ensure the order of the result set you must use the order by clause.
Second, like Devart demonstrated in his answer, you don't need dynamic sql for this kind of procedures.
Third, if you want your results ordered by the order of the parameters in the list, you should use a slightly different approach then Devart wrote.
Therefor, here are my 2 cents:
If you can change the stored procedure to accept a table valued parameter instead of VARCHAR(max) that would be your best option IMHO.
If not, you must use a split function to create a table from that varchar and then use that table in your select.
Note that you will have to choose a split function that returns a table with two columns - one for the value and one for it's position in the original string.
Whatever the case may be, the rest of the sql should be something like this:
SELECT os.X, os.Y
FROM Table1 os
INNER JOIN Table2 s ON os.[sID] = s.[sID]
INNER JOIN #TVP t ON s.SCode = t.Value
ORDER BY t.Sort
That's assuming #TVP to be a Table containing a Value column that's the same data type of SCode in table2, and a Sort column (an int, naturally).
Without dynamic sql -
ALTER PROCEDURE [dbo].[sp_MTracking]
(
#OList VARCHAR(MAX)
)
AS BEGIN
SET NOCOUNT ON
DECLARE #t TABLE (val VARCHAR(50) PRIMARY KEY WITH(IGNORE_DUP_KEY=ON))
INSERT INTO #t
SELECT item = t.c.value('.', 'INT')
FROM (
SELECT txml = CAST('<r>' + REPLACE(#OList, ',', '</r><r>') + '</r>' AS XML)
) r
CROSS APPLY txml.nodes('/r') t(c)
SELECT os.X, os.Y
FROM Table1 os
JOIN Table2 s ON os.[sID] = s.[sID]
WHERE s.SCode IN (SELECT * FROM #t)
--OPTION(RECOMPILE)
END
GO

Finding strings with duplicate letters inside

Can somebody help me with this little task? What I need is a stored procedure that can find duplicate letters (in a row) in a string from a table "a" and after that make a new table "b" with just the id of the string that has a duplicate letter.
Something like this:
Table A
ID Name
1 Matt
2 Daave
3 Toom
4 Mike
5 Eddie
And from that table I can see that Daave, Toom, Eddie have duplicate letters in a row and I would like to make a new table and list their ID's only. Something like:
Table B
ID
2
3
5
Only 2,3,5 because that is the ID of the string that has duplicate letters in their names.
I hope this is understandable and would be very grateful for any help.
In your answer with stored procedure, you have 2 mistakes, one is missing space between column name and LIKE clause, second is missing single quotes around search parameter.
I first create user-defined scalar function which return 1 if string contains duplicate letters:
EDITED
CREATE FUNCTION FindDuplicateLetters
(
#String NVARCHAR(50)
)
RETURNS BIT
AS
BEGIN
DECLARE #Result BIT = 0
DECLARE #Counter INT = 1
WHILE (#Counter <= LEN(#String) - 1)
BEGIN
IF(ASCII((SELECT SUBSTRING(#String, #Counter, 1))) = ASCII((SELECT SUBSTRING(#String, #Counter + 1, 1))))
BEGIN
SET #Result = 1
BREAK
END
SET #Counter = #Counter + 1
END
RETURN #Result
END
GO
After function was created, just call it from simple SELECT query like following:
SELECT
*
FROM
(SELECT
*,
dbo.FindDuplicateLetters(ColumnName) AS Duplicates
FROM TableName) AS a
WHERE a.Duplicates = 1
With this combination, you will get just rows that has duplicate letters.
In any version of SQL, you can do this with a brute force approach:
select *
from t
where t.name like '%aa%' or
t.name like '%bb%' or
. . .
t.name like '%zz%'
If you have a case sensitive collation, then use:
where lower(t.name) like '%aa%' or
. . .
Here's one way.
First create a table of numbers
CREATE TABLE dbo.Numbers
(
number INT PRIMARY KEY
);
INSERT INTO dbo.Numbers
SELECT number
FROM master..spt_values
WHERE type = 'P'
AND number > 0;
Then with that in place you can use
SELECT *
FROM TableA
WHERE EXISTS (SELECT *
FROM dbo.Numbers
WHERE number < LEN(Name)
AND SUBSTRING(Name, number, 1) = SUBSTRING(Name, number + 1, 1))
Though this is an old post it's worth posting a solution that will be faster than a brute force approach or one that uses a scalar udf (which generally drag down performance). Using NGrams8K this is rather simple.
--sample data
declare #table table (id int identity primary key, [name] varchar(20));
insert #table([name]) values ('Mattaa'),('Daave'),('Toom'),('Mike'),('Eddie');
-- solution #1
select id
from #table
cross apply dbo.NGrams8k([name],1)
where charindex(replicate(token,2), [name]) > 0
group by id;
-- solution #2 (SQL 2012+ solution using LAG)
select id
from
(
select id, token, prevToken = lag(token,1) over (partition by id order by position)
from #table
cross apply dbo.NGrams8k([name],1)
) prep
where token = prevToken
group by id; -- optional id you want to remove possible duplicates.
another burte force way:
select *
from t
where t.name ~ '(.)\1';

Generating an n-gram table with an SQL query

I'm trying to implement a fuzzy search with JavaScript client side, to search a largish db (300 items roughly) of records contained in an SQL database. My constraint is that it is not possible to perform a live query on the database- I must generate "indexes" as flat files during a nightly batch job. And so, starting with a db that looks like this:
ID. NAME
1. The Rain Man
2. The Electric Slide
3. Transformers
I need to create within a single query something like this:
Trigram ID
the 1
the 2
he_ 1
he_ 2
e_r 1
_ra 1
rai 1
ain 1
in_ 1
n_m 1
_ma 1
man 1
e_e 2
_el 2
ele 2
lec 2
Etc etc, typos not withstanding. The rules here are that ''n' is the length of the strings in the first column, that only a-z and _ are valid characters, any other character being normalized to Lower case, or mapped to _, that a group by n-gram clause may be applied to the table. Thus, I would hope to gain a table that would allow me to quickly look up a particular n-gram and get a list of all the Ids of rows which contain that sequence. I'm not a clever enough SQL cookie to figure this problem out. Can you?
I created an T-SQL NGrams that works quite nicely; note the comments section for examples of how to use
CREATE FUNCTION dbo.nGrams8K
(
#string VARCHAR(8000),
#n TINYINT,
#pad BIT
)
/*
Created by: Alan Burstein
Created on: 3/10/2014
Updated on: 5/20/2014 changed the logic to use an "inline tally table"
9/10/2014 Added some more code examples in the comment section
9/30/2014 Added more code examples
10/27/2014 Small bug fix regarding padding
Use: Outputs a stream of tokens based on an input string.
Works just like mdq.nGrams; see http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
n-gram defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items from a given
sequence of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application.
To better understand N-Grams see: http://en.wikipedia.org/wiki/N-gram
*/
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH
E1(n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(n)),
E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
iTally(n) AS
(
SELECT TOP (LEN(#string)+#n) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E2 a CROSS JOIN E2 b
),
NewString(NewString) AS
(
SELECT REPLICATE(CASE #pad WHEN 0 THEN '' ELSE ' ' END,#n-1)+#string+
REPLICATE(CASE #pad WHEN 0 THEN '' ELSE ' ' END,#n-1)
)
SELECT TOP ((#n)+LEN(#string))
n AS [sequence],
SUBSTRING(NewString,n,#n) AS token
FROM iTally
CROSS APPLY NewString
WHERE n < ((#n)+LEN(#string));
/*
------------------------------------------------------------
-- (1) Basic Use
-------------------------------------------------------------
;-- (A)basic "string to table":
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',1,1);
-- (b) create "bi-grams" (pad bit off)
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',2,0);
-- (c) create "tri-grams" (pad bit on)
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1);
-- (d) filter for only "tri-grams"
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1)
WHERE len(ltrim(token)) = 3;
-- note the query plan for each. The power is coming from an index
-- also note how many rows are produced: len(#string+(#n-1))
-- lastly, you can trim as needed when padding=1
------------------------------------------------------------
-- (2) With a variable
------------------------------------------------------------
-- note, in this example I am getting only the stuff that has three letters
DECLARE #string varchar(20) = 'abcdefg',
#tokenLen tinyint = 3;
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1)
WHERE len(ltrim(token)) = 3;
GO
------------------------------------------------------------
-- (3) An on-the-fly alphabet (this will come in handy in a moment)
------------------------------------------------------------
DECLARE #alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ';
SELECT [sequence], token
FROM dbo.nGrams8K(#alphabet,1,0);
GO
------------------------------------------------------------
-- (4) Character Count
------------------------------------------------------------
DECLARE #string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dog just laid there.',
#alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ';
SELECT a.token, COUNT(b.token) ttl
FROM dbo.nGrams8K(#alphabet,1,0) a
LEFT JOIN dbo.nGrams8K(#string,1,0) b ON a.token=b.token
GROUP BY a.token
ORDER BY a.token;
GO
------------------------------------------------------------
-- (5) Locate the start position of a search pattern
------------------------------------------------------------
;-- (A) note these queries:
DECLARE #string varchar(100)='THE QUICK Green FOX JUMPED OVER THE LAZY DOGS BACK';
-- (i)
SELECT * FROM dbo.nGrams8K(#string,1,0) a;
-- (ii) note this query:
SELECT * FROM dbo.nGrams8K(#string,1,0) a WHERE [token]=' ';
-- (B) and now the word count (#string included for presentation)
SELECT #string AS string,
count(*)+1 AS words
FROM dbo.nGrams8K(#string,1,0) a
WHERE [token]=' '
GO
------------------------------------------------------------
-- (6) search for the number of occurances of a word
------------------------------------------------------------
DECLARE #string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dog just laid there.',
#alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ',
#searchString VARCHAR(100)='The';
-- (5a) by location
SELECT sequence-(LEN(#searchstring)) AS location,
token AS searchString
FROM dbo.nGrams8K(#string,LEN(#searchstring+' ')+1,0) b
WHERE token=#searchString;
-- (2b) get total
SELECT #string AS string,
#searchString AS searchString,
COUNT(*) AS ttl
FROM dbo.nGrams8K(#string,LEN(#searchstring+' ')+1,0) b
WHERE token=#searchString;
------------------------------------------------------------
-- (7) Special SubstringBefore and SubstringAfter
------------------------------------------------------------
-- (7a) SubstringBeforeSSI (note: SSI = substringIndex)
ALTER FUNCTION dbo.SubstringBeforeSSI
(
#string varchar(1000),
#substring varchar(100),
#substring_index tinyint
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH get_pos AS
(
SELECT rn = row_number() over (order by sequence), substring_index = sequence
FROM dbo.nGrams8K(#string,len(#substring),1)
WHERE token=#substring
)
SELECT newstring = substring(#string,1,substring_index-len(#substring))
FROM get_pos
WHERE rn=#substring_index;
GO
DECLARE #string varchar(1000)='10.0.1600.22',
#searchPattern varchar(100)='.',
#substring_index tinyint = 3;
SELECT * FROM dbo.SubstringBeforeSSI(#string,#searchPattern,#substring_index);
GO
-- (7b) SubstringBeforeSSI (note: SSI = substringIndex)
ALTER FUNCTION dbo.SubstringAfterSSI
(
#string varchar(1000),
#substring varchar(100),
#substring_index tinyint
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH get_pos AS
(
SELECT rn = row_number() over (order by sequence), substring_index = sequence
FROM dbo.nGrams8K(#string,len(#substring),1)
WHERE token=#substring
)
SELECT newstring = substring(#string,substring_index+1,8000)
FROM get_pos
WHERE rn=#substring_index;
GO
DECLARE #string varchar(1000)='<notes id="1">blah, blah, blah</notes><notes id="2">More Notes</notes>',
#searchPattern varchar(100)='</notes>',
#substring_index tinyint = 1;
SELECT #string, *
FROM dbo.SubstringAfterSSI(#string,#searchPattern,#substring_index);
------------------------------------------------------------
-- (8) Strip non-numeric characters from a string
------------------------------------------------------------
-- (8a) create the function
ALTER FUNCTION StripNonNumeric_itvf(#OriginalText VARCHAR(8000))
RETURNS TABLE
--WITH SCHEMABINDING
AS
return
WITH ngrams AS
(
SELECT n = [sequence], c = token
FROM dbo.nGrams8K(#OriginalText,1,1)
),
clean_txt(CleanedText) AS
(
SELECT c+''
FROM ngrams
WHERE ascii(substring(#OriginalText,n,1)) BETWEEN 48 AND 57
FOR XML PATH('')
)
SELECT CleanedText
FROM clean_txt;
GO
-- (8b) use against a value or variable
SELECT CleanedText
FROM dbo.StripNonNumeric_itvf('value123');
-- (8c) use against a table
-- test harness:
IF OBJECT_ID('tempdb..#strings') IS NOT NULL DROP TABLE #strings;
WITH strings AS
(
SELECT TOP (100000) string = newid()
FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
SELECT *
INTO #strings
FROM strings;
GO
-- query (returns 100K rows every 3 seconds on my pc):
SELECT CleanedText
FROM #strings
CROSS APPLY dbo.StripNonNumeric_itvf(string);
------------------------------------------------------------
-- (9) A couple complex String Algorithms
------------------------------------------------------------
-- (9a) hamming distance between two strings:
DECLARE #string1 varchar(8000) = 'xxxxyyyzzz',
#string2 varchar(8000) = 'xxxxyyzzzz';
SELECT string1 = #string1,
string2 = #string2,
hamming_distance = count(*)
FROM dbo.nGrams8K(#string1,1,0) s1
CROSS APPLY dbo.nGrams8K(#string2,1,0) s2
WHERE s1.sequence = s2.sequence
AND s1.token <> s2.token
GO
-- (9b) inner join between 2 strings
--(can be used to speed up other string metrics such as the longest common subsequence)
DECLARE #string1 varchar(100)='xxxx123yyyy456zzzz',
#string2 varchar(100)='xx789yy000zz';
WITH
s1(string1) AS
(
SELECT [token]+''
FROM dbo.nGrams8K(#string1,1,0)
WHERE charindex([token],#string2)<>0
ORDER BY [sequence]
FOR XML PATH('')
),
s2(string2) AS
(
SELECT [token]+''
FROM dbo.nGrams8K(#string2,1,0)
WHERE charindex([token],#string1)<>0
ORDER BY [sequence]
FOR XML PATH('')
)
SELECT string1, string2
FROM s1
CROSS APPLY s2;
------------------------------------------------------------
-- (10) Advanced Substring Metrics
------------------------------------------------------------
-- (10a) Identify common substrings and their location
DECLARE #string1 varchar(100) = 'xxx yyy zzz',
#string2 varchar(100) = 'xx yyy zz';
-- (i) review the two strings
SELECT str1 = #string1,
str2 = #string2;
-- (ii) the results
WITH
iTally AS
(
SELECT n
FROM dbo.tally t
WHERE n<= len(#string1)
),
distinct_tokens AS
(
SELECT ng1 = ng1.token, ng2 = ng2.token --= ltrim(ng1.token), ng2 = ltrim(ng2.token)
FROM itally
CROSS APPLY dbo.nGrams8K(#string1,n,1) ng1
CROSS APPLY dbo.nGrams8K(#string2,n,1) ng2
WHERE ng1.token=ng2.token
)
SELECT ss_txt = ng1,
ss_len = len(ng1),
str1_loc = charindex(ng1,#string1),
str2_loc = charindex(ng2,#string2)
FROM distinct_tokens
WHERE ng1<>'' AND charindex(ng1,#string1)+charindex(ng2,#string2)<>0
GROUP BY ng1, ng2
ORDER BY charindex(ng1,#string1), charindex(ng2,#string2), len(ng1);
-- (10b) Longest common substring function
-- (i) function
IF EXISTS
( SELECT * FROM INFORMATION_SCHEMA.ROUTINES
WHERE ROUTINE_SCHEMA='dbo' AND ROUTINE_NAME = 'lcss')
DROP FUNCTION dbo.lcss;
GO
CREATE FUNCTION dbo.lcss(#string1 varchar(100), #string2 varchar(100))
RETURNS TABLE
AS
RETURN
SELECT TOP (1) with ties token
FROM dbo.tally
CROSS APPLY dbo.nGrams8K(#string1,n,1)
WHERE n <= len(#string1)
AND charindex(token, #string2) > 0
ORDER BY len(token) DESC;
GO
-- (ii) example of use
DECLARE #string1 varchar(100) = '000xxxyyyzzz',
#string2 varchar(100) = '999xxyyyzaa';
SELECT string1 = #string1,
string2 = #string2,
token
FROM dbo.lcss(#string1, #string2);
*/
GO
You'd have to repeat this statement:
insert into trigram_table ( Trigram, ID )
select substr( translate( lower( Name ), ' ', '_' ), :X, :N ),
ID
from db_table
for all :X from 1 to Len(Name) + 1 - :N
You will also have to extend the translate function for all the other special characters you'd want to convert to an underscore. Right now it's just translating a blank into an underscore.
For performance you could do the translate and lower functions on the Trigram column in a last pass on the trigram_table so you're not doing those functions for each :X.

Define variable to use with IN operator (T-SQL)

I have a Transact-SQL query that uses the IN operator. Something like this:
select * from myTable where myColumn in (1,2,3,4)
Is there a way to define a variable to hold the entire list "(1,2,3,4)"? How should I define it?
declare #myList {data type}
set #myList = (1,2,3,4)
select * from myTable where myColumn in #myList
DECLARE #MyList TABLE (Value INT)
INSERT INTO #MyList VALUES (1)
INSERT INTO #MyList VALUES (2)
INSERT INTO #MyList VALUES (3)
INSERT INTO #MyList VALUES (4)
SELECT *
FROM MyTable
WHERE MyColumn IN (SELECT Value FROM #MyList)
DECLARE #mylist TABLE (Id int)
INSERT INTO #mylist
SELECT id FROM (VALUES (1),(2),(3),(4),(5)) AS tbl(id)
SELECT * FROM Mytable WHERE theColumn IN (select id from #mylist)
There are two ways to tackle dynamic csv lists for TSQL queries:
1) Using an inner select
SELECT * FROM myTable WHERE myColumn in (SELECT id FROM myIdTable WHERE id > 10)
2) Using dynamically concatenated TSQL
DECLARE #sql varchar(max)
declare #list varchar(256)
select #list = '1,2,3'
SELECT #sql = 'SELECT * FROM myTable WHERE myColumn in (' + #list + ')'
exec sp_executeSQL #sql
3) A possible third option is table variables. If you have SQl Server 2005 you can use a table variable. If your on Sql Server 2008 you can even pass whole table variables in as a parameter to stored procedures and use it in a join or as a subselect in the IN clause.
DECLARE #list TABLE (Id INT)
INSERT INTO #list(Id)
SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
SELECT
*
FROM
myTable
JOIN #list l ON myTable.myColumn = l.Id
SELECT
*
FROM
myTable
WHERE
myColumn IN (SELECT Id FROM #list)
Use a function like this:
CREATE function [dbo].[list_to_table] (#list varchar(4000))
returns #tab table (item varchar(100))
begin
if CHARINDEX(',',#list) = 0 or CHARINDEX(',',#list) is null
begin
insert into #tab (item) values (#list);
return;
end
declare #c_pos int;
declare #n_pos int;
declare #l_pos int;
set #c_pos = 0;
set #n_pos = CHARINDEX(',',#list,#c_pos);
while #n_pos > 0
begin
insert into #tab (item) values (SUBSTRING(#list,#c_pos+1,#n_pos - #c_pos-1));
set #c_pos = #n_pos;
set #l_pos = #n_pos;
set #n_pos = CHARINDEX(',',#list,#c_pos+1);
end;
insert into #tab (item) values (SUBSTRING(#list,#l_pos+1,4000));
return;
end;
Instead of using like, you make an inner join with the table returned by the function:
select * from table_1 where id in ('a','b','c')
becomes
select * from table_1 a inner join [dbo].[list_to_table] ('a,b,c') b on (a.id = b.item)
In an unindexed 1M record table the second version took about half the time...
I know this is old now but TSQL => 2016, you can use STRING_SPLIT:
DECLARE #InList varchar(255) = 'This;Is;My;List';
WITH InList (Item) AS (
SELECT value FROM STRING_SPLIT(#InList, ';')
)
SELECT *
FROM [Table]
WHERE [Item] IN (SELECT Tag FROM InList)
Starting with SQL2017 you can use STRING_SPLIT and do this:
declare #myList nvarchar(MAX)
set #myList = '1,2,3,4'
select * from myTable where myColumn in (select value from STRING_SPLIT(#myList,','))
DECLARE #myList TABLE (Id BIGINT) INSERT INTO #myList(Id) VALUES (1),(2),(3),(4);
select * from myTable where myColumn in(select Id from #myList)
Please note that for long list or production systems it's not recommended to use this way as it may be much more slower than simple INoperator like someColumnName in (1,2,3,4) (tested using 8000+ items list)
slight improvement on #LukeH, there is no need to repeat the "INSERT INTO":
and #realPT's answer - no need to have the SELECT:
DECLARE #MyList TABLE (Value INT)
INSERT INTO #MyList VALUES (1),(2),(3),(4)
SELECT * FROM MyTable
WHERE MyColumn IN (SELECT Value FROM #MyList)
No, there is no such type. But there are some choices:
Dynamically generated queries (sp_executesql)
Temporary tables
Table-type variables (closest thing that there is to a list)
Create an XML string and then convert it to a table with the XML functions (really awkward and roundabout, unless you have an XML to start with)
None of these are really elegant, but that's the best there is.
If you want to do this without using a second table, you can do a LIKE comparison with a CAST:
DECLARE #myList varchar(15)
SET #myList = ',1,2,3,4,'
SELECT *
FROM myTable
WHERE #myList LIKE '%,' + CAST(myColumn AS varchar(15)) + ',%'
If the field you're comparing is already a string then you won't need to CAST.
Surrounding both the column match and each unique value in commas will ensure an exact match. Otherwise, a value of 1 would be found in a list containing ',4,2,15,'
As no one mentioned it before, starting from Sql Server 2016 you can also use json arrays and OPENJSON (Transact-SQL):
declare #filter nvarchar(max) = '[1,2]'
select *
from dbo.Test as t
where
exists (select * from openjson(#filter) as tt where tt.[value] = t.id)
You can test it in
sql fiddle demo
You can also cover more complicated cases with json easier - see Search list of values and range in SQL using WHERE IN clause with SQL variable?
This one uses PATINDEX to match ids from a table to a non-digit delimited integer list.
-- Given a string #myList containing character delimited integers
-- (supports any non digit delimiter)
DECLARE #myList VARCHAR(MAX) = '1,2,3,4,42'
SELECT * FROM [MyTable]
WHERE
-- When the Id is at the leftmost position
-- (nothing to its left and anything to its right after a non digit char)
PATINDEX(CAST([Id] AS VARCHAR)+'[^0-9]%', #myList)>0
OR
-- When the Id is at the rightmost position
-- (anything to its left before a non digit char and nothing to its right)
PATINDEX('%[^0-9]'+CAST([Id] AS VARCHAR), #myList)>0
OR
-- When the Id is between two delimiters
-- (anything to its left and right after two non digit chars)
PATINDEX('%[^0-9]'+CAST([Id] AS VARCHAR)+'[^0-9]%', #myList)>0
OR
-- When the Id is equal to the list
-- (if there is only one Id in the list)
CAST([Id] AS VARCHAR)=#myList
Notes:
when casting as varchar and not specifying byte size in parentheses the default length is 30
% (wildcard) will match any string of zero or more characters
^ (wildcard) not to match
[^0-9] will match any non digit character
PATINDEX is an SQL standard function that returns the position of a pattern in a string
DECLARE #StatusList varchar(MAX);
SET #StatusList='1,2,3,4';
DECLARE #Status SYS_INTEGERS;
INSERT INTO #Status
SELECT Value
FROM dbo.SYS_SPLITTOINTEGERS_FN(#StatusList, ',');
SELECT Value From #Status;
Most of these seem to focus on separating-out each INT into its own parenthetical, for example:
(1),(2),(3), and so on...
That isn't always convenient. Especially since, many times, you already start with a comma-separated list, for example:
(1,2,3,...) and so on...
In these situations, you may care to do something more like this:
DECLARE #ListOfIds TABLE (DocumentId INT);
INSERT INTO #ListOfIds
SELECT Id FROM [dbo].[Document] WHERE Id IN (206,235,255,257,267,365)
SELECT * FROM #ListOfIds
I like this method because, more often than not, I am trying to work with IDs that should already exist in a table.
My experience with a commonly proposed technique offered here,
SELECT * FROM Mytable WHERE myColumn IN (select id from #mylist)
is that it induces a major performance degradation if the primary data table (Mytable) includes a very large number of records. Presumably, that is because the IN operator’s list-subquery is re-executed for every record in the data table.
I’m not seeing any offered solution here that provides the same functional result by avoiding the IN operator entirely. The general problem isn’t a need for a parameterized IN operation, it’s a need for a parameterized inclusion constraint. My favored technique for that is to implement it using an (inner) join:
DECLARE #myList varchar(50) /* BEWARE: if too small, no error, just missing data! */
SET #myList = '1,2,3,4'
SELECT *
FROM myTable
JOIN STRING_SPLIT(#myList,',') MyList_Tbl
ON myColumn = MyList_Tbl.Value
It is so much faster because the generation of the constraint-list table (MyList_Tbl) is executed only once for the entire query execution. Typically, for large data sets, this technique executes at least five times faster than the functionally equivalent parameterized IN operator solutions, like those offered here.
I think you'll have to declare a string and then execute that SQL string.
Have a look at sp_executeSQL