Join using a LIKE clause is taking too long - sql

Please see the TSQL below:
create table #IDs (id varchar(100))
insert into #IDs values ('123')
insert into #IDs values ('456')
insert into #IDs values ('789')
insert into #IDs values ('1010')
create table #Notes (Note varchar(500))
insert into #Notes values ('Here is a note for 123')
insert into #Notes values ('A note for 789 here')
insert into #Notes values ('456 has a note here')
I want to find all the IDs that are referenced in the #Notes table. This works:
select #IDs.id from #IDs inner join #Notes on #Notes.note like '%' + #IDs.id + '%'
However, there are hundreds of thousands of records in both tables and the query does not complete. I was thinking about FreeText searching, but I don't think it can be applied here. A cursor takes too long to run as well (I think it will take over one month). Is there anything else I can try? I am using SQL Server 2019.

The size of the input is only one aspect of the solution.
By splitting the text to tokens you indeed increase the number of records, but in the same time you enable equality join, which can be implemented using Hash Join.
You should get the query results in a few minutes top, basically the time it takes to your system to do a full scan on both tables, plus some processing time.
No need for temp tables.
No need for indexes.
Select id
from #IDS
where id in (select w.value
from #Notes as n
cross apply string_split(n.Note, ' ') as w
)
Fiddle
Per the OP request -
Here is a code that handles more complicated scenario, where an id could contain various characters (as defined by #token_char) and the separators are potentially all other characters
declare #token_char varchar(100) = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
;
with cte_notes as
(
select Note
,replace(translate(Note,#token_char,space(len(#token_char))),' ','') as non_token_char
from #Notes
)
select id
from #IDS
where id in
(
select w.value
from cte_notes as n
cross apply string_split(translate(n.Note,n.non_token_char,space(len(n.non_token_char))),' ') as w
where w.value != ''
)
The Fiddle data sample was altered accordingly, to reflect the change

If you are going to do this search often you may want to explore using a wonderful (if underused) feature of SQL Server called 'Full Text Search.' To quote Microsoft:
A LIKE query against millions of rows of text data can take minutes to
return; whereas a full-text query can take only seconds or less
against the same data, depending on the number of rows that are
returned.'
I have seen searches go from minutes to seconds using this feature.
You would need to create a Full Text Search Catalog and then create indexs on the tables you want to search. It's not hard and will take you a few minutes to learn how to do this.
This is a good starting point:
https://learn.microsoft.com/en-us/sql/relational-databases/search/get-started-with-full-text-search?view=sql-server-ver15

I would apply CTE with string_split to filter out all alphabetic components and then join #ID table with the result of the CTE by id column. The query was tested on a sample of 1 mm rows.
With CTE As (
Select T.value As id
From #Notes Cross Apply String_Split(Note,' ') As T
Where Try_Convert(Int, T.value) Is Not Null
)
Select I.id
From #IDs As I Inner Join CTE On (I.id=CTE.id)
If you just want to extract a numeric value from a string, in this case join is excessive.
Select T.value As id, #Notes.Note
From #Notes Cross Apply String_Split(Note,' ') As T
Where Try_Convert(Int, T.value) Is Not Null And T.value Like '%[0-9]%'
id
Note
123
Here is a note for 123
789
A note for 789 here
456
456 has a note here
No matter what, under the given circumstances, I would use join to filter out those numbers that are not represented in #IDs table.
With CTE As (
Select distinct(id) As id
From #IDs
)
Select T.value As id, #Notes.Note
From #Notes Cross Apply String_Split(Note,' ') As T
Inner Join CTE On (T.value=CTE.id)
Where Try_Convert(Int, T.value) Is Not Null
And T.value Like '%[0-9]%'
If the string contains brackets or parenthesis instead of spaces like this:
"456(this is an id number) has a note here" or "456[01/01/2022]"
as last resorts (since it degrades performance) you can use TRANSLATE to replace those brackets with spaces as follows:
With CTE As (
Select distinct(id) As id
From #IDs
)
Select T.value As id, #Notes.Note
From #Notes Cross Apply String_Split(TRANSLATE(Note,'[]()',' '),' ') As T
Inner Join CTE On (T.value=CTE.id)
Where Try_Convert(Int, T.value) Is Not Null
And T.value Like '%[0-9]%'
db<>fiddle

Related

How to search for all except certain strings using like (or different solution welcomed)?

For now, I need to filter out rows with certain text strings.
E.g. for string that is given in this format: 'taxes, car' - I need to filter out all rows that include either "taxes" or "cars" in the description of the row.
I have come up with this:
SELECT
TransactionId
,t.DocumentID
,t.DocumentDescription
FROM [Transaction] t
INNER JOIN (SELECT CONCAT('%',[Value], '%') AS [Value]
FROM STRING_SPLIT(N'taxes,cars',',')
) w
ON t.[DocumentDescription] NOT LIKE w.[Value]
This does not work at all, since it matches both of the splitted strings and filters out the row only when both of the strings are included in the description of the row.
Any ideas how to make it work?
I think you want NOT EXISTS:
WITH w as (
SELECT value as word
FROM STRING_SPLIT(N'taxes,cars', ',')
)
SELECT t.*
FROM [Transaction] t
WHERE NOT EXISTS (SELECT 1
FROM w
WHERE t.DocumentDescription LIKE CONCAT('%', w.word, '%')
);
Note that because of the use of LIKE this query has to scan the entire table. You might want to rethink your data model, perhaps using a full text index or breaking the description into words if you have large tables and performance is an issue.
Since you said you were open to other ideas... What you are looking to do can be done without a splitter function (e.g. STRING_SPLIT in your example). If you wanted to let your filter expression ('taxes,cars') come in as a parameter then you could use SRING_SPLIT. Note the sample data and both examples below:
DECLARE #Transaction TABLE
(
TransactionId INT IDENTITY,
DocumentDescription VARCHAR(1000)
);
INSERT #Transaction (DocumentDescription) VALUES('Blah, blah... cars...'), ('Yada, yada... taxes'),('Blah blah...');
-- Without a Splitter Function (e.g. SPLIT_STRING)
SELECT t.TransactionId, t.DocumentDescription
FROM #Transaction AS t
WHERE NOT EXISTS
(
SELECT 1
FROM #Transaction
CROSS JOIN (VALUES('taxes'),('cars')) AS srch(Item)
WHERE CHARINDEX(srch.Item,t.DocumentDescription) > 0
);
-- Using Split String
SELECT t.*
FROM #Transaction AS t
WHERE NOT EXISTS
(
SELECT 1
FROM STRING_SPLIT(N'taxes,cars', ',') AS w
WHERE CHARINDEX(w.[value],DocumentDescription) > 0
);
This got me the result I wanted!
SELECT 1,2,3 FROM [Transaction]
EXCEPT
SELECT 1,2,3 FROM [Transaction] t
INNER JOIN (INNER JOIN(SELECT CONCAT('%',[Value], '%') AS [Value] FROM
STRING_SPLIT(N'cars,taxes',',')) w
ON t.Description LIKE w.Value

Generating Lines based on a value from a column in another table

I have the following table:
EventID=00002,DocumentID=0005,EventDesc=ItemsReceived
I have the quantity in another table
DocumentID=0005,Qty=20
I want to generate a result of 20 lines (depending on the quantity) with an auto generated column which will have a sequence of:
ITEM_TAG_001,
ITEM_TAG_002,
ITEM_TAG_003,
ITEM_TAG_004,
..
ITEM_TAG_020
Here's your sql query.
with cte as (
select 1 as ctr, t2.Qty, t1.EventID, t1.DocumentId, t1.EventDesc from tableA t1
inner join tableB t2 on t2.DocumentId = t1.DocumentId
union all
select ctr + 1, Qty, EventID, DocumentId, EventDesc from cte
where ctr <= Qty
)select *, concat('ITEM_TAG_', right('000'+ cast(ctr AS varchar(3)),3)) from cte
option (maxrecursion 0);
Output:
Best is to introduce a numbers table, very handsome in many places...
Something along:
Create some test data:
DECLARE #MockNumbers TABLE(Number BIGINT);
DECLARE #YourTable1 TABLE(DocumentID INT,ItemTag VARCHAR(100),SomeText VARCHAR(100));
DECLARE #YourTable2 TABLE(DocumentID INT, Qty INT);
INSERT INTO #MockNumbers SELECT TOP 100 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values;
INSERT INTO #YourTable1 VALUES(1,'FirstItem','qty 5'),(2,'SecondItem','qty 7');
INSERT INTO #YourTable2 VALUES(1,5), (2,7);
--The query
SELECT CONCAT(t1.ItemTag,'_',REPLACE(STR(A.Number,3),' ','0'))
FROM #YourTable1 t1
INNER JOIN #YourTable2 t2 ON t1.DocumentID=t2.DocumentID
CROSS APPLY(SELECT Number FROM #MockNumbers WHERE Number BETWEEN 1 AND t2.Qty) A;
The result
FirstItem_001
FirstItem_002
[...]
FirstItem_005
SecondItem_001
SecondItem_002
[...]
SecondItem_007
The idea in short:
We use an INNER JOIN to get the quantity joined to the item.
Now we use APPLY, which is a row-wise action, to bind as many rows to the set, as we need it.
The first item will return with 5 lines, the second with 7. And the trick with STR() and REPLACE() is one way to create a padded number. You might use FORMAT() (v2012+), but this is working rather slowly...
The table #MockNumbers is a declared table variable containing a list of numbers from 1 to 100. This answer provides an example how to create a pyhsical numbers and date table. Any database should have such a table...
If you don't want to create a numbers table, you can search for a tally table or tally on the fly. There are many answers showing approaches how to create a list of running numbers...a

String_agg for SQL Server before 2017

Can anyone help me make this query work for SQL Server 2014?
This is working on Postgresql and probably on SQL Server 2017. On Oracle it is listagg instead of string_agg.
Here is the SQL:
select
string_agg(t.id,',') AS id
from
Table t
I checked on the site some xml option should be used but I could not understand it.
In SQL Server pre-2017, you can do:
select stuff( (select ',' + cast(t.id as varchar(max))
from tabel t
for xml path ('')
), 1, 1, ''
);
The only purpose of stuff() is to remove the initial comma. The work is being done by for xml path.
Note that for some characters, the values will be escaped when using FOR XML PATH, for example:
SELECT STUFF((SELECT ',' + V.String
FROM (VALUES('7 > 5'),('Salt & pepper'),('2
lines'))V(String)
FOR XML PATH('')),1,1,'');
This returns the string below:
7 > 5,Salt & pepper,2
lines'
This is unlikely desired. You can get around this using TYPE and then getting the value of the XML:
SELECT STUFF((SELECT ',' + V.String
FROM (VALUES('7 > 5'),('Salt & pepper'),('2
lines'))V(String)
FOR XML PATH(''),TYPE).value('(./text())[1]','varchar(MAX)'),1,1,'');
This returns the string below:
7 > 5,Salt & pepper,2
lines
This would replicate the behaviour of the following:
SELECT STRING_AGG(V.String,',')
FROM VALUES('7 > 5'),('Salt & pepper'),('2
lines'))V(String);
Of course, there might be times where you want to group the data, which the above doesn't demonstrate. To achieve this you would need to use a correlated subquery. Take the following sample data:
CREATE TABLE dbo.MyTable (ID int IDENTITY(1,1),
GroupID int,
SomeCharacter char(1));
INSERT INTO dbo.MyTable (GroupID, SomeCharacter)
VALUES (1,'A'), (1,'B'), (1,'D'),
(2,'C'), (2,NULL), (2,'Z');
From this wanted the below results:
GroupID
Characters
1
A,B,D
2
C,Z
To achieve this you would need to do something like this:
SELECT MT.GroupID,
STUFF((SELECT ',' + sq.SomeCharacter
FROM dbo.MyTable sq
WHERE sq.GroupID = MT.GroupID --This is your correlated join and should be on the same columns as your GROUP BY
--You "JOIN" on the columns that would have been in the PARTITION BY
FOR XML PATH(''),TYPE).value('(./text())[1]','varchar(MAX)'),1,1,'')
FROM dbo.MyTable MT
GROUP BY MT.GroupID; --I use GROUP BY rather than DISTINCT as we are technically aggregating here
So, if you were grouping on 2 columns, then you would have 2 clauses your sub query's WHERE: WHERE MT.SomeColumn = sq.SomeColumn AND MT.AnotherColumn = sq.AnotherColumn, and your outer GROUP BY would be GROUP BY MT.SomeColumn, MT.AnotherColumn.
Finally, let's add an ORDER BY into this, which you also define in the subquery. Let's, for example, assume you wanted to sort the data by the value of the ID descending in the string aggregation:
SELECT MT.GroupID,
STUFF((SELECT ',' + sq.SomeCharacter
FROM dbo.MyTable sq
WHERE sq.GroupID = MT.GroupID
ORDER BY sq.ID DESC --This is identical to the ORDER BY you would have in your OVER clause
FOR XML PATH(''),TYPE).value('(./text())[1]','varchar(MAX)'),1,1,'')
FROM dbo.MyTable MT
GROUP BY MT.GroupID;
For would produce the following results:
GroupID
Characters
1
D,B,A
2
Z,C
Unsurprisingly, this will never be as efficient as a STRING_AGG, due to having the reference the table multiple times (if you need to perform multiple aggregations, then you need multiple sub queries), but a well indexed table will greatly help the RDBMS. If performance really is a problem, because you're doing multiple string aggregations in a single query, then I would either suggest you need to reconsider if you need the aggregation, or it's about time you conisidered upgrading.

Return a XML field when using GROUP BY clause in MS SQL Management Studio?

I have the following table structure (partially excluded for clarity of question):
The table sometimes receives two lowFareRQ and lowFareRS that is considered to be only one booking under BookingNumber. The booking is then processed into a ticket where each booking number always have the same TicketRQ and TicketRS if the user proceeded with the booking. TicketRS contains 3rd party reference number.
I now want to display all the active bookings to the user in order to allow the user to cancel a booking if he wanted to.
So I would naturally want to retrieve the each booking number with active status as well as the TicketRS xml data in order to get the 3rd party reference number.
Here is the SQL query I started with:
SELECT TOP 100
[BookingNumber]
,[Status]
,[TicketRS]
FROM [VTResDB].[dbo].[LowFareRS]
GROUP BY [BookingNumber],[Status],[TicketRS]
ORDER BY [recID] desc
Now with MS SQL Management Studio you have to add the field [TicketRS] to 'GROUP BY' if you want to have it in the 'SELECT' field list... but you cannot have a XML field in the 'GROUP BY' list.
The XML data type cannot be compared or sorted, except when using the IS NULL operator.
I know that if I change the table structure this problem can be solved without any issue but I want to avoid changing the table structure because I am just completing the software and do not want to rewrite existing code.
Is there a way to return a XML field when using GROUP BY clause in MS SQL Management Studio?
Uhm, this seems dirty... If your XMLs are identically within the group, you might try something like this:
DECLARE #tbl TABLE(ID INT IDENTITY,Col1 VARCHAR(100),SomeValue INT,SomeXML XML);
INSERT INTO #tbl(col1,SomeValue,SomeXML) VALUES
('testA',1,'<root><a>testA</a></root>')
,('testA',2,'<root><a>testA</a></root>')
,('testB',3,'<root><a>testB</a></root>')
,('testB',4,'<root><a>testB</a></root>');
WITH GroupedSource AS
(
SELECT SUM(SomeValue) AS SumColumn
,CAST(SomeXml AS NVARCHAR(MAX)) AS XmlColumn
FROM #tbl AS tbl
GROUP BY Col1,CAST(SomeXml AS NVARCHAR(MAX))
)
SELECT SumColumn
,CAST(XmlColumn AS XML) AS ReCasted
FROM GroupedSource
Another approach was this
WITH GroupedSource AS
(
SELECT SUM(SomeValue) AS SumColumn
,MIN(ID) AS FirstID
FROM #tbl AS tbl
GROUP BY Col1
)
SELECT SumColumn
,(SELECT SomeXML FROM #tbl WHERE ID=FirstID) AS ReTaken
FROM GroupedSource
Cast it to nvarchar(max) and back
with t(xc,val) as (
select xc=cast(N'<x><y>txt</y></x>' as xml), val = 5
union all
select xc=cast(N'<x><y>txt</y></x>' as xml), val = 6
)
select xc = cast(xc as XML), val
from (
select xc = cast(xc as nvarchar(max)), val = sum(val)
from t
group by cast(xc as nvarchar(max))
) tt
;

Using the distinct function in SQL

I have a SQL query I am running. What I was wanting to know is that is there a way of selecting the rows in a table where the value in on one of those columns is distinct? When I use the distinct function, It returns all of the distinct rows so...
select distinct teacher from class etc.
This works fine, but I am selecting multiple columns, so...
select distinct teacher, student etc.
but I don't want to retrieve the distinct rows, I want the distinct rows where the teacher is distinct. So this query would probably return the same teacher's name multiple times because the student value is different but what I would like is to return rows where the teachers are distinct, even if it means returning the teacher and one student name (because I don't need all the students).
I hope what I am trying to ask is clear but is it possible to use the distinct function on a single column even when selecting multiple columns or is there any other solution to this problem? Thanks.
The above is just an example I am giving. I don't know if using 'distinct' is the solution to my problem. I am not using teacher etc. that was just an example to get the idea accross. I am selecting multiple columns (about 10) from different tables. I have a query to get the tabled result I want. Now I want to query that table to find the unique values in one particular column. So using the teacher example again, say I have wrote a query and I have all the teachers and all the pupils they teach. Now I want to go through each row in this table and email the teacher a message. But I don't want to email the teacher numerous times, just the once, so I want to return all the columns from the table I have, where only the teacher value is distinct.
Col A Col B Col C Col D
a b c d
a c d b
b a a c
b c c c
A query I have produces the above table. Now I want only those rows where Col A values are unique. How would I go about it?
You have misunderstood the DISTINCT keyword. It is not a function and it does not modify a column. You cannot SELECT a, DISTINCT(b), c, DISTINCT(d) FROM SomeTable. DISTINCT is a modifier for the query itself, i.e. you don't select a distinct column, you make a SELECT DISTINCT query.
In other words: DISTINCT tells the server to go through the whole result set and remove all duplicate rows after the query has been performed.
If you need a column to contain every value once, you need to GROUP BY that column. Once you do that, the server now needs to do which student to select with each teacher, if there are multiple, so you need to provide a so-called aggregate function like COUNT(). Example:
SELECT teacher, COUNT(student) AS amountStudents
FROM ...
GROUP BY teacher;
One option is to use a GROUP BY on Col A. Example:
SELECT * FROM table_name
GROUP BY Col A
That should return you:
abcd
baac
Based on the limited details you provided in your question (you should explain how/why your data is in different tables, what DB server you are using, etc) you can approach this from 2 different directions.
Reduce the number of columns in your query to only return the "teacher" and "email" columns but using the existing WHERE criteria. The problem you have with your current attempt is both DISTINCT and GROUP BY don't understand that you one want 1 row for each value of the column that you are trying to be distinct about. From what I understand, MySQL has support for what you are doing using GROUP BY but MSSQL does not support result columns not included in the GROUP BY statement. If you don't need the "student" columns, don't put them in your result set.
Convert your existing query to use column based sub-queries so that you only return a single result for non-grouped data.
Example:
SELECT t1.a
, (SELECT TOP 1 b FROM Table1 t2 WHERE t1.a = t2.a) AS b
, (SELECT TOP 1 c FROM Table1 t2 WHERE t1.a = t2.a) AS c
, (SELECT TOP 1 d FROM Table1 t2 WHERE t1.a = t2.a) AS d
FROM dbo.Table1 t1
WHERE (your criteria here)
GROUP BY t1.a
This query will not be fast if you have a lot of data, but it will return a single row per teacher with a somewhat random value for the remaining columns. You can also add an ORDER BY to each sub-query to further tweak the values returned for the additional columns.
I'm not sure if I am understanding this right but couldn't you do
SELECT * FROM class WHERE teacher IN (SELECT DISTINCT teacher FROM class)
This would return all of the data in each row where the teacher is distinct
distinct requires a unique result-set row. This means that whatever values you select from your table will need to be distinct together as a row from any other row in the result-set.
Using distinct can return the same value more than once from a given field as long as the other corresponding fields in the row are distinct as well.
As soulmerge and Shiraz have mentioned you'll need to use a GROUP BY and subselect. This worked for me.
DECLARE #table TABLE (
[Teacher] [NVarchar](256) NOT NULL ,
[Student] [NVarchar](256) NOT NULL
)
INSERT INTO #table VALUES ('Teacher 1', 'Student 1')
INSERT INTO #table VALUES ('Teacher 1', 'Student 2')
INSERT INTO #table VALUES ('Teacher 2', 'Student 3')
INSERT INTO #table VALUES ('Teacher 2', 'Student 4')
SELECT
T.[Teacher],
(
SELECT TOP 1 T2.[Student]
FROM #table AS T2
WHERE T2.[Teacher] = T.[Teacher]
) AS [Student]
FROM #table AS T
GROUP BY T.[Teacher]
Results
Teacher 1, Student 1
Teacher 2, Student 3
You need to do it with a sub select where you take TOP 1 of student where the teacher is the same.
You may try "GROUP BY teacher" to return what you need.
What is the question your query is trying to answer?
Do you need to know which classes have only one teacher?
select class_name, count(teacher)
from class group by class_name having count(teacher)=1
Or are you looking for teachers with only one student?
select teacher, count(student)
from class group by teacher having count(student)=1
Or is it something else? The question you've posed assumes that using DISTINCT is the correct approach to the query you're trying to construct. It seems likely this is not the case. Could you describe the question you're trying to answer with DISTINCT?
You will need to say how your data is stored in-memory for us to say how you can query it.
But you could do a separate query to just get the distinct teachers.
select distinct teacher from class
I am struggling to understand exactly what you wish to do.. but you can do something like this:
SELECT DISTINCT ColA FROM Table WHERE ...
If you only select a singular column, the distinct will only grab those.
If you could clarify a little more, I could try to help a bit more.
You could use GROUP BY to separate the return values based on a single column value.
All you have to do is select just the columns you want the first one and do a select Distinct
Select Distinct column1 -- where your criteria...
The following might help you get to your solution. The other poster did point to this but his syntax for group by was incorrect.
Get all teachers that teach any classes.
Select teacher_id, count(*)
from teacher_table inner join classes_table
on teacher_table.teacher_id = classes_table.teacher_id
group by teacher_id
Noone seems to understand what you want. I will take another guess.
Select * from tbl
Where ColA in (Select ColA from tbl Group by ColA Having Count(ColA) = 1)
This will return all data from rows where ColA is unique -i.e. there isn't another row with the same ColA value. Of course, that means zero rows from the sample data you provided.
select cola,colb,colc
from yourtable
where cola in
(
select cola from yourtable where your criteria group by cola having count(*) = 1
)
declare #temp as table (colA nchar, colB nchar, colC nchar, colD nchar, rownum int)
insert #temp (colA, colB, colC, colD, rownum)
select Test.ColA, Test.ColB, Test.ColC, Test.ColD, ROW_NUMBER() over (order by ColA) as rownum
from Test
select t1.ColA, ColB, ColC, ColD
from #temp as t1
join (
select ColA, MIN(rownum) [min]
from #temp
group by Cola)
as t2 on t1.Cola = t2.Cola and t1.rownum = t2.[min]
This will return a single row for each value of the colA.
CREATE FUNCTION dbo.DistinctList
(
#List VARCHAR(MAX),
#Delim CHAR
)
RETURNS
VARCHAR(MAX)
AS
BEGIN
DECLARE #ParsedList TABLE
(
Item VARCHAR(MAX)
)
DECLARE #list1 VARCHAR(MAX), #Pos INT, #rList VARCHAR(MAX)
SET #list = LTRIM(RTRIM(#list)) + #Delim
SET #pos = CHARINDEX(#delim, #list, 1)
WHILE #pos > 0
BEGIN
SET #list1 = LTRIM(RTRIM(LEFT(#list, #pos - 1)))
IF #list1 <> ''
INSERT INTO #ParsedList VALUES (CAST(#list1 AS VARCHAR(MAX)))
SET #list = SUBSTRING(#list, #pos+1, LEN(#list))
SET #pos = CHARINDEX(#delim, #list, 1)
END
SELECT #rlist = COALESCE(#rlist+',','') + item
FROM (SELECT DISTINCT Item FROM #ParsedList) t
RETURN #rlist
END
GO