SQL searching two columns for best results - sql

I would like to perform an sql search and I would like to get best results. I tried some things but they didn't work well. I have got two columns named subject and content
For example we will search "search this keywords" text on subject and content area. First I'm searching "search this keywords" then searching "search" and "this" and "keywords"
I would like to retrieve subject's results on top and I would like to retrieve best results liking "search this keywords".My query sometimes works well sometimes not.
How should I write this query
Thanks..

I think you're saying that you want to perform multiple SQL queries against your database and then combine the results and set a "weighting" to a subject match over a content match.
select messageid, textstring, max(weight) from (
-- exact subject match
select messageid, substr(subject,1,100) textstring, 100 weight
from mytable
where subject='search this keywords'
union
-- partial subject match
select messageid, substr(subject,1,100), 90 weight
from mytable
where subject like '%search this keywords%'
union
select messageid, substr(subject,1,100), 80 weight
from mytable
where subject like '%search%'
union
select messageid, substr(subject,1,100), 80 weight
from mytable
where subject like '%this%'
union
select messageid, substr(subject,1,100), 80 weight
from mytable
where subject like '%keywords%'
union
-- partial content match
select messageid, substr(content,1,100), 70 weight
from mytable
where content like '%search this keywords%'
union
select messageid, substr(content,1,100), 60 weight
from mytable
where content like '%search%'
union
select messageid, substr(content,1,100), 60 weight
from mytable
where content like '%this%'
union
select messageid, substr(content,1,100), 60 weight
from mytable
where content like '%keywords%'
)
group by
messageid, textstring,

Try this
select * from (
Select sch, rank,
case when sch like '%search this keywords%' then 0
when sch like '%search%' then 1
when sch like '%this%' then 2
when sch like '%keywords%' then 3 end ord
from
(
select subject as sch, 1 as rank from mytable
union all
select content, 2 as rank from mytable
) as x
) as y
where ord is not null
order by rank, ord

Implementing a simple 'Full Text Search' like tables would be a way.
CREATE TABLE YourTable (id int, subject varchar(256), content varchar(8000))
CREATE TABLE Keywords (key_id int, keyw varchar(50), relevanceModifier float)
CREATE TABLE SubjectsKeywords (key_fk int, yourTable_fk int, quantity int)
CREATE TABLE ContentKeywords (key_fk int, yourTable_fk int, quantity int)
When you insert in YourTable, fire a trigger to:
Split subject and content columns by spaces, commas, etc into words.
Optionally, avoid "stop words" like "the", "they", "to", etc. This is called stemming if i'm not mistaken.
Each word should be inserted in tables SubjectsKeywords, ContentKeywords and Keywords.
Optionally, set relevanceModifier. A very simple criteria would be to use the string length.
Optionally, count each ocurrence and track it quantity fields.
Then your query would be like this:
select max(t.relevance), yourtable.id, MAX([subject]), MAX(content)
from
(
/* exact match and 'contains' match */
select 100 as relevance, id
from YourTable
where [subject] like '%search this keywords%'
UNION
/* keyword match */
select 70 as relevance, yt.id
from YourTable as yt
join SubjectsKeywords on id = yourTable_fk
join Keywords as k on k.id = key_fk
where keyw in ('search', 'this', 'keywords')
UNION
select 40 as relevance, id
from YourTable
where [subject] like '%search this keywords%'
UNION
select 10 as relevance, yt.id
from YourTable as yt
join ContentKeywords on yt.id = yourTable_fk
join Keywords as k on k.id = key_fk
where keyw in ('search', 'this', 'keywords')
) as T
join yourtable on t.id = yourtable.id
group by t.id
order by max(relevance) desc
, yourtable.id ASC /*So that the result will always be in the same order*/
Notes:
Trigger is a way to do it if you have little control of you application or if it is a maintenance nightmare.
Later it you could improve it by adding a soundex, so that, you can search even mispelled keywords.
RelevanceModifier, Quantity field can be use to calculate more relevant results.
As it may be fast enough, it may be usefull as an autocomplete feature for your application, in which case you'd like to limit the results to let say 256 at most.
I hope this gives you and idea, and so you decide what suits you best.

Related

Finding the id's which include multiple criteria in long format

Suppose I have a table like this,
id
tagId
1
1
1
2
1
5
2
1
2
5
3
2
3
4
3
5
3
8
I want to select id's where tagId includes both 2 and 5. For this fake data set, It should return 1 and 3.
I tried,
select id from [dbo].[mytable] where tagId IN(2,5)
But it takes 2 and 5 into account respectively. I also did not want to keep my table in wide format since tagId is dynamic. It can reach any number of columns. I also considered filtering with two different queries to find (somehow) the intersection. However since I may search more than two values inside the tagId in real life, it sounds inefficient to me.
I am sure that this is something faced before when tag searching. What do you suggest? Changing table format?
One option is to count the number of distinct tagIds (from the ones you're looking for) each id has:
SELECT id
FROM [dbo].[mytable]
WHERE tagId IN (2,5)
GROUP BY id
HAVING COUNT(DISTINCT tagId) = 2
This is actually a Relational Division With Remainder question.
First, you have to place your input into proper table format. I suggest you use a Table Valued Parameter if executing from client code. You can also use a temp table or table variable.
DECLARE #ids TABLE (tagId int PRIMARY KEY);
INSERT #ids VALUES (2), (5);
There are a number of different solutions to this type of question.
Classic double-negative EXISTS
SELECT DISTINCT
mt.Id
FROM mytable mt
WHERE NOT EXISTS (SELECT 1
FROM #ids i
WHERE NOT EXISTS (SELECT 1
FROM mytable mt2
WHERE mt2.id = mt.id
AND mt2.tagId = i.tagId)
);
This is not usually efficient though
Comparing to the total number of IDs to match
SELECT mt.id
FROM mytable mt
JOIN #ids i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = (SELECT COUNT(*) FROM #ids);
This is much more efficient. You can also do this using a window function, it may be more or less efficient, YMMV.
SELECT mt.Id
FROM mytable mt
JOIN (
SELECT *,
total = COUNT(*) OVER ()
FROM #ids i
) i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = MIN(i.total);
Another solution involves cross-joining everything and checking how many matches there are using conditional aggregation
SELECT mt.id
FROM (
SELECT
mt.id,
mt.tagId,
matches = SUM(CASE WHEN i.tagId = mt.tagId THEN 1 END),
total = COUNT(*)
FROM mytable mt
CROSS JOIN #ids i
GROUP BY
mt.id,
mt.tagId
) mt
GROUP BY mt.id
HAVING SUM(matches) = MIN(total)
AND MIN(matches) >= 0;
db<>fiddle
There are other solutions also, see High Performance Relational Division in SQL Server

Display SUM of 2 listed/GROUP BY values using WHERE condition

I want to add the values of two columns displayed and display as 1 column name.
This is the output I'm getting,
ID Total
Apple 10
RawApple 10
Mango 10
RawMango 10
I want the output as
ID Total
Apples 20
Mangoes 20
If the issue is removing the first three characters -- if they are "Raw" -- then you can do:
select (case when id like 'Raw%' then stuff(id, 1, 3, '') else id end) as id,
sum(total)
from t
group by (case when id like 'Raw%' then stuff(id, 1, 3, '') else id end);
If you want to replace specific values with other values, I would suggest an in-query lookup table:
select coalesce(v.new_id, t.id) as id, sum(total)
from t left join
(values ('RawApple', 'Apple'),
('RawMango', 'Mango')
) v(id, new_id)
on t.id = v.id
group by coalesce(v.new_id, t.id);
If we can assume that the name of the fruit is after the prefix, and the prefix ends with a hyphen (-), then we can use STUFF to remove the prefix and then aggregate:
WITH VTE AS(
SELECT *
FROM (VALUES('Apple',10),
('Raw-Apple',10),
('Mango',10),
('Raw-Mango',10))V(ID,Total))
SELECT S.ID,
SUM(V.Total) AS Total
FROM VTE V
CROSS APPLY(VALUES(STUFF(V.ID,1,CHARINDEX('-',V.ID),'')))S(ID)
GROUP BY S.ID;
Note I don't change the names of the fruits to the plural, as depending on the fruit changes what the plural is. You'll need a dictionary table to store what the plural of the fruit is and then `JOIN to that. So a table that looks like this:
CREATE TABLE dbo.FruitPlural (Fruit varchar(20), Plural varchar(20));
INSERT INTO dbo.FruitPlural
VALUES ('Apple','Apples'),
('Mango','Mangoes'),
('Strawberry','Strawberries'),
...;
Note, this answer was invalidated due to the OP moving the goal posts due to the sample data not being representative of their actual data, however, I am leaving here as it may help future users.

How to search for all except certain strings using like (or different solution welcomed)?

For now, I need to filter out rows with certain text strings.
E.g. for string that is given in this format: 'taxes, car' - I need to filter out all rows that include either "taxes" or "cars" in the description of the row.
I have come up with this:
SELECT
TransactionId
,t.DocumentID
,t.DocumentDescription
FROM [Transaction] t
INNER JOIN (SELECT CONCAT('%',[Value], '%') AS [Value]
FROM STRING_SPLIT(N'taxes,cars',',')
) w
ON t.[DocumentDescription] NOT LIKE w.[Value]
This does not work at all, since it matches both of the splitted strings and filters out the row only when both of the strings are included in the description of the row.
Any ideas how to make it work?
I think you want NOT EXISTS:
WITH w as (
SELECT value as word
FROM STRING_SPLIT(N'taxes,cars', ',')
)
SELECT t.*
FROM [Transaction] t
WHERE NOT EXISTS (SELECT 1
FROM w
WHERE t.DocumentDescription LIKE CONCAT('%', w.word, '%')
);
Note that because of the use of LIKE this query has to scan the entire table. You might want to rethink your data model, perhaps using a full text index or breaking the description into words if you have large tables and performance is an issue.
Since you said you were open to other ideas... What you are looking to do can be done without a splitter function (e.g. STRING_SPLIT in your example). If you wanted to let your filter expression ('taxes,cars') come in as a parameter then you could use SRING_SPLIT. Note the sample data and both examples below:
DECLARE #Transaction TABLE
(
TransactionId INT IDENTITY,
DocumentDescription VARCHAR(1000)
);
INSERT #Transaction (DocumentDescription) VALUES('Blah, blah... cars...'), ('Yada, yada... taxes'),('Blah blah...');
-- Without a Splitter Function (e.g. SPLIT_STRING)
SELECT t.TransactionId, t.DocumentDescription
FROM #Transaction AS t
WHERE NOT EXISTS
(
SELECT 1
FROM #Transaction
CROSS JOIN (VALUES('taxes'),('cars')) AS srch(Item)
WHERE CHARINDEX(srch.Item,t.DocumentDescription) > 0
);
-- Using Split String
SELECT t.*
FROM #Transaction AS t
WHERE NOT EXISTS
(
SELECT 1
FROM STRING_SPLIT(N'taxes,cars', ',') AS w
WHERE CHARINDEX(w.[value],DocumentDescription) > 0
);
This got me the result I wanted!
SELECT 1,2,3 FROM [Transaction]
EXCEPT
SELECT 1,2,3 FROM [Transaction] t
INNER JOIN (INNER JOIN(SELECT CONCAT('%',[Value], '%') AS [Value] FROM
STRING_SPLIT(N'cars,taxes',',')) w
ON t.Description LIKE w.Value

How to not display an item in select query?

I feel a little stupid asking this because I feel like this is very easy, but for some reason I'm not able to update a query to not select a specific item based on two criteria.
Let's say I have data like this:
ID Name Variant Count1
110 Bob Type1 0
110 Bob Type2 1
120 John Type1 1
So as you can see we have two BOB rows with same ID but different variant (type1 and type2). I want to be able to only see one of the Bob's.
Desired result:
110 Bob Type2
120 John Type1
So what I've been doing is something like
Select ID, Name, Variant, sum(count1) from tbl1
where (id not in (110) and Variant <> 'type1')
Group by Id,name,variant
Please don't use COUNT as a criteria, because in my example it just so happens that Count=0 for the row that I don't want to see. It can vary.
I have many rows where I can have multiple instances of the same id with a variety of different VARIANTS. I'm looking to exclude certain instances of ID based on Variant value
UPDATE:
It has nothing to do with latest variant, it has to do with a specific variant. So I'm just looking to basically be able to use a clause where i used the ID and VARIANT, in order to remove that particular row.
Aggregating (grouping) the data like you're doing is one way to do it, although the where condition is a little overkill. If all you want to do is see the unique combinations of ID and Name, then another approach is just to use the "distinct" statement.
select distinct Id, Name
from tbl1
If you always want to see data from a specific Variant then just include that condition in your where clause and you don't need to worry about using distinct or aggregates.
select *
from tbl1
where Variant = 'Type 1'
If you always want to see the record associated with the latest Variant, then you can use a window function to do so.
select a.Id, a.Name, a.Variant
from
(
select *, row_number() over (partition by Id order by Variant desc) as RowRank
from tbl1
) a
where RowRank = 1
;
If there is not a predictable pattern for exclusion then you will have to maintain an exclusion list. It's not ideal but if you want to maintain this in the SQL itself then you could have a query like the one below.
select *
from tbl1
-- Define rows to exlcude
where not (Id = 110 and Variant = 'Type 1') -- Your example
and not (Id = 110 and Variant = 'Type 3') -- Theoretical example
;
A better solution would be to create an exclusion reference table to maintain all exclusions within. Then you could simply negative join to that table to retrieve your desired results.
Have you considered using an exclusion table where you can place the ID and Variant combinations that you want to exclude? ( I just used temp tables for this example, you can always use user tables so your exclusion table will always be available)
Here is an example of what I mean based on your example:
if object_id('tempdb..#temp') is not null
drop table #temp
create table #temp (
ID int,
Name varchar(20),
Variant varchar(20),
Count1 int
)
if object_id('tempdb..#tempExclude') is not null
drop table #tempExclude
create table #tempExclude (
ID int,
Variant varchar(20)
)
insert into #temp values
(110,'Bob','Type1',0),
(110,'Bob','Type2',1),
(120,'John','Type1',1),
(120,'John','Type2',1),
(120,'John','Type2',1),
(120,'John','Type2',1),
(120,'John','Type3',1)
insert into #tempExclude values (110,'Type1')
select
t.ID,
t.Name
,t.Variant
,sum(t.Count1) as TotalCount
from
#temp t
left join
#tempExclude te
on t.ID = te.ID
and t.Variant = te.Variant
where
te.id is null
group by
t.ID,
t.Name
,t.Variant
Here are the results:
I think the logic you want is something like:
Select ID, Name, Variant, sum(count1)
from tbl1
where not (id = 110 and variant = 'type1')
Group by Id, name, variant;
For the second condition, just keep adding:
where not (id = 110 and variant = 'type1') and
not (id = 314 and variant = 'popsicle')
You can also express this using a list of exclusions:
select t.ID, Name, t.Variant, sum(t.count1)
from tbl1 t left join
(values (111, 'type1'),
(314, 'popsicle')
) v(id, excluded_variant)
on t.id = v.id and
t.variant = v.excluded_variant
where v.id is not null -- doesn't match an exclusion criterion
group by Id, name, variant;

Select rows from SQL where column doesn't match something in a string array?

Let's say I have a table, Product, with a column called ProductName, with values like:
Lawnmower
Weedwacker
Backhoe
Gas Can
Batmobile
Now, I have a list, in Notepad, of products that should be excluded from the result set, i.e.:
Lawnmower
Weedwacker
Batmobile
In my real-life problem, there are tens of thousands of records, and thousands of exclusions. In SQL Studio Manager, how can I construct a query similar to the following pseudocode that will just return Backhoe and Gas Can as results?:
declare #excludedProductNames varchar(MAX) =
'Lawnmower
Weedwacker
Batmobile'
SELECT ProductName FROM Product
WHERE ProductName isn't in the list of #excludedProductNames
This is just a one-time report, so I don't care about performance at all.
First thing is getting those words into SSMS - you can construct a derived table using UNION ALL:
SELECT 'Lawnmower' AS word
UNION ALL
SELECT 'Weedwacker'
UNION ALL
SELECT 'Batmobile'
This will return a table with a single column, named "word":
word
--------
Lawnmower
Weedwacker
Batmobile
Caveat
You'll need to escape any single quotes in your data. IE: O'Brian needs to be changed to O''Brian--just double up the single quote to escape it.
Now, to the real query...
Using NOT IN
Some databases limit the number of clauses in the IN, somewhere in the thousands IIRC so NOT EXISTS or LEFT JOIN/IS NULL might be better alternatives.
SELECT p.*
FROM PRODUCT p
WHERE p.productname NOT IN (SELECT 'Lawnmower' AS word
UNION ALL
SELECT 'Weedwacker'
UNION ALL
SELECT 'Batmobile'
...)
Using NOT EXISTS
SELECT p.*
FROM PRODUCT p
WHERE NOT EXISTS (SELECT NULL
FROM (SELECT 'Lawnmower' AS word
UNION ALL
SELECT 'Weedwacker'
UNION ALL
SELECT 'Batmobile'
...) x
WHERE x.word = p.productname)
Using LEFT JOIN/IS NULL
SELECT p.*
FROM PRODUCT p
LEFT JOIN (SELECT 'Lawnmower' AS word
UNION ALL
SELECT 'Weedwacker'
UNION ALL
SELECT 'Batmobile'
...) x ON x.word = p.productname
WHERE x.word IS NULL
Which is The Most Efficient/Fastest?
If the columns compared are not nullable, NOT IN or NOT EXIST are the best choice.
i think you're best to use some text editor tricks to accomplish this. replace newlines with ', ' for example, and you can easily go for a select * from product where ProductName not in ('...', '...') query.
Create a temp table, load all your exclusions there and select all rows that do not exist in the temp table.
-- create temp table #exclusions
select ProductName into #exclusions
from Product
where 1 = 2
# run a bunch of inserts
insert into #exclusions (ProductName) values ('LawnMower')
-- as many as needed...
# run your select
select * from Product
where ProductName not in (select Product from #exclusions)
drop table #exclusions
As an alternative to running a ton of inserts, use bcp to upload a csv file containing the ProductNames into a non temp table.