Is there away in SQL Server to sort by the number of matched words in a contains function on a full text index - sql

I have a table in a database that has a description of an item. I want to be able to have the user type a search term and return the rows that had at least one match, sorted by the number of matches they had, descending.
I don't know if this is possible, I haven't been able to find an answer googling so I'm coming here.
Basically if the user enters "truck blue with gold two tone", this will be generated:
SELECT * FROM MyItemsTable
WHERE contains(Description, 'truck or blue or with or gold or two or tone')
and have that return sorted by the number of words that matched.
Any advice would be greatly appreciated. This table will become very large in time so efficiency is also in the back of my mind as well.

This seems to have worked very well, thanks very much to Gordon Linoff.
SELECT * FROM MyItemsTable m
INNER JOIN
CONTAINSTABLE(MyItemsTable, Description, 'truck or blue or with or gold or two or tone') AS l ON m.MyItemsTable=l.[KEY]

Reference
In case you have a record like "truck blue with gold two tone". You can use below query.
SELECT * FROM
MyItemsTable as t
JOIN CONTAINSTABLE(MyItemsTable , Description,'"truck"') fulltextSearch
ON
t.[Id] = fulltextSearch.[KEY]
This will also bring this record.

Related

Creating a view that contains all records from one table, that match the comma separated field content in another table

I have two tables au_postcodes and groups.
Table groups contains a field called PostCodeFootPrint
that contains the postcode set making up the footprint.
Table au_postcodes contains a field called poa_code that
contains a single postcode.
The records in groups.PostCodeFootPrint look like:
PostCodeFootPrint
2529,2530,2533,2534,2535,2536,2537,2538,2539,2540,2541,2575,2576,2577,2580
2640
3844
2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2079, 2080, 2081, 2082, 2083, 2119, 2120, 2126, 2158, 2159
2848, 2849, 2850, 2852
Some records have only one postcode, some have multiple separated by a "," or ", " (comma and space).
The records in au_postcode.poa_code look like:
poa_code
2090
2092
2093
829
830
836
2080
2081
Single postcode (always).
The objective is to:
Get all records from au_postcode, where the poa_code appears in groups.*PostCodeFootPrint into a view.
I tried:
SELECT
au_postcodes.poa_code,
groups."NameOfGroup"
FROM
groups,
au_postcodes
WHERE
groups."PostcodeFootprint" LIKE '%au_postcodes.poa_code%'
But no luck
You can use regex for this. Take a look at this fiddle:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=739592ef262231722d783670b46bd7fa
Where I form a regex from the poa_code and the word boundary (to avoid partial matches) and compare that to the PostCodeFootPrint.
select p.poa_code, g.PostCodeFootPrint
from groups g
join au_postcode p
on g.PostCodeFootPrint ~ concat('\y', p.poa_code, '\y')
Depending on your data, this may be performant enough. I also believe that in postGres you have access to the array data type, and so it might be better to store the post code lists as arrays.
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=ae24683952cb2b0f3832113375fbb55b
Here I stored the post code lists as arrays, then used ANY to join with.
select p.poa_code, g.PostCodeFootPrint
from groups g
join au_postcode p
on p.poa_code = any(g.PostCodeFootPrint);
In these two fiddles I use explain to show the cost of the queries, and while the array solution is more expensive, I imagine it might be easier to maintain.
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=7f16676825e10625b90eb62e8018d78e
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=e96e0fc463f46a7c467421b47683f42f
I changed the underlying data type to integer in this fiddle, expecting it to reduce the cost, but it didn't, which seems strange to me.
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=521d6a7d0eb4c45471263214186e537e
It is possible to reduce the query cost with the # operator (see the last query here: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=edc9b07e9b22ee72f856e9234dbec4ba):
select p.poa_code, g.PostCodeFootPrint
from groups g
join au_postcode p
on (g.PostCodeFootPrint # p.poa_code) > 0;
but it is still more expensive than the regex. However, I think you can probably rearrange the way the tables are set up and radically change performance. See the first and second queries in the fiddle, where I take each post code in the footprint and insert it as a row in a table, along with an identifier for the group it was in:
select p.poa_code, g.which
from groups2 g
join au_postcode p
on g.footprint = p.poa_code;
The explain plan for this indicates that query cost drops significantly (from 60752.50 to 517.20, or two orders of magnitude) and the execution times go from 0.487 to 0.070. So it might be worth looking into changing the table structure.
Since the values of PostCodeFootPrint are separated by a common character, you can easily create an array out of it. From there use unnest to convert the array elements to records, and then join then with au_postcode:
SELECT * FROM au_postcode au
JOIN (SELECT trim(unnest(string_to_array(PostCodeFootPrint,',')))
FROM groups) fp (PostCodeFootPrint) ON fp.PostCodeFootPrint = au.poa_code;
Demo: db<>fiddle

SQL Specific LIKE ANY String Search

So I'm in Teradata trying to pull any products that have more than 1 color-related name, as seen in the code snippet here:
SELECT
pt.product_number,
COUNT (CASE WHEN ot.option_name like any ('%green%', '%red%', '%blue%') THEN 1 ELSE NULL END) as differentColorCount
FROM product_table pt
JOIN option_table ot on ot.product_num = pt.product_num
HAVING differentColorCount > 1
GROUP BY 1
This is running fine, but the problem that I'm realizing is that a product might have a hundred different "Red" options for instance. (Red-1, Red-2, Red-3, etc). But I only want a count of when two of the different color strings are present for a single product.
So instead of LIKE ANY what I really need is LIKE ANY TWO. If both Red AND Green are present, count 1. If both Blue AND Purple are present, count 1.
I realize I could do a really long list where I do dozens of LIKE ALLs in every possible combination, but that doesn't seem like it will scale well if I need to check for, say 100 different colors instead of 6?
If anyone has any insight on this I would be incredibly grateful. Thanks in advance for any help you can offer! :)
You can utilize a regular expression to extract the color and then apply a distinct count:
Count (DISTINCT RegExp_Substr(option_name, '(green|red|blue)')) AS differentColorCount
This is similar to your like any ('%green%', '%red%', '%blue%'), but returns the actual matching color instead of TRUE/FALSE.
The'(green|red|blue)' search pattern seperates defines three alternative search strings and returns the first match.

Custom SQL sort by

Use:
The user searches for a partial postcode such as 'RG20' which should then be displayed in a specific order. The query uses the MATCH AGAINST method in boolean mode where an example of the postcode in the database would be 'RG20 7TT' so it is able to find it.
At the same time it also matches against a list of other postcodes which are in it's radius (which is a separate query).
I can't seem to find a way to order by a partial match, e.g.:
ORDER BY FIELD(postcode, 'RG20', 'RG14', 'RG18','RG17','RG28','OX12','OX11')
DESC, city DESC
Because it's not specifically looking for RG20 7TT, I don't think it can make a partial match.
I have tried SUBSTR (postcode, -4) and looked into left and right, but I haven't had any success using 'by field' and could not find another route...
Sorry this is a bit long winded, but I'm in a bit of a bind.
A UK postcode splits into 2 parts, the last section always being 3 characters and within my database there is a space between the two if that helps at all.
Although there is a DESC after the postcodes, I do need them to display in THAT particular order (RG20, RG14 then RG18 etc..) I'm unsure if specifying descending will remove the ordering or not
Order By Case
When postcode Like 'RG20%' Then 1
When postcode Like 'RG14%' Then 2
When postcode Like 'RG18%' Then 3
When postcode Like 'RG17%' Then 4
When postcode Like 'RG28%' Then 5
When postcode Like 'OX12%' Then 6
When postcode Like 'OX11%' Then 7
Else 99
End Asc
, City Desc
You're on the right track, trimming the field down to its first four characters:
ORDER BY FIELD(LEFT(postcode, 4), 'RG20', 'RG14', ...),
-- or SUBSTRING(postcode FROM 1 FOR 4)
-- or SUBSTR(postcode, 1, 4)
Here you don't want DESC.
(If your result set contains postcodes whose prefixes do not appear in your FIELD() ordering list, you'll have a bit more work to do, since those records will otherwise appear before any explicitly ordered records you specify. Before 'RG20' in the example above.)
If you want a completely custom sorting scheme, then I only see one way to do it...
Create a table to hold the values upon which to sort, and include a "sequence" or "sort_order" field. You can then join to this table and sort by the sequence field.
One note on the sequence field. It makes sense to create it as an int as... well, sequences are often ints :)
If there is any possibility of changing the sort order, you may want to consider making it alpha numeric... It is a lot easier to insert "5A" between "5 and "6" than it is to insert a number into a sequence of integers.
Another method I use is utilising the charindex function:
order by charindex(substr(postcode,4,1),"RG20RG14RG18...",1)
I think that's the syntax anyway, I'm just doing this in SAS at the moment so I've had to adapt from memory!
But essentially the sooner you hit your desired part of the string, the higher the rank.
If you're trying to rank on a large variety of postcodes then a case statement gets pretty hefty.

How to optimize group by in table with huge number of records

I have a Person table with huge number of records(for about 16 million), and have a requirement to find all persons, with same lastname, first letter of firstname and birthyear, in other worlds I want to show assuming duplicate persons in UI for users to analyze and decide are there a same person or not.
Here is the query I write
SELECT *
FROM Person INNER JOIN
(
SELECT SUBSTRING(firstName, 1, 1) firstNameF,lastName,YEAR(birthDate) birthYear
FROM Person
GROUP BY SUBSTRING(firstName, 1,1),lastName,YEAR(birthDate)
HAVING count(*) > 1
) as dupPersons
ON SUBSTRING(Person.firstName,1,1) = dupPersons.firstNameF and Person.lastName = dupPersons.lastName and YEAR(Person.birthDate) = dupPersons.birthYear
order by Person.lastName,Person.firstName
but as I am not SQL expert, want too know, is this good way to do that? are there more optimized way?
EDIT
Note that I can cut data, which can have contribution in optimization
for example if I want to cut data by 2 it could return two persons
Johan Smith |
Jane Smith | have same lastname and first name inita
Jack Smith |
Mark Tween | have same lastname and first name inita
Mac Tween |
If the performance using a GROUP BY is not adequate, You could try using an INNER JOIN
SELECT *
FROM Person p1
INNER JOIN Person p2 ON p2.PersonID > p1.PersonID
WHERE SUBSTRING(p2.Firstname, 1, 1) = SUBSTRING(p1.Firstname, 1, 1)
AND p2.LastName = p1.LastName
AND YEAR(p2.BirthDate) = YEAR(p1.BirthDate)
ORDER BY
p1.LastName, p1.FirstName
Well, if you're not an expert, the query you wrote says to me that you're at least pretty competent. When we look at whether a query is "optimized", there are two immediate parts to that: 1. The query just on its own has something notably wrong with it - a bad join, keyword misuse, exploding result set size, supersitions about NOT IN, etc. 2. The context that the query operates within - DB specifics, task specifics, etc.
Your query passes #1, no problem. I would have written it differently - aliased the Person table, used LEFT(P.FirstName, 1) instead of SUBSTRING, and used a CTE (WITH-clause) instead of a subquery. But these aren't optimization issues. Maybe I'd use WITH(READUNCOMMITTED) if the results weren't sensitive to dirty reads. Out of any further context, your query doesn't look like a bomb waiting to go off.
As for #2 - You should probably switch to specifics. Like "I have to run this every week. It takes 17 minutes. How can I get it down to under a minute?" Then people will ask you what your plan looks like, what indexes you have, etc.
Things I'd want to know:
How long does it already take to run?
What's your runtime window? (User & app tolerance for query time.)
Is this run once a day? Week? Month? Quarter?
Do you have the permission to create tables, change current tables, or alter indexes?
Maybe based on having run it, what's the ratio of duplicates you're expecting to find? 5%? 90%?
How stable is the matching criteria requirement?
Example scenario: If this was a run-on-command feature, it will be in my app indefinitely, it will get run weekly, with 10% or fewer records expected to be duplicates, with ability to change the DB how I'd like, if the duplicate matching criteria is firm (not fluctuating), and I wan to cut it from 90s to 5s, I'd create a dedicated BirthYear column (possibly a persisted computed column off of BirthDate), and an index on LastName ASC, BirthYear ASC, FirstName ASC. If too many of those stipulations change, I might to a different direction entirely.
You can try something like this and see the difference on the execution plans, or benchmark the results on performance:
;WITH DupPersons AS
(
SELECT *, COUNT(1) OVER(PARTITION BY SUBSTRING(firstName, 1, 1), lastName, YEAR(birthDate)) Quant
FROM Person
)
SELECT *
FROM DupPersons
WHERE Quant > 1
Of course, it would also help to know your table definition and the indexes you created. I think that maybe it can help to add a computed column with the year of birthdate and create an index on it, the same with the first letter of firstname.

Group keywords by site

I am finding a lot of useful help here today, and I really appreciate it. This should be the last one for the day:
I have a list of the top 10 keywords per site, sorted by visits, by date. The records need to be sorted as follows (excuse the formatting):
2010-05 2010-04
site1.com keyword1 apples wine
keyword1 visits 100 12
keyword2 oranges water
keyword2 visits 99 10
site2.com keyword1 blueberry cornbread
keyword1 visits 90 100
keyword2 squares biscuits
keyword2 visits 80 99
Basically what I need to accomplish involves grouping, but I can't seem to figure it out. Am I heading down the right path, or is there another way to achieve this, or is it just impossible?
Edit:
The dataset is something like this (csv):
site_name,date,keyword,visits
site1.com,2010-04,apples,100
site1.com,2010-04,oranges,99
site1.com,2010-05,wine,12
site1.com,2010-05,water,10
site2.com,2010-04,cornbread,100
site2.com,2010-04,biscuits,99
site2.com,2010-05,blueberry,90
site2.com,2010-05,squares,80
Across the X-axis, we need to have the 'date' value
Across the Y-axis, we need to have the 'site_name' as the primary value, but grouped within that we need to have the 'keyword' followed by the respective 'visits'.
Ok, I think you are going down the right track. It's a little tricky getting the groups right, but this should be able to be solved with grouping.
What it looks like you need is a matrix (the table where you can have dynamic rows and columns) and put the dates in a group across the top. Then group the rows by site name and then (I think) by keyword.
If grouping by keyword doesn't work, try grouping by the row number instead (within the scope of the site name group)? If this doesn't work, try getting your database to produce an extra column with rank in it first. Then you can definitely group by that. What I mean is:
site_name,date,keyword,visits,rank
site1.com,2010-04,apples,100,1
site1.com,2010-04,oranges,99,2
site1.com,2010-05,wine,12,1
site1.com,2010-05,water,10,2
site2.com,2010-04,cornbread,100,1
site2.com,2010-04,biscuits,99,2
site2.com,2010-05,blueberry,90,1
site2.com,2010-05,squares,80,2
You should then be able to add two rows in that group to put the keyword and visits in. If you can't, you might have to resort to fancy rectangle work - in the detail cell, put a rectangle, then two textboxes, with the keyword in the top one and the number of visits in the bottom one.
Create a row grouping on "site" then a child/sub row grouping on "keyword"
You don't need to use a Matrix as you know how many columns you will have, so you can just do it in a table
So the grouping would be something like
=Fields!site_name
with the same value appearing in the text box
then for the next grouping down
=Fields!keyword
ditto for the textbox
you can just use SUM to figure out how many vists =SUM(Fields!vists)
in the group total