SQL query to find rows with the most matching keywords - sql

I'm really bad at SQL and I would like to know what SQL I can run to solve the problem below which I suspect to be a NP-Complete problem but I'm ok with the query taking a long time to run over large datasets as this will be done as a background task. A standard sql statement is preferred but if a stored procedure is required then so be it. The SQL is required to run on Postgres 9.3.
Problem: Given a set of articles that contain a set of keywords, find the top n articles for each article that contains the most number of matching keywords.
A trimmed down version of the article table looks like this:
CREATE TABLE article (
id character varying(36) NOT NULL, -- primary key of article
keywords character varying, -- comma separated set of keywords
CONSTRAINT pk_article PRIMARY KEY (id)
);
-- Test Data
INSERT INTO article(id, keywords) VALUES(0, 'red,green,blue');
INSERT INTO article(id, keywords) VALUES(1, 'red,green,yellow');
INSERT INTO article(id, keywords) VALUES(2, 'purple,orange,blue');
INSERT INTO article(id, keywords) VALUES(3, 'lime,violet,ruby,teal');
INSERT INTO article(id, keywords) VALUES(4, 'red,green,blue,yellow');
INSERT INTO article(id, keywords) VALUES(5, 'yellow,brown,black');
INSERT INTO article(id, keywords) VALUES(6, 'black,white,blue');
Which would result in this for a SELECT * FROM article; query:
Table: article
------------------------
id keywords
------------------------
0 red,green,blue
1 red,green,yellow
2 purple,orange,blue
3 lime,violet,ruby,teal
4 red,green,blue,yellow
5 yellow,brown,black
6 black,white,blue
Assuming I want to find the top 3 articles for each article that contains the most number of matching keywords then the output should be this:
------------------------
id related
------------------------
0 4,1,6
1 4,0,5
2 0,4,6
3 null
4 0,1,6
5 1,6
6 5,0,4

Like #a_horse commented: This would be simpler with a normalized design (besides making other tasks simpler/ cleaner), but still not trivial.
Also, a PK column of data type character varying(36) is highly suspicious (and inefficient) and should most probably be an integer type or at least a uuid instead.
Here is one possible solution based on your design as is:
WITH cte AS (
SELECT id, string_to_array(a.keywords, ',') AS keys
FROM article a
)
SELECT id, string_agg(b_id, ',') AS best_matches
FROM (
SELECT a.id, b.id AS b_id
, row_number() OVER (PARTITION BY a.id ORDER BY ct.ct DESC, b.id) AS rn
FROM cte a
LEFT JOIN cte b ON a.id <> b.id AND a.keys && b.keys
LEFT JOIN LATERAL (
SELECT count(*) AS ct
FROM (
SELECT * FROM unnest(a.keys)
INTERSECT ALL
SELECT * FROM unnest(b.keys)
) i
) ct ON TRUE
ORDER BY a.id, ct.ct DESC, b.id -- b.id as tiebreaker
) sub
WHERE rn < 4
GROUP BY 1;
sqlfiddle (using an integer id instead).
The CTE cte converts the string into an array. You could even have a functional GIN index like that ...
If multiple rows tie for the top 3 picks, you need to define a tiebreaker. In my example, rows with smaller id come first.
Detailed explanation in this recent related answer:
Query and order by number of matches in JSON array
The comparison is between a JSON array and an SQL array, but it's basically the same problem, burns down to the same solution(s). Also comparing a couple of similar alternatives.
To make this fast, you should at least have a GIN index on the array column (instead of the comma-separated string) and the query wouldn't need the CTE step. A completely normalized design has other advantages, but won't necessarily be faster than an array with GIN index.

You can store lists in comma-separated strings. No problem, as long as this is just a string for you and you are not interested in its separate values. As soon as you are interested in the separate values, as in your example, store them separately.
This said, correct your database design and only then think about the query.
The following query selects all ID pairs first and counts common keywords. It then ranks the pairs by giving the other ID with the most keywords in common rank #1, etc. Then you keep only the three best matching IDs. STRING_AGG lists the best matching IDs in a string ordered by the number of keywords in common.
select
this_article as id,
string_agg(other_article, ',' order by rn) as related
from
(
select
this_article,
other_article,
row_number() over (partition by this_article order by cnt_common desc) as rn
from
(
select
this.id as this_article,
other.id as other_article,
count(other.id) as cnt_common
from keywords this
left join keywords other on other.keyword = this.keyword and other.id <> this.id
group by this.id, other.id
) pairs
) ranked
where rn <= 3
group by this_article
order by this_article;
Here is the SQL fiddle: http://sqlfiddle.com/#!15/1d20c/9.

Related

Find if a string is in or not in a database

I have a list of IDs
'ACE', 'ACD', 'IDs', 'IN','CD'
I also have a table similar to following structure
ID value
ACE 2
CED 3
ACD 4
IN 4
IN 4
I want a SQL query that returns a list of IDs that exists in the database and a list of IDs that does not in the database.
The return should be:
1.ACE, ACD, IN (exist)
2.IDs,CD (not exist)
my code is like this
select
ID,
value
from db
where ID is in ( 'ACE', 'ACD', 'IDs', 'IN','CD')
however, the return is 1) super slow with all kinds of IDs 2) return multiple rows with the same ID. Is there anyway using postgresql to return 1) unique ID 2) make the running faster?
Assuming no duplicates in table nor input, this query should do it:
SELECT t.id IS NOT NULL AS id_exists
, array_agg(ids.id)
FROM unnest(ARRAY['ACE','ACD','IDs','IN','CD']) ids(id)
LEFT JOIN tbl t USING (id)
GROUP BY 1;
Else, please define how to deal with duplicates on either side.
If the LEFT JOIN finds a matching row, the expression t.id IS NOT NULL is true. Else it's false. GROUP BY 1 groups by this expression (1st in the SELECT list), array_agg() forms arrays for each of the two groups.
Related:
Select rows which are not present in other table
Hmmm . . . Is this sufficient:
select ids.id,
(exists (select 1 from table t where t.id = ids.id)
from unnest(array['ACE', 'ACD', 'IDs', 'IN','CD']) ids(id);

SQL query to find a row with a specific number of associations

Using Postgres I have a schema that has conversations and conversationUsers. Each conversation has many conversationUsers. I want to be able to find the conversation that has the exactly specified number of conversationUsers. In other words, provided an array of userIds (say, [1, 4, 6]) I want to be able to find the conversation that contains only those users, and no more.
So far I've tried this:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;
Unfortunately, this also seems to return conversations which include these 2 users among others. (For example, it returns a result if the conversation also includes "userId" 5).
This is a case of relational-division - with the added special requirement that the same conversation shall have no additional users.
Assuming the PK of table "conversationUsers" is on ("userId", "conversationId"), which enforces uniqueness of combinations, NOT NULL and also provides the essential index for performance implicitly. Columns of the multicolumn PK in this order. Ideally, you have another index on ("conversationId", "userId"). See:
Is a composite index also good for queries on the first field?
For the basic query, there is the "brute force" approach to count the number of matching users for all conversations of all given users and then filter the ones matching all given users. OK for small tables and/or only short input arrays and/or few conversations per user, but doesn't scale well:
SELECT "conversationId"
FROM "conversationUsers" c
WHERE "userId" = ANY ('{1,4,6}'::int[])
GROUP BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = c."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
Eliminating conversations with additional users with a NOT EXISTS anti-semi-join. More:
How do I (or can I) SELECT DISTINCT on multiple columns?
Alternative techniques:
Select rows which are not present in other table
There are various other, (much) faster relational-division query techniques. But the fastest ones are not well suited for a dynamic number of user IDs.
How to filter SQL results in a has-many-through relation
For a fast query that can also deal with a dynamic number of user IDs, consider a recursive CTE:
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = ('{1,4,6}'::int[])[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = ('{1,4,6}'::int[])[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length(('{1,4,6}'::int[]), 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL('{1,4,6}'::int[])
);
For ease of use wrap this in a function or prepared statement. Like:
PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
SELECT "conversationId", 1 AS idx
FROM "conversationUsers"
WHERE "userId" = $1[1]
UNION ALL
SELECT c."conversationId", r.idx + 1
FROM rcte r
JOIN "conversationUsers" c USING ("conversationId")
WHERE c."userId" = $1[idx + 1]
)
SELECT "conversationId"
FROM rcte r
WHERE idx = array_length($1, 1)
AND NOT EXISTS (
SELECT FROM "conversationUsers"
WHERE "conversationId" = r."conversationId"
AND "userId" <> ALL($1);
Call:
EXECUTE conversations('{1,4,6}');
db<>fiddle here (also demonstrating a function)
There is still room for improvement: to get top performance you have to put users with the fewest conversations first in your input array to eliminate as many rows as possible early. To get top performance you can generate a non-dynamic, non-recursive query dynamically (using one of the fast techniques from the first link) and execute that in turn. You could even wrap it in a single plpgsql function with dynamic SQL ...
More explanation:
Using same column multiple times in WHERE clause
Alternative: MV for sparsely written table
If the table "conversationUsers" is mostly read-only (old conversations are unlikely to change) you might use a MATERIALIZED VIEW with pre-aggregated users in sorted arrays and create a plain btree index on that array column.
CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users -- sorted array
FROM (
SELECT "conversationId", "userId"
FROM "conversationUsers"
ORDER BY 1, 2
) sub
GROUP BY 1
ORDER BY 1;
CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");
The demonstrated covering index requires Postgres 11. See:
https://dba.stackexchange.com/a/207938/3684
About sorting rows in a subquery:
How to apply ORDER BY and LIMIT in combination with an aggregate function?
In older versions use a plain multicolumn index on (users, "conversationId"). With very long arrays, a hash index might make sense in Postgres 10 or later.
Then the much faster query would simply be:
SELECT "conversationId"
FROM mv_conversation_users c
WHERE users = '{1,4,6}'::int[]; -- sorted array!
db<>fiddle here
You have to weigh added costs to storage, writes and maintenance against benefits to read performance.
Aside: consider legal identifiers without double quotes. conversation_id instead of "conversationId" etc.:
Are PostgreSQL column names case-sensitive?
you can modify your query like this and it should work:
SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."conversationId" IN (
SELECT DISTINCT c1."conversationId"
FROM "conversationUsers" c1
WHERE c1."userId" IN (1, 4)
)
GROUP BY c."conversationId"
HAVING COUNT(DISTINCT c."userId") = 2;
This might be easier to follow. You want the conversation ID, group by it. add the HAVING clause based on the sum of matching user IDs count equal to the all possible within the group. This will work, but will be longer to process because of no pre-qualifier.
select
cu.ConversationId
from
conversationUsers cu
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )
To Simplify the list even more, have a pre-query of conversations that at least one person is in... If they are not in to begin with, why bother considering such other conversations.
select
cu.ConversationId
from
( select cu2.ConversationID
from conversationUsers cu2
where cu2.userID = 4 ) preQual
JOIN conversationUsers cu
preQual.ConversationId = cu.ConversationId
group by
cu.ConversationID
having
sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

Search for a word in the column string and list those words

Have two tables, table 1 with columns W_ID and word. Table 2 with column N_ID and note. Have to list all the NID where words found in table 1 word column contains in Note column (easy part) and also list those words in another column without duplicating the N_ID. Which means using STUFF to concatenate all the words found in Note column for that particular N_ID. I tried using
FULL TEXT INDEX using CONTAIN
But it only allows to search for one word at a time. Any suggestions how I can use a while loop to achieve this.
If there is a maximum number of words you want displayed for N_ID, you can pivot this. You could have them in a single column by concatenating them, but I would recommend against that. Here is a pivot that supports up to 4 words per N_ID. You can adjust it as needed. You can view the SQL Fiddle for this here.
SELECT
n_id,
[1] AS word_1,
[2] AS word_2,
[3] AS word_3,
[4] AS word_4
FROM (
SELECT
n_id,
word,
ROW_NUMBER() OVER (PARTITION BY n_id ORDER BY word) AS rn
FROM tbl2
JOIN tbl1 ON
tbl2.note LIKE '%'+tbl1.word+'[ ,.?!]%'
) AS source_table
PIVOT (
MAX(word)
FOR rn IN ([1],[2],[3],[4])
) AS pivot_table
*updated the join to prevent look for a space or punctuation to declare the end of a word.
You can join your tables together based on a postive result from the charindex function.
In SQL 2017 you can run:
SELECT n_id, string_agg(word)
FROM words
inner join notes on 0 < charindex(words.word, notes.note);
Prior to SQL 2017, there is no string_agg so you'll need to use stuff, which is trickier:
select
stuff((
SELECT ', ' + word
FROM words
where 0 < charindex(words.word, notes.note)
FOR XML PATH('')
), 1, 2, '')
from notes;
I used the following schema:
CREATE table WORDS
(W_ID int identity primary key
,word varchar(100)
);
CREATE table notes
(N_ID int identity primary key
,note varchar(1000)
);
insert into words (word) values
('No'),('Nope'),('Nah');
insert into notes (note) values
('I am not going to do this. Nah!!!')
,('It is OK.');

How can I stop joins from adding rows in my match query?

I'm having difficulty translating what I want into functional programming, since I think imperatively. Basically, I have a table of forms, and a table of expectations. In the Expectation view, I want it to look through the forms table and tell me if each one found a match. However, when I try to use joins to accomplish this, the joins are adding rows to the Expectation table when two or more forms match. I do not want this.
In an imperative fashion, I want the equivalent of this:
ForEach (row in Expectation table)
{
if (any form in the Form table matches the criteria)
{
MatchID = form.ID;
SignDate = form.SignDate;
...
}
}
What I have in SQL is this:
SELECT
e.*, match.ID, match.SignDate, ...
FROM
POFDExpectation e LEFT OUTER JOIN
(SELECT MIN(ID) as MatchID, MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount, ...
FROM Form f
GROUP BY (matching criteria columns)
) match
ON (form.[match criteria] = expectation.[match criteria])
Which works okay, but very slowly, and every time there are TWO matches, a row is added to the Expectation results. Mathematically I understand that a join is a cross multiply and this is expected, but I'm unsure how to do this without them. Subquery perhaps?
I'm not able to give too many further details about the implementation, but I'll be happy to try any suggestion and respond with the results. I have 880 Expectation rows, and 942 results being returned. If I only allow results that match one form, I get 831 results. Neither are desirable, so if yours gets me to exactly 880, yours is the accepted answer.
Edit: I am using SQL Server 2008 R2, though a generic solution would be best.
Sample code:
--DROP VIEW ExpectationView; DROP TABLE Forms; DROP TABLE Expectations;
--Create Tables and View
CREATE TABLE Forms (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100), Complete bit, SignDate datetime)
GO
CREATE TABLE Expectations (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100))
GO
CREATE VIEW ExpectationView AS select e.*, filed.MatchID, filed.SignDate, ISNULL(filed.FiledCount, 0) as FiledCount, ISNULL(name.NameCount, 0) as NameCount from Expectations e LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, Complete, Min(SignDate) as SignDate, COUNT(*) as FiledCount from Forms f GROUP BY ReportYear, Name, Complete) filed
on filed.ReportYear = e.ReportYear AND filed.Name like '%'+e.Name+'%' AND filed.Complete = 1 LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, COUNT(*) as NameCount from Forms f GROUP BY ReportYear, Name) name
on name.ReportYear = e.ReportYear AND name.Name like '%'+e.Name+'%'
GO
--Insert Text Data
INSERT INTO Forms (ReportYear, Name, Complete, SignDate)
SELECT 2011, 'Bob Smith', 1, '2012-03-01' UNION ALL
SELECT 2011, 'Bob Jones', 1, '2012-10-04' UNION ALL
SELECT 2011, 'Bob', 1, '2012-07-20'
GO
INSERT INTO Expectations (ReportYear, Name)
SELECT 2011, 'Bob'
GO
SELECT * FROM ExpectationView --Should only return 1 result, returns 9
The 'filed' shows that they have completed a form, 'name' shows that they may have started one but not finished it. My view has four different 'match criteria' - each a little more strict, and counts each. 'Name Only Matches', 'Loose Matches', 'Matches' (default), 'Tight Matches' (used if there are more than one default match.
This is how I do it when I want to keep to a JOIN-type query format:
SELECT
e.*,
match.ID,
match.SignDate,
...
FROM POFDExpectation e
OUTER APPLY (
SELECT TOP 1
MIN(ID) as MatchID,
MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount,
...
FROM Form f
WHERE form.[match criteria] = expectation.[match criteria]
GROUP BY ID (matching criteria columns)
-- Add ORDER BY here to control which row is TOP 1
) match
It usually performs better as well.
Semantically, {CROSS|OUTER} APPLY (table-expression) specifies a table-expression that is called once for each row in the preceding table expressions of the FROM clause and then joined to them. Pragmatically, however, the compiler treats it almost identically to a JOIN.
The practical difference is that unlike a JOIN table-expression, the APPLY table-expression is dynamically re-evaluated for each row. So instead of an ON clause, it relies on its own logic and WHERE clauses to limit/match its rows to the preceding table-expressions. This also allows it to make reference to the column-values of the preceding table-expressions, inside its own internal subquery expression. (This is not possible in a JOIN)
The reason that we want this here, instead of a JOIN, is that we need a TOP 1 in the sub-query to limit its returned rows, however, that means that we need to move the ON clause conditions to the internal WHERE clause so that it will get applied before the TOP 1 is evaluated. And that means that we need an APPLY here, instead of the more usual JOIN.
#RBarryYoung answered the question as I asked it, but there was a second question that I didn't make very clear. What I really wanted was a combination of his answer and this question, so for the record here's what I used:
SELECT
e.*,
...
match.ID,
match.SignDate,
match.MatchCount
FROM
POFDExpectation e
OUTER APPLY (
SELECT TOP 1
ID as MatchID,
ReportYear,
...
SignDate as MatchSignDate,
COUNT(*) as MatchCount OVER ()
FROM
Form f
WHERE
form.[match criteria] = expectation.[match criteria]
-- Add ORDER BY here to control which row is TOP 1
) match

MySQL doesn't support the limit clause inside a subselect, how can I do that?

I have the following table on a MySQL 5.1.30:
CREATE TABLE article (
article_id int(10) unsigned NOT NULL AUTO_INCREMENT,
category_id int(10) unsigned NOT NULL,
title varchar(100) NOT NULL,
PRIMARY KEY (article_id)
);
With this information:
1, 1, 'foo'
2, 1, 'bar'
3, 1, 'baz'
4, 1, 'quox'
5, 2, 'quonom'
6, 2, 'qox'
I need to obtain the first three articles in each category for all categories that have articles. Something like this:
1, 1, 'foo'
2, 1, 'bar'
3, 1, 'baz'
5, 2, 'quonom'
6, 2, 'qox'
Of course a union would work:
select * from articles where category_id = 1 limit 3
union
select * from articles where category_id = 2 limit 3
But there are an unknown number of categories in the database. Also, the order should specified by an is_sticky and a published_date columns I left out of the examples to simplify.
Is it possible to build a query that retrieves this information?
UPDATE: I've tried the following which would seemed to work except that MySQL doesn't support the limit clause inside a subselect. Do you know of a way to simulate the limit there?
select *
from articles a
where a.article_id in (select f.article_id
from articles f
where f.category_id = a.category_id
order by f.is_sticky, f.published_at
limit 3)
Thanks
SELECT ... LIMIT isn't supported in subqueries, I'm afraid, so it's time to break out the self-join magic:
SELECT article.*
FROM article
JOIN (
SELECT a0.category_id AS id, MIN(a2.article_id) AS lim
FROM article AS a0
LEFT JOIN article AS a1 ON a1.category_id=a0.category_id AND a1.article_id>a0.article_id
LEFT JOIN article AS a2 ON a2.category_id=a1.category_id AND a2.article_id>a1.article_id
GROUP BY id
) AS cat ON cat.id=article.category_id
WHERE article.article_id<=cat.lim OR cat.lim IS NULL
ORDER BY article_id;
The bit in the middle is working out the ID of the third-lowest-ID article for each category by trying to join three copies of the same table in ascending ID order. If there are fewer than three articles for a category, the left joins will ensure the limit is NULL, so the outer WHERE needs to pick up that case as well.
If your “top 3” requirement might change to “top n” at some point, this begins to get unwieldy. In that case you might want to reconsider the idea of querying the list of distinct categories first then unioning the per-category queries.
ETA: Ordering on two columns: eek, new requirements! :-)
It depends what you mean: if you're only trying to order the final results you can bang it on the end no problem. But if you need to use this ordering to select which three articles are to be picked things are a lot harder.
We are using a self-join with ‘<’ to reproduce the effect ‘ORDER BY article_id’ would have. Unfortunately, whilst you can do ‘ORDER BY a, b’, you can't do ‘(a, b)<(c, d)’... neither can you do ‘MIN(a, b)’. Plus, you'd actually be ordering by three columns, issticky, published and article_id, because you need to ensure that each ordering value is unique, to avoid getting four or more rows returned.
Whilst you could make up your own orderable value by some crude integer or string combination of columns:
LEFT JOIN article AS a1
ON a1.category_id=a0.category_id
AND HEX(a1.issticky)+HEX(a1.published_at)+HEX(a1.article_id)>HEX(a0.issticky)+HEX(a0.published_at)+HEX(a0.article_id)
this is getting unfeasibly ugly, and the calculations will scupper any chance of using the indices to make the query efficient. At which point you are better off simply doing the separate per-category LIMITed queries.
You probably should add another table containing the category_id and a description of the categories. Then you can query that table for a list of category IDs, and use a subquery or additional queries to get the articles with proper sorting and limiting. I don't have time to write this out fully now, but someone else probably will (or I'll do it in the unlikely event that no one else has responded by the time I get back).
Here's something I'm not proud of (in MS SQL - not sure if it'll work in MySQL)
select a2.article_id, a2.category_id, a2.title
from
(select distinct category_id
from article) as a1
inner join article a2 on a2.category_id = a1.category_id
where a2.article_id <= (
select top 1 a4.article_id
from (
select top 3 a3.article_id
from article a3
where a3.category_id = a1.category_id
order by a3.article_id asc
) a4
order by a4.article_id desc)
It'll depend on MySQL supporting subqueries in this manner. Basically it works out the third-largest article_id for each category and joins all articles less than or equal to that per category.
SELECT TOP n * should work the same as SELECT * LIMIT n, I hope...