Getting SQL records sharing same keywords - sql

I have table for article keywords:
id INT
keyword VARCHAR
And I have an article id, let's say 13. This article has 4 keywords in this table.
I'm trying to get other articles where they share 2 or more keywords.
I can get a list of articles having same keywords with my original article with this query:
SELECT id FROM table WHERE keyword IN (SELECT keyword FROM table WHERE id=13)
But this only gives me a list of all articles sharing at least one keyword... But I need articles sharing 2 or more keywords, preferably ordered descending by the most occurrences...
How do I achieve this?

DECLARE #original_id int = 13
SELECT
id,
COUNT(*) c
FROM keywords k1
INNER JOIN (
SELECT keyword
FROM keywords
WHERE id = #original_id
) k2 ON (k1.keyword = k2.keyword)
GROUP BY id
HAVING COUNT(*) > 1
ORDER BY c DESC, id

SELECT id
, Count(*) As number_of_keywords
FROM articles
INNER
JOIN keywords
ON keywords.keyword = articles.keyword
GROUP
BY id
HAVING Count(*) >= 2

Related

Select columns based on count of many-to-many association

I have a Postgres database with 3 tables that looks a little something like this:
table categories
id
type
table games
id
table game_category
id
game_id
category_id
I want to select all games which have more than x categories where type is something
I have gotten this far:
SELECT * FROM games WHERE id IN (
SELECT game_id FROM game_category GROUP BY game_id HAVING COUNT(*) >= 5
)
This works to select all games with more than 5 categories, but doesn't narrow down the categories by their type. How could I expand on this to add the additional check for the type?
You have to join your categories table with the subquery. Then you can add a WHERE clause for the type. Replace '?' with your actual type, of course.
SELECT * FROM games WHERE id IN (
SELECT game_id FROM game_category
INNER JOIN categories ON (categories.id=game_category.category_id)
WHERE categories.type='?'
GROUP BY game_id HAVING COUNT(*) >= 5
)
Considering query response time, you can avoid the in clause. Mitchel's answer would work if written as follows:
SELECT game_id
FROM game_category gc
inner join categories c on c.id = gc.category_id
WHERE type = 'X'
GROUP BY game_id
HAVING COUNT(game_id) >= 5
Notice I avoided using count(*) that is also a query optimization strategy

Query for master records that have matching detail records

Currently I'm having the following table structure.
Master table Documents:
ID
Filename
1
document1.pdf
2
document2.pdf
3
document3.pdf
Detail table Keywords:
ID
DocumentID
Keyword
1
1
KeywordA
2
1
KeywordB
3
1
KeywordC
4
2
KeywordB
5
3
KeywordA
6
3
KeywordD
Code to create this:
CREATE TABLE Documents (
ID int IDENTITY(1,1) PRIMARY KEY,
Filename nvarchar(255) NOT NULL
);
CREATE TABLE Keywords (
ID int IDENTITY(1,1) PRIMARY KEY,
DocumentID int NOT NULL,
Keyword nvarchar(255) NOT NULL
);
INSERT INTO Documents(Filename) VALUES
('document1.pdf'), ('document2.pdf'), ('document3.pdf');
INSERT INTO Keywords(DocumentID, Keyword) VALUES
(1, 'KeywordA'),
(1, 'KeywordB'),
(1, 'KeywordC'),
(2, 'KeywordB'),
(3, 'KeywordA'),
(3, 'KeywordD');
SQL Fiddle for this.
Finding with one keyword
I'm looking for a way to get all documents matching a certain keyword.
This could be e.g. written with the following T-SQL query:
SELECT Documents.*
FROM Documents
WHERE Documents.ID IN
(
SELECT Keywords.DocumentID
FROM Keywords
WHERE Keywords.Keyword = 'KeywordA'
)
This works successfully.
Finding with multiple keywords
What I'm currently stuck with is when I want to find all documents that match multiple keyword, combined with logical AND.
E.g. find a document that has three detail records with keyword A, B and C.
I think the following might work, but I don't know whether this performant or elegant at all:
SELECT Documents.*
FROM Documents
WHERE Documents.ID IN
(
SELECT Keywords.DocumentID
FROM Keywords
WHERE
Keywords.Keyword = 'KeywordA' OR
Keywords.Keyword = 'KeywordB'
GROUP BY Keywords.DocumentID HAVING COUNT(*) = 2
)
SQL Fiddle for that.
My question
How to write a (performant) SQL query to find all documents that have multiple keywords associated.
If it is easier, a solution with a constant number of keywords (e.g. 3) would be sufficient.
I hope the following query can help you
SELECT D.ID
FROM Documents D
JOIN Keywords K ON K.DocumentID = D.ID
WHERE K.Keyword IN ('KeywordA', 'KeywordB', 'KeywordC')
GROUP BY D.ID
HAVING COUNT(DISTINCT K.Keyword) = 3
Demo
The technique you are trying to do is called Relational Division With Remainder, in other words: find all groups which contain a particular set of rows.
Your current query is one of the standard ways of doing this, there are others.
If you had the keywords in a table variable or TVP, ...
DECLARE #keywords AS TABLE (Keyword varchar(50));
INSERT #keywords VALUES
('KeywordA'), ('KeywordB'), ('KeywordC');
... you could make it much neater with the following:
SELECT d.*
FROM Documents d
WHERE d.ID IN
(
SELECT k.DocumentID
FROM Keywords k
JOIN #keywords kt ON kt.Keyword = k.Keyword
GROUP BY k.DocumentID
HAVING COUNT(*) = (SELECT COUNT(*) FROM #keywords)
);
Another option:
SELECT d.*
FROM Documents d
WHERE EXISTS (SELECT 1
FROM #keywords kt
LEFT JOIN Keywords k ON kt.Keyword = k.Keyword
AND k.DocumentID = d.ID
HAVING COUNT(*) = COUNT(k.keywords) -- there are no missing matches
);
And another, slightly confusing one:
SELECT d.*
FROM Documents d
WHERE NOT EXISTS (SELECT 1
FROM #keywords kt
WHERE NOT EXISTS (SELECT 1
FROM Keywords k
WHERE k.Keyword = kt.Keyword
AND K.DocumentID = d.ID
)
);
-- For each document, there are no keywords for which there is no match

INNER JOIN of pagevies, contacts and companies - duplicated entries

In short: 3 table inner join duplicates records
I have data in BigQuery in 3 tables:
Pageviews with columns:
timestamp
user_id
title
path
Contacts with columns:
website_user_id
email
company_id
Companies with columns:
id
name
I want to display all recorded pageviews and, if user and/or company is known, display this data next to pageview.
First, I join contact and pageviews data (SQL is generated by Metabase business intelligence tool):
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
ORDER BY `timestamp` DESC
It works as expected and I can see pageviews attributed to known contacts.
Next, I'd like to show pageviews of contacts with known company and which company is this:
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`,
`Companies`.`name` AS `name`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
INNER JOIN `analytics.companies` `Companies` ON `Contacts`.`company_id` = `Companies`.`id`
ORDER BY `timestamp` DESC
With this query I would expect to see only pageviews where associated contact AND company are known (just another column for company name). The problem is, I get duplicate rows for every pageview (sometimes 5, sometimes 20 identical rows).
I want to avoid selecting DISTINCT timestamps because it can lead to excluding valid pageviews from different users but with identical timestamp.
How to approach this?
Your description sounds like you have duplciates in companies. This is easy to test for:
select c.id, count(*)
from `analytics.companies` c
group by c.id
having count(*) >= 2;
You can get the details using window functions:
select c.*
from (select c.*, count(*) over (partition by c.id) as cnt
from `analytics.companies` c
) c
where cnt >= 2
order by cnt desc, id;

How to select the highest value after a count() | Sql Oracle

This is my query:
SELECT f.name, COUNT(*) as num_books
from author f
JOIN book b on b.tittle = f.book
Group by f.name
Which gives me this table:
NAME NUM_BOOKS
-------------------------------------------------- ----------
Dyremann 2
Nam mann 1
Thomas 1
Asgeir 1
Tullemann 5
Plantemann 1
Beste forfatter 1
Fagmann 5
Lars 1
Hans 1
Svein Arne 1
How could I easly alter the query to only display the author with the highest amount of released books? (While keeping in mind I'm rather new to sql)
Oracle, and as far as I know - only Oracle, allows you to nest two aggregate functions.
SELECT max (f.name) keep (dense_rank last order by count (*)) as name
from author f
JOIN book b on b.tittle = f.book
Group by f.name
In order to get ALL top authors:
select name
from (SELECT f.name,rank () over (order by count(*) desc) as rnk
from author f
JOIN book b on b.tittle = f.book
Group by f.name
)
where rnk = 1
Since Oracle 12c:
SELECT f.name
from author f
JOIN book b on b.tittle = f.book
Group by f.name
order by count (*) desc
fetch first row /* with ties (optional, in order to get all top authors) */
The best way to do is to use:
SELECT f.name, COUNT(*) as num_books
from author f
JOIN book b on b.tittle = f.book
Group by f.name
Order by num_books DESC
FETCH FIRST ROW ONLY
This will order the results from biggest to smallest and return the first result.
1) Oracle Specific : ( Using ROWNUM, For Postgres/MySql use limit )
select * from
(SELECT f.name, COUNT(*) as num_books
from author f
JOIN book b on b.tittle = f.book
Group by f.name order by num_books desc )
where ROWNUM = 1
2) General Query for all databases :
select f.name,count(*) as max_num_books from author f
JOIN book b on b.tittle = f.book
Group by f.name
having count(*) =
(select max(num_books)
from
(SELECT f.name, COUNT(*) as num_books
from author f
JOIN book b on b.tittle = f.book
Group by f.name)
);
I am not sure why you need a join in the first place. It appears that the author table has a column book - why is it not enough to count(book) from that table, grouping by name? This arrangement is very strange - the author table should only have author properties, the author name should be in the title table, but you do join on author.book = book.title which seems to suggest that you do, in fact, have that strange arrangement (and therefore you don't need a join). Also, having a table and a column (in another table) share the same name, book, is a practice best to be avoided.
The most elementary solution (not the most efficient though), in this case, is
select name, count(book) as max_num_books
from author
group by name
having count(book) = (select max(count(book) from author group by name);
The subquery groups by name, and then it selects the max over all group counts. The outer query selects the names that have a book count equal to this maximum. The subquery returns a single row in a single column - a single value. Such a query is called a "scalar" subquery and can be used wherever a single value is needed, such as the HAVING clause of the outer query. (It's in the HAVING clause and not a WHERE clause, since it refers to group properties - count(book) - and not to individual row properties).
The more efficient solution is as Dudu showed:
select name, ct as max_num_books
from ( select name, count(*) as ct, rank() over (order by count(*) desc) rnk
from author
group by name
)
where rnk = 1;

Get latest record from second table left joined to first table

I have a candidate table say candidates having only id field and i left joined profiles table to it. Table profiles has 2 fields namely, candidate_id & name.
e.g. Table candidates:
id
----
1
2
and Table profiles:
candidate_id name
----------------------------
1 Foobar
1 Foobar2
2 Foobar3
i want the latest name of a candidate in a single query which is given below:
SELECT C.id, P.name
FROM candidates C
LEFT JOIN profiles P ON P.candidate_id = C.id
GROUP BY C.id
ORDER BY P.name;
But this query returns:
1 Foobar
2 Foobar3
...Instead of:
1 Foobar2
2 Foobar3
The problem is that your PROFILES table doesn't provide a reliable means of figuring out what the latest name value is. There are two options for the PROFILES table:
Add a datetime column IE: created_date
Define an auto_increment column
The first option is the best - it's explicit, meaning the use of the column is absolutely obvious, and handles backdated entries better.
ALTER TABLE PROFILES ADD COLUMN created_date DATETIME
If you want the value to default to the current date & time when inserting a record if no value is provided, tack the following on to the end:
DEFAULT CURRENT_TIMESTAMP
With that in place, you'd use the following to get your desired result:
SELECT c.id,
p.name
FROM CANDIDATES c
LEFT JOIN PROFILES p ON p.candidate_id = c.id
JOIN (SELECT x.candidate_id,
MAX(x.created_date) AS max_date
FROM PROFILES x
GROUP BY x.candidate_id) y ON y.candidate_id = p.candidate_id
AND y.max_date = p.created_date
GROUP BY c.id
ORDER BY p.name
Use a subquery:
SELECT C.id, (SELECT P.name FROM profiles P WHERE P.candidate_id = C.id ORDER BY P.name LIMIT 1);