SQL Sort table by number of items in common - sql

I have 3 tables, user, artist and a join table.
I'd like to find for a particular user, the ordering of the rest of the user table by the number of artists they have in common in the join table, or potentially just the n other users who are have the most in common with them.
For example in the table:
userID | artistID
-------|----------
1 | 1
1 | 2
2 | 1
2 | 2
3 | 1
I'd want to get that the ordering for user1 would be (2,3) because user2 shares both artist1 and artist2 with user1, whereas user3 only shares artist1.
Is there a way to get this from SQL?
Thanks

Assuming that you always know the user ID you want to check agaist, you can also do the following:
SELECT user, count(*) as in_common
FROM user_artist
WHERE
user<>1 AND
artist IN (SELECT artist FROM user_artist WHERE user=1)
GROUP BY user
ORDER BY in_common DESC;
This avoids joining which might have better performance on a large table. Your example is sqlfiddle here

You can do this with a self-join and aggregation:
select ua.userID, count(ua1.artistID) as numInCommonWithUser1
from userartists ua left join
userartists ua1
on ua.artistID = ua1.artistID and ua1.userID = 1
group by ua.userID
order by numInCommonWithUser1 desc;

If Suppose you know the user ID you are going to check then this query will complete your requirement and also perform very well.
SELECT ua1.user, count(*) as all_Common
FROM user_artist ua1
WHERE
(
Select count(*)
From user_artist ua2
Where ua2.user=1
AND ua2.artist=ua1.artist
)>0
AND ua1.user = 1
GROUP BY ua1.user
ORDER BY ua1.all_Common DESC;
Let me know if any question!

Related

Postgres, groupBy and count for table and relations at the same time

I have a table called 'users' that has the following structure:
id (PK)
campaign_id
createdAt
1
123
2022-07-14T10:30:01.967Z
2
1234
2022-07-14T10:30:01.967Z
3
123
2022-07-14T10:30:01.967Z
4
123
2022-07-14T10:30:01.967Z
At the same time I have a table that tracks clicks per user:
id (PK)
user_id(FK)
createdAt
1
1
2022-07-14T10:30:01.967Z
2
2
2022-07-14T10:30:01.967Z
3
2
2022-07-14T10:30:01.967Z
4
2
2022-07-14T10:30:01.967Z
Both of these table are up to millions of records... I need the most efficient query to group the data per campaign_id.
The result I am looking for would look like this:
campaign_id
total_users
total_clicks
123
3
1
1234
1
3
I unfortunately have no idea how to achieve this while minding performance and most important of it all I need to use WHERE or HAVING to limit the query in a certain time range by createdAt
Note, PostgreSQL is not my forte, nor is SQL. But, I'm learning spending some time on your question. Have a go with INNER JOIN after two seperate SELECT() statements:
SELECT * FROM
(
SELECT campaign_id, COUNT (t1."id(PK)") total_users FROM t1 GROUP BY campaign_id
) tbl1
INNER JOIN
(
SELECT campaign_id, COUNT (t2."user_id(FK)") total_clicks FROM t2 INNER JOIN t1 ON t1."id(PK)" = t2."user_id(FK)" GROUP BY campaign_id
) tbl2
USING(campaign_id)
See an online fiddle. I believe this is now also ready for a WHERE clause in both SELECT statements to filter by "createdAt". I'm pretty sure someone else will come up with something better.
Good luck.
Hope this will help you.
select u.campaign_id,
count(distinct u.id) users_count,
count(c.user_id) clicks_count
from
users u left join clicks c on u.id=c.user_id
group by 1;
See here query output

Aggregate and count after left join

I am aggregating columns of a table to find the count of unique values. For example, aggregating
the status shows that out of 5 alerts there are 2 in open status and 3 that are closed. The simplified table looks like this:
create table alerts (
id,
status,
owner_id
);
The query below uses grouping sets to aggregate multiple columns at once. This approach works well.
with aggs as (
select status
from alerts
where alerts.owner_id = 'x'
)
select status, count(*)
from aggs
group by grouping sets(
(),
(status)
);
the output at its simplest could look like this:
status | count
--------+-------
| 1
new | 1
However, now I need to aggregate additional columns from another table. This table (shown below) can have zero or more rows associated to the first table (alerts:users 1:N).
create table users (
id,
alert_id,
name
);
I have tried updating the query to use a left join but this approach incorrectly inflates the counts of the alert columns.
with aggs as (
select alerts.status, users.name
from alerts
left join users on alerts.id = users.alert_id
where alerts.owner_id = 'x'
-- and additional filtering by columns in the users table
)
select status, name, count(*)
from aggs
group by grouping sets(
(),
(status),
(name)
);
Below is an example of the incorrect results. Since there are 3 rows in the user table the count
for the status column is now 3 but should be 1.
status | name | count
--------+-------------------------+-------
| | 3
| user1 | 1
| user2 | 1
| user3 | 1
new | | 3
How can I perform this aggregation to include the columns from the table with a many-to-one relationship without inflating the counts? In the future I will likely need to aggregate more columns from other tables with a many-to-one relationship and need a solution that will still work with several left joins. All help is much appreciated.
edit: link to db-fiddle https://www.db-fiddle.com/f/buGD2DuJiqf9LGF9rw5EgT/2
Do you just want to count the number of alerts? If so, use count(distinct):
count(distinct alert_id)
Of course, you need this in aggs, so the select would include:
alerts.id as alert_id

`INTERSECT` does not return anything from two tables, separately values are returned fine

I'm not sure what I am doing wrong here since I didn't touch SQL queries for several years plus MSSQL query language is a bit strange to me but after 30 minutes of googling I still cannot find the answer.
Problem
I have two queries that work perfectly fine:
SELECT COUNT(*) AS 'NumberOfAccounts' FROM Accounts
SELECT COUNT(*) AS 'NumberOfUsers' FROM Users
I need to get this information in one go in my API response since I don't want to execute two statements. How can I combine them into one query so it will return table as follows:
+------------------+---------------+
| NumberOfAccounts | NumberOfUsers |
+------------------+---------------+
| 10 | 16 |
+------------------+---------------+
What I have tried
UNION SELECT COUNT(*) AS 'NumberOfAccounts' FROM Accounts UNION SELECT COUNT(*) AS 'NumberOfUsers' FROM Users
This is giving me the result of both tables, however it all pushes it into NumberOfAccounts and the result is invalid for me to parse.
+------------------+
| NumberOfAccounts |
+------------------+
| 10 |
| 16 |
+------------------+
INTRSECT SELECT COUNT(*) AS 'NumberOfAccounts' FROM Accounts INTERSECT SELECT COUNT(*) AS 'NumberOfUsers' FROM Users
This just gives me empty result with only NumberOfAccounts column in it.
You can just put these as subqueries in a select:
SELECT (SELECT COUNT(*) FROM Accounts) as NumberOfAccounts,
(SELECT COUNT(*) FROM Users) as NumberOfUsers
In SQL Server, no FROM clause is needed.
UNION is the wrong usage here. Union will "merge" rows of identical tables (or identical selects) and not columns.
One solution might be:
SELECT AccountCount, UserCount FROM
(SELECT COUNT(*) AS AccountCount, 1 AS Id FROM Accounts) AS a
JOIN
(SELECT COUNT(*) AS UserCount, 1 as Id FROM Users) AS u ON (a.Id = u.Id)
Be aware of the artificial surrogate key 1 you need to insert to join both sub-selects together.
For completeness sake; with UNION ALL you'd do:
SELECT 'NumberOfAccounts' AS what, COUNT(*) AS howmany FROM accounts
UNION ALL
SELECT 'NumberOfUsers' AS what, COUNT(*) AS howmany FROM users;
which results in
+------------------+---------+
| what | howmany |
+------------------+---------+
| NumberOfAccounts | 10 |
| NumberOfUsers | 16 |
+------------------+---------+
And another variation:
WITH cte AS
(
SELECT COUNT(*) AS cntAccounts, 0 AS cntUsers FROM accounts
UNION ALL
SELECT 0 AS cntAccounts, COUNT(*) AS cntUsers FROM users
)
SELECT
SUM(cntAccounts) AS NumberOfAccounts
,SUM(cntUsers ) AS NumberOfUsers
FROM cte
If you want (need) better performance you can get the row counts from the following query which uses sys.dm_db_partition_stats to get the row counts:
SELECT (
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Accounts')
AND (index_id=0 or index_id=1)) NumberOfAccounts,
(
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Users')
AND (index_id=0 or index_id=1)) NumberOfUsers

How can you get a histogram of counts from a join table without using a subquery?

I have a lot of tables that look like this: (id, user_id, object_id). I am often interested in the question "how many users have one object? how many have two? etc." and would like to see the distribution.
The obvious answer to this looks like:
select x.ucount, count(*)
from (select count(*) as ucount from objects_users group by user_id) as x
group by x.ucount
order by x.ucount;
This produces results like:
ucount | count
-------|-------
1 | 15
2 | 17
3 | 23
4 | 104
5 | 76
7 | 12
Using a subquery here feels inelegant to me and I'd like to figure out how to produce the same result without. Further, if the question you're trying to ask is slightly more complicated it gets messy passing more information out of the subquery. For example, if you want the data further grouped by the user's creation date:
select
x.ucount,
(select cdate from users where id = x.user_id) as cdate,
count(*)
from (
select user_id, count(*) as ucount
from objects_users group by user_id
) as x
group by cdate, x.ucount,
order by cdate, x.ucount;
Is there some way to avoid the explosion of subqueries? I suppose in the end my objection is aesthetic, but it makes the queries hard to read and hard to write.
I think a subquery is exactly the appropriate way to do this, regardless of your RDBMS. Why would it be inelegant?
For the second query, just join the users table like this:
SELECT
x.ucount,
u.cdate,
COUNT(*)
FROM (
SELECT
user_id,
COUNT(*) AS ucount
FROM objects_users
GROUP BY user_id
) AS x
LEFT JOIN users AS u
ON x.user_id = u.id
GROUP BY u.cdate, x.ucount
ORDER BY u.cdate, x.ucount

How can I create a MySQL JOIN query for only selecting the rows in one table where a certain number of references to that row exist in another table?

I have two tables in my database, called ratings and movies.
Ratings:
| id | movie_id | rating |
Movies:
| id | title |
A typical movie record might be like this:
| 4 | Cloverfield (2008) |
and there may be several rating records for Cloverfield, like this:
| 21 | 4 | 3 | (rating number 21, on movie number 4, giving it a rating of 3)
| 22 | 4 | 2 | (rating number 22, on movie number 4, giving it a rating of 2)
| 23 | 4 | 5 | (rating number 23k on movie number 4, giving it a rating of 5)
The question:
How do I create a JOIN query for only selecting the rows in the movie table that have more than x number of ratings in the ratings table? For example, in the above example if Cloverfield only had one rating in the ratings table and x was 2, it would not be selected.
Thanks for any help or advice!
Use the HAVING clause. Something along these lines:
SELECT movies.id, movies.title, COUNT(ratings.id) AS num_ratings
FROM movies
LEFT JOIN ratings ON ratings.movie_id=movies.id
GROUP BY movies.id
HAVING num_ratings > 5;
The JOIN method is somewhat stilted and confusing because that's not exactly what it was intended to do. The most direct (and in my opinion, easily human-parseable) method uses EXISTS:
SELECT whatever
FROM movies m
WHERE EXISTS( SELECT COUNT(*)
FROM reviews
WHERE movie_id = m.id
HAVING COUNT(*) > xxxxxxxx )
Read it out loud -- SELECT something FROM movies WHERE there EXIST rows in Reviews where the movie_id matches and there are > xxxxxx rows
You'll probably want to use MySQL's HAVING clause
http://www.severnsolutions.co.uk/twblog/archive/2004/10/03/havingmysql
SELECT * FROM movies
INNER JOIN
(SELECT movie_id, COUNT(*) as num_ratings from ratings GROUP BY movie_id) as movie_counts
ON movies.id = movie_counts.movie_id
WHERE num_ratings > 3;
That will only get you the movies with more than 3 ratings, to actually get the ratings with it will take another join. The advantage of a subquery over HAVING is you can aggregate the ratings at the same time. Such as (SELECT movie_id, COUNT(*), AVG(rating) as average_move_rating ...)
Edit: Oops, you can aggregate with the having method to. :)
The above solutions are okay for the scenario you mentioned. My suggestion may be overkill for what you have in mind, but may be handy for other situations:
Subquery only those from the ratings table having more than the number you need (again using tha group by having clause):
select movie_id from ratings group by movie_id having count (*) > x
Join that subquery with the movies table
select movies.id
from movies join
as MoviesWRatings on movies.id = MoviesWRatings.movie_id
When you're doing more stuff to the subquery, this might be helpful.
(Not sure if the syntax is right for MySQL, please fix if necessary.)