Using SQL Aggregate Functions With Multiple Joins - sql

I am attempting to use multiple aggregate functions across multiple tables in a single SQL query (using Postgres).
My table is structured similar to the following:
CREATE TABLE user (user_id INT PRIMARY KEY, user_date_created TIMESTAMP NOT NULL);
CREATE TABLE item_sold (item_sold_id INT PRIMARY KEY, sold_user_id INT NOT NULL);
CREATE TABLE item_bought (item_bought_id INT PRIMARY KEY, bought_user_id INT NOT NULL);
I want to count the number of items bought and sold for each user. The solution I thought up does not work:
SELECT user_id, COUNT(item_sold_id), COUNT(item_bought_id)
FROM user
LEFT JOIN item_sold ON sold_user_id=user_id
LEFT JOIN item_bought ON bought_user_id=user_id
WHERE user_date_created > '2014-01-01'
GROUP BY user_id;
That seems to perform all the combinations of (item_sold_id, item_bought_id), e.g. if there are 4 sold and 2 bought, both COUNT()s are 8.
How can I properly query the table to obtain both counts?

The easy fix to your query is to use distinct:
SELECT user_id, COUNT(distinct item_sold_id), COUNT(distinct item_bought_id)
FROM user
LEFT JOIN item_sold ON sold_user_id=user_id
LEFT JOIN item_bought ON bought_user_id=user_id
WHERE user_date_created > '2014-01-01'
GROUP BY user_id;
However, the query is doing unnecessary work. If someone has 100 items bought and 200 items sold, then the join produces 20,000 intermediate rows. That is a lot.
The solution is to pre-aggregate the results or use a correlated subquery in the select. In this case, I prefer the correlated subquery solution (assuming the right indexes are available):
SELECT u.user_id,
(select count(*) from item_sold s where u.user_id = s.sold_user_id),
(select count(*) from item_bought b where u.user_id = b.bought_user_id)
FROM user u
WHERE u.user_date_created > '2014-01-01';
The right indexes are item_sold(sold_user_id) and item_bought(bought_user_id). I prefer this over pre-aggregation because of the filtering on the user table. This only does the calculations for users created this year -- that is harder to do with pre-aggregation.

SQL Fiddle
With a lateral join it is possible to pre aggregate only the filtered users
select user_id, total_item_sold, total_item_bought
from
"user" u
left join lateral (
select sold_user_id, count(*) as total_item_sold
from item_sold
where sold_user_id = u.user_id
group by sold_user_id
) item_sold on user_id = sold_user_id
left join lateral (
select bought_user_id, count(*) as total_item_bought
from item_bought
where bought_user_id = u.user_id
group by bought_user_id
) item_bought on user_id = bought_user_id
where u.user_date_created >= '2014-01-01'
Notice that you need >= in the filter otherwise it is possible to miss the exact first moment of the year. Although that timestamp is unlikely with naturally entered data, it is common with an automated job.

Another way to solve this problem is to use two nested selects.
select user_id,
(select count(*) from item_sold where sold_user_id = user_id),
(select count(*) from item_bought where bought_user_id = user_id)
from user
where user_date_created > '2014-01-01'

Related

right way to alias count * in a subquery

I have query below as
select t.comment_count, count(*) as frequency
from
(select u.id, count(c.user_id) as comment_count
from users u
left join comments c
on u.id = c.user_id
and c.created_at between '2020-01-01' and '2020-01-31'
group by 1) t
group by 1
order by 1
when I also try to alias the count(*) as count(t.*) it gives error, can I not alias that with the t from the table? Not sure what I am missing
Thank you
Count(*) stands for the count of all rows returned by a query (with respect to GROUP BY columns). So it makes no sence to specify one of the involved tables. Consider counting rows produced by a join for example. If you need a count of rows of the specific table t you can use count(distinct t.<unique column>)

How to pull the count of occurences from 2 SQL tables

I am using python on a SQlite3 DB i created. I have the DB created and currently just using command line to try and get the sql statement correct.
I have 2 tables.
Table 1 - users
user_id, name, message_count
Table 2 - messages
id, date, message, user_id
When I setup table two, I added this statement in the creation of my messages table, but I have no clue what, if anything, it does:
FOREIGN KEY (user_id) REFERENCES users (user_id)
What I am trying to do is return a list containing the name and message count during 2020. I have used this statement to get the TOTAL number of posts in 2020, and it works:
SELECT COUNT(*) FROM messages WHERE substr(date,1,4)='2020';
But I am struggling with figuring out if I should Join the tables, or if there is a way to pull just the info I need. The statement I want would look something like this:
SELECT name, COUNT(*) FROM users JOIN messages ON messages.user_id = users.user_id WHERE substr(date,1,4)='2020';
One option uses a correlated subquery:
select u.*,
(
select count(*)
from messages m
where m.user_id = u.user_id and m.date >= '2020-01-01' and m.date < '2021-01-01'
) as cnt_messages
from users u
This query would take advantage of an index on messages(user_id, date).
You could also join and aggregate. If you want to allow users that have no messages, a left join is a appropriate:
select u.name, count(m.user_id) as cnt_messages
from users u
left join messages m
on m.user_id = u.user_id and m.date >= '2020-01-01' and m.date < '2021-01-01'
group by u.user_id, u.name
Note that it is more efficient to filter the date column against literal dates than applying a function on it (which precludes the use of an index).
You are missing a GROUP BY clause to group by user:
SELECT u.user_id, u.name, COUNT(*) AS counter
FROM users u JOIN messages m
ON m.user_id = u.user_id
WHERE substr(m.date,1,4)='2020'
GROUP BY u.user_id, u.name

CTE Optimization

I have a query involving CTEs. I just want to know if the following query can be optimized in anyway and if it can be, then what's the rationale behind the optimized version of it:
here is the query:
WITH A AS
(
SELECT
user_ID
FROM user
WHERE user_Date IS not NULL
),
B AS
(
SELECT
P.user_ID,
P.Payment_Type,
SUM(P.Payment_Amount) AS Total_Payment
FROM Payment P
JOIN A ON A.user_ID = P.user_ID
)
SELECT
user_ID,
Total_Payment_Amount
FROM B
WHERE Payment_Type = 'CR';
Your query should be using GROUP BY, as it seems you want to take a sum aggregate for each user_ID. From a performance point of view, you are introducing many subqueries, which aren't really necessary. We can write your query using a single join between the Payment and user tables.
SELECT
P.user_ID,
SUM(P.Payment_Amount) AS Total_Payment
FROM Payment P
INNER JOIN user A
ON A.user_ID = P.user_ID
WHERE
A.user_Date IS NOT NULL AND P.Payment_Type = 'CR'
GROUP BY
P.user_ID;
For Oracle, SQL Server, and Postgres, the query optimizers should have no problems finding the optimal query plan for this. You can find out what the database will do with EXPLAIN, usually.
Try the folowing query-:
select p.user_ID,
SUM(Payment_Amount) as Total_Payment_Amount
from Payment P join user u
on P.user_ID=u.user_ID
where Payment_Type = 'CR';
and u.user_Date IS not NULL
group by p.user_ID
SQL Server 2014

SQL: Two queries in a single set of results?

I'm using Postgres 9.6. I have three tables, like this:
Table public.user
id integer
name character varying
email character varying
Table public.project
id integer
user_id integer
Table public.sale
id integer
user_id integer
user_id is a foreign key in both the project and sale tables.
Is there a way I can get a list back of all user IDs with the number of projects and number of sales attached to them, as a single query?
So I'd like final data that looks like this:
user_id,num_projects,num_stories
121,28,1
122,43,6
123,67,2
I know how to do just the number of projects:
SELECT "user".id, COUNT(*) AS num_visualisations
JOIN project ON project.user_id="user".id
GROUP BY "user".id
ORDER BY "user".id DESC
But I don't know how also to get the number of sales too, in a single query.
Use subqueries for the aggregation and a left join:
select u.*, p.num_projects, s.num_sales
from user u left join
(select p.user_id, count(*) as num_projects
from projects p
group by p.user_id
) p
on p.user_id = u.id left join
(select s.user_id, count(*) as num_sales
from sales s
group by s.user_id
) s
on s.user_id = u.id;

Help in a Join query

SELECT game_ratingstblx245v.game_id,avg( game_ratingstblx245v.rating )
as avg_rating,
count(DISTINCT game_ratingstblx245v.userid)
as count,
game_data.name,
game_data.id ,
avg(game_ratings.critic_rating),count(DISTINCT game_ratings.critic)
as cr_count
FROM game_data
LEFT JOIN game_ratingstblx245v ON game_ratingstblx245v.game_id = game_data.id
LEFT JOIN game_ratings ON game_ratings.game_id = game_data.id
WHERE game_data.release_date < NOW()
GROUP BY game_ratingstblx245v.game_id
ORDER BY game_data.release_date DESC,
game_data.name
I am currenty using this query to extract values from 3 tables
game_data - id(foreign key), name, release_date \games info
game_ratings - game_id(foreign key),critic , rating \critic rating
game_ratingstblx245v - game_id(foreign key), rating, userid \user rating
What I want to do with this query is select all id's from table game_data order by release_date descending, then check the avg rating from table game_ratings and game_ratingsblx245v corresponding to individual id's(if games have not been rated the result should return null from fields of the latter two tables)..Now the problem I am facing here is the result is not coming out as expected(some games which have not been rated are showing up while others are not), can you guys check my query and tell me where am i wrong if so...Thanks
You shouldn't use the game_ratingstblx245v.game_id column in your GROUP BY, since it could be NULL when there are no ratings for a given game id. Use game_data.id instead.
Here's how I would write the query:
SELECT g.id, g.name,
AVG( x.rating ) AS avg_user_rating,
COUNT( DISTINCT x.userid ) AS user_count,
AVG( r.critic_rating ) AS avg_critic_rating,
COUNT( DISTINCT r.critic ) AS critic_count
FROM game_data g
LEFT JOIN game_ratingstblx245v x ON (x.game_id = g.id)
LEFT JOIN game_ratings r ON (r.game_id = g.id)
WHERE g.release_date < NOW()
GROUP BY g.id
ORDER BY g.release_date DESC, g.name;
Note that although this query produces a Cartesian product between x and r, it doesn't affect the calculation of the average ratings. Just be aware in the future that if you were doing SUM() or COUNT(), the calculations could be exaggerated by an unintended Cartesian product.