LEFT JOIN discarding left rows in results? - google-bigquery

Simplifying my issue, let's say I have two tables:
"Users" storing user_id and event_date from users who access each day.
"Purchases" storing user_id, event_date and product_id from users who make purchases each day.
I need to get from all users, their respective product purchases, or null value for product_id if a user didn't make a purchase. For that purpose I made this query:
with all_users as (
select user_id from `my_project.my_dataset.Users`
where event_date = "2019-12-01"
)
select user_id,product_id
from all_users
left join `my_project.my_dataset.Purchases`
using(user_id)
where event_date = "2019-12-01"
But this query returns only user_id who made purchases, in other words, there are rows in the LEFT from_item (all_users) that are being ommited in the result.
Is this working as spected? I read that LEFT JOIN always retains all rows of the left from_item.
EDIT 1:
Adding some screenshots:
This is the full query detailed before, but with real names (table "Users" is "user_metrics_daily" and table "Purchases" is "virtual_currency_daily"). As you can see, I added the count(distinct user_pseudo_id)OVER() to count how many distinct users are in the result.
In the other hand, this is a query to get the number of users I expect to have in the result (8935 users, with null values in product_id for users who don't purchase). But actually I got 2724 distinct users (the number of users who made purchases).
EDIT 2: I found a solution to my desired result, but still I don't understand what's wrong with my first query.

Your query (as it is) should return an error because user_id is ambiguous. BigQuery does not know if you want the column from all_users or my_project.my_dataset.Purchases.
Discarding that, you need to explicitly say from which table the projected columns should come from. In your case, user_id from all_users and product_id from my_project.my_dataset.Purchases.
with all_users as (
select user_id from `my_project.my_dataset.Users`
where event_date = "2019-12-01"
)
select
a.user_id,
p.product_id
from all_users as a
left join `my_project.my_dataset.Purchases` as p on a.user_id = p.user_id
where event_date = "2019-12-01"

Related

Trying to count the number of occurences that 3 columns from 2 tables have on my organizations table? I need the occurrences joined in one table

-- 2. In one table, show how many private topics, admins, and standard users each organization has.
SELECT organizations.name, COUNT(topics.privacy) AS private_topic, COUNT(users.type) AS user_admin, COUNT(users.type) AS user_standard
FROM organizations
LEFT JOIN topics
ON organizations.id=topics.org_id
AND topics.privacy='private'
LEFT JOIN users
ON users.org_id=organizations.id
AND users.type='admin'
LEFT JOIN users
ON users.org_id=organizations.id
AND users.type='standard'
GROUP BY organizations.name
;
org_id is the foreign key that reals both the users table and topics table. It keeps giving me the wrong result by only either counting the number of admins or standard users and putting that for all rows in the each column. Any help is really appreciated as I have been stuck on this for a while now!
So, I am getting an error when I do as you said which is that the users table cannot be specified more than once. I updated the code to how you said to write it but still nothing. They really don't give me any sample data either but I just made some queries and saw the number of times there are private topics for example, which is in the privacy column of the topics table. When I dont get this error as I said, the joins seem to overwrite themselves where each row for all the columns is the same as the last join.
It appears to me that topics and users have no relationship. You're just trying to get the result together in a single query. There are other and possibly better ways to accomplish that but I think this will fix what you've got already (assuming you have id columns for each table.)
SELECT
organizations.name,
COUNT(DISTINCT topics.id) AS private_topic,
COUNT(DISTINCT users.id) FILTER (WHERE users.type = 'admin') AS user_admin,
COUNT(DISTINCT users.id) FILTER (WHERE users.type = 'standard') AS user_standard`
FROM organizations
LEFT JOIN topics
ON organizations.id = topics.org_id AND topics.privacy = 'private'
LEFT JOIN users
ON users.org_id = organizations.id
GROUP BY organizations.name;
I propose this as a more straightforward way:
SELECT
min(o.name) as "name",
(
select count(*) from topics t
where t.org_id = o.id AND t.privacy = 'private'
) as private_topics,
(
select count(*) from users u
where u.org_id = o.id and u.type = 'admin'
) AS user_admin,
(
select count(*) from users u
where u.org_id = o.id and u.type = 'standard'
) AS user_standard
FROM organizations o
GROUP BY o.id;

Count number of posts of each user - SQL

I need to get the number of posts each user has created.
This is the structure of both tables (users, microposts).
Microposts
id
user_id
content
created_at
Users
id
name
email
admin
SELECT users.*, count( microposts.user_id )
FROM microposts LEFT JOIN users ON users.id=microposts.user_id
GROUP BY microposts.user_id
This gets me only the users that have posts. I need to get all users, even if they have 0 posts
You have the join in the wrong order.
In a LEFT JOIN you ensure you keep all the records in the table written first (to the left).
So, join in the other order (users first/left), and then group by the user table's id, and not the microposts table's user_id...
SELECT users.*, count( microposts.user_id )
FROM users LEFT JOIN microposts ON users.id=microposts.user_id
GROUP BY users.id

How to get the number of users grouped by the number of comments they've made?

I'd like to get the count of users grouped by the number of comments they've made.
[User]: ID
[Comment]: ID, UserID
So if user A has made 1 comment, user B has made 1 comment and user C has made 2 comments, then the output would be:
0 comments => 0 users
1 comment => 2 users (A+B)
2 comments => 1 user (C)
How would you query this?
It will depend on your specific database structure, but let's say you have a users table and a comments table:
users table:
id: serial
name: text
comments table:
id: serial
user_id: integer (foreign key to the users table)
comment: text
You can count the number of comments each user has made with this query:
SELECT users.id, users.name, count(comments.id) as comment_cnt
FROM users LEFT JOIN
comments ON users.id = comments.user_id
GROUP BY users.id, users.name
You can then use the results of this query in a nested query to count the number of users for each number of comments:
SELECT comment_cnt, count(*) FROM
(SELECT users.id, users.name, count(comments.id) as comment_cnt
FROM users LEFT JOIN
comments ON users.id = comments.user_id
GROUP BY users.id, users.name) AS comment_cnts
GROUP BY comment_cnt;
I don't know of any elegant way to fill the gaps where there are zero users for a given number of comments, but a temporary table and another level of nesting works:
CREATE TABLE wholenumbers (num integer);
INSERT INTO wholenumbers VALUES (0), (1), (2), (3), (4), (5), (6);
SELECT num as comment_cnt, COALESCE(user_cnt,0) as user_cnt
FROM wholenumbers
LEFT JOIN (SELECT comment_cnt, count(*) AS user_cnt
FROM ( SELECT users.id, users.name, count(comments.id) AS comment_cnt
FROM users LEFT JOIN comments ON users.id = comments.user_id
GROUP BY users.id, users.name) AS comment_cnts
GROUP BY comment_cnt) AS user_cnts ON wholenumbers.num = user_cnts.comment_cnt
ORDER BY num;
Building on the table layout #ClaytonC provided:
WITH cte AS (
SELECT msg_ct, count(*) AS users
FROM (
SELECT count(*) AS msg_ct
FROM comments
GROUP BY user_id
) sub
GROUP BY 1
)
SELECT msg_ct, COALESCE(users, 0) AS users
FROM generate_series(0, (SELECT max(msg_ct) FROM cte)) msg_ct
LEFT JOIN cte USING (msg_ct)
ORDER BY 1;
Major points
First, count comments per user (msg_ct). As long as referential integrity is enforced by a foreign key, you do not need to join to the users table at all to aggregate comments per user. Just count rows in comments.
Next, count users per message count (users).
I am doing this in a CTE, because I use the derived table twice in the final query.
First for generate_series() to generate all counts from min to max dynamically, including gaps.
Then for the table to LEFT JOIN to and get the final result.
The count starts with 0 (after my update). If you want to have it start with the smallest actual msg_ct, consider the first draft of my answer in the edit history.
Closely related answer explaining the basics:
Select all integers that are not already in table in postgres
Count users without comments
As #ClaytonC commented, the above answer does not include users without comments.
To fix this (if you actually need it), either LEFT JOIN to users right at the start after all:
WITH cte AS (
SELECT msg_ct, count(*) AS users
FROM (
SELECT count(c.user_id) AS msg_ct
FROM users u
LEFT JOIN comments c ON c.user_id = u.id
GROUP BY u.id
) sub
GROUP BY 1
)
SELECT ...
Or, since the join is just for finding users without comments, we might get away cheaper: Count all users and subtract users with comments (which we processed anyway):
WITH cte AS (
SELECT msg_ct, count(*)::int AS users
FROM (
SELECT count(*)::int AS msg_ct
FROM comments
GROUP BY user_id
) sub
GROUP BY 1
)
, agg AS (
SELECT max(msg_ct) AS max_ct -- maximum for generate_series
,((SELECT count(*) FROM users) - sum(users))::int AS users
-- quiet rest with 0 comments
FROM cte
)
SELECT 0 AS msg_ct, users FROM agg -- users with 0 comments
UNION ALL
SELECT msg_ct, COALESCE(users, 0)
FROM (SELECT generate_series(1, max_ct) AS msg_ct FROM agg) g
LEFT JOIN cte USING (msg_ct)
ORDER BY 1;
The query gets a bit more complex, but it might be faster for big tables. Not sure. Test with EXPLAIN ANALYZE (I would be grateful for a comment with the results.)

SQL to gather data from one table while counting records in another

I have a users table and a songs table, I want to select all the users in the users table while counting how many songs they have in the songs table. I have this SQL but it doesn't work, can someone spot what i'm doing wrong?
SELECT jos_mfs_users.*, COUNT(jos_mfs_songs.id) as song_count
FROM jos_mfs_users
INNER JOIN jos_mfs_songs
ON jos_mfs_songs.artist=jos_mfs_users.id
Help is much appreciated. Thanks!
The inner join won't work, because it joins every matching row in the songs table with the users table.
SELECT jos_mfs_users.*,
(SELECT COUNT(jos_mfs_songs.id)
FROM jos_mfs_songs
WHERE jos_mfs_songs.artist=jos_mfs_users.id) as song_count
FROM jos_mfs_users
WHERE (SELECT COUNT(jos_mfs_songs.id)
FROM jos_mfs_songs
WHERE jos_mfs_songs.artist=jos_mfs_users.id) > 10
There's a GROUP BY clause missing, e.g.
SELECT jos_mfs_users.id, COUNT(jos_mfs_songs.id) as song_count
FROM jos_mfs_users
INNER JOIN jos_mfs_songs
ON jos_mfs_songs.artist=jos_mfs_users.id
GROUP BY jos_mfs_users.id
If you want to add more columns from jos_mfs_users in the select list you should add them in the GROUP BYclause as well.
Changes:
Don't do SELECT *...specify your fields. I included ID and NAME, you can add more as needed but put them in the GROUP BY as well
Changed to a LEFT JOIN - INNER JOIN won't list any users that have no songs
Added the GROUP BY so it gives a valid count and is valid syntax
SELECT u.id, u.name COUNT(s.id) as song_count
FROM jos_mfs_users AS u
LEFT JOIN jos_mfs_songs AS S
ON s.artist = u.id
GROUP BY U.id, u.name
Try
SELECT
*,
(SELECT COUNT(*) FROM jos_mfs_songs as songs WHERE songs.artist=users.id) as song_count
FROM
jos_mfs_users as users
This seems like a many to many relationship. By that I mean it looks like there can be several records in the users table for each user, one of each song they have.
I would have three tables.
Users, which has one record for each user
Songs, which has one record for each song
USER_SONGS, which has one record for each user/song combination
Now, you can do a count of the songs each user has by doing a query on the intermediate table. You can also find out how many users have a particular song.
This will tell you how many songs each user has
select id, count(*) from USER_SONGS
GROUP BY id;
This will tell you how many users each song has
select artist, count(*) from USER_SONGS
GROUP BY artist;
I'm sure you will need to tweak this for your needs, but it may give you the type of results you are looking for.
You can also join either of these queries to the other two tables to find the user name, and/or artist name.
HTH
Harv Sather
ps I am not sure if you are looking for song counts or artist counts.
You need a GROUP BY clause to use aggregate functions (like COUNT(), for example)
So, assuming that jos_mfs_users.id is a primary key, something like this will work:
SELECT jos_mfs_users.*, COUNT( jos_mfs_users.id ) as song_count
FROM jos_mfs_users
INNER JOIN jos_mfs_songs
ON jos_mfs_songs.artist = jos_mfs_users.id
GROUP BY jos_mfs_users.id
Notice that
since you are grouping by user id, you will get one result per distinct user id in the results
the thing you need to COUNT() is the number of rows that are being grouped (in this case the number of results per user)

Problem With DISTINCT!

Here is my query:
SELECT
DISTINCT `c`.`user_id`,
`c`.`created_at`,
`c`.`body`,
(SELECT COUNT(*) FROM profiles_comments c2 WHERE c2.user_id = c.user_id AND c2.profile_id = 1) AS `comments_count`,
`u`.`username`,
`u`.`avatar_path`
FROM `profiles_comments` AS `c` INNER JOIN `users` AS `u` ON u.id = c.user_id
WHERE (c.profile_id = 1) ORDER BY `u`.`id` DESC;
It works. The problem though is with the DISTINCT word. As I understand it, it should select only one row per c.user_id.
But what I get is even 4-5 rows with the same c.user_id column. Where is the problem?
actually, DISTINCT does not limit itself to 1 column, basically when you say:
SELECT DISTINCT a, b
What you're saying is, "give me the distinct value of a and b combined" .. just like a multi-column UNIQUE index
distinct will ensure that ALL values in your select clause are unique, not just user_id. If you want to limit the results to individual user_ids, you should group by user_id.
Perhaps what you want is:
SELECT
`c`.`user_id`,
`u`.`username`,
`u`.`avatar_path`,
(SELECT COUNT(*) FROM profiles_comments c2 WHERE c2.user_id = c.user_id AND c2.profile_id = 1) AS `comments_count`
FROM `profiles_comments` AS `c` INNER JOIN `users` AS `u` ON u.id = c.user_id
WHERE (c.profile_id = 1)
GROUP BY `c`.`user_id`,
`u`.`username`,
`u`.`avatar_path`
ORDER BY `u`.`id` DESC;
DISTINCT works at a row level, not just a column level
If you want the DISTiNCT of only one column then you will have to aggregate the rest of the columns returned (MIN, MAX, SUM, AVG, etc)
SELECT DISTINCT (Name), Min (ID)
From MyTable
Distinct will try to return only unique rows, it will not return only 1 row per user id in your example.
http://dev.mysql.com/doc/refman/5.0/en/distinct-optimization.html
You misunderstand. The DISTINCT modifier applies to the entire row — it states that no two identical ROWS will be returned in the result set.
Looking at your SQL, what value of the several available do you expect to see returned in the created_at column (for instance)? It would be impossible to predict the results of the query as written.
Also, you're using profile_comments twice in your SELECT. It appears that you're trying to obtain a count of how many times each user has commented. If so, what you want to do is use an AGGREGATE query, grouped on user_id and including only those columns that uniquely identify a user along with a COUNT of the comments:
SELECT user_id, COUNT(*) FROM profile_comments WHERE profile_id = 1 GROUP BY user_id
You can add the join to users to get the user name if you want but, logically, your result set cannot include other columns from profile_comments and still produce only a single row per user_id unless those columns are also aggregated in some way:
SELECT user_id, MIN(created_at) AS Earliest, MAX(created_at) AS Latest, COUNT(*) FROM profile_comments WHERE profile_id = 1 GROUP BY user_id