Bigquery: Group query results in arrays - google-bigquery

I have a table that lists friends from a particular user:
user_id | friend_name
1 | JOEL
1 | JACK
2 | MARIA
I want to have them grouped by user_id and each row has an array with all the friends.
How can make a selection that would do this transformation?
UPDATE:
select user_id, array_agg(friend_name) as friends
from your_table
group by user_id
Works fine.
However I forgot a small detail, the table has another column.
user_id | friend_name | friends_age
1 | JOEL | 21
1 | JACK | 30
2 | MARIA | 25
My solution was to add another array_agg:
select user_id, array_agg(friend_name), array_agg(friend_age)as friends
from your_table
group by user_id
I believe it works, the only problem is when age is Null, in that case, I need to add a CASE WHEN clause.
select user_id, array_agg(friend_name),
array_agg(CASE friend_age IS NULL THEN 0 ELSE friend_age END)as friends
from your_table
group by user_id

select user_id, array_agg(friend_name) as friends
from your_table
group by user_id

Related

SUM CASE when DISTINCT?

Joining two tables and grouping, we're trying to get the sum of a user's value but only include a user's value once if that user is represented in a grouping multiple times.
Some sample tables:
user table:
| id | net_worth |
------------------
| 1 | 100 |
| 2 | 1000 |
visit table:
| id | location | user_id |
-----------------------------
| 1 | mcdonalds | 1 |
| 2 | mcdonalds | 1 |
| 3 | mcdonalds | 2 |
| 4 | subway | 1 |
We want to find the total net worth of users visiting each location. User 1 visited McDonalds twice, but we don't want to double count their net worth. Ideally we can use a SUM but only add in the net worth value if that user hasn't already been counted for at that location. Something like this:
-- NOTE: Hypothetical query
SELECT
location,
SUM(CASE WHEN DISTINCT user.id then user.net_worth ELSE 0 END) as total_net_worth
FROM visit
JOIN user on user.id = visit.user_id
GROUP BY 1;
The ideal output being:
| location | total_net_worth |
-------------------------------
| mcdonalds | 1100 |
| subway | 100 |
This particular database is Redshift/PostgreSQL, but it would be interesting if there is a generic SQL solution. Is something like the above possible?
You don't want to consider duplicate entries in the visits table. So, select distinct rows from the table instead.
SELECT
v.location,
SUM(u.net_worth) as total_net_worth
FROM (SELECT DISTINCT location, user_id FROM visit) v
JOIN user u on u.id = v.user_id
GROUP BY v.location
ORDER BY v.location;
You can use a window function to get the unique users, then join that to the user table:
select v.location, sum(u.net_worth)
from "user" u
join (
select location, user_id,
row_number() over (partition by location, user_id) as rn
from visit
order by user_id, location, id
) v on v.user_id = u.id and v.rn = 1
group by v.location;
The above is standard ANSI SQL, in Postgres this can also be expressed using distinct on ()
select v.location, sum(u.net_worth)
from "user" u
join (
select distinct on (user_id, location) *
from visit
order by user_id, location, id
) v on v.user_id = u.id
group by v.location;
You can join the user table with distinct values of location & user id combination like the below generic SQL.
SELECT v.location, SUM(u.net_worth)
FROM (SELECT location, user_id FROM visit GROUP BY location, user_id) v
JOIN user u on u.id = v.user_id
GROUP BY v.location;

Identifying duplicate records in SQL along with primary key

I have a business case scenario where I need to do a lookup into our SQL "Users" table to find out email addresses which are duplicated. I was able to do that by the below query:
SELECT
user_email, COUNT(*) as DuplicateEmails
FROM
Users
GROUP BY
user_email
HAVING
COUNT(*) > 1
ORDER BY
DuplicateEmails DESC
I get an output like this:
user_email DuplicateEmails
--------------------------------
abc#gmail.com 2
xyz#yahoo.com 3
Now I am asked to list out all the duplicate records in a single row of its own and display some additional properties like first name , last name and userID. All this information is stored in this table "Users". I am having difficulty doing so. Can anyone help me or put me toward right direction?
My output needs to look like this:
user_email DuplicateEmails FirstName LastName UserID
------------------------------------------------------------------------------
abc#gmail.com 2 Tim Lentil timLentil
abc#gmail.com 2 John Doe johnDoe12
xyz#yahoo.com 3 brian boss brianTheBoss
xyz#yahoo.com 3 Thomas Hood tHood
xyz#yahoo.com 3 Mark Brown MBrown12
There are several ways you could do this. Here is one using a cte.
with FoundDuplicates as
(
SELECT
uter_email, COUNT(*) as DuplicateEmails
FROM
Users
GROUP BY
uter_email
HAVING
COUNT(*) > 1
)
select fd.user_email
, fd.DuplicateEmails
, u.FirstName
, u.LastName
, u.UserID
from Users u
join FoundDuplicates fd on fd.uter_email = u.uter_email
ORDER BY fd.DuplicateEmails DESC
Use count() over( Partition by ), example
You can solve it like:
DECLARE #T TABLE
(
UserID VARCHAR(20),
FirstName NVARCHAR(45),
LastName NVARCHAR(45),
UserMail VARCHAR(45)
);
INSERT INTO #T (UserMail, FirstName, LastName, UserID) VALUES
('abc#gmail.com', 'Tim', 'Lentil', 'timLentil'),
('abc#gmail.com', 'John', 'Doe', 'johnDoe12'),
('xyz#yahoo.com', 'brian', 'boss', 'brianTheBoss'),
('xyz#yahoo.com', 'Thomas', 'Hood', 'tHood'),
('xyz#yahoo.com', 'Mark', 'Brown', 'MBrown12');
SELECT *, COUNT (1) OVER (PARTITION BY UserMail) MailCount
FROM #T;
Results:
+--------------+-----------+----------+---------------+-----------+
| UserID | FirstName | LastName | UserMail | MailCount |
+--------------+-----------+----------+---------------+-----------+
| timLentil | Tim | Lentil | abc#gmail.com | 2 |
| johnDoe12 | John | Doe | abc#gmail.com | 2 |
| brianTheBoss | brian | boss | xyz#yahoo.com | 3 |
| tHood | Thomas | Hood | xyz#yahoo.com | 3 |
| MBrown12 | Mark | Brown | xyz#yahoo.com | 3 |
+--------------+-----------+----------+---------------+-----------+
Use a window function like this:
SELECT u.*
FROM (SELECT u.*, COUNT(*) OVER (PARTITION BY user_email) as numDuplicateEmails
FROM Users
) u
WHERE numDuplicateEmails > 1
ORDER BY numDuplicateEmails DESC;
I think this will also work.
WITH cte (
SELECT
*
,DuplicateEmails = ROW_NUMBER() OVER (Partition BY user_email ORder by user_email)
FROM Users
)
Select * from CTE
where DuplicateEmails > 1

SQL Query find users with only one product type

I solemnly swear I did my best to find an existing question, may I'm not sure how to phrase it correctly.
I would like to return records for users that have quota for only one product type.
| user_id | product |
| 1 | A |
| 1 | B |
| 1 | C |
| 2 | B |
| 3 | B |
| 3 | C |
| 3 | D |
In the example above I'd like a query that only returns users who carry quota for only one product type - doesn't really matter which product at this point.
I tried using select user_id, product from table group by 1,2 having count(user) < 2 but this does not work, nor does select user_id, product from table group by 1,2 having count(*) < 2
Any help is appreciated.
Your having clause is good; the issue's with your group by. Try this:
select user_id
, count(distinct product) NumberOfProducts
from table
group by user_id
having count(distinct product) = 1
Or you could do this; which is closer to your original:
select user_id
from table
group by user_id
having count(*) < 2
The group by clause can't take ordinal arguments (like, e.g., the order by clause can). When grouping by a value like 1, you're in fact grouping by the literal value 1, which would just be the same for any row in the table, and thus will group all the rows in the table to one group. Since there are more than one product in the entire table, no rows will be returned.
Instead, you should group by the user_id:
SELECT user_id
FROM mytable
GROUP BY user_id
HAVING COUNT(*) = 1
If you want the product, then do:
select user_id, max(product) as product
from table
group by user_id
having min(product) = max(product);
The having clause could also be:
having count(distinct product) = 1

PostgreSQL - MAX value for every user

I have a table
User | Phone | Value
Peter | 0 | 1
Peter | 456 | 2
Peter | 456 | 3
Paul | 456 | 7
Paul | 789 | 10
I want to select MAX value for every user, than it also lower than a tresshold
For tresshold 8, I want result to be
Peter | 456 | 3
Paul | 456 | 7
I have tried the GROUP BY with HAVING, but I am getting
column "phone" must appear in the GROUP BY clause or be used in an aggregate function
Similar query logic works in MySQL, but I am not quite sure how to operate with GROUP BY in PostgreSQL. I dont want to GROUP BY phone.
After I have results from "juergen d" solution, I came up with this which gives me the same results faster
SELECT DISTINCT ON(user) user, phone, value
FROM table
WHERE value < 8
ORDER BY user, value DESC;
select t1.*
from your_table t1
join
(
select user, max(value) as max_value
from your_table
where value < 8
group by user
) t2 on t1.user = t2.user and t1.value = t2.max_value
Alternatively, you could use a ranking function:
select * from
(
select *, RANK() OVER (partition by [user] ORDER BY t.value desc ) as value_rank from test_table as t
where t.value < 8
) as t1
where value_rank = 1

SQL group by with a count

I have a table (simplified below)
|company|name |age|
| 1 | a | 3 |
| 1 | a | 3 |
| 1 | a | 2 |
| 2 | b | 8 |
| 3 | c | 1 |
| 3 | c | 1 |
For various reason the age column should be the same for each company. I have another process that is updating this table and sometimes it put an incorrect age in. For company 1 the age should always be 3
I want to find out which companies have a mismatch of age.
Ive done this
select company, name age from table group by company, name, age
but dont know how to get the rows where the age is different. this table is a lot wider and has loads of columns so I cannot really eyeball it.
Can anyone help?
Thanks
You should not be including age in the group by clause.
SELECT company
FROM tableName
GROUP BY company, name
HAVING COUNT(DISTINCT age) <> 1
SQLFiddle Demo
If you want to find the row(s) with a different age than the max-count age of each company/name group:
WITH CTE AS
(
select company, name, age,
maxAge=(select top 1 age
from dbo.table1 t2
group by company,name, age
having( t1.company=t2.company and t1.name=t2.name)
order by count(*) desc)
from dbo.table1 t1
)
select * from cte
where age <> maxAge
Demontration
If you want to update the incorrect with the correct ages you just need to replace the SELECT with UPDATE:
WITH CTE AS
(
select company, name, age,
maxAge=(select top 1 age
from dbo.table1 t2
group by company,name, age
having( t1.company=t2.company and t1.name=t2.name)
order by count(*) desc)
from dbo.table1 t1
)
UPDATE cte SET AGE = maxAge
WHERE age <> maxAge
Demonstration
Since you mentioned "how to get the rows where the age is different" and not just the comapnies:
Add a unique row id (a primary key) if there isn't already one. Let's call it id.
Then, do
select id from table
where company in
(select company from table
group by company
having count(distinct age)>1)