I have been given a challenge that is a bit out of my scope, so I'm just going to jump right in.
I have a sample dataset in BigQuery you can find here for testing purposes: https://bigquery.cloud.google.com/table/robotic-charmer-726:bl_test_data.complex_problem
I need to figure out the SQL code to query my table and do the following:
By aggregating using the following rules (I'll start with just one email address, and add the other in at the end):
As a general note up front, everything is to be made lowercase such that Ben=ben when aggregating.
Email is the broadest aggregation, and is aggregated by the lowercase version.
The amounts for all of those lowercase emails are summed, as is pictured below in blue.
First and last names are considered next, and they are selected based on the sum amount of the lowercase of the first AND last name.
Note, first or last names are NOT considered separately. See below where Ben has a sum amount of 160 and Kathleen only has a sum amount of 150, but Kathleen is still selected because her full name has a sum amount higher than any other full name.
Next the lowercase full address of the SELECTED NAME is chosen based on the highest sum amount.
Similar to the names, the full address considers all columns together.
Now I'll add in another email address, and we'll do the same thing.
Each lowercase email address is considered separately. I'm now realizing that I should have made that more clear with my pictures, but I don't want to do it all again... too much work. So I hope I have made it clear enough.
I hope you find this to be a very fun challenge!
There are probably cleaner ways of doing this, but this will give you the answer you need:
select email, first_name, last_name, address, city, state, zip, total_amount amount
from (
select d.email email, d.first_name first_name, d.last_name last_name, d.amount amount, d.total_amount total_amount, e.address address, e.city city, e.state state, e.zip zip, row_number() over (partition by e.email order by e.amount desc) ord
from (
select a.email email, a.first_name first_name, a.last_name last_name, b.amount amount, c.amount total_amount
from (
SELECT
lower(email) email, lower(first_name) first_name, lower(last_name) last_name, lower(concat(first_name, last_name)) as name_group, lower(address) address, lower(city) city, lower(state) state, lower(concat(address,city,state)) as location_group, zip, sum(amount) amount
FROM [robotic-charmer-726:bl_test_data.complex_problem]
group by 1,2,3,4,5,6,7,8,9
) a
inner join (
select email, first_name, last_name, name_group, amount
from (
select email, first_name, last_name, name_group, amount, row_number() over (partition by email order by amount desc) as ord
from (
select lower(email) email , lower(first_name) first_name, lower(last_name) last_name, lower(concat(first_name,last_name)) as name_group, sum(amount) amount,
from [robotic-charmer-726:bl_test_data.complex_problem]
group by 1, 2, 3, 4
)
)
where ord = 1
) b
on a.name_group = b.name_group
inner join (
select lower(email) email, sum(amount) amount
from [robotic-charmer-726:bl_test_data.complex_problem]
group by 1
) c
on a.email = c.email
group by 1,2,3,4,5
) d
inner join (
select lower(email) email, lower(first_name) first_name, lower(last_name) last_name, lower(address) address, lower(city) city, lower(state) state, zip,lower(concat(lower(address),lower(city), lower(state), zip)) as location_group, sum(amount) amount
from [robotic-charmer-726:bl_test_data.complex_problem]
group by 1,2,3,4,5,6,7,8
) e
on d.email = e.email and d.first_name = e.first_name and d.last_name = e.last_name
)
where ord = 1
Related
I have the following table.
Fights (fight_year, fight_round, winner, fid, city, league)
I am trying to query the following:
For each year that appears in the Fights table, find the city that held the most fights. For example, if in year 1992, Jersey held more fights than any other city did, you should print out (1992, Jersey)
Here's what I have so far but I keep getting the following error. I am not sure how I should construct my group by functions.
ERROR: column, 'ans.fight_round' must appear in the GROUP BY clause or be used in an aggregate function. Line 3 from (select *
select fight_year, city, max(*)
from (select *
from (select *
from fights as ans
group by (fight_year)) as l2
group by (ans.city)) as l1;
In Postgres, I would recommend aggregation and distinct on:
select distinct on (flight_year) flight_year, city, count(*) cnt
from flights
group by flight_year, city
order by flight_year, count(*) desc
This counts how many fights each city had each year, and retains the city with most fight per year.
If you want to allow ties, then use window functions:
select flight_year, city, cnt
from (
select flight_year, city, count(*) cnt,
rank() over(partition by flight_year order by count(*) desc) rn
from flights
group by flight_year, city
) f
where rn = 1
Although row_number is the easiest way as done by #GMB. Can try this alternative as well
select city, fight_year
from fights
group by city, fightyear
having count(*) = sum(case when fid is not null then 1 end)
I have a table as
create table mock_sales
(
first_name character varying,
last_name character varying,
amount integer
);
insert into mock_sales(first_name, last_name, amount)
values('ted','mosby', 100),
('lily', 'aldrin', 400),
('ted', 'mosby', 350),
('barney', 'Stinson',180)
Output Desired
Person with max sum amount
ted mosby // As ted mosby sum = 450 (100 + 350), which is largest
I tried
Select first_name, last_name from mock_sales group by first_name, last_name where amount in (
select max(amount) from t
(select sum(amount) as amount from mock_sales as t group by first_name, last_name)
or
select t.first_name, t.last_name from mock_sales where max(amount) == t.amount and t.amount in (
Select first-name, last_name, sum(amount) as amount from mock_sales as t group by first_name, last_name)
But they both gave syntax errors. Any help will be appreciated.
Having trouble joining the result of 2 queries.
You can just group records having the same first/last name, order the results and keep the first row only:
select first_name, last_name, sum(amount) total_amount
from mock_sales
group by first_name, last_name
order by total_amount desc
limit 1
If you want to allow ties, then it is a bit different. In Postgres, you can use window functions:
select *
from (
select first_name, last_name, sum(amount) total_amount,
rank() over(order by sum(amount) desc) rn
from mock_sales
group by first_name, last_name
) t
where rn = 1
Since you want the sum instead of max value so, try this instead:
SELECT first_name,last_name, SUM(amount) AS amount FROM mock_sales
GROUP BY first_name,last_name ORDER BY amount DESC LIMIT 1;
Here's the initial table's structure :
yearquarter,user_id,gender,generation,country,group_id
2019-03,zfuzhfuzh,M,Y,FR,Group_1
2019-04,zfuzhfuzh,M,Y,FR,Group_1
2020-04,zfuzhfuzh,M,Y,FR,Group_1
2019-03,ggezegz,F,Y,FR,Group_2
2019-04,ggezegz,F,Y,FR,Group_2
2020-04,ggezegz,F,X,FR,Group_2
....
I want to be able to know the cumulative amount of user_id quarter after quarter grouped by gender, generation and country. Expected result: for a given combination of gender,generation,country I need the cumulated number of users quarter after quarter.
I started with this :
SELECT yearquarter,gender,generation,country,array_agg(distinct user_id IGNORE NULLS) as users FROM my table
WHERE group_id= "mygroup"
GROUP BY 1,2,3,4
But I don't know how to go from this to the result I'm looking for...
You can use aggregation to count the number of users per gender, generation country and period, and then make a window sum over the periods;
select
gender,
generation,
country,
yearquarter,
sum(count(distinct user_id)) over(partition by gender, generation, country order by yearquarter) cnt
from mytable
where group_id = 'mygroup'
group by gender, generation, country, yearquarter
order by gender, generation, country, yearquarter
I am unsure that bigquery supports distinct in window functions. If it doesn't, then we can use a subquery:
select
gender,
generation,
country,
yearquarter,
sum(count(*)) over(partition by gender, generation, country order by yearquarter) cnt
from (
select distinct gender, generation, country, yearquarter, user_id
from mytable
where group_id = 'mygroup'
) t
group by gender, generation, country, yearquarter
order by gender, generation, country, yearquarter
If you want each user to be counted only once, for their first appearance period:
select select
gender,
generation,
country,
yearquarter,
sum(count(*)) over(partition by gender, generation, country order by yearquarter) cnt
from (
select gender, generation, country, user_id, min(yearquarter) yearquarter
from mytable
where group_id = 'mygroup'
group by gender, generation, country, user_id
) t
group by gender, generation, country
Below is for BigQuery Standard SQL - built purely on top of your initial query with ARRAY_AGG replaced with STRING_AGG
#standardSQL
SELECT yearquarter, gender, generation, country,
(SELECT COUNT(DISTINCT id) FROM UNNEST(SPLIT(cumulative_users)) AS id) AS cumulative_number_of_users
FROM (
SELECT *,
STRING_AGG(users) OVER(PARTITION BY gender, generation, country ORDER BY yearquarter) AS cumulative_users
FROM (
SELECT
yearquarter, gender, generation, country,
STRING_AGG(DISTINCT user_id) AS users
FROM `project.dataset.table`
WHERE group_id= "mygroup"
GROUP BY yearquarter, gender, generation, country
)
)
-- ORDER BY yearquarter, gender, generation, country
I have a working query:
SELECT
COUNT(*), ACCOUNT_ID
FROM
CDS_PLAYER
GROUP BY
ACCOUNT_ID
HAVING
COUNT(*) > 1`
Output
No column name Account_ID
----------------------------
'2' '12345'
I'm trying to add names to these accounts (all from the same table) but with no luck. The only query that gets me close is:
SELECT
LASTNAME, FIRSTNAME, COUNT(ACCOUNT_ID) AS NUMBER
FROM
(SELECT
COUNT(*), ACCOUNT_ID
FROM
CDS_PLAYER
GROUP BY
ACCOUNT_ID
HAVING
COUNT(*) > 1) AS T1
GROUP BY
LASTNAME, FIRSTNAME, PLAYER_ID
But I get an error:
No column was specified for column 1 of 'T1'
Like I said VERY NEW AT THIS. My boss of 4 months wanted me to learn and so I'm self taught (books and google). Any help at all to get me where I need to be would be appreciated!
(I'm using Windows Server 2003 and SQL Server 2000)
The error message can be resolved as below
SELECT LASTNAME, FIRSTNAME, COUNT(ACCOUNT_ID) AS NUMBER
FROM
(SELECT COUNT(*) AS Total, ACCOUNT_ID FROM CDS_PLAYER GROUP BY ACCOUNT_ID HAVING
COUNT(*) > 1) AS T1
GROUP BY LASTNAME, FIRSTNAME, PLAYER_ID`
Add as TOTAL after the count(*)
Does this do what you want?
SELECT COUNT(*), ACCOUNT_ID, LASTNAME, FIRSTNAME, PLAYER_ID
FROM CDS_PLAYER
GROUP BY ACCOUNT_ID, LASTNAME, FIRSTNAME, PLAYER_ID
HAVING COUNT(*) > 1;
You should also update your version of SQL Server. It is like 15 years out of date and hasn't been supported in many years. You can download a free version of SQL Server Express from Microsoft.
you want to select the LASTNAME and FIRSTNAME, but havn't it selected in your subselect. You only can access field which are in the resultset.
Solution: Add it to your GROUP BY clause.
ie:
SELECT
LASTNAME, FIRSTNAME, COUNT(ACCOUNT_ID) AS NUMBER
FROM
(SELECT COUNT(*), LASTNAME, FIRSTNAME, ACCOUNT_ID
FROM CDS_PLAYER
GROUP BY ACCOUNT_ID, LASTNAME, FIRSTNAME
HAVING COUNT(*) > 1) AS T1
GROUP BY
LASTNAME, FIRSTNAME, PLAYER_ID
I have this query :
select first_name, last_name, MAX(date)
from person p inner join
address a on
a.person_id = p.id
group by first_name, last_name
with person(sid, last_name, first_name), address(data, cp, sid, city)
My question is how I can have a query that select first_name, last_name, MAX(date), city, cp
without adding city and cp to the group
I mean I want to have all 5 columns but only for the datas grouped by first_name, last_name and date
Many Thanks
This is not possible. Say you have three John Smith in your database, each of them having one or two addresses. When you group by name now, then what city do you want to get? The city of which John Smith and of which of his addresses? As there is no implicit answer to this question, there is no way to write a select statement without explicitly saying which city is to be selected.