Merge array with previous value (under defined conditions) - sql

Here's the initial table's structure :
yearquarter,user_id,gender,generation,country,group_id
2019-03,zfuzhfuzh,M,Y,FR,Group_1
2019-04,zfuzhfuzh,M,Y,FR,Group_1
2020-04,zfuzhfuzh,M,Y,FR,Group_1
2019-03,ggezegz,F,Y,FR,Group_2
2019-04,ggezegz,F,Y,FR,Group_2
2020-04,ggezegz,F,X,FR,Group_2
....
I want to be able to know the cumulative amount of user_id quarter after quarter grouped by gender, generation and country. Expected result: for a given combination of gender,generation,country I need the cumulated number of users quarter after quarter.
I started with this :
SELECT yearquarter,gender,generation,country,array_agg(distinct user_id IGNORE NULLS) as users FROM my table
WHERE group_id= "mygroup"
GROUP BY 1,2,3,4
But I don't know how to go from this to the result I'm looking for...

You can use aggregation to count the number of users per gender, generation country and period, and then make a window sum over the periods;
select
gender,
generation,
country,
yearquarter,
sum(count(distinct user_id)) over(partition by gender, generation, country order by yearquarter) cnt
from mytable
where group_id = 'mygroup'
group by gender, generation, country, yearquarter
order by gender, generation, country, yearquarter
I am unsure that bigquery supports distinct in window functions. If it doesn't, then we can use a subquery:
select
gender,
generation,
country,
yearquarter,
sum(count(*)) over(partition by gender, generation, country order by yearquarter) cnt
from (
select distinct gender, generation, country, yearquarter, user_id
from mytable
where group_id = 'mygroup'
) t
group by gender, generation, country, yearquarter
order by gender, generation, country, yearquarter
If you want each user to be counted only once, for their first appearance period:
select select
gender,
generation,
country,
yearquarter,
sum(count(*)) over(partition by gender, generation, country order by yearquarter) cnt
from (
select gender, generation, country, user_id, min(yearquarter) yearquarter
from mytable
where group_id = 'mygroup'
group by gender, generation, country, user_id
) t
group by gender, generation, country

Below is for BigQuery Standard SQL - built purely on top of your initial query with ARRAY_AGG replaced with STRING_AGG
#standardSQL
SELECT yearquarter, gender, generation, country,
(SELECT COUNT(DISTINCT id) FROM UNNEST(SPLIT(cumulative_users)) AS id) AS cumulative_number_of_users
FROM (
SELECT *,
STRING_AGG(users) OVER(PARTITION BY gender, generation, country ORDER BY yearquarter) AS cumulative_users
FROM (
SELECT
yearquarter, gender, generation, country,
STRING_AGG(DISTINCT user_id) AS users
FROM `project.dataset.table`
WHERE group_id= "mygroup"
GROUP BY yearquarter, gender, generation, country
)
)
-- ORDER BY yearquarter, gender, generation, country

Related

Is this SQL query possible? I am trying to get the least frequent names in this table

This is a toy public table in google BigQuery:
The table contains the names given to people in the US at birth and the frequency of those names for each state and year from 1910 to 2020
Columns: name, year, state, number, gender
names toy table
I am trying to get the LEAST popular names (names with lowest 'number' column) each year.
I am not sure this is possible with this schema.
Depending on how you want to handle a tie, you want rank or row_number.
select
*
from (
select
*,
row_number() over (partition by year order by name_frequency) as rn
from (
select
year,
name,
sum(number) as name_frequency
from `bigquery-public-data.usa_names.usa_1910_2013`
group by
year,
name
) sub1
) sub2
where rn = 1
Consider below approach
select * from (
select year, gender, name, sum(number) number
from `bigquery-public-data.usa_names.usa_1910_2013`
group by year, gender, name
qualify 1 = row_number() over(partition by year, gender order by number)
)
pivot (any_value(name) name, any_value(number) number for gender in ('M', 'F'))
# order by year desc
with output (just top 9 shown)
In reality - there are many names that have same least frequency - to get all of them - use below approach
select * from (
select year, gender, name, sum(number) number
from `bigquery-public-data.usa_names.usa_1910_2013`
group by year, gender, name
qualify 1 = dense_rank() over(partition by year, gender order by number)
)
pivot (string_agg(name) name, any_value(number) number for gender in ('M', 'F'))
with output
while if you would look for most frequent - you would use below
select * from (
select year, gender, name, sum(number) number
from `bigquery-public-data.usa_names.usa_1910_2013`
group by year, gender, name
qualify 1 = dense_rank() over(partition by year, gender order by number desc)
)
pivot (string_agg(name) name, any_value(number) number for gender in ('M', 'F'))
with just one most frequent name per year/gender

Column must appear in group by or aggregate function in nested query

I have the following table.
Fights (fight_year, fight_round, winner, fid, city, league)
I am trying to query the following:
For each year that appears in the Fights table, find the city that held the most fights. For example, if in year 1992, Jersey held more fights than any other city did, you should print out (1992, Jersey)
Here's what I have so far but I keep getting the following error. I am not sure how I should construct my group by functions.
ERROR: column, 'ans.fight_round' must appear in the GROUP BY clause or be used in an aggregate function. Line 3 from (select *
select fight_year, city, max(*)
from (select *
from (select *
from fights as ans
group by (fight_year)) as l2
group by (ans.city)) as l1;
In Postgres, I would recommend aggregation and distinct on:
select distinct on (flight_year) flight_year, city, count(*) cnt
from flights
group by flight_year, city
order by flight_year, count(*) desc
This counts how many fights each city had each year, and retains the city with most fight per year.
If you want to allow ties, then use window functions:
select flight_year, city, cnt
from (
select flight_year, city, count(*) cnt,
rank() over(partition by flight_year order by count(*) desc) rn
from flights
group by flight_year, city
) f
where rn = 1
Although row_number is the easiest way as done by #GMB. Can try this alternative as well
select city, fight_year
from fights
group by city, fightyear
having count(*) = sum(case when fid is not null then 1 end)

How to eliminate duplicate records based on only few columns in table and keep one and indicate that there were duplicate records corresponding to it?

The resultset I have is like shown below:
And expected output is like shown below:
Any idea how can we achieve this with SQL in Oracle?
You can use window functions:
select city, name, salary,
(case when cnt > 1 then 'Multiple' else 'Single' end) as Indicator
from (select t.*,
count(*) over (partition by city, name) as cnt,
row_number(*) over (partition by city, name order by salary) as seqnum
from t
) t
where seqnum = 1;
EDIT:
Actually, if you want the minimum salary:
select city, name, min(salary),
(case when count(*) = 1 then 'Single' else 'Multiple' end) as indicator
from t
group by city, name;
Try this
DELETE FROM tabename WHERE rowid in
(SELECT city, name, salary,COUNT(*)
FROM tabename
GROUP BY city, name, salary
HAVING count(*) > 1);

Displaying value in only last row of a partition

I am generating a transcript for students in a database. I am using window functions to provide running semester and cumulative GPAs. My query is as follows:
SELECT FIRSTNAME, COURSENAME, SCORE, MAX_SCORE, SEMESTER, SGPA, AVG(ROUND(SGPA,2)) OVER (PARTITION BY FIRSTNAME ORDER BY SEMESTER) AS CGPA
FROM(
SELECT FIRSTNAME, COURSENAME, SCORE, MAX_SCORE, SEMESTER, ROUND((SEM_TOTAL_SCORE * 4/SEM_TOTAL_MAX_SCORE ),2) AS SGPA FROM(
SELECT FIRSTNAME, COURSENAME, SCORE, MAX_SCORE, SEMESTER,
SUM(SCORE)
OVER (PARTITION BY FIRSTNAME ORDER BY SEMESTER) AS SEM_TOTAL_SCORE,
SUM(MAX_SCORE)
OVER (PARTITION BY FIRSTNAME ORDER BY SEMESTER) AS SEM_TOTAL_MAX_SCORE
FROM STUDENT_GRADE_COURSE
)
);
I get the following results:
Now the results are correct. But I do not want to display the SGPA and CGPA on each row. Rather I want them to be displayed only on the last row of the partition which in this case is the semester. So on last row of semester 1 and semester 2 I see the gpas...on the other rows nothing should be displayed in those columns.
How can I do that?

SQl server query multiple aggregate columns

I need to write a query in sql server to data get like this.
Essentially it is group by dept, race, gender and then
SUM(employees_of_race_by_gender),Sum(employees_Of_Dept).
I could get data of first four columns, getting sum of employees in that dept is becoming difficult.
Could you pls help me in writing the query?
All these details in same table Emp. Columns of Emp are Emp_Number, Race_Name,Gender,Dept
Your "num_of_emp_in_race" is actually by Gender too
SELECT DISTINCT
Dept,
Race_name,
Gender,
COUNT(*) OVER (PARTITION BY Dept, Race_name, Gender) AS num_of_emp_in_race,
COUNT(*) OVER (PARTITION BY Dept) AS num_of_emp_dept
FROM
MyTable
You should probably have this
COUNT(*) OVER (PARTITION BY Dept, Gender) AS PerDeptRace
COUNT(*) OVER (PARTITION BY Dept, Race_name) AS PerDeptGender,
COUNT(*) OVER (PARTITION BY Dept, Race_name, Gender) AS PerDeptRaceGender,
COUNT(*) OVER (PARTITION BY Dept) AS PerDept
Edit: the DISTINCT appears to be applied before the COUNT (which would odd based on this) so try this instead
SELECT DISTINCT
*
FROM
(
SELECT
Dept,
Race_name,
Gender,
COUNT(*) OVER (PARTITION BY Dept, Race_name, Gender) AS num_of_emp_in_race,
COUNT(*) OVER (PARTITION BY Dept) AS num_of_emp_dept
FROM
MyTable
) foo
Since the two sums you're looking for are based on a different aggregation, you need to calculate them separately and join the result. In such cases I first build the selects to show me the different results, making it easy to catch errors early:
SELECT Dept, Gender, race_name, COUNT(*) as num_of_emp_in_race
FROM Emp
GROUP BY 1, 2, 3
SELECT Dept, COUNT(*) as num_of_emp_in_dept
FROM Emp
GROUP BY 1
Afterwards, joining those two is pretty straight forward:
SELECT *
FROM ( first statement here ) as by_race
JOIN ( second statement here ) as by_dept ON (by_race.Dept = by_dept.Dept)