writing a query using advanced group by - sql

I have a single table database consists of the following fields:
ID, Seniority (years), outcome and some other less important fields.
Table row example:
ID:36 Seniority(years):1.79 outcome:9627
I need to write a query (sql server) in relatively simple code that returns the average outcome, grouped by the Seniority field, with leaps of five years (0-5 years, 6-10 etc...) with the condition that the average will be shown only if the group has more than 3 rows.
Result row example:
range:0-5 average:xxxx
Thank you very much

Use CASE statement to create different age groups. Try this
select case when Seniority between 0 and 5 then '0-5'
when Seniority between 6 and 10 then '6-10'
..
End,
Avg(outcome)
From yourtable
Group by case when Seniority between 0 and 5 then '0-5'
when Seniority between 6 and 10 then '6-10'
..
End
Having count(1)>=3
Since you have decimal places, If you want to count 5.4 to 0-5 group and 5.6 to 6-10 then use Round(Seniority,0) instead of Seniority in CASE statement

P.s.
0-5 contains 6 values while 6-10 contains 5.
select 'range:'
+ cast (isnull(nullif(floor((abs(seniority-1))/5)*5+1,1),0) as varchar)
+ '-'
+ cast ((floor((abs(seniority-1))/5)+1)*5 as varchar) as seniority_group
,avg(outcome)
from t
group by floor((abs(seniority-1))/5)
having count(*) >= 3
;

This would be something like:
select floor(seniority / 5), avg(outcome)
from t
group by floor(seniority / 5)
having count(*) >= 3;
Note: This breaks the seniority into equal sized groups which is 0-4, 5-9, and so on. This seems more reasonable than having unequal groups.

You can follow Gordon's answer(but you should to edit it a little), but I would do this with additional table with all possible intervals. You then can add appropriate index to boost it.
create table intervals
(
id int identity(1, 1),
start int,
end int
)
insert into intervals values
(0, 5),
(6, 10)
...
select i.id, avg(t.outcome) as outcome
from intervals i
join tablename t on t.seniority between i.start and i.end
group by i.id
having count(*) >=3
If creating new tables is not an option you can always use a CTE:
;with intervals as(
select * from
(values
(0, 5),
(6, 10)
--...
) t(start, [end])
)
select i.id, avg(t.outcome) as outcome
from intervals i
join tablename t on t.seniority between i.start and i.[end]
group by i.id
having count(*) >=3

Related

Check if 20% players of a tournament has played 8 or more rounds in previous year - Query Optimization - BigQuery

So I have a table "tournaments" with the following attributes/columns:
player_name
tournament_name
round_1, round_2, ......, round_10
round_1_date, round_2_date, ......, round_10_date
Other columns
I'm willing to check if 20% players of a particular tournament has played 8 or more rounds in previous year. If a person has played a round then there will be some score in that round else "null".
e.g: If a player has played round 1 and 2 then columns of round_1 and round_2 will be populated with some score along with round_1_date and round_2_date. All the other columns of rounds will be having null.
(Starting date can be round_1_date.)
I have made the following query which is giving accurate results but I beleive there can be a more better/optimized approach which will take less time than this one as this query will run multiple times in a SQL loop for large dataset. Query returns true or false.
SELECT
(CAST((total_players_meeting_threshold_criteria/COUNT(t4.player_name)*100>=20) AS string)) AS result
FROM (
SELECT
COUNT(*) total_players_meeting_threshold_criteria,
FROM (
SELECT
(player_name),
COUNT(round_1)+COUNT(round_2)+COUNT(round_3)+COUNT(round_4)+COUNT(round_5)+COUNT(round_6)+COUNT(round_7)+COUNT(round_8)+COUNT(round_9)+COUNT(round_10) AS total_rounds_played,
FROM
`Golf_DB_Women_Dataset.tournaments` t2
WHERE
player_name IN (
SELECT
DISTINCT player_name,
FROM
`tournaments` t1
WHERE
tournament_name = 'Tournament of Champions')
AND t2.round_1_date BETWEEN DATE_SUB((
SELECT
round_1_date
FROM
`tournaments`
WHERE
tournament_name = 'Tournament of Champions'
LIMIT
1), INTERVAL 1 YEAR)
AND DATE_SUB((
SELECT
round_1_date
FROM
`tournaments`
WHERE
tournament_name = 'Tournament of Champions'
LIMIT
1), INTERVAL 1 DAY)
GROUP BY
player_name
HAVING
total_rounds_played >= 8
ORDER BY
player_name) ) AS threshold_result,
`tournaments` t4
WHERE
t4.tournament_name = 'Tournament of Champions'
GROUP BY
total_players_meeting_threshold_criteria
Any help will be highly appreciated.
Thank you

Adding missing date rows to a BigQuery Table

I have a table where 1 of the rows is an integer that represents the rows time. Problem is the table isn't full, there are missing timestamps.
I would like to fill missing values such that every 10 seconds there is a row. I want the rest of the columns to be nuns (later I'll forward fill these nuns).
10 secs is basically 10,000.
If this was python the range would be:
range(
min(table[column]),
max(table[column]),
10000
)
If your values are strictly parted by 10 seconds, and there are just some multiples of 10 seconds intervals which are missing, you can go by this approach to fill your data holes:
WITH minsmax AS (
SELECT
MIN(time) AS minval,
MAX(time) AS maxval
FROM `dataset.table`
)
SELECT
IF (d.time <= i.time, d.time, i.time) as time,
MAX(IF(d.time <= i.time, d.value, NULL)) as value
FROM (
SELECT time FROM minsmax m, UNNEST(GENERATE_ARRAY(m.minval, m.maxval+100, 100)) AS time
) AS i
LEFT JOIN `dataset.table` d ON 1=1
WHERE ABS(d.time - i.time) >= 100
GROUP BY 1
ORDER BY 1
Hope this helps.
You can use arrays. For numbers, you can do:
select n
from unnest(generate_array(1, 1000, 1)) n;
There are similar functions for generate_timestamp_array() and generate_date_array() if you really need those types.
I ended up using the following query through python API:
"""
SELECT
i.time,
Sensor_Reading,
Sensor_Name
FROM (
SELECT time FROM UNNEST(GENERATE_ARRAY({min_time}, {max_time}+{sampling_period}+1, {sampling_period})) AS time
) AS i
LEFT JOIN
`{input_table}` AS input
ON
i.time =input.Time
ORDER BY i.time
""".format(sampling_period=sampling_period, min_time=min_time,
max_time=max_time,
input_table=input_table)
Thanks to both answers

Sum rows with same identification-number and sort by custom order

I have the following table structure for the table "products":
id amount number
1 10 M6545
2 32 M6424
3 32 M6545
4 49 M6412
... ... ...
I want to select the sum of amounts of all rows with the same number. The rows with the same number should be summed up to one sum. That means:
M6545 -> Sum 42
M6424 -> Sum 32
M6421 -> Sum 49
My query looks like the following and still does not work:
SELECT SUM(amount) as SumAm FROM products WHERE number IN ('M6412', 'M6545')
I want to find a way where I can only select the sum ordered by the numbers in the "IN" statement. That means, the result table should look like:
SumAm
49
42
The sums should not be ordered in some way. It should match the order of numbers in the IN clause.
use group by number
SELECT number, SUM(amount) as SumAm FROM products
--WHERE number IN ('M6412', 'M6545') i think you dont need where clause
group by number
But if you want just for 'M6412', 'M6545' then you need where clause that you showed in your 2nd output sample
Use group by and aggregation
SELECT SUM(amount) as SumAm FROM products
WHERE number IN ('M6412', 'M6545')
group by number
You can't order by results based directly on the order of the IN clause.
What you can do is something like this:
SELECT SUM(amount) as SumAm
FROM products
WHERE number IN ('M6412', 'M6545')
GROUP BY number -- You must group by to get a row for each number
ORDER BY CASE number
WHEN 'M6412' THEN 1
WHEN 'M6545' THEN 2
END
Of course, the more items you have in your IN clause the more cumbersome this query will get. Therefor another solution might be more practical - joining to a table variable instead of using IN:
DECLARE #Numbers AS TABLE
(
sort int identity(1,1), -- this will hold the order of the inserted values
number varchar(10) PRIMARY KEY -- enforce unique values
);
INSERT INTO #Numbers (number) VALUES
('M6412'),
('M6545')
SELECT SUM(amount) as SumAm
FROM products As p
JOIN numbers As n ON p.Number = n.Number
-- number and sort have a 1 - 1 relationship,
-- so it's safe to group by it instead of by number
GROUP BY n.sort
ORDER BY n.sort
Your requirement is non-sense... this is not how IN is designed to work. Having said that, the following will give you the result in the desired order:
SELECT SUM(amount)
FROM (VALUES
('M6545', 1),
('M6412', 2)
) AS va(number, sortorder)
INNER JOIN sumam ON va.number = sumam.number
GROUP BY va.number, va.sortorder
ORDER BY va.sortorder
It is somewhat better than writing a CASE statement when you would need to add a WHEN condition for each number.

SQL Server iterating through time series data

I am using SQL Server and wondering if it is possible to iterate through time series data until specific condition is met and based on that label my data in other table?
For example, let's say I have a table like this:
Id Date Some_kind_of_event
+--+----------+------------------
1 |2018-01-01|dsdf...
1 |2018-01-06|sdfs...
1 |2018-01-29|fsdfs...
2 |2018-05-10|sdfs...
2 |2018-05-11|fgdf...
2 |2018-05-12|asda...
3 |2018-02-15|sgsd...
3 |2018-02-16|rgw...
3 |2018-02-17|sgs...
3 |2018-02-28|sgs...
What I want to get, is to calculate for each key the difference between two adjacent events and find out if there exists difference > 10 days between these two adjacent events. In case yes, I want to stop iterating for that specific key and put label 'inactive', otherwise 'active' in my other table. After we finish with one key, we start with another.
So for example id = 1 would get label 'inactive' because there exists two dates which have difference bigger that 10 days. The final result would be like that:
Id Label
+--+----------+
1 |inactive
2 |active
3 |inactive
Any ideas how to do that? Is it possible to do it with SQL?
When working with a DBMS you need to get away from the idea of thinking iteratively. Instead you need to try and think in sets. "Instead of thinking about what you want to do to a row, think about what you want to do to a column."
If I understand correctly, is this what you're after?
CREATE TABLE SomeEvent (ID int, EventDate date, EventName varchar(10));
INSERT INTO SomeEvent
VALUES (1,'20180101','dsdf...'),
(1,'20180106','sdfs...'),
(1,'20180129','fsdfs..'),
(2,'20180510','sdfs...'),
(2,'20180511','fgdf...'),
(2,'20180512','asda...'),
(3,'20180215','sgsd...'),
(3,'20180216','rgw....'),
(3,'20180217','sgs....'),
(3,'20180228','sgs....');
GO
WITH Gaps AS(
SELECT *,
DATEDIFF(DAY,LAG(EventDate) OVER (PARTITION BY ID ORDER BY EventDate),EventDate) AS EventGap
FROM SomeEvent)
SELECT ID,
CASE WHEN MAX(EventGap) > 10 THEN 'inactive' ELSE 'active' END AS Label
FROM Gaps
GROUP BY ID
ORDER BY ID;
GO
DROP TABLE SomeEvent;
GO
This assumes you are using SQL Server 2012+, as it uses the LAG function, and SQL Server 2008 has less than 12 months of any kind of support.
Try this. Note, replace #MyTable with your actual table.
WITH Diffs AS (
SELECT
Id
,DATEDIFF(DAY,[Date],LEAD([Date],1,0) OVER (ORDER BY [Id], [Date])) Diff
FROM #MyTable)
SELECT
Id
,CASE WHEN MAX(Diff) > 10 THEN 'Inactive' ELSE 'Active' END
FROM Diffs
GROUP BY Id
Just to share another approach (without a CTE).
SELECT
ID
, CASE WHEN SUM(TotalDays) = (MAX(CNT) - 1) THEN 'Active' ELSE 'Inactive' END Label
FROM (
SELECT
ID
, EventDate
, CASE WHEN DATEDIFF(DAY, EventDate, LEAD(EventDate) OVER(PARTITION BY ID ORDER BY EventDate)) < 10 THEN 1 ELSE 0 END TotalDays
, COUNT(ID) OVER(PARTITION BY ID) CNT
FROM EventsTable
) D
GROUP BY ID
The method is counting how many records each ID has, and getting the TotalDays by date differences (in days) between the current the next date, if the difference is less than 10 days, then give me 1, else give me 0.
Then compare, if the total days equal the number of records that each ID has (minus one) would print Active, else Inactive.
This is just another approach that doesn't use CTE.

Use google bigquery to build histogram graph

How can write a query that makes histogram graph rendering easier?
For example, we have 100 million people with ages, we want to draw the histogram/buckets for age 0-10, 11-20, 21-30 etc... What does the query look like?
Has anyone done it? Did you try to connect the query result to google spreadsheet to draw the histogram?
You could also use the quantiles aggregation operator to get a quick look at the distribution of ages.
SELECT
quantiles(age, 10)
FROM mytable
Each row of this query would correspond to the age at that point in the list of ages. The first result is the age 1/10ths of the way through the sorted list of ages, the second is the age 2/10ths through, 3/10ths, etc.
See the 2019 update, with #standardSQL --Fh
The subquery idea works, as does "CASE WHEN" and then doing a group by:
SELECT COUNT(field1), bucket
FROM (
SELECT field1, CASE WHEN age >= 0 AND age < 10 THEN 1
WHEN age >= 10 AND age < 20 THEN 2
WHEN age >= 20 AND age < 30 THEN 3
...
ELSE -1 END as bucket
FROM table1)
GROUP BY bucket
Alternately, if the buckets are regular -- you could just divide and cast to an integer:
SELECT COUNT(field1), bucket
FROM (
SELECT field1, INTEGER(age / 10) as bucket FROM table1)
GROUP BY bucket
With #standardSQL and an auxiliary stats query, we can define the range the histogram should look into.
Here for the time to fly between SFO and JFK - with 10 buckets:
WITH data AS (
SELECT *, ActualElapsedTime datapoint
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate_year = "2018-01-01"
AND Origin = 'SFO' AND Dest = 'JFK'
)
, stats AS (
SELECT min+step*i min, min+step*(i+1)max
FROM (
SELECT max-min diff, min, max, (max-min)/10 step, GENERATE_ARRAY(0, 10, 1) i
FROM (
SELECT MIN(datapoint) min, MAX(datapoint) max
FROM data
)
), UNNEST(i) i
)
SELECT COUNT(*) count, (min+max)/2 avg
FROM data
JOIN stats
ON data.datapoint >= stats.min AND data.datapoint<stats.max
GROUP BY avg
ORDER BY avg
If you need round numbers, see: https://stackoverflow.com/a/60159876/132438
Using a cross join to get your min and max values (not that expensive on a single tuple) you can get a normalized bucket list of any given bucket count:
select
min(data.VAL) as min,
max(data.VAL) as max,
count(data.VAL) as num,
integer((data.VAL-value.min)/(value.max-value.min)*8) as group
from [table] data
CROSS JOIN (SELECT MAX(VAL) as max, MIN(VAL) as min, from [table]) value
GROUP BY group
ORDER BY group
in this example we're getting 8 buckets (pretty self explanatory) plus one for null VAL
Write a subquery like this:
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0)
Then you can do something like this:
SELECT * FROM
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0),
(SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 20 AND AGE >= 10),
(SELECT '3' AS agegroup, count(*) FROM people WHERE AGE <= 120 AND AGE >= 20)
Result will be like:
Row agegroup count
1 1 somenumber
2 2 somenumber
3 3 another number
I hope this helps you. Of course in the age group you can write anything like: '0 to 10'
There is now the APPROX_QUANTILES aggregation function in standard SQL.
SELECT
APPROX_QUANTILES(column, number_of_bins)
...
I found gamars approach quite intriguing and expanded a little bit on it using scripting instead of the cross join. Notably, this approach also allows to consistently change group sizes, like here with group sizes that increase exponentially.
declare stats default
(select as struct min(new_confirmed) as min, max(new_confirmed) as max
from `bigquery-public-data.covid19_open_data.covid19_open_data`
where new_confirmed >0 and date = date "2022-03-07"
);
declare group_amount default 10; -- change group amount here
SELECT
CAST(floor(
(ln(new_confirmed-stats.min+1)/ln(stats.max-stats.min+1)) * (group_amount-1))
AS INT64) group_flag,
concat('[',min(new_confirmed),',',max(new_confirmed),']') as group_value_range,
count(1) as quantity
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where new_confirmed >0 and date = date "2022-03-07"
GROUP BY group_flag
ORDER BY group_flag ASC
The basic approach is to label each value with its group_flag and then group by it. The flag is calculated by scaling the value down to a value between 0 and 1 and then scale it up again to 0 - group_amount.
I just took the log of the corrected value and range before their division to get the desired bias in group sizes. I also add 1 to make sure it doesn't try to take the log of 0.
You're looking for a single vector of information. I would normally query it like this:
select
count(*) as num,
integer( age / 10 ) as age_group
from mytable
group by age_group
A big case statement will be needed for arbitrary groups. It would be simple but much longer. My example should be fine if every bucket contains N years.
Take a look at the custom SQL functions. It works as
to_bin(10, [0, 100, 500]) => '... - 100'
to_bin(1000, [0, 100, 500, 0]) => '500 - ...'
to_bin(1000, [0, 100, 500]) => NULL
Read more here
https://github.com/AdamovichAleksey/BigQueryTips/blob/main/sql/functions/to_bins.sql
Any ideas and commits are welcomed