Average of Moving Averages using multiple partitions - google-bigquery

I would like to create an average of individual moving averages per team. The moving average of each player will be their own and not dependent on what team they were on that day (See Example).
I have a good understanding of how to do a moving average of just an individual player, but not how to combine multiple that occur in different rows.
One idea I had was merging every row of each team under one row first. However, that does not seem like the most ideal way. Can I partition over two columns to accomplish this?
Bonus question: is there a potential to weigh the players differently in their individual moving average depending on stat B (Example 2)?
For example:
Team A average = AVG(MA_statA_player1, MA_statA_player2, & MA_statA_player3)
Example 2:
Team A average = AVG(MA_player1stat_b, MA_player2stat_b, & MA_player3*stat_b)
I have data like below:
Team
ID
date
stat A
stat B
1
player1
5-31-2022
2.5
0.1
1
player2
5-31-2022
2.9
0.5
1
player3
5-31-2022
5
0.3
2
player10
5-31-2022
6
0.75
2
player12
5-31-2022
2.5
0.2
3
player10
6-01-2022
2.5
0.12
3
player2
6-01-2022
2.5
0.85
Example Expected Data; Each row is made up of a team with a date and a moving average of the team. The individual moving averages do not need to be there but are to show how the average team is generated.
No Weight_ Average_team = (ma_playerX + ma_playerY)/2
Team
date
ma_playX
ma_playerY
Average_team
1
5-31-202
3.2
2.5
2.85
2
5-31-2022
5.6
2.9
4.25
3
6-01-2022
2.5
5
2.25
AVG(stat_A) OVER (PARTITION BY player ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)+ 2 AS avg7games

Related

How to calculate monthy counts per season using dataframe in pandas

Need to calculate monthly averge count as per season for below given dataset
season months daily counts
1 2 280
1 3 360
2 1 290
3 2 750
3 4 360
I tried using below code but the counts are daily for each month therefore couldn't find average monthly counts
dataseason=pd.read_csv(path,usecols = ['season','mnth','cnt']);
dataseason ['col5']=dataseason.groupby(dataseason['season'].ne(dataseason['season'].shift()).cumsum())['cnt'].transform('mean')
print(dataseason.drop_duplicates(subset='col5'))

Combining aggregate and analytics functions in BigQuery to reduce table size

I am facing an issue that is more about query design than SQL specifics. Essentially, I am trying to reduce a dataset size by carrying out the following transformations described below.
I start with an irregularly sampled timeseries of Voltage measurements:
Seconds
Volts
0.1
2899
0.15
2999
0.17
2990
0.6
3001
0.98
2978
1.2
3000
1.22
3003
3.7
2888
4.1
2900
4.11
3012
4.7
3000
4.8
3000
I bin the data into buckets, where data points that are close to one another fall into the same bucket. In this example, I bin data into 1 second buckets simply by dividing the Seconds column by 1. I also add an ordering number to each group. I use the below query:
WITH qry1 AS (SELECT
Seconds
, Volts
, DIV(CAST(Seconds AS NUMERIC), 1) as BinNo
, Rank() OVER (PARTITION BY DIV(CAST(Seconds AS NUMERIC), 1) ORDER BY Seconds) as BinRank
FROM
project.rawdata
)
Seconds
Volts
BinNo
BinRank
0.1
2899
0
1
0.15
2999
0
2
0.17
2990
0
3
0.6
3001
0
4
0.98
2978
0
5
1.2
3000
1
1
1.22
3003
1
2
3.7
2888
3
1
4.1
2900
4
1
4.11
3012
4
2
4.7
3000
4
3
4.8
3000
4
4
Now comes the part I am struggling with. I am attempting to get the following output from a query acting on the above table. Keeping the time order is important as I need to plot these values on a line style chart. For each group:
Get the first row ('first' meaning earliest Second value)
Get the Max and Min of the Volts field, and associate these with the earliest (can be latest too I guess) Seconds value
Get the last row (last meaning latest Second value)
The conditions for this query are:
If there is only one row in the group, simply assign the Volts value for that row as both the max and the min and only use the single Seconds value for that group
If there are only two rows in the group, simply assign the Volts values for both the max and min to the corresponding first and last Seconds values, respectively.
(Now for the part I am struggling with) If there are three rows or more per group, extract the first and last rows as above, but then also get the max and min over all rows in the group and assign these to the max and min values for an intermediate row between the first and last row. The output would be as below. As mentioned, this step could be associated with any position between the first and last Seconds values, and here I have assigned it to the first Seconds value per group.
Seconds
Volts_min
Volts_max
OrderingCol
0.1
2899
2899
1
0.1
2899
3001
2
0.98
2978
2978
3
1.2
3000
3000
1
1.22
3003
3003
2
3.7
2888
2888
1
4.1
2900
2900
1
4.1
2900
3012
2
4.8
3000
3000
3
This will then allow me to plot these values using a custom charting library which we have without overloading the memory. I can extract the first and last rows per group by using analytics functions and then doing a join, but cannot get the intermediate values. The Ordering Column's goal is to enable me to sort the table before pulling the data to the dashboard. I am attempting to do this in BigQuery as a first preference.
Thanks :)
Below should do it
select Seconds, values.*,
row_number() over(partition by bin_num order by Seconds) as OrderingCol
from (
select *,
case
when row_num = 1 or row_num = rows_count then true
when rows_count > 2 and row_num = 2 then true
end toShow,
case
when row_num = 1 then struct(first_row.Volts as Volts_min, first_row.Volts as Volts_max)
when row_num = rows_count then struct(last_row.Volts as Volts_min, last_row.Volts as Volts_max)
else struct(min_val as Volts_min, max_val as Volts_max)
end values
from (
select *,
div(cast(Seconds AS numeric), 1) as bin_num,
row_number() over win_all as row_num,
count(1) over win_all as rows_count,
min(Volts) over win_all as min_val,
max(Volts) over win_all as max_val,
first_value(t) over win_with_order as first_row,
last_value(t) over win_with_order as last_row
from `project.dataset.table` t
window
win_all as (partition by div(cast(Seconds AS numeric), 1)),
win_with_order as (partition by div(cast(Seconds AS numeric), 1) order by Seconds)
)
)
where toShow
# order by Seconds
If applied to sample data in your question - output is

Displaying records with a column value 0 when there is no record or Data

I have table like these
select id,category from table1 and
select date,id,hrs from table2
by join I get table like this lets call this mixTable
Date Category Hrs
01/23/2017 One 3.5
01/30/2017 Two 2.3
01/20/2017 Three 0.6
01/18/2017 Four 4.3
02/13/2017 One 6.2
02/15/2017 Two 4
02/20/2017 Four 2.2
03/16/2017 One 1
03/25/2017 Two 4.3
03/20/2017 Three 3.6
03/18/2017 Four 2.5
04/26/2017 One 2.5
04/30/2017 Two 3.3
04/22/2017 Three 2.1
I am looking for Output like this mixTable2
Date Category Hrs
Jan-17 One 3.5
Jan-17 Two 2.3
Jan-17 Three 0.6
Jan-17 Four 4.3
Feb-17 One 6.2
Feb-17 Two 4
Feb-17 Three 0
Feb-17 Four 2.2
Mar-17 One 1
Mar-17 Two 4.3
Mar-17 Three 3.6
Mar-17 Four 2.5
Apr-17 One 2.5
Apr-17 Two 3.3
Apr-17 Three 2.1
Apr-17 Four 0
As You can see both Tables have Date, Category and Hrs. In Output table I want to show missing month value as 0 even it is not recorded in the table for ex. Feb-17 and Apr-17 are not recorded in the main table. Also the Category Three and Four are not recorded.
I'm trying to figure out how to show rows in a table that do not have corresponding values in another table.
This is what I would do in TSQL.
SELECT
CASE WHEN ISNULL(h.hrs, 0) = 0 THEN 'NOT RECORDED'
ELSE w.category END AS [Category]
w.wonum,w.completed_date,ISNULL(h.hrs, 0) AS [hrs]
FROM workorder w
LEFT JOIN workhrs h (NOLOCK) ON w.wonum=h.wonum
If you're doing this in Power BI, I would do the following:
Create a calculated column on mixTable for the month:
Month = EOMONTH(mixTable[Date],0)
Cross join the Month and Category columns into a new table.
mixTable2 = CROSSJOIN(VALUES(mixTable[Month]),VALUES(mixTable[Category]))
Created a calculated column on this new table to the sum of the hours.
Hours = SUMX(
FILTER(MixTable,
MixTable[Category] = MixTable2[Category] &&
MixTable[Month] = MixTable2[Month]),
MixTable[Hrs])
If you want zeros instead of blanks, wrap the above in IF(ISBLANK(...),0,...).

Tableau: How to get moving average with respect to day of week in last 4 weeks?

e.g: If I have the data as below:
Week 1 Week2 Week3
S M T W T F S S M T W T F S S M T W T F S
2 5 6 7 5 5 3 4 5 7 2 4 3 2 4 5 2 1 2 7 8
If today is Monday, my average will be (5+5+5)/3 which is 5. Tomorrow it will be (6+7+2)/3 which will be 5 again and day after it will be (7+2+1)/3 which will be 3.33
How to get this in Tableau?
First, you can use "Weekday" as a column or row (by rightclicking on the date).
Then you can simply add a Table Calculation "Moving Average" with a specific computing dimension "Week of [Date]"
=> Table Calculation Specifics <=
=> Result <=
Data source used-: Tableau Sample Superstore.
You can do the following-:
Columns-: Week(Order Date)
Rows-: Weekday(Order date)
Put Sales in text.
Right click sales>Quick Table Calculation>Moving Average
right click Sales>edit quick table calculation>
Set the following
Select Moving along-: "Table across"
Previous values-: 4

Perform conditional calculations using data distributed over multiple records

I am trying to calculate win rates for players using different 'nations'. The raw data I get is on a per player per game basis, so in a 1v1 game I will get two entries in the database. One will show the 'win' for one team, and the other entry will record the 'loss' for the opposing team. The issue is that there are multiple different 'nations' and I want to be able to calculate the nation vs nation win rate, as opposed to the overall generalized win rate per team, if that makes sense. e.g.
Looking at the example below, I want to be able to calculate the rate at which dogs beat cats, cats beat mice, and mice beat dogs.
Here is a simplified toy model of the data I'm working with
date match sessionid team nation result
1/1/2016 1 143138354 0 cats loss
1/1/2016 1 143146203 1 dogs win
1/1/2016 2 143134711 0 mice win
1/1/2016 2 143165199 1 dogs loss
1/1/2016 3 143183402 0 cats win
1/1/2016 3 143127251 1 mice loss
1/1/2016 4 143192433 0 cats win
1/1/2016 4 143129777 1 dogs loss
1/1/2016 5 143197254 0 mice win
1/1/2016 5 143147178 1 dogs loss
1/1/2016 6 143220297 0 cats loss
1/1/2016 6 143168454 1 mice win
1/1/2016 7 143169544 0 cats win
1/1/2016 7 143188824 0 cats win
1/1/2016 7 143178786 1 mice loss
1/1/2016 7 143212127 1 dogs loss
I've considered something like
SELECT
match,
CASE WHEN nation='cats' AND result='loss' AND nation='dogs' AND result='win' THEN 'dogs_over_cats' END as result
FROM
table
GROUP BY
match
But of course that doesn't work because nation can't be simultaneously 'cats' and 'dogs' at the same time.
What I want is something like this
date, match, winning_nation, losing_nation
or alternatively
date, match, result
where result would be a string indicating who beat who ('dogs_over_cats') or something.
I have no idea how to do this. It seems like it should be pretty simple but I can't figure out how to do it. How do I get a CASE statement to consider the field values over multiple records at the same time.. Is that possible? Do I just have to use lag/lead functions?
Thanks
Brad
You can trasform it like this:
select A1.match, A1.team as winner, A2.team as loser
from tableA A1
inner join tableA A2
on A1.match = A2.match
where A1.result = 'win'
and A2.result = 'loss'