How to calculate monthy counts per season using dataframe in pandas - pandas

Need to calculate monthly averge count as per season for below given dataset
season months daily counts
1 2 280
1 3 360
2 1 290
3 2 750
3 4 360
I tried using below code but the counts are daily for each month therefore couldn't find average monthly counts
dataseason=pd.read_csv(path,usecols = ['season','mnth','cnt']);
dataseason ['col5']=dataseason.groupby(dataseason['season'].ne(dataseason['season'].shift()).cumsum())['cnt'].transform('mean')
print(dataseason.drop_duplicates(subset='col5'))

Related

Ensuring years and months are running as part of data cleaning

I have 2 datasets:
rainfall per month (mm) from 1982-01 to 2022-08
no. of rainy days per month per year from 1982-01 to 2022-08.
month no_of_rainy_days
0 1982-01 10
1 1982-02 5
2 1982-03 11
3 1982-04 14
4 1982-05 10
month total_rainfall
0 1982-01 107.1
1 1982-02 27.8
2 1982-03 160.8
3 1982-04 157.0
4 1982-05 102.2
Qn 1: As part of ensuring data integrity, how do I ensure that the dates are running consecutively? i.e 1982-01 and next is 1982-02 and not a skip to 1982-03?
I am unsure how to perform the checking and have done a search online. Is it common practice to assume that the years and months are running?
First, separate the year from the month.
df.rename(columns={"month": "ym"}, inplace=True)
df[["year", "month"]] = df["ym"].astype(str).str.split("-", expand=True)
Then you can group the dataframe by year and count the number of observations per year (counts number of rows per year).
observations_per_year = df["year"]\
.groupby(df["year"])\
.agg("count")\
.reset_index(name="observations")
observations_per_year[observations_per_year["observations"] < 12]
Assuming you have any years with less than 12 observations, they will be displayed like so:
year observations
0 1982 11
4 1986 11
5 1987 11
6 1988 10
11 1993 11
Given the lack of detail and no sample data provided, I made some assumptions about your data:
Each data set will not have more than one row for any month of the year (i.e., a maximum of 12 rows/observations per year).
Each dataframe contains a single observation per row, as shown in your examples (so you would do this for each dataframe prior to merging them). As such, counting rows per year-month is an accurate means of counting the number of observations for any given month.
The sorted order of the data is irrelevant (you can later sort by year-month if needed).

Average of Moving Averages using multiple partitions

I would like to create an average of individual moving averages per team. The moving average of each player will be their own and not dependent on what team they were on that day (See Example).
I have a good understanding of how to do a moving average of just an individual player, but not how to combine multiple that occur in different rows.
One idea I had was merging every row of each team under one row first. However, that does not seem like the most ideal way. Can I partition over two columns to accomplish this?
Bonus question: is there a potential to weigh the players differently in their individual moving average depending on stat B (Example 2)?
For example:
Team A average = AVG(MA_statA_player1, MA_statA_player2, & MA_statA_player3)
Example 2:
Team A average = AVG(MA_player1stat_b, MA_player2stat_b, & MA_player3*stat_b)
I have data like below:
Team
ID
date
stat A
stat B
1
player1
5-31-2022
2.5
0.1
1
player2
5-31-2022
2.9
0.5
1
player3
5-31-2022
5
0.3
2
player10
5-31-2022
6
0.75
2
player12
5-31-2022
2.5
0.2
3
player10
6-01-2022
2.5
0.12
3
player2
6-01-2022
2.5
0.85
Example Expected Data; Each row is made up of a team with a date and a moving average of the team. The individual moving averages do not need to be there but are to show how the average team is generated.
No Weight_ Average_team = (ma_playerX + ma_playerY)/2
Team
date
ma_playX
ma_playerY
Average_team
1
5-31-202
3.2
2.5
2.85
2
5-31-2022
5.6
2.9
4.25
3
6-01-2022
2.5
5
2.25
AVG(stat_A) OVER (PARTITION BY player ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)+ 2 AS avg7games

Is it possible to set a dynamic window frame bound in SQL OVER(ROW BETWEEN ...)-Clause?

Consider the following table, describing a patients medication plan. For example, the first row describes that the patient with patient_id = 1 is treated from timestamp 0 to 4. At time = 0, the patient has not yet become any medication (kum_amount_start = 0). At time = 4, the patient has received a kumulated amount of 100 units of a certain drug. It can be assumed, that the drug is given in with a constant rate. Regarding the first row, this means that the drug is given with a rate of 25 units/h.
patient_id
starttime [h]
endtime [h]
kum_amount_start
kum_amount_end
1
0
4
0
100
1
4
5
100
300
1
5
15
300
550
1
15
18
550
700
2
0
3
0
150
2
3
6
150
350
2
6
10
350
700
2
10
15
700
1100
2
15
19
1100
1500
I want to add the two columns "kum_amount_start_last_6hr" and "kum_amount_end_last_6hr" that describe the amount that has been given within the last 6 hours of the treatment (for the respective timestamps start, end).
I'm stuck with this problem for a while now.
I tried to tackle it with something like this
SUM(kum_amount) OVER (PARTITION BY patient_id ROWS BETWEEN "dynmaic window size" AND CURRENT ROW)
but I'm not sure whether this is the right approach.
I would be very happy if you could help me out here, thanks!

Using Pandas groupby methods, find largest values in each group

By using Pandas groupby, I have data on how much activity certain users have on average any given day of the week. Grouped by user and day, I compute max and mean for several users in the last 30 days.
Now I want to find, for every user, which day of the week corresponds to their daily max activity, and what is the average magnitude of that activity.
What is the method in pandas to perform such a task?
The original data looks something like this:
userID countActivity weekday
0 3 25 5
1 3 58 6
2 3 778 0
3 3 78208 1
4 3 6672 2
The object that has these groups is created from the following:
aggregations = {
'countActivity': {
'maxDaily': 'max',
'meanDaily': 'mean'
}
}
dailyAggs = df.groupby(['userID','weekday']).agg(aggregations)
The groupby object looks something like this:
countActivity
maxDaily meanDaily
userID weekday
3 0 84066 18275.6
1 78208 20698.5
2 172579 64930.75
3 89535 25443
4 6152 2809
Pandas groupby method filter seems to be needed here, but I'm stumped how on how to proceed.
I'd first do a groupby on 'userID', and then write an apply function to do the rest. The apply function will take a 'userID' group, perform another groupby on 'weekday' to do your aggregations, and then only return the row that contains the maximum value for maxDaily, which can be found with argmax.
def get_max_daily(grp):
aggregations = {'countActivity': {'maxDaily': 'max', 'meanDaily': 'mean'}}
grp = grp.groupby('weekday').agg(aggregations).reset_index()
return grp.loc[grp[('countActivity', 'maxDaily')].argmax()]
result = df.groupby('userID').apply(get_max_daily)
I've added a row to your sample data to make sure the daily aggregations were working correctly, since your sample data only contains one entry per weekday:
userID countActivity weekday
0 3 25 5
1 3 58 6
2 3 778 0
3 3 78208 1
4 3 6672 2
5 3 78210 1
The resulting output:
weekday countActivity
meanDaily maxDaily
userID
3 1 78209 78210

PowerPivot formula for row wise weighted average

I have a table in PowerPivot which contains the logged data of a traffic control camera mounted on a road. This table is filled the velocity and the number of vehicles that pass this camera during a specific time(e.g. 14:10 - 15:25). Now I want to know that how can I get the average velocity of cars for an specific hour and list them in a separate table with 24 rows(hour 0 - 23) where the second column of each row is the weighted average velocity of that hour? A sample of my stat_table data is given below:
count vel hour
----- --- ----
133 96.00237 15
117 91.45705 21
81 81.90521 6
2 84.29946 21
4 77.7841 18
1 140.8766 17
2 56.14951 14
6 71.72839 13
4 64.14309 9
1 60.949 17
1 77.00728 21
133 100.3956 6
109 100.8567 15
54 86.6369 9
1 83.96901 17
10 114.6556 21
6 85.39127 18
1 76.77993 15
3 113.3561 2
3 94.48055 2
In a separate PowerPivot table I have 24 rows and 2 columns but when I enter my formula, the whole rows get updated with the same number. My formula is:
=sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count] * stat_table[vel])/sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count])
Create a new calculated column named "WeightedVelocity" as follows
WeightedVelocity = [count]*[vel]
Create a measure "WeightedAverage" as follows
WeightedAverage = sum(stat_table[WeightedVelocity]) / sum(stat_table[count])
Use measure "WeightedAverage" in VALUES area of pivot Table and use "hour" column in ROWS to get desired result.