Calculating mean of a specific column by specific rows - pandas

I have a dataframe that looks like in the pictures.
Now, I want to add a new column that will show the average of power for each day (given the data is sampled every 5 minutes), but separately for when it is day_or_night (day = 0 in the column, night = 1). I've gotten this far:
train['avg_by_day'][train['day_or_night']==1] = train['power'][train['day_or_night']==1].mean()
train['avg_by_day'][train['day_or_night']==0] = train['power'][train['day_or_night']==0].mean()
but this just adds the average of all the power values that correspond to day, or similarly - night, which isn't what I'm after: a specific average for each day/night separately.
I need something like: train['avg_by_day'] == train.power.mean() when day == 1 and day_or_night == 1, and this for each day.

So you want to group the dataframe by day and day_or_night and create a new column with mean power values for each group:
train['avg_by_day'] = train.groupby(['day','day_or_night'])['power']\
.transform('mean')
Maybe you should also include year and month in the grouping columns because otherwise it's going to group the 1st day of every month together, same for the 2nd day and so on.

Related

Moving Average Template in Power Query

I am new to Power Query and I'm trying to create a query template that applies a 3 moving average to the volume and sales for the next 3 weeks grouped by weekday for each product.
My input table looks like this.
Note: I didn't include all the data but there is data for each weekday (Mon, Tue, Wed, Thu, Fri, Sat, Sun) and there are more products.
Below are the steps that I've been doing:
.Sort the data first by call type and then by weekday.
.Then grouped the rows and create an index for each product-weekday with the next code:
'''#"Grouped Rows" = Table.Group(#"Sorted Rows", {"product", "weekday"}, {{"Index", each
Table.AddIndexColumn(_,"Index", 1,1), type table}}),
#"Expanded Index" = Table.ExpandTableColumn(#"Grouped Rows", "Index", {"date", "volume", "sales",
"Index",} , {"Index.date", "Index.volume", "Index.sales",
"Index.Index",}'''
Then I apply the 3 moving Average to the volume:
''' step1 = Table.AddColumn(#"Expanded Index", "Average_vol", each if [Index.Index] > 2 then List.Average(List.Range(#"Expanded Index"[Index.volume], _[Index.Index] -3, 3)) else null, type number'''
So now I have this:
[![!['''][2]][2]
*the highlighted cells are the moving average expected for the next week.
It's working but now I need to include the numbers in yellow for the next week.
So now I'm stuck on that step and I need to create the template.
Do you know how can I apply the moving average for the next 2 weeks including the calculated numbers for the next week(yellow numbers)?
Output
So, what I'm trying to do is create the moving average for the next week the use this number as input to calculated the moving average for the 2nd week and do the same to calculate the 3rd week
Can you just add 3 three custom columns with = Date.AddDays([date],7)= Date.AddDays([date],14)= Date.AddDays([date],21)
Then merge the data on top of itself three times using [product] and [date] matching against [product] and either [date1] or [date2] or [date3]. Expand the merged data to pull out the sales and volume from those 3 weeks. After the expands, each row will have that weeks own number, along with the numbers from the next 3 weeks. Average them
Note this is not going to take the moving average from the first week and somehow use that in the following weeks calculations

sampling from a DataFrame on a daily basis

in my data frame, I have data for 3 months, and it's per day. ( for every day, I have a different number of samples, for example on 1st January I have 20K rows of samples and on the second of January there are 15K samples)
what I need is that I want to take the mean number and apply it to all the data frames.
for example, if the mean value is 8K, i want to get the random 8k rows from 1st January data and 8k rows randomly from 2nd January, and so on.
as far as I know, rand() will give the random values of the whole data frame, But I need to apply it per day. since my data frame is on a daily basis and the date is mentioned in a column of the data frame.
Thanks
You can use groupby_sample after computing the mean of records:
# Suppose 'date' is the name of your column
sample = df.groupby('date').sample(n=int(df['date'].value_counts().mean()))
# Or
g = df.groupby('date')
sample = g.sample(n=int(g.size().mean()))
Update
Is there ant solution for the dates that their sum is lower than the mean? I face with this error for those dates: Cannot take a larger sample than population when 'replace=False'
n = np.floor(df['date'].value_counts().mean()).astype(int)
sample = (df.groupby('date').sample(n, replace=True)
.loc[lambda x: ~x.index.duplicated()])

Changing the year of a slice/series obtained from a pandas dataframe

I got a large dataset which is the imported schedule (spanning multiple years) for my team. I cleaned the data (made it long instead of wide), however I encounter a problem.
First an explanation of the data:
'year' and 'period' are obtained from the split sheetname. Both strings.
'week' the week of the year, obtained from the roster. Float.
'date' converted from string, for which I wrote a function as the dates were in Dutch and needed to be normalized, no year was defined so therefore the year from the first column is used. After processing; datetime format.
'shift' the type of shift it belongs to. S1 > early, S2 > late, S3 > night.
Each rule is assigned to one of my employees, those names are erased for privacy reasons.
I've written a class with several methods that apply rules our government enforces on schedules.
Now my problem:
As you can see: entries 1137 and 1138 should belong to the year 2022. But how do I change this easily? I tried:
for week, date in prepocessed_data_merged[['week', 'date']].values:
# There are always more than 52 weeks in a year.
# If the month of the date in week 52 is 1 (Jan), then something is wrong.
if (week == 52) & (date.month == 1):
prepocessed_data_merged.loc[(prepocessed_data_merged['week'] == week)
& (prepocessed_data_merged['date']), 'date'] = ???
But as you might expect this returns a series since there are three shifts on a day, so three entries of a date that need their year changed. So, how does one change the year of a selected series/slice, simultaneously changing it in the dataframe?
I know I can use: dt.replace(year=current_year+1) but how do I enforce this replace on this selected series in the preprocessed_data DF? Thanks in advance!
Have you tried:
cond = prepocessed_data_merged['week'].eq(52) & prepocessed_data_merged['date'].dt.month.eq(1)
prepocessed_data_merged.loc[cond, 'date'] += pd.DateOffset(years=1)

Trying to create a well count to compare to BOE using the on production date and comparing it to Capital spends and total BOE

I have data that includes the below columns:
Date
Total Capital
Total BOED
On Production Date
UWI
I'm trying to create a well count based on the unique UWI for each On Production Date and graph it against the Total BOED/Total Capital with Date as the x-axis.
I've tried unique count by UWI but it then populates ALL rows of that UWI with the same well count total, so when it is summed the numbers are multiplied by the row count.
Plot Xaxis as Date and Y with Total BOED and Well Count.
Add a calculated column to create a row id using the rowid() function. Then, in the calculation you already have, the one that populates all rows of the UWI with the same well count, add the following logic...
if([rowid] = min([rowid]) over [UWI], uniquecount([UWI]) over [Production Date], null)
This will make it so that the count only populates once.

Creating a DAX pattern that counts days between a date field and a month value on a chart's x-axis

I am struggling with a DAX pattern to allow me to plot an average duration value on a chart.
Here is the problem: My dataset has a field called dtOpened which is a date value describing when something started, and I want to be able to calculate the duration in days since that date.
I then want to be able to create an average duration since that date over a time period.
It is very easy to do when thinking about the value as it is now, but I want to be able to show a chart that describes what that average value would have been over various time periods on the x-axis (month/quarter/year).
The problem that I am facing is that if I create a calculated column to find the current age (NOW() - [dtOpened]), then it always uses the NOW() function - which is no use for historic time spans. Maybe I need a Measure for this, rather than a calculated column, but I cannot work out how to do it.
I have thought about using LASTDATE (rather than NOW) to work out what the last date would be in the filter context of any single month/quarter/year, but if the current month is only half way through, then it would probably need to consider today's date as the value from which to subtract the dtOpened value.
I would appreciate any help or pointers that you can give me!
It looks like you have a table (let's call it Cases) storing your cases with one record per case with fields like the following:
casename, dtOpened, OpenClosedFlag
You should create a date table with on record per day spanning your date range. The date table will have a month ending date field identifying the last day of the month (same for quarter & year). But this will be a disconnected date table. Don't create a relationship between the Date on the Date table and your case open date.
Then use iterative averagex to average the date differences.
Average Duration (days) :=
CALCULATE (
AVERAGEX ( Cases, MAX ( DateTable[Month Ending] ) - Cases[dtopened] ),
FILTER ( Cases, Cases[OpenClosedFlag] = "Open" ),
FILTER ( Cases, Cases[dtopened] <= MAX ( DateTable[Month Ending] ) )
)
Once you plot the measure against your Month you should see the average values represented correctly. You can do something similar for quarter & year.
You're a genius, Rory; Thanks.
In my example, I had a dtClosed field rather than an Opened/Closed flag, so there was one extra piece of filtering to do to test if the Case was closed at that point in time. So my measure ended up looking like this:
Average Duration:=CALCULATE(
AVERAGEX(CasesOnly, MAX(DT[LastDateM]) - CasesOnly[Owner Opened dtOnly]),
FILTER(CasesOnly, OR(ISBLANK(CasesOnly[Owner Resolution dtOnly]),
CasesOnly[Owner Resolution dtOnly] > MAX(DT[LastDateM]))),
FILTER(CasesOnly, CasesOnly[Owner Opened dtOnly] <= MAX(DT[LastDateM]))
)
And to get the chart, I plotted the DT[Date] field on the x-axis.
Thanks very much again.