sampling from a DataFrame on a daily basis - pandas

in my data frame, I have data for 3 months, and it's per day. ( for every day, I have a different number of samples, for example on 1st January I have 20K rows of samples and on the second of January there are 15K samples)
what I need is that I want to take the mean number and apply it to all the data frames.
for example, if the mean value is 8K, i want to get the random 8k rows from 1st January data and 8k rows randomly from 2nd January, and so on.
as far as I know, rand() will give the random values of the whole data frame, But I need to apply it per day. since my data frame is on a daily basis and the date is mentioned in a column of the data frame.
Thanks

You can use groupby_sample after computing the mean of records:
# Suppose 'date' is the name of your column
sample = df.groupby('date').sample(n=int(df['date'].value_counts().mean()))
# Or
g = df.groupby('date')
sample = g.sample(n=int(g.size().mean()))
Update
Is there ant solution for the dates that their sum is lower than the mean? I face with this error for those dates: Cannot take a larger sample than population when 'replace=False'
n = np.floor(df['date'].value_counts().mean()).astype(int)
sample = (df.groupby('date').sample(n, replace=True)
.loc[lambda x: ~x.index.duplicated()])

Related

Calculate mean for panda databframe based on date

In my dataset as follows (two columns: DATE and RATE)
I want to get the mean for the RATE for each day (from the dataset, you can see that there are multiple rate values for the same day). I have about 1,000 rows, so that I am trying to find an easier way to calculate the mean for each day, then save the results to a data frame.
You have to group by date then aggregate
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html
In your case
df.groupby('DATE').agg({'RATE': ['mean']})
You can groupby the date and perform mean operation.
new_df = df.groupby('DATE').mean()

How to select a range of consecutive dates of a dataframe with many users in pandas

I have a dataframe with 19M rows of different customers (~10K customers) and for their daily consumption over different date ranges. I have resampled this data into weekly consumption and the resulted dataframe is 2M rows. I want to know the ranges of consecutive dates for each customer and select those with the max(range). Any ideas? Thank you!
It would be great if you could post some example code, so the replies will be more specific.
You probably want to do something like earliest = df.groupby('Customer_ID').min()['Consumption_date'] to get the earliest consumption date per customer, and latest = df.groupby('Customer_ID').max()['Consumption_date'] for the latest consumption date, and then take the difference time_span = latest-earliest to get the time span per customer.
Knowing the specific df and variable names would be great

Calculating mean of a specific column by specific rows

I have a dataframe that looks like in the pictures.
Now, I want to add a new column that will show the average of power for each day (given the data is sampled every 5 minutes), but separately for when it is day_or_night (day = 0 in the column, night = 1). I've gotten this far:
train['avg_by_day'][train['day_or_night']==1] = train['power'][train['day_or_night']==1].mean()
train['avg_by_day'][train['day_or_night']==0] = train['power'][train['day_or_night']==0].mean()
but this just adds the average of all the power values that correspond to day, or similarly - night, which isn't what I'm after: a specific average for each day/night separately.
I need something like: train['avg_by_day'] == train.power.mean() when day == 1 and day_or_night == 1, and this for each day.
So you want to group the dataframe by day and day_or_night and create a new column with mean power values for each group:
train['avg_by_day'] = train.groupby(['day','day_or_night'])['power']\
.transform('mean')
Maybe you should also include year and month in the grouping columns because otherwise it's going to group the 1st day of every month together, same for the 2nd day and so on.

Calculating the rolling exponential weighted moving average for each share price over time

This question is similar to my previous one: Shifting elements of column based on index given condition on another column
I have a dataframe (df) with 2 columns and 1 index.
Index is datetime index and is in format of 2001-01-30 .... etc and the index is ordered by DATE and there are thousands of identical dates (and is monthly dates). Column A is company name (which corresponds to the date), Column B are share prices for the company names in column A for the date in the Index.
Now there are multiple companies in Column A for each date, and companies do vary over time (so the data is not predictable fully).
I want to create a column C which has the 3 day rolling exponential weighting average of the price for a particular company using the current and 2 dates before for a particular company in column A.
I have tried a few methods but have failed. Thanks.
Try:
df.groupby('ColumnA', as_index=False).apply(lambda g: g.ColumnB.ewm(3).mean())

SQL GROUPING SETS averages with multiple many-to-many dimensions

I have a table of data with the following:
User,Platform,Dt,Activity_Flag,Total_Purchases
1,iOS,05/05/2016,1,1
1,Android,05/05/2016,1,2
2,iOS,05/05/2016,1,0
2,Android,05/05/2016,1,2
3,iOS,05/05/2016,1,1
3,Android,06/05/2016,1,3
1,iOS,06/05/2016,1,2
4,Android,06/05/2016,1,2
1,Android,06/05/2016,1,0
3,iOS,07/05/2016,1,2
2,iOS,08/05/2016,1,0
I want to do a GROUPING SETS (Platform,Dt,(Platform,Dt),()) aggregation to be able to find for each combination of Platform and Dt the following:
Total Purchases
Total Unique Users
Average Purchases per User per Day
The first two are simple as these can be achieved via a sum(Total_Purchases) and count(distinct user) respectively.
The problem I have is with the last metric. The result set should look like this but I don't know how to get the last column to be calculated correctly:
Platform,Dt,Total_Purchases,Total_Unique_Users,Average_Purchases_Per_User_Per_Day
Android,05/05/2016,4,2,2.0
iOS,05/05/2016,2,3,0.7
Android,06/05/2016,5,3,1.7
iOS,06/05/2016,2,1,2.0
iOS,07/05/2016,2,1,2.0
iOS,08/05/2016,0,1,0.0
,05/05/2016,6,3,2.0
,06/05/2016,7,3,2.3
,07/05/2016,1,1,1.0
,08/05/2016,1,1,1.0
Android,,9,4,1.8
iOS,,6,3,1.2
,,15,4,1.6
For the first ten rows we see that getting the Average purchase per user per day is a simple division of the first two columns as the dimension in these rows represent a single date only. But when we look at the final 3 rows we see that the division is not the way to achieve the desired result. This is because it needs to take an average for each day in turn to get the overall per day amount.
If this isn't clear please let me know and I'll be happy to explain better. This is my first post on this site!