How to calculate the rolling sum on custom time columns? - pandas

The rolling function in Pandas can only calculate rolling statistics according to row counts or date/time columns. But I want to have a discrete time column for calculating rolling sum, something like this:
key time value
A 1 10
A 2 20
A 4 30
A 7 10
B 1 15
B 2 30
B 3 15
I want to first group by key, then calculate the rolling sum on value for the nearest 3 time:
key time value output
A 1 10 10
A 2 20 30(10+20)
A 4 30 60(10+20+30)
A 7 10 40(30+10)
B 1 15 15
B 2 30 45
B 3 15 60
I tried this:
grouped = input.groupby("key", as_index=False)
for name, group in grouped:
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
out = calcRollingStat(time, value, mode="avg")
group["output"] = out #out is a list
But then I don't know how to convert grouped back to DataFrame. Pandas tells me that there is no reset_index attribute in grouped.
Is my code the best method to do this? How would you tackle this problem?
Thank you!

I believe you can use GroupBy.apply with custom function:
def f(group):
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
group["output"] = calcRollingStat(time, value, mode="avg")
return group
df = input.groupby("key", as_index=False).apply(f)

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Pandas Cumulative sum over 1 indice but not the other 3

I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286

Groupby two columns in pandas, and perform operations over totals for each group

The code below:
df = pd.read_csv('./filename.csv', header='infer').dropna()
df.groupby(['category_code','event_type']).event_type.count().head(20)
Returns the following table:
How can I obtain, for all the sub groups under event_type that have both "purchase" and "view", the ratio between the total of "purchase" and the total of "view"?
In this specific case, for instance, I need a function that returns:
1/57
1/232
3/249
Eventually, I will need to plot such result.
I have been trying for a day, without success. I am still new to pandas, and I searched across every possible forum without finding anything useful.
Next time please consider adding a sample of your data as text instead of as an image. It helps us testing..
Anyway, in your case you can combine different dataframe methods, such as groupby, as you have already done, and pivot_table. I used this data just as an example:
category_code event_type
0 A purchase
1 A view
2 B view
3 B view
4 C view
5 D purchase
6 D view
7 D view
You can create a new column from your groupby
df['event_count'] = df.groupby(['category_code', 'event_type'])\
.event_type.transform('count')
Then create a pivot_table
my_table = df.pivot_table(values='event_count',
index='category_code',
columns='event_type',
fill_value=0)
Then, finally, you can calculate the purchase_ratio directly:
my_table['purchase_ratio'] = my_table['purchase'] / my_table['view']
Which results in the following DataFrame:
event_type purchase view purchase_ratio
category_code
A 1 1 1.0
B 0 2 0.0
C 0 1 0.0
D 1 2 0.5

Pandas series group by calculate percentage

I have a data frame. I have grouped a column status by date using
y = news_dataframe.groupby(by=[news_dataframe['date'].dt.date,news_dataframe['status']])['status'].count()
and my output is --
date status count
2019-05-29 selected 24
rejected auto 243
waiting 109
no action 1363
2019-05-30 selected 28
rejected auto 188
waiting 132
no action 1249
repeat 3
2019-05-31 selected 13
rejected auto 8
waiting 23
no action 137
repeat 2
source 1
Name: reasonForReject, dtype: int64
Now I want to calculate the percentage of each status group by date. How can I achieve this using pandas dataframe?
Compute two different groupbys and divide one by the other:
y_numerator = news_dataframe.groupby(by=[news_dataframe['date'].dt.date,news_dataframe['status']])['status'].count()
y_denominator = news_dataframe.groupby(by=news_dataframe['date'].dt.date)['status'].count()
y=y_numerator/y_denominator
I guess that's the shortest:
news_dataframe['date'] = news_dataframe['date'].dt.date
news_dataframe.groupby(['date','status'])['status'].count()/news_dataframe.groupby(['date'])['status'].count()
try this:
# just fill the consecutive rows with this
df=df.ffill()
df.df1.columns=['date','status','count']
# getting the total value of count with date and status
df1=df.groupby(['date']).sum().reset_index()
#renaming it to total as it is the sum
df1.columns=['date','status','total']
# now join the tables to find the total and actual value together
df2=df.merge(df1,on=['date'])
#calculate the percentage
df2['percentage']=(df2.count/df2.total)*100
If you need one liner its:
df['percentage']=(df.ffill()['count]/df.ffill().groupby(['date']).sum().reset_index().rename(columns={'count': 'total'}).merge(df,on=['date'])['total'])*100