window function for moving average - pandas

I am trying to replicate SQL's window function in pandas.
SELECT avg(totalprice) OVER (
PARTITION BY custkey
ORDER BY orderdate
RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)
FROM orders
I have this dataframe:
from io import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100
cust_2,2020-10-10,15
cust_1,2020-10-15,200
cust_1,2020-10-16,240
cust_2,2020-12-20,25
cust_1,2020-12-25,140
cust_2,2021-01-01,5
"""
u_cols=['customer_id', 'date', 'price']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
df=df.sort_values(list(df.columns))
And after calculating moving average restricted to last 1 month, it will look like this...
from io import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100,100
cust_2,2020-10-10,15,15
cust_1,2020-10-15,200,150
cust_1,2020-10-16,240,180
cust_2,2020-12-20,25,25
cust_1,2020-12-25,140,140
cust_2,2021-01-01,5,15
"""
u_cols=['customer_id', 'date', 'price', 'my_average']
myf = StringIO(myst)
import pandas as pd
my_df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
my_df=my_df.sort_values(list(my_df.columns))
As shown in this image:
https://trino.io/assets/blog/window-features/running-average-range.svg
I tried to write a function like this...
import numpy as np
def mylogic(myro):
mylist = list()
mydate = myro['date'][0]
for i in range(len(myro)):
if myro['date'][i] > mydate:
mylist.append(myro['price'][i])
mydate = myro['date'][i]
return np.mean(mylist)
But that returned a key_error.

You can use the rolling function on the last 30 days
df['date'] = pd.to_datetime(df['date'])
df['my_average'] = (df.groupby('customer_id')
.apply(lambda d: d.rolling('30D', on='date')['price'].mean())
.reset_index(level=0, drop=True)
.astype(int)
)
output:
customer_id date price my_average
0 cust_1 2020-10-10 100 100
2 cust_1 2020-10-15 200 150
3 cust_1 2020-10-16 240 180
5 cust_1 2020-12-25 140 140
1 cust_2 2020-10-10 15 15
4 cust_2 2020-12-20 25 25
6 cust_2 2021-01-01 5 15

Related

Count how many non-zero entries at each month in a dataframe column

I have a dataframe, df, with datetimeindex and a single column, like this:
I need to count how many non-zero entries i have at each month. For example, according to those images, in January i would have 2 entries, in February 1 entry and in March 2 entries. I have more months in the dataframe, but i guess that explains the problem.
I tried using pandas groupby:
df.groupby(df.index.month).count()
But that just gives me total days at each month and i don't saw any other parameter in count() that i could use here.
Any ideas?
Try index.to_period()
For example:
In [1]: import pandas as pd
import numpy as np
x_df = pd.DataFrame(
{
'values': np.random.randint(low=0, high=2, size=(120,))
} ,
index = pd.date_range("2022-01-01", periods=120, freq="D")
)
In [2]: x_df
Out[2]:
values
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
...
2022-04-26 1
2022-04-27 0
2022-04-28 0
2022-04-29 1
2022-04-30 1
[120 rows x 1 columns]
In [3]: x_df[x_df['values'] != 0].groupby(lambda x: x.to_period("M")).count()
Out[3]:
values
2022-01 17
2022-02 15
2022-03 16
2022-04 17
can you try this:
#drop nans
import numpy as np
dfx['col1']=dfx['col1'].replace(0,np.nan)
dfx=dfx.dropna()
dfx=dfx.resample('1M').count()

Pandas: drop out of sequence row

My Pandas df:
import pandas as pd
import io
data = """date value
"2015-09-01" 71.925000
"2015-09-06" 71.625000
"2015-09-11" 71.333333
"2015-09-12" 64.571429
"2015-09-21" 72.285714
"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
df.date = pd.to_datetime(df.date)
I Given a user input date ( 01-09-2015).
I would like to keep only those date where difference between date and input date is multiple of 5.
Expected output:
input = 01-09-2015
df:
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
3 2015-09-21 72.285714
My Approach so far:
I am taking the delta between input_date and date in pandas and saving this delta in separate column.
If delta%5 == 0, keep the row else drop. Is this the best that can be done?
Use boolean indexing for filter by mask, here convert input values to datetimes and then timedeltas to days by Series.dt.days:
input1 = '01-09-2015'
df = df[df.date.sub(pd.to_datetime(input1)).dt.days % 5 == 0]
print (df)
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
4 2015-09-21 72.285714

Pandas groupby calculate difference

import pandas as pd
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(['Company',df['Date'].dt.year]).diff()
print(df)
Date Revenue YTD
0 NaT NaN
1 92 days -100.0
2 NaT NaN
3 92 days -22670.0
4 NaT NaN
5 92 days 28627.0
I would like to calculate the company's revenue difference by September and December. I have tried with groupby company and year. But the result is not what I am expecting
Expecting result
Date Company Revenue YTD
0 2017 A -100
1 2018 A -22670
2 2017 B 28627
IIUC, this should work
(df.assign(Date=df['Date'].dt.year,
Revenue_Diff=df.groupby(['Company',df['Date'].dt.year])['Revenue YTD'].diff())
.drop('Revenue YTD', axis=1)
.dropna()
)
Output:
Date Company Revenue_Diff
1 2017 A -100.0
3 2017 B -22670.0
5 2018 A 28627.0
Try this:
Set it up:
import pandas as pd
import numpy as np
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
Update with np.diff():
my_func = lambda x: np.diff(x)
df = (df.groupby([df.Date.dt.year, df.Company])
.agg({'Revenue YTD':my_func}))
print(df)
Revenue YTD
Date Company
2017 A -100
B -22670
2018 A 28627
Hope this helps.

alternatives to pivot very large table pandas

I have a dataframe of 25M x 3 cols of format:
import pandas as pd
import numpy as np
d={'ID':['A1','A1','A2','A2','A2'], 'date':['Jan 1','Jan7','Jan4','Jan5','Jan12'],'value':[10,12,3,5,2]}
df=pd.DataFrame(data=d)
df
ID date value
0 A1 Jan 1 10
1 A1 Jan7 12
2 A2 Jan4 3
3 A2 Jan5 5
4 A2 Jan12 2
...
An
And want to pivot it using:
df['date'] = pd.to_datetime(df['date'], format='%b%d')
(df.pivot(index='date', columns='ID',values='value')
.asfreq('D')
.interpolate()
.bfill()
.reset_index()
)
df.index = df.index.strftime('%b%d')
This works for 500k rows
df3=(df.iloc[:500000,:].pivot(index='date', columns='ID',values='value')
.resample('M').mean()
.interpolate()
.bfill()
.reset_index()
)
, but when I used my full data set, or >1M rows, it fails with:
ValueError: Unstacked DataFrame is too big, causing int32 overflow
Are there any suggestions on how I can get this to run to completion?
A further computation is performed on the wide table:
N=19/df2.iloc[0]
df2.mul(N.tolist(),axis=1).sum(1)

Slicing in group by function

How do I group by data frame based on first column after splitting data on semi colon?
In this example I need to split on last column time and group by hour.
from StringIO import StringIO
myst="""india, 905034 , 19:44
USA, 905094 , 19:33
Russia, 905154 , 21:56
"""
u_cols=['country', 'index', 'current_tm']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
This query does not return the expected results:
df[df['index'] > 900000].groupby([df.current_tm]).size()
current_tm
21:56 1
19:33 1
19:44 1
dtype: int64
It should be :
21 1
19 2
The time is in hh:mm format but pandas consider it as string.
Is there any utility that will convert the SQL query to pandas equivalent? (something like querymongo.com that will help mongoDB users)
You can add the hour to your dataframe as follows and then use it for grouping:
df['hour'] = df.current_tm.str.strip().apply(lambda x: x.split(':')[0] if isinstance(x, str)
else None)
>>> df[df['index'] > 900000].groupby('hour').size()
hour
19 2
21 1
dtype: int64
Create a new column:
df['hour'] = [current_time.split(':')[0] for current_time in df['current_tm']]
Then apply your method:
df[df['index'] > 900000].groupby([df['hour']]).size()
hour
19 2
21 1
dtype: int64