pandas rolling cumsum over the trailing n elements - pandas

Using pandas, what is the easiest way to calculate a rolling cumsum over the previous n elements, for instance to calculate trailing three days sales:
df = pandas.Series(numpy.random.randint(0,10,10), index=pandas.date_range('2020-01', periods=10))
df
2020-01-01 8
2020-01-02 4
2020-01-03 1
2020-01-04 0
2020-01-05 5
2020-01-06 8
2020-01-07 3
2020-01-08 8
2020-01-09 9
2020-01-10 0
Freq: D, dtype: int64
Desired output:
2020-01-01 8
2020-01-02 12
2020-01-03 13
2020-01-04 5
2020-01-05 6
2020-01-06 13
2020-01-07 16
2020-01-08 19
2020-01-09 20
2020-01-10 17
Freq: D, dtype: int64

You need rolling.sum:
df.rolling(3, min_periods=1).sum()
Out:
2020-01-01 8.0
2020-01-02 12.0
2020-01-03 13.0
2020-01-04 5.0
2020-01-05 6.0
2020-01-06 13.0
2020-01-07 16.0
2020-01-08 19.0
2020-01-09 20.0
2020-01-10 17.0
dtype: float64
min_periods ensures the first two elements are calculated, too. With a window size of 3, by default, the first two elements are NaN.

Related

How to Coalesce datetime values from 3 columns into a single column in a pandas dataframe?

I have a dataframe with 3 date columns in datetime format:
CLIENT_ID
DATE_BEGIN
DATE_START
DATE_REGISTERED
1
2020-01-01
2020-01-01
2020-01-01
2
2020-01-02
2020-02-01
2020-01-01
3
NaN
2020-05-01
2020-04-01
4
2020-01-01
2020-01-01
NaN
How do I create (coalesce) a new column with the earliest datetime for each row resulting in an ACTUAL_START_DATE
CLIENT_ID
DATE_BEGIN
DATE_START
DATE_REGISTERED
ACTUAL_START_DATE
1
2020-01-01
2020-01-01
2020-01-01
2020-01-01
2
2020-01-02
2020-02-01
2020-01-01
2020-01-01
3
NaN
2020-05-01
2020-04-01
2020-04-01
4
2020-01-01
2020-01-02
NaN
2020-01-01
some sort of variation with bfill?
You are right, a mix of bfill and ffill on the axis columns should do it:
df.assign(ACTUAL_START_DATE = df.filter(like='DATE')
.bfill(axis=1)
.ffill(axis=1)
.min(axis=1)
)
CLIENT_ID DATE_BEGIN DATE_START DATE_REGISTERED ACTUAL_START_DATE
0 1 2020-01-01 2020-01-01 2020-01-01 2020-01-01
1 2 2020-01-02 2020-02-01 2020-01-01 2020-01-01
2 3 NaN 2020-05-01 2020-04-01 2020-04-01
3 4 2020-01-01 2020-01-01 NaN 2020-01-01

pandas how to populate missing rows

I have a dataset like:
Dept, Date, Number
dept1, 2020-01-01, 12
dept1, 2020-01-03, 34
dept2, 2020-01-03, 56
dept3, 2020-01-03, 78
dept2, 2020-01-04, 11
dept3, 2020-01-04, 12
...
eg, I want to fill zero for missing dept2 & dept3 on date 2020-01-01
Dept, Date, Number
dept1, 2020-01-01, 12
dept2, 2020-01-01, 0 <--need to be added
dept3, 2020-01-01, 0 <--need to be added
dept1, 2020-01-03, 34
dept2, 2020-01-03, 56
dept3, 2020-01-03, 78
dept1, 2020-01-04, 0 <--need to be added
dept2, 2020-01-04, 11
dept3, 2020-01-04, 12
In other words, for unique dept, I need them to be shown on every unique date.
Is it a way to achieve this? Thanks!
You could use the complete function from pyjanitor, to abstract the process, simply pass the columns that you wish to expand:
In [598]: df.complete('Dept', 'Date').fillna(0)
Out[598]:
Dept Date Number
0 dept1 2020-01-01 12.0
1 dept1 2020-01-03 34.0
2 dept1 2020-01-04 0.0
3 dept2 2020-01-01 0.0
4 dept2 2020-01-03 56.0
5 dept2 2020-01-04 11.0
6 dept3 2020-01-01 0.0
7 dept3 2020-01-03 78.0
8 dept3 2020-01-04 12.0
You could also stick solely to Pandas and use the reindex method; complete covers cases where the index is not unique, or there are nulls; it is an abstraction/convenience wrapper:
(df
.set_index(['Dept', 'Date'])
.pipe(lambda df: df.reindex(pd.MultiIndex.from_product(df.index.levels),
fill_value = 0))
.reset_index()
)
Dept Date Number
0 dept1 2020-01-01 12
1 dept1 2020-01-03 34
2 dept1 2020-01-04 0
3 dept2 2020-01-01 0
4 dept2 2020-01-03 56
5 dept2 2020-01-04 11
6 dept3 2020-01-01 0
7 dept3 2020-01-03 78
8 dept3 2020-01-04 12
Let us do pivot then stack
out = df.pivot(*df.columns).fillna(0).stack().reset_index(name='Number')
Dept Date Number
0 dept1 2020-01-01 12.0
1 dept1 2020-01-03 34.0
2 dept1 2020-01-04 0.0
3 dept2 2020-01-01 0.0
4 dept2 2020-01-03 56.0
5 dept2 2020-01-04 11.0
6 dept3 2020-01-01 0.0
7 dept3 2020-01-03 78.0
8 dept3 2020-01-04 12.0

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

Add a column value with the other date time column at minutes level in pandas

I have a data frame as shown below
ID ideal_appt_time service_time
1 2020-01-06 09:00:00 22
2 2020-01-06 09:30:00 15
1 2020-01-08 14:00:00 42
2 2020-01-12 01:30:00 5
I would like to add service time in terms of minutes with ideal_appt_time and create new column called finish.
Expected Output:
ID ideal_appt_time service_time finish
1 2020-01-06 09:00:00 22 2020-01-06 09:22:00
2 2020-01-06 09:30:00 15 2020-01-06 09:45:00
1 2020-01-08 14:00:00 42 2020-01-08 14:42:00
2 2020-01-12 01:30:00 35 2020-01-12 02:05:00
Use to_timedelta for convert column to timedeltas by minutes and add to datetimes:
df['ideal_appt_time'] = pd.to_datetime(df['ideal_appt_time'])
df['finish'] = df['ideal_appt_time'] + pd.to_timedelta(df['service_time'], unit='Min')
print (df)
ID ideal_appt_time service_time finish
0 1 2020-01-06 09:00:00 22 2020-01-06 09:22:00
1 2 2020-01-06 09:30:00 15 2020-01-06 09:45:00
2 1 2020-01-08 14:00:00 42 2020-01-08 14:42:00
3 2 2020-01-12 01:30:00 5 2020-01-12 01:35:00
Data
df=pd.DataFrame({'ideal_appt_time':['2020-01-06 09:00:00','2020-01-06 09:30:00','2020-01-08 14:00:00','2020-01-12 01:30:00'],'service_time':[22,15,42,35]})
Another way out
df['finish'] = pd.to_datetime(df['ideal_appt_time']).add( df['service_time'].astype('timedelta64[m]'))
df
ideal_appt_time service_time finish
0 2020-01-06 09:00:00 22 2020-01-06 09:22:00
1 2020-01-06 09:30:00 15 2020-01-06 09:45:00
2 2020-01-08 14:00:00 42 2020-01-08 14:42:00
3 2020-01-12 01:30:00 35 2020-01-12 02:05:00

generate a random number between 2 and 40 with mean 20 as a column in pandas

I have a data frame as shown below
session slot_num appt_time
s1 1 2020-01-06 09:00:00
s1 2 2020-01-06 09:20:00
s1 3 2020-01-06 09:40:00
s1 3 2020-01-06 09:40:00
s1 4 2020-01-06 10:00:00
s1 4 2020-01-06 10:00:00
s2 1 2020-01-06 08:20:00
s2 2 2020-01-06 08:40:00
s2 2 2020-01-06 08:40:00
s2 3 2020-01-06 09:00:00
s2 4 2020-01-06 09:20:00
s2 5 2020-01-06 09:40:00
s2 5 2020-01-06 09:40:00
s2 6 2020-01-06 10:00:00
s3 1 2020-01-09 13:00:00
s3 1 2020-01-09 13:00:00
s3 2 2020-01-09 13:20:00
s3 3 2020-01-09 13:40:00
In the above I would like to add a column called service_time.
service_time should contain any random digits between 2 to 40 with mean 20 for each session.
I prefer random numbers should follow random normal distribution with mean 20, standard deviation 10, minimum 2 and maximum 40
Expected output:
session slot_num appt_time service_time
s1 1 2020-01-06 09:00:00 30
s1 2 2020-01-06 09:20:00 10
s1 3 2020-01-06 09:40:00 15
s1 3 2020-01-06 09:40:00 35
s1 4 2020-01-06 10:00:00 20
s1 4 2020-01-06 10:00:00 10
s2 1 2020-01-06 08:20:00 15
s2 2 2020-01-06 08:40:00 20
s2 2 2020-01-06 08:40:00 25
s2 3 2020-01-06 09:00:00 30
s2 4 2020-01-06 09:20:00 20
s2 5 2020-01-06 09:40:00 8
s2 5 2020-01-06 09:40:00 40
s2 6 2020-01-06 10:00:00 2
s3 1 2020-01-09 13:00:00 4
s3 1 2020-01-09 13:00:00 32
s3 2 2020-01-09 13:20:00 26
s3 3 2020-01-09 13:40:00 18
Note : please note that this is the one of that random combination which follows the minimum, maximum and mean criteria mentioned above.
One possible solution with cutom function:
#https://stackoverflow.com/a/39435600/2901002
def gen_avg(n, expected_avg=20, a=2, b=40):
while True:
l = np.random.randint(a, b, size=n)
avg = np.mean(l)
if avg == expected_avg:
return l
df['service_time'] = df.groupby('session')['session'].transform(lambda x: gen_avg(len(x)))
print (df)
session slot_num appt_time service_time
0 s1 1 2020-01-06 09:00:00 31
1 s1 2 2020-01-06 09:20:00 9
2 s1 3 2020-01-06 09:40:00 23
3 s1 3 2020-01-06 09:40:00 37
4 s1 4 2020-01-06 10:00:00 6
5 s1 4 2020-01-06 10:00:00 14
6 s2 1 2020-01-06 08:20:00 33
7 s2 2 2020-01-06 08:40:00 29
8 s2 2 2020-01-06 08:40:00 18
9 s2 3 2020-01-06 09:00:00 32
10 s2 4 2020-01-06 09:20:00 9
11 s2 5 2020-01-06 09:40:00 26
12 s2 5 2020-01-06 09:40:00 10
13 s2 6 2020-01-06 10:00:00 3
14 s3 1 2020-01-09 13:00:00 19
15 s3 1 2020-01-09 13:00:00 22
16 s3 2 2020-01-09 13:20:00 5
17 s3 3 2020-01-09 13:40:00 34
Here's a solution with NumPy's new Generator infrastructure. See the documentation for a discussion of the differences between this and the older RandomState infrastructure.
import numpy as np
from numpy.random import default_rng
# assuming df is the name of your dataframe
n = len(df)
# set up random number generator
rng = default_rng()
# sample more than enough values
vals = rng.normal(loc=20., scale=10., size=2*n)
# filter values according to cut-off conditions
vals = vals[2 <= vals]
vals = vals[vals <= 40]
# add n random values to dataframe
df['service_time'] = vals[:n]
The normal distribution has an unbounded range, so if you're bounding between 2 and 40 the distribution isn't normal. An alternative which is bounded, and avoids acceptance/rejection schemes, is to use the triangular distribution (see Wikipedia for details). Since the mean of a triangular distribution is (left + mode + right) / 3, with left = 2 and right = 40 you would set mode = 18 to get the desired mean of 20.