pandas how to populate missing rows - pandas

I have a dataset like:
Dept, Date, Number
dept1, 2020-01-01, 12
dept1, 2020-01-03, 34
dept2, 2020-01-03, 56
dept3, 2020-01-03, 78
dept2, 2020-01-04, 11
dept3, 2020-01-04, 12
...
eg, I want to fill zero for missing dept2 & dept3 on date 2020-01-01
Dept, Date, Number
dept1, 2020-01-01, 12
dept2, 2020-01-01, 0 <--need to be added
dept3, 2020-01-01, 0 <--need to be added
dept1, 2020-01-03, 34
dept2, 2020-01-03, 56
dept3, 2020-01-03, 78
dept1, 2020-01-04, 0 <--need to be added
dept2, 2020-01-04, 11
dept3, 2020-01-04, 12
In other words, for unique dept, I need them to be shown on every unique date.
Is it a way to achieve this? Thanks!

You could use the complete function from pyjanitor, to abstract the process, simply pass the columns that you wish to expand:
In [598]: df.complete('Dept', 'Date').fillna(0)
Out[598]:
Dept Date Number
0 dept1 2020-01-01 12.0
1 dept1 2020-01-03 34.0
2 dept1 2020-01-04 0.0
3 dept2 2020-01-01 0.0
4 dept2 2020-01-03 56.0
5 dept2 2020-01-04 11.0
6 dept3 2020-01-01 0.0
7 dept3 2020-01-03 78.0
8 dept3 2020-01-04 12.0
You could also stick solely to Pandas and use the reindex method; complete covers cases where the index is not unique, or there are nulls; it is an abstraction/convenience wrapper:
(df
.set_index(['Dept', 'Date'])
.pipe(lambda df: df.reindex(pd.MultiIndex.from_product(df.index.levels),
fill_value = 0))
.reset_index()
)
Dept Date Number
0 dept1 2020-01-01 12
1 dept1 2020-01-03 34
2 dept1 2020-01-04 0
3 dept2 2020-01-01 0
4 dept2 2020-01-03 56
5 dept2 2020-01-04 11
6 dept3 2020-01-01 0
7 dept3 2020-01-03 78
8 dept3 2020-01-04 12

Let us do pivot then stack
out = df.pivot(*df.columns).fillna(0).stack().reset_index(name='Number')
Dept Date Number
0 dept1 2020-01-01 12.0
1 dept1 2020-01-03 34.0
2 dept1 2020-01-04 0.0
3 dept2 2020-01-01 0.0
4 dept2 2020-01-03 56.0
5 dept2 2020-01-04 11.0
6 dept3 2020-01-01 0.0
7 dept3 2020-01-03 78.0
8 dept3 2020-01-04 12.0

Related

How to apply df.groupby for each user summing up purchases between two dynamic dates?

I have a df which looks like this:
user_id | date1 | date2 | purchase
1 | 2020-01-01 | 2021-01-01 | 100
1 | 2021-02-01 | 2021-05-01 | 29
2 | 2019-01-01 | 2021-01-01 | 11..
I want a dataframe which returns for every user the sum of purchase amounts between date1 and date 2. Those dates are likely always different for each user. How could I achieve this the most efficiently?
df.groupby('user_id').purchase.sum() #But how do I say that only between date1 and date2?
IIUC first repeat months and then aggregate:
df['date1'] = pd.to_datetime(df['date1']).dt.to_period('m')
df['date2'] = pd.to_datetime(df['date2']).dt.to_period('m')
diff = df['date2'].astype('int').sub(df['date1'].astype('int')) + 1
df = df.loc[df.index.repeat(diff)]
df['date'] = df.groupby(level=0).cumcount().add(df['date1']).dt.to_timestamp()
print (df)
user_id date1 date2 purchase date
0 1 2020-01 2021-01 100 2020-01-01
0 1 2020-01 2021-01 100 2020-02-01
0 1 2020-01 2021-01 100 2020-03-01
0 1 2020-01 2021-01 100 2020-04-01
0 1 2020-01 2021-01 100 2020-05-01
0 1 2020-01 2021-01 100 2020-06-01
0 1 2020-01 2021-01 100 2020-07-01
0 1 2020-01 2021-01 100 2020-08-01
0 1 2020-01 2021-01 100 2020-09-01
0 1 2020-01 2021-01 100 2020-10-01
0 1 2020-01 2021-01 100 2020-11-01
0 1 2020-01 2021-01 100 2020-12-01
0 1 2020-01 2021-01 100 2021-01-01
1 1 2021-02 2021-05 29 2021-02-01
1 1 2021-02 2021-05 29 2021-03-01
1 1 2021-02 2021-05 29 2021-04-01
1 1 2021-02 2021-05 29 2021-05-01
2 2 2019-01 2021-01 11 2019-01-01
2 2 2019-01 2021-01 11 2019-02-01
2 2 2019-01 2021-01 11 2019-03-01
2 2 2019-01 2021-01 11 2019-04-01
2 2 2019-01 2021-01 11 2019-05-01
2 2 2019-01 2021-01 11 2019-06-01
2 2 2019-01 2021-01 11 2019-07-01
2 2 2019-01 2021-01 11 2019-08-01
2 2 2019-01 2021-01 11 2019-09-01
2 2 2019-01 2021-01 11 2019-10-01
2 2 2019-01 2021-01 11 2019-11-01
2 2 2019-01 2021-01 11 2019-12-01
2 2 2019-01 2021-01 11 2020-01-01
2 2 2019-01 2021-01 11 2020-02-01
2 2 2019-01 2021-01 11 2020-03-01
2 2 2019-01 2021-01 11 2020-04-01
2 2 2019-01 2021-01 11 2020-05-01
2 2 2019-01 2021-01 11 2020-06-01
2 2 2019-01 2021-01 11 2020-07-01
2 2 2019-01 2021-01 11 2020-08-01
2 2 2019-01 2021-01 11 2020-09-01
2 2 2019-01 2021-01 11 2020-10-01
2 2 2019-01 2021-01 11 2020-11-01
2 2 2019-01 2021-01 11 2020-12-01
2 2 2019-01 2021-01 11 2021-01-01
df = df.groupby(['user_id','date'], as_index=False).purchase.sum()

How to Coalesce datetime values from 3 columns into a single column in a pandas dataframe?

I have a dataframe with 3 date columns in datetime format:
CLIENT_ID
DATE_BEGIN
DATE_START
DATE_REGISTERED
1
2020-01-01
2020-01-01
2020-01-01
2
2020-01-02
2020-02-01
2020-01-01
3
NaN
2020-05-01
2020-04-01
4
2020-01-01
2020-01-01
NaN
How do I create (coalesce) a new column with the earliest datetime for each row resulting in an ACTUAL_START_DATE
CLIENT_ID
DATE_BEGIN
DATE_START
DATE_REGISTERED
ACTUAL_START_DATE
1
2020-01-01
2020-01-01
2020-01-01
2020-01-01
2
2020-01-02
2020-02-01
2020-01-01
2020-01-01
3
NaN
2020-05-01
2020-04-01
2020-04-01
4
2020-01-01
2020-01-02
NaN
2020-01-01
some sort of variation with bfill?
You are right, a mix of bfill and ffill on the axis columns should do it:
df.assign(ACTUAL_START_DATE = df.filter(like='DATE')
.bfill(axis=1)
.ffill(axis=1)
.min(axis=1)
)
CLIENT_ID DATE_BEGIN DATE_START DATE_REGISTERED ACTUAL_START_DATE
0 1 2020-01-01 2020-01-01 2020-01-01 2020-01-01
1 2 2020-01-02 2020-02-01 2020-01-01 2020-01-01
2 3 NaN 2020-05-01 2020-04-01 2020-04-01
3 4 2020-01-01 2020-01-01 NaN 2020-01-01

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

LAG / OVER / PARTITION / ORDER BY using conditions - SQL Server 2017

I have a table that looks like this:
Date AccountID Amount
2018-01-01 123 12
2018-01-06 123 150
2018-02-14 123 11
2018-05-06 123 16
2018-05-16 123 200
2018-06-01 123 18
2018-06-15 123 17
2018-06-18 123 110
2018-06-30 123 23
2018-07-01 123 45
2018-07-12 123 116
2018-07-18 123 60
This table has multiple dates and IDs, along with multiple Amounts. For each individual row, I want grab the last Date where Amount was over a specific value for that specific AccountID. I have been trying to use the LAG( Date, 1 ) in combination with several variatons of CASE and OVER ( PARTITION BY AccountID ORDER BY Date ) statements but I've had no luck. Ultimately, this is what I would like my SELECT statement to return.
Date AccountID Amount LastOverHundred
2018-01-01 123 12 NULL
2018-01-06 123 150 2018-01-06
2018-02-14 123 11 2018-01-06
2018-05-06 123 16 2018-01-06
2018-05-16 123 200 2018-05-16
2018-06-01 123 18 2018-05-16
2018-06-15 123 17 2018-05-16
2018-06-18 123 110 2018-06-18
2018-06-30 123 23 2018-06-18
2018-07-01 123 45 2018-06-18
2018-07-12 123 116 2018-07-12
2018-07-18 123 60 2018-07-12
Any help with this would be greatly appreciated.
Use a cumulative conditional max():
select t.*,
max(case when amount > 100 then date end) over (partition by accountid order by date) as lastoverhundred
from t;

pandas rolling cumsum over the trailing n elements

Using pandas, what is the easiest way to calculate a rolling cumsum over the previous n elements, for instance to calculate trailing three days sales:
df = pandas.Series(numpy.random.randint(0,10,10), index=pandas.date_range('2020-01', periods=10))
df
2020-01-01 8
2020-01-02 4
2020-01-03 1
2020-01-04 0
2020-01-05 5
2020-01-06 8
2020-01-07 3
2020-01-08 8
2020-01-09 9
2020-01-10 0
Freq: D, dtype: int64
Desired output:
2020-01-01 8
2020-01-02 12
2020-01-03 13
2020-01-04 5
2020-01-05 6
2020-01-06 13
2020-01-07 16
2020-01-08 19
2020-01-09 20
2020-01-10 17
Freq: D, dtype: int64
You need rolling.sum:
df.rolling(3, min_periods=1).sum()
Out:
2020-01-01 8.0
2020-01-02 12.0
2020-01-03 13.0
2020-01-04 5.0
2020-01-05 6.0
2020-01-06 13.0
2020-01-07 16.0
2020-01-08 19.0
2020-01-09 20.0
2020-01-10 17.0
dtype: float64
min_periods ensures the first two elements are calculated, too. With a window size of 3, by default, the first two elements are NaN.