Groupby sum by Inspector_ID, and Day in pandas - pandas

I have a dataframe as shown below. Which is the distance travelled by two inspector in different day.
Inspector_ID Timestamp Distance
1 2018-07-24 7:31:00 0
1 2018-07-24 7:31:01 2.3
1 2018-07-24 7:33:01 2.8
1 2018-07-25 7:32:04 0
1 2018-07-25 7:33:06 3.6
2 2018-07-20 8:35:08 0
2 2018-07-20 8:36:10 5.6
2 2018-07-20 8:37:09 2.1
2 2018-07-27 8:38:00 0
2 2018-07-27 8:39:00 3
2 2018-07-27 8:40:00 2.6
From the above, I would like to prepare below dataframe
Expected output
Inspector_ID Day_of_the_month Total_distance No.of_entries_in that_day week_of_the_month
1 24 5.1 3 4
1 25 3.6 2 4
2 20 7.7 3 3
2 27 5.6 3 4

import pandas as pd
from datetime import datetime as dt
df={"Inspector_ID":[1,1,1,1,1,2,2,2,2,2,2] ,"Timestamp" :["2018-07-24 7:31:00" ,"2018-07-24 7:31:01" ,"2018-07-24 7:33:01" ,"2018-07-25 7:32:04" ,"2018-07-25 7:33:06","2018-07-20 7:31:00" ,"2018-07-20 8:36:10" ,"2018-07-20 8:37:09","2018-07-27 8:38:00" ,"2018-07-27 8:39:00","2018-07-27 8:38:00"],"Distance":[0,2.3,2.8,0,3.6,0,5.6,2.1,0,3,2.6]}
df["Day"]=[]
df["week_of_month"]=[]
for entry in df["Timestamp"]:
df["Day"].append(dt.strptime(entry,'%Y-%m-%d %H:%M:%S').day)
df["week_of_month"].append((dt.strptime(entry,'%Y-%m-%d %H:%M:%S').day - 1) // 7 + 1)
df=pd.DataFrame(df)
result=df.groupby(['Inspector_ID','Day','week_of_month'])['Distance'].agg({"totdist":"sum","NoEntries":"count"})
print(result)

Related

pandas lag multi-index irregular time series data by number of months

I have the following pandas dataframe
df = pd.DataFrame(data = {
'item': ['red','red','red','blue','blue'],
'dt': pd.to_datetime(['2018-01-31', '2018-02-28', '2018-03-31', '2018-01-31', '2018-03-31']),
's': [3.2, 4.8, 5.1, 5.3, 5.8],
'r': [1,2,3,4,5],
't': [7,8,9,10,11],
})
which looks like
item dt s r t
0 red 2018-01-31 3.2 1 7
1 red 2018-02-28 4.8 2 8
2 red 2018-03-31 5.1 3 9
3 blue 2018-01-31 5.3 4 10
4 blue 2018-03-31 5.8 5 11
Note that the time points are irregular: "blue" is missing February data. All dates are valid end-of-month dates.
I'd like to add a column which is the "s value from two months ago", ideally something like
df['s_lag2m'] = df.set_index(['item','dt'])['s'].shift(2, 'M')
and I would get
item dt s r t s_lag2m
0 red 2018-01-31 3.2 1 7 NaN
1 red 2018-02-28 4.8 2 8 NaN
2 red 2018-03-31 5.1 3 9 3.2
3 blue 2018-01-31 5.3 4 10 NaN
4 blue 2018-03-31 5.8 5 11 5.3
But that doesn't work; it throws NotImplementedError: Not supported for type MultiIndex.
How can I do this?
We can do reindex after set_index with only dt
df['New']=df.set_index(['dt']).groupby('item')['s'].shift(2, 'M').\
reindex(pd.MultiIndex.from_frame(df[['item','dt']])).values
df
item dt s r t New
0 red 2018-01-31 3.2 1 7 NaN
1 red 2018-02-28 4.8 2 8 NaN
2 red 2018-03-31 5.1 3 9 3.2
3 blue 2018-01-31 5.3 4 10 NaN
4 blue 2018-03-31 5.8 5 11 5.3

Multindex add zero values if no in pandas dataframe

I have a pandas (v.0.23.4) dataframe with multindex('date', 'class').
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
4 7
In '2019-06-30' class 3 is missing because there are no data.
What I want is to add class 3 in the multindex and zero values in the Col_values column automatically.
Use DataFrame.unstack with fill_value=0 with DataFrame.stack:
df = df.unstack(fill_value=0).stack()
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7
Another solution is use DataFrame.reindex with MultiIndex.from_product:
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0)
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7

How to unstack the first row index in the Multidex in Pandas

in pandas I have a dataframe using unstack()like follows with
mean median std
0 1 2 3 0 1 2 3 0 1 2 3
-------------------------------------------------------------------------------
2019-08-31 2 3 6 4 3 3 2 3 0.3 0.4 2 3
before the unstack(),the frame is :
mean median std
--------------------------------------------
2019-08-31 0 2 3 0.3
1 3 3 0.4
2 6 2 2
3 4 3 3
2019-09-01 0
which unstack() command can I use to remove the the first row index and make the frame like:
0 1 2 3 0 1 2 3 0 1 2 3
-------------------------------------------------------------------------------
2019-08-31 2 3 6 4 3 3 2 3 0.3 0.4 2 3
Start from your original instruction:
df = df.unstack()
This instruction actually unstacks level=-1 i.e. the last level
of the MultiIndex and adds it to the column index.
Then run:
df.columns = df.columns.droplevel()
which actually drops level=0 i.e. the top level of MultiIndex
(just mean / median / std).

Fill consecutive NaNs with cumsum, to increment by one on each consecutive NaN

Given a dataframe with lots of missing value in a certain inverval, my desired output dataframe should have all consecutive NaN filled with a cumsum starting from the first valid value, and adding 1 for each NaN.
Given:
shop_id calendar_date quantity
0 2018-12-12 1
1 2018-12-13 NaN
2 2018-12-14 NaN
3 2018-12-15 NaN
4 2018-12-16 1
5 2018-12-17 NaN
Desired output:
shop_id calendar_date quantity
0 2018-12-12 1
1 2018-12-13 2
2 2018-12-14 3
3 2018-12-15 4
4 2018-12-16 1
5 2018-12-17 2
Use:
g = (~df.quantity.isnull()).cumsum()
df['quantity'] = df.fillna(1).groupby(g).quantity.cumsum()
shop_id calendar_date quantity
0 0 2018-12-12 1.0
1 1 2018-12-13 2.0
2 2 2018-12-14 3.0
3 3 2018-12-15 4.0
4 4 2018-12-16 1.0
5 5 2018-12-17 2.0
Details
Use .isnull() to check where quantity has valid values, and take the cumsum of the boolean Series:
g = (~df.quantity.isnull()).cumsum()
0 1
1 1
2 1
3 1
4 2
5 2
Use fillna
so that when you group by g and take the cusmum the values will increase starting from whatever the value is:
df.fillna(1).groupby(g).quantity.cumsum()
0 1.0
1 2.0
2 3.0
3 4.0
4 1.0
5 2.0
Another approach ?
data
shop_id calender_date quantity
0 0 2018-12-12 1.0
1 1 2018-12-13 NaN
2 2 2018-12-14 NaN
3 3 2018-12-15 NaN
4 4 2018-12-16 1.0
5 5 2018-12-17 NaN
6 6 2018-12-18 NaN
7 7 2018-12-17 NaN
using np.where
where = np.where(data['quantity'] >= 1)
r = []
for i in range(len(where[0])):
try:
r.extend(np.arange(1,where[0][i+1] - where[0][i]+1))
except:
r.extend(np.arange(1,len(data)-where[0][i]+1))
data['quantity'] = r
print(data)
shop_id calender_date quantity
0 0 2018-12-12 1
1 1 2018-12-13 2
2 2 2018-12-14 3
3 3 2018-12-15 4
4 4 2018-12-16 1
5 5 2018-12-17 2
6 6 2018-12-18 3
7 7 2018-12-17 4

Pandas: Obtaining the maximum of a column based on other column values

I have a pandas dataframe that looks like this:
ID date num
1 2018-03-28 3
1 2018-03-29 1
1 2018-03-30 4
1 2018-04-04 1
2 2018-04-03 1
2 2018-04-04 6
2 2018-04-10 3
2 2018-04-11 4
Created by the following code:
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 2, 2], 'date': ['2018-03-28',
'2018-03-29', '2018-03-30', '2018-04-04', '2018-04-03', '2018-04-04',
'2018-04-10', '2018-04-11'], 'num': [3,1,4,1,1,6,3,4]})
What I would like is to create a new column called 'maxnum' that is filled with the maximum value of num per ID for the date that is on that row and all earlier dates. This column would look like this:
ID date maxnum num
1 2018-03-28 3 3
1 2018-03-29 3 1
1 2018-03-30 4 4
1 2018-04-04 4 1
2 2018-04-03 1 1
2 2018-04-04 6 6
2 2018-04-10 6 3
2 2018-04-11 6 4
Does anyone know how I can program this column correctly and efficiently?
Thanks in advance!
Using cummax (assuming your dataframe is order by date already, if not
run mask lines)
#df.date=pd.to_datetime(df.date)
#df=df.sort_values('date')
df.groupby('ID').num.cummax()
Out[258]:
0 3
1 3
2 4
3 4
4 1
5 6
6 6
7 6
Name: num, dtype: int64