pandas lag multi-index irregular time series data by number of months - pandas

I have the following pandas dataframe
df = pd.DataFrame(data = {
'item': ['red','red','red','blue','blue'],
'dt': pd.to_datetime(['2018-01-31', '2018-02-28', '2018-03-31', '2018-01-31', '2018-03-31']),
's': [3.2, 4.8, 5.1, 5.3, 5.8],
'r': [1,2,3,4,5],
't': [7,8,9,10,11],
})
which looks like
item dt s r t
0 red 2018-01-31 3.2 1 7
1 red 2018-02-28 4.8 2 8
2 red 2018-03-31 5.1 3 9
3 blue 2018-01-31 5.3 4 10
4 blue 2018-03-31 5.8 5 11
Note that the time points are irregular: "blue" is missing February data. All dates are valid end-of-month dates.
I'd like to add a column which is the "s value from two months ago", ideally something like
df['s_lag2m'] = df.set_index(['item','dt'])['s'].shift(2, 'M')
and I would get
item dt s r t s_lag2m
0 red 2018-01-31 3.2 1 7 NaN
1 red 2018-02-28 4.8 2 8 NaN
2 red 2018-03-31 5.1 3 9 3.2
3 blue 2018-01-31 5.3 4 10 NaN
4 blue 2018-03-31 5.8 5 11 5.3
But that doesn't work; it throws NotImplementedError: Not supported for type MultiIndex.
How can I do this?

We can do reindex after set_index with only dt
df['New']=df.set_index(['dt']).groupby('item')['s'].shift(2, 'M').\
reindex(pd.MultiIndex.from_frame(df[['item','dt']])).values
df
item dt s r t New
0 red 2018-01-31 3.2 1 7 NaN
1 red 2018-02-28 4.8 2 8 NaN
2 red 2018-03-31 5.1 3 9 3.2
3 blue 2018-01-31 5.3 4 10 NaN
4 blue 2018-03-31 5.8 5 11 5.3

Related

Classify a value under certain conditions in pandas dataframe

I have this dataframe:
value limit_1 limit_2 limit_3 limit_4
10 2 3 7 10
11 5 6 11 13
2 0.3 0.9 2.01 2.99
I want to add another column called class that classifies the value column this way:
if value <= limit1.value then 1
if value > limit1.value and <= limit2.value then 2
if value > limit2.value and <= limit3.value then 3
if value > limit3.value then 4
to get this result:
value limit_1 limit_2 limit_3 limit_4 CLASS
10 2 3 7 10 4
11 5 6 11 13 3
2 0.3 0.9 2.01 2.99 3
I know I could work to get these 'if's to work but my dataframe has 2kk rows and I need the fasted way to perform such classification.
I tried to use .cut function but the result was not what I expected/wanted
Thanks
We can use the rank method over the column axis (axis=1):
df["CLASS"] = df.rank(axis=1, method="first").iloc[:, 0].astype(int)
value limit_1 limit_2 limit_3 limi_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
We can use np.select:
import numpy as np
conditions = [df["value"]<df["limit_1"],
df["value"].between(df["limit_1"], df["limit_2"]),
df["value"].between(df["limit_2"], df["limit_3"]),
df["value"]>df["limit_3"]]
df["CLASS"] = np.select(conditions, [1,2,3,4])
>>> df
value limit_1 limit_2 limit_3 limit_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3

Groupby sum by Inspector_ID, and Day in pandas

I have a dataframe as shown below. Which is the distance travelled by two inspector in different day.
Inspector_ID Timestamp Distance
1 2018-07-24 7:31:00 0
1 2018-07-24 7:31:01 2.3
1 2018-07-24 7:33:01 2.8
1 2018-07-25 7:32:04 0
1 2018-07-25 7:33:06 3.6
2 2018-07-20 8:35:08 0
2 2018-07-20 8:36:10 5.6
2 2018-07-20 8:37:09 2.1
2 2018-07-27 8:38:00 0
2 2018-07-27 8:39:00 3
2 2018-07-27 8:40:00 2.6
From the above, I would like to prepare below dataframe
Expected output
Inspector_ID Day_of_the_month Total_distance No.of_entries_in that_day week_of_the_month
1 24 5.1 3 4
1 25 3.6 2 4
2 20 7.7 3 3
2 27 5.6 3 4
import pandas as pd
from datetime import datetime as dt
df={"Inspector_ID":[1,1,1,1,1,2,2,2,2,2,2] ,"Timestamp" :["2018-07-24 7:31:00" ,"2018-07-24 7:31:01" ,"2018-07-24 7:33:01" ,"2018-07-25 7:32:04" ,"2018-07-25 7:33:06","2018-07-20 7:31:00" ,"2018-07-20 8:36:10" ,"2018-07-20 8:37:09","2018-07-27 8:38:00" ,"2018-07-27 8:39:00","2018-07-27 8:38:00"],"Distance":[0,2.3,2.8,0,3.6,0,5.6,2.1,0,3,2.6]}
df["Day"]=[]
df["week_of_month"]=[]
for entry in df["Timestamp"]:
df["Day"].append(dt.strptime(entry,'%Y-%m-%d %H:%M:%S').day)
df["week_of_month"].append((dt.strptime(entry,'%Y-%m-%d %H:%M:%S').day - 1) // 7 + 1)
df=pd.DataFrame(df)
result=df.groupby(['Inspector_ID','Day','week_of_month'])['Distance'].agg({"totdist":"sum","NoEntries":"count"})
print(result)

Pandas: Grouped DataFrame - divide values of a column by the value of a certain row within that column for each group

I have a dataframe with groups.
To normalize the values for each group I'd like to divide all values of each group by the value of a certain element within that group.
df = pd.DataFrame([['a','2018-02-03',42],
['a','2018-02-04',22],
['a','2018-02-05',10],
['a','2018-02-06',32],
['b','2018-02-03',10],
['b','2018-02-04',8],
['b','2018-02-05',2],
['b','2018-02-06',12],
['c','2018-02-03',20],
['c','2018-02-04',30],
['c','2018-02-05',5],
['c','2018-02-06',15]])
df.columns = ['product','day','value']
I want to normalize column 'value' for each 'product' by 'value' of 'day' == '2018-02-05'
Expected Result:
product day value
0 a 2018-02-03 4.2
1 a 2018-02-04 2.2
2 a 2018-02-05 1
3 a 2018-02-06 3.2
5 b 2018-02-03 5
6 b 2018-02-04 4
7 b 2018-02-05 1
8 b 2018-02-06 6
10 c 2018-02-03 4
11 c 2018-02-04 6
12 c 2018-02-05 1
13 c 2018-02-06 3
I tried df.groupby('product').transform().
To access the first value .transform('first') is possible.
But I cannot find a way to access a certain value.
Annotation:
Maybe this one can be solved without using .groupby() ?
Do like this:
df = pd.DataFrame([['a','2018-02-03',42],
['a','2018-02-04',22],
['a','2018-02-05',10],
['a','2018-02-06',32],
['b','2018-02-03',10],
['b','2018-02-04',8],
['b','2018-02-05',2],
['b','2018-02-06',12],
['c','2018-02-03',20],
['c','2018-02-04',30],
['c','2018-02-05',5],
['c','2018-02-06',15]])
df.columns = ['product','day','value']
date = '2018-02-05'
# Set the index to ['product', 'day']
df.set_index(['product', 'day'], inplace=True)
# Helper Series - Values of date at index 'day'
s = df.xs(date, level=1)
# Divide df by helper Series and reset index
df = df.div(s, level=0).reset_index()
print(df)
product day value
0 a 2018-02-03 4.2
1 a 2018-02-04 2.2
2 a 2018-02-05 1.0
3 a 2018-02-06 3.2
4 b 2018-02-03 5.0
5 b 2018-02-04 4.0
6 b 2018-02-05 1.0
7 b 2018-02-06 6.0
8 c 2018-02-03 4.0
9 c 2018-02-04 6.0
10 c 2018-02-05 1.0
11 c 2018-02-06 3.0

pandas: fillna whole df with groupby

I have the following df with a lot more number columns. I now want to make a forward filling for all the columns in the dataframe but grouped by id.
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 NaN 13
2 2001 7 NaN
2 2002 8 2
The result should look like this:
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 4 13
2 2001 7 NaN
2 2002 8 2
I tried the following command:
df= df.groupby("id").fillna(method="ffill", limit=2)
However, this raises a KeyError "isin". Filling just one column with the following command works just fine, but how can I efficiently forward fill the whole df grouped by isin?
df["number"]= df.groupby("id")["number"].fillna(method="ffill", limit=2)
You can use:
df = df.groupby("id").apply(lambda x: x.ffill(limit=2))
print (df)
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
Also for me working:
df.groupby("id").fillna(method="ffill", limit=2)
so I think is necessary upgrade pandas.
ffill can be use directly
df.groupby('id').ffill(2)
Out[423]:
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
#isin
#df.loc[:,df.columns.isin([''])]=df.loc[:,df.columns.isin([''])].groupby('id').ffill(2)

pandas multiindex selection with ranges

I have a python frame like
y m A B
1990 1 3.4 5
2 4 4.9
...
1990 12 4.0 4.5
...
2000 1 2.3 8.1
2 3.7 5.0
...
2000 12 2.4 9.1
I would like to select 2-12 from the second index (m) and years 1991-2000. I don't seem to get the multindex slicing correct. E.g. I tried
idx = pd.IndexSlice
dfa = df.loc[idx[1:,1:],:]
but that does not seem to slice on the first index. Any suggestions on an elegant solution?
Cheers, Mike
Without a sample code to reproduce your df it is difficult to guess, but if you df is similar to:
import pandas as pd
df = pd.read_csv(pd.io.common.StringIO(""" y m A B
1990 1 3.4 5
1990 2 4 4.9
1990 12 4.0 4.5
2000 1 2.3 8.1
2000 2 3.7 5.0
2000 12 2.4 9.1"""), sep='\s+')
df
y m A B
0 1990 1 3.4 5.0
1 1990 2 4.0 4.9
2 1990 12 4.0 4.5
3 2000 1 2.3 8.1
4 2000 2 3.7 5.0
5 2000 12 2.4 9.1
Then this code will extract what you need:
print df.loc[(df['y'].isin(range(1990,2001))) & df['m'].isin(range(2,12))]
y m A B
1 1990 2 4.0 4.9
4 2000 2 3.7 5.0
If however your df is indexes by y and m, then this will do the same:
df.set_index(['y','m'],inplace=True)
years = df.index.get_level_values(0).isin(range(1990,2001))
months = df.index.get_level_values(1).isin(range(2,12))
df.loc[years & months]
y m A B
1 1990 2 4.0 4.9
4 2000 2 3.7 5.0