pandas multiindex selection with ranges - pandas

I have a python frame like
y m A B
1990 1 3.4 5
2 4 4.9
...
1990 12 4.0 4.5
...
2000 1 2.3 8.1
2 3.7 5.0
...
2000 12 2.4 9.1
I would like to select 2-12 from the second index (m) and years 1991-2000. I don't seem to get the multindex slicing correct. E.g. I tried
idx = pd.IndexSlice
dfa = df.loc[idx[1:,1:],:]
but that does not seem to slice on the first index. Any suggestions on an elegant solution?
Cheers, Mike

Without a sample code to reproduce your df it is difficult to guess, but if you df is similar to:
import pandas as pd
df = pd.read_csv(pd.io.common.StringIO(""" y m A B
1990 1 3.4 5
1990 2 4 4.9
1990 12 4.0 4.5
2000 1 2.3 8.1
2000 2 3.7 5.0
2000 12 2.4 9.1"""), sep='\s+')
df
y m A B
0 1990 1 3.4 5.0
1 1990 2 4.0 4.9
2 1990 12 4.0 4.5
3 2000 1 2.3 8.1
4 2000 2 3.7 5.0
5 2000 12 2.4 9.1
Then this code will extract what you need:
print df.loc[(df['y'].isin(range(1990,2001))) & df['m'].isin(range(2,12))]
y m A B
1 1990 2 4.0 4.9
4 2000 2 3.7 5.0
If however your df is indexes by y and m, then this will do the same:
df.set_index(['y','m'],inplace=True)
years = df.index.get_level_values(0).isin(range(1990,2001))
months = df.index.get_level_values(1).isin(range(2,12))
df.loc[years & months]
y m A B
1 1990 2 4.0 4.9
4 2000 2 3.7 5.0

Related

How to trim util the last row meet condition within groups in Pandas

Problem Description:
Suppose I have the following dataframe
df = pd.DataFrame({"date": [1,2,3,3,4,1,2,3,3,4,1,1,1,4,4,4,1,1,1,2,2,3,3,3,4,4],
"variable": ["A", "A", "A","A","A","A", "A", "A","A","A", "B", "B", "B","B","B","B" ,"C", "C", "C","C", "D","D","D","D","D","D"],
"no": [1, 2.2, 3.5, 1.5, 1.5,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,9],
"value": [0.469112, -0.282863, -1.509059, -1.135632, 1.212112,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
0.119209, -1.044236, -0.861849, -0.234,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
0.119209, -1.044236, -0.861849, 0.332,0.87]})
where the df would be
print(df)
date variable no value
0 1 A 1.0 0.469112
1 1 A 1.0 0.469112
2 3 A 1.5 -1.135632
3 4 A 1.5 1.212112
4 3 A 1.5 -1.135632
5 4 A 1.5 1.212112
6 2 A 2.2 -0.282863
7 2 A 2.2 -0.282863
8 3 A 3.5 -1.509059
9 3 A 3.5 -1.509059
10 4 B 1.0 0.469112
11 1 B 1.1 -1.044236
12 1 B 1.2 -0.173215
13 1 B 1.3 0.119209
14 4 B 2.0 -0.861849
15 4 B 3.0 -0.234000
16 1 C 1.5 -1.135632
17 2 C 1.5 1.212112
18 1 C 2.2 -0.282863
19 1 C 3.5 -1.509059
20 3 D 1.1 -1.044236
21 2 D 1.2 -0.173215
22 3 D 1.3 0.119209
23 3 D 2.0 -0.861849
24 4 D 3.0 0.332000
25 4 D 9.0 0.870000
And then I wanna:
Sort based on columns variable and no,
Trim each group until the last row meets a condition, say, I would like to trim the group (by single column, say variable) until the last row where the value in column value is greater than 0, in other words, to drop rest of rows after the last row that meets the condition.
I have tried groupby-apply
df.groupby('variable', as_index=False).apply(
lambda x: x.iloc[: x.where(x['value'] > 0).last_valid_index() + 1, ]))
but the result is incorrect:
date variable no value
0 0 1 A 1.0 0.469112
1 1 A 1.0 0.469112
2 3 A 1.5 -1.135632
3 4 A 1.5 1.212112
4 3 A 1.5 -1.135632
5 4 A 1.5 1.212112
1 10 4 B 1.0 0.469112
11 1 B 1.1 -1.044236
12 1 B 1.2 -0.173215
13 1 B 1.3 0.119209
14 4 B 2.0 -0.861849
15 4 B 3.0 -0.234000
2 16 1 C 1.5 -1.135632
17 2 C 1.5 1.212112
18 1 C 2.2 -0.282863
19 1 C 3.5 -1.509059
3 20 3 D 1.1 -1.044236
21 2 D 1.2 -0.173215
22 3 D 1.3 0.119209
23 3 D 2.0 -0.861849
24 4 D 3.0 0.332000
25 4 D 9.0 0.870000
as you may see the end of group B and C are not greater than 0.
Anyone who could provide a solution and explain why my solution does not work would be highly appreciated.
Plus. Since the size of dataframe is way larger than the example here, I assume we had better not reverse the dataframe.
You can do it this way:
df = df.sort_values(['variable', 'no'])
(df.groupby('variable')
.apply(
lambda x: x.iloc[:np.where(x.value.gt(0), range(len(x)), 0).max() + 1]
))
Output
date variable no value
variable
A 0 1 A 1.0 0.469112
5 1 A 1.0 0.469112
3 3 A 1.5 -1.135632
4 4 A 1.5 1.212112
8 3 A 1.5 -1.135632
9 4 A 1.5 1.212112
B 15 4 B 1.0 0.469112
12 1 B 1.1 -1.044236
10 1 B 1.2 -0.173215
11 1 B 1.3 0.119209
C 18 1 C 1.5 -1.135632
19 2 C 1.5 1.212112
D 22 3 D 1.1 -1.044236
20 2 D 1.2 -0.173215
21 3 D 1.3 0.119209
23 3 D 2.0 -0.861849
24 4 D 3.0 0.332000
25 4 D 9.0 0.870000

pandas lag multi-index irregular time series data by number of months

I have the following pandas dataframe
df = pd.DataFrame(data = {
'item': ['red','red','red','blue','blue'],
'dt': pd.to_datetime(['2018-01-31', '2018-02-28', '2018-03-31', '2018-01-31', '2018-03-31']),
's': [3.2, 4.8, 5.1, 5.3, 5.8],
'r': [1,2,3,4,5],
't': [7,8,9,10,11],
})
which looks like
item dt s r t
0 red 2018-01-31 3.2 1 7
1 red 2018-02-28 4.8 2 8
2 red 2018-03-31 5.1 3 9
3 blue 2018-01-31 5.3 4 10
4 blue 2018-03-31 5.8 5 11
Note that the time points are irregular: "blue" is missing February data. All dates are valid end-of-month dates.
I'd like to add a column which is the "s value from two months ago", ideally something like
df['s_lag2m'] = df.set_index(['item','dt'])['s'].shift(2, 'M')
and I would get
item dt s r t s_lag2m
0 red 2018-01-31 3.2 1 7 NaN
1 red 2018-02-28 4.8 2 8 NaN
2 red 2018-03-31 5.1 3 9 3.2
3 blue 2018-01-31 5.3 4 10 NaN
4 blue 2018-03-31 5.8 5 11 5.3
But that doesn't work; it throws NotImplementedError: Not supported for type MultiIndex.
How can I do this?
We can do reindex after set_index with only dt
df['New']=df.set_index(['dt']).groupby('item')['s'].shift(2, 'M').\
reindex(pd.MultiIndex.from_frame(df[['item','dt']])).values
df
item dt s r t New
0 red 2018-01-31 3.2 1 7 NaN
1 red 2018-02-28 4.8 2 8 NaN
2 red 2018-03-31 5.1 3 9 3.2
3 blue 2018-01-31 5.3 4 10 NaN
4 blue 2018-03-31 5.8 5 11 5.3

Is it possible to do pandas groupby transform rolling mean?

Is it possible for pandas to do something like:
df.groupby("A").transform(pd.rolling_mean,10)
You can do this without the transform or apply:
df = pd.DataFrame({'grp':['A']*5+['B']*5,'data':[1,2,3,4,5,2,4,6,8,10]})
df.groupby('grp')['data'].rolling(2, min_periods=1).mean()
Output:
grp
A 0 1.0
1 1.5
2 2.5
3 3.5
4 4.5
B 5 2.0
6 3.0
7 5.0
8 7.0
9 9.0
Name: data, dtype: float64
Update per comment:
df = pd.DataFrame({'grp':['A']*5+['B']*5,'data':[1,2,3,4,5,2,4,6,8,10]},
index=[*'ABCDEFGHIJ'])
df['avg_2'] = df.groupby('grp')['data'].rolling(2, min_periods=1).mean()\
.reset_index(level=0, drop=True)
Output:
grp data avg_2
A A 1 1.0
B A 2 1.5
C A 3 2.5
D A 4 3.5
E A 5 4.5
F B 2 2.0
G B 4 3.0
H B 6 5.0
I B 8 7.0
J B 10 9.0

pandas: fillna whole df with groupby

I have the following df with a lot more number columns. I now want to make a forward filling for all the columns in the dataframe but grouped by id.
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 NaN 13
2 2001 7 NaN
2 2002 8 2
The result should look like this:
id date number number2
1 2001 4 11
1 2002 4 45
1 2003 4 13
2 2001 7 NaN
2 2002 8 2
I tried the following command:
df= df.groupby("id").fillna(method="ffill", limit=2)
However, this raises a KeyError "isin". Filling just one column with the following command works just fine, but how can I efficiently forward fill the whole df grouped by isin?
df["number"]= df.groupby("id")["number"].fillna(method="ffill", limit=2)
You can use:
df = df.groupby("id").apply(lambda x: x.ffill(limit=2))
print (df)
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
Also for me working:
df.groupby("id").fillna(method="ffill", limit=2)
so I think is necessary upgrade pandas.
ffill can be use directly
df.groupby('id').ffill(2)
Out[423]:
id date number number2
0 1 2001 4.0 11.0
1 1 2002 4.0 45.0
2 1 2003 4.0 13.0
3 2 2001 7.0 NaN
4 2 2002 8.0 2.0
#isin
#df.loc[:,df.columns.isin([''])]=df.loc[:,df.columns.isin([''])].groupby('id').ffill(2)

Unifying columns in the same Pandas dataframe to one column

Hi I would like to unify columns in the same dataframe to one column such as:
col1 col2
1 1.4 1.5
2 2.3 2.6
3 3.6 6.7
to
col1&2
1 1.4
1 1.5
2 2.3
2 2.6
3 3.6
3 6.7
Thanks for your help
Use stack, then remove level by reset_index and last create one column DataFrame by to_frame:
df = df.stack().reset_index(level=1, drop=True).to_frame('col1&2')
print (df)
col1&2
1 1.4
1 1.5
2 2.3
2 2.6
3 3.6
3 6.7
Or:
df = pd.DataFrame({'col1&2': df.values.reshape(1,-1).ravel()}, index=np.repeat(df.index, 2))
print (df)
col1&2
1 1.4
1 1.5
2 2.3
2 2.6
3 3.6
3 6.7