Get data from multiindexed dataframe with given a list of index - pandas

I want to get information from from a column called 'index' in a certain pandas dataframe with multiple index. However, the index shall be listed. Please see the following example.
index ID
0 0 1.0 17226815
1 0 2.0 17226807
2 0 3.0 17226816
3 0 4.0 17226808
4 0 5.0 17231739
5 0 6.0 17231739
6 0 1.0 17226815
1 2.0 17226807
7 0 1.0 17226815
1 3.0 17226816
filtered_list = [3, 5, 7]
with the following line I can get the filtered data.
print(df.loc[df.index.isin(filtered_list, level=0)]['index'])
out:
3 0 4.0
5 0 6.0
7 0 1.0
1 3.0
what I want to get is a list consisting of the 'index' value. It will be then as additional information next to the filtered index. It is shown as follow:
0 3 4
1 5 6
2 7 (1, 3)
how can I get this list?
thank you in advance.

If I understand correctly,
df.loc[filtered_list,'index'].groupby(level=0).apply(tuple).reset_index()
Output:
0 index
0 3 (4.0,)
1 5 (6.0,)
2 7 (1.0, 3.0)
Going further:
df.loc[filtered_list,'index']\
.groupby(level=0)\
.apply(lambda x: tuple(x)[0] if len(x.index)==1 else tuple(x))\
.reset_index()
OUtput:
0 index
0 3 4
1 5 6
2 7 (1.0, 3.0)

Related

pandas function to find index of the first future instance where column is less than each row's value

I'm new to Stack Overflow, and I just have a question about solving a problem in pandas. I am looking to create a function that returns the index of the first future instance where a column is less than each row's value for that column.
For example, consider the dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Val': [1, 2, 3, 4, 0, 1, -1, -2, -3]}, index = np.arange(0,9))
df
Index
Val
0
1
1
2
2
3
3
4
4
0
5
1
6
-1
7
-2
8
-3
I am looking for the output:
Index
F(Val)
0
4
1
4
2
4
3
4
4
6
5
6
6
7
7
8
8
NaN
Or the series/array equivalent of F(Val).
I've been able to solve this quite easily using for loops, but obviously this is extremely slow on the large dataset I am working with an not a very elegant or optimal solution. My hope is that the solution is an efficient pandas function that employs vectorization.
Also, as a bonus question (if anyone can assist), how might the maximum value between each row's index and the F(Val) index be computed using vectorization? The output should look like:
Index
G(Val)
0
4
1
4
2
4
3
4
4
1
5
1
6
-1
7
-2
8
NaN
Thanks!
You can use:
grp = df['Val'].lt(df['Val'].shift()).shift(fill_value=0).cumsum()
df['F(Val)'] = df.groupby(grp).transform(lambda x: x.index[-1]).shift(-1)
print(df)
# Output
Val F(Val)
0 1 4.0
1 2 4.0
2 3 4.0
3 4 4.0
4 0 6.0
5 1 6.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
Using numpy broadcasting and the lower triangle:
a = df['Val'].to_numpy()
m = np.tril(a[:,None]<=a, k=-1)
df['F(Val)'] = np.where(m.any(0), m.argmax(0), np.nan)
Same logic with expanding:
df['F(Val)'] = (df.loc[::-1, 'Val'].expanding()
.apply(lambda x: s.idxmax() if len(s:=(x.iloc[-2::-1]<=x.iloc[-1]))
else np.nan)
)
Output (with a difference to the provided one):
Val F(Val)
0 1 5.0 # here the next is 5
1 2 4.0
2 3 4.0
3 4 4.0
4 2 5.0
5 -2 7.0
6 -1 7.0
7 -2 8.0
8 -3 NaN

Pandas how to group rows by a dictionary of {row : group}

I have a dataframe n rows:
1 2 3
3 4 1
5 3 2
9 8 2
7 2 6
0 0 0
4 4 4
8 4 1
...
and a dictionary of keys , so that row is a key and the value is the group:
d = {0 : 0 , 1: 0, 2 : 0, 3 : 1, 4 : 1, 5: 2, 6: 2}
I want to group by the keys and then apply mean on the groups.
So I will get:
3 3 2 #This is the mean of rows 0,1,2 from the original df, as d[0]=d[1]=d[2]=0
8 5 4
2 2 2
8 4 1
What is the best way to do so?
Simply use the dictionary in the groupby it will replace the index value by the dictionary value matching on the key:
df.groupby(d).mean()
output:
a b c
0.0 3.0 3.0 2.0
1.0 8.0 5.0 4.0
2.0 2.0 2.0 2.0
If you also want to get the missing keys, use dropna=False in groupby. Those keys will be listed in the 'NaN' group:
df.groupby(d, dropna=False).mean()
output:
a b c
0.0 3.0 3.0 2.0
1.0 8.0 5.0 4.0
2.0 2.0 2.0 2.0
NaN 8.0 4.0 1.0
And for a range index instead of the dictionary keys:
df.groupby(d, dropna=False, as_index=False).mean()
output:
a b c
0 3.0 3.0 2.0
1 8.0 5.0 4.0
2 2.0 2.0 2.0
3 8.0 4.0 1.0
used input:
a b c
0 1 2 3
1 3 4 1
2 5 3 2
3 9 8 2
4 7 2 6
5 0 0 0
6 4 4 4
7 8 4 1

Insert a list to row in pandas?

The data may look like this
0 1 2
0 0 0 0
1 1 5 6
2 2 7 4
list=[1.2,1.3,1.4]
How may I insert to it to a dataframe?
0 1 2
0 1.1 1.2 1.3
1 1 5 6
2 2 7 4
I only use a loop to do it.
Is there any function do it?
Use .loc to select the first row, then insert your list:
mylist = [1.2,1.3,1.4]
df.loc[0] = mylist
Output
0 1 2
0 1.2 1.3 1.4
1 1.0 5.0 6.0
2 2.0 7.0 4.0

pandas column operation on certain row in succession

I have a panda dataframe like this:
second block
0 1 a
1 2 b
2 3 c
3 4 a
4 5 c
This is a sequential data and I would like to get a new column which is the time difference between the current block and next time it repeats.
second block freq
0 1 a 3 //(4-1)
1 2 b 0 //(not repeating)
2 3 c 2 //(5-3)
3 4 a 0 //(not repeating)
4 5 c 0 //(not repeating)
I have tried to get the unique list of blocks. Then a for loop that do as below.
for i in unique_block:
df['freq'] = df['timestamp'].shift(-1) - df['timestamp']
I do not know how to get 0 for row index 1,3,4 and since the dataframe is too big. This is not efficient. This is not working.
Thanks.
Use groupby + diff(periods=-1). Multiply by -1 to get your difference convention and fillna with 0.
df['freq'] = (df.groupby('block').diff(-1)*-1).fillna(0)
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
You can use shift and transform in your groupby:
df['freq'] = df.groupby('block').second.transform(lambda x: x.shift(-1) - x).fillna(0)
>>> df
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
Using
df.groupby('block').second.apply(lambda x : x.diff().shift(-1)).fillna(0)
Out[242]:
0 3.0
1 0
2 2.0
3 0
4 0
Name: second, dtype: float64

create new column using a shift within a groupby values

I want to create a new column which is a result of a shift function applied to a grouped values.
df = pd.DataFrame({'X': [0,1,0,1,0,1,0,1], 'Y':[2,4,3,1,2,3,4,5]})
df
X Y
0 0 2
1 1 4
2 0 3
3 1 1
4 0 2
5 1 3
6 0 4
7 1 5
def func(x):
x['Z'] = test['Y']-test['Y'].shift(1)
return x
df_new = df.groupby('X').apply(func)
X Y Z
0 0 2 NaN
1 1 4 2.0
2 0 3 -1.0
3 1 1 -2.0
4 0 2 1.0
5 1 3 1.0
6 0 4 1.0
7 1 5 1.0
As you can see from the output the values are shifted sequentally without accounting for a group by.
I have seen a similar question, but I could not figure out why it does not work as expected.
Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation
The values are shifted without accounting for the groups because your func uses test (presumably some other object, likely another name for what you call df) directly instead of simply the group x.
def func(x):
x['Z'] = x['Y']-x['Y'].shift(1)
return x
gives me
In [8]: df_new
Out[8]:
X Y Z
0 0 2 NaN
1 1 4 NaN
2 0 3 1.0
3 1 1 -3.0
4 0 2 -1.0
5 1 3 2.0
6 0 4 2.0
7 1 5 2.0
but note that in this particular case you don't need to write a custom function, you can just call diff on the groupby object directly. (Of course other functions you might want to work with may be more complicated).
In [13]: df_new["Z2"] = df.groupby("X")["Y"].diff()
In [14]: df_new
Out[14]:
X Y Z Z2
0 0 2 NaN NaN
1 1 4 NaN NaN
2 0 3 1.0 1.0
3 1 1 -3.0 -3.0
4 0 2 -1.0 -1.0
5 1 3 2.0 2.0
6 0 4 2.0 2.0
7 1 5 2.0 2.0