pandas function to find index of the first future instance where column is less than each row's value - pandas

I'm new to Stack Overflow, and I just have a question about solving a problem in pandas. I am looking to create a function that returns the index of the first future instance where a column is less than each row's value for that column.
For example, consider the dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Val': [1, 2, 3, 4, 0, 1, -1, -2, -3]}, index = np.arange(0,9))
df
Index
Val
0
1
1
2
2
3
3
4
4
0
5
1
6
-1
7
-2
8
-3
I am looking for the output:
Index
F(Val)
0
4
1
4
2
4
3
4
4
6
5
6
6
7
7
8
8
NaN
Or the series/array equivalent of F(Val).
I've been able to solve this quite easily using for loops, but obviously this is extremely slow on the large dataset I am working with an not a very elegant or optimal solution. My hope is that the solution is an efficient pandas function that employs vectorization.
Also, as a bonus question (if anyone can assist), how might the maximum value between each row's index and the F(Val) index be computed using vectorization? The output should look like:
Index
G(Val)
0
4
1
4
2
4
3
4
4
1
5
1
6
-1
7
-2
8
NaN
Thanks!

You can use:
grp = df['Val'].lt(df['Val'].shift()).shift(fill_value=0).cumsum()
df['F(Val)'] = df.groupby(grp).transform(lambda x: x.index[-1]).shift(-1)
print(df)
# Output
Val F(Val)
0 1 4.0
1 2 4.0
2 3 4.0
3 4 4.0
4 0 6.0
5 1 6.0
6 -1 7.0
7 -2 8.0
8 -3 NaN

Using numpy broadcasting and the lower triangle:
a = df['Val'].to_numpy()
m = np.tril(a[:,None]<=a, k=-1)
df['F(Val)'] = np.where(m.any(0), m.argmax(0), np.nan)
Same logic with expanding:
df['F(Val)'] = (df.loc[::-1, 'Val'].expanding()
.apply(lambda x: s.idxmax() if len(s:=(x.iloc[-2::-1]<=x.iloc[-1]))
else np.nan)
)
Output (with a difference to the provided one):
Val F(Val)
0 1 5.0 # here the next is 5
1 2 4.0
2 3 4.0
3 4 4.0
4 2 5.0
5 -2 7.0
6 -1 7.0
7 -2 8.0
8 -3 NaN

Related

Pandas: Get rolling mean with a add operation in between

My Pandas df is like:
ID delta price
1 -2 4
2 2 5
3 -3 3
4 0.8
5 0.9
6 -2.3
7 2.8
8 1
9 1
10 1
11 1
12 1
Pandas already has robust mean calculation method in built. I need to use it slightly differently.
So, in my df, price at row 4 would be sum of (a) rolling mean of price in row 1, 2, 3 and (b) delta at row 4.
Once, this is computed: I would move to row 5 for this: (a) rolling mean of price in row 2, 3, 4 and (b) delta at row 5. This would give price at row 5.....
I can iterate over rows to get this but my actual dataframe in quite big and iterating over row would slow things up....any better way to achieve?
I do not think we have method in panda can use the pervious calculated value in the next calculation
n = 3
for x in df.index[df.price.isna()]:
df.loc[x,'price'] = (df.loc[x-n:x,'price'].sum() + df.loc[x,'delta'])/4
df
Out[150]:
ID delta price
0 1 -2.0 4.000000
1 2 2.0 5.000000
2 3 -3.0 3.000000
3 4 0.8 3.200000
4 5 0.9 3.025000
5 6 -2.3 1.731250
6 7 2.8 2.689062
7 8 1.0 2.111328
8 9 1.0 1.882910
9 10 1.0 1.920825
10 11 1.0 1.728766
11 12 1.0 1.633125

How to append to a data frame from multiple loops

I have a code, which takes in files from csv and takes a price difference, but to make it simplar I made a reproducible example as seen below. I want to append each result to the end of a specific column name. For example the first loop will go through size 1 and minute 1 so it should append to column names 1;1, for file2, file 3, file4. So the output should be :
1;1 1;2 1;3 2;1 2;2 2;3
0 0 0 same below as for 1
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
5 5 5
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
6 6 6
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
7 7 7
I am using a loop to set prefixed data frame columns, because in my original code the number of minutes, sizes, and files is inputted by the user.
import numpy as np
import pandas as pd
file =[1,2,3,4,5,6,6,2]
file2=[1,2,3,4,5,6,7,8]
file3=[1,2,3,4,5,6,7,9]
file4=[1,2,1,2,1,2,1,2]
size=[1,2]
minutes=[1,2,3]
list1=[file,file2,file3]
data=pd.DataFrame(file)
data2=pd.DataFrame(file2)
data3=pd.DataFrame(file3)
list1=(data,data2,data3)
datas=pd.DataFrame(file4)
col_names = [str(sizer)+';'+str(number) for sizer in size for number in minutes]
datanew=pd.DataFrame(columns=col_names)
for sizes in size:
for minute in minutes:
for files in list1:
pricediff=files-data
datanew[str(sizes)+';'+str(minute)]=datanew[str(sizes)+';'+str(minute)].append(pricediff,ignore_index=True)
print(datanew)
Edit: When trying this line: datanew=datanew.append({str(sizes)+';'+str(minute): df['pricediff']},ignore_index=True) It appends the data but the result isn't "clean"
The result from my original data, gives me this:
111;5.0,1111;5.0
"0 4.5
1 0.5
2 8
3 8
4 8
...
704 3.5
705 0.5
706 11.5
707 0.5
708 9.0
Name: pricediff, Length: 709, dtype: object",
"price 0.0
0 0.0
Name: pricediff, dtype: float64",
"0 6.5
1 6.5
2 3.5
3 13.0
Name: pricediff, Length: 4, dtype: float64",
What you are looking for is:
datanew=datanew.append({str(sizes)+';'+str(minute): pricediff}, ignore_index=True)
This happens because you cannot change length of a single column of a dataframe without modifying length of the whole data frame.
Now consider the below as an example:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqr"), "b": [1,3,5,4,2,7], "c": list("pqrtuv")})
print(df)
#this will fail:
#df["c"]=df["c"].append("abc", ignore_index=True)
#print(df)
#what you can do instead:
df=df.append({"c": "abc"}, ignore_index=True)
print(df)
#you can even create new column that way:
df=df.append({"x": "abc"}, ignore_index=True)
Edit
In order to append pd.Series do literally the same:
abc=pd.Series([-1,-2,-3], name="c")
df=df.append({"c": abc}, ignore_index=True)
print(df)
abc=pd.Series([-1,-2,-3], name="x")
df=df.append({"x": abc}, ignore_index=True)

Get data from multiindexed dataframe with given a list of index

I want to get information from from a column called 'index' in a certain pandas dataframe with multiple index. However, the index shall be listed. Please see the following example.
index ID
0 0 1.0 17226815
1 0 2.0 17226807
2 0 3.0 17226816
3 0 4.0 17226808
4 0 5.0 17231739
5 0 6.0 17231739
6 0 1.0 17226815
1 2.0 17226807
7 0 1.0 17226815
1 3.0 17226816
filtered_list = [3, 5, 7]
with the following line I can get the filtered data.
print(df.loc[df.index.isin(filtered_list, level=0)]['index'])
out:
3 0 4.0
5 0 6.0
7 0 1.0
1 3.0
what I want to get is a list consisting of the 'index' value. It will be then as additional information next to the filtered index. It is shown as follow:
0 3 4
1 5 6
2 7 (1, 3)
how can I get this list?
thank you in advance.
If I understand correctly,
df.loc[filtered_list,'index'].groupby(level=0).apply(tuple).reset_index()
Output:
0 index
0 3 (4.0,)
1 5 (6.0,)
2 7 (1.0, 3.0)
Going further:
df.loc[filtered_list,'index']\
.groupby(level=0)\
.apply(lambda x: tuple(x)[0] if len(x.index)==1 else tuple(x))\
.reset_index()
OUtput:
0 index
0 3 4
1 5 6
2 7 (1.0, 3.0)

create new column using a shift within a groupby values

I want to create a new column which is a result of a shift function applied to a grouped values.
df = pd.DataFrame({'X': [0,1,0,1,0,1,0,1], 'Y':[2,4,3,1,2,3,4,5]})
df
X Y
0 0 2
1 1 4
2 0 3
3 1 1
4 0 2
5 1 3
6 0 4
7 1 5
def func(x):
x['Z'] = test['Y']-test['Y'].shift(1)
return x
df_new = df.groupby('X').apply(func)
X Y Z
0 0 2 NaN
1 1 4 2.0
2 0 3 -1.0
3 1 1 -2.0
4 0 2 1.0
5 1 3 1.0
6 0 4 1.0
7 1 5 1.0
As you can see from the output the values are shifted sequentally without accounting for a group by.
I have seen a similar question, but I could not figure out why it does not work as expected.
Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation
The values are shifted without accounting for the groups because your func uses test (presumably some other object, likely another name for what you call df) directly instead of simply the group x.
def func(x):
x['Z'] = x['Y']-x['Y'].shift(1)
return x
gives me
In [8]: df_new
Out[8]:
X Y Z
0 0 2 NaN
1 1 4 NaN
2 0 3 1.0
3 1 1 -3.0
4 0 2 -1.0
5 1 3 2.0
6 0 4 2.0
7 1 5 2.0
but note that in this particular case you don't need to write a custom function, you can just call diff on the groupby object directly. (Of course other functions you might want to work with may be more complicated).
In [13]: df_new["Z2"] = df.groupby("X")["Y"].diff()
In [14]: df_new
Out[14]:
X Y Z Z2
0 0 2 NaN NaN
1 1 4 NaN NaN
2 0 3 1.0 1.0
3 1 1 -3.0 -3.0
4 0 2 -1.0 -1.0
5 1 3 2.0 2.0
6 0 4 2.0 2.0
7 1 5 2.0 2.0

Python Pandas groupby-apply strange behavior

Can anyone help me understand why there is different behavior between the two calls to apply below? Thank you.
In [34]: df
Out[34]:
A B C
0 1 0 0
1 1 7 4
2 2 9 8
3 2 2 4
4 2 2 1
5 3 3 3
6 3 3 2
7 3 5 7
In [35]: g = df.groupby('A')
In [36]: g.apply(max)
Out[36]:
A B C
A
1 1 7 4
2 2 9 8
3 3 5 7
In [37]: g.apply(lambda x: max(x))
Out[37]:
A
1 C
2 C
3 C
dtype: object
Short answer - you probably just want
df.groupby('A').max()
Longer answer - max is a generic python function that finds the max of any iterable. Because iterating a DataFrame is over the columns, calling the python max just finds the "largest" column, which happens in your second case.
In the first case - pandas has intercept logic, which turns things like g.apply(sum) into g.sum().