Understanding the method apply() in Pandas series and dataframe - pandas

I am trying to understand how the method apply() can be used with series and dataframes.
As shown below, when the np.max() function is used with the apply() method with the dataframe it is returning the max value for each column. But when used with the series, it is just returning the series. My expectation was that it would return the max value of the series. That is, the result would be similar to series.max(). Why is apply() performing differently on series and on dataframes?
import pandas as pd
import numpy as np
my_df = pd.DataFrame(np.random.randint(10, size=(4,3)), columns = list('ABC'))
my_df
Output:
A B C
0 2 4 7
1 9 6 6
2 4 4 8
3 8 8 1
df_max = my_df.apply(np.max)
df_max
Output:
A 9
B 8
C 8
dtype: int32
se_max = my_df['A'].apply(np.max)
se_max
Output:
0 2
1 9
2 4
3 8
Name: A, dtype: int32

By default, apply works along the first dimension of the object. In a dataframe, the first dimension is vertical, and apply applies the function to each column. In a series, the first (and the only) dimension is horizontal, and apply applies the function to each row.

Related

Trying to convert column to be row indexes, set_index error

data_new. set_index('Usual Mode of Transport to Work')
jupyter notebook
Trying to convert column to be row indexes, however, it shows up as NaN? How do i resolve it? Thanks. Im a beginner in python.
Lets start with a toy dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(5, 4)), columns=list('ABCD'))
print(df)
A B C D
0 3 1 2 1
1 2 2 3 4
2 2 4 4 1
3 1 0 3 2
4 1 2 4 0
Now, let's set column A as the index
df.set_index('A')
B C D
A
3 1 2 1
2 2 3 4
2 4 4 1
1 0 3 2
1 2 4 0
This sets the index correctly but doesn't save this newly indexed dataframe in the original data frame variable, i.e., df. So when you check the value of df you see find the original dataframe.
To save the new indexing, you can do one of the following
df = df.set_index('A)
or
df.set_index('A', inplace=True)
Coming to the NaN values, I believe it has got something to do with using Jupyter notebook. Since Jupyter allows jumping between cells, it does not necessarily follow the linear execution order like traditional coding. This can get confusing. You can use the "Variable View" in Jupyter to cross-check if you are passing the value you intend to. I hope this can help you figure out the NaN issue.

pandas split-apply-combine creates undesired MultiIndex

I am using the split-apply-combine pattern in pandas to group my df by a custom aggregation function.
But this returns an undesired DataFrame with the grouped column existing twice: In an MultiIndex and the columns.
The following is a simplified example of my problem.
Say, I have this df
df = pd.DataFrame([[1,2],[3,4],[1,5]], columns=['A','B']))
A B
0 1 2
1 3 4
2 1 5
I want to group by column A and keep only those rows where B has an even value. Thus the desired df is this:
B
A
1 2
3 4
The custom function my_combine_func should do the filtering. But applying it after a groupby, leads to an MultiIndex with the former Index in the second level. And thus column A existing two times.
my_combine_func = group[group['B'] % 2 == 0]
df.groupby(['A']).apply(my_combine_func)
A B
A
1 0 1 2
3 1 3 4
How to apply a custom group function and have the desired df?
It's easier to use apply here so you get a boolean array back:
df[df.groupby('A')['B'].apply(lambda x: x % 2 == 0)]
A B
0 1 2
1 3 4

Multi-indexed series into DataFrame and reformat

I have a correlation matrix of stock returns in a Pandas DataFrame and I want to extract the top/bottom 10 correlated pairs from the matrix.
Sample DataFrame:
import pandas as pd
import numpy as np
data = np.random.randint(5,30,size=500)
df = pd.DataFrame(data.reshape((50,10)))
corr = df.corr()
This is my function to get the top/bottom 10 correlated pairs by 1) first returning a multi-indexed series (high) for highest correlated pairs, and then 2) unstacking back into a DataFrame (high_df):
def get_rankings(corr_matrix):
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
ranked_corr = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
high = ranked_corr[:10]
high_df = high.unstack().fillna("")
return high_df
get_rankings(corr)
My current DF output looks something like this:
6 4 5 7 8 3 9
3 0.359 0.198
1 0.275
4 0.257
2 0.176 0.154
0 0.153 0.164
5 0.156
But I want it to look this in either 2/3 columns:
ID1 ID2 Corr
0 9 0.304471
2 8 0.271009
2 3 0.147702
7 9 0.146176
0 7 0.144549
7 8 0.111888
4 6 0.098619
1 7 0.092338
1 4 0.09091
3 6 0.079688
It needs to be in a DataFrame so I can pass it to a grid widget, which only accepts DataFrames. Can anyone help me rehash the shape of the unstacked DF?

Pandas Series Chaining: Filter on boolean value

How can I filter a pandas series based on boolean values?
Currently I have:
s.apply(lambda x: myfunc(x, myparam).where(lambda x: x).dropna()
What I want is only keep entries where myfunc returns true.myfunc is complex function using 3rd party code and operates only on individual elements.
How can i make this more understandable?
You can understand it with below given sample code
import pandas as pd
data = pd.Series([1,12,15,3,5,3,6,9,10,5])
print(data)
# filter data based on a condition keep only rows which are multiple of 3
filter_cond = data.apply(lambda x:x%3==0)
print(filter_cond)
filter_data = data[filter_cond]
print(filter_data)
This code is about to filter the series data which are of the multiples of 3. To do that, we just put the filter condition and apply it on the series data. You can verify it with below generated output.
The sample series data:
0 1
1 12
2 15
3 3
4 5
5 3
6 6
7 9
8 10
9 5
dtype: int64
The conditional filter output:
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 True
8 False
9 False
dtype: bool
The final required filter data:
1 12
2 15
3 3
5 3
6 6
7 9
dtype: int64
Hope, this helps you to understand that how we can apply conditional filters on the series data.
Use boolean indexing:
mask = s.apply(lambda x: myfunc(x, myparam))
print (s[mask])
If also is changed index values in mask filter by 1d array:
#pandas 0.24+
print (s[mask.to_numpy()])
#pandas below
print (s[mask.values])
EDIT:
s = pd.Series([1,2,3])
def myfunc(x, n):
return x > n
myparam = 1
a = s[s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64
Solution with callable is possible, but a bit overcomplicated in my opinion:
a = s.loc[lambda s: s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64

apply() function to generate new value in a new column

I am new to python 3 and pandas. I tried to add a new column into a dataframe where the value is the difference between two existing columns.
My current code is:
import pandas as pd
import io
from io import StringIO
x="""a,b,c
1,2,3
4,5,6
7,8,9"""
with StringIO(x) as df:
new=pd.read_csv(df)
print (new)
y=new.copy()
y.loc[:,"d"]=0
# My lambda function is completely wrong, but I don't know how to make it right.
y["d"]=y["d"].apply(lambda x:y["a"]-y["b"], axis=1)
Desired output is
a b c d
1 2 3 -1
4 5 6 -1
7 8 9 -1
Does anyone have any idea how I can make my code work?
Thanks for your help.
You need y only for DataFrame for DataFrame.apply with axis=1 for process by rows:
y["d"]= y.apply(lambda x:x["a"]-x["b"], axis=1)
For better debugging is possible create custom function:
def f(x):
print (x)
a = x["a"]-x["b"]
return a
y["d"]= y.apply(f, axis=1)
a 1
b 2
c 3
Name: 0, dtype: int64
a 4
b 5
c 6
Name: 1, dtype: int64
a 7
b 8
c 9
Name: 2, dtype: int64
Better solution if need only subtract columns:
y["d"] = y["a"] - y["b"]
print (y)
a b c d
0 1 2 3 -1
1 4 5 6 -1
2 7 8 9 -1