pandas apply() with and without lambda - pandas

What is the rule/process when a function is called with pandas apply() through lambda vs. not? Examples below. Without lambda apparently, the entire series ( df[column name] ) is passed to the "test" function which throws an error trying to do a boolean operation on a series.
If the same function is called via lambda it works. Iteration over each row with each passed as "x" and the df[ column name ] returns a single value for that column in the current row.
It's like lambda is removing a dimension. Anyone have an explanation or point to the specific doc on this? Thanks.
Example 1 with lambda, works OK
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( lambda x: test( x['yTest'], x[ 'yPred']), axis=1 ).head()
Example 1 output
probPredDF columns: Index([0, 1, 'yPred', 'yTest'], dtype='object')
Out[215]:
0 equal
1 equal
2 equal
3 equal
4 equal
dtype: object
Example 2 without lambda, throws boolean operation on series error
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( test( probPredDF['yTest'], probPredDF[ 'yPred']), axis=1 ).head()
Example 2 output
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

There is nothing magic about a lambda. They are functions in one parameter, that can be defined inline, and do not have a name. You can use a function where a lambda is expected, but the function will need to also take one parameter. You need to do something like...
Define it as:
def wrapper(x):
return test(x['yTest'], x['yPred'])
Use it as:
probPredDF.apply(wrapper, axis=1)

Related

Pandas - Setting column value, based on a function that runs on another column

I have been all over the place to try and get this to work (new to datascience). It's obviously because I don't get how the datastructure of Panda fully works.
I have this code:
def getSearchedValue(identifier):
full_str = anedf["Diskret data"].astype(str)
value=""
if full_str.str.find(identifier) <= -1:
start_index = full_str.str.find(identifier)+len(identifier)+1
end_index = full_str[start_index:].find("|")+start_index
value = full_str[start_index:end_index].astype(str)
return value
for col in anedf.columns:
if col.count("#") > 0:
anedf[col] = getSearchedValue(col)
What i'm trying to do is iterate over my columns. I have around 260 in my dataframe. If they contain the character #, it should try to fill values based on whats in my "Diskret data" column.
Data in the "Diskret data" column is completely messed up but in the form CCC#111~VALUE|DDD#222~VALUE| <- Until there is no more identifiers + values. All identifiers are not present in each row, and they come in no specific order.
The function works if I run it with hard coded strings in regular Python document. But with the dataframe I get various error like:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Input In [119], in <cell line: 12>()
12 for col in anedf.columns:
13 if col.count("#") > 0:
---> 14 anedf[col] = getSearchedValue(col)
Input In [119], in getSearchedValue(identifier)
4 full_str = anedf["Diskret data"].astype(str)
5 value=""
----> 6 if full_str.str.find(identifier) <= -1:
7 start_index = full_str.str.find(identifier)+len(identifier)+1
8 end_index = full_str[start_index:].find("|")+start_index
I guess this is because it evaluate against all rows (Series) which obviously provides some false and true errors. But how can I make the evaluation and assignment so it it's evaluating+assigning like this:
Diskret data
CCC#111
JJSDJ#1234
CCC#111~1IBBB#2323~2234
1 (copied from "Diskret data")
0
JJSDJ#1234~Heart attack
0 (or skipped since the row does not contain a value for the identifier)
Heart attack
The plan is to drop the "Diskret data" when the assignment is done, so I have the data in a more structured way.
--- Update---
By request:
I have included a picture of how I visualize the problem, And what I seemingly can't make it do.
Problem visualisation
With regex you could do something like:
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
series = pd.Series(
['CCC#111~1|BBB#2323~2234', 'JJSDJ#1234~Heart attack']
)
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series.apply(map_)
Breaking this down:
Create a new series by running a map on each row that turns your long string into a list of tuples
Create a new series by running a map on each row that turns your long string into a list of tuples.
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series
# output:
# 0 [(CCC#111, 1), (BBB#2323, 2234)]
# 1 [(JJSDJ#1234, Heart attack)]
Then we create a map_ function. This function takes each row of reg_series and maps it to two rows: the first with only the "keys" and the other with only the "values". We then create series of this with the index as the keys and the values as the values.
Edit: We added in a if/else statement that check whether the list exists. If it does not, we return an empty series of type object.
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
...
print(idx, values) # first row
# output:
# ('CCC#111', 'BBB#2323') (1, 2234)
Finally we run apply on the series to create a dataframe that takes the outputs from map_ for each row and zips them together in columnar format.
reg_series.apply(map_)
# output:
# CCC#111 BBB#2323 JJSDJ#1234
# 0 1 2234 NaN
# 1 NaN NaN Heart attack

Create new column in dataframe by passing existing pandas column values as argument to API call

I have created a function below, get_lyrics, which I want to pass the Song_Title and Singer_Name column values from an existing dataframe and create a new column in the dataframe.
My code below that attempts to create a column df['Lyrics'] gives me this error below and I have no idea why:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
My testing of the function with get_lyrics(test_song_name, test_song_author) works, it returns a very long string.
import lyricsgenius as lg
import pandas as pd
genius = lg.Genius(access_token=token)
test_song_name = "My Heart Will Go On"
test_song_author = "Celine Dion"
def get_lyrics(Song_Title, Singer_Name):
song = genius.search_song(Song_Title, Singer_Name)
return song.lyrics
get_lyrics(test_song_name, test_song_author)
df['Lyrics'] = df.apply(
get_lyrics(
df["Song_Title"], df["Singer_Name"]
)
)
To apply function on rows, you can use apply() with axis=1.
df['Lyrics'] = df.apply(lambda row: get_lyrics(row["Song_Title"], row["Singer_Name"]), axis=1)
Or with lambda function in one line
df['Lyrics'] = df.apply(lambda row: genius.search_song(row["Song_Title"], row["Singer_Name"]).lyrics, axis=1)
If you don't want lambda, you can do
def get_lyrics(row):
song = genius.search_song(row["Song_Title"], row["Singer_Name"])
return song.lyrics
df['Lyrics'] = df.apply(genius.search_song, axis=1)
I found this page here: https://www.codeforests.com/2020/07/18/pass-multiple-columns-to-lambda/
That has two working solutions. The first is identical to that posted by #Ynjxsjmh
df["Lyrics"] = df.apply(lambda x :
get_lyrics(x["Song_Title"], x["Singer_Name"]), axis=1
)
This latter one first selects a subset of the dataframe columns and then unpacks them with *x to feed into get_lyrics.
df["Lyrics"] = df[["Song_Title", "Singer_Name"]].apply(lambda x:
get_lyrics(*x),
axis=1)

Pandas function error: The truth value of a Series is ambiguous

I have a function to compared to date in a dataframe and return a value after a basic calculation:
def SDate(x,y):
s=1
a = min(x,y)
b = max(x,y)
if a !=x: s = -1
r = b-a
if b-a > 182:
r = 365-b+a
return(r * s)
I have tried using the following but I have an error:
df['Result']= SDate(df['GL Date'].dt.dayofyear,df['Special Date'].dt.dayofyear )
but I have an
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I am not sure exactly what you are trying to achieve, but it looks like you are trying to set the inputs of a function to a series and get row wise outputs, which is unwise given that you want the output to be in the dataframe.
Its also good practice to include a sample of the data you are trying to use and what you want the output to look like, as well as a more detailed explanation of what you want to achieve.
That being said, from what you have described -you should use the apply method as a row wise operation to get your output.
So if you wish to apply this function:
def SDate(x,y):
s=1
a = min(x,y)
b = max(x,y)
if a !=x: s = -1
r = b-a
if b-a > 182:
r = 365-b+a
return(r * s)
You should do this:
df['Result'] = df.apply(lambda x: SDate(x['GL Date'].dt.dayofyear, x['Special Date'].dt.dayofyear), axis = 1)
You are giving as x and as y a Pandas Series. Therefore, the function min cannot recieve such object. As I don't know what are you doing there I can't fix that on code.
Hope it works.
The problem is about min/max functions, they don't work with Series objects. Consider using this:
a = min(x.min(), y.min())
b = max(x.max(), y.max())
However, then you compare Series with a number: if a != x: – it will fail too. What's the purpose of your function?
You can try axis parameter of df.apply:
def SDate(row):
s=1
year1=row['GL Date'].year
year2= row['Special Date'].year
a = min(year1,year2)
b = max(year1,year2)
if a !=year1:
s = -1
r = b-a
if b-a > 182:
r = 365-b+a
return(r * s)
df['Result']= df.apply(SDate, axis=1)
When you Pass the GL Date and Special Date years to the functions, you're actually passing in a Series The error is because you can't compare values of Series with < or > operators. So the results is ambiguous. Which is greater can't be determined. Hence the error that you got.
When you use apply functions axis=1, it applies the function row-wise.

Shift happens: using pandas shift to combine rows

I am trying to add a column to my data frame in pandas where each entry represents the difference between another column's values across two adjacent rows (if certain conditions are met). Following this answer to get previous row's value and calculate new column pandas python I'm using shift to find the delta between the duration_seconds column entries in the two rows (next minus current) and then return that delta as the derived entry if both rows are from the same user_id, the next row's action is not 'login', and the delta is not negative. Here's the code:
def duration (row):
candidate_duration = row['duration_seconds'].shift(-1) - row['duration_seconds']
if row['user_id'] == row['user_id'].shift(-1) and row['action'].shift(-1) != 'login' and candidate_duration >= 0:
return candidate_duration
else:
return np.nan
Then I test the function using
analytic_events.apply(lambda row: duration(row), axis = 1)
But that throws an error:
AttributeError: ("'int' object has no attribute 'shift'", 'occurred at index 9464384')
I wondered if this was akin to the error fixed here and so I tried passing in the whole data frame thus:
duration(analytic_events)
but that throws the error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What should I be doing to achieve this combination; how should I be using shift?
Without seeing your data. You could simplify this with using conditionally creation of columns with np.where:
cond1 = analytic_events['user_id'] == analytic_events['user_id'].shift(-1)
cond2 = analytic_events['action'].shift(-1) != 'login'
cond3 = analytic_events['duration_seconds'].shift(-1) - analytic_events['duration_seconds'] >= 0
analytic_events['candidate_duration'] = np.where((cond1) & (cond2) & (cond3),
analytic_events['duration_seconds'].shift(-1) - analytic_events['duration_seconds'],
np.NaN)
explanation
np.where works as following: np.where(condition, value if true, value is false)

Pandas isin boolean operator giving error

I am running into an error while using the 'isin' Boolean operator:
def rowcheck(row):
return row['CUST_NAME'].isin(['John','Alan'])
My dataframe has column CUST_NAME. So I use:
df['CUSTNAME_CHK'] = df.apply (lambda row: rowcheck(row),axis=1)
I get:
'str' object has no attribute 'isin'
What did I do wrong?
You are doing it inside a function passed to apply, such that row['CUST_NAME'] holds the value for a specific cell (and it is a string). Strings which have no isin method. This method belongs to pd.Series, and not strings.
If you really want to use apply, use np.isin in this case
def rowcheck(row):
return pd.np.isin(row['CUST_NAME'], ['John','Alan'])
As #juanpa.arrivilaga noticed, isin won't be efficient in this case, so its advised to use the operator in directly
return row['CUST_NAME'] in ['John','Alan']
Notice that you probably don't need apply. You can just use pd.Series.isindirectly. For example,
df = pd.DataFrame({'col1': ['abc', 'dfe']})
col1
0 abc
1 dfe
Such that you can do
df.col1.isin(['abc', 'xyz'])
0 True
1 False
Name: col1, dtype: bool