Pandas function error: The truth value of a Series is ambiguous - pandas

I have a function to compared to date in a dataframe and return a value after a basic calculation:
def SDate(x,y):
s=1
a = min(x,y)
b = max(x,y)
if a !=x: s = -1
r = b-a
if b-a > 182:
r = 365-b+a
return(r * s)
I have tried using the following but I have an error:
df['Result']= SDate(df['GL Date'].dt.dayofyear,df['Special Date'].dt.dayofyear )
but I have an
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I am not sure exactly what you are trying to achieve, but it looks like you are trying to set the inputs of a function to a series and get row wise outputs, which is unwise given that you want the output to be in the dataframe.
Its also good practice to include a sample of the data you are trying to use and what you want the output to look like, as well as a more detailed explanation of what you want to achieve.
That being said, from what you have described -you should use the apply method as a row wise operation to get your output.
So if you wish to apply this function:
def SDate(x,y):
s=1
a = min(x,y)
b = max(x,y)
if a !=x: s = -1
r = b-a
if b-a > 182:
r = 365-b+a
return(r * s)
You should do this:
df['Result'] = df.apply(lambda x: SDate(x['GL Date'].dt.dayofyear, x['Special Date'].dt.dayofyear), axis = 1)

You are giving as x and as y a Pandas Series. Therefore, the function min cannot recieve such object. As I don't know what are you doing there I can't fix that on code.
Hope it works.

The problem is about min/max functions, they don't work with Series objects. Consider using this:
a = min(x.min(), y.min())
b = max(x.max(), y.max())
However, then you compare Series with a number: if a != x: – it will fail too. What's the purpose of your function?

You can try axis parameter of df.apply:
def SDate(row):
s=1
year1=row['GL Date'].year
year2= row['Special Date'].year
a = min(year1,year2)
b = max(year1,year2)
if a !=year1:
s = -1
r = b-a
if b-a > 182:
r = 365-b+a
return(r * s)
df['Result']= df.apply(SDate, axis=1)
When you Pass the GL Date and Special Date years to the functions, you're actually passing in a Series The error is because you can't compare values of Series with < or > operators. So the results is ambiguous. Which is greater can't be determined. Hence the error that you got.
When you use apply functions axis=1, it applies the function row-wise.

Related

How to calculate rolling.agg('max') utilising a dataframe column as input to my function

I'm working with a kline dataframe. I'm adding a Swing_High and Swing_Low column to my df.
I've picked up an error where during low volatile periods my Close == Swing_Low price. This gives me a inf error in another function I have where close / Swing_Low.
To fix this I need to calculate the max/min value based on whether Close == Swing_Low or not. Default is for the rolling period to be 10 but if the above is true then increase the rolling period to 15.
Below is how I calculated the Swing_High and Swing_Low up to encountering Inf error.
import pandas as pd
df = pd.read_csv('Data/bybit_BTCUSD_15m.csv')
df["Date"] = df["Date"].astype('datetime64[ns]')
# Calculate the swing high and low for a given length
df['Swing_High'] = df['High'].rolling(10).agg('max')
df['Swing_Low'] = df['Low'].rolling(10).agg('min')
I tried the below function but it gives me a ValueError: The truth value of a Series is ambiguous
def swing_high(close, high, period1, period2):
a = high.rolling(period1).agg('max')
b = high.rolling(period2).agg('max')
if a != close:
return a
else:
return b
df['Swing_High'] = swing_high(df['Close'], df['High'], 10, 15)
How do I fix this or is there a better way to achieve my desired outcome?
A simple solution for what you're trying to achieve :
using the where function:
here’s the basic syntax using the pandas where() function:
df['col'] = (value_if_false).where(condition, value_if_true)
df['Swing_High_10']=df['High'].rolling(10).agg('max')
df['Swing_High_15']=df['High'].rolling(15).agg('max')
df['Swing_High']=(df['Swing_High_15']).where(df['Swing_High_10']!=df['Close'], df['Swing_High_15'])

Pandas - Setting column value, based on a function that runs on another column

I have been all over the place to try and get this to work (new to datascience). It's obviously because I don't get how the datastructure of Panda fully works.
I have this code:
def getSearchedValue(identifier):
full_str = anedf["Diskret data"].astype(str)
value=""
if full_str.str.find(identifier) <= -1:
start_index = full_str.str.find(identifier)+len(identifier)+1
end_index = full_str[start_index:].find("|")+start_index
value = full_str[start_index:end_index].astype(str)
return value
for col in anedf.columns:
if col.count("#") > 0:
anedf[col] = getSearchedValue(col)
What i'm trying to do is iterate over my columns. I have around 260 in my dataframe. If they contain the character #, it should try to fill values based on whats in my "Diskret data" column.
Data in the "Diskret data" column is completely messed up but in the form CCC#111~VALUE|DDD#222~VALUE| <- Until there is no more identifiers + values. All identifiers are not present in each row, and they come in no specific order.
The function works if I run it with hard coded strings in regular Python document. But with the dataframe I get various error like:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Input In [119], in <cell line: 12>()
12 for col in anedf.columns:
13 if col.count("#") > 0:
---> 14 anedf[col] = getSearchedValue(col)
Input In [119], in getSearchedValue(identifier)
4 full_str = anedf["Diskret data"].astype(str)
5 value=""
----> 6 if full_str.str.find(identifier) <= -1:
7 start_index = full_str.str.find(identifier)+len(identifier)+1
8 end_index = full_str[start_index:].find("|")+start_index
I guess this is because it evaluate against all rows (Series) which obviously provides some false and true errors. But how can I make the evaluation and assignment so it it's evaluating+assigning like this:
Diskret data
CCC#111
JJSDJ#1234
CCC#111~1IBBB#2323~2234
1 (copied from "Diskret data")
0
JJSDJ#1234~Heart attack
0 (or skipped since the row does not contain a value for the identifier)
Heart attack
The plan is to drop the "Diskret data" when the assignment is done, so I have the data in a more structured way.
--- Update---
By request:
I have included a picture of how I visualize the problem, And what I seemingly can't make it do.
Problem visualisation
With regex you could do something like:
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
series = pd.Series(
['CCC#111~1|BBB#2323~2234', 'JJSDJ#1234~Heart attack']
)
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series.apply(map_)
Breaking this down:
Create a new series by running a map on each row that turns your long string into a list of tuples
Create a new series by running a map on each row that turns your long string into a list of tuples.
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series
# output:
# 0 [(CCC#111, 1), (BBB#2323, 2234)]
# 1 [(JJSDJ#1234, Heart attack)]
Then we create a map_ function. This function takes each row of reg_series and maps it to two rows: the first with only the "keys" and the other with only the "values". We then create series of this with the index as the keys and the values as the values.
Edit: We added in a if/else statement that check whether the list exists. If it does not, we return an empty series of type object.
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
...
print(idx, values) # first row
# output:
# ('CCC#111', 'BBB#2323') (1, 2234)
Finally we run apply on the series to create a dataframe that takes the outputs from map_ for each row and zips them together in columnar format.
reg_series.apply(map_)
# output:
# CCC#111 BBB#2323 JJSDJ#1234
# 0 1 2234 NaN
# 1 NaN NaN Heart attack

Create new column in dataframe by passing existing pandas column values as argument to API call

I have created a function below, get_lyrics, which I want to pass the Song_Title and Singer_Name column values from an existing dataframe and create a new column in the dataframe.
My code below that attempts to create a column df['Lyrics'] gives me this error below and I have no idea why:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
My testing of the function with get_lyrics(test_song_name, test_song_author) works, it returns a very long string.
import lyricsgenius as lg
import pandas as pd
genius = lg.Genius(access_token=token)
test_song_name = "My Heart Will Go On"
test_song_author = "Celine Dion"
def get_lyrics(Song_Title, Singer_Name):
song = genius.search_song(Song_Title, Singer_Name)
return song.lyrics
get_lyrics(test_song_name, test_song_author)
df['Lyrics'] = df.apply(
get_lyrics(
df["Song_Title"], df["Singer_Name"]
)
)
To apply function on rows, you can use apply() with axis=1.
df['Lyrics'] = df.apply(lambda row: get_lyrics(row["Song_Title"], row["Singer_Name"]), axis=1)
Or with lambda function in one line
df['Lyrics'] = df.apply(lambda row: genius.search_song(row["Song_Title"], row["Singer_Name"]).lyrics, axis=1)
If you don't want lambda, you can do
def get_lyrics(row):
song = genius.search_song(row["Song_Title"], row["Singer_Name"])
return song.lyrics
df['Lyrics'] = df.apply(genius.search_song, axis=1)
I found this page here: https://www.codeforests.com/2020/07/18/pass-multiple-columns-to-lambda/
That has two working solutions. The first is identical to that posted by #Ynjxsjmh
df["Lyrics"] = df.apply(lambda x :
get_lyrics(x["Song_Title"], x["Singer_Name"]), axis=1
)
This latter one first selects a subset of the dataframe columns and then unpacks them with *x to feed into get_lyrics.
df["Lyrics"] = df[["Song_Title", "Singer_Name"]].apply(lambda x:
get_lyrics(*x),
axis=1)

I have a dataframe and I want to find the standard deviation for some specific cells

I'm trying to use pandas to find the standard deviation for the entries in some specific cells
I have tried using numPy's stdev like so:
numpy.std(df[columnName][j:i])
I have also tried using this:
df.std(axis=0)[columnName][j:i]
Just pseudocode becuase my actual code is more confusing than necessary for this question:
df = loadIris()
for feat in df.columns:
i = 0
j = 0
flower = df['flower'][i]
while i < df.index.max():
if df['flower'][i] == flower:
i+=1
else:
j = i
stand = df.std(axis=0)[feat][j:i]
flower = df['flower'][i]
I ended up just appending all of the values to a list and then calculating the standard deviation using statistics.stdev which you can get by importing statistics.

pandas apply() with and without lambda

What is the rule/process when a function is called with pandas apply() through lambda vs. not? Examples below. Without lambda apparently, the entire series ( df[column name] ) is passed to the "test" function which throws an error trying to do a boolean operation on a series.
If the same function is called via lambda it works. Iteration over each row with each passed as "x" and the df[ column name ] returns a single value for that column in the current row.
It's like lambda is removing a dimension. Anyone have an explanation or point to the specific doc on this? Thanks.
Example 1 with lambda, works OK
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( lambda x: test( x['yTest'], x[ 'yPred']), axis=1 ).head()
Example 1 output
probPredDF columns: Index([0, 1, 'yPred', 'yTest'], dtype='object')
Out[215]:
0 equal
1 equal
2 equal
3 equal
4 equal
dtype: object
Example 2 without lambda, throws boolean operation on series error
print("probPredDF columns:", probPredDF.columns)
def test( x, y):
if x==y:
r = 'equal'
else:
r = 'not equal'
return r
probPredDF.apply( test( probPredDF['yTest'], probPredDF[ 'yPred']), axis=1 ).head()
Example 2 output
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
There is nothing magic about a lambda. They are functions in one parameter, that can be defined inline, and do not have a name. You can use a function where a lambda is expected, but the function will need to also take one parameter. You need to do something like...
Define it as:
def wrapper(x):
return test(x['yTest'], x['yPred'])
Use it as:
probPredDF.apply(wrapper, axis=1)