Python new columns resulting from if statement - pandas

Result: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I have a dataframe
abcd = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 0]]),columns=['a', 'b', 'c'])
I want to create a new column "d" in this data frame where: if column c = 0, then its value is column a + column b, if column c is between 1 and 3, then its value is column a and else its value is 10
My code:
if (abcd.c == 0):
abcd.d = abcd.a + abcd.b
elif abcd.c in range (0,4):
abcd.d = abcd.a
else:
abcd.d = 10
Result: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Use numpy.select with Series.isin for test membership:
abcd['d'] = np.select([abcd.c == 0, abcd.c.isin(range (0,4))],
[abcd.a + abcd.b, abcd.a],
default=10)

Related

apply a funtion to all element in a dataframe by considering all values of ement row

I have this dataframe:
name,A,B,C,D,E,F
x,1,2,3,0,5,6
y,5,5,6,0,4,2
z,2,3,3,0,1,1
2012-01-01,106.20,48.80,41.60,1015.04,211.13,643.55
2012-02-01,8.40,-9999.,4.80,15.36,0.37,0.02
2012-03-01,5.20,7.00,12.20,42.70,2.60,0.33
2012-04-01,45.60,29.80,48.20,718.18,-9999.,373.28
2012-05-01,-9999.,21.20,18.30,193.98,17.75,10.34
2012-06-01,122.40,95.30,103.00,4907.95,2527.59,37253.17
2012-07-01,-9999.,98.50,83.70,4122.23,1725.15,21355.74
2012-08-01,-9999.,113.00,94.80,5356.20,2538.84,40836.42
2012-09-01,-9999.,97.80,96.90,4738.41,2295.76,32667.42
2012-10-01,50.20,52.60,47.90,1259.77,301.71,1141.42
2012-11-01,76.40,-9999.,118.00,5858.70,3456.63,60814.94
2012-12-01,73.80,41.90,31.10,651.55,101.32,198.23
As you could notice, its represents the record on some data for the stations named [A,B,C,D,E,F] at different times. Each station has a position in the space with coordinates (x,y,z)
I read it as:
dfrGEO = pd.read_csv(f_name,
parse_dates = True,
index_col = 0,
nrows = 3,
infer_datetime_format = True,
cache_dates=True).replace(-9999.0, np.nan)
dfrDATA = pd.read_csv(f_name,
parse_dates = True,
index_col = 0,
header = 0,
skiprows = range(1,4),
infer_datetime_format = True,
cache_dates=True).replace(-9999.0, np.nan)
Let's say that I want to apply a function to all the element of the dataframe dfrDATA.
The first idea could be to set-up a double cycle with iloc but this will kill pandas advantages and i suppose the code performances.
Therefore, I come up with this:
def func_each_column(x,dfr):
"""
here apply again for each row
"""
res = 1
return res
res = dfrDATA.apply(func_each_column,args=(dfrDATA))
However, I have this error:
The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In addition, I would like to know if there is a better way to do what I want.
Thanks

Calculate Average True Range directly with Dataframe

I wonder if there is a simple and direct way to calculate ATR from DataFrame object. I am stuck in the max() part. This is what I am trying to do:
df['atr']=max( (df['High']-df['Low']), (df['High']-df['Close'].shift()).abs(), (df['Low']-df['close'].shift()).abs() )
The above code gives this error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I understand that to use max() in this context is not appropriate for the dataframe object. But if it works this would be rather elegant and simple. Just wonder if there are built in functions within dataframe object to achieve this.
Following your approach:
np.max( ((df['High']-df['Low']).values, np.abs(df['High']-df['Close'].shift()), np.abs(df['Low']-df['Close'].shift())) , axis=0)
A function can be this (no pandas copy warning):
def ATR(data: pd.DataFrame, window=14, use_nan=True) -> pd.Series:
df_ = data.copy(deep=True)
df_.loc[:, 'H_L'] = df_['High'] - df_['Low']
df_.loc[:, 'H_Cp'] = abs(df_['High'] - df_['Close'].shift(1))
df_.loc[:, 'L_Cp'] = abs(df_['Low'] - df_['Close'].shift(1))
df_.loc[:, 'TR'] = df_[["H_L", "H_Cp", "L_Cp"]].max(axis=1)
df_.loc[:, 'ATR'] = df_['TR'].rolling(window).mean()
for i in range(window, len(df_)):
df_.iloc[i, df_.columns.get_loc('ATR')] = (((df_.iloc[i - 1, df_.columns.get_loc('ATR')]) * (window - 1)) + df_.iloc[
i, df_.columns.get_loc('TR')]) / window
if use_nan:
df_.iloc[:window, df_.columns.get_loc('ATR')] = np.nan
return df_['ATR']

iterating pandas rows using .apply()

I wanted to iterate through the pandas data frame but for some reason it does not work with .apply() method.
train = pd.read_csv('../kaggletrain')
pclass = train['Pclass']
# pclass has list of data with either 1, 2 or 3..
# so wanted to return if the cell is 1 then return True or everything False
def abc(pclass):
if pclass == 1:
return True
else:
return False
ABCDEFG = train.apply(abc, axis=1)
This gives valueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Thank you for your help
ABCDEFG = train[train['pclass']==1]

how to compare the values of two columns using condition, and assign a value when that condition is met

I want to compare the home_score and away_score column values and if homescore<awayscore assigning homeloss , if homescore>awayscore assigning homewin and if homescore = awayscore assingning draw in new columns
era1800_1900 = era(eras,1800,1900)
era1800_1900["result"] = era1800_1900[(era1800_1900["home_score"] < era1800_1900["away_score"] == "Lose")]
I expect another column result in my data frame with values homeloss, homewin and draw based on the condition scores but i get this error when i used the following code
--era1800_1900 = era(eras,1800,1900)
era1800_1900["result"] = era1800_1900[(era1800_1900["home_score"] < era1800_1900["away_score"] == "Lose")]------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-78-58ef8c4a0715> in <module>
1 era1800_1900 = era(eras,1800,1900)
----> 2 era1800_1900["result"] = era1800_1900[(era1800_1900["home_score"] < era1800_1900["away_score"] == "Lose")]
~\Anaconda3 new\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1574 raise ValueError("The truth value of a {0} is ambiguous. "
1575 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1576 .format(self.__class__.__name__))
1577
1578 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Try the following approach:
era['result'] = None
era.loc[era[era['A'] < era['B']].index.values,'result'] = 'homelose'
era.loc[era[era['A'] > era['B']].index.values,'result'] = 'homewin'
era.loc[era[era['A'] < era['B']].index.values,'result'] = 'homedraw'
If you are comfortable with functions, look at this example

Apply functions to multiple columns with pandas

I have 2 functions like this one:
def wind_index(result):
if result > 10:
return 1
elif (result > 0) & (result <= 5):
return 1.5
elif (result > 5) & (result <= 10):
return 2
def get_thermal_index(temp, hum):
return wind_index(temp - 0.4*(temp-10)*((1-hum)/100))
When I'm trying to apply this function like this:
df['tci'] = get_thermal_index(df['tempC'], df['humidity'])
I got this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What else can I do to get a new column for my DataFrame using those functions??
You can use Series.apply:
def get_thermal_index(temp, hum):
return (temp - 0.4*(temp-10)*((1-hum)/100)).apply(wind_index)