Function with for loop and logical operators - dataframe

I want to create a function in python that normalizes the values of several variables with specific condition:
As an example the following df, mine have 24 in total (23 int and 1 obj)
Column A
Column B
Column C
2
4
A
3
3
B
0
0.4
A
5
7
B
3
2
A
6
0
B
Lets say that I want to create a new df with the values of Col A and Col B after dividing by factor X or Y depending of whether col C is A or B. ie if col C is A the factor is X and if col C is B the factor is Y
I have create different version of a function:
def normalized_new (columns):
for col in df.columns:
if df.loc[df['Column C'] =='A']:
col=df[col]/X
elif df.loc[df['Column C'] =='B']:
col=df[col]/Y
else: pass
return columns
normalized_new (df)
and the other I tried:
def new_norm (prog):
if df.loc[(df['Column C']=='A')]:
prog = 1/X
elif df.loc[(df['Column C']=='B')]:
prog = 1/Y
else: print('this function doesnt work well')
return (prog)
for col in df.columns:
df[col]=new_norm(df)
For both function I always have the same valueError:
The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Could you help me to understand what is going on here? is there any other way to create a df with the desire output?
Thank you so much in advance!

Try to use np.where + .div:
X = 10
Y = -10
df[["Column A", "Column B"]] = df[["Column A", "Column B"]].div(
np.where(df["Column C"].eq("A"), X, Y), axis=0
)
print(df)
Prints:
Column A Column B Column C
0 0.2 0.40 A
1 -0.3 -0.30 B
2 0.0 0.04 A
3 -0.5 -0.70 B
4 0.3 0.20 A
5 -0.6 -0.00 B

Would you consider using apply and call custom function to set new column based on whole row data. This makes it easier to read.
For example:
X=10
Y=5
def new_norm(row):
#put your if/elif logic here, for example:
if row['Column C'] == 'A':
return row['Column A']/X #don't forget to return value for new column
....
df['newcol'] = df.apply(new_norm, axis=1) #call function for each row and add column 'newcol'
Function will allow to solve edge case (for example empty Column C or when there is different value than A or B etc.

Related

Classifying pandas columns according to range limits

I have a dataframe with several numeric columns and their range goes either from 1 to 5 or 1 to 10
I want to create two lists of these columns names this way:
names_1to5 = list of all columns in df with numbers ranging from 1 to 5
names_1to10 = list of all columns in df with numbers from 1 to 10
Example:
IP track batch size type
1 2 3 5 A
9 1 2 8 B
10 5 5 10 C
from the dataframe above:
names_1to5 = ['track', 'batch']
names_1to10 = ['ip', 'size']
I want to use a function that gets a dataframe and perform the above transformation only on columns with numbers within those ranges.
I know that if the column 'max()' is 5 than it's 1to5 same idea when max() is 10
What I already did:
def test(df):
list_1to5 = []
list_1to10 = []
for col in df:
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
I tried the above but it's returning the following error msg:
'>=' not supported between instances of 'float' and 'str'
The type of the columns is 'object' maybe this is the reason. If this is the reason, how can I fix the function without the need to cast these columns to float as there are several, sometimes hundreds of these columns and if I run:
df['column'].max() I get 10 or 5
What's the best way to create this this function?
Use:
string = """alpha IP track batch size
A 1 2 3 5
B 9 1 2 8
C 10 5 5 10"""
temp = [x.split() for x in string.split('\n')]
cols = temp[0]
data = temp[1:]
def test(df):
list_1to5 = []
list_1to10 = []
for col in df.columns:
if df[col].dtype!='O':
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
df = pd.DataFrame(data, columns = cols, dtype=float)
Output:
(['track', 'batch'], ['IP', 'size'])

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

Pandas truth value of a series is ambiguous

I have a pandas dataframe. I want to check the value in a particular column and create a flag column based on if it is null/not null.
df_have:
A B
1
2 X
df_want
A B B_Available
1 N
2 X Y
I did:
def chkAvail(row):
return (pd.isnull(row['B']) == False)
if (df_have.apply (lambda row: chkAvail(row),axis=1)):
df_want['B_Available']='Y'
I got:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What did I do wrong?
You can use
df['B_available'] = df.B.notnull().map({False: 'N', True:'Y'})
If blank values are NaN or None. If they are whitespaces, do
df['B_available'] = (df.B != ' ').map({False: 'N', True:'Y'})
To do if series is not a good idea because there might be many True and False in series. E.g. what does if pd.Series([True, False, True, True]) mean? Makes no sense ;)
You can also use np.select:
# In case blank values are NaN
df['B_Available'] = np.select([df.B.isnull()], ['N'], 'Y')
# In case blank values are empty strings:
df['B_Available'] = np.select([df.B == ''], ['N'], 'Y')
>>> df
A B B_Available
0 1 NaN N
1 2 X Y
By using np.where
df['B_Available']=np.where(df.B.eq(''),'N','Y')
df
Out[86]:
A B B_Available
0 1 N
1 2 X Y

Passing an Index or column value as a **KWARG in groupby aggregate function

I am trying to run an aggregate function on a pandas groupby where I pass one of the columns as a kwarg or arg. I can do this with passing a constant but cannot figure out how to pass a column value.
For example
import pandas as pd
import datetime
import numpy as np
def sum_corr(vector, cor):
a = vector.tolist()
radicand = sum([a[i]*a[j] * (1 if i == j else cor) for i in range(len(a)) for j in range(len(a))])
return np.sqrt(radicand)
my_table = pd.DataFrame({'Date':4*pd.bdate_range(datetime.datetime(2017,1,1),periods=4).tolist(),
'Name':[i for i in 'abcd' for j in range(4)],
'corr':[i for i in [0,1,.5,.8] for j in range(4)],
'vals':[1,2,3,4]*4})
I can call this with a constant No Problem
print(my_table.groupby(['Name','corr'],as_index=False).agg(sum_corr,**{'cor':0}))
Name corr vals
0 a 0.0 5.477226
1 b 1.0 5.477226
2 c 0.5 5.477226
3 d 0.8 5.477226
I would like to call this passing in the 'corr' column something like
print(my_table.groupby(['Name','corr'],as_index=False).agg(sum_corr,**{'cor':my_table['corr']}))
Name corr vals
0 a 0.0 5.477226
1 b 1.0 10
2 c 0.5 8.062258
3 d 0.8 9.273618
Thanks in advance!
The problem here is not passing a column, the problem is that sum_corr() returns an array when you pass a column, when it should return aggregated (scalar) value if you want to use it in agg() for a groupby object.
For example, if you change the last line in sum_corr() from
return np.sqrt(radicand)
to
return np.sum(np.sqrt(radicand))
then your function returns a scalar, and
print(my_table.groupby(['Name','corr'],as_index=False).agg(sum_corr,**{'cor':my_table['corr']}))
produces no error:
Name corr vals
0 a 0.0 131.252407
1 b 1.0 131.252407
2 c 0.5 131.252407
3 d 0.8 131.252407
This example might not be what you want to achieve, but it illustrates you can pass a column as a kwarg in groupby.agg().

Case insensitive pandas.concat

How would I perform a case insensitive pandas.concat?
df1 = pd.DataFrame({"a":[1,2,3]},index=["a","b","c"])
df2 = pd.DataFrame({"b":[1,2,3]},index=["a","b","c"])
df1a = pd.DataFrame({"A":[1,2,3]},index=["A","B","C"])
pd.concat([df1, df2],axis=1)
a b
a 1 1
b 2 2
c 3 3
but this does not work:
pd.concat([df1, df1a],axis=1)
a A
A NaN 1
B NaN 2
C NaN 3
a 1 NaN
b 2 NaN
c 3 NaN
Is there an easy way to do this?
I have the same question for concat on a Series.
This works for a DataFrame:
pd.DataFrame([11,21,31],index=pd.MultiIndex.from_tuples([("A",x) for x in ["a","B","c"]])).rename(str.lower)
but this does not work for a Series:
pd.Series([11,21,31],index=pd.MultiIndex.from_tuples([("A",x) for x in ["a","B","c"]])).rename(str.lower)
TypeError: descriptor 'lower' requires a 'str' object but received a 'tuple'
For renaming, DataFrames use:
def rename_axis(self, mapper, axis=1):
index = self.axes[axis]
if isinstance(index, MultiIndex):
new_axis = MultiIndex.from_tuples([tuple(mapper(y) for y in x) for x in index], names=index.names)
else:
new_axis = Index([mapper(x) for x in index], name=index.name)
whereas when renaming Series:
result.index = Index([mapper_f(x) for x in self.index], name=self.index.name)
so my updated question is how to perform the rename/case insensitive concat with a Series?
You can do this via rename:
pd.concat([df1, df1a.rename(index=str.lower)], axis=1)
EDIT:
If you want to do this with a MultiIndexed Series you'll need to set it manually, for now. There's a bug report over at pandas GitHub repo waiting to be fixed (thanks #ViktorKerkez).
s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: tuple(map(str.lower, x))))
You can replace str.lower with whatever function you want to use to rename your index.
Note that you cannot use reindex in general here, because it tries to find values with the renamed index and thus it will return nan values, unless your rename results in no changes to the original index.
For the MultiIndexed Series objects, if this is not a bug, you can do:
s.index = pd.MultiIndex.from_tuples(
s.index.map(lambda x: tuple(map(str.lower, x)))
)