I have a pandas dataframe. I want to check the value in a particular column and create a flag column based on if it is null/not null.
df_have:
A B
1
2 X
df_want
A B B_Available
1 N
2 X Y
I did:
def chkAvail(row):
return (pd.isnull(row['B']) == False)
if (df_have.apply (lambda row: chkAvail(row),axis=1)):
df_want['B_Available']='Y'
I got:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What did I do wrong?
You can use
df['B_available'] = df.B.notnull().map({False: 'N', True:'Y'})
If blank values are NaN or None. If they are whitespaces, do
df['B_available'] = (df.B != ' ').map({False: 'N', True:'Y'})
To do if series is not a good idea because there might be many True and False in series. E.g. what does if pd.Series([True, False, True, True]) mean? Makes no sense ;)
You can also use np.select:
# In case blank values are NaN
df['B_Available'] = np.select([df.B.isnull()], ['N'], 'Y')
# In case blank values are empty strings:
df['B_Available'] = np.select([df.B == ''], ['N'], 'Y')
>>> df
A B B_Available
0 1 NaN N
1 2 X Y
By using np.where
df['B_Available']=np.where(df.B.eq(''),'N','Y')
df
Out[86]:
A B B_Available
0 1 N
1 2 X Y
Related
I want to create a function in python that normalizes the values of several variables with specific condition:
As an example the following df, mine have 24 in total (23 int and 1 obj)
Column A
Column B
Column C
2
4
A
3
3
B
0
0.4
A
5
7
B
3
2
A
6
0
B
Lets say that I want to create a new df with the values of Col A and Col B after dividing by factor X or Y depending of whether col C is A or B. ie if col C is A the factor is X and if col C is B the factor is Y
I have create different version of a function:
def normalized_new (columns):
for col in df.columns:
if df.loc[df['Column C'] =='A']:
col=df[col]/X
elif df.loc[df['Column C'] =='B']:
col=df[col]/Y
else: pass
return columns
normalized_new (df)
and the other I tried:
def new_norm (prog):
if df.loc[(df['Column C']=='A')]:
prog = 1/X
elif df.loc[(df['Column C']=='B')]:
prog = 1/Y
else: print('this function doesnt work well')
return (prog)
for col in df.columns:
df[col]=new_norm(df)
For both function I always have the same valueError:
The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Could you help me to understand what is going on here? is there any other way to create a df with the desire output?
Thank you so much in advance!
Try to use np.where + .div:
X = 10
Y = -10
df[["Column A", "Column B"]] = df[["Column A", "Column B"]].div(
np.where(df["Column C"].eq("A"), X, Y), axis=0
)
print(df)
Prints:
Column A Column B Column C
0 0.2 0.40 A
1 -0.3 -0.30 B
2 0.0 0.04 A
3 -0.5 -0.70 B
4 0.3 0.20 A
5 -0.6 -0.00 B
Would you consider using apply and call custom function to set new column based on whole row data. This makes it easier to read.
For example:
X=10
Y=5
def new_norm(row):
#put your if/elif logic here, for example:
if row['Column C'] == 'A':
return row['Column A']/X #don't forget to return value for new column
....
df['newcol'] = df.apply(new_norm, axis=1) #call function for each row and add column 'newcol'
Function will allow to solve edge case (for example empty Column C or when there is different value than A or B etc.
I want to populate 10 columns with the numbers 1-16 depending on the values in 2 other columns. I can start by providing the column header or create new columns (does not matter to me).
I tried to create a function that iterates over the numbers 1-10 and then assigns a value to the z variable depending on the values of b and y.
Then I want to apply this function to each row in my dataframe.
import pandas as pd
import numpy as np
data = pd.read_csv('Nuc.csv')
def write_Pcolumns(df):
"""populates a column in the given dataframe, df, based on the values in two other columns in the same dataframe"""
#create string of numbers for each nucleotide position
positions = ('1','2','3','4','5','6','7','8','9','10')
a = "Po "
x = "O.Po "
#for each position create a variable for the nucleotide in the sequence (Po) and opposite to the sequence(o. Po)
for each in positions:
b = a + each
y = x + each
z = 'P' + each
#assign a value to z based on the nucleotide identities in the sequence and opposite position
if df[b] == 'A' and df[y]=='A':
df[z]==1
elif df[b] == 'A' and df[y]=='C':
df[z]==2
elif df[b] == 'A' and df[y]=='G':
df[z]==3
elif df[b] == 'A' and df[y]=='T':
df[z]==4
...
elif df[b] == 'T' and df[y]=='G':
df[z]==15
else:
df[z]==16
return(df)
data.apply(write_Pcolumns(data), axis=1)
I get the following error message:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This happens because df[index]=='value' returns a series of booleans, not a single boolean for each value.
Check out Pandas error when using if-else to create new column: The truth value of a Series is ambiguous
consider a pandas dataframe that has values such as 'a - b'. I would like to check for the occurrence of '-' anywhere across all values of the dataframe without looping through individual columns. Clearly a check such as the following won't work:
if '-' in df.values
Any suggestions on how to check for this? Thanks.
I'd use stack() + .str.contains() in this case:
In [10]: df
Out[10]:
a b c
0 1 a - b w
1 2 c z
2 3 d 2 - 3
In [11]: df.stack().str.contains('-').any()
Out[11]: True
In [12]: df.stack().str.contains('-')
Out[12]:
0 a NaN
b True
c False
1 a NaN
b False
c False
2 a NaN
b False
c True
dtype: object
You can use replace to to swap a regex match with something else then check for equality
df.replace('.*-.*', True, regex=True).eq(True)
One way may be to try using flatten to values and list comprehension.
df = pd.DataFrame([['val1','a-b', 'val3'],['val4','3', 'val5']],columns=['col1','col2', 'col3'])
print(df)
Output:
col1 col2 col3
0 val1 a-b val3
1 val4 3 val5
Now, to search for -:
find_value = [val for val in df.values.flatten() if '-' in val]
print(find_value)
Output:
['a-b']
Using NumPy: np.core.defchararray.find(a,s) returns an array of indices where the substring s appears in a;
if it's not present, -1 is returned.
(np.core.defchararray.find(df.values.astype(str),'-') > -1).any()
returns True if '-' is present anywhere in df.
The Problem:
I have a list of pandas.Series, where the series all have dates as index, but it is not guaranteed that they all have the same index.
The values are guaranteed to be bools (no NaNs possible).
The result i want to get is one pandas.Series where the index is the union of all indices found in the list of series. The value for each index should be the logical and of all series values, which contain the index.
Example:
A = pd.Series(index=[datetime(2015,05,01,20),
datetime(2015,05,01,20,15),
datetime(2015,05,01,20,30)],
data=[False, True, True])
B = pd.Series(index=[datetime(2015,05,01,20),
datetime(2015,05,01,20,30),
datetime(2015,05,01,20,45)],
data=[True, True, True])
series = [A, B]
A common index is datetime(2015,05,01,20) the result at this index should be False and True i.e. False.
An uncommon index is datetime(2015,05,01,20,45), it is only found in series B. The expected result is to be the value of B at this index, i.e. True.
The desired result in total looks like this:
result = pd.Series(index=[datetime(2015,05,01,20),
datetime(2015,05,01,20,15),
datetime(2015,05,01,20,30),
datetime(2015,05,01,20,45)],
data=[False, True, True, True])
My Approach:
I came up with a good start (I think) but cannot find the correct operation, it currently looks like this
result = None
for next in series:
if result is None:
result = next
else:
result = result.reindex(index=result.index | next.index)
# the next line sadly returns: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
result.loc[next.index] = result.loc[next.index] and next.loc[next.index] # sadly returns
I should have digged a little further before asking. I found a solution, which works for me and looks like the pandas-way of doing it, but I might stand corrected if an even more convenient way is presented!
result = None
for next in series:
if result is None:
result = next
else:
index = result.index | next.index
result = result.reindex(index, fill_value=True) & next.reindex(index, fill_value=True)
If I understand what you want, I'd concat the 2 series column-wise and then call a function row-wise that drops the NaN values and returns the logical and of the 2 columns or the lone column value:
In [231]:
df = pd.concat([A,B], axis=1)
def func(x):
l = x.dropna()
if len(l) > 1:
return l[0]&l[1]
return l.values[0]
df['result'] = df.apply(func, axis=1)
df
Out[231]:
0 1 result
2015-05-01 20:00:00 False True False
2015-05-01 20:15:00 True NaN True
2015-05-01 20:30:00 True True True
2015-05-01 20:45:00 NaN True True
How would I perform a case insensitive pandas.concat?
df1 = pd.DataFrame({"a":[1,2,3]},index=["a","b","c"])
df2 = pd.DataFrame({"b":[1,2,3]},index=["a","b","c"])
df1a = pd.DataFrame({"A":[1,2,3]},index=["A","B","C"])
pd.concat([df1, df2],axis=1)
a b
a 1 1
b 2 2
c 3 3
but this does not work:
pd.concat([df1, df1a],axis=1)
a A
A NaN 1
B NaN 2
C NaN 3
a 1 NaN
b 2 NaN
c 3 NaN
Is there an easy way to do this?
I have the same question for concat on a Series.
This works for a DataFrame:
pd.DataFrame([11,21,31],index=pd.MultiIndex.from_tuples([("A",x) for x in ["a","B","c"]])).rename(str.lower)
but this does not work for a Series:
pd.Series([11,21,31],index=pd.MultiIndex.from_tuples([("A",x) for x in ["a","B","c"]])).rename(str.lower)
TypeError: descriptor 'lower' requires a 'str' object but received a 'tuple'
For renaming, DataFrames use:
def rename_axis(self, mapper, axis=1):
index = self.axes[axis]
if isinstance(index, MultiIndex):
new_axis = MultiIndex.from_tuples([tuple(mapper(y) for y in x) for x in index], names=index.names)
else:
new_axis = Index([mapper(x) for x in index], name=index.name)
whereas when renaming Series:
result.index = Index([mapper_f(x) for x in self.index], name=self.index.name)
so my updated question is how to perform the rename/case insensitive concat with a Series?
You can do this via rename:
pd.concat([df1, df1a.rename(index=str.lower)], axis=1)
EDIT:
If you want to do this with a MultiIndexed Series you'll need to set it manually, for now. There's a bug report over at pandas GitHub repo waiting to be fixed (thanks #ViktorKerkez).
s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: tuple(map(str.lower, x))))
You can replace str.lower with whatever function you want to use to rename your index.
Note that you cannot use reindex in general here, because it tries to find values with the renamed index and thus it will return nan values, unless your rename results in no changes to the original index.
For the MultiIndexed Series objects, if this is not a bug, you can do:
s.index = pd.MultiIndex.from_tuples(
s.index.map(lambda x: tuple(map(str.lower, x)))
)