Create a new column with IF-THEN in grouped pandas df - pandas

I'm applying a simple function to a grouped pandas df. Below is what I'm trying. Even if I try to modify the function to carry one step, I keep getting the same error. Any direction will be super helpful.
def udf_pd(df_group):
if (df_group['A'] - df_group['B']) > 1:
df_group['D'] = 'Condition-1'
elif df_group.A == df_group.C:
df_group['D'] = 'Condition-2'
else:
df_group['D'] = 'Condition-3'
return df_group
final_df = df.groupby(['id1','id2']).apply(udf_pd)
final_df = final_df.reset_index()
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().

Note that in groupby.apply the function is applied to the whole group.
On the other hand, each if condition must boil down to a single value
(not to any Series of True/False values).
So each comparison of 2 columns in this function must be supplemented with
e.g. all() or any(), like in the example below:
def udf_pd(df_group):
if (df_group.A - df_group.B > 1).all():
df_group['D'] = 'Condition-1'
elif (df_group.A == df_group.C).all():
df_group['D'] = 'Condition-2'
else:
df_group['D'] = 'Condition-3'
return df_group
Of course, the function can return the whole group, e.g. "extended"
by a new column and in such a case a single value of the new column
is broadcast, so each row in the current group receives this value.
I created a test DataFrame:
id1 id2 A B C
0 1 1 5 3 0
1 1 1 7 5 4
2 1 2 3 4 3
3 1 2 4 5 4
4 2 1 2 4 3
5 2 1 4 5 4
In this example:
In the first group (id1 == 1, id2 == 1), in all rows, A - B > 1,
so Condition-1 is True.
In the second group (id1 == 1, id2 == 2), the above condition is
not met, but in all rows, A == C, so Condition-2 is True.
In the last group (id1 == 2, id2 == 1), neither of the above
conditions is met, so Condition-3 is True.
Hence the result of df.groupby(['id1','id2']).apply(udf_pd) is:
id1 id2 A B C D
0 1 1 5 3 0 Condition-1
1 1 1 7 5 4 Condition-1
2 1 2 3 4 3 Condition-2
3 1 2 4 5 4 Condition-2
4 2 1 2 4 3 Condition-3
5 2 1 4 5 4 Condition-3

I've encountered this error before and my understanding that pandas isn't sure which value it's supposed to run the conditional against. You're going to probably want to use .any() or .all(). Consider these examples
>>> a = pd.Series([0,0,3])
>>> b = pd.Series([1,1,1])
>>> a - b
0 -1
1 -1
2 2
dtype: int64
>>> (a - b) >= 1
0 False
1 False
2 True
dtype: bool
you can see that (a-b) >= 1 truthiness is kinda ambigious, the first elements in the vector is false while the others are true.
Using .any() or .all() will evaluate the entire series.
>>> ((a - b) >= 1).any()
True
>>> ((a - b) >= 1).all()
False
.any() checks to see if well any of the elements in the series are True. While .all() checks to see if all of the elements are True. Which in this example they're not.
you can also check out this post for more information: Pandas Boolean .any() .all()

Related

Filter rows from subsets of a Pandas DataFrame efficiently

I have a DataFrame consisting of medical data where the columns are ["Patient_ID", "Code", "Data"], where "Code" just represents some medical interaction patient "Patient_ID" had on "Date". Any patient will generally have more than one row, since they have more than one interaction. I want to apply two types of filtering to this data.
Remove any patients who have less than some min_len interactions
To each patient apply a half-overlapping, sliding window of length T days. Within each window keep only the first of any duplicate codes, and then shuffle the codes within the window
So I need to modify subsets of the overall dataframe, but the modification involves changing the size of the subset. I have both of these implemented as part of a larger pipeline, however they are a sigfnificant bottleneck in terms of time. I'm wondering if there's a more efficient way to achieve the same thing, as I really just threw together what worked and I'm not too familiar on efficiency of pandas operations. Here is how I have them currently:
def Filter_by_length(df, min_len = 1):
print("Filtering short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
def shuffle_col(df, col):
df[col] = np.random.permutation(df[col])
return df
def Filter_by_redundancy(df, T, min_len = 1):
print("Filtering redundant concepts and short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
start_date = sub_df.DATE.min()
end_date = sub_df.DATE.max()
next_date = start_date + dt.timedelta(days = T)
while start_date <= end_date:
sub_df = pd.concat([sub_df[sub_df.DATE < start_date],\
shuffle_col(sub_df[(sub_df.DATE <= next_date) & (sub_df.DATE >= start_date)]\
.drop_duplicates(subset = ['CODE']), "CODE"),\
sub_df[sub_df.DATE > next_date]], sort = False )
start_date += dt.timedelta(days = int(T/2))
next_date += dt.timedelta(days = int(T/2))
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
As you can see, in the second case I am actually applying both filters, because it is important to have the option to apply both together or either one on its own, but I am interested in any performance improvement that can be made to either one or both.
For the first part, instead of counting in your group-by like that, I would use this approach:
>>> d = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'q': [np.random.randint(1, 15, size=np.random.randint(1, 5)) for _ in range(5)]}).explode('q')
id q
0 1 1
0 1 9
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
3 4 11
3 4 5
4 5 5
4 5 6
4 5 3
4 5 2
>>> sizes = d.groupby('id').size()
>>> d[d['id'].isin(sizes[sizes >= 3].index)] # index is list of IDs meeting criteria
id q
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
4 5 5
4 5 6
4 5 3
4 5 2
I'm not sure why you want to shuffle your codes within some window. To avoid an X-Y problem, what are you in fact trying to do there?

Why does .loc not always match column names?

I noticed this today and wanted to ask because I am a little confused about this.
Lets say we have two df's
df = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('ABC'))
A B C
0 3 1 6
1 2 4 0
2 8 8 0
3 8 6 7
4 4 5 0
df2 = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('CBA'))
C B A
0 3 5 5
1 7 4 6
2 0 7 7
3 6 6 5
4 4 0 6
If we wanted to conditionally assign new values in the first df with values, we could do this:
df.loc[df['A'].gt(3)] = df2
I would expect the columns to be aligned, and if there were missing columns, for the values in the first df to be populated with nan. However when the above code is run, it replaces the data and does not take into account the column names. (it does take the index names into account however)
A B C
0 3 1 6
1 2 4 0
2 0 7 7
3 6 6 5
4 4 0 6
on index 2 instead of [7,7,0] we have [0,7,7].
However, if we pass the names of the columns into the loc statement, without changing the order of the columns in df2, it aligns with the columns.
df.loc[df['A'].gt(3),['A','B','C']] = df2
A B C
0 3 1 6
1 2 4 0
2 7 7 0
3 5 6 6
4 6 0 4
Why does this happen?
Interestingly, loc performs a number of optimizations to improve performance, one of those optimizations is checking the type of the index passed in.
Both Row and Column Indexes Included
When passing both a row index and a column index the __setitem__ function:
def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
else:
key = com.apply_if_callable(key, self.obj)
indexer = self._get_setitem_indexer(key)
self._has_valid_setitem_indexer(key)
iloc = self if self.name == "iloc" else self.obj.iloc
iloc._setitem_with_indexer(indexer, value, self.name)
Interprets the key as a tuple.
key:
(0 False
1 False
2 True
3 True
4 True
Name: A, dtype: bool,
['A', 'B', 'C'])
This is then passed to _get_setitem_indexer to convert to a positional indexer from label-based:
indexer = self._get_setitem_indexer(key)
def _get_setitem_indexer(self, key):
"""
Convert a potentially-label-based key into a positional indexer.
"""
if self.name == "loc":
self._ensure_listlike_indexer(key)
if self.axis is not None:
return self._convert_tuple(key, is_setter=True)
ax = self.obj._get_axis(0)
if isinstance(ax, ABCMultiIndex) and self.name != "iloc":
with suppress(TypeError, KeyError, InvalidIndexError):
# TypeError e.g. passed a bool
return ax.get_loc(key)
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
if isinstance(key, range):
return list(key)
try:
return self._convert_to_indexer(key, axis=0, is_setter=True)
except TypeError as e:
# invalid indexer type vs 'other' indexing errors
if "cannot do" in str(e):
raise
elif "unhashable type" in str(e):
raise
raise IndexingError(key) from e
This generates a tuple indexer (both rows and columns are converted):
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
returns
(array([2, 3, 4], dtype=int64), array([0, 1, 2], dtype=int64))
Only Row Index Included
However, when only a row index is passed to loc the indexer is not a tuple and, as such, only a single dimension is converted from label to positional:
if isinstance(key, range):
return list(key)
returns
[2 3 4]
For this reason, no alignment happens among columns when only a single value is passed to loc, as no parsing is done to align the columns.
That is why an empty slice is often used:
df.loc[df['A'].gt(3), :] = df2
As this is sufficient to align the columns appropriately.
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('CBA'))
print(df)
print(df2)
df.loc[df['A'].gt(3), :] = df2
print(df)
Example:
df:
A B C
0 3 6 6
1 0 8 4
2 7 0 0
3 7 1 5
4 7 0 1
df2:
C B A
0 4 6 2
1 1 2 7
2 0 5 0
3 0 4 4
4 3 2 4
df.loc[df['A'].gt(3), :] = df2:
A B C
0 3 6 6
1 0 8 4
2 0 5 0
3 4 4 0 # Aligned as expected
4 4 2 3

pandas split-apply-combine creates undesired MultiIndex

I am using the split-apply-combine pattern in pandas to group my df by a custom aggregation function.
But this returns an undesired DataFrame with the grouped column existing twice: In an MultiIndex and the columns.
The following is a simplified example of my problem.
Say, I have this df
df = pd.DataFrame([[1,2],[3,4],[1,5]], columns=['A','B']))
A B
0 1 2
1 3 4
2 1 5
I want to group by column A and keep only those rows where B has an even value. Thus the desired df is this:
B
A
1 2
3 4
The custom function my_combine_func should do the filtering. But applying it after a groupby, leads to an MultiIndex with the former Index in the second level. And thus column A existing two times.
my_combine_func = group[group['B'] % 2 == 0]
df.groupby(['A']).apply(my_combine_func)
A B
A
1 0 1 2
3 1 3 4
How to apply a custom group function and have the desired df?
It's easier to use apply here so you get a boolean array back:
df[df.groupby('A')['B'].apply(lambda x: x % 2 == 0)]
A B
0 1 2
1 3 4

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1

Pandas Series Chaining: Filter on boolean value

How can I filter a pandas series based on boolean values?
Currently I have:
s.apply(lambda x: myfunc(x, myparam).where(lambda x: x).dropna()
What I want is only keep entries where myfunc returns true.myfunc is complex function using 3rd party code and operates only on individual elements.
How can i make this more understandable?
You can understand it with below given sample code
import pandas as pd
data = pd.Series([1,12,15,3,5,3,6,9,10,5])
print(data)
# filter data based on a condition keep only rows which are multiple of 3
filter_cond = data.apply(lambda x:x%3==0)
print(filter_cond)
filter_data = data[filter_cond]
print(filter_data)
This code is about to filter the series data which are of the multiples of 3. To do that, we just put the filter condition and apply it on the series data. You can verify it with below generated output.
The sample series data:
0 1
1 12
2 15
3 3
4 5
5 3
6 6
7 9
8 10
9 5
dtype: int64
The conditional filter output:
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 True
8 False
9 False
dtype: bool
The final required filter data:
1 12
2 15
3 3
5 3
6 6
7 9
dtype: int64
Hope, this helps you to understand that how we can apply conditional filters on the series data.
Use boolean indexing:
mask = s.apply(lambda x: myfunc(x, myparam))
print (s[mask])
If also is changed index values in mask filter by 1d array:
#pandas 0.24+
print (s[mask.to_numpy()])
#pandas below
print (s[mask.values])
EDIT:
s = pd.Series([1,2,3])
def myfunc(x, n):
return x > n
myparam = 1
a = s[s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64
Solution with callable is possible, but a bit overcomplicated in my opinion:
a = s.loc[lambda s: s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64