dataframe columnwise comparision to another series - pandas

It seems dataframe.le doesn't operate column wise fashion.
df = DataFrame(randn(8,12))
series=Series(rand(8))
df.le(series)
I would expect for each column in df it will compare to series (so total 12 columns comparison with series, so 12 column*8 row comparison involved). But it appears for each element in df it will compare against every elements in series so this will involves 12(columns)*8(rows) * 8(elements in series) comparison. How can I achieve column by column comparison?
Second question is once I am done with column wise comparison I want to be able to count for each row how many 'true' are there, I am currently doing astype(int32) to turn bool into int then do sum, does this sound reasonable?
Let me give an example about the first question to show what I meant (using a simpler example since show 8*12 is tough):
>>>from pandas import *
>>>from numpy.random import *
>>>df = DataFrame(randn(2,5))
>>>t = DataFrame(randn(2,1))
>>>df
0 1 2 3 4
0 -0.090283 1.656517 -0.183132 0.904454 0.157861
1 1.667520 -1.242351 0.379831 0.672118 -0.290858
>>>t
0
0 1.291535
1 0.151702
>>>df.le(t)
0 1 2 3 4
0 True False False False False
1 False False False False False
What I expect df's column 1 should be:
1
False
True
Because 1.656517 < 1.291535 is False and -1.242351 < 0.151702 is True, this is column wise comparison. However the print out is False False.

I'm not sure I understand the first part of your question, but as to the second part, you can count the Trues in a boolean DataFrame using sum:
In [11]: df.le(s).sum(axis=0)
Out[11]:
0 4
1 3
2 7
3 3
4 6
5 6
6 7
7 6
8 0
9 0
10 0
11 0
dtype: int64
.
Essentially le is testing for each column:
In [21]: df[0] < s
Out[21]:
0 False
1 True
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool
Which for each index is testing:
In [22]: df[0].loc[0] < s.loc[0]
Out[22]: False

Related

Multiple conditions on pandas dataframe

I have a list of conditions to be run on the dataset to sort huge data.
df = A Huge_dataframe.
eg.
Index D1 D2 D3 D5 D6
0 8 5 0 False True
1 45 35 0 True False
2 35 10 1 False True
3 40 5 2 True False
4 12 10 5 False False
5 18 15 13 False True
6 25 15 5 True False
7 35 10 11 False True
8 95 50 0 False False
I have to sort above df based on given orders:
orders = [[A, B],[D, ~E, B], [~C, ~A], [~C, A]...]
#(where A, B, C , D, E are the conditions)
eg.
A = df['D1'].le(50)
B = df['D2'].ge(5)
C = df['D3'].ne(0)
D = df['D1'].ne(False)
E = df['D1'].ne(True)
# In the real scenario, I have 64 such conditions to be run on 5 million records.
eg.
I have to run all these conditions to get the resultant output.
What is the easiest way to achieve the following task, to order them using for loop or map or .apply?
df = df.loc[A & B]
df = df.loc[D & ~E & B]
df = df.loc[~C & ~A]
df = df.loc[~C & A]
Resultant df would be my expected output.
Here I am more interested in knowing, how would you use loop or map or .apply, If I want to run multiple conditions which are stored in a list. Not the resultant output.
such as:
for i in orders:
df = df[all(i)] # I am not able to implement this logic for each order
You are looking for bitwise and all the elements inside orders. In which case:
df = df[np.concatenate(orders).all(0)]

Pandas Dataframe Checking Consecutive Values in a colum

Have a Pandas Dataframe like below.
EventOccurrence Month
1 4
1 5
1 6
1 9
1 10
1 12
Need to add a identifier column to above panda's dataframe such that whenever Month is consecutive thrice a value of True is filled, else false. Explored few options like shift and window without luck. Any pointer is appreciated.
EventOccurrence Month Flag
1 4 F
1 5 F
1 6 T
1 9 F
1 10 F
1 12 F
Thank You.
You can check whether the diff between rows is one, and the diff shifted by 1 is one as well:
df['Flag'] = (df.Month.diff() == 1) & (df.Month.diff().shift() == 1)
EventOccurrence Month Flag
0 1 4 False
1 1 5 False
2 1 6 True
3 1 9 False
4 1 10 False
5 1 12 False
Note that this will also return True if it is consecutive > 3 times, but that behaviour wasn't specified in the question so I'll assume it's OK
If it needs to only flag the third one, and not for example the fourth consecutive instance, you could add a condition:
df['Flag'] = (df.Month.diff() == 1) & (df.Month.diff().shift() == 1) & (df.Month.diff().shift(2) !=1)

Create a new column with IF-THEN in grouped pandas df

I'm applying a simple function to a grouped pandas df. Below is what I'm trying. Even if I try to modify the function to carry one step, I keep getting the same error. Any direction will be super helpful.
def udf_pd(df_group):
if (df_group['A'] - df_group['B']) > 1:
df_group['D'] = 'Condition-1'
elif df_group.A == df_group.C:
df_group['D'] = 'Condition-2'
else:
df_group['D'] = 'Condition-3'
return df_group
final_df = df.groupby(['id1','id2']).apply(udf_pd)
final_df = final_df.reset_index()
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Note that in groupby.apply the function is applied to the whole group.
On the other hand, each if condition must boil down to a single value
(not to any Series of True/False values).
So each comparison of 2 columns in this function must be supplemented with
e.g. all() or any(), like in the example below:
def udf_pd(df_group):
if (df_group.A - df_group.B > 1).all():
df_group['D'] = 'Condition-1'
elif (df_group.A == df_group.C).all():
df_group['D'] = 'Condition-2'
else:
df_group['D'] = 'Condition-3'
return df_group
Of course, the function can return the whole group, e.g. "extended"
by a new column and in such a case a single value of the new column
is broadcast, so each row in the current group receives this value.
I created a test DataFrame:
id1 id2 A B C
0 1 1 5 3 0
1 1 1 7 5 4
2 1 2 3 4 3
3 1 2 4 5 4
4 2 1 2 4 3
5 2 1 4 5 4
In this example:
In the first group (id1 == 1, id2 == 1), in all rows, A - B > 1,
so Condition-1 is True.
In the second group (id1 == 1, id2 == 2), the above condition is
not met, but in all rows, A == C, so Condition-2 is True.
In the last group (id1 == 2, id2 == 1), neither of the above
conditions is met, so Condition-3 is True.
Hence the result of df.groupby(['id1','id2']).apply(udf_pd) is:
id1 id2 A B C D
0 1 1 5 3 0 Condition-1
1 1 1 7 5 4 Condition-1
2 1 2 3 4 3 Condition-2
3 1 2 4 5 4 Condition-2
4 2 1 2 4 3 Condition-3
5 2 1 4 5 4 Condition-3
I've encountered this error before and my understanding that pandas isn't sure which value it's supposed to run the conditional against. You're going to probably want to use .any() or .all(). Consider these examples
>>> a = pd.Series([0,0,3])
>>> b = pd.Series([1,1,1])
>>> a - b
0 -1
1 -1
2 2
dtype: int64
>>> (a - b) >= 1
0 False
1 False
2 True
dtype: bool
you can see that (a-b) >= 1 truthiness is kinda ambigious, the first elements in the vector is false while the others are true.
Using .any() or .all() will evaluate the entire series.
>>> ((a - b) >= 1).any()
True
>>> ((a - b) >= 1).all()
False
.any() checks to see if well any of the elements in the series are True. While .all() checks to see if all of the elements are True. Which in this example they're not.
you can also check out this post for more information: Pandas Boolean .any() .all()

Pandas Series Chaining: Filter on boolean value

How can I filter a pandas series based on boolean values?
Currently I have:
s.apply(lambda x: myfunc(x, myparam).where(lambda x: x).dropna()
What I want is only keep entries where myfunc returns true.myfunc is complex function using 3rd party code and operates only on individual elements.
How can i make this more understandable?
You can understand it with below given sample code
import pandas as pd
data = pd.Series([1,12,15,3,5,3,6,9,10,5])
print(data)
# filter data based on a condition keep only rows which are multiple of 3
filter_cond = data.apply(lambda x:x%3==0)
print(filter_cond)
filter_data = data[filter_cond]
print(filter_data)
This code is about to filter the series data which are of the multiples of 3. To do that, we just put the filter condition and apply it on the series data. You can verify it with below generated output.
The sample series data:
0 1
1 12
2 15
3 3
4 5
5 3
6 6
7 9
8 10
9 5
dtype: int64
The conditional filter output:
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 True
8 False
9 False
dtype: bool
The final required filter data:
1 12
2 15
3 3
5 3
6 6
7 9
dtype: int64
Hope, this helps you to understand that how we can apply conditional filters on the series data.
Use boolean indexing:
mask = s.apply(lambda x: myfunc(x, myparam))
print (s[mask])
If also is changed index values in mask filter by 1d array:
#pandas 0.24+
print (s[mask.to_numpy()])
#pandas below
print (s[mask.values])
EDIT:
s = pd.Series([1,2,3])
def myfunc(x, n):
return x > n
myparam = 1
a = s[s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64
Solution with callable is possible, but a bit overcomplicated in my opinion:
a = s.loc[lambda s: s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64

Pandas .replace() int by boolean [duplicate]

Some column in dataframe df, df.column, is stored as datatype int64.
The values are all 1s or 0s.
Is there a way to replace these values with boolean values?
df['column_name'] = df['column_name'].astype('bool')
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_integers(0,1,size=5),
columns=['foo'])
print(df)
# foo
# 0 0
# 1 1
# 2 0
# 3 1
# 4 1
df['foo'] = df['foo'].astype('bool')
print(df)
yields
foo
0 False
1 True
2 False
3 True
4 True
Given a list of column_names, you could convert multiple columns to bool dtype using:
df[column_names] = df[column_names].astype(bool)
If you don't have a list of column names, but wish to convert, say, all numeric columns, then you could use
column_names = df.select_dtypes(include=[np.number]).columns
df[column_names] = df[column_names].astype(bool)
Reference: Stack Overflow unutbu (Jan 9 at 13:25), BrenBarn (Sep 18 2017)
I had numerical columns like age and ID which I did not want to convert to Boolean. So after identifying the numerical columns like unutbu showed us, I filtered out the columns which had a maximum more than 1.
# code as per unutbu
column_names = df.select_dtypes(include=[np.number]).columns
# re-extracting the columns of numerical type (using awesome np.number1 :)) then getting the max of those and storing them in a temporary variable m.
m=df[df.select_dtypes(include=[np.number]).columns].max().reset_index(name='max')
# I then did a filter like BrenBarn showed in another post to extract the rows which had the max == 1 and stored it in a temporary variable n.
n=m.loc[m['max']==1, 'max']
# I then extracted the indexes of the rows from n and stored them in temporary variable p.
# These indexes are the same as the indexes from my original dataframe 'df'.
p=column_names[n.index]
# I then used the final piece of the code from unutbu calling the indexes of the rows which had the max == 1 as stored in my variable p.
# If I used column_names directly instead of p, all my numerical columns would turn into Booleans.
df[p] = df[p].astype(bool)
There are various ways to achieve that, below one will see various options:
Using pandas.Series.map
Using pandas.Series.astype
Using pandas.Series.replace
Using pandas.Series.apply
Using numpy.where
As OP didn't specify the dataframe, in this answer I will be using the following dataframe
import pandas as pd
df = pd.DataFrame({'col1': [1, 0, 0, 1, 0], 'col2': [0, 0, 1, 0, 1], 'col3': [1, 1, 1, 0, 1], 'col4': [0, 0, 0, 0, 1]})
[Out]:
col1 col2 col3 col4
0 1 0 1 0
1 0 0 1 0
2 0 1 1 0
3 1 0 0 0
4 0 1 1 1
We will consider that one wants to change to boolean only the values in col1. If one wants to transform the whole dataframe, see one of the notes below.
In the section Time Comparison one will measure the times of execution of each option.
Option 1
Using pandas.Series.map as follows
df['col1'] = df['col1'].map({1: True, 0: False})
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 2
Using pandas.Series.astype as follows
df['col1'] = df['col1'].astype(bool)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 3
Using pandas.Series.replace, with one of the following options
# Option 3.1
df['col1'] = df['col1'].replace({1: True, 0: False})
# or
# Option 3.2
df['col1'] = df['col1'].replace([1, 0], [True, False])
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 4
Using pandas.Series.apply and a custom lambda function as follows
df['col1'] = df['col1'].apply(lambda x: True if x == 1 else False)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 5
Using numpy.where as follows
import numpy as np
df['col1'] = np.where(df['col1'] == 1, True, False)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Time Comparison
For this specific case one has used time.perf_counter() to measure the time of execution.
method time
0 Option 1 0.00000120000913739204
1 Option 2 0.00000220000219997019
2 Option 3.1 0.00000179999915417284
3 Option 3.2 0.00000200000067707151
4 Option 4 0.00000400000135414302
5 Option 5 0.00000210000143852085
Notes:
There are strong opinions on using .apply(), so one might want to read this.
There are additional ways to measure the time of execution. For additional ways, read this: How do I get time of a Python program's execution?
To convert the whole dataframe, one can do, for example, the following
df = df.astype(bool)
[Out]:
col1 col2 col3 col4
0 True False True False
1 False False True False
2 False True True False
3 True False False False
4 False True True True