Select subset of rows of dataframe using multiple conditions - dataframe

I would like to select a subset of a dataframe that satisfies multiple conditions on multiple rows. I know I could this sequentially -- first selecting the subset that matches the first condition, then the portion of those that match the second, etc, but it seems like it should be able to be done in a single step. The following seems like it should work, but doesn't. Apparently it does work like this in other languages' implementations of DataFrame. Any thoughts?
using DataFrames
df = DataFrame()
df[:A]=[ 1, 3, 4, 7, 9]
df[:B]=[ "a", "c", "c", "D", "c"]
df[(df[:A].<5)&&(df[:B].=="c"),:]
type: non-boolean (DataArray{Bool,1}) used in boolean context
while loading In[18], in expression starting on line 5

This is a Julia thing, not so much a DataFrame thing: you want & instead of &&. For example:
julia> [true, true] && [false, true]
ERROR: TypeError: non-boolean (Array{Bool,1}) used in boolean context
julia> [true, true] & [false, true]
2-element Array{Bool,1}:
false
true
julia> df[(df[:A].<5)&(df[:B].=="c"),:]
2x2 DataFrames.DataFrame
| Row | A | B |
|-----|---|-----|
| 1 | 3 | "c" |
| 2 | 4 | "c" |
FWIW, this works the same way in pandas in Python:
>>> df[(df.A < 5) & (df.B == "c")]
A B
1 3 c
2 4 c

I have the same now as https://stackoverflow.com/users/5526072/jwimberley , occurring on my update to julia 0.6 from 0.5, and now using dataframes v 0.10.1.
Update: I made the following change to fix:
r[(r[:l] .== l) & (r[:w] .== w), :] # julia 0.5
r[.&(r[:l] .== l, r[:w] .== w), :] # julia 0.6
but this gets very slow with long chains (time taken \propto 2^chains)
so maybe Query is the better way now:
# r is a dataframe
using Query
q1 = #from i in r begin
#where i.l == l && i.w == w && i.nl == nl && i.lt == lt &&
i.vz == vz && i.vw == vw && i.vδ == vδ &&
i.ζx == ζx && i.ζy == ζy && i.ζδx == ζδx
#select {absu=i.absu, i.dBU}
#collect DataFrame
end
for example. This is fast. It's in the DataFrames documentation.

Related

Comparison of values in Dataframes with different size

I have a DataFrame in which I want to compare the speed of certain IDs at different conditions.
Boundary conditions:
IDs do not have to be represented in every condition,
ID is not represented in every condition with the same frequency.
My goal is to assign whether the speed remained
larger (speed > than speed in CondA +10%),
smaller ((speed < than speed in CondA -10%)) or
the same (speed < than speed in CondA +10%) & (speed > than speed in CondA -10%))
depending on the condition.
The data
import numpy as np
import pandas as pd
data1 = {
'ID' : [1, 1, 1, 2, 3, 3, 4, 5],
'Condition' : ['Cond_A', 'Cond_A', 'Cond_A', 'Cond_A', 'Cond_A', 'Cond_A','Cond_A','Cond_A', ],
'Speed' : [1.2, 1.05, 1.2, 1.3, 1.0, 0.85, 1.1, 0.85],
}
df1 = pd.DataFrame(data1)
data2 = {
'ID' : [1, 2, 3, 4, 5, 6],
'Condition' : ['Cond_B', 'Cond_B', 'Cond_B', 'Cond_B', 'Cond_B', 'Cond_B' ],
'Speed' : [0.8, 0.55, 0.7, 1.15, 1.2, 1.4],
}
df2 = pd.DataFrame(data2)
data3 = {
'ID' : [1, 2, 3, 4, 6],
'Condition' : ['Cond_C', 'Cond_C', 'Cond_C', 'Cond_C', 'Cond_C' ],
'Speed' : [1.8, 0.99, 1.7, 131, 0.2, ],
}
df3 = pd.DataFrame(data3)
lst_of_dfs = [df1,df2, df3]
# creating a Dataframe object
data = pd.concat(lst_of_dfs)
My goal is to archive a result like this
Condition ID Speed Category
0 Cond_A 1 1.150 NaN
1 Cond_A 2 1.300 NaN
2 Cond_A 3 0.925 NaN
3 Cond_A 4 1.100 NaN
4 Cond_A 5 0.850 NaN
5 Cond_B 1 0.800 faster
6 Cond_B 2 0.550 slower
7 Cond_B 3 0.700 slower
8 Cond_B 4 1.150 equal
...
My attempt:
Calculate average of speed for each ID per condition
data = data.groupby(["Condition", "ID"]).mean()["Speed"].reset_index()
Definition of thresholds. Assuming I want to realize thresholds up to 10 percent around the CondA-Values
threshold_upper = data.loc[(data.Condition == 'CondA')]['Speed'] + (data.loc[(data.Condition == 'CondA')]['Speed']*10/100)
threshold_lower = data.loc[(data.Condition == 'CondA')]['Speed'] - (data.loc[(data.Condition == 'CondA')]['Speed']*10/100)
Mapping strings 'faster', 'equal', 'slower' based on condition using numpy select.
conditions = [
(data.loc[(data.Condition == 'CondB')]['Speed'] > threshold_upper), #check whether Speed of each ID in CondB is faster than Speed in CondA+10%
(data.loc[(data.Condition == 'CondC')]['Speed'] > threshold_upper), #check whether Speed of each ID in CondC is faster than Speed in CondA+10%
((data.loc[(data.Condition == 'CondB')]['Speed'] < threshold_upper) & (data.loc[(data.Condition == 'CondB')]['Speed'] > threshold_lower)), #check whether Speed of each ID in CondB is slower than Speed in CondA+10% AND faster than Speed in CondA-10%
((data.loc[(data.Condition == 'CondC')]['Speed'] < threshold_upper) & (data.loc[(data.Condition == 'CondC')]['Speed'] > threshold_lower)), #check whether Speed of each ID in CondC is slower than Speed in CondA+10% AND faster than Speed in CondA-10%
(data.loc[(data.Condition == 'CondB')]['Speed'] < threshold_upper), #check whether Speed of each ID in CondB is slower than Speed in CondA-10%
(data.loc[(data.Condition == 'CondC')]['Speed'] < threshold_upper), #check whether Speed of each ID in CondC is faster than Speed in CondA-10%
]
values = [
'faster',
'faster',
'equal',
'equal',
'slower',
'slower'
]
data['Category'] = np.select(conditions, values)
Produces this error: <ValueError: Length of values (0) does not match length of index (16)>
My data frames unfortunately have a different length (since not all IDs performed all trials to each condition). I appreciate any hint. Many thanks in advance.
# Dataframe created
data
ID Condition Speed
0 1 Cond_A 1.20
1 1 Cond_A 1.05
2 1 Cond_A 1.20
# Reset the index
data = data.reset_index(drop=True)
# Creating based on ID
data['group'] = data.groupby(['ID']).ngroup()
# Creating functions which returns the upper and lower limit of speed
def lowlimit(x):
return x[x['Condition']=='Cond_A'].Speed.mean() * 0.9
def upperlimit(x):
return x[x['Condition']=='Cond_A'].Speed.mean() * 1.1
# Calculate the upperlimit and lowerlimit for the groups
df = pd.DataFrame()
df['ul'] = data.groupby('group').apply(lambda x: upperlimit(x))
df['ll'] = data.groupby('group').apply(lambda x: lowlimit(x))
# reseting the index
# So that we can merge the values of 'group' column
df = df.reset_index()
# Merging the data and df dataframe
data_new = pd.merge(data,df,on='group',how='left')
data_new
ID Condition Speed group ul ll
0 1 Cond_A 1.20 0 1.2650 1.0350
1 1 Cond_A 1.05 0 1.2650 1.0350
2 1 Cond_A 1.20 0 1.2650 1.0350
3 2 Cond_A 1.30 1 1.4300 1.1700
Now we have to apply the conditions
data_new.loc[(data_new['Speed'] >= data_new['ul']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'larger'
data_new.loc[(data_new['Speed'] <= data_new['ll']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'smaller'
data_new.loc[(data_new['Speed'] < data_new['ul']) & (data_new['Speed'] > data_new['ll']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'Same'
Here is the output
You can drop the other columns now, if you want
data_new = data_new.drop(columns=['group','ul','ll'])

Build a decision Column by ANDing multiple columns in pandas

I have a pandas data frame which is shown below:
>>> x = [[1,2,3,4,5],[1,2,4,4,3],[2,4,5,6,7]]
>>> columns = ['a','b','c','d','e']
>>> df = pd.DataFrame(data = x, columns = columns)
>>> df
a b c d e
0 1 2 3 4 5
1 1 2 4 4 3
2 2 4 5 6 7
I have an array of objects (conditions) as shown below:
[
{
'header' : 'a',
'condition' : '==',
'values' : [1]
},
{
'header' : 'b',
'condition' : '==',
'values' : [2]
},
...
]
and an assignHeader which is:
assignHeader = decision
now I want to do an operation which builds up all the conditions from the conditions array by looping through it, for example something like this:
pConditions = []
for eachCondition in conditions:
header = eachCondition['header']
values = eachCondition['values']
if eachCondition['condition'] == "==":
pConditions.append(df[header].isin(values))
else:
pConditions.append(~df[header].isin(values))
df[assignHeader ] = and(pConditions)
I was thinking of using all operator in pandas but am unable to crack the right syntax to do so. The list I shared can go big and dynamic and so I want to use this nested approach and check for the equality. Does anyone know a way to do so?
Final Output:
conditons = [df['a']==1,df['b']==2]
>>> df['decision'] = (df['a']==1) & (df['b']==2)
>>> df
a b c d e decision
0 1 2 3 4 5 True
1 1 2 4 4 3 True
2 2 4 5 6 7 False
Here conditions array will be variable. And I want to have a function which takes df, 'newheadernameandconditions` as input and returns the output as shown below:
>>> df
a b c d e decision
0 1 2 3 4 5 True
1 1 2 4 4 3 True
2 2 4 5 6 7 False
where newheadername = 'decision'
I was able to solve the problem using the code shown below. I am not sure if this is kind of fast way of getting things done, but would love to know your inputs in case you have any specific thing to point out.
def andMerging(conditions, mergeHeader, df):
if len(conditions) != 0:
df[mergeHeader] = pd.concat(conditions, axis = 1).all(axis = 1)
return df
where conditions are an array of pd.Series with boolean values.
And conditions are formatted as shown below:
def prepareForConditionMerging(conditionsArray, df):
conditions = []
for prop in conditionsArray:
condition = prop['condition']
values = prop['values']
header = prop['header']
if type(values) == str:
values = [values]
if condition=="==":
conditions.append(df[header].isin(values))
else:
conditions.append(~df[header].isin(values))
# Here we can add more conditions such as greater than less than etc.
return conditions

Keep rows of dataframe if multiple conditions met

Given the following dataframe:
df = pd.DataFrame({'A': ["EQ", "CB", "CB", "FF", "EQ", "EQ", "CB", "CB"],
'B': ["ANT", "ANT", "DQ", "DQ", "BQ", "VGQ", "GHB", "VGQ"]})
How can I keep the rows of column B if it meets the condition of exist for both EQ and CB. For example, I would want to keep ANT because it exists for both EQ and CB, while DQ would be deleted. So the expected output for the df would be:
out = pd.DataFrame({'A': ["EQ", "CB", "EQ", "CB"],
'B': ["ANT", "ANT", "VGQ", "VGQ"]})
Thanks!
Let us try filter
s=df.groupby('B').filter(lambda x : pd.Series(['EQ','CB']).isin(x['A']).all())
Out[7]:
A B
0 EQ ANT
1 CB ANT
5 EQ VGQ
7 CB VGQ
Then
s=s[s.A.isin(['EQ','CB'])]
Here is a solution not using groupby() if you want code that may be easier to think about:
equities = df.B[df.A == 'EQ']
bonds = df.B[df.A == 'CB']
both = equities[equities.isin(bonds)]
That gives you:
0 ANT
5 VGQ
Which makes the last part easy:
df[df.B.isin(both)]
Out:
A B
0 EQ ANT
1 CB ANT
5 EQ VGQ
7 CB VGQ
This is 3x faster on small data sets than groupby().filter().
Another way uses transform and slicing
m = df.groupby('B').A.transform(lambda x: (x.nunique() >= 2)
& (x.isin(['EQ', 'CB']).sum() >= 2))
df_final = df[m]
Out[623]:
A B
0 EQ ANT
1 CB ANT
5 EQ VGQ
7 CB VGQ

My .loc with multiple conditions keeps running...help me land the plane

When I try to run the code below, it just keeps running. Is it something obvious?
df.loc[(df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year), ['Target_P']] = df.loc[(df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year), ['y']]
I think you need assign condition to variable and the reuse:
m = (df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year)
df.loc[m, 'Target_P'] = df.loc[m, 'y']
For improve performance is possible use numpy.where:
df['Target_P'] = np.where(m, df['y'], df['Target_P'])
pandas is index sensitive , so you do not need repeat the condition for assignment
cond=(df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year)
df.loc[cond, 'Target_P'] = df.y
More info, example
df=pd.DataFrame({'cond':[1,2],'v1':[-110,-11],'v2':[9999,999999]})
df.loc[df.cond==1,'v1']=df.v2
df
Out[200]:
cond v1 v2
0 1 9999 9999
1 2 -11 999999
If index contain duplicate
df.loc[cond, 'Target_P'] = df.loc[cond,'y'].values

Fill pandas fields with tuples as elements by slicing

Sorry if this question has been asked before, but I did not find it here nor somewhere else:
I want to fill some of the fields of a column with tuples. Currently I would have to resort to:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
df['b'] = ''
df['b'] = df['b'].astype(object)
mytuple = ('x','y')
for l in df[df.a % 2 == 0].index:
df.set_value(l, 'b', mytuple)
with df being (which is what I want)
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)
This does not look very elegant to me and probably not very efficient. Instead of the loop, I would prefer something like
df.loc[df.a % 2 == 0, 'b'] = np.array([mytuple] * sum(df.a % 2 == 0), dtype=tuple)
which (of course) does not work. How can I improve my above method by using slicing?
In [57]: df.loc[df.a % 2 == 0, 'b'] = pd.Series([mytuple] * len(df.loc[df.a % 2 == 0])).values
In [58]: df
Out[58]:
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)