My .loc with multiple conditions keeps running...help me land the plane - pandas

When I try to run the code below, it just keeps running. Is it something obvious?
df.loc[(df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year), ['Target_P']] = df.loc[(df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year), ['y']]

I think you need assign condition to variable and the reuse:
m = (df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year)
df.loc[m, 'Target_P'] = df.loc[m, 'y']
For improve performance is possible use numpy.where:
df['Target_P'] = np.where(m, df['y'], df['Target_P'])

pandas is index sensitive , so you do not need repeat the condition for assignment
cond=(df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year)
df.loc[cond, 'Target_P'] = df.y
More info, example
df=pd.DataFrame({'cond':[1,2],'v1':[-110,-11],'v2':[9999,999999]})
df.loc[df.cond==1,'v1']=df.v2
df
Out[200]:
cond v1 v2
0 1 9999 9999
1 2 -11 999999
If index contain duplicate
df.loc[cond, 'Target_P'] = df.loc[cond,'y'].values

Related

Comparison of values in Dataframes with different size

I have a DataFrame in which I want to compare the speed of certain IDs at different conditions.
Boundary conditions:
IDs do not have to be represented in every condition,
ID is not represented in every condition with the same frequency.
My goal is to assign whether the speed remained
larger (speed > than speed in CondA +10%),
smaller ((speed < than speed in CondA -10%)) or
the same (speed < than speed in CondA +10%) & (speed > than speed in CondA -10%))
depending on the condition.
The data
import numpy as np
import pandas as pd
data1 = {
'ID' : [1, 1, 1, 2, 3, 3, 4, 5],
'Condition' : ['Cond_A', 'Cond_A', 'Cond_A', 'Cond_A', 'Cond_A', 'Cond_A','Cond_A','Cond_A', ],
'Speed' : [1.2, 1.05, 1.2, 1.3, 1.0, 0.85, 1.1, 0.85],
}
df1 = pd.DataFrame(data1)
data2 = {
'ID' : [1, 2, 3, 4, 5, 6],
'Condition' : ['Cond_B', 'Cond_B', 'Cond_B', 'Cond_B', 'Cond_B', 'Cond_B' ],
'Speed' : [0.8, 0.55, 0.7, 1.15, 1.2, 1.4],
}
df2 = pd.DataFrame(data2)
data3 = {
'ID' : [1, 2, 3, 4, 6],
'Condition' : ['Cond_C', 'Cond_C', 'Cond_C', 'Cond_C', 'Cond_C' ],
'Speed' : [1.8, 0.99, 1.7, 131, 0.2, ],
}
df3 = pd.DataFrame(data3)
lst_of_dfs = [df1,df2, df3]
# creating a Dataframe object
data = pd.concat(lst_of_dfs)
My goal is to archive a result like this
Condition ID Speed Category
0 Cond_A 1 1.150 NaN
1 Cond_A 2 1.300 NaN
2 Cond_A 3 0.925 NaN
3 Cond_A 4 1.100 NaN
4 Cond_A 5 0.850 NaN
5 Cond_B 1 0.800 faster
6 Cond_B 2 0.550 slower
7 Cond_B 3 0.700 slower
8 Cond_B 4 1.150 equal
...
My attempt:
Calculate average of speed for each ID per condition
data = data.groupby(["Condition", "ID"]).mean()["Speed"].reset_index()
Definition of thresholds. Assuming I want to realize thresholds up to 10 percent around the CondA-Values
threshold_upper = data.loc[(data.Condition == 'CondA')]['Speed'] + (data.loc[(data.Condition == 'CondA')]['Speed']*10/100)
threshold_lower = data.loc[(data.Condition == 'CondA')]['Speed'] - (data.loc[(data.Condition == 'CondA')]['Speed']*10/100)
Mapping strings 'faster', 'equal', 'slower' based on condition using numpy select.
conditions = [
(data.loc[(data.Condition == 'CondB')]['Speed'] > threshold_upper), #check whether Speed of each ID in CondB is faster than Speed in CondA+10%
(data.loc[(data.Condition == 'CondC')]['Speed'] > threshold_upper), #check whether Speed of each ID in CondC is faster than Speed in CondA+10%
((data.loc[(data.Condition == 'CondB')]['Speed'] < threshold_upper) & (data.loc[(data.Condition == 'CondB')]['Speed'] > threshold_lower)), #check whether Speed of each ID in CondB is slower than Speed in CondA+10% AND faster than Speed in CondA-10%
((data.loc[(data.Condition == 'CondC')]['Speed'] < threshold_upper) & (data.loc[(data.Condition == 'CondC')]['Speed'] > threshold_lower)), #check whether Speed of each ID in CondC is slower than Speed in CondA+10% AND faster than Speed in CondA-10%
(data.loc[(data.Condition == 'CondB')]['Speed'] < threshold_upper), #check whether Speed of each ID in CondB is slower than Speed in CondA-10%
(data.loc[(data.Condition == 'CondC')]['Speed'] < threshold_upper), #check whether Speed of each ID in CondC is faster than Speed in CondA-10%
]
values = [
'faster',
'faster',
'equal',
'equal',
'slower',
'slower'
]
data['Category'] = np.select(conditions, values)
Produces this error: <ValueError: Length of values (0) does not match length of index (16)>
My data frames unfortunately have a different length (since not all IDs performed all trials to each condition). I appreciate any hint. Many thanks in advance.
# Dataframe created
data
ID Condition Speed
0 1 Cond_A 1.20
1 1 Cond_A 1.05
2 1 Cond_A 1.20
# Reset the index
data = data.reset_index(drop=True)
# Creating based on ID
data['group'] = data.groupby(['ID']).ngroup()
# Creating functions which returns the upper and lower limit of speed
def lowlimit(x):
return x[x['Condition']=='Cond_A'].Speed.mean() * 0.9
def upperlimit(x):
return x[x['Condition']=='Cond_A'].Speed.mean() * 1.1
# Calculate the upperlimit and lowerlimit for the groups
df = pd.DataFrame()
df['ul'] = data.groupby('group').apply(lambda x: upperlimit(x))
df['ll'] = data.groupby('group').apply(lambda x: lowlimit(x))
# reseting the index
# So that we can merge the values of 'group' column
df = df.reset_index()
# Merging the data and df dataframe
data_new = pd.merge(data,df,on='group',how='left')
data_new
ID Condition Speed group ul ll
0 1 Cond_A 1.20 0 1.2650 1.0350
1 1 Cond_A 1.05 0 1.2650 1.0350
2 1 Cond_A 1.20 0 1.2650 1.0350
3 2 Cond_A 1.30 1 1.4300 1.1700
Now we have to apply the conditions
data_new.loc[(data_new['Speed'] >= data_new['ul']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'larger'
data_new.loc[(data_new['Speed'] <= data_new['ll']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'smaller'
data_new.loc[(data_new['Speed'] < data_new['ul']) & (data_new['Speed'] > data_new['ll']) & (data_new['Condition'] != 'Cond_A'),'Category'] = 'Same'
Here is the output
You can drop the other columns now, if you want
data_new = data_new.drop(columns=['group','ul','ll'])

Pandas .loc[] method is too slow, how can I speed it up

I have a dataframe with 40 million rows,and I want to change some colums by
age = data[data['device_name'] == 12]['age'].apply(lambda x : x if x != -1 else max_age)
data.loc[data['device_name'] == 12,'age'] = age
but this method is too slow, how can I speed it up.
Thanks for all reply!
you might wanna change the first part to :
age = data[data['device_name'] == 12]['age']
age[age == -1] = max_age
data.loc[data['device_name'] == 12,'age'] = age
you could use, to me more concise(this could gain you a little speed)
cond = data['device_name'] == 12
age = data.loc[cond, age]
data.loc[cond,'age'] = age.where(age != -1, max_age)

Fill pandas fields with tuples as elements by slicing

Sorry if this question has been asked before, but I did not find it here nor somewhere else:
I want to fill some of the fields of a column with tuples. Currently I would have to resort to:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4]})
df['b'] = ''
df['b'] = df['b'].astype(object)
mytuple = ('x','y')
for l in df[df.a % 2 == 0].index:
df.set_value(l, 'b', mytuple)
with df being (which is what I want)
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)
This does not look very elegant to me and probably not very efficient. Instead of the loop, I would prefer something like
df.loc[df.a % 2 == 0, 'b'] = np.array([mytuple] * sum(df.a % 2 == 0), dtype=tuple)
which (of course) does not work. How can I improve my above method by using slicing?
In [57]: df.loc[df.a % 2 == 0, 'b'] = pd.Series([mytuple] * len(df.loc[df.a % 2 == 0])).values
In [58]: df
Out[58]:
a b
0 1
1 2 (x, y)
2 3
3 4 (x, y)

How to declare constraints with variable as array index in Z3Py?

Suppose x,y,z are int variables and A is a matrix, I want to express a constraint like:
z == A[x][y]
However this leads to an error:
TypeError: object cannot be interpreted as an index
What would be the correct way to do this?
=======================
A specific example:
I want to select 2 items with the best combination score,
where the score is given by the value of each item and a bonus on the selection pair.
For example,
for 3 items: a, b, c with related value [1,2,1], and the bonus on pairs (a,b) = 2, (a,c)=5, (b,c) = 3, the best selection is (a,c), because it has the highest score: 1 + 1 + 5 = 7.
My question is how to represent the constraint of selection bonus.
Suppose CHOICE[0] and CHOICE[1] are the selection variables and B is the bonus variable.
The ideal constraint should be:
B = bonus[CHOICE[0]][CHOICE[1]]
but it results in TypeError: object cannot be interpreted as an index
I know another way is to use a nested for to instantiate first the CHOICE, then represent B, but this is really inefficient for large quantity of data.
Could any expert suggest me a better solution please?
If someone wants to play a toy example, here's the code:
from z3 import *
items = [0,1,2]
value = [1,2,1]
bonus = [[1,2,5],
[2,1,3],
[5,3,1]]
choices = [0,1]
# selection score
SCORE = [ Int('SCORE_%s' % i) for i in choices ]
# bonus
B = Int('B')
# final score
metric = Int('metric')
# selection variable
CHOICE = [ Int('CHOICE_%s' % i) for i in choices ]
# variable domain
domain_choice = [ And(0 <= CHOICE[i], CHOICE[i] < len(items)) for i in choices ]
# selection implication
constraint_sel = []
for c in choices:
for i in items:
constraint_sel += [Implies(CHOICE[c] == i, SCORE[c] == value[i])]
# choice not the same
constraint_neq = [CHOICE[0] != CHOICE[1]]
# bonus constraint. uncomment it to see the issue
# constraint_b = [B == bonus[val(CHOICE[0])][val(CHOICE[1])]]
# metric definition
constraint_sumscore = [metric == sum([SCORE[i] for i in choices ]) + B]
constraints = constraint_sumscore + constraint_sel + domain_choice + constraint_neq + constraint_b
opt = Optimize()
opt.add(constraints)
opt.maximize(metric)
s = []
if opt.check() == sat:
m = opt.model()
print [ m.evaluate(CHOICE[i]) for i in choices ]
print m.evaluate(metric)
else:
print "failed to solve"
Turns out the best way to deal with this problem is to actually not use arrays at all, but simply create integer variables. With this method, the 317x317 item problem originally posted actually gets solved in about 40 seconds on my relatively old computer:
[ 0.01s] Data loaded
[ 2.06s] Variables defined
[37.90s] Constraints added
[38.95s] Solved:
c0 = 19
c1 = 99
maxVal = 27
Note that the actual "solution" is found in about a second! But adding all the required constraints takes the bulk of the 40 seconds spent. Here's the encoding:
from z3 import *
import sys
import json
import sys
import time
start = time.time()
def tprint(s):
global start
now = time.time()
etime = now - start
print "[%ss] %s" % ('{0:5.2f}'.format(etime), s)
# load data
with open('data.json') as data_file:
dic = json.load(data_file)
tprint("Data loaded")
items = dic['items']
valueVals = dic['value']
bonusVals = dic['bonusVals']
vals = [[Int("val_%d_%d" % (i, j)) for j in items if j > i] for i in items]
tprint("Variables defined")
opt = Optimize()
for i in items:
for j in items:
if j > i:
opt.add(vals[i][j-i-1] == valueVals[i] + valueVals[j] + bonusVals[i][j])
c0, c1 = Ints('c0 c1')
maxVal = Int('maxVal')
opt.add(Or([Or([And(c0 == i, c1 == j, maxVal == vals[i][j-i-1]) for j in items if j > i]) for i in items]))
tprint("Constraints added")
opt.maximize(maxVal)
r = opt.check ()
if r == unsat or r == unknown:
raise Z3Exception("Failed")
tprint("Solved:")
m = opt.model()
print " c0 = %s" % m[c0]
print " c1 = %s" % m[c1]
print " maxVal = %s" % m[maxVal]
I think this is as fast as it'll get with Z3 for this problem. Of course, if you want to maximize multiple metrics, then you can probably structure the code so that you can reuse most of the constraints, thus amortizing the cost of constructing the model just once, and incrementally optimizing afterwards for optimal performance.

Select subset of rows of dataframe using multiple conditions

I would like to select a subset of a dataframe that satisfies multiple conditions on multiple rows. I know I could this sequentially -- first selecting the subset that matches the first condition, then the portion of those that match the second, etc, but it seems like it should be able to be done in a single step. The following seems like it should work, but doesn't. Apparently it does work like this in other languages' implementations of DataFrame. Any thoughts?
using DataFrames
df = DataFrame()
df[:A]=[ 1, 3, 4, 7, 9]
df[:B]=[ "a", "c", "c", "D", "c"]
df[(df[:A].<5)&&(df[:B].=="c"),:]
type: non-boolean (DataArray{Bool,1}) used in boolean context
while loading In[18], in expression starting on line 5
This is a Julia thing, not so much a DataFrame thing: you want & instead of &&. For example:
julia> [true, true] && [false, true]
ERROR: TypeError: non-boolean (Array{Bool,1}) used in boolean context
julia> [true, true] & [false, true]
2-element Array{Bool,1}:
false
true
julia> df[(df[:A].<5)&(df[:B].=="c"),:]
2x2 DataFrames.DataFrame
| Row | A | B |
|-----|---|-----|
| 1 | 3 | "c" |
| 2 | 4 | "c" |
FWIW, this works the same way in pandas in Python:
>>> df[(df.A < 5) & (df.B == "c")]
A B
1 3 c
2 4 c
I have the same now as https://stackoverflow.com/users/5526072/jwimberley , occurring on my update to julia 0.6 from 0.5, and now using dataframes v 0.10.1.
Update: I made the following change to fix:
r[(r[:l] .== l) & (r[:w] .== w), :] # julia 0.5
r[.&(r[:l] .== l, r[:w] .== w), :] # julia 0.6
but this gets very slow with long chains (time taken \propto 2^chains)
so maybe Query is the better way now:
# r is a dataframe
using Query
q1 = #from i in r begin
#where i.l == l && i.w == w && i.nl == nl && i.lt == lt &&
i.vz == vz && i.vw == vw && i.vδ == vδ &&
i.ζx == ζx && i.ζy == ζy && i.ζδx == ζδx
#select {absu=i.absu, i.dBU}
#collect DataFrame
end
for example. This is fast. It's in the DataFrames documentation.