Pandas .loc[] method is too slow, how can I speed it up

Pandas .loc[] method is too slow, how can I speed it up - pandas

I have a dataframe with 40 million rows,and I want to change some colums by
age = data[data['device_name'] == 12]['age'].apply(lambda x : x if x != -1 else max_age)
data.loc[data['device_name'] == 12,'age'] = age
but this method is too slow, how can I speed it up.
Thanks for all reply!

you might wanna change the first part to :
age = data[data['device_name'] == 12]['age']
age[age == -1] = max_age
data.loc[data['device_name'] == 12,'age'] = age
you could use, to me more concise(this could gain you a little speed)
cond = data['device_name'] == 12
age = data.loc[cond, age]
data.loc[cond,'age'] = age.where(age != -1, max_age)

Related

How to do OOP and MVC properly

This afternoon I wrote this but it feels dirty to me. Since I'm learning OOP and MVC, I would like to know if there is a proper way to change nested object's attributes.
For example what is the cleanest way to modify an attribute from an object contained in an object, etc ?
The whole code is here https://github.com/Meezouney/Chess_tournament
If you have any advise, I'll take it :)
Thanks
def enter_match_results(self, nth_round, nth_match):
n_round = self.tournament.get_list_of_rounds()[nth_round]
game = n_round.get_list_of_match()[nth_match]
players = game.get_players()
for i, player in enumerate(players):
if i == 0:
valid_input = 0
while not valid_input:
score = self.view.ask_results(player)
if score == "0" or score == "0,5" or score == "0.5" or score == "1":
if ',' in score:
score = score.replace(',', '.')
valid_input = 1
self.tournament.get_list_of_rounds()[nth_round].get_list_of_match()[nth_match].set_score_player_1(float(score))
else:
self.view.display_score_error()
elif i == 1:
valid_input = 0
while not valid_input:
score = self.view.ask_results(player)
if score == "0" or score == "0,5" or score == "0.5" or score == "1":
if ',' in score:
score = score.replace(',', '.')
valid_input = 1
self.tournament.get_list_of_rounds()[nth_round].get_list_of_match()[nth_match].set_score_player_2(float(score))
else:
self.view.display_score_error()
I am trying to figure out the philosophy of OOP by practicing.

Why is setting the expiry time ineffective in Redis

Now i use redispipeline to set key to redis ,and set key timeout.After running codes and timeout,and the keys are still in redis?It seemed doesn't work.Is there sth wrong with my code.
X[1] is dict,such as dict["a"]=b dict["c"]=d
redis_pool = redis.ConnectionPool(host = redis_ip, port = redis_port, decode_responses = False)
redis_connection = redis.StrictRedis(connection_pool = redis_pool)
redis_pipeline = redis_connection.pipeline(transaction = False)
sync_count = 0
for x in iterator:
key = suffix + x[0]
#value = str(x['tagid'])+"\t"+x['tag_name']
value = x[1]
redis_pipeline.hmset(key, value)
redis_pipeline.expire(key,60)
sync_count += 1
if (sync_count % 100 == 0):
result = redis_pipeline.execute()
print (result)
time.sleep(0.001)
break
if sync_count == 10000:
break
redis_pipeline.execute()

My .loc with multiple conditions keeps running...help me land the plane

When I try to run the code below, it just keeps running. Is it something obvious?
df.loc[(df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year), ['Target_P']] = df.loc[(df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year), ['y']]

I think you need assign condition to variable and the reuse:
m = (df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year)
df.loc[m, 'Target_P'] = df.loc[m, 'y']
For improve performance is possible use numpy.where:
df['Target_P'] = np.where(m, df['y'], df['Target_P'])

pandas is index sensitive , so you do not need repeat the condition for assignment
cond=(df['Target_Group'] == 'x') & (df['Period'].dt.year == df['Year_Performed'].dt.year)
df.loc[cond, 'Target_P'] = df.y
More info, example
df=pd.DataFrame({'cond':[1,2],'v1':[-110,-11],'v2':[9999,999999]})
df.loc[df.cond==1,'v1']=df.v2
df
Out[200]:
cond v1 v2
0 1 9999 9999
1 2 -11 999999
If index contain duplicate
df.loc[cond, 'Target_P'] = df.loc[cond,'y'].values

How to declare constraints with variable as array index in Z3Py?

Suppose x,y,z are int variables and A is a matrix, I want to express a constraint like:
z == A[x][y]
However this leads to an error:
TypeError: object cannot be interpreted as an index
What would be the correct way to do this?
=======================
A specific example:
I want to select 2 items with the best combination score,
where the score is given by the value of each item and a bonus on the selection pair.
For example,
for 3 items: a, b, c with related value [1,2,1], and the bonus on pairs (a,b) = 2, (a,c)=5, (b,c) = 3, the best selection is (a,c), because it has the highest score: 1 + 1 + 5 = 7.
My question is how to represent the constraint of selection bonus.
Suppose CHOICE[0] and CHOICE[1] are the selection variables and B is the bonus variable.
The ideal constraint should be:
B = bonus[CHOICE[0]][CHOICE[1]]
but it results in TypeError: object cannot be interpreted as an index
I know another way is to use a nested for to instantiate first the CHOICE, then represent B, but this is really inefficient for large quantity of data.
Could any expert suggest me a better solution please?
If someone wants to play a toy example, here's the code:
from z3 import *
items = [0,1,2]
value = [1,2,1]
bonus = [[1,2,5],
[2,1,3],
[5,3,1]]
choices = [0,1]
# selection score
SCORE = [ Int('SCORE_%s' % i) for i in choices ]
# bonus
B = Int('B')
# final score
metric = Int('metric')
# selection variable
CHOICE = [ Int('CHOICE_%s' % i) for i in choices ]
# variable domain
domain_choice = [ And(0 <= CHOICE[i], CHOICE[i] < len(items)) for i in choices ]
# selection implication
constraint_sel = []
for c in choices:
for i in items:
constraint_sel += [Implies(CHOICE[c] == i, SCORE[c] == value[i])]
# choice not the same
constraint_neq = [CHOICE[0] != CHOICE[1]]
# bonus constraint. uncomment it to see the issue
# constraint_b = [B == bonus[val(CHOICE[0])][val(CHOICE[1])]]
# metric definition
constraint_sumscore = [metric == sum([SCORE[i] for i in choices ]) + B]
constraints = constraint_sumscore + constraint_sel + domain_choice + constraint_neq + constraint_b
opt = Optimize()
opt.add(constraints)
opt.maximize(metric)
s = []
if opt.check() == sat:
m = opt.model()
print [ m.evaluate(CHOICE[i]) for i in choices ]
print m.evaluate(metric)
else:
print "failed to solve"

Turns out the best way to deal with this problem is to actually not use arrays at all, but simply create integer variables. With this method, the 317x317 item problem originally posted actually gets solved in about 40 seconds on my relatively old computer:
[ 0.01s] Data loaded
[ 2.06s] Variables defined
[37.90s] Constraints added
[38.95s] Solved:
c0 = 19
c1 = 99
maxVal = 27
Note that the actual "solution" is found in about a second! But adding all the required constraints takes the bulk of the 40 seconds spent. Here's the encoding:
from z3 import *
import sys
import json
import sys
import time
start = time.time()
def tprint(s):
global start
now = time.time()
etime = now - start
print "[%ss] %s" % ('{0:5.2f}'.format(etime), s)
# load data
with open('data.json') as data_file:
dic = json.load(data_file)
tprint("Data loaded")
items = dic['items']
valueVals = dic['value']
bonusVals = dic['bonusVals']
vals = [[Int("val_%d_%d" % (i, j)) for j in items if j > i] for i in items]
tprint("Variables defined")
opt = Optimize()
for i in items:
for j in items:
if j > i:
opt.add(vals[i][j-i-1] == valueVals[i] + valueVals[j] + bonusVals[i][j])
c0, c1 = Ints('c0 c1')
maxVal = Int('maxVal')
opt.add(Or([Or([And(c0 == i, c1 == j, maxVal == vals[i][j-i-1]) for j in items if j > i]) for i in items]))
tprint("Constraints added")
opt.maximize(maxVal)
r = opt.check ()
if r == unsat or r == unknown:
raise Z3Exception("Failed")
tprint("Solved:")
m = opt.model()
print " c0 = %s" % m[c0]
print " c1 = %s" % m[c1]
print " maxVal = %s" % m[maxVal]
I think this is as fast as it'll get with Z3 for this problem. Of course, if you want to maximize multiple metrics, then you can probably structure the code so that you can reuse most of the constraints, thus amortizing the cost of constructing the model just once, and incrementally optimizing afterwards for optimal performance.

Help in optimizing a for loop in matlab

I have a 1 by N double array consisting of 1 and 0. I would like to map all the 1 to symbol '-3' and '3' and all the 0 to symbol '-1' and '1' equally. Below is my code. As my array is approx 1 by 8 million, it is taking a very long time. How to speed things up?
[row,ll] = size(Data);
sym_zero = -1;
sym_one = -3;
for loop = 1 : row
if Data(loop,1) == 0
Data2(loop,1) = sym_zero;
if sym_zero == -1
sym_zero = 1;
else
sym_zero = -1;
end
else
Data2(loop,1) = sym_one;
if sym_one == -3
sym_zero = 3;
else
sym_zero = -3;
end
end
end

Here's a very important MATLAB optimization tip.
Preallocate!
Your code is much faster with a simple preallocation. Just add
Data2 = zeros(size(Data));
for loop = 1: row
...
before your for loop.
On my computer your code with preallocation terminated in 0.322s, and your original code is still running. I removed my original solution since yours is pretty fast with this optimization :).
Also since we're talking about MATLAB, it's faster to work on column vectors.

Hope you can follow this and I hope that I have understood your code correctly:
nOnes = sum(Data);
nZeroes = size(Data,2) - nOnes;
Data2(find(Data)) = repmat([-3 3],1,nOnes/2)
Data2(find(Data==0)) = repmat([-1 1],1,nZeroes/2)
I'll leave it to you to deal with the odd 1s and 0s.

So, disregarding negative signs, the equation for the output item Data2[loop,1] = Data[loop,1]*2 + 1. So why not first do that using a simple multiply-- that should be fast since it can be vectorized. Then create an array of half the original array length of 1s, half the original array length of -1s, call randperm on that. Then multiply by that. Everything's vectorized and should be much faster.

[row,ll] = size(Data);
sym_zero = -1;
sym_one = -3;
for loop = 1 : row
if ( Data(loop,1) ) // is 1
Data2(loop,1) = sym_one;
sym_one = sym_one * -1; // flip the sign
else
Data2(loop,1) = sym_zero;
sym_zero = sym_zero * -1; // flip the sign
end
end

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas .loc[] method is too slow, how can I speed it up - pandas

I have a dataframe with 40 million rows,and I want to change some colums by age = data[data['device_name'] == 12]['age'].apply(lambda x : x if x != -1 else max_age) data.loc[data['device_name'] == 12,'age'] = age but this method is too slow, how can I speed it up. Thanks for all reply!

Related

How to do OOP and MVC properly

Why is setting the expiry time ineffective in Redis

My .loc with multiple conditions keeps running...help me land the plane

How to declare constraints with variable as array index in Z3Py?

Help in optimizing a for loop in matlab

Categories

Resources