Avoid the use of for loops - optimization

I'm working with R and I have a code like this:
for (i in 1:10)
for (j in 1:100)
if (data[i] == paths[j,1])
cluster[i,4] <- paths[j,2]
where:
data is a vector with 100 rows and 1 column
paths is a matrix with 100 rows and 5 columns
cluster is a matrix with 100 rows and 5 columns
My question is: how could I avoid the use of "for" loops to iterate through the matrix? I don't know whether apply functions (lapply, tapply...) are useful in this case.
This is a problem when j=10000 for example because the execution time is very long.
Thank you

Inner loop could be vectorized
cluster[i,4] <- paths[max(which(data[i]==paths[,1])),2]
but check Musa's comment. I think you indented something else.
Second (outer) loop could be vectorize either, by replicating vectors but
if i is only 100 your speed-up don't be large
it will need more RAM
[edit]
As I understood your comment can you just use logical indexing?
indx <- data==paths[, 1]
cluster[indx, 4] <- paths[indx, 2]

I think that both loops can be vectorized using the following:
cluster[na.omit(match(paths[1:100,1],data[1:10])),4] = paths[!is.na(match(paths[1:100,1],data[1:10])),2]

Related

vectorize join condition in pandas

This code is working correctly as expected. But it takes a lot of time for large dataframes.
for i in excel_df['name_of_college_school'] :
for y in mysql_df['college_name'] :
if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8:
excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y
I guess, I can not use a function on join clause to compare values like this.
How do I vectorize this?
Update:
Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.
What you are looking for is fuzzy merging.
a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
for j in b:
if SequenceMatcher(None,
i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
i[dupmark_index] = j
Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -
df.columns.get_loc("college name")
You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.
for y in mysql_df['college_name']:
match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
None, x.lower(), y.lower()).ratio() > 0.8)
excel_df.loc[match, 'dupmark4'] = y

what is the most efficient way to create a categorical variable based on another continuous variable in python pandas?

I have a continuous variable A (say, earnings) in my dataframe. I want to make a categorical variable B off that. Specifically, I'd like to define the second variable as going up in increments of 500 until a certain limit. For instance,
B= 1 if A<500
2 if A>=500 & A<1000
3 if A>=1000 & A<1500
....
11 if A>5000
What is the most efficient way to do this in Pandas? In STATA in which I mostly program, I would either use replace and if (tedious) or loop if I have many categories. I want to break out of STATA thinking when using Pandas but sometimes my imagination is limited.
Thanks in advance
If the intervals are regular and the values are positive as they seem to be in the example, you can get the integer part of the values divided by the length of the interval. Something like
df['category'] = (df.A / step_size).astype(int)
Note that if there are negative values you can run into problems, e.g. anything between -500 and 500 comes out as 0. But you can get around this by adding some base value before dividing. You can effectively define you're catgeories as the multiples of step size from some base value, which happens to be zero above.
Something like
df['category'] = ((df.A + base) / step_size).astype(int)
Here'a another approach for intervals which aren't regularly spaced:
lims = np.arange(500, 5500, 500)
df['category'] = 0
for lim in lims:
df.category += df.A > lim
This method is good when you have a relatively small number of limits but slows down for many, obviously.
Here's some benchmarking for the various methods:
a = np.random.rand(100000) * 6000
%timeit pd.cut(a, 11)
%timeit (a / 500).astype(int)
100 loops, best of 3: 6.47 ms per loop
1000 loops, best of 3: 1.12 ms per loop
%%timeit
x = 0
for lim in lims:
x += a > lim
100 loops, best of 3: 3.84 ms per loop
I put pd.cut in there as well as per John E's suggestion. This yields categorical variables rather than integers as he pointed out which have different uses. There are pros and cons to both approaches and the best method would depend on the scenario.

find ranges to create Uniform histogram

I need to find ranges in order to create a Uniform histogram
i.e: ages
to 4 ranges
data_set = [18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46]
is there a function that gives me the ranges so the histogram is uniform?
in this case
ranges = [(18,24), (27,29), (30,33), (42,46)]
This example is easy, I'd like to know if there is an algorithm that deals with complex data sets as well
thanks
You are looking for the quantiles that split up your data equally. This combined with cutshould work. So, suppose you want n groups.
set.seed(1)
x <- rnorm(1000) # Generate some toy data
n <- 10
uniform <- cut(x, c(-Inf, quantile(x, prob = (1:(n-1))/n), Inf)) # Determine the groups
plot(uniform)
Edit: now corrected to yield the correct cuts in the ends.
Edit2: I don't quite understand the downvote. But this also works in your example:
data_set = c(18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46)
n <- 4
groups <- cut(data_set, breaks = c(-Inf, quantile(data_set, prob = 1:(n-1)/n), Inf))
levels(groups)
With some minor renaming nessesary. For slightly better level names, you could also put in min(x) and max(x) instead of -Inf and Inf.

Fast way of multiplying two 1-D arrays

I have the following data:
A = [a0 a1 a2 a3 a4 a5 .... a24]
B = [b0 b1 b2 b3 b4 b5 .... b24]
which I then want to multiply as follows:
C = A * B' = [a0b0 a1b1 a2b2 ... a24b24]
This clearly involves 25 multiplies.
However, in my scenario, only 5 new values are shifted into A per "loop iteration" (and 5 old values are shifted out of A). Is there any fast way to exploit the fact that data is shifting through A rather than being completely new? Ideally I want to minimize the number of multiplication operations (at a cost of perhaps more additions/subtractions/accumulations). I initially thought a systolic array might help, but it doesn't (I think!?)
Update 1: Note B is fixed for long periods, but can be reprogrammed.
Update 2: the shifting of A is like the following: a[24] <= a[19], a[23] <= a[18]... a[1] <= new01, a[0] <= new00. And so on so forth each clock cycle
Many thanks!
Is there any fast way to exploit the fact that data is shifting through A rather than being completely new?
Even though all you're doing is the shifting and adding new elements to A, the products in C will, in general, all be different since one of the operands will generally change after each iteration. If you have additional information about the way the elements of A or B are structured, you could potentially use that structure to reduce the number of multiplications. Barring any such structural considerations, you will have to compute all 25 products each loop.
Ideally I want to minimize the number of multiplication operations (at a cost of perhaps more additions/subtractions/accumulations).
In theory, you can reduce the number of multiplications to 0 by shifting and adding the array elements to simulate multiplication. In practice, this will be slower than a hardware multiplication so you're better off just using any available hardware-based multiplication unless there's some additional, relevant constraint you haven't mentioned.
on the very first 5 data set you could be saving upto 50 multiplications. but after that its a flat road of multiplications. since for every set after the first 5 set you need to multiply with the new set of data.
i'l assume all the arrays are initialized to zero.
i dont think those 50 saved are of any use considering the amount of multiplication on the whole.
But still i will give you a hint on how to save those 50 maybe you could find an extension to it?
1st data set arrived : multiply the first data set in a with each of the data set in b. save all in a, copy only a[0] to a[4] to c. 25 multiplications here.
2nd data set arrived : multiply only a[0] to a[4](having new data) with b[0] to b[4] resp. save in a[0] to a[4],copy to a[0->9] to c. 5 multiplications here
3rd data set arrived : multiply a[0] to a[9] with b[0] to b[9] this time and copy to corresponding a[0->14] to c.10 multiplications here
4th data set : multiply a[0] to a[14] with corresponding b copy corresponding a[0->19] to c. 15 multiplications here.
5th data set : mutiply a[0] to a[19] with corresponding b copy corresponding a[0->24] to c. 20 multiplications here.
total saved mutiplications : 50 multiplications.
6th data set : usual data multiplications. 25 each. this is because for each set in the array a there a new data set avaiable so multiplication is unavoidable.
Can you add another array D to flag the changed/unchanged value in A. Each time you check this array to decide whether to do new multiplications or not.

Algorithm - find the minimal subtraction between sum of two arrays

I am hunting job now and doing many algorithm exercises. Here is my problem:
Given two arrays: a and b with same length, the subject is to make |sum(a)-sum(b)| minimal, by swapping elements between a and b.
Here is my though:
assume we swap a[i] and b[j], set Delt = sum(a) - sum(b), x = a[i]-b[j]
then Delt2 = sum(a)-a[i]+b[j] - (sum(b)-b[j]+a[i]) = Delt - 2*x,
then the change = |Delt| - |Delt2|, which is proportional to |Delt|^2 - |Delt2|^2 = 4*x*(Delt-x),
Based on the thought above I got the following code:
Delt = sum(a) - sum(b);
done = false;
while(!done)
{
done = true;
for i = [0, n)
{
for j = [0,n)
{
x = a[i]-b[j];
change = x*(Delt-x);
if(change >0)
{
swap(a[i], b[j]);
Delt = Delt - 2*x;
done = false;
}
}
}
}
However, does anybody have a much better solution ? If you got, please tell me and I would be very grateful of you!
This problem is basically the optimization problem for Partition Problem with an extra constraint of equal parts. I'll prove that adding this constraint doesn't make the problem easier.
NP-Hardness proof:
Assume there was an algorithm A that solves this problem in polynomial time, we can solve the Partition-Problem in polynomial time.
Partition(S):
for i in range(|S|):
S += {0}
result <- A(S\2,S\2) //arbitrary split S into 2 parts
if result is a partition: //simple to check, since partition is NP.
return true.
return false //no partition
Correctness:
If there is a partition denote as (S1,S2) [assume S2 has more elements], on iteration |S2|-|S1| [i.e. when adding |S2|-|S1| zeros]. The input to A will contatin enough zeros so we can return two equal length arrays: S2,S1+{0,0,...,0}, which will be a partition to S, and the algorithm will yield true.
If the algorithm yields true, and iteration k, we had two arrays: S2,S1, with same number of elements, and equal values. by removing k zeros from the arrays, we get a partition to the original S, so S had a partition.
Polynomial:
assume A takes P(n) time, the algorithm we produced will take n*P(n) time, which is also polynomial.
Conclusion:
If this problem is solveable in polynomial time, so does the Partion-Problem, and thus P=NP. based on this: this problem is NP-Hard.
Because this problem is NP-Hard, for an exact solution you will probably need an exponential algorith. One of those is simple backtracking [I leave it as an exercise to the reader to implement a backtracking solution]
EDIT: as mentioned by #jpalecek: by simply creating a reduction: S->S+(0,0,...,0) [k times 0], one can directly prove NP-Hardness by reduction. polynomial is trivial and correctness is very similar to the above partion's correctness proof: [if there is a partition, adding 'balancing' zeros is possible; the other direction is simply trimming those zeros]
Just a comment. Through all this swapping you can basically arrange the contents of both arrays as you like. So it is unimportant in which array the values are at start.
Can't do it in my head but I'm pretty sure there is a constructive solution. I think if you sort them first and then deal them according to some rule. Something along the lines If value > 0 and if sum(a)>sum(b) then insert to a else into b