vectorize join condition in pandas - pandas

This code is working correctly as expected. But it takes a lot of time for large dataframes.
for i in excel_df['name_of_college_school'] :
for y in mysql_df['college_name'] :
if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8:
excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y
I guess, I can not use a function on join clause to compare values like this.
How do I vectorize this?
Update:
Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.

What you are looking for is fuzzy merging.
a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
for j in b:
if SequenceMatcher(None,
i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
i[dupmark_index] = j
Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -
df.columns.get_loc("college name")

You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.
for y in mysql_df['college_name']:
match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
None, x.lower(), y.lower()).ratio() > 0.8)
excel_df.loc[match, 'dupmark4'] = y

Related

Finding the index of max value of columns in numpy array but removing the previous max

I have an array with N rows and M columns.
I would like to run through all the columns, finding the index of the row in which contains the max value of the column. However, each row should be selected only once.
For instance, let's consider a matrix
1 1
2 2
The output should be [1, 0]. Because the row 1 (value of 2) is the max value of column 0, then we move to column 2, the row 1 is out of consideration, so the row 0 will be the highest cell.
Indeed, things can be solved easily with for a nested for loop, and something like:
removed_rows = []
for i in range (nb_columns):
index_max = 0
value_max = A[0,i]
for j in range (nb_rows):
if j in removed_rows:
continue
else:
if value_max < A[j,i]:
index_max = j
value_max = A[j,i]
removed_rows.append (index_max)
However, it seems slow for a huge matrix. Is there any method we can do it faster (with numpy?)?
Many thanks
This might not be very fast as it still loop through the columns, which I think is unavoidable due to the constrain, but should be faster than your solution as it finds the maximum's index with argmax:
out = []
mm = A.min() - 1
for j in range(A.shape[1]):
idx = np.argmax(A[:,j])
# replace the entire row with mm
# so next `argmax` will ignore this row
A[idx] = mm
out.append(idx)
The above takes about 640 us on 100 x 100 arrays, and 18ms on 1k x 1k arrays. Your code refuses to run on 1k x 1k array within reasonable time on my system.

Need pandas optimized code with 1 million stock data

Currently my code is
self.df['sma'] = self.df['Close'].rolling(window=30).mean()
self.df['cma'] = self.df.apply(lambda x: self.get_cma(x), axis=1)
def get_cma(self, candle):
if np.isnan(candle['sma']):
return np.nan
secma = (candle['sma'] - self.previous_cma if self.previous_cma is not None else 0) ** 2
ka = 1 - (candle['var']/secma) if candle['var'] < secma else 0
cma = ((ka * candle['sma']) + ((1 - ka) * self.previous_cma)) if self.previous_cma is not None else candle[self.src]
self.previous_cma = cma
return cma
Can the above optimized to make it faster?
As you may already know, the secret to performance with Pandas is to do this in vectorized form. This means no apply. Here are the first few steps you need to take to speed up your code, by extracting parts of your get_cma() function to their vectorized equivalents.
if np.isnan(candle['sma']):
return np.nan
This early exit is not needed in get_cma(), we can do this instead:
self.df['cma'] = np.nan
valid = self.df['sma'].notnull()
# this comment is a placeholder for step 2
self.df.loc[valid, 'cma'] = self.df[valid].apply(self.get_cma, axis=1)
This not only vectorizes the first two lines of get_cma(), it means get_cma() is now only called on not-null rows, rather than every row. Depending on your data that alone may provide a noticeable speedup.
If that's not enough, we need a bigger hammer. The fundamental problem is that each iteration of get_cma() depends on the previous, so it is not easy to vectorize. So let's use Numba to JIT compile the code. First we need to get rid of apply by using a good old for loop over the individual columns, which is equivalent (and will still be slow). Note this is a free (global) function, not a member function, and it takes NumPy arrays instead of Pandas types, because those are what Numba understands:
def get_cma(sma, var, src):
cma = np.empty_like(sma)
# take care of the initial value first, to avoid unnecessary branches later
cma[0] = src[0]
# now do all remaining rows, cma[ii-1] is previous_cma and is never None
for ii in range(1, len(sma)):
secma = (sma[ii] - cma[ii-1]) ** 2
ka = 1 - (var[ii] / secma) if var[ii] < secma else 0
cma[ii] = (ka * sma[ii]]) + ((1 - ka) * cma[ii-1])
return cma
Call it like this, passing the required columns as NumPy arrays:
valid_rows = self.df[valid]
self.df.loc[valid, 'cma'] = get_cma(
valid_rows['sma'].to_numpy(),
valid_rows['var'].to_numpy(),
valid_rows[self.src].to_numpy())
Finally, after confirming the code works, decorate get_cma() to compile it with Numba automatically like this:
import numba
#numba.njit
def get_cma(sma, var, src):
...
That's it. Please let us know how much faster this runs on your real data. I expect it will be plenty fast enough.

Matplotlib: Plot a function with multiple defintions

How do I plot e.g. a function f(x) = x for 0 < x < 1 and f(x) = 1 for x >= 1?
Thanks in advance!
EDIT:
Okay, I have thought for a while and found a solution for the given function, but I'd really like to find a more generic solution. Maybe like f=f1+f2+...fn, where fi is the function in domain i and then plot f alltogehter.
f = 0.5*(1*(1-np.sign(1-x))+x*(1-np.sign(x-1)))
Matplotlib doesn't care where your data comes from: you can either make lists from two different functions and combine them, or call a function with a conditional in it. The most mathematically appealing choice is probably
def f(x):
return 1 if x>=1 else 0 if x>0 else ...
Of course, if you care only about x>0, your function can be computed just as x>=1, which can be used as a number already.

torch logical indexing of tensor

I looking for an elegant way to select a subset of a torch tensor which satisfies some constrains.
For example, say I have:
A = torch.rand(10,2)-1
and S is a 10x1 tensor,
sel = torch.ge(S,5) -- this is a ByteTensor
I would like to be able to do logical indexing, as follows:
A1 = A[sel]
But that doesn't work.
So there's the index function which accepts a LongTensor but I could not find a simple way to convert S to a LongTensor, except the following:
sel = torch.nonzero(sel)
which returns a K x 2 tensor (K being the number of values of S >= 5). So then I have to convert it to a 1 dimensional array, which finally allows me to index A:
A:index(1,torch.squeeze(sel:select(2,1)))
This is very cumbersome; in e.g. Matlab all I'd have to do is
A(S>=5,:)
Can anyone suggest a better way?
One possible alternative is:
sel = S:ge(5):expandAs(A) -- now you can use this mask with the [] operator
A1 = A[sel]:unfold(1, 2, 2) -- unfold to get back a 2D tensor
Example:
> A = torch.rand(3,2)-1
-0.0047 -0.7976
-0.2653 -0.4582
-0.9713 -0.9660
[torch.DoubleTensor of size 3x2]
> S = torch.Tensor{{6}, {1}, {5}}
6
1
5
[torch.DoubleTensor of size 3x1]
> sel = S:ge(5):expandAs(A)
1 1
0 0
1 1
[torch.ByteTensor of size 3x2]
> A[sel]
-0.0047
-0.7976
-0.9713
-0.9660
[torch.DoubleTensor of size 4]
> A[sel]:unfold(1, 2, 2)
-0.0047 -0.7976
-0.9713 -0.9660
[torch.DoubleTensor of size 2x2]
There are two simpler alternatives:
Use maskedSelect:
result=A:maskedSelect(your_byte_tensor)
Use a simple element-wise multiplication, for example
result=torch.cmul(A,S:gt(0))
The second one is very useful if you need to keep the shape of the original matrix (i.e A), for example to select neurons in a layer at backprop. However, since it puts zeros in the resulting matrix whenever the condition dictated by the ByteTensor doesn't apply, you can't use it to compute the product (or median, etc.). The first one only returns the elements that satisfy the condittion, so this is what I'd use to compute products or medians or any other thing where I don't want zeros.

Avoid the use of for loops

I'm working with R and I have a code like this:
for (i in 1:10)
for (j in 1:100)
if (data[i] == paths[j,1])
cluster[i,4] <- paths[j,2]
where:
data is a vector with 100 rows and 1 column
paths is a matrix with 100 rows and 5 columns
cluster is a matrix with 100 rows and 5 columns
My question is: how could I avoid the use of "for" loops to iterate through the matrix? I don't know whether apply functions (lapply, tapply...) are useful in this case.
This is a problem when j=10000 for example because the execution time is very long.
Thank you
Inner loop could be vectorized
cluster[i,4] <- paths[max(which(data[i]==paths[,1])),2]
but check Musa's comment. I think you indented something else.
Second (outer) loop could be vectorize either, by replicating vectors but
if i is only 100 your speed-up don't be large
it will need more RAM
[edit]
As I understood your comment can you just use logical indexing?
indx <- data==paths[, 1]
cluster[indx, 4] <- paths[indx, 2]
I think that both loops can be vectorized using the following:
cluster[na.omit(match(paths[1:100,1],data[1:10])),4] = paths[!is.na(match(paths[1:100,1],data[1:10])),2]