Finding the index of max value of columns in numpy array but removing the previous max - numpy

I have an array with N rows and M columns.
I would like to run through all the columns, finding the index of the row in which contains the max value of the column. However, each row should be selected only once.
For instance, let's consider a matrix
1 1
2 2
The output should be [1, 0]. Because the row 1 (value of 2) is the max value of column 0, then we move to column 2, the row 1 is out of consideration, so the row 0 will be the highest cell.
Indeed, things can be solved easily with for a nested for loop, and something like:
removed_rows = []
for i in range (nb_columns):
index_max = 0
value_max = A[0,i]
for j in range (nb_rows):
if j in removed_rows:
continue
else:
if value_max < A[j,i]:
index_max = j
value_max = A[j,i]
removed_rows.append (index_max)
However, it seems slow for a huge matrix. Is there any method we can do it faster (with numpy?)?
Many thanks

This might not be very fast as it still loop through the columns, which I think is unavoidable due to the constrain, but should be faster than your solution as it finds the maximum's index with argmax:
out = []
mm = A.min() - 1
for j in range(A.shape[1]):
idx = np.argmax(A[:,j])
# replace the entire row with mm
# so next `argmax` will ignore this row
A[idx] = mm
out.append(idx)
The above takes about 640 us on 100 x 100 arrays, and 18ms on 1k x 1k arrays. Your code refuses to run on 1k x 1k array within reasonable time on my system.

Related

Numpy deleting all rows with condition

I have to edit a CSV file.
I can already import it and transform it into a 2D-Array
Now, my job is to delete all rows where, 0.0005 < array[i, 0]%0.0025 < 0.9995.
(Basically, in the first column, are steps with a 0.0025 interval, and I need to delete all rows, where a step is accidentally bigger than it should)
I already tried the following:
length = len(data)
for i in range data:
if 0.0005 < data[i,0]%0.0025 < 0.9995:
np.delete(data, i, 0)
but it didn`t work. Can anybody help me?
I see some issues with your approach - first, you should not iterate on an array from which elements are deleted mid-iteration. Second, np.delete returns a new array and is not in place. Therefore, your call to this method does nothing. Also, there is a small syntax error in your range definition
Perhaps using multiple condition index
np.delete(data,(data[:,0]%0.0025>0.0005)&(data[:,0]%0.0025<0.9995),0)
We can verify a similar problem with an example: remove all rows where the first element x satisfies 1<x<4 (removed the mod as it doesn't matter for the example):
data = np.array([[i, 2, 3] for i in range(1, 6)])
>>> [[1,2,3],[2,2,3],...,[5,2,3]]
data = np.delete(data, (data[:, 0] > 1) & (data[:, 0] < 4), 0)
>>> [[1,2,3],[4,2,3],[5,2,3]]

Running time of nested while loops

Function f(n)
s = 0
i = 1
while i < 7n^1/2 do
j = i
while j > 5 do
s = s + i -j
j = j -2
end
i = 5i
end
return s
end f
I am trying to solve the running time for big theta with the code above. I have been looking all over the place for something to help me with an example, but everything is for loops or only one while loop. How would you go about this problem with nested while loops?
Let's break this down into two key points:
i starts from 1, and is self-multiplied by 5, until it is greater than or equal to 7 sqrt(n). This is an exponential increase with logarithmic number of steps. Thus we can change the code to the following equivalent:
m = floor(log(5, 7n^(1/2)))
k = 0
while k < m do
j = 5^k
// ... inner loop ...
end
For each iteration of the outer loop, j starts from i, and decreases in steps of 2, until it is less than or equal to 5. Note that in the first execution of the outer loop i = 1, and in the second i = 5, so the inner loop is not executed until the third iteration. The loop limit means that the final value of j is 7 if k is odd, and 6 if even (you can check this with pen and paper).
Combining the above steps, we arrive at:
First loop will do 7 * sqrt(n) iterations. Exponent 1/2 is the same as sqrt() of a number.
Second loop will run m - 2 times since first two values of i are 1 and 5 respectively, not passing the comparison.
i is getting an increment of 5i.
Take an example where n = 16:
i = 1, n = 16;
while( i < 7 * 4; i *= 5 )
//Do something
First value of i = 1. It runs 1 time. Inside loop will run 0 times.
Second value of i = 5. It runs 2 times. Inside loop will run 0 times.
Third value of i = 25. It runs 3 times. Inside loop will run 10 times.
Fourth value of i = 125. It stops.
Outer iterations are n iterations while inner iterations are m iterations, which gives O( 7sqrt(n) * (m - 2) )
IMO, is complex.

Combining many sort ranks into one master sort rank

Say I have some sorted result from a SQL query that looks like:
x y z
0 0 0
0 0 1
0 0 2
0 1 0
0 1 1
0 2 0
0 2 1
Where x, y and z are sort ranks. These sort ranks are always greater than 0, and smaller than 500mil.
Is there a way to combine the values from x, y and z into one "master" sort rank? Sorting the dataset using this "master" sort rank should result in the same ordering.
I'm thinking I can do something with bit shifting but I am not sure...
Assuming that every value in each of the three columns in between 1 and 500 million, you could use the following formula to generate a unique rank:
1000000
z + (500 x 10^6)*y + (500 x 10^6)*(500 x 10^6)*x
To generate this rank you could use the following query:
SELECT
x, y, z,
z + (500 * 1000000)*y + (500 * 1000000)*(500 * 1000000)*x AS master_rank
FROM yourTable;
The reason this works can be seen by examining say the z and y columns. The largest value from z is 500 million, which is guaranteed to be smaller than the smallest value in y, which is 1 billion. This logic applies to the whole formula. This approach is similar to using a bit mask, on a larger scale.
Note that I assume that your version of SQL can tolerate numbers this large. If it doesn't, then you might want to consider another approach here, possibly just ordering as #Gordon mentioned in his answer. Besides this, having 1 bil x 1 bil records would make for a very large table and would have other problems.
Do you mean something like this?
order by x * 10000 + y * 100 + z
(You would adjust the numbers for the width you need.)
I'm not sure why you would want to do that instead of:
order by x, y, z
If you do combine into a single value, be careful about integer overflow.

vectorize join condition in pandas

This code is working correctly as expected. But it takes a lot of time for large dataframes.
for i in excel_df['name_of_college_school'] :
for y in mysql_df['college_name'] :
if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8:
excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y
I guess, I can not use a function on join clause to compare values like this.
How do I vectorize this?
Update:
Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.
What you are looking for is fuzzy merging.
a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
for j in b:
if SequenceMatcher(None,
i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
i[dupmark_index] = j
Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -
df.columns.get_loc("college name")
You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.
for y in mysql_df['college_name']:
match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
None, x.lower(), y.lower()).ratio() > 0.8)
excel_df.loc[match, 'dupmark4'] = y

How to choose a range for a loop based upon the answers of a previous loop?

I'm sorry the title is so confusingly worded, but it's hard to condense this problem down to a few words.
I'm trying to find the minimum value of a specific equation. At first I'm looping through the equation, which for our purposes here can be something like y = .245x^3-.67x^2+5x+12. I want to design a loop where the "steps" through the loop get smaller and smaller.
For example, the first time it loops through, it uses a step of 1. I will get about 30 values. What I need help on is how do I Use the three smallest values I receive from this first loop?
Here's an example of the values I might get from the first loop: (I should note this isn't supposed to be actual code at all. It's just a brief description of what's happening)
loop from x = 1 to 8 with step 1
results:
x = 1 -> y = 30
x = 2 -> y = 28
x = 3 -> y = 25
x = 4 -> y = 21
x = 5 -> y = 18
x = 6 -> y = 22
x = 7 -> y = 27
x = 8 -> y = 33
I want something that can detect the lowest three values and create a loop. From theses results, the values of x that get the smallest three results for y are x = 4, 5, and 6.
So my "guess" at this point would be x = 5. To get a better "guess" I'd like a loop that now does:
loop from x = 4 to x = 6 with step .5
I could keep this pattern going until I get an absurdly accurate guess for the minimum value of x.
Does anybody know of a way I can do this? I know the values I'm going to get are going to be able to be modeled by a parabola opening up, so this format will definitely work. I was thinking that the values could be put into a column. It wouldn't be hard to make something that returns the smallest value for y in that column, and the corresponding x-value.
If I'm being too vague, just let me know, and I can answer any questions you might have.
nice question. Here's at least a start for what I think you should do for this:
Sub findMin()
Dim lowest As Integer
Dim middle As Integer
Dim highest As Integer
lowest = 999
middle = 999
hightest = 999
Dim i As Integer
i = 1
Do While i < 9
If (retVal(i) < retVal(lowest)) Then
highest = middle
middle = lowest
lowest = i
Else
If (retVal(i) < retVal(middle)) Then
highest = middle
middle = i
Else
If (retVal(i) < retVal(highest)) Then
highest = i
End If
End If
End If
i = i + 1
Loop
End Sub
Function retVal(num As Integer) As Double
retVal = 0.245 * Math.Sqr(num) * num - 0.67 * Math.Sqr(num) + 5 * num + 12
End Function
What I've done here is set three Integers as your three Min values: lowest, middle, and highest. You loop through the values you're plugging into the formula (here, the retVal function) and comparing the return value of retVal (hence the name) to the values of retVal(lowest), retVal(middle), and retVal(highest), replacing them as necessary. I'm just beginning with VBA so what I've done likely isn't very elegant, but it does at least identify the Integers that result in the lowest values of the function. You may have to play around with the values of lowest, middle, and highest a bit to make it work. I know this isn't EXACTLY what you're looking for, but it's something along the lines of what I think you should do.
There is no trivial way to approach this unless the problem domain is narrowed.
The example polynomial given in fact has no minimum, which is readily determined by observing y'>0 (hence, y is always increasing WRT x).
Given the wide interpretation of
[an] equation, which for our purposes here can be something like y =
.245x^3-.67x^2+5x+12
many conditions need to be checked, even assuming the domain is limited to polynomials.
The polynomial order is significant, and the order determines what conditions are necessary to check for how many solutions are possible, or whether any solution is possible at all.
Without taking this complexity into account, an iterative approach could yield an incorrect solution due to underflow error, or an unfortunate choice of iteration steps or bounds.
I'm not trying to be hard here, I think your idea is neat. In practice it is more complicated than you think.