Numpy deleting all rows with condition - numpy

I have to edit a CSV file.
I can already import it and transform it into a 2D-Array
Now, my job is to delete all rows where, 0.0005 < array[i, 0]%0.0025 < 0.9995.
(Basically, in the first column, are steps with a 0.0025 interval, and I need to delete all rows, where a step is accidentally bigger than it should)
I already tried the following:
length = len(data)
for i in range data:
if 0.0005 < data[i,0]%0.0025 < 0.9995:
np.delete(data, i, 0)
but it didn`t work. Can anybody help me?

I see some issues with your approach - first, you should not iterate on an array from which elements are deleted mid-iteration. Second, np.delete returns a new array and is not in place. Therefore, your call to this method does nothing. Also, there is a small syntax error in your range definition
Perhaps using multiple condition index
np.delete(data,(data[:,0]%0.0025>0.0005)&(data[:,0]%0.0025<0.9995),0)
We can verify a similar problem with an example: remove all rows where the first element x satisfies 1<x<4 (removed the mod as it doesn't matter for the example):
data = np.array([[i, 2, 3] for i in range(1, 6)])
>>> [[1,2,3],[2,2,3],...,[5,2,3]]
data = np.delete(data, (data[:, 0] > 1) & (data[:, 0] < 4), 0)
>>> [[1,2,3],[4,2,3],[5,2,3]]

Related

Finding the index of max value of columns in numpy array but removing the previous max

I have an array with N rows and M columns.
I would like to run through all the columns, finding the index of the row in which contains the max value of the column. However, each row should be selected only once.
For instance, let's consider a matrix
1 1
2 2
The output should be [1, 0]. Because the row 1 (value of 2) is the max value of column 0, then we move to column 2, the row 1 is out of consideration, so the row 0 will be the highest cell.
Indeed, things can be solved easily with for a nested for loop, and something like:
removed_rows = []
for i in range (nb_columns):
index_max = 0
value_max = A[0,i]
for j in range (nb_rows):
if j in removed_rows:
continue
else:
if value_max < A[j,i]:
index_max = j
value_max = A[j,i]
removed_rows.append (index_max)
However, it seems slow for a huge matrix. Is there any method we can do it faster (with numpy?)?
Many thanks
This might not be very fast as it still loop through the columns, which I think is unavoidable due to the constrain, but should be faster than your solution as it finds the maximum's index with argmax:
out = []
mm = A.min() - 1
for j in range(A.shape[1]):
idx = np.argmax(A[:,j])
# replace the entire row with mm
# so next `argmax` will ignore this row
A[idx] = mm
out.append(idx)
The above takes about 640 us on 100 x 100 arrays, and 18ms on 1k x 1k arrays. Your code refuses to run on 1k x 1k array within reasonable time on my system.

Check a row for ascension in Numpy, but ignoring elements = 0

I have the code snippet below that searches each row/column in an array to see if all values are either ascending or descending. Ideally, this code would ignore zeros. For example, a row with (5, 0, 3, 1) would come up True for descending. The code below still looks at the zeros. If the masked technique is a dead end, maybe I could create a copy without zeros? I'm very new to Numpy so I would appreciate specific directions. Thanks!
np.ma.masked_equal(grid, 0)
for row in grid:
if np.all(np.diff(row) <= 0) or np.all(np.diff(row) >= 0):
monoScore += .5
for col in np.transpose(grid):
if np.all(np.diff(col) <= 0) or np.all(np.diff(col) >= 0):
monoScore += .5

Optimizing specific numbers to reach value

I'm trying to make a program, that when given specific values (let's say 1, 4 and 10), will try to get how much of each value is needed to reach a certain amount, say 19.
It will always try to use as many high values as possible, so in this case, the result should be 10*1, 4*2, 1*1.
I tried thinking about it, but couldn't end up with an algorithm that could work...
Any help or hints would be welcome!
Here is a python solution that tries all the choices until one is found. If you pass the values it can use in descending order, the first found will be the one that uses the most high values as possible:
def solve(left, idx, nums, used):
if (left == 0):
return True
for i in range(idx, len(nums)):
j = int(left / nums[idx])
while (j > 0):
used.append((nums[idx], j))
if solve(left - j * nums[idx], idx + 1, nums, used):
return True
used.pop()
j -= 1
return False
solution = []
solve(19, 0, [10, 4, 1], solution)
print(solution) # will print [(10, 1), (4, 2), (1, 1)]
If anyone needs a simple algorithm, one way I found was:
sort the values, in descending order
keep track on how many values are kept
for each value, do:
if the sum is equal to the target, stop
if it isn't the first value, remove one of the previous values
while the total sum of values is smaller than the objective:
add the current value once
Have a nice day!
(As juviant mentionned, this won't work if the skips larger numbers, and only uses smaller ones! I'll try to improve it and post a new version when I get it to work)

vectorize join condition in pandas

This code is working correctly as expected. But it takes a lot of time for large dataframes.
for i in excel_df['name_of_college_school'] :
for y in mysql_df['college_name'] :
if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8:
excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y
I guess, I can not use a function on join clause to compare values like this.
How do I vectorize this?
Update:
Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.
What you are looking for is fuzzy merging.
a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
for j in b:
if SequenceMatcher(None,
i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
i[dupmark_index] = j
Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -
df.columns.get_loc("college name")
You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.
for y in mysql_df['college_name']:
match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
None, x.lower(), y.lower()).ratio() > 0.8)
excel_df.loc[match, 'dupmark4'] = y

torch logical indexing of tensor

I looking for an elegant way to select a subset of a torch tensor which satisfies some constrains.
For example, say I have:
A = torch.rand(10,2)-1
and S is a 10x1 tensor,
sel = torch.ge(S,5) -- this is a ByteTensor
I would like to be able to do logical indexing, as follows:
A1 = A[sel]
But that doesn't work.
So there's the index function which accepts a LongTensor but I could not find a simple way to convert S to a LongTensor, except the following:
sel = torch.nonzero(sel)
which returns a K x 2 tensor (K being the number of values of S >= 5). So then I have to convert it to a 1 dimensional array, which finally allows me to index A:
A:index(1,torch.squeeze(sel:select(2,1)))
This is very cumbersome; in e.g. Matlab all I'd have to do is
A(S>=5,:)
Can anyone suggest a better way?
One possible alternative is:
sel = S:ge(5):expandAs(A) -- now you can use this mask with the [] operator
A1 = A[sel]:unfold(1, 2, 2) -- unfold to get back a 2D tensor
Example:
> A = torch.rand(3,2)-1
-0.0047 -0.7976
-0.2653 -0.4582
-0.9713 -0.9660
[torch.DoubleTensor of size 3x2]
> S = torch.Tensor{{6}, {1}, {5}}
6
1
5
[torch.DoubleTensor of size 3x1]
> sel = S:ge(5):expandAs(A)
1 1
0 0
1 1
[torch.ByteTensor of size 3x2]
> A[sel]
-0.0047
-0.7976
-0.9713
-0.9660
[torch.DoubleTensor of size 4]
> A[sel]:unfold(1, 2, 2)
-0.0047 -0.7976
-0.9713 -0.9660
[torch.DoubleTensor of size 2x2]
There are two simpler alternatives:
Use maskedSelect:
result=A:maskedSelect(your_byte_tensor)
Use a simple element-wise multiplication, for example
result=torch.cmul(A,S:gt(0))
The second one is very useful if you need to keep the shape of the original matrix (i.e A), for example to select neurons in a layer at backprop. However, since it puts zeros in the resulting matrix whenever the condition dictated by the ByteTensor doesn't apply, you can't use it to compute the product (or median, etc.). The first one only returns the elements that satisfy the condittion, so this is what I'd use to compute products or medians or any other thing where I don't want zeros.