Numpy remove rows with same column values

Numpy remove rows with same column values - numpy

How do I remove rows from ndarray arrays which have the same nth column value?
For eg,
a = np.ndarray([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
And I want to have rows unique by third column.
I want to have just the [1, 3, 5] row left.
numpy.unique does not do it. It will check for uniqueness in every column; I can't specify the
column by which to check uniqueness.
How can I do this efficiently for thousand + rows?
Thank you.

You could try a combination of bincount, nonzero and in1d
import numpy as np
a = np.array([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
#A tuple containing the values which are unique in column 3
unique_in_column = (np.bincount(a[:,2]) == 1).nonzero()
a[:,2] == unique_in_column[0]
unique_index = np.in1d(a[:,2], unique_in_column[0])
unique_a = a[unique_index]
This should do the trick. However, I'm not sure how this method scales with 1000+ rows.

I had done this finally:
repeatdict = {}
todel = []
for i, row in enumerate(kplist):
if repeatdict.get(row[2], 0):
todel.append(i)
else:
repeatdict[row[2]] = 1
kplist = np.delete(kplist, todel, axis=0)
Basically, I iterated over the list store the values of the third column, and if in the next iteration the same value is already found in the repeatdict dict, that row is marked for deletion, by storing its index in todel list.
Then we can get rid of the unwanted rows by calling np.delete with the list of all row indexes which we want to delete.
Also, I'm not picking my answer as the picked answer, because I know there's probably a better way to do this with just numpy magic.
I'll wait.

Related

Sorting an array based on one column, then based on a second column

I would like sort an array based on one column, then for all the columns values that are equal - sort them based on a second column. For example: suppose that I have the array:
a = np.array([[0,1,1],[0,3,1],[1,7,2],[0,2,1]])
I can sort it by column 0 using:
sorted_array = a[np.argsort(a[:, 0])]
however, I want rows that have similar values at the [0] column to be sorted by the [1] column, so my result would look like:
desired_result = np.array([[0,1,1],[0,2,1],[0,3,1],[1,7,2]])
What is the best way to achieve that? Thanks.

You can sort them as tuple, then convert back to numpy array:
out = np.array(sorted(map(tuple,a)))
Output:
array([[0, 1, 1],
[0, 2, 1],
[0, 3, 1],
[1, 7, 2]])

You first sort the array in the secondary column, then you sort in the primary axis, making sure to use a stable sorting method.
sorted_array = a[np.argsort(a[:, 1])]
sorted_array = sorted_array[np.argsort(sorted_array[:, 0], kind='stable')]
Or you can use lexsort
sorted_array = a[np.lexsort((a[:,1], a[:, 0])), :]

Numpy, how to retrieve sub-array of array (specific indices)?

I have an array:
>>> arr1 = np.array([[1,2,3], [4,5,6], [7,8,9]])
array([[1 2 3]
[4 5 6]
[7 8 9]])
I want to retrieve a list (or 1d-array) of elements of this array by giving a list of their indices, like so:
indices = [[0,0], [0,2], [2,0]]
print(arr1[indices])
# result
[1,6,7]
But it does not work, I have been looking for a solution about it for a while, but I only found ways to select per row and/or per column (not per specific indices)
Someone has any idea ?
Cheers
Aymeric

First make indices an array instead of a nested list:
indices = np.array([[0,0], [0,2], [2,0]])
Then, index the first dimension of arr1 using the first values of indices, likewise the second:
arr1[indices[:,0], indices[:,1]]
It gives array([1, 3, 7]) (which is correct, your [1, 6, 7] example output is probably a typo).

How do you create a "count" matrix from a series?

I have a Pandas series of lists of categorical variables. For example:
df = pd.Series([["A", "A", "B"], ["A", "C"]])
Note that in my case the series is pretty long (50K elements) and the number of possible distinct elements in the list is also big (20K elements).
I would like to obtain a matrix having a column for each distinct feature and its count as value. For the previous example, this means:
[[2, 0, 0], [1, 0, 1]]
This is the same output as the one obtained with OneHot encoding, except that it contains the count instead of just 1.
What is the best way to achieve this?

Let's try explode:
df.explode().groupby(level=0).value_counts().unstack(fill_value=0)
Output:
A B C
0 2 1 0
1 1 0 1
To get the list of list, chain the above with .values:
array([[2, 1, 0],
[1, 0, 1]])
Note that you will end up with a 50K x 20K array.

Get indices for values of one array in another array

I have two 1D-arrays containing the same set of values, but in a different (random) order. I want to find the list of indices, which reorders one array according to the other one. For example, my 2 arrays are:
ref = numpy.array([5,3,1,2,3,4])
new = numpy.array([3,2,4,5,3,1])
and I want the list order for which new[order] == ref.
My current idea is:
def find(val):
return numpy.argmin(numpy.absolute(ref-val))
order = sorted(range(new.size), key=lambda x:find(new[x]))
However, this only works as long as no values are repeated. In my example 3 appears twice, and I get new[order] = [5 3 3 1 2 4]. The second 3 is placed directly after the first one, because my function val() does not track which 3 I am currently looking for.
So I could add something to deal with this, but I have a feeling there might be a better solution out there. Maybe in some library (NumPy or SciPy)?
Edit about the duplicate: This linked solution assumes that the arrays are ordered, or for the "unordered" solution, returns duplicate indices. I need each index to appear only once in order. Which one comes first however, is not important (neither possible based on the data provided).
What I get with sort_idx = A.argsort(); order = sort_idx[np.searchsorted(A,B,sorter = sort_idx)] is: [3, 0, 5, 1, 0, 2]. But what I am looking for is [3, 0, 5, 1, 4, 2].

Given ref, new which are shuffled versions of each other, we can get the unique indices that map ref to new using the sorted version of both arrays and the invertibility of np.argsort.
Start with:
i = np.argsort(ref)
j = np.argsort(new)
Now ref[i] and new[j] both give the sorted version of the arrays, which is the same for both. You can invert the first sort by doing:
k = np.argsort(i)
Now ref is just new[j][k], or new[j[k]]. Since all the operations are shuffles using unique indices, the final index j[k] is unique as well. j[k] can be computed in one step with
order = np.argsort(new)[np.argsort(np.argsort(ref))]
From your original example:
>>> ref = np.array([5, 3, 1, 2, 3, 4])
>>> new = np.array([3, 2, 4, 5, 3, 1])
>>> np.argsort(new)[np.argsort(np.argsort(ref))]
>>> order
array([3, 0, 5, 1, 4, 2])
>>> new[order] # Should give ref
array([5, 3, 1, 2, 3, 4])
This is probably not any faster than the more general solutions to the similar question on SO, but it does guarantee unique indices as you requested. A further optimization would be to to replace np.argsort(i) with something like the argsort_unique function in this answer. I would go one step further and just compute the inverse of the sort:
def inverse_argsort(a):
fwd = np.argsort(a)
inv = np.empty_like(fwd)
inv[fwd] = np.arange(fwd.size)
return inv
order = np.argsort(new)[inverse_argsort(ref)]

Just add new (different) elements to the array in Ruby (Rails)?

I want to create an array in Rails that contains every value of two columns but each just one time. So, for example, there is in column "A" {1,5,7,1,7} and in column "B" {3,2,3,1,4}.
When I just wanted an array with all elements of "A", I would write:
Model.uniq.pluck(:A)
And I would get {1,5,7}.
Is there an option in Rails to make the same thing with two columns, so just getting all values one time that are contained in two columns? (Here it would be {1,5,7,3,2,4})
Thanks for help!

Yup, pass multiple column names to pluck:
Model.pluck(:A, :B)
#=> [[1, 3], [5, 2], [7, 3], [1, 1], [7, 4]]
But of course you want the values together and uniqued so:
Model.pluck(:A, :B).flatten.uniq
#=> [1, 3, 5, 2, 7, 4]
Doing Model.uniq.pluck(:A, :B).flatten won’t work since it will just get distinct rows (i.e. combinations of A & B), so you’d still have to uniq again after flattening.

records = []
Model.all.map {|e| records << [e.A, e.B] }
uniq_records = records.flatten.uniq
Hope this would help you.
Thanks

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Numpy remove rows with same column values - numpy

Related

Sorting an array based on one column, then based on a second column

Numpy, how to retrieve sub-array of array (specific indices)?

How do you create a "count" matrix from a series?

Get indices for values of one array in another array

Just add new (different) elements to the array in Ruby (Rails)?

Categories

Resources