How to fast lookup by list of indices?

How to fast lookup by list of indices? - pandas

Given a list of candidate indices
candidates = np.array([ 3, 4, 5 ])
You can lookup in your dataframe df via
df.loc[ candidates ]
However if there is a candidate missing in df.index this will throw an exception.
What is the the fastest way to obtain both?:
the slice of df for all candidates that are in the index
A boolean array indicating which candidate is in the index
Especially if df.index.is_monotonic == True, this fact should be used to speed things up.

As a benchmark,the df.size is about 6.8M.
%time df.index.isin([45,66,77,99,87,65,234,668,798])
Wall time: 15 ms
array([False, False, False, ..., False, False, False], dtype=bool)
My suggestion
%%time
item_list = [45,66,77,99,87,65,234,668,798]
result_list = []
for item in item_list:
if item in user_shop_df.index.values:
result_list.append(True)
else:
result_list.append(False)
Wall time: 10 ms

Related

What is the best way to initialise a NumPy masked array with an existing mask?

I was expecting to just say something like
ma.zeros(my_shape, mask=my_mask, hard_mask=True)
(where the mask is the correct shape) but ma.zeros (or ma.ones or ma.empty) rather surprisingly doesn't recognise the mask argument. The simplest I've come up with is
ma.array(np.zeros(my_shape), mask=my_mask, hard_mask=True)
which seems to involve unnecessary copying of lots of zeros. Is there a better way?

Make a masked array:
In [162]: x = np.arange(5); mask=np.array([1,0,0,1,0],bool)
In [163]: M = np.ma.MaskedArray(x,mask)
In [164]: M
Out[164]:
masked_array(data=[--, 1, 2, --, 4],
mask=[ True, False, False, True, False],
fill_value=999999)
Modify x, and see the result in M:
In [165]: x[-1] = 10
In [166]: M
Out[166]:
masked_array(data=[--, 1, 2, --, 10],
mask=[ True, False, False, True, False],
fill_value=999999)
In [167]: M.data
Out[167]: array([ 0, 1, 2, 3, 10])
In [169]: M.data.base
Out[169]: array([ 0, 1, 2, 3, 10])
The M.data is a view of the array used in creating it. No unnecessary copies.
I haven't used functions like np.ma.zeros, but
In [177]: np.ma.zeros
Out[177]: <numpy.ma.core._convert2ma at 0x1d84a052af0>
_convert2ma is a Python class, that takes a funcname and returns new callable. It does not add mask-specific parameters. Study that yourself if necessary.
np.ma.MaskedArray, the function that actually subclasses ndarray takes a copy parameter
copy : bool, optional
Whether to copy the input data (True), or to use a reference instead.
Default is False.
and the first line of its __new__ is
_data = np.array(data, dtype=dtype, copy=copy,
order=order, subok=True, ndmin=ndmin)
I haven't quite sorted out whether M._data is just a reference to the source data, or a view. In either case, it isn't a copy, unless you say so.
I haven't worked a lot with masked arrays, but my impression is that, while they can be convenient, they shouldn't be used where you are concerned about performance. There's a lot of extra work required to maintain both the mask and the data. The extra time involved in copying the data array, if any, will be minor.

numpy array of array with custom filtering

I am trying to filter a numpy array of array with given conditions, for example
input = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
output where the [0] >= 4, [1] >= 5, [2] >= 6
expected result = np.array([[4,5,6],[4,5,6]])
what would be the best way to achieve this with performance concern?
extended question: and how to retrieve the correspondance index of the each output elements in the input array?

You can do:
a = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
a[(a[:,0] >=4) & (a[:,1] >= 5) & (a[:,2] >=6)]
Here you create binary masks for the conditions on each elements in each row of the data, use the logical and to combine them, and finally use the resulting mask to get the matching data rows.
To find the index of the data rows matching the conditions, you can use numpys where() function:
idx = np.where((a[:,0] >=4) & (a[:,1] >= 1) & (a[:,2] >=6))[0]

As per your request, a numba version
import numpy as np
import numba as nb
import sys
import timeit
target = np.random.randint(low=-100000, high=100000, size=(int(sys.argv[2]), 3), dtype=np.int)
comp = np.array([4, 5, 6])
#nb.njit((nb.int64[:, :], nb.int64[::3]), parallel=True)
def cmp(a, b):
c = np.empty((a.shape[0],), dtype=a.dtype)
for i in nb.prange(a.shape[0]):
c[i] = a[i][0] > b[0] and a[i][1] > b[1] and a[i][2] > b[2]
return c
def cmp_normal(a, b):
# return np.all(a > b, axis=1)
return (a[:,0] >=b[0]) & (a[:,1] >= b[1]) & (a[:,2] >=b[2])
print(timeit.timeit(lambda: eval(sys.argv[1])(target, comp), number=10))
First output time is for sequential numba, second one is for parallel numba.
Parallel numba gives 5 times speed up compared to sequential
(base) xxx#xxx:~$ python test.py cmp 1000000
6.40756068899982
(base) xxx#xxx:~$ python test.py cmp 1000000
1.3425709140001345
Now vanilla numpy
(base) xxx#xxx:~$ python test.py cmp_normal 1000000
4.04174472700015
Numba parallel is fastest. But if you try to return a[c] instead, numba will slow down. So it depends on what you write

In [223]: arr =np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
In [224]: arr
Out[224]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 4, 5, 6],
[ 0, 9, 19]])
Since you are testing values, one for each column, you can do a simple numpy == test (the (3,) test broadcasts with the (4,3) arr)
In [225]: arr==[4,5,6]
Out[225]:
array([[False, False, False],
[ True, True, True],
[ True, True, True],
[False, False, False]])
and where a whole row is true:
In [226]: (arr==[4,5,6]).all(axis=1)
Out[226]: array([False, True, True, False])
This can be applied as a boolean mask to select those rows from arr:
In [227]: arr[_]
Out[227]:
array([[4, 5, 6],
[4, 5, 6]])
and the numeric indices:
In [228]: np.nonzero(__)
Out[228]: (array([1, 2]),)

Splitting a numpy array / pandas dataframe by boolean delimiters

Assume a numpy array (actually Pandas) of the form:
[value, included,
0.123, False,
0.127, True,
0.140, True,
0.111, False,
0.159, True,
0.321, True,
0.444, True,
0.323, True,
0.432, False]
I'd like to split the array such that False elements are excluded and successive runs of True elements are split into their own array. So for the above case, we'd end up with:
[[0.127, True,
0.140, True],
[0.159, True,
0.321, True,
0.444, True,
0.323, True]]
I can certainly do this by pushing individual elements onto lists, but surely there must be a more numpy-ish way to do this.

You can create groups by inverse mask by ~ with Series.cumsum and filter only Trues by boolean indexing, then create list of DataFrames by DataFrame.groupby:
dfs = [v for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[ value included
1 0.127 True
2 0.140 True, value included
4 0.159 True
5 0.321 True
6 0.444 True
7 0.323 True]
Also is possible convert Dataframes to arrays by DataFrame.to_numpy:
dfs = [v.to_numpy() for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[array([[0.127, True],
[0.14, True]], dtype=object), array([[0.159, True],
[0.321, True],
[0.444, True],
[0.32299999999999995, True]], dtype=object)]

How to find the index in a list of numbers where there are repeating numbers [duplicate]

Does anyone know how I can get the index position of duplicate items in a python list?
I have tried doing this and it keeps giving me only the index of the 1st occurrence of the of the item in the list.
List = ['A', 'B', 'A', 'C', 'E']
I want it to give me:
index 0: A
index 2: A

You want to pass in the optional second parameter to index, the location where you want index to start looking. After you find each match, reset this parameter to the location just after the match that was found.
def list_duplicates_of(seq,item):
start_at = -1
locs = []
while True:
try:
loc = seq.index(item,start_at+1)
except ValueError:
break
else:
locs.append(loc)
start_at = loc
return locs
source = "ABABDBAAEDSBQEWBAFLSAFB"
print(list_duplicates_of(source, 'B'))
Prints:
[1, 3, 5, 11, 15, 22]
You can find all the duplicates at once in a single pass through source, by using a defaultdict to keep a list of all seen locations for any item, and returning those items that were seen more than once.
from collections import defaultdict
def list_duplicates(seq):
tally = defaultdict(list)
for i,item in enumerate(seq):
tally[item].append(i)
return ((key,locs) for key,locs in tally.items()
if len(locs)>1)
for dup in sorted(list_duplicates(source)):
print(dup)
Prints:
('A', [0, 2, 6, 7, 16, 20])
('B', [1, 3, 5, 11, 15, 22])
('D', [4, 9])
('E', [8, 13])
('F', [17, 21])
('S', [10, 19])
If you want to do repeated testing for various keys against the same source, you can use functools.partial to create a new function variable, using a "partially complete" argument list, that is, specifying the seq, but omitting the item to search for:
from functools import partial
dups_in_source = partial(list_duplicates_of, source)
for c in "ABDEFS":
print(c, dups_in_source(c))
Prints:
A [0, 2, 6, 7, 16, 20]
B [1, 3, 5, 11, 15, 22]
D [4, 9]
E [8, 13]
F [17, 21]
S [10, 19]

>>> def indices(lst, item):
... return [i for i, x in enumerate(lst) if x == item]
...
>>> indices(List, "A")
[0, 2]
To get all duplicates, you can use the below method, but it is not very efficient. If efficiency is important you should consider Ignacio's solution instead.
>>> dict((x, indices(List, x)) for x in set(List) if List.count(x) > 1)
{'A': [0, 2]}
As for solving it using the index method of list instead, that method takes a second optional argument indicating where to start, so you could just repeatedly call it with the previous index plus 1.
>>> List.index("A")
0
>>> List.index("A", 1)
2

I made a benchmark of all solutions suggested here and also added another solution to this problem (described in the end of the answer).
Benchmarks
First, the benchmarks. I initialize a list of n random ints within a range [1, n/2] and then call timeit over all algorithms
The solutions of #Paul McGuire and #Ignacio Vazquez-Abrams works about twice as fast as the rest on the list of 100 ints:
Testing algorithm on the list of 100 items using 10000 loops
Algorithm: dupl_eat
Timing: 1.46247477189
####################
Algorithm: dupl_utdemir
Timing: 2.93324529055
####################
Algorithm: dupl_lthaulow
Timing: 3.89198786645
####################
Algorithm: dupl_pmcguire
Timing: 0.583058259784
####################
Algorithm: dupl_ivazques_abrams
Timing: 0.645062989076
####################
Algorithm: dupl_rbespal
Timing: 1.06523873786
####################
If you change the number of items to 1000, the difference becomes much bigger (BTW, I'll be happy if someone could explain why) :
Testing algorithm on the list of 1000 items using 1000 loops
Algorithm: dupl_eat
Timing: 5.46171654555
####################
Algorithm: dupl_utdemir
Timing: 25.5582547323
####################
Algorithm: dupl_lthaulow
Timing: 39.284285326
####################
Algorithm: dupl_pmcguire
Timing: 0.56558489513
####################
Algorithm: dupl_ivazques_abrams
Timing: 0.615980005148
####################
Algorithm: dupl_rbespal
Timing: 1.21610942322
####################
On the bigger lists, the solution of #Paul McGuire continues to be the most efficient and my algorithm begins having problems.
Testing algorithm on the list of 1000000 items using 1 loops
Algorithm: dupl_pmcguire
Timing: 1.5019953958
####################
Algorithm: dupl_ivazques_abrams
Timing: 1.70856155898
####################
Algorithm: dupl_rbespal
Timing: 3.95820421595
####################
The full code of the benchmark is here
Another algorithm
Here is my solution to the same problem:
def dupl_rbespal(c):
alreadyAdded = False
dupl_c = dict()
sorted_ind_c = sorted(range(len(c)), key=lambda x: c[x]) # sort incoming list but save the indexes of sorted items
for i in xrange(len(c) - 1): # loop over indexes of sorted items
if c[sorted_ind_c[i]] == c[sorted_ind_c[i+1]]: # if two consecutive indexes point to the same value, add it to the duplicates
if not alreadyAdded:
dupl_c[c[sorted_ind_c[i]]] = [sorted_ind_c[i], sorted_ind_c[i+1]]
alreadyAdded = True
else:
dupl_c[c[sorted_ind_c[i]]].append( sorted_ind_c[i+1] )
else:
alreadyAdded = False
return dupl_c
Although it's not the best it allowed me to generate a little bit different structure needed for my problem (i needed something like a linked list of indexes of the same value)

dups = collections.defaultdict(list)
for i, e in enumerate(L):
dups[e].append(i)
for k, v in sorted(dups.iteritems()):
if len(v) >= 2:
print '%s: %r' % (k, v)
And extrapolate from there.

I think I found a simple solution after a lot of irritation :
if elem in string_list:
counter = 0
elem_pos = []
for i in string_list:
if i == elem:
elem_pos.append(counter)
counter = counter + 1
print(elem_pos)
This prints a list giving you the indexes of a specific element ("elem")

Using new "Counter" class in collections module, based on lazyr's answer:
>>> import collections
>>> def duplicates(n): #n="123123123"
... counter=collections.Counter(n) #{'1': 3, '3': 3, '2': 3}
... dups=[i for i in counter if counter[i]!=1] #['1','3','2']
... result={}
... for item in dups:
... result[item]=[i for i,j in enumerate(n) if j==item]
... return result
...
>>> duplicates("123123123")
{'1': [0, 3, 6], '3': [2, 5, 8], '2': [1, 4, 7]}

from collections import Counter, defaultdict
def duplicates(lst):
cnt= Counter(lst)
return [key for key in cnt.keys() if cnt[key]> 1]
def duplicates_indices(lst):
dup, ind= duplicates(lst), defaultdict(list)
for i, v in enumerate(lst):
if v in dup: ind[v].append(i)
return ind
lst= ['a', 'b', 'a', 'c', 'b', 'a', 'e']
print duplicates(lst) # ['a', 'b']
print duplicates_indices(lst) # ..., {'a': [0, 2, 5], 'b': [1, 4]})
A slightly more orthogonal (and thus more useful) implementation would be:
from collections import Counter, defaultdict
def duplicates(lst):
cnt= Counter(lst)
return [key for key in cnt.keys() if cnt[key]> 1]
def indices(lst, items= None):
items, ind= set(lst) if items is None else items, defaultdict(list)
for i, v in enumerate(lst):
if v in items: ind[v].append(i)
return ind
lst= ['a', 'b', 'a', 'c', 'b', 'a', 'e']
print indices(lst, duplicates(lst)) # ..., {'a': [0, 2, 5], 'b': [1, 4]})

Wow, everyone's answer is so long. I simply used a pandas dataframe, masking, and the duplicated function (keep=False markes all duplicates as True, not just first or last):
import pandas as pd
import numpy as np
np.random.seed(42) # make results reproducible
int_df = pd.DataFrame({'int_list': np.random.randint(1, 20, size=10)})
dupes = int_df['int_list'].duplicated(keep=False)
print(int_df['int_list'][dupes].index)
This should return Int64Index([0, 2, 3, 4, 6, 7, 9], dtype='int64').

def index(arr, num):
for i, x in enumerate(arr):
if x == num:
print(x, i)
#index(List, 'A')

In a single line with pandas 1.2.2 and numpy:
import numpy as np
import pandas as pd
idx = np.where(pd.DataFrame(List).duplicated(keep=False))
The argument keep=False will mark every duplicate as True and np.where() will return an array with the indices where the element in the array was True.

string_list = ['A', 'B', 'C', 'B', 'D', 'B']
pos_list = []
for i in range(len(string_list)):
if string_list[i] = ='B':
pos_list.append(i)
print pos_list

def find_duplicate(list_):
duplicate_list=[""]
for k in range(len(list_)):
if duplicate_list.__contains__(list_[k]):
continue
for j in range(len(list_)):
if k == j:
continue
if list_[k] == list_[j]:
duplicate_list.append(list_[j])
print("duplicate "+str(list_.index(list_[j]))+str(list_.index(list_[k])))

Here is one that works for multiple duplicates and you don't need to specify any values:
List = ['A', 'B', 'A', 'C', 'E', 'B'] # duplicate two 'A's two 'B's
ix_list = []
for i in range(len(List)):
try:
dup_ix = List[(i+1):].index(List[i]) + (i + 1) # dup onwards + (i + 1)
ix_list.extend([i, dup_ix]) # if found no error, add i also
except:
pass
ix_list.sort()
print(ix_list)
[0, 1, 2, 5]

def dup_list(my_list, value):
'''
dup_list(list,value)
This function finds the indices of values in a list including duplicated values.
list: the list you are working on
value: the item of the list you want to find the index of
NB: if a value is duplcated, its indices are stored in a list
If only one occurence of the value, the index is stored as an integer.
Therefore use isinstance method to know how to handle the returned value
'''
value_list = []
index_list = []
index_of_duped = []
if my_list.count(value) == 1:
return my_list.index(value)
elif my_list.count(value) < 1:
return 'Your argument is not in the list'
else:
for item in my_list:
value_list.append(item)
length = len(value_list)
index = length - 1
index_list.append(index)
if item == value:
index_of_duped.append(max(index_list))
return index_of_duped
# function call eg dup_list(my_list, 'john')

If you want to get index of all duplicate elements of different types you can try this solution:
# note: below list has more than one kind of duplicates
List = ['A', 'B', 'A', 'C', 'E', 'E', 'A', 'B', 'A', 'A', 'C']
d1 = {item:List.count(item) for item in List} # item and their counts
elems = list(filter(lambda x: d1[x] > 1, d1)) # get duplicate elements
d2 = dict(zip(range(0, len(List)), List)) # each item and their indices
# item and their list of duplicate indices
res = {item: list(filter(lambda x: d2[x] == item, d2)) for item in elems}
Now, if you print(res) you'll get to see this:
{'A': [0, 2, 6, 8, 9], 'B': [1, 7], 'C': [3, 10], 'E': [4, 5]}

def duplicates(list,dup):
a=[list.index(dup)]
for i in list:
try:
a.append(list.index(dup,a[-1]+1))
except:
for i in a:
print(f'index {i}: '+dup)
break
duplicates(['A', 'B', 'A', 'C', 'E'],'A')
Output:
index 0: A
index 2: A

This is a good question and there is a lot of ways to it.
The code below is one of the ways to do it
letters = ["a", "b", "c", "d", "e", "a", "a", "b"]
lettersIndexes = [i for i in range(len(letters))] # i created a list that contains the indexes of my previous list
counter = 0
for item in letters:
if item == "a":
print(item, lettersIndexes[counter])
counter += 1 # for each item it increases the counter which means the index
An other way to get the indexes but this time stored in a list
letters = ["a", "b", "c", "d", "e", "a", "a", "b"]
lettersIndexes = [i for i in range(len(letters)) if letters[i] == "a" ]
print(lettersIndexes) # as you can see we get a list of the indexes that we want.
Good day

Using a dictionary approach based on setdefault instance method.
List = ['A', 'B', 'A', 'C', 'B', 'E', 'B']
# keep track of all indices of every term
duplicates = {}
for i, key in enumerate(List):
duplicates.setdefault(key, []).append(i)
# print only those terms with more than one index
template = 'index {}: {}'
for k, v in duplicates.items():
if len(v) > 1:
print(template.format(k, str(v).strip('][')))
Remark: Counter, defaultdict and other container class from collections are subclasses of dict hence share the setdefault method as well

I'll mention the more obvious way of dealing with duplicates in lists. In terms of complexity, dictionaries are the way to go because each lookup is O(1). You can be more clever if you're only interested in duplicates...
my_list = [1,1,2,3,4,5,5]
my_dict = {}
for (ind,elem) in enumerate(my_list):
if elem in my_dict:
my_dict[elem].append(ind)
else:
my_dict.update({elem:[ind]})
for key,value in my_dict.iteritems():
if len(value) > 1:
print "key(%s) has indices (%s)" %(key,value)
which prints the following:
key(1) has indices ([0, 1])
key(5) has indices ([5, 6])

a= [2,3,4,5,6,2,3,2,4,2]
search=2
pos=0
positions=[]
while (search in a):
pos+=a.index(search)
positions.append(pos)
a=a[a.index(search)+1:]
pos+=1
print "search found at:",positions

I just make it simple:
i = [1,2,1,3]
k = 0
for ii in i:
if ii == 1 :
print ("index of 1 = ", k)
k = k+1
output:
index of 1 = 0
index of 1 = 2

Udacity Deep Learning: Assignment 1, Part 5

I'm working on the Udacity Deep Learning class and I'm working on the first assignment, problem 5 where you try to count the number of duplicates in, say, your test set and training set. (Or validation and training, etc.)
I've looked at other people's answers, but I'm not satisfied with them for various reasons. For example, I tried out someone's hash based solution. But I felt the results returned was not likely to be correct.
So the main idea is that you have an array of images that are formatted as arrays. I.e. you're trying to compare two 3-dimensional arrays on index 0. One array is the training dataset, which is 200000 rows with each row containing a 2-D array that is the values for the image. The other is the test set, with is 10000 rows with each row containing a 2-D array of an image. The goal is to find all rows in the test set that match (for now, exactly match is fine) a row in the training set. Since each 'row' is itself an image (which is a 2-d array) then to make this work fast I must be able to do a comparison of both sets as an element-wise compare of each row.
I worked up my own fairly simple solution like this:
# Find duplicates
# Loop through validation/test set and find ones that are identical matrices to something in the training data
def find_duplicates(compare_set, compare_labels, training_set, training_labels):
dup_count = 0
duplicates = []
for i in range(len(compare_set)):
if i > 100: continue
if i % 100 == 0:
print("i: ", i)
for j in range(len(training_set)):
if compare_labels[i] == training_labels[j]:
if np.array_equal(compare_set[i], training_set[j]):
duplicates.append((i,j))
dup_count += 1
return dup_count, duplicates
#print(len(valid_dataset))
print(len(train_dataset))
valid_dup_count, duplicates = find_duplicates(valid_dataset, valid_labels, train_dataset, train_labels)
print(valid_dup_count)
print(duplicates)
#test_dups = find_duplicates(test_dataset, train_dataset)
#print(test_dups)
The reason it just "continues" after 100 is because that alone takes a very long time. If I were to try to compare all 10,000 rows of the validation set to the training set, it would take forever.
I like my solution in principle because it allows me to not only count the duplicates, but get a list back of which matches existed. (Something missing on every other solution I've looked at.) This allows me to manually test that I'm getting the right solution.
What I really need is a much faster (i.e. built into Numpy) solution to compare matrices of matrices like this. I've played with 'isin' and 'where' but haven't figured out how to use those to get the results I'm after. Can someone point me in the right direction for a faster solution?

You should be able to compare a single image from compare_set throughout all the images in training_set with a single line of code using np.all(). You can provide multiple axes as a tuple in the axis argument to check array equality over rows and columns, going through each of the images. Then np.where() can give you the indices you want.
For example:
n_train = 50
n_validation = 10
h, w = 28, 28
training_set = np.random.rand(n_train, h, w)
validation_set = np.random.rand(n_validation, h, w)
# create some duplicates
training_set[5] = training_set[10]
validation_set[2] = training_set[10]
validation_set[8] = training_set[10]
duplicates = []
for i, img in enumerate(validation_set):
training_dups = np.where(np.all(training_set == img, axis=(1, 2)))[0]
for j in training_dups:
duplicates.append((i, j))
print(duplicates)
[(2, 5), (2, 10), (8, 5), (8, 10)]
Many numpy functions, np.all() included, let you specify the axes to operate on. For example, let's say you had the two arrays
>>> A = np.array([[1, 2], [3, 4]])
>>> B = np.array([[1, 2], [5, 6]])
>>> A
array([[1, 2],
[3, 4]])
>>> B
array([[1, 2],
[5, 6]])
Now, A and B have the same first row, but a different second row. If we check equality for them
>>> A == B
array([[ True, True],
[False, False]], dtype=bool)
We get an array the same shape as A and B. But what if I want the indices of the rows which are equal? Well in this case what we can do is say 'only return True if all the values in the row (i.e. the value in each column) are True'. So we can use np.all() after the equality check, and provide it the axis corresponding to the columns.
>>> np.all(A == B, axis=1)
array([ True, False], dtype=bool)
So this result is letting us know that the first row is equal in both arrays, and the second row is not all equal. We can then get the row indices with np.where()
>>> np.where(np.all(A == B, axis=1))
(array([0]),)
So here we see row 0, i.e. A[0] and B[0] are equal.
Now in the solution I proposed, you have a 3D array instead of these 2D arrays. We don't care if a single row is equal, we care if all the rows and columns are equal. So breaking it down as above, let's create two random 5x5 images. I'll grab one of those images and check for equality among the array of two images:
>>> imgs = np.random.rand(2, 5, 5)
>>> img = imgs[1]
>>> imgs == img
array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]],
[[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]]], dtype=bool)
So this is obvious that the second one is correct, but I want to reduce all those True values to one True value; I only want the index corresponding to images where every value is equal.
If we use axis=1
>>> np.all(imgs == img, axis=1)
array([[False, False, False, False, False],
[ True, True, True, True, True]], dtype=bool)
Then we get True for each row if all the columns in each row are equivalent. And really we want to reduce this further by checking equality along all the rows as well. So we can take this result, feed it into np.all() and check along the rows of the resulting array:
>>> np.all(np.all(imgs == img, axis=1), axis=1)
array([False, True], dtype=bool)
And this gives us a boolean of which image inside imgs is equal to img, and we can simply get the result with np.where(). But you don't actually need to call np.all() twice like this; instead you can provide it multiple axes in a tuple to just reduce along both the rows and columns in one step:
>>> np.all(imgs == img, axis=(1, 2))
array([False, True], dtype=bool)
And that's what the solution above does. Hope that clears it up!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to fast lookup by list of indices? - pandas

Related

What is the best way to initialise a NumPy masked array with an existing mask?

numpy array of array with custom filtering

Splitting a numpy array / pandas dataframe by boolean delimiters

How to find the index in a list of numbers where there are repeating numbers [duplicate]

Udacity Deep Learning: Assignment 1, Part 5

Categories

Resources