Udacity Deep Learning: Assignment 1, Part 5 - numpy

I'm working on the Udacity Deep Learning class and I'm working on the first assignment, problem 5 where you try to count the number of duplicates in, say, your test set and training set. (Or validation and training, etc.)
I've looked at other people's answers, but I'm not satisfied with them for various reasons. For example, I tried out someone's hash based solution. But I felt the results returned was not likely to be correct.
So the main idea is that you have an array of images that are formatted as arrays. I.e. you're trying to compare two 3-dimensional arrays on index 0. One array is the training dataset, which is 200000 rows with each row containing a 2-D array that is the values for the image. The other is the test set, with is 10000 rows with each row containing a 2-D array of an image. The goal is to find all rows in the test set that match (for now, exactly match is fine) a row in the training set. Since each 'row' is itself an image (which is a 2-d array) then to make this work fast I must be able to do a comparison of both sets as an element-wise compare of each row.
I worked up my own fairly simple solution like this:
# Find duplicates
# Loop through validation/test set and find ones that are identical matrices to something in the training data
def find_duplicates(compare_set, compare_labels, training_set, training_labels):
dup_count = 0
duplicates = []
for i in range(len(compare_set)):
if i > 100: continue
if i % 100 == 0:
print("i: ", i)
for j in range(len(training_set)):
if compare_labels[i] == training_labels[j]:
if np.array_equal(compare_set[i], training_set[j]):
duplicates.append((i,j))
dup_count += 1
return dup_count, duplicates
#print(len(valid_dataset))
print(len(train_dataset))
valid_dup_count, duplicates = find_duplicates(valid_dataset, valid_labels, train_dataset, train_labels)
print(valid_dup_count)
print(duplicates)
#test_dups = find_duplicates(test_dataset, train_dataset)
#print(test_dups)
The reason it just "continues" after 100 is because that alone takes a very long time. If I were to try to compare all 10,000 rows of the validation set to the training set, it would take forever.
I like my solution in principle because it allows me to not only count the duplicates, but get a list back of which matches existed. (Something missing on every other solution I've looked at.) This allows me to manually test that I'm getting the right solution.
What I really need is a much faster (i.e. built into Numpy) solution to compare matrices of matrices like this. I've played with 'isin' and 'where' but haven't figured out how to use those to get the results I'm after. Can someone point me in the right direction for a faster solution?

You should be able to compare a single image from compare_set throughout all the images in training_set with a single line of code using np.all(). You can provide multiple axes as a tuple in the axis argument to check array equality over rows and columns, going through each of the images. Then np.where() can give you the indices you want.
For example:
n_train = 50
n_validation = 10
h, w = 28, 28
training_set = np.random.rand(n_train, h, w)
validation_set = np.random.rand(n_validation, h, w)
# create some duplicates
training_set[5] = training_set[10]
validation_set[2] = training_set[10]
validation_set[8] = training_set[10]
duplicates = []
for i, img in enumerate(validation_set):
training_dups = np.where(np.all(training_set == img, axis=(1, 2)))[0]
for j in training_dups:
duplicates.append((i, j))
print(duplicates)
[(2, 5), (2, 10), (8, 5), (8, 10)]
Many numpy functions, np.all() included, let you specify the axes to operate on. For example, let's say you had the two arrays
>>> A = np.array([[1, 2], [3, 4]])
>>> B = np.array([[1, 2], [5, 6]])
>>> A
array([[1, 2],
[3, 4]])
>>> B
array([[1, 2],
[5, 6]])
Now, A and B have the same first row, but a different second row. If we check equality for them
>>> A == B
array([[ True, True],
[False, False]], dtype=bool)
We get an array the same shape as A and B. But what if I want the indices of the rows which are equal? Well in this case what we can do is say 'only return True if all the values in the row (i.e. the value in each column) are True'. So we can use np.all() after the equality check, and provide it the axis corresponding to the columns.
>>> np.all(A == B, axis=1)
array([ True, False], dtype=bool)
So this result is letting us know that the first row is equal in both arrays, and the second row is not all equal. We can then get the row indices with np.where()
>>> np.where(np.all(A == B, axis=1))
(array([0]),)
So here we see row 0, i.e. A[0] and B[0] are equal.
Now in the solution I proposed, you have a 3D array instead of these 2D arrays. We don't care if a single row is equal, we care if all the rows and columns are equal. So breaking it down as above, let's create two random 5x5 images. I'll grab one of those images and check for equality among the array of two images:
>>> imgs = np.random.rand(2, 5, 5)
>>> img = imgs[1]
>>> imgs == img
array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]],
[[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]]], dtype=bool)
So this is obvious that the second one is correct, but I want to reduce all those True values to one True value; I only want the index corresponding to images where every value is equal.
If we use axis=1
>>> np.all(imgs == img, axis=1)
array([[False, False, False, False, False],
[ True, True, True, True, True]], dtype=bool)
Then we get True for each row if all the columns in each row are equivalent. And really we want to reduce this further by checking equality along all the rows as well. So we can take this result, feed it into np.all() and check along the rows of the resulting array:
>>> np.all(np.all(imgs == img, axis=1), axis=1)
array([False, True], dtype=bool)
And this gives us a boolean of which image inside imgs is equal to img, and we can simply get the result with np.where(). But you don't actually need to call np.all() twice like this; instead you can provide it multiple axes in a tuple to just reduce along both the rows and columns in one step:
>>> np.all(imgs == img, axis=(1, 2))
array([False, True], dtype=bool)
And that's what the solution above does. Hope that clears it up!

Related

What is the best way to initialise a NumPy masked array with an existing mask?

I was expecting to just say something like
ma.zeros(my_shape, mask=my_mask, hard_mask=True)
(where the mask is the correct shape) but ma.zeros (or ma.ones or ma.empty) rather surprisingly doesn't recognise the mask argument. The simplest I've come up with is
ma.array(np.zeros(my_shape), mask=my_mask, hard_mask=True)
which seems to involve unnecessary copying of lots of zeros. Is there a better way?
Make a masked array:
In [162]: x = np.arange(5); mask=np.array([1,0,0,1,0],bool)
In [163]: M = np.ma.MaskedArray(x,mask)
In [164]: M
Out[164]:
masked_array(data=[--, 1, 2, --, 4],
mask=[ True, False, False, True, False],
fill_value=999999)
Modify x, and see the result in M:
In [165]: x[-1] = 10
In [166]: M
Out[166]:
masked_array(data=[--, 1, 2, --, 10],
mask=[ True, False, False, True, False],
fill_value=999999)
In [167]: M.data
Out[167]: array([ 0, 1, 2, 3, 10])
In [169]: M.data.base
Out[169]: array([ 0, 1, 2, 3, 10])
The M.data is a view of the array used in creating it. No unnecessary copies.
I haven't used functions like np.ma.zeros, but
In [177]: np.ma.zeros
Out[177]: <numpy.ma.core._convert2ma at 0x1d84a052af0>
_convert2ma is a Python class, that takes a funcname and returns new callable. It does not add mask-specific parameters. Study that yourself if necessary.
np.ma.MaskedArray, the function that actually subclasses ndarray takes a copy parameter
copy : bool, optional
Whether to copy the input data (True), or to use a reference instead.
Default is False.
and the first line of its __new__ is
_data = np.array(data, dtype=dtype, copy=copy,
order=order, subok=True, ndmin=ndmin)
I haven't quite sorted out whether M._data is just a reference to the source data, or a view. In either case, it isn't a copy, unless you say so.
I haven't worked a lot with masked arrays, but my impression is that, while they can be convenient, they shouldn't be used where you are concerned about performance. There's a lot of extra work required to maintain both the mask and the data. The extra time involved in copying the data array, if any, will be minor.

Create masked array from list containing ma.masked

If I have a (possibly multidimensional) Python list where each element is one of True, False, or ma.masked, what's the idiomatic way of turning this into a masked numpy array of bool?
Example:
>>> print(somefunc([[True, ma.masked], [False, True]]))
[[True --]
[False True]]
A masked array has to attributes, data and mask:
In [342]: arr = np.ma.masked_array([[True, False],[False,True]])
In [343]: arr
Out[343]:
masked_array(
data=[[ True, False],
[False, True]],
mask=False,
fill_value=True)
That starts without anything masked. Then as you suggest, assigning np.ma.masked to an element masks the slot:
In [344]: arr[0,1]=np.ma.masked
In [345]: arr
Out[345]:
masked_array(
data=[[True, --],
[False, True]],
mask=[[False, True],
[False, False]],
fill_value=True)
Here the arr.mask has been changed from scalar False (applying to the whole array) to a boolean array of False, and then the selected item has been changed to True.
arr.data hasn't changed:
In [346]: arr.data[0,1]
Out[346]: False
Looks like this change to arr.mask occurs in data.__setitem__ at:
if value is masked:
# The mask wasn't set: create a full version.
if _mask is nomask:
_mask = self._mask = make_mask_none(self.shape, _dtype)
# Now, set the mask to its value.
if _dtype.names is not None:
_mask[indx] = tuple([True] * len(_dtype.names))
else:
_mask[indx] = True
return
It checks if the assignment values is this special constant, np.ma.masked, and it makes the full mask, and assigns True to an element.

Splitting a numpy array / pandas dataframe by boolean delimiters

Assume a numpy array (actually Pandas) of the form:
[value, included,
0.123, False,
0.127, True,
0.140, True,
0.111, False,
0.159, True,
0.321, True,
0.444, True,
0.323, True,
0.432, False]
I'd like to split the array such that False elements are excluded and successive runs of True elements are split into their own array. So for the above case, we'd end up with:
[[0.127, True,
0.140, True],
[0.159, True,
0.321, True,
0.444, True,
0.323, True]]
I can certainly do this by pushing individual elements onto lists, but surely there must be a more numpy-ish way to do this.
You can create groups by inverse mask by ~ with Series.cumsum and filter only Trues by boolean indexing, then create list of DataFrames by DataFrame.groupby:
dfs = [v for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[ value included
1 0.127 True
2 0.140 True, value included
4 0.159 True
5 0.321 True
6 0.444 True
7 0.323 True]
Also is possible convert Dataframes to arrays by DataFrame.to_numpy:
dfs = [v.to_numpy() for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[array([[0.127, True],
[0.14, True]], dtype=object), array([[0.159, True],
[0.321, True],
[0.444, True],
[0.32299999999999995, True]], dtype=object)]

Tensorflow: How to randomly select elements according to condition without np.where?

I have 3 tensorflow arrays (a, b, valid_entries), which share the first two dimensionalities [T, N, ?]. One of these arrays 'valid_entries' has shape [T,N,1] with boolean values. I want to randomly sample T*M 2-tuples of indices (M < N) such that valid_entries[t,m] == 1 for all of these indices.
In other words, for each time step, I want to randomly select M valid entries from a and b.
I persume that in numpy, this task would be solved by doing the following (let's skip the first dimension T for simplicity):
M = 3
N = 5
valid_entries = [[0],[1],[0],[1],[0]]
valid_indices = np.where(a==1)
valid_indices = np.random.select(valid_indices,np.min(len(valid_indices),M))
a_new = a[valid_indices]
b_new = b[valid_indices]
valid_new = valid_entries[valid_indices]
However, all this needs to happen in Tensorflow.
Thanks a ton in advance for any help!
Here is a function that does that:
import tensorflow as tf
def sample_indices(valid, m, seed=None):
valid = tf.convert_to_tensor(valid)
n = tf.size(valid)
# Flatten boolean tensor
valid_flat = tf.reshape(valid, [n])
# Get flat indices where the tensor is true
valid_idx = tf.boolean_mask(tf.range(n), valid_flat)
# Shuffled valid indices
valid_idx_shuffled = tf.random.shuffle(valid_idx, seed=seed)
# Pick sample from shuffled indices
valid_idx_sample = valid_idx_shuffled[:m]
# Unravel indices
return tf.transpose(tf.unravel_index(valid_idx_sample, tf.shape(valid)))
with tf.Graph().as_default(), tf.Session() as sess:
valid = [[ True, True, False, True],
[False, True, True, False],
[False, True, False, False]]
m = 4
print(sess.run(sample_indices(valid, m, seed=0)))
# [[1 1]
# [1 2]
# [0 1]
# [2 1]]
This sample_indices is generic for any shape of boolean tensor. If in your case valid_entries has shape (T, N, 1) then you will get a tensor with shape (M, 3) as output, although you can ignore the last column since it is always going to be zero (or you can pass tf.squeeze(valid_entries, axis=2) instead).
Note: The last tf.transpose is just to have as output a tensor with shape (sample_size, num_dimensions) instead of the other way around. However, if m is rather big and you don't mind the order of the dimensions, you may skip it to save a bit of time and memory, since (unlike its NumPy counterpart) tf.transpose produces a whole new tensor.

How to fast lookup by list of indices?

Given a list of candidate indices
candidates = np.array([ 3, 4, 5 ])
You can lookup in your dataframe df via
df.loc[ candidates ]
However if there is a candidate missing in df.index this will throw an exception.
What is the the fastest way to obtain both?:
the slice of df for all candidates that are in the index
A boolean array indicating which candidate is in the index
Especially if df.index.is_monotonic == True, this fact should be used to speed things up.
As a benchmark,the df.size is about 6.8M.
%time df.index.isin([45,66,77,99,87,65,234,668,798])
Wall time: 15 ms
array([False, False, False, ..., False, False, False], dtype=bool)
My suggestion
%%time
item_list = [45,66,77,99,87,65,234,668,798]
result_list = []
for item in item_list:
if item in user_shop_df.index.values:
result_list.append(True)
else:
result_list.append(False)
Wall time: 10 ms