Tensorflow: how to make sure all samples in each batch are with the same label? - tensorflow

I wonder whether there are some ways to apply constraints on the batches to generate in Tensorflow. For example, let's say we are training a CNN on a huge dataset to do image classification. Is it possible to force Tensorflow to generate batches where all samples are with the same class? Like, one batch of images all tagged with "Apple", the other one where samples all tagged with "Orange".
The reason I ask this question is I want to do some experiments to see how different levels of shuffling influence the final trained models. It's common practice to do sample-level shuffling for CNN training, and everybody is doing it. I just want to check it myself, thus obtaining a more vivid and first-hand knowledge about it.
Thanks!

Dataset.filter() can be used:
labels = np.random.randint(0, 10, (10000))
data = np.random.uniform(size=(10000, 5))
ds = tf.data.Dataset.from_tensor_slices((data, labels))
ds = ds.filter(lambda data, labels: tf.equal(labels, 1)) #comment this line out for unfiltered case
ds = ds.batch(5)
iterator = ds.make_one_shot_iterator()
vals = iterator.get_next()
with tf.Session() as sess:
for _ in range(5):
py_data, py_labels = sess.run(vals)
print(py_labels)
with ds.filter():
> [1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]
without ds.filter():
> [8 0 7 6 3]
[2 4 7 6 1]
[1 8 5 5 5]
[7 1 7 4 0]
[7 1 8 0 0]
Edit. The following code shows how to use a feedable iterator to perform batch label selection on the fly. See "Creating an iterator"
labels = ['Apple'] * 100 + ['Orange'] * 100
data = list(range(200))
random.shuffle(labels)
batch_size = 4
ds_apple = tf.data.Dataset.from_tensor_slices((data, labels)).filter(
lambda data, label: tf.equal(label, 'Apple')).batch(batch_size)
ds_orange = tf.data.Dataset.from_tensor_slices((data, labels)).filter(
lambda data, label: tf.equal(label, 'Orange')).batch(batch_size)
handle = tf.placeholder(tf.string, [])
iterator = tf.data.Iterator.from_string_handle(
handle, ds_apple.output_types, ds_apple.output_shapes)
batch = iterator.get_next()
apple_iterator = ds_apple.make_one_shot_iterator()
orange_iterator = ds_orange.make_one_shot_iterator()
with tf.Session() as sess:
apple_handle = sess.run(apple_iterator.string_handle())
orange_handle = sess.run(orange_iterator.string_handle())
# loop and switch back and forth between apples and oranges
for _ in range(3):
feed_dict = {handle: apple_handle}
print(sess.run(batch, feed_dict=feed_dict))
feed_dict = {handle: orange_handle}
print(sess.run(batch, feed_dict=feed_dict))
Typical output for this is as follows. Note that the data values increase monotonically across Apple and Orange batches showing that the iterators are not resetting.
> (array([2, 3, 6, 7], dtype=int32), array([b'Apple', b'Apple', b'Apple', b'Apple'], dtype=object))
(array([0, 1, 4, 5], dtype=int32), array([b'Orange', b'Orange', b'Orange', b'Orange'], dtype=object))
(array([ 9, 13, 15, 19], dtype=int32), array([b'Apple', b'Apple', b'Apple', b'Apple'], dtype=object))
(array([ 8, 10, 11, 12], dtype=int32), array([b'Orange', b'Orange', b'Orange', b'Orange'], dtype=object))
(array([21, 22, 23, 25], dtype=int32), array([b'Apple', b'Apple', b'Apple', b'Apple'], dtype=object))
(array([14, 16, 17, 18], dtype=int32), array([b'Orange', b'Orange', b'Orange', b'Orange'], dtype=object))

Related

pytorch tensor indices is confusing [duplicate]

I am trying to access a pytorch tensor by a matrix of indices and I recently found this bit of code that I cannot find the reason why it is not working.
The code below is split into two parts. The first half proves to work, whilst the second trips an error. I fail to see the reason why. Could someone shed some light on this?
import torch
import numpy as np
a = torch.rand(32, 16)
m, n = a.shape
xx, yy = np.meshgrid(np.arange(m), np.arange(m))
result = a[xx] # WORKS for a torch.tensor of size M >= 32. It doesn't work otherwise.
a = torch.rand(16, 16)
m, n = a.shape
xx, yy = np.meshgrid(np.arange(m), np.arange(m))
result = a[xx] # IndexError: too many indices for tensor of dimension 2
and if I change a = np.random.rand(16, 16) it does work as well.
To whoever comes looking for an answer: it looks like its a bug in pyTorch.
Indexing using numpy arrays is not well defined, and it works only if tensors are indexed using tensors. So, in my example code, this works flawlessly:
a = torch.rand(M, N)
m, n = a.shape
xx, yy = torch.meshgrid(torch.arange(m), torch.arange(m), indexing='xy')
result = a[xx] # WORKS
I made a gist to check it, and it's available here
First, let me give you a quick insight into the idea of indexing a tensor with a numpy array and another tensor.
Example: this is our target tensor to be indexed
numpy_indices = torch.tensor([[0, 1, 2, 7],
[0, 1, 2, 3]]) # numpy array
tensor_indices = torch.tensor([[0, 1, 2, 7],
[0, 1, 2, 3]]) # 2D tensor
t = torch.tensor([[1, 2, 3, 4], # targeted tensor
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20],
[21, 22, 23, 24],
[25, 26, 27, 28],
[29, 30, 31, 32]])
numpy_result = t[numpy_indices]
tensor_result = t[tensor_indices]
Indexing using a 2D numpy array: the index is read like pairs (x,y) tensor[row,column] e.g. t[0,0], t[1,1], t[2,2], and t[7,3].
print(numpy_result) # tensor([ 1, 6, 11, 32])
Indexing using a 2D tensor: walks through the index tensor in a row-wise manner and each value is an index of a row in the targeted tensor.
e.g. [ [t[0],t[1],t[2],[7]] , [[0],[1],[2],[3]] ] see the example below, the new shape of tensor_result after indexing is (tensor_indices.shape[0],tensor_indices.shape[1],t.shape[1])=(2,4,4).
print(tensor_result) # tensor([[[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12],
# [29, 30, 31, 32]],
# [[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12],
# [ 13, 14, 15, 16]]])
If you try to add a third row in numpy_indices, you will get the same error you have because the index will be represented by 3D e.g., (0,0,0)...(7,3,3).
indices = np.array([[0, 1, 2, 7],
[0, 1, 2, 3],
[0, 1, 2, 3]])
print(numpy_result) # IndexError: too many indices for tensor of dimension 2
However, this is not the case with indexing by tensor and the shape will be bigger (3,4,4).
Finally, as you see the outputs of the two types of indexing are completely different. To solve your problem, you can use
xx = torch.tensor(xx).long() # convert a numpy array to a tensor
What happens in the case of advanced indexing (rows of numpy_indices > 3 ) as your situation is still ambiguous and unsolved and you can check 1 , 2, 3.

Indices in Numpy and MATLAB

I have a piece of code in Matlab that I want to convert into Python/numpy.
I have a matrix ind which has the dimensions (32768, 24). I have another matrix X which has the dimensions (98304, 6). When I perform the operation
result = X(ind)
the shape of the matrix is (32768, 24).
but in numpy when I perform the same shape
result = X[ind]
I get the shape of the result matrix as (32768, 24, 6).
I would greatly appreciate it if someone can help me with why I can these two different results and how can I fix them. I would want to get the shape (32768, 24) for the result matrix in numpy as well
In Octave, if I define:
>> X=diag([1,2,3,4])
X =
Diagonal Matrix
1 0 0 0
0 2 0 0
0 0 3 0
0 0 0 4
>> idx = [6 7;10 11]
idx =
6 7
10 11
then the indexing selects a block:
>> X(idx)
ans =
2 0
0 3
The numpy equivalent is
In [312]: X=np.diag([1,2,3,4])
In [313]: X
Out[313]:
array([[1, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 4]])
In [314]: idx = np.array([[5,6],[9,10]]) # shifted for 0 base indexing
In [315]: np.unravel_index(idx,(4,4)) # raveled to unraveled conversion
Out[315]:
(array([[1, 1],
[2, 2]]),
array([[1, 2],
[1, 2]]))
In [316]: X[_] # this indexes with a tuple of arrays
Out[316]:
array([[2, 0],
[0, 3]])
another way:
In [318]: X.flat[idx]
Out[318]:
array([[2, 0],
[0, 3]])

Find duplicated sequences in numpy.array or pandas column

For example, I have got an array like this:
([ 1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5 ])
I need to find all duplicated sequences , not values, but sequences of at least two values one by one.
The result should be like this:
of length 2: [1, 5] with indexes (0, 16);
of length 3: [3, 3, 7] with indexes (6, 12); [7, 9, 4] with indexes (2, 8)
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
I can do it with pandas.apply function, but it calculates too slow, swifter did not help me.
And in real life I need to find all of them, with length from 10 up to 100 values one by one on database with 1500 columns with 700 000 values each. So i really do need a vectorized decision.
Is there a vectorized decision for finding all at once? Or at least for finding only 10-values sequences? Or only 4-values sequences? Anything, that will be fully vectorized?
One possible implementation (although not fully vectorized) that finds all sequences of size n that appear more than once is the following:
import numpy as np
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
if M.any():
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
repeated_inds = np.count_nonzero(matches, axis=-1) > n
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)[::n]
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
else:
return [], []
return unique_seqs[repeated_inds], grouped_idxs
In theory, you could replace
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
with
matches = scipy.signal.convolve(
M, np.ones((1, n), dtype=int), mode="full"
).astype(int)
which would make the whole thing "fully vectorized", but my tests showed that this was 3 to 4 times slower than the for-loop. So I'd stick with that. Or simply,
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
which does not have any significant speed-up, since it's basically a hidden loop (see this).
This is based off #Divakar's answer here that dealt with a very similar problem, in which the sequence to look for was provided. I simply made it so that it could follow this procedure for all possible sequences of size n, which are found inside the function with n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]; unique_seqs = np.unique(n_seqs, axis=0).
For example,
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> repeated_seqs, inds = repeated_sequences(a, n)
>>> for i, seq in enumerate(repeated_seqs[:10]):
...: print(f"{seq} with indexes {inds[i]}")
...:
[3 3 7] with indexes [ 6 12]
[7 9 4] with indexes [2 8]
Disclaimer
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
This is not directly taken into account and the sequence [5, 5] would appear more than once according to this algorithm. You could do something like this, based off #Paul's answer here, but it involves a loop:
import numpy as np
repeated_matches = np.array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
>>> print(grouped_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64),
array([ 7, 8, 9, 10], dtype=int64)]
# If there are consecutive numbers in grouped_idxs, that means that there is a long
# sequence that should be excluded. So, you'd have to check for consecutive numbers
filtered_idxs = []
for idx in grouped_idxs:
if not all((idx[1:] - idx[:-1]) == 1):
filtered_idxs.append(idx)
>>> print(filtered_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64)]
Some tests:
>>> n = 3
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> %timeit repeated_sequences(a, n)
414 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> n = 4
>>> a = np.random.randint(0, 10, (10000,))
>>> %timeit repeated_sequences(a, n)
3.88 s ± 54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> result, _ = repeated_sequences(a, n)
>>> result.shape
(2637, 4)
This is not the most efficient implementation by far, but it works as a 2D approach. Plus, if there aren't any repeated sequences, it returns empty lists.
EDIT: Full implementation
I vectorized the routine I added in the Disclaimer section as a possible solution to the long sequence problem and ended up with the following:
import numpy as np
# Taken from:
# https://stackoverflow.com/questions/53051560/stacking-numpy-arrays-of-different-length-using-padding
def stack_padding(it):
def resize(row, size):
new = np.array(row)
new.resize(size)
return new
row_length = max(it, key=len).__len__()
mat = np.array([resize(row, row_length) for row in it])
return mat
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
repeated_seqs = []
idxs_repeated_seqs = []
if M.any():
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
repeated_inds = np.count_nonzero(matches, axis=-1) > n
if repeated_inds.any():
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
# Additional routine
# Pad this uneven array with zeros so that we can use it normally
grouped_idxs = np.array(grouped_idxs, dtype=object)
padded_idxs = stack_padding(grouped_idxs)
# Find the indices where there are padded zeros
pad_positions = padded_idxs == 0
# Perform the "consecutive-numbers check" (this will take one
# item off the original array, so we have to correct for its shape).
idxs_to_remove= np.pad(
(padded_idxs[:, 1:] - padded_idxs[:, :-1]) == 1,
[(0, 0), (0, 1)],
constant_values=True,
)
pad_positions = np.argwhere(pad_positions)
i = pad_positions[:, 0]
j = pad_positions[:, 1] - 1 # Shift by one (shape correction)
idxs_to_remove[i, j] = True # Masking, since we don't want pad indices
# Obtain a final mask (boolean opposite of indices to remove)
final_mask = ~idxs_to_remove.all(axis=-1)
grouped_idxs = grouped_idxs[final_mask] # Filter the long sequences
repeated_seqs = unique_seqs[repeated_inds][final_mask]
# In order to get the correct indices, we must first limit the
# search to a shape (on axis=1) of the closest multiple of n.
# This will avoid taking more indices than we should to show where
# each repeated sequence begins
to = padded_idxs.shape[1] & (-n)
# Build the final list of indices (that goes from 0 - to with
# a step of n
idxs_repeated_seqs = [
grouped_idxs[i][:to:n] for i in range(grouped_idxs.shape[0])
]
return repeated_seqs, idxs_repeated_seqs
For example,
n = 2
examples = [
# First example is your original example array.
np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Second example has a long sequence of 5's, and since there aren't
# any [5, 5] anywhere else, it's not taken into account and therefore
# should not come out.
np.array([1, 5, 5, 5, 5, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Third example has the same long sequence but since there is a [5, 5]
# later, then it should take it into account and this sequence should
# be found.
np.array([1, 5, 5, 5, 5, 6, 5, 5, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Fourth example has a [5, 5] first and later it has a long sequence of
# 5's which are uneven and the previous implementation got confused with
# the indices to show as the starting indices. In this case, it should be
# 1, 13 and 15 for [5, 5].
np.array([1, 5, 5, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 5, 5, 5, 5, 5]),
]
for a in examples:
print(f"\nExample: {a}")
repeated_seqs, inds = repeated_sequences(a, n)
for i, seq in enumerate(repeated_seqs):
print(f"\t{seq} with indexes {inds[i]}")
Output (as expected):
Example: [1 5 7 9 4 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
[7 9] with indexes [2 8]
[9 4] with indexes [3 9]
Example: [1 5 5 5 5 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
Example: [1 5 5 5 5 6 5 5 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [ 0 16]
[5 5] with indexes [1 3 6]
Example: [1 5 5 9 4 6 3 3 7 9 4 0 3 5 5 5 5 5]
[5 5] with indexes [ 1 13 15]
[9 4] with indexes [3 9]
You can test it out yourself with more examples and more cases. Keep in mind this is what I understood from your disclaimer. If you want to count the long sequences as one, even if multiple sequences are in there (for example, [5, 5] appears twice in [5, 5, 5, 5]), this won't work for you and you'd have to come up with something else.

How to make a simple Vandermonde matrix with numpy?

My question is how to make a vandermonde matrix. This is the definition:
In linear algebra, a Vandermonde matrix, named after Alexandre-Théophile Vandermonde, is a matrix with the terms of a geometric progression in each row, i.e., an m × n matrix
I would like to make a 4*4 version of this.
So farI have defined values but only for one row as follows
a=2
n=4
for a in range(n):
for i in range(n):
v.append(a**i)
v = np.array(v)
print(v)
I dont know how to scale this. Please help!
Given a starting column a of length m you can create a Vandermonde matrix v with n columns a**0 to a**(n-1)like so:
import numpy as np
m = 4
n = 4
a = range(1, m+1)
v = np.array([a]*n).T**range(n)
print(v)
#[[ 1 1 1 1]
# [ 1 2 4 8]
# [ 1 3 9 27]
# [ 1 4 16 64]]
As proposed by michael szczesny you could use numpy.vander.
But this will not be according to the definition on Wikipedia.
x = np.array([1, 2, 3, 5])
N = 4
np.vander(x, N)
#array([[ 1, 1, 1, 1],
# [ 8, 4, 2, 1],
# [ 27, 9, 3, 1],
# [125, 25, 5, 1]])
So, you'd have to use numpy.fliplr aswell:
x = np.array([1, 2, 3, 5])
N = 4
np.fliplr(np.vander(x, N))
#array([[ 1, 1, 1, 1],
# [ 1, 2, 4, 8],
# [ 1, 3, 9, 27],
# [ 1, 5, 25, 125]])
This could also be achieved without numpy using nested list comprehensions:
x = [1, 2, 3, 5]
N = 4
[[xi**i for i in range(N)] for xi in x]
# [[1, 1, 1, 1],
# [1, 2, 4, 8],
# [1, 3, 9, 27],
# [1, 5, 25, 125]]
# Vandermonde Matrix
def Vandermonde_Matrix(D, k):
'''
D = {(x_i,y_i): 0<=i<=n}
----------------
k degree
'''
n = len(D)
V = np.zeros(shape=(n, k))
for i in range(n):
V[i] = np.power(np.array(D[i][0]), np.arange(k))
return V

Does 'tf.assign' return its argument?

The Tensorflow documentation says that tf.assign(ref, ...) returns ref, but it appears instead (not surprisingly) to return a Tensor (attached to the assign op):
import tensorflow as tf
sess = tf.InteractiveSession()
Q = tf.Variable(tf.constant(range(1, 12)))
sess.run(tf.global_variables_initializer())
qop = tf.assign(Q, tf.zeros(Q.shape, tf.int32))#.eval()
print(Q.eval())
print(qop.eval())
print(Q.eval())
produces
[ 1 2 3 4 5 6 7 8 9 10 11]
[0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0]
demonstrating that the argument Q and what's returned qop behave differently (and that Q is unchanged until qop is executed).
Is the return value of tf.assign described correctly in the documentation?
Take a look at the documentation of Tensorflow about operations. tf.assign returns an Operation, which represents a graph node that performs computations on tensors. You use the operations to compose a graph of computations. Those computations actually occur at a later time, when you call eval on any of the operations of the graph.
In your example qop is the definition of an operation that assigns zeros to variable Q. The graph of your example would look something like Q --> qep. For pedagogical purposes, let's change the order of your code to something like this:
Q = tf.Variable(tf.constant(range(1, 12)))
Q.eval() # Error: Variable has not been initalized.
sess.run(tf.global_variables_initializer())
Q.eval() # Output: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype=int32)
qop = tf.assign(Q, tf.zeros(Q.shape, tf.int32))
Q.eval() # Output: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype=int32)
qop.eval()
Q.eval() # Output array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
The first time you evaluate Q you get an error, because the variable represented by Q does not contain anything in it yet. But after you run sess.run(tf.global_variables_initializer()) the error goes away. That is because that line of code runs an operation that initializes all the global variables of the current graph. When you run Q.eval() after defining the qop operation, Q still has the same values, because operation qop was defined, but not yet executed. Once you execute qop (qop.eval) is that the value of the variable represnted by Q changes.