Padding and Masking a batch dataset - numpy

When representing multiple strings of natural language, the number of characters in each string may not be equal. Then, the return result could be placed in a tf.RaggedTensor, where the length of the innermost dimension varies depending on the number of characters in each string:
rtensor = tf.ragged.constant([
[1, 2],
[3, 4, 5],
[6]
])
rtensor
#<tf.RaggedTensor [[1, 2], [3, 4, 5], [6]]>
In turn, applying to_tensor method, converts that RaggedTensor into a regular tf.Tensor and consequently apply the padding operation:
batch_size=3
max_length=8
tensor = rtensor.to_tensor(default_value=0, shape=(batch_size, max_length))
#<tf.Tensor: shape=(3, 8), dtype=int32, numpy=
#array([[1, 2, 0, 0, 0, 0, 0, 0],
# [3, 4, 5, 0, 0, 0, 0, 0],
# [6, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>
Now, is there an approach to generate also an adjunct tensor showing what is original data and what is padding? For the example above it would be:
<tf.Tensor: shape=(3, 8), dtype=int32, numpy=
array([[1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>

As thusv89 suggests, you can simply check for non-zero values. It can be as simple as converting to boolean and back.
import tensorflow as tf
rtensor = tf.ragged.constant([[1, 2],
[3, 4, 5],
[6]])
batch_size = 3
max_length = 8
tensor = rtensor.to_tensor(default_value=0, shape=(batch_size, max_length))
mask = tf.dtypes.cast(tf.dtypes.cast(tensor, tf.bool), tensor.dtype)
print(mask.numpy())
# [[1 1 0 0 0 0 0 0]
# [1 1 1 0 0 0 0 0]
# [1 0 0 0 0 0 0 0]]
The only possible drawback is that you might have had 0 values originally. You could use some other value as default value when converting to a tensor, for example -1, if you know that your data is always going to be non-negative:
tensor = rtensor.to_tensor(default_value=-1, shape=(batch_size, max_length))
mask = tf.dtypes.cast(tensor >= 0, tensor.dtype)
But if you want your mask to work for whatever values you have, you can also just use tf.ones_like with the ragged tensor:
rtensor_ones = tf.ones_like(rtensor)
mask = rtensor_ones.to_tensor(default_value=0, shape=(batch_size, max_length))
This way mask will always be one exactly where rtensor has a value.

Related

How to slice a scipy sparse matrix and keep the original indexing?

Let's say i have the following array:
import numpy as np
a = np.array([[1, 2, 3], [0, 1, 2], [1, 3, 4], [4, 5, 6]])
a = sp_sparse.csr_matrix(a)
and I want to get a submatrix of the sparse array that consists of the first and last rows.
>>>sub_matrix = a[[0, 3], :]
>>>print(sub_matrix)
(0, 0) 1
(0, 1) 2
(0, 2) 3
(1, 0) 4
(1, 1) 5
(1, 2) 6
But I want to keep the original indexing for the selected rows, so for my example, it would be something like:
(0, 0) 1
(0, 1) 2
(0, 2) 3
(3, 0) 4
(3, 1) 5
(3, 2) 6
I know I could do this by setting all the other rows of the dense array to zero and then computing the sparse array again but I want to know if there is a better way to achieve this.
Any help would be appreciated!
Depending on the indexing, it might be easier to construct the extractor/indexing matrix with the coo style of inputs:
In [129]: from scipy import sparse
In [130]: M = sparse.csr_matrix(np.arange(16).reshape(4,4))
In [131]: M
Out[131]:
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 15 stored elements in Compressed Sparse Row format>
In [132]: M.A
Out[132]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
A square extractor matrix with the desired "diagonal" values:
In [133]: extractor = sparse.csr_matrix(([1,1],([0,3],[0,3])))
In [134]: extractor
Out[134]:
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
Matrix multiplication in one direction selects columns:
In [135]: M#extractor
Out[135]:
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [136]: _.A
Out[136]:
array([[ 0, 0, 0, 3],
[ 4, 0, 0, 7],
[ 8, 0, 0, 11],
[12, 0, 0, 15]])
and in the other, rows:
In [137]: extractor#M
Out[137]:
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [138]: _.A
Out[138]:
array([[ 0, 1, 2, 3],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[12, 13, 14, 15]])
In [139]: extractor.A
Out[139]:
array([[1, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 1]])
M[[0,3],:] does the same thing, but with:
In [140]: extractor = sparse.csr_matrix(([1,1],([0,1],[0,3])))
In [142]: (extractor#M).A
Out[142]:
array([[ 0, 1, 2, 3],
[12, 13, 14, 15]])
Row and column sums are also performed with matrix multiplication:
In [149]: M#np.ones(4,int)
Out[149]: array([ 6, 22, 38, 54])
import numpy as np
import scipy.sparse as sp_sparse
a = np.array([[1, 2, 3], [0, 1, 2], [1, 3, 4], [4, 5, 6]])
a = sp_sparse.csr_matrix(a)
It's probably easiest to just use a selection matrix and then multiply.
idx = np.isin(np.arange(a.shape[0]), [0, 3]).astype(int)
b = sp_sparse.diags(idx, format='csr') # a
The disadvantage is that this will result in an array of floats instead of integers, but that's easy enough to fix.
>>> b.astype(int).A
array([[1, 2, 3],
[0, 0, 0],
[0, 0, 0],
[4, 5, 6]])

convert CSR format to dense/COO format in tensorflow

tf.sparse_to_dense() fucntion in tensorflow only support ((data, (row_ind, col_ind)), [shape=(M, N)]) format. How can I convert standard CSR tensor (((data, indices, indptr), [shape=(M, N)])) to dense representation in tensorflow?
For example given, data, indices and indptr the function will return dense tensor.
e.g., inputs:
indices = [1 3 3 0 1 2 2 3]
indptr = [0 2 3 6 8]
data = [2 4 1 3 2 1 1 5]
expected output:
[[0, 2, 0, 4],
[0, 0, 0, 1],
[3, 2, 1, 0],
[0, 0, 1, 5]]
According to Scipy documentation, we can convert it back by the following:
the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their
corresponding values are stored in data[indptr[i]:indptr[i+1]].
If the shape parameter is not supplied, the matrix dimensions are
inferred from the index arrays.
It is relatively easily to convert from the CSR format to the COO by expanding the indptr argument to get the row indices. Here is an example using a subtraction, tf.repeat and tf.range. The shape of the final sparse tensor is inferred from the max indices in the rows/columns respectively (but can also be provided explicitly).
def csr_to_sparse(data, indices, indptr, dense_shape=None):
rep = tf.math.subtract(indptr[1:], indptr[:-1])
row_indices = tf.repeat(tf.range(tf.size(rep)), rep)
sparse_indices = tf.cast(tf.stack((row_indices, indices), axis=-1), tf.int64)
if dense_shape is None:
max_row = tf.math.reduce_max(row_indices)
max_col = tf.math.reduce_max(indices)
dense_shape = (max_row + 1, max_col + 1)
return tf.SparseTensor(indices=sparse_indices, values=data, dense_shape=dense_shape)
With your example:
>>> indices = [1, 3, 3, 0, 1, 2, 2, 3]
>>> indptr = [0, 2, 3, 6, 8,]
>>> data = [2, 4, 1, 3, 2, 1, 1, 5]
>>> tf.sparse.to_dense(csr_to_sparse(data, indices, indptr))
<tf.Tensor: shape=(4, 4), dtype=int32, numpy=
array([[0, 2, 0, 4],
[0, 0, 0, 1],
[3, 2, 1, 0],
[0, 0, 1, 5]], dtype=int32)>

Trying to fill a zeros numpy array of size (6,6) with a array size (2,2)

I have a array S:
S = array([[980, 100],
[ 3, 5]])
I need to resize him or fill a zeros array to size (6,6). My desire output is:
out = array([[980, 100, 0, 0, 0, 0],
[3, 5, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]], dtype=int32)
Anyone can help?
I figure it out.
Create a zeros matrix to desired size matrix
zeros = np.zeros((6,6))
your array
array = np.array([[1,2,5,6],[3,4,4,3],[5,6,2,8]])
#getting shape
lenx, leny = array.shape
fill the zeros matrix with your array
zeros[:lenx,:leny] = array

Numpy dot product of a matrix and an array is a matrix

When I updated to the most recent version of numpy, a lot of my code broke because now every time I call np.dot() on a matrix and an array, it returns a 1xn matrix rather than simply an array.
This causes me an error when I try to multiply the new vector/array by a matrix
example
A = np.matrix( [ [4, 1, 0, 0], [1, 5, 1, 0], [0, 1, 6, 1], [1, 0, 1, 4] ] )
x = np.array([0, 0, 0, 0])
print(x)
x1 = np.dot(A, x)
print(x1)
x2 = np.dot(A, x1)
print(x2)
output:
[0 0 0 0]
[[0 0 0 0]]
Traceback (most recent call last):
File "review.py", line 13, in <module>
x2 = np.dot(A, x1)
ValueError: shapes (4,4) and (1,4) not aligned: 4 (dim 1) != 1 (dim 0)
I would expect that either dot of a matrix and vector would return a vector, or dot of a matrix and 1xn matrix would work as expected.
Using the transpose of x doesn't fix this, nor does using A # x, or A.dot(x) or any variation of np.matmul(A, x)
Your arrays:
In [24]: A = np.matrix( [ [4, 1, 0, 0], [1, 5, 1, 0], [0, 1, 6, 1], [1, 0, 1, 4] ] )
...: x = np.array([0, 0, 0, 0])
In [25]: A.shape
Out[25]: (4, 4)
In [26]: x.shape
Out[26]: (4,)
The dot:
In [27]: np.dot(A,x)
Out[27]: matrix([[0, 0, 0, 0]]) # (1,4) shape
Let's try the same, but with a ndarray version of A:
In [30]: A.A
Out[30]:
array([[4, 1, 0, 0],
[1, 5, 1, 0],
[0, 1, 6, 1],
[1, 0, 1, 4]])
In [31]: np.dot(A.A, x)
Out[31]: array([0, 0, 0, 0])
The result is (4,) shape. That makes sense: (4,4) dot (4,) => (4,)
np.dot(A,x) is doing the same calculation, but returning a np.matrix. That by definition is a 2d array, so the (4,) is expanded to (1,4).
I don't have an older version to test this on, and am not aware of any changes.
If x is a (4,1) matrix, then the result (4,4)dot(4,1)=>(4,1):
In [33]: np.matrix(x)
Out[33]: matrix([[0, 0, 0, 0]])
In [34]: np.dot(A, np.matrix(x).T)
Out[34]:
matrix([[0],
[0],
[0],
[0]])

What is the order output of conv2d in tensorflow?

I wanted to get the values of output tensor in tensorflow.
the kernel shape of first layer was K[row, col, in_channel, out_channel].
the input image shape is P[batch, row, col, channel]
But I tried to get the first four kernel value, they were K[0, 0, 0, 0], K[0, 1, 0, 0], K[1, 0, 0, 0], K[1, 1, 0, 0].
I got the input values were P[0, 0, 0, 0], P[0, 0, 1, 0], P[0, 1, 0, 0], P[0, 1, 1, 0].
python code is that "F = tf.nn.conv2d(P, K, stride=[1, 1, 1, 1], padding='SAME')"
Console showed output value (F[0, 0, 0, 0]) is not K[0, 0, 0, 0] * P[0, 0, 0, 0] + K[0, 1, 0, 0] * P[0, 0, 1, 0] + K[1, 0, 0, 0] * P[0, 1, 0, 0] + K[1, 1, 0, 0] * P[0, 1, 1, 0]
What is the order of these output feature maps? I had 40 conv_kernel,the first output was not computed by the first conv_kernel
There's something wrong in your input values.
Remember that conv2d wants an input tensor of shape [batch, in_height, in_width, in_channels] and a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels].
In fact, reshaping your data the results are the one expected (please note that conv2d computes the correlation and not the convolution).
import tensorflow as tf
K = tf.get_variable("K", shape=(4,4), initializer=tf.constant_initializer([
[0, 0, 0, 0],
[0, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 0]
]))
K = tf.reshape(K, (4,4,1,1))
P = tf.get_variable("P", shape=(4,4), initializer=tf.constant_initializer([
[0, 0, 0, 0],
[0, 0, 1, 0],
[0, 1, 0, 0],
[0, 1, 1, 0]
]))
P = tf.reshape(P, (1,4,4,1))
F = tf.nn.conv2d(P, K, strides=[1, 1, 1, 1], padding='VALID')
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sess.run(F))
In this example, I'm computing the correlation between the input P (a batch with 1 element of depth 1) and the filter P (4x4 filter, with input depth 1 and output depth 1).