np.array returns different dimensions of the array for the same data - numpy

I need to convert the list of ndarray to ndarray of ndarray. In the first case I'm splitting the original array into 5 pieces using the np.array_split function, as a result i have list of ndarray, then I transform this list using the np.array() function and get ndaray with shape (5,). In the second case i do the same but with other data and as a result I get ndarray with shape (5,200,3072). The only difference between the data is their shape. In first case it is (121, 3072), in second case (1000,3072).
Here shape will be (5,)
train_folds_X = []
train_folds_X = np.array_split(binary_train_X,5,axis = 0)
np.array(train_folds_X).shape
but here shape will be (5,200,3072)
train_folds_X = []
train_folds_X = np.array_split(train_X,5,axis = 0)
np.array(train_folds_X).shape
binary_train_X shape is (121,3072), train_X shape(1000,3072) in other it is same data,this is number from Street View House Numbers (http://ufldl.stanford.edu/housenumbers/) but in binary_train_X only 0 and 9. train_folds_X before using the np.array in the first and second cases have the same len = 5. I don't understand why this is happening.

The reason is in the first case the result of the split is a list of uneven arrays; look at this:
y = np.empty([121,3072])
[np.array_split(y,5,axis = 0)[i].shape for i in range(5)]
which returns the first array a bit bigger than the others:
[(25, 3072), (24, 3072), (24, 3072), (24, 3072), (24, 3072)]
and thus the resulting shape of:
np.array(np.array_split(y,5,axis = 0)).shape
is (5,). Plus you should get a deprecation warning:
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
On the other hand the second case behaves as expected:
z = np.empty([1000,3072])
[np.array_split(z,5,axis = 0)[i].shape for i in range(5)]
returns:
[(200, 3072), (200, 3072), (200, 3072), (200, 3072), (200, 3072)]
and thus the resulting array can be constructed as you expect:
np.array(np.array_split(z,5,axis = 0)).shape
returns (5, 200, 3072)

Related

different dimension numpy array broadcasting issue with '+=' operator

I'm new to numpy, and I have an interesting observation on the broadcasting. When I'm adding a 3x5 array directly to a 3x1 array, and update the original 3x1 array with the result, there is no broadcasting issue.
import numpy as np
total = np.random.uniform(-1,1, size=(3))[:,np.newaxis]
print(f'init = \n {total}')
for i in range(3):
total = total + np.ones(shape=(3,5))
print(f'total_{i} = \n {total}')
However, if i'm using '+=' operator to increment the 3x1 array with the value of 3x5 array, there is a broadcasting issue. May I know which rule of numpy broadcasting did I violate in the latter case?
total = np.random.uniform(-1,1, size=(3))[:,np.newaxis]
print(f'init = \n {total}')
for i in range(3):
total += np.ones(shape=(3,5))
print(f'total_{i} = \n {total}')
Thank you!
hawkoli1987
according to add function overridden in numpy array,
def add(x1, x2, *args, **kwargs): # real signature unknown; NOTE: unreliably restored from __doc__
"""
add(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])
Add arguments element-wise.
Parameters
----------
x1, x2 : array_like
The arrays to be added.
If ``x1.shape != x2.shape``, they must be broadcastable to a common
shape (which becomes the shape of the output).
out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have
a shape that the inputs broadcast to. If not provided or None,
a freshly-allocated array is returned. A tuple (possible only as a
keyword argument) must have length equal to the number of outputs.
add function returns a freshly-allocated array when dimensions of arrays are different.
In python, a=a+b and a+=b aren't absolutly same. + calls __add__ function and += calls __iadd__.
a = np.array([1, 2])
b = np.array([3, 4])
first_id = id(a)
a = a + b
second_id = id(a)
assert first_id == second_id # False
a = np.array([1, 2])
b = np.array([3, 4])
first_id = id(a)
a += b
second_id = id(a)
assert first_id == second_id # True
+= function does not create new objects and updates the value to the same address.
numpy add function updates an existing instance when adding an array of the same dimensions, but returns a new object when adding arrays of different dimensions. So when use += functions, the two functions must have the same dimension because the results of the add function must be updated on the same object.
For example,
a = np.array()
total = np.random.uniform(-1,1, size=(3))[:,np.newaxis]
print(id(total))
for i in range(3):
total += np.ones(shape=(3,1))
print(id(total))
id(total) are all same because add function just updates the instance in same address because dimmension of two arrays are same.
In [29]: arr = np.zeros((1,3))
The actual error message is:
In [30]: arr += np.ones((2,3))
Traceback (most recent call last):
Input In [30] in <cell line: 1>
arr += np.ones((2,3))
ValueError: non-broadcastable output operand with shape (1,3) doesn't match the broadcast shape (2,3)
I read that as say that arr on the left is "non-broadcastable", where as arr+np.ones((2,3)) is the result of broadcasting. The wording may be awkward; it's probably produced in some deep compiled function where it makes more sense.
We get a variant on this when we try to assign an array to a slice of an array:
In [31]: temp = arr + np.ones((2,3))
In [32]: temp.shape
Out[32]: (2, 3)
In [33]: arr[:] = temp
Traceback (most recent call last):
Input In [33] in <cell line: 1>
arr[:] = temp
ValueError: could not broadcast input array from shape (2,3) into shape (1,3)
This is clearer, saying that the RHS (2,3) cannot be put into the LHS (1,3) slot.
Or trying to put the (2,3) into one "row" of arr:
In [35]: arr[0] = temp
Traceback (most recent call last):
Input In [35] in <cell line: 1>
arr[0] = temp
ValueError: could not broadcast input array from shape (2,3) into shape (3,)
arr[0] = arr works because it tries to put a (1,3) into a (3,) shape - that's a workable broadcasting combination.
arr[0] = arr.T tries to put a (3,1) into a (3,), and fails.

Is there a numpy function like np.fill(), but for arrays as fill value?

I'm trying to build an array of some given shape in which all elements are given by another array. Is there a function in numpy which does that efficiently, similar to np.full(), or any other elegant way, without simply employing for loops?
Example: Let's say I want an array with shape
(dim1,dim2) filled with a given, constant scalar value. Numpy has np.full() for this:
my_array = np.full((dim1,dim2),value)
I'm looking for an analog way of doing this, but I want the array to be filled with another array of shape (filldim1,filldim2) A brute-force way would be this:
my_array = np.array([])
for i in range(dim1):
for j in range(dim2):
my_array = np.append(my_array,fill_array)
my_array = my_array.reshape((dim1,dim2,filldim1,filldim2))
EDIT
I was being stupid, np.full() does take arrays as fill value if the shape is modified accordingly:
my_array = np.full((dim1,dim2,filldim1,filldim2),fill_array)
Thanks for pointing that out, #Arne!
You can use np.tile:
>>> shape = (2, 3)
>>> fill_shape = (4, 5)
>>> fill_arr = np.random.randn(*fill_shape)
>>> arr = np.tile(fill_arr, [*shape, 1, 1])
>>> arr.shape
(2, 3, 4, 5)
>>> np.all(arr[0, 0] == fill_arr)
True
Edit: better answer, as suggested by #Arne, directly using np.full:
>>> arr = np.full([*shape, *fill_shape], fill_arr)
>>> arr.shape
(2, 3, 4, 5)
>>> np.all(arr[0, 0] == fill_arr)
True

Padding Labels for Tensorflow CTC Loss?

I would like to pad my labels so that they would be of equal length to be passed into the ctc_loss function. Apparently, -1 is not allowed. If I were to apply padding, should the padding value be part of the labels for ctc?
Update
I have this code that converts dense labels into sparse ones to be passed to the ctc_loss function which I think is related to the problem.
def dense_to_sparse(dense_tensor, out_type):
indices = tf.where(tf.not_equal(dense_tensor, tf.constant(0, dense_tensor.dtype)
values = tf.gather_nd(dense_tensor, indices)
shape = tf.shape(dense_tensor, out_type=out_type)
return tf.SparseTensor(indices, values, shape)
Actually, -1 values are allowed to be present in the y_true argument of the ctc_batch_cost with one limitation - they should not appear within the actual label "content" which is specified by label_length (here i-th label "content" would start from the index 0 and end at the index label_length[i]).
So it is perfectly fine to pad labels with -1 so that they would be of equal length, as you intended. The only thing you should take care about is to correctly calculate and pass corresponding label_length values.
Here is the sample code which is a modified version of the test_ctc unit test from keras:
import numpy as np
from tensorflow.keras import backend as K
number_of_categories = 4
number_of_timesteps = 5
labels = np.asarray([[0, 1, 2, 1, 0], [0, 1, 1, 0, -1]])
label_lens = np.expand_dims(np.asarray([5, 4]), 1)
# dimensions are batch x time x categories
inputs = np.zeros((2, number_of_timesteps, number_of_categories), dtype=np.float32)
input_lens = np.expand_dims(np.asarray([5, 5]), 1)
k_labels = K.variable(labels, dtype="int32")
k_inputs = K.variable(inputs, dtype="float32")
k_input_lens = K.variable(input_lens, dtype="int32")
k_label_lens = K.variable(label_lens, dtype="int32")
res = K.eval(K.ctc_batch_cost(k_labels, k_inputs, k_input_lens, k_label_lens))
It runs perfectly fine even with -1 as the last element of the (second) labels sequence because corresponding label_lens item (second) specified that its length is 4.
If we change it to be 5 or if we change some other label value to be -1 then we have the All labels must be nonnegative integers exception that you've mentioned. But this just means that our label_lens is invalid.
Here's how I do it. I have a dense tensor labels that includes padding with -1, so that all targets in a batch have the same length. Then I use
labels_sparse = dense_to_sparse(labels, sparse_val=-1)
where
def dense_to_sparse(dense_tensor, sparse_val=0):
"""Inverse of tf.sparse_to_dense.
Parameters:
dense_tensor: The dense tensor. Duh.
sparse_val: The value to "ignore": Occurrences of this value in the
dense tensor will not be represented in the sparse tensor.
NOTE: When/if later restoring this to a dense tensor, you
will probably want to choose this as the default value.
Returns:
SparseTensor equivalent to the dense input.
"""
with tf.name_scope("dense_to_sparse"):
sparse_inds = tf.where(tf.not_equal(dense_tensor, sparse_val),
name="sparse_inds")
sparse_vals = tf.gather_nd(dense_tensor, sparse_inds,
name="sparse_vals")
dense_shape = tf.shape(dense_tensor, name="dense_shape",
out_type=tf.int64)
return tf.SparseTensor(sparse_inds, sparse_vals, dense_shape)
This creates a sparse tensor of the labels, which is what you need to put into the ctc loss. That is, you call tf.nn.ctc_loss(labels=labels_sparse, ...) The padding (i.e. all values equal to -1 in the dense tensor) is simply not represented in this sparse tensor.

Cython Typing List of Strings

I'm trying to use cython to improve the performance of a loop, but I'm running
into some issues declaring the types of the inputs.
How do I include a field in my typed struct which is a string that can be
either 'front' or 'back'
I have a np.recarray that looks like the following (note the length of the
recarray is unknown as compile time)
import numpy as np
weights = np.recarray(4, dtype=[('a', np.int64), ('b', np.str_, 5), ('c', np.float64)])
weights[0] = (0, "front", 0.5)
weights[1] = (0, "back", 0.5)
weights[2] = (1, "front", 1.0)
weights[3] = (1, "back", 0.0)
as well as inputs of a list of strings and a pandas.Timestamp
import pandas as pd
ts = pd.Timestamp("2015-01-01")
contracts = ["CLX16", "CLZ16"]
I am trying to cythonize the following loop
def ploop(weights, contracts, timestamp):
cwts = []
for gen_num, position, weighting in weights:
if weighting != 0:
if position == "front":
cntrct_idx = gen_num
elif position == "back":
cntrct_idx = gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((gen_num, contracts[cntrct_idx], weighting, timestamp))
return cwts
My attempt involved typing the weights input as a struct in cython,
in a file struct_test.pyx as follows
import numpy as np
cimport numpy as np
cdef packed struct tstruct:
np.int64_t gen_num
char[5] position
np.float64_t weighting
def cloop(tstruct[:] weights_array, contracts, timestamp):
cdef tstruct weights
cdef int i
cdef int cntrct_idx
cwts = []
for k in xrange(len(weights_array)):
w = weights_array[k]
if w.weighting != 0:
if w.position == "front":
cntrct_idx = w.gen_num
elif w.position == "back":
cntrct_idx = w.gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((w.gen_num, contracts[cntrct_idx], w.weighting,
timestamp))
return cwts
But I am receiving runtime errors, which I believe are related to the
char[5] position.
import pyximport
pyximport.install()
import struct_test
struct_test.cloop(weights, contracts, ts)
ValueError: Does not understand character buffer dtype format string ('w')
In addition I am a bit unclear how I would go about typing contracts as well
as timestamp.
Your ploop (without the timestamp variable) produces:
In [226]: ploop(weights, contracts)
Out[226]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
Equivalent function without a loop:
def ploopless(weights, contracts):
arr_contracts = np.array(contracts) # to allow array indexing
wgts1 = weights[weights['c']!=0]
mask = wgts1['b']=='front'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]]
mask = wgts1['b']=='back'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]+1]
return wgts1.tolist()
In [250]: ploopless(weights, contracts)
Out[250]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
I'm taking advantage of the fact that returned list of tuples has same (int, str, int) layout as the input weight array. So I'm just making a copy of weights and replacing selected values of the b field.
Note that I use the field selection index before the mask one. The boolean mask produces a copy, so we have to careful about indexing order.
I'm guessing that loop-less array version will be competitive in time with the cloop (on realistic arrays). The string and list operations in cloop probably limit its speedup.

Dynamic Axes with a custom RNN

I’m running into a number of issues relating to dynamic axes. I am trying to implement a convolutional rnn similar to the of the LSTM() function but handles sequential image input and outputs an image.
I’m able to build the network and pass dummy data through it to produce output, but when I try to compute the error with an input_variable label I consistently see the following error:
RuntimeError: Node '__v2libuid__Input471__v2libname__img_label' (InputValue operation): DataFor: FrameRange's dynamic axis is inconsistent with matrix: {numTimeSteps:1, numParallelSequences:2, sequences:[{seqId:0, s:0, begin:0, end:1}, {seqId:1, s:1, begin:0, end:1}]} vs. {numTimeSteps:2, numParallelSequences:1, sequences:[{seqId:0, s:0, begin:0, end:2}]}`
If I understand this error message correctly, it claims that the value I passed in as the label has inconsistent axes to what is expected with 2 time steps and 1 parallel sequence, when what is desired is 1 time-step and 2 sequences. This makes sense to me, but I’m not sure how the data I’m passing in is not conforming to this. Here are (roughly) the variable declarations and eval statements:
…
img_input = input_variable(shape=img_shape, dtype=np.float32, name="img_input")
convlstm = Recurrence(conv_lstm_cell, initial_state=initial_state)(img_input)
out = select_last(convlstm)
img_label = input_variable(shape=img_shape, dynamic_axes=out.dynamic_axes, dtype=np.float32, name="img_label”)
error = squared_error(out, img_label)
…
dummy_input = np.ones(shape=(2, 3, 3, 32, 32)) # (batch, seq_len, channels, height, width)
dummy_label = np.ones(shape=(2, 3, 32, 32)) # (batch, channels, height, width)
out = error.eval({img_input:dummy_input, img_label:dummy_label})
I believe part of the issue is with the dynamic_axes set when creating the img_label input_variable, I’ve also tried setting it to [Axis.default_batch_axis()] and not setting it at all and either squared error complains about inconsistent axes between out and img_label or I see the same error as above.
The only issue I see with the above setup is that your dummy label should have an explicit dynamic axis so it should be declared as
dummy_label = np.ones(shape=(2, 1, 3, 32, 32))
Assuming your convlstm works similar to an lstm, then the following works without issues for me and it evaluates the loss for two input/output pairs.
x = C.input_variable((3,32,32))
cx = convlstm(x)
lx = C.sequence.last(cx)
y = C.input_variable(lx.shape, dynamic_axes=lx.dynamic_axes)
loss = C.squared_error(y, lx)
x0 = np.arange(2*3*3*32*32,dtype=np.float32).reshape(2,3,3,32,32)
y0 = np.arange(2*1*3*32*32,dtype=np.float32).reshape(2,1,3,32,32)
loss.eval({x:x0, y:y0})