Update submatrix in Tensorflow - indexing

Quite simply, what I want to do is the following
A = np.ones((3,3)) #arbitrary matrix
B = np.ones((2,2)) #arbitrary matrix
A[1:,1:] = A[1:,1:] + B
except in Tensorflow (where the matrices can be arbitrarily complicated tensor expressions). Neither A nor B is a Tensorflow Variable, but just a run-of-the-mill tensor.
What I have gathered so far: tensors are immutable, so I cannot assign to a submatrix. tf.scatter_nd is the current option for sub-assignment, but does not appear to support sub-matrices, only slices.
Methods that should work, but are perhaps not ideal:
I could pad B with zeros, but I'm sure this leads to instantiation of
an unnecessarily large B - can it be made sparse, maybe?
I could use the padding idea, but write it as a low-rank decomposition, e.g. in Numpy: A+U.dot(B).U.T where U is a stacked zero and identity matrix. I'm not sure this is actually advantageous.
I could split A into submatrices, and stack them back together. Might be the most efficient, but sounds like the code would be convoluted.
Ideally, I want to do this operation N times for progressively smaller matrices, resulting in one large final result, but this is tangential.
I'll use one of the hacks for now, but I'm hoping someone can tell me what the idiomatic version is!

Related

Tensorflow: Add small number before division for numerical stability

In order to prevent divisions by zero in TensorFlow, I want to add a tiny number to my dividend. A quick search did not yield any results. In particular, I am interested in using the scientific notation, e.g.
a = b/(c+1e-05)
How can this be achieved?
Assuming a, b and c are tensors. The formula you have written will work as expected. 1e-5 will be broadcasted and added on the tensor c. Tensorflow automatically typecasts the 1e-5 to tf.constant(1e-5).
Tensorflow however has some limitations with non-scalar broadcasts. Take a look at my other answer.

Multiple-input multiple-output CNN with custom loss function

I have a set of 2D input arrays m x n namely A,B,C and I have to predict two 2D output arrays namely d,e for which I do have the expected values. You can think of the inputs/outputs as grey images if you like.
Because of the spatial information is relevant (these are actually 2D physical domains) I want to use a Convolutional Neural Network to predict d and e. My design (not tested yet) looks as follows:
Because I have multiple inputs, I guess I should use multiple columns (or branches) to find different features for each of the inputs (they look fairly different). Each of these columns follows a encoding-decoding architecture used in segmentation (see SegNet): Conv2D block involves a convolution+batch normalisation+ReLU layer. Deconv2D involves a deconvolution+batch normalisation+ReLU.
Then, I can merge the output of each column by either concatenating, averaging or taking the maximum for example. To obtain the original m x n shape for each of the outputs I have seen I could do this with a 1 x 1 kernel convolution.
I want to predict the two outputs from that single layer. Is that okay from the network structure point of view? Finally my loss function depends on the outputs themselves compared to the target plus another relation I want to impose.
A would like to have some expert opinion on this since this is my first design of a CNN and I am not sure if I it makes sense as it is now and/or if there are better approaches (or network architectures) to this problem.
I posted this originally in datascience but I did not get much feedback. I am now posting it here since there is a bigger community on these topics plus I would be very grateful to receive implementation tips beside network architectural ones. Thanks.
I think your design makes sense in general:
since A, B, and C are fairly different, you make each input a transform sub-network, and then fuse them together, which is your intermediate representation.
from the intermediate representation, you apply additional CNN to decode D and E, respectively.
Several things:
A, B, and C looking different does not necessarily mean you can't stack them together as a 3-channel input. The decision should be made upon the fact that whether the values in A, B, and C mean differently or not. For example, if A is a gray scale image, B is a depth map, C is a also a gray image captured by a different camera. Then A and B are better processed in your suggested way, but A and C can be concatenated as one single input before feeding it to your network.
D and E are two outputs of the network and will be trained in the multi-task manner. Of course, they should share some latent feature, and one should split at this feature to apply a down-stream non-shared weight branch for each output. However, where to split is usually tricky.
It is really a broad question, asking for answers relying mostly on opinions. Here are my two cents though, which you might find interesting as it does not go along the previous answers here and on datascience.
First, I wouldn't go with separate columns for each input. AFAIK, when different inputs are processed by different columns, it is almost always the case that the network is some sort of Siemese network and the columns share the same weights; or at least the columns all need to produce a similar code. It is not your case here, so I would simply not bother.
Second, you are blessed with a problem with a dense output and no need to learn a code. This should direct you straight to U-nets, which outperforms any bottleneck-designed network without much effort. U-nets were introduced for dense segmentation but they shine at any dense-output problem really.
In short, just stack your inputs together and use a U-net.

Why is tf.transpose so important in a RNN?

I've been reading the docs to learn TensorFlow and have been struggling on when to use the following functions and their purpose.
tf.split()
tf.reshape()
tf.transpose()
My guess so far is that:
tf.split() is used because inputs must be a sequence.
tf.reshape() is used to make the shapes compatible (Incorrect shapes tends to be a common problem / mistake for me). I used numpy for this before. I'll probably stick to tf.reshape() now. I am not sure if there is a difference between the two.
tf.transpose() swaps the rows and columns from my understanding. If I don't use tf.transpose() my loss doesn't go down. If the parameter values are incorrect the loss doesn't go down. So the purpose of me using tf.transpose() is so that my loss goes down and my predictions become more accurate.
This bothers me tremendously because I'm using tf.transpose() because I have to and have no understanding why it's such an important factor. I'm assuming if it's not used correctly the inputs and labels can be in the wrong position. Making it impossible for the model to learn. If this is true how can I go about using tf.transpose() so that I am not so reliant on figuring out the parameter values via trial and error?
Question
Why do I need tf.transpose()?
What is the purpose of tf.transpose()?
Answer
Why do I need tf.transpose()? I can't imagine why you would need it unless you coded your solution from the beginning to require it. For example, suppose I have 120 student records with 50 stats per student and I want to use that to try and make a linear association with their chance of taking 3 classes. I'd state it like so
c = r x m
r = records, a matrix with a shape if [120x50]
m = the induction matrix. it has a shape of [50x3]
c = the chance of all students taking one of three courses, a matrix with a shape of [120x3]
Now if instead of making m [50x3], we goofed and made m [3x50], then we'd have to transpose it before multiplication.
What is the purpose of tf.transpose()?
Sometimes you just need to swap rows and columns, like above. Wikipedia has a fantastic page on it. The transpose function has some excellent properties for matrix math function, like associativeness and associativeness with the inverse function.
Summary
I don't think I've ever used tf.transpose in any CNN I've written.

Differences between X.ravel() and X.reshape(s0*s1*s2) when number of axes known

Seeing this answer I am wondering if the creation of a flattened view of X are essentially the same, as long as I know that the number of axes in X is 3:
A = X.ravel()
s0, s1, s2 = X.shape
B = X.reshape(s0*s1*s2)
C = X.reshape(-1) # thanks to #hpaulj below
I'm not asking if A and B and C are the same.
I'm wondering if the particular use of ravel and reshape in this situation are essentially the same, or if there are significant differences, advantages, or disadvantages to one or the other, provided that you know the number of axes of X ahead of time.
The second method takes a few microseconds, but that does not seem to be size dependent.
Look at their __array_interface__ and do some timings. The only difference that I can see is that ravel is faster.
.flatten() has a more significant difference - it returns a copy.
A.reshape(-1)
is a simpler way to use reshape.
You could study the respective docs, and see if there is something else. I haven't explored what happens when you specify order.
I would use ravel if I just want it to be 1d. I use .reshape most often to change a 1d (e.g. arange()) to nd.
e.g.
np.arange(10).reshape(2,5).ravel()
Or choose the one that makes your code most readable.
reshape and ravel are defined in numpy C code:
In https://github.com/numpy/numpy/blob/0703f55f4db7a87c5a9e02d5165309994b9b13fd/numpy/core/src/multiarray/shape.c
PyArray_Ravel(PyArrayObject *arr, NPY_ORDER order) requires nearly 100 lines of C code. And it punts to PyArray_Flatten if the order changes.
In the same file, reshape punts to newshape. That in turn returns a view is the shape doesn't actually change, tries _attempt_nocopy_reshape, and as last resort returns a PyArray_NewCopy.
Both make use of PyArray_Newshape and PyArray_NewFromDescr - depending on how shapes and order mix and match.
So identifying where reshape (to 1d) and ravel are different would require careful study.
Another way to do this ravel is to make a new array, with a new shape, but the same data buffer:
np.ndarray((24,),buffer=A.data)
It times the same as reshape. Its __array_interface__ is the same. I don't recommend using this method, but it may clarify what is going on with these reshape/ravel functions. They all make a new array, with new shape, but with share data (if possible). Timing differences are the result of different sequences of function calls - in Python and C - not in different handling of the data.

Numpy/Scipy pinv and pinv2 behave differently

I am working with bidimensional arrays on Numpy for Extreme Learning Machines. One of my arrays, H, is random, and I want to compute its pseudoinverse.
If I use scipy.linalg.pinv2 everything runs smoothly. However, if I use scipy.linalg.pinv, sometimes (30-40% of the times) problems arise.
The reason why I am using pinv2 is because I read (here: http://vene.ro/blog/inverses-pseudoinverses-numerical-issues-speed-symmetry.html ) that pinv2 performs better on "tall" and on "wide" arrays.
The problem is that, if H has a column j of all 1, pinv(H) has huge coefficients at row j.
This is in turn a problem because, in such cases, np.dot(pinv(H), Y) contains some nan values (Y is an array of small integers).
Now, I am not into linear algebra and numeric computation enough to understand if this is a bug or some precision related property of the two functions. I would like you to answer this question so that, if it's the case, I can file a bug report (honestly, at the moment I would not even know what to write).
I saved the arrays with np.savetxt(fn, a, '%.2e', ';'): please, see https://dl.dropboxusercontent.com/u/48242012/example.tar.gz to find them.
Any help is appreciated. In the provided file, you can see in pinv(H).csv that rows 14, 33, 55, 56 and 99 have huge values, while in pinv2(H) the same rows have more decent values.
Your help is appreciated.
In short, the two functions implement two different ways to calculate the pseudoinverse matrix:
scipy.linalg.pinv uses least squares, which may be quite compute intensive and take up a lot of memory.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html#scipy.linalg.pinv
scipy.linalg.pinv2 uses SVD (singular value decomposition), which should run with a smaller memory footprint in most cases.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv2.html#scipy.linalg.pinv2
numpy.linalg.pinv also implements this method.
As these are two different evaluation methods, the resulting matrices will not be the same. Each method has its own advantages and disadvantages, and it is not always easy to determine which one should be used without deeply understanding the data and what the pseudoinverse will be used for. I'd simply suggest some trial-and-error and use the one which gives you the best results for your classifier.
Note that in some cases these functions cannot converge to a solution, and will then raise a scipy.stats.LinAlgError. In that case you may try to use the second pinv implementation, which will greatly reduce the amount of errors you receive.
Starting from scipy 1.7.0 , pinv2 is deprecated and also replaced by a SVD solution.
DeprecationWarning: scipy.linalg.pinv2 is deprecated since SciPy 1.7.0, use scipy.linalg.pinv instead
That means, numpy.pinv, scipy.pinv and scipy.pinv2 now compute all equivalent solutions. They are also equally fast in their computation, with scipy being slightly faster.
import numpy as np
import scipy
arr = np.random.rand(1000, 2000)
res1 = np.linalg.pinv(arr)
res2 = scipy.linalg.pinv(arr)
res3 = scipy.linalg.pinv2(arr)
np.testing.assert_array_almost_equal(res1, res2, decimal=10)
np.testing.assert_array_almost_equal(res1, res3, decimal=10)