Implications of using np.lib.stride_tricks.as_strided - numpy

I have some questions about how np.lib.stride_tricks.as_strided differs from just using np.ndarray and if the as_strided can be intercepted by sub-classes to np.ndarray.
Given an array:
a = np.arange(10, dtype=np.int8)
Is there any difference (effectively) between using as_strided and creating the ndarray with the appropriate strides directly?
np.lib.stride_tricks.as_strided(a, (9,2), (1,1))
np.ndarray((9,2), np.int8, buffer=a.data, strides=(1,1))
as_strided just seems to be more implicit about what it is doing.
Can either function be intercepted by a sub-classes for np.ndarray? If a in the example above was a custom subclass of ndarray will there be some function that gets called before returning the strided view? I know that setting subok for as_strided will preserve any subclass, but is this actually doing anything else than casting the array?

Related

Pandas interpolation method definitions

In the pandas documentation, a number of methods are provided as arguments to pandas.DataFrame.interpolate including
nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5).
‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes
However, the scipy documentation indicates the following options:
kind str or int, optional
Specifies the kind of interpolation as a string or as an integer specifying the order of the spline interpolator to use. The string has to be one of ‘linear’, ‘nearest’, ‘nearest-up’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘previous’, or ‘next’. ‘zero’, ‘slinear’, ‘quadratic’ and ‘cubic’ refer to a spline interpolation of zeroth, first, second or third order; ‘previous’ and ‘next’ simply return the previous or next value of the point; ‘nearest-up’ and ‘nearest’ differ when interpolating half-integers (e.g. 0.5, 1.5) in that ‘nearest-up’ rounds up and ‘nearest’ rounds down. Default is ‘linear’.
The documentation seems wrong since scipy.interpolate.interp1d does not accept barycentric or polynomial as valid methods. I suppose that barycentric refers to scipy.interpolate.barycentric_interpolate, but what does polynomial refer to? I thought it might be equivalent to the piecewise_polynomial option, but the two give different results.
Also, method=cubicspline and method=spline, order=3 give different results. What's the difference here?
The pandas interpolate method is an amalgamation of interpolation methods coming from different places in the numpy and scipy libraries.
Currently all of the code is located in pandas/core/missing.py.
At a high level it splits the interpolation methods into those that are handled by np.iterp and others handled by throughout the scipy library.
# interpolation methods that dispatch to np.interp
NP_METHODS = ["linear", "time", "index", "values"]
# interpolation methods that dispatch to _interpolate_scipy_wrapper
SP_METHODS = ["nearest", "zero", "slinear", "quadratic", "cubic",
"barycentric", "krogh", "spline", "polynomial",
"from_derivatives", "piecewise_polynomial", "pchip",
"akima", "cubicspline"]
Then because the scipy methods are split across different methods, you can see there are a ton of other wrappers within missing.py that indicate the scipy method. Most of the methods are passed off to scipy.interpolate.interp1d; however for a few others there's a dict or other wrapper methods pointing to those specific scipy methods.
from scipy import interpolate
alt_methods = {
"barycentric": interpolate.barycentric_interpolate,
"krogh": interpolate.krogh_interpolate,
"from_derivatives": _from_derivatives,
"piecewise_polynomial": _from_derivatives,
}
where the doc string of _from_derivatives within missing.py indicates:
def _from_derivatives(xi, yi, x, order=None, der=0, extrapolate=False):
"""
Convenience function for interpolate.BPoly.from_derivatives.
...
"""
So TLDR, depending upon the method you specify you wind up directly using one of the following:
numpy.interp
scipy.interpolate.interp1d
scipy.interpolate.barycentric_interpolate
scipy.interpolate.krogh_interpolate
scipy.interpolate.BPoly.from_derivatives
scipy.interpolate.Akima1DInterpolator
scipy.interpolate.UnivariateSpline
scipy.interpolate.CubicSpline

Kernel's hyper-parameters; initialization and setting bounds

I think many other people like me might be interested in how they can use GPFlow for their special problems. The key is how GPFlow is customizable, and a good example would be very helpful.
In my case, I read and tried lots of comments in raised issues without any real success. Setting kernel model parameters is not straightforward (creating with default values, and then do it via the delete object method). Transform method is vague.
It would be really helpful if you could add an example showing. how one can initialize and set bounds of an anisotropic kernel model (length-scales values and bounds, variances, ...) and specially adding observations error (as an array-like alpha parameter)
If you just want to set a value, then you can do
model = gpflow.models.GPR(np.zeros((1, 1)),
np.zeros((1, 1)),
gpflow.kernels.RBF(1, lengthscales=0.2))
Alternatively
model = gpflow.models.GPR(np.zeros((1, 1)),
np.zeros((1, 1)),
gpflow.kernels.RBF(1))
model.kern.lengthscales = 0.2
If you want to change the transform, you either need to subclass the kernel, or you can also do
with gpflow.defer_build():
model = gpflow.models.GPR(np.zeros((1, 1)),
np.zeros((1, 1)),
gpflow.kernels.RBF(1))
transform = gpflow.transforms.Logistic(0.1, 1.))
model.kern.lengthscales = gpflow.params.Parameter(0.3, transform=transform)
model.compile()
You need the defer_build to stop the graph being compiled before you've changed the transform. Using the approach above, the compilation of the tensorflow graph is delayed (until the explicit model.compile()) so is built with the intended bounding transform.
Using an array parameter for likelihood variance is outside the scope of gpflow. For what it's worth (and because it has been asked about before), that particular model is especially problematic as it is not clear how test points are defined.
Setting kernel parameters can be done using the .assign() function, or through direct assignment. See the notebook https://github.com/GPflow/GPflow/blob/develop/doc/source/notebooks/understanding/tf_graphs_and_sessions.ipynb. You do not need to delete a parameter to assign a new value to it.
If you want to have per-datapoint noise, you will need to implement your own custom likelihood, which you can do by taking Gaussian likelihood in likelihoods.py as an example.
If by "bounds" you mean limiting the optimisation range for a parameter, you can use the Logistic transform. If you want to pass in a custom transformation for a parameter, you can pass a constructed Parameter object into constructors with a custom transform. Alternatively you can assign a newly created Parameter with a new transform to the model.
Here is more information on how to access and change GPflow parameters: viewing, getting and settings parameters documentation.
Extra bit for #user1018464 answer about replacing transform in existing parameter: changing transformation is a bit tricky, you can't change transformation once a model was compiled in TensorFlow.
E.g.
likelihood = gpflow.likelihoods.Gaussian()
likelihood.variance.transform = gpflow.transforms.Logistic(1., 10.)
----
GPflowError: Parameter "Gaussian/variance" has already been compiled.
Instead you have to reset GPflow object:
likelihood = gpflow.likelihoods.Gaussian() # All tensors compiled
likelihood.clear()
likelihood.variance.transform = gpflow.transforms.Logistic(2, 5)
likelihood.variance = 2.5
likelihood.compile()

What exactly qualifies as a 'Tensor' in TensorFlow?

I am new to TensorFlow and just went through the eager execution tutorial and came across the tf.decode_csv function. Not knowing about it, I read the documentation. https://www.tensorflow.org/api_docs/python/tf/decode_csv
I don't really understand it.
The documentation says 'records: A Tensor of type string.'
So, my question is: What qualifies as a 'Tensor'?
I tried the following code:
dec_res = tf.decode_csv('0.1,0.2,0.3', [[0.0], [0.0], [0.0]])
print(dec_res, type(dec_res))
l = [[1,2,3],[4,5,6],[7,8,9]]
r = tf.reshape(l, [9,-1])
print(l, type(l))
print(r, type(r))
So the list dec_res contains tf.tensor objects. That seems reasonable to me. But is an ordinary string also a 'Tensor' according to the documentation?
Then I tried something else with the tf.reshape function. In the documentation https://www.tensorflow.org/api_docs/python/tf/reshape it says that 'tensor: A Tensor.' So, l is supposed to be a tensor. But it is not of type tf.tensor but simply a python list. This is confusing.
Then the documentation says
Returns:
A Tensor. Has the same type as tensor.
But the type of l is list where the type of r is tensorflow.python.framework.ops.Tensor. So the types are not the same.
Then I thought that TensorFlow is very generous with things being a tensor. So I tried:
class car(object):
def __init__(self, color):
self.color = color
red_car = car('red')
#test_reshape = tf.reshape(red_car, [1, -1])
print(red_car.color) # to check, that red_car exists.
Now, the line in comments results in an error.
So, can anyone help me to find out, what qualifies as a 'Tensor'?
P.S.: I tried to read the source code of tf.reshape as given in the documentation
Defined in tensorflow/python/ops/gen_array_ops.py.
But this file does not exist in the Github repo. Does anyone know how to read it?
https://www.tensorflow.org/programmers_guide/tensors
TensorFlow, as the name indicates, is a framework to define and run
computations involving tensors. A tensor is a generalization of
vectors and matrices to potentially higher dimensions. Internally,
TensorFlow represents tensors as n-dimensional arrays of base
datatypes.
What you are observing commes from the fact that tensorflow operations (like reshape) can be built from various python types using the function tf.convert_to_tensor:
https://www.tensorflow.org/api_docs/python/tf/convert_to_tensor
All standard Python op constructors apply this function to each of
their Tensor-valued inputs, which allows those ops to accept numpy
arrays, Python lists, and scalars in addition to Tensor objects

Combine sparse and masked array in numpy

Is there a convenient way to use masked array over sparse matrices ?
Because it seems that mask not work when creating masked array with scipy sparse matrix...
And a typical application would be a adjacency matrix where values could be {0,1,?} for representing links in a network {0,1} and unknown/unseen value {?} to predict.
I'm not surprised that trying to give a sparse matrix to masked does not work. The few numpy functions that work with sparse ones are ones that delegate to the task to the sparse code.
It might be possible to construct a coo format matrix with the data attribute being a masked array, but I doubt if that carries far. Code that isn't masked aware generally will ignore the mask.
A masked array is an ndarray subclass that maintains two attributes, the data and mask, both of which are arrays. Many masked methods work by filling the masked values with a suitable value (0 for sums, 1 for products), and performing regular array calculations.
A sparse matrix is not an ndarray subclass. One format is actually a dictionary subclass. Most store their data in 3 arrays, the 2 coordinates and the data. Interactions with non-sparse arrays often involve todense() to turn the action into a regular numpy one.
There's no interoperability by design. If something does work it's probably because of some coincidental delegation of method.
For example
In [85]: A=sparse.coo_matrix(np.eye(3))
In [86]: M=np.ma.masked_array(np.eye(3))
In [87]: A+M
Out[87]:
masked_array(data =
[[ 2. 0. 0.]
[ 0. 2. 0.]
[ 0. 0. 2.]],
mask =
False,
fill_value = 1e+20)
In [88]: M+A
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
I would have expected M+A to work, but since I read it as adding sparse to a masked. But sometimes x+y is actually implemented as y.__add__(x). A+np.eye(3) works in both orders.

Clarification on tf.Tensor.set_shape()

I have an image that is 478 x 717 x 3 = 1028178 pixels, with a rank of 1. I verified it by calling tf.shape and tf.rank.
When I call image.set_shape([478, 717, 3]), it throws the following error.
"Shapes %s and %s must have the same rank" % (self, other))
ValueError: Shapes (?,) and (478, 717, 3) must have the same rank
I tested again by first casting to 1028178, but the error still exists.
ValueError: Shapes (1028178,) and (478, 717, 3) must have the same rank
Well, that does make sense because one is of rank 1 and the other is of rank 3. However, why is it necessary to throw an error, as the total number of pixels still match.
I could of course use tf.reshape and it works, but I think that's not optimal.
As stated on the TensorFlow FAQ
What is the difference between x.set_shape() and x = tf.reshape(x)?
The tf.Tensor.set_shape() method updates the static shape of a Tensor
object, and it is typically used to provide additional shape
information when this cannot be inferred directly. It does not change
the dynamic shape of the tensor.
The tf.reshape() operation creates a new tensor with a different dynamic shape.
Creating a new tensor involves memory allocation and that could potentially be more costly when more training examples are involved. Is this by design, or am I missing something here?
As far as I know (and I wrote that code), there isn't a bug in Tensor.set_shape(). I think the misunderstanding stems from the confusing name of that method.
To elaborate on the FAQ entry you quoted, Tensor.set_shape() is a pure-Python function that improves the shape information for a given tf.Tensor object. By "improves", I mean "makes more specific".
Therefore, when you have a Tensor object t with shape (?,), that is a one-dimensional tensor of unknown length. You can call t.set_shape((1028178,)), and then t will have shape (1028178,) when you call t.get_shape(). This doesn't affect the underlying storage, or indeed anything on the backend: it merely means that subsequent shape inference using t can rely on the assertion that it is a vector of length 1028178.
If t has shape (?,), a call to t.set_shape((478, 717, 3)) will fail, because TensorFlow already knows that t is a vector, so it cannot have shape (478, 717, 3). If you want to make a new Tensor with that shape from the contents of t, you can use reshaped_t = tf.reshape(t, (478, 717, 3)). This creates a new tf.Tensor object in Python; the actual implementation of tf.reshape() does this using a shallow copy of the tensor buffer, so it is inexpensive in practice.
One analogy is that Tensor.set_shape() is like a run-time cast in an object-oriented language like Java. For example, if you have a pointer to an Object but know that, in fact, it is a String, you might do the cast (String) obj in order to pass obj to a method that expects a String argument. However, if you have a String s and try to cast it to a java.util.Vector, the compiler will give you an error, because these two types are unrelated.