Motivation behind numpy's datetime64 type? - numpy

I noticed recently that numpy includes a datetime64 data type beginning in numpy 1.7:
http://www.compsci.wm.edu/SciClone/documentation/software/math/NumPy/html1.7/reference/arrays.datetime.html
I am wondering what is the motivation behind including this as a separate type within the numpy package rather than using the builtin datetime.datetime provided by Python?
Some of the reasons I am interested in understanding this better include:
I want to know when it is appropriate to use datetime.datetime vs when to use numpy.datetime64
Since numpy includes no date type analogous to datetime.date, should I use numpy.datetime64 for dates when I need to interact with numpy.datetime64 objects? Or should I intermingle datetime.date and numpy.datetime64 in my code?

The reason is identical as to why there is a np.int and an np.float. These numpy types get stored by value in an array, rather than by boxed reference, as generic python object are. The latter takes far more memory, allocation overhead, and is much less cache friendly to traverse.

I avoid intermingling datetime64 and python's inbuilt datetime objects. The reason for this is that the code that you write to work with a datetime.datetime will not work with a numpy.datetime64 scalar. For example any of the methods or properties of a datetime.datetime would not be availble on a numpy.datetime64 object.
In order to avoid the intermingling, what I tend to do is when I am dealing with scalars, I use python's datetime.datetime or datetime.date. When I am dealing with numpy array's, I use datetime64. This means that when I am extracting or iterating over single values from a numpy datetime64 array, I convert them into a datetime object first before I let propagate into other parts of the codebase.
Also you can read about different units of datetime64, that will allow you use a datetime64 as a datetime.date or datetime.datetime here:
http://docs.scipy.org/doc/numpy-dev/reference/arrays.datetime.html#arrays-dtypes-dateunits

Related

Should I care about data types in pandas dataframe? If so, how?

I know that columns of pandas dataframe can be of different types. In practice, does it make any difference which type I use? If so, what is that difference and what are some best practices regarding pandas dtypes?
(For example in R one should use factor type to encode categorical vars and it impacts the way that plots are generated etc. - but I haven't heard of anything similar in pandas.)

Pandas interpolation method definitions

In the pandas documentation, a number of methods are provided as arguments to pandas.DataFrame.interpolate including
nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5).
‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes
However, the scipy documentation indicates the following options:
kind str or int, optional
Specifies the kind of interpolation as a string or as an integer specifying the order of the spline interpolator to use. The string has to be one of ‘linear’, ‘nearest’, ‘nearest-up’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘previous’, or ‘next’. ‘zero’, ‘slinear’, ‘quadratic’ and ‘cubic’ refer to a spline interpolation of zeroth, first, second or third order; ‘previous’ and ‘next’ simply return the previous or next value of the point; ‘nearest-up’ and ‘nearest’ differ when interpolating half-integers (e.g. 0.5, 1.5) in that ‘nearest-up’ rounds up and ‘nearest’ rounds down. Default is ‘linear’.
The documentation seems wrong since scipy.interpolate.interp1d does not accept barycentric or polynomial as valid methods. I suppose that barycentric refers to scipy.interpolate.barycentric_interpolate, but what does polynomial refer to? I thought it might be equivalent to the piecewise_polynomial option, but the two give different results.
Also, method=cubicspline and method=spline, order=3 give different results. What's the difference here?
The pandas interpolate method is an amalgamation of interpolation methods coming from different places in the numpy and scipy libraries.
Currently all of the code is located in pandas/core/missing.py.
At a high level it splits the interpolation methods into those that are handled by np.iterp and others handled by throughout the scipy library.
# interpolation methods that dispatch to np.interp
NP_METHODS = ["linear", "time", "index", "values"]
# interpolation methods that dispatch to _interpolate_scipy_wrapper
SP_METHODS = ["nearest", "zero", "slinear", "quadratic", "cubic",
"barycentric", "krogh", "spline", "polynomial",
"from_derivatives", "piecewise_polynomial", "pchip",
"akima", "cubicspline"]
Then because the scipy methods are split across different methods, you can see there are a ton of other wrappers within missing.py that indicate the scipy method. Most of the methods are passed off to scipy.interpolate.interp1d; however for a few others there's a dict or other wrapper methods pointing to those specific scipy methods.
from scipy import interpolate
alt_methods = {
"barycentric": interpolate.barycentric_interpolate,
"krogh": interpolate.krogh_interpolate,
"from_derivatives": _from_derivatives,
"piecewise_polynomial": _from_derivatives,
}
where the doc string of _from_derivatives within missing.py indicates:
def _from_derivatives(xi, yi, x, order=None, der=0, extrapolate=False):
"""
Convenience function for interpolate.BPoly.from_derivatives.
...
"""
So TLDR, depending upon the method you specify you wind up directly using one of the following:
numpy.interp
scipy.interpolate.interp1d
scipy.interpolate.barycentric_interpolate
scipy.interpolate.krogh_interpolate
scipy.interpolate.BPoly.from_derivatives
scipy.interpolate.Akima1DInterpolator
scipy.interpolate.UnivariateSpline
scipy.interpolate.CubicSpline

Best way to convert TensorProto to TensorFlow tensor

As far as I can tell, there are at least two different ways to recover a Tensor from a TensorProto in Tensorflow 2.3. Say, for the sake of example, that we have
tensor = tf.range(10)
tproto = tf.make_tensor_proto(tensor)
Then:
You can use tf.make_ndarray like so
tf.constant(tf.make_ndarray(tproto))
Or you can use tf.io.parse_tensor like so
tf.io.parse_tensor(tproto.SerializeToString(), out_type=tf.int32)
I feel both of these are a bit artificial, since in the former you end up with an intermediate numpy array, and in the latter you have to serialize the TensorProto to a string and parse it back. Additionally, parse_tensor won't automatically recover the correct data type from the TensorProto. So:
Is there a function to do the conversion in a single step? I'd like to see something like tf.from_tensor_proto doing the conversion all at once optimizing for speed and memory allocation (or, if tf.constant(tf.make_ndarray(tproto)) is the best you can do, just wrapping this up).
Otherwise, which of the two options above should be preferred (in terms of efficiency, memory usage, etc.)?

What is the equivalent of numpy.allclose for structured numpy arrays?

Running numpy.allclose(a, b) throws TypeError: invalid type promotion on structured arrays. What would be the correct way of checking whether the contents of two structured arrays are almost equal?
np.allclose does an np.isclose followed by all(). isclose tests abs(x-y) against tolerances, with accomodations for np.nan and np.inf. So it is designed primarily to work with floats, and by extension ints.
The arrays have to work with np.isfinite(a), as well as a-b and np.abs. In short a.astype(float) should work with your arrays.
None of this works with the compound dtype of a structured array. You could though iterate over the fields of the array, and compare those with isclose (or allclose). But you will have ensure that the 2 arrays have matching dtypes, and use some other test on fields that don't work with isclose (eg. string fields).
So in the simple case
all([np.allclose(a[name], b[name]) for name in a.dtype.names])
should work.
If the fields of the arrays are all the same numeric dtype, you could view the arrays as 2d arrays, and do allclose on those. But usually structured arrays are used when the fields are a mix of string, int and float. And in the most general case, there are compound dtypes within dtypes, requiring some sort of recursive testing.
import numpy.lib.recfunctions as rf
has functions to help with complex structured array operations.
Assuming b is a scalar, you can just iterate over the fields of a:
all(np.allclose(a[field], b) for field in a.dtype.names)

Understanding Numpy internals for profiling purposes

Profiling a piece of numpy code shows that I'm spending most of the time within these two functions
numpy/matrixlib/defmatrix.py.__getitem__:301
numpy/matrixlib/defmatrix.py.__array_finalize__:279
Here's the Numpy source:
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L301
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L279
Question #1:
__getitem__ seems to be called every time I'm using something like my_array[arg] and it's getting more expensive if arg is not an integer but a slice. Is there any way to speed up calls to array slices?
E.g. in
for i in range(idx): res[i] = my_array[i:i+10].mean()
Question #2:
When exactly does __array_finalize__ get called and how can I speed up by reducing the number of calls to this function?
Thanks!
You could not use matrices as much and just use 2d numpy arrays. I typically only use matrices for a short-time to take advantage of the syntax for multiplication (but with the addition of the .dot method on arrays, I find I do that less and less as well).
But, to your questions:
1) There really is no short-cut to __getitem__ unless defmatrix over-rides __getslice__ which it could do but doesn't yet. There are the .item and .itemset methods which are optimized for integer getting and setting (and return Python objects rather than NumPy's array-scalars)
2) __array_finalize__ is called whenever an array object (or a subclass) is created. It is called from the C-function that every array-creation gets funneled through. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L1003
In the case of sub-classes defined purely in Python, it is calling back into the Python interpreter from C which has overhead. If the matrix class were a builtin type (a Cython-based cdef class, for example), then the call could avoid the Python interpreter overhead.
Question 1:
Since array slices can sometimes require a copy of the underlying data structure (holding the pointers to the data in memory) they can be quite expensive. If you're really bottlenecked by this in your above example, you can perform mean operations by actually iterating over the i to i+10 elements and manually creating the mean. For some operations this won't give any performance improvement, but avoiding creating new data structures will generally speed up the process.
Another note, if you're not using native types inside numpy you will get a Very large performance penalty to manipulating a numpy array. Say you're array has dtype=float64 and your native machine float size is float32 -- this will cost a lot of extra computation power for numpy and performance overall will drop. Sometimes this is fine and you can just take the hit for maintaining a data type. Other times it's arbitrary what type the float or int is stored as internally. In these cases try dtype=float instead of dtype=float64. Numpy should default to your native type. I've had 3x+ speedups on numpy intensive algorithms by making this change.
Question 2:
__array_finalize__ "is called whenever the system internally allocates a new array from obj, where obj is a subclass (subtype) of the (big)ndarray" according to SciPy. Thus this is a result described in the first question. When you slice and make a new array, you have to finalize that array by either making structural copies or wrapping the original structure. This operation takes time. Avoiding slices will save on this operation, though for multidimensional data it may be impossible to completely avoid calls to __array_finalize__.