What is the equivalent of numpy.allclose for structured numpy arrays? - numpy

Running numpy.allclose(a, b) throws TypeError: invalid type promotion on structured arrays. What would be the correct way of checking whether the contents of two structured arrays are almost equal?

np.allclose does an np.isclose followed by all(). isclose tests abs(x-y) against tolerances, with accomodations for np.nan and np.inf. So it is designed primarily to work with floats, and by extension ints.
The arrays have to work with np.isfinite(a), as well as a-b and np.abs. In short a.astype(float) should work with your arrays.
None of this works with the compound dtype of a structured array. You could though iterate over the fields of the array, and compare those with isclose (or allclose). But you will have ensure that the 2 arrays have matching dtypes, and use some other test on fields that don't work with isclose (eg. string fields).
So in the simple case
all([np.allclose(a[name], b[name]) for name in a.dtype.names])
should work.
If the fields of the arrays are all the same numeric dtype, you could view the arrays as 2d arrays, and do allclose on those. But usually structured arrays are used when the fields are a mix of string, int and float. And in the most general case, there are compound dtypes within dtypes, requiring some sort of recursive testing.
import numpy.lib.recfunctions as rf
has functions to help with complex structured array operations.

Assuming b is a scalar, you can just iterate over the fields of a:
all(np.allclose(a[field], b) for field in a.dtype.names)

Related

Typed lists vs ND-arrays in Numba

Could someone, please clarify that what is the benefit of using a Numba typed list over an ND array? Also, how do the two compares in terms of speed, and in what context would it be recommended to use the typed list?
Typed lists are useful when your need to append a sequence of elements but you do not know the total number of elements and you could not even find a reasonable bound. Such a data structure is significantly more expensive than a 1D array (both in memory space and computation time).
1D arrays cannot be resized efficiently: a new array needs to be created and a copy must be performed. However, the indexing of 1D arrays is very cheap. Numpy also provide many functions that can natively operate on them (lists are implicitly converted to arrays when passed to a Numpy function and this process is expensive). Note that is the number of items can be bounded to a reasonably size (ie. not much higher than the number of actual element), you can create a big array, then add the elements and finally work on a sub-view of the array.
ND arrays cannot be directly compared with lists. Note that lists of lists are similar to jagged array (they can contains lists of different sizes) while ND array are likes a (fixed-size) N x ... x M table. Lists of lists are very inefficient and often not needed.
As a result, use ND arrays when you can and you do not need to often resize them (or append/remove elements). Otherwise, use typed lists.

Dealing with both categorical and numerical variables in a Multiple Linear Regression Python

So I have already performed a multiple linear regression in Python using LinearRegression from sklearn.
My independant variables were all numerical (and so was my dependant one)
But now I'd like to perform a multiple linear regression combining numerical and non numerical independant variables.
Therefore I have several questions:
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
If yes, do I have to change some parameters?
If not, how should I perform the Linear Regression?
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Problem is: Even if I want to encode diffently nominal and ordinal variables,
it seems impossible for Python to tell the difference between both of them?
This stuff might be easy for you but right now as you could tell I'm a little confused so I could really use your help !
Thanks in advance,
Alex
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
In fact the model has to be fed exclusively with numerical data, thus you must use OneHot vectors for the categorical data in your input features. For that you can take a look at Scikit-Learn's LabelEncoder and OneHotEncoder.
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Yes. As you mention one-hot methods don't deal with ordinal variables. One way to work with ordinal features is to create a scale map, and map those features to that scale. Ordinal is a very useful tool for these cases. You can feed it a mapping dictionary according to a predifined scale mapping as mentioned. Otherwise, obviously it randomly assigns integers to the different categories as it has no knowledge to infer any order. From the documentation:
Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in, in this case we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.
Hope this helps.

Difference between numpy.logical_and and &

I'm trying to use the logical_and of two or more numpy arrays. I know numpy has the function logical_and(), but I find the simple operator & returns the same results and are potentially easier to use.
For example, consider three numpy arrays a, b, and c. Is
np.logical_and(a, np.logical_and(b,c))
equivalent to
a & b & c?
If they are (more or less) equivalent, what's the advantage of using logical_and()?
#user1121588 answered most of this in a comment, but to answer fully...
"Bitwise and" (&) behaves much the same as logical_and on boolean arrays, but it doesn't convey the intent as well as using logical_and, and raises the possibility of getting misleading answers in non-trivial cases (packed or sparse arrays, maybe).
To use logical_and on multiple arrays, do:
np.logical_and.reduce([a, b, c])
where the argument is a list of as many arrays as you wish to logical_and together. They should all be the same shape.
I have been googling some official confirmation that I can use & instead of logical_and on NumPy bool arrays, and found one in the NumPy v1.15 Manual:
If you know you have boolean arguments, you can get away with using
NumPy’s bitwise operators, but be careful with parentheses, like this:
z = (x > 1) & (x < 2). The absence of NumPy operator forms of
logical_and and logical_or is an unfortunate consequence of Python’s
design.
So one can also use ~ for logical_not and | for logical_or.

Working with columns of NumPy matrices

I've been unable to figure out how to access, add, multiply, replace, etc. single columns of a NumPy matrix. I can do this via looping over individual elements of the column, but I'd like to treat the column as a unit, something that I can do with rows.
When I've tried to search I'm usually directed to answers handling NumPy arrays, but this is not the same thing.
Can you provide code that's giving trouble? The operations on columns that you list are among the most basic operations that are supported and optimized in NumPy. Consider looking over the tutorial on NumPy for MATLAB users, which has many examples of accessing rows or columns, performing vectorized operations on them, and modifying them with copies or in-place.
NumPy for MATLAB Users
Just to clarify, suppose you have a 2-dimensional NumPy ndarray or matrix called a. Then a[:, 0] would access the first column just the same as a[0] or a[0, :] would access the first row. Any operations that work for rows should work for columns as well, with some caveats for broadcasting rules and certain mathematical operations that depend upon array alignment. You can also use the numpy.transpose(a) function (which is also exposed with a.T) to transpose a making the columns become rows.

What does it mean to flatten an iterator?

I would like to know what it means to flatten e.g. flatten an iterator of iterators. Can you tell me? Are there any C/Java/Python idioms for it?
In this context, to flatten means to remove nesting. For instance, an array of arrays (an array where each element is an array) of integers is nested; if we flatten it we get an array of integers which contains the same values in the same order, but next to each other in a single array, rather than split into several arrays: [[1 2] [3 4]] -> [1 2 3 4]. Same difference with iterators, other collections, and deeper nesting (array of array of sets of iterators of strings).
As for idioms, there aren't really many -- it's not a common task, and often simple. Note that in the case of regular arrays (all nested arrays are of the same size), nested[i][j] is equivalent to nested[i * INNER_ARRAY_SIZE + j]. This is sometimes used to avoid nesting, especially in languages which treat arrays as reference types and thus require many separately-allocated arrays if you nest them. In Python, you can flatten iterables with itertools.chain(*iterable_of_iterables).
Flattening means to remove nesting of sequence types. Python provides itertools.chain(*iterables) for this purpose (http://docs.python.org/library/itertools.html#itertools.chain).