NumPy indexing ambiguity in 3D arrays [duplicate] - numpy

I have the following minimal example:
a = np.zeros((5,5,5))
a[1,1,:] = [1,1,1,1,1]
print(a[1,:,range(4)])
I would expect as output an array with 5 rows and 4 columns, where we have ones on the second row. Instead it is an array with 4 rows and 5 columns with ones on the second column. What is happening here, and what can I do to get the output I expected?

This is an example of mixed basic and advanced indexing, as discussed in https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
The slice dimension has been appended to the end.
With one scalar index this is a marginal case for the ambiguity described there. It's been discussed in previous SO questions and one or more bug/issues.
Numpy sub-array assignment with advanced, mixed indexing
In this case you can replace the range with a slice, and get the expected order:
In [215]: a[1,:,range(4)].shape
Out[215]: (4, 5) # slice dimension last
In [216]: a[1,:,:4].shape
Out[216]: (5, 4)
In [219]: a[1][:,[0,1,3]].shape
Out[219]: (5, 3)

Related

Trouble understanding how the indices of a series are determined

Trouble understanding how the indices of a series are determined
So I have a huge data frame that i am reading a single column from, and I need to choose 100 unique values from this column. I think that what I did resulted in 100 unique values but I'm confused about the indexing of the resulting series. I looked at the indices of the data frame and they did not correspond to the value associated with the same indices of the series. I would like this to be the case, that is I want the indices of the resulting series to be the same as the indices of the data frame from which I am reading the column from. Would someone be able to explain to me how the resulting indices were determined here?
The indices of the sample do not correspond to the indices that exist in the DataFrame. This is due to the following fact:
When doing CSsq.unique() you are in fact getting back a np.ndarray (check the docs here). An array does not have any indices. But, you are passing this to the pd.Series constructor and as a result, a new Series is created, which in fact has indexing (starting from 0 up to n-1, where n is the size of the Series). This, of course, has nothing to do with the DataFrame indices, because you have firstly isolated the unique values.
See the example below for a hypothetical Series called s:
s
0 100
1 100
2 100
3 200
4 250
5 300
6 300
Let's isolate the unique occurences:
s.unique()
# [100, 200, 250, 300]
And now let's feed this to the pd.Series constructor:
pd.Series(s.unique())
0 100
1 200
2 250
3 300
As you can see this Series was generated from an array and its indices have nothing to do with the initial indices!
Now, if you take a random sample out of this Series, you'll get values with indices that correspond to this new Series object!
If you'd like to get a sample with indices that are derived from the DataFrame try something like this:
CSsq.drop_duplicates().sample(100)

Selecting two sets of columns from a dataFrame with all rows

I have a dataFrame with 28 columns (features) and 600 rows (instances). I want to select all rows, but only columns from 0-12 and 16-27. Meaning that I don't want to select columns 12-15.
I wrote the following code, but it doesn't work and throws a syntax error at : in 0:12 and 16:. Can someone help me understand why?
X = df.iloc[:,[0:12,16:]]
I know there are other ways for selecting these rows, but I am curious to learn why this one does not work, and how I should write it to work (if there is a way).
For now, I have written it is as:
X = df.iloc[:,0:12]
X = X + df.iloc[:,16:]
Which seems to return an incorrect result, because I have already treated the NaN values of df, but when I use this code, X includes lots of NaNs!
Thanks for your feedback in advance.
You can use np.r_ to concatenate the slices:
x = df.iloc[:, np.r_[0:12,16:]]
iloc has these allowed inputs (from the docs):
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
What you're passing to iloc in X = df.iloc[:,[0:12,16:]] is not a list of integers or a slice of ints, but a list of slice objects. You need to convert those slices to a list of integers, and the best way to do that is using the numpy.r_ function.
X = df.iloc[:, np.r_[0:13, 16:28]]

Vectors vs ndarrays in pandas/numpy

I know for a 4D vector, shape should be (4, 1) which is actually represented in 4D space but ndim is 2, and for some ndarray to be in 4 dimension, its shape should be something like (2, 3, 4, 5).
So, Is it like dimensional concept differs between vector and matrices (or arrays)? I'm trying to understand from mathematical perspective and how it's derived to pandas programming.
The dimensionality of a mathematical object is usually determined by the number of independent parameters in that particular object. For example, a 4-D vector is mathematically 4 dimensional because it contains 4 independent elements (unless some relation between them has been specified). Such a vector, if represented as a column vector in numpy, would have a shape (4, 1) because it has 4 rows and 1 column. The transpose of this vector, a row vector, has shape (4, ) because it has 4 columns and only 1 row, and the row-style view is default, so if there is 1 row, it's not explicitly mentioned.
Note however, that the column vector and row vector are dimensionally equivalent mathematically. Both have 4 dimensions.
For a 3 x 3 matrix, the most general mathematical dimension is 9, because it has 9 independent elements in general. The shape of a corresponding numpy array would be (3, 3). If you're looking for the maximum number of elements in any numpy array, ndarray.size is the way to go.
ndarray.ndim, however, yields the number of axes in a numpy array. That is, the number of directions along which values are placed (sloppy terminology!). So for the 3 x 3 matrix, ndim yields 2. For an array of shape (3, 7, 2, 1), ndim would yield 4. But, as we already discussed, the mathematical dimension would generally be 3 x 7 x 2 x 1 = 42 (So this is a matrix in 42-dimensional space! But the numpy array has just 4 dimensions). Thereby, as you might've already noticed, ndarray.size is just the product of the numbers in ndarray.shape.
Note that these are not just concepts of programming. We are used to saying "2-D matrices" in mathematics, but that is not to be confused with the space in which the matrices reside.

initialize pandas SparseArray

Is it possible to initialize a pandas SparseArray by providing only the dense entries? I could not figure this out from the documentation: http://pandas.pydata.org/pandas-docs/stable/sparse.html .
For example, say I want a length 1000 SparseArray with a one at index 9 and zeros everywhere else, how would I go about creating it? This is one way:
a = [0] * 1000
a[9] = 1
sparse_a = pd.SparseArray(data=a, fill_value=0)
But, in the above, we have to create the dense array before the sparse one. Is there a way to specify only the indices and the dense entries to create the SparseArray directly?
A length 10 SparseArray with a one at index 9 and zeros everywhere else:
pd.SparseArray(1, index= range(1), kind='block',
sparse_index= BlockIndex(10, [8], [1]),
fill_value=0)
Notes:
index could be any list as long as its length is equal to all non-sparsed part of the array (the smaller part of the data), in this case, number of 1 in the sparse array
BlockIndex(10, [8], [1]) is the object pointing to the positions of the non-parsed part of the data where the first argument is the TOTAL length of the array (sparse + non-sparse), the second argument is a list of starting positions of the non-sparse data and the third argument is a list of how long each block of non-sparse lasts. Notice: that the length of the array mentioned in point 1 is the sum of all elements of the list in the third argument of this BlockIndex
So a more general example is: to make a length 20 SparseArray where the 2nd, 3rd, 6th,7th,8th elements are 1 and the rest is 0 is:
pd.SparseArray(1, index= range(5), kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
or
pd.SparseArray(1, index= [None, 3, 2, 7, np.inf], kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
Sadly, I don't know any good way to specify an array of non-sparsed data as the first argument for SparseArray-- it does not mean that it can't be done, this is only a disclaimer. I think as long as you specify index=... pandas will require a scalar for the first argument (the data).
Tested on Windows 7, pandas version 0.20.2 installed by Aconda.

How to find last occurrence of maximum value in a numpy.ndarray

I have a numpy.ndarray in which the maximum value will mostly occur more than once.
EDIT: This is subtly different from numpy.argmax: how to get the index corresponding to the *last* occurrence, in case of multiple occurrences of the maximum values because the author says
Or, even better, is it possible to get a list of indices of all the occurrences of the maximum value in the array?
whereas in my case getting such a list may prove very expensive
Is it possible to find the index of the last occurrence of the maximum value by using something like numpy.argmax? I want to find only the index of the last occurrence, not an array of all occurrences (since several hundreds may be there)
For example this will return the index of the first occurrence ie 2
import numpy as np
a=np.array([0,0,4,4,4,4,2,2,2,2])
print np.argmax(a)
However I want it to output 5.
numpy.argmax only returns the index of the first occurrence. You could apply argmax to a reversed view of the array:
import numpy as np
a = np.array([0,0,4,4,4,4,2,2,2,2])
b = a[::-1]
i = len(b) - np.argmax(b) - 1
i # 5
a[i:] # array([4, 2, 2, 2, 2])
Note numpy doesn't copy the array but instead creates a view of the original with a stride that accesses it in reverse order.
id(a) == id(b.base) # True
If your array is made up of integers and has less than 1e15 rows. You can also sort this out by adding a noise function that linearly increases the value of later occurrences.
>>>import numpy as np
>>>a=np.array([0,0,4,4,4,4,2,2,2,2])
>>>noise= np.array(range(len(a))) * 1e-15
>>>print(np.argmax(a + noise))
5