Split Xy matrix into X and y - numpy

If I have a matrix Xy that I want to split into a matrix X and an array y, I usually do this
X, y = Xy[:, :-1], Xy[:, -1]
Is there a better way to do this using scikit-learn or numpy? I feel like it's a very common operation.

You can use NumPy built-in np.split -
X, y = np.split(Xy,[-1],axis=1) # Or simply : np.split(Xy,[-1],1)
Sample run -
In [93]: Xy
Out[93]:
array([[6, 2, 0, 5, 2],
[6, 3, 7, 0, 0],
[3, 2, 3, 1, 3],
[1, 3, 7, 1, 7]])
In [94]: X, y = np.split(Xy,[-1],axis=1)
In [95]: X
Out[95]:
array([[6, 2, 0, 5],
[6, 3, 7, 0],
[3, 2, 3, 1],
[1, 3, 7, 1]])
In [96]: y
Out[96]:
array([[2],
[0],
[3],
[7]])
Note that np.split would produce y as 2D. To have a 1D slice, we need to use np.squeeze(y) there.
Also, these slices would be views into original array, so no additional memory required there -
In [104]: np.may_share_memory(Xy, X)
Out[104]: True
In [105]: np.may_share_memory(Xy, y)
Out[105]: True

np.split uses np.array_split. That in turn does:
sub_arys = []
sary = _nx.swapaxes(ary, axis, 0)
for i in range(Nsections):
st = div_points[i]
end = div_points[i + 1]
sub_arys.append(_nx.swapaxes(sary[st:end], axis, 0))
swapaxes is needed with axis=1; or without the swapping:
sub_arys = []
for ...:
sub_arys.append(ary[:, st:end])
return sub_arys
i.e. the same as:
In [388]: ary=np.arange(12).reshape(3,4)
In [389]: [ary[:,0:3], ary[:,3:4]]
Out[389]:
[array([[ 0, 1, 2],
[ 4, 5, 6],
[ 8, 9, 10]]),
array([[ 3],
[ 7],
[11]])]
split like this keeps the original number of dimensions.
Wrapping your code in a function gives something that will be as fast, if not faster:
def xysplit(ary):
return ary[:,:-1], ary[:,-1]
X, y = xysplit(ary)
produces:
array([[ 0, 1, 2],
[ 4, 5, 6],
[ 8, 9, 10]]),
array([ 3, 7, 11])
When I commented that this seems to be more common in sklearn contexts I had in mind questions like
Python ValueError: non-broadcastable output operand with shape (124,1) doesn't match the broadcast shape (124,13)
X = df_wine.iloc[:, 1:].values
y = df_wine.iloc[:, 0].values
....
X_train, X_test, y_train, y_test = train_test_split(X, y, ...
X and y are 2d and 1d arrays, pulled in this case from a columns of a pandas dataframe. train_test_split is used to split X and y into training and testing groups. If there is a special X,y splitter, it would be in the sklearn package, not numpy.
Python - NumPy array_split adds a dminesion
train_inputs = train[:,: -1]
train_outputs = train[:, -1]

Related

Changing values of array in 2nd dimension

I have this array:
x = numpy.array([[[1, 2, 3]],
[[4, 5, 6]],
[[7,8,9]]])
I want to replace the elements 3,6 and 9 with some other numbers.
I tried to split the array to
y=x[:,:,:2]
and than add the array new at the end of array y with
new = numpy.array([[[10]],
[[11]],
[[12]]])
final_arr= numpy.insert(y,2,new, axis=2)
But it adds in each line the new-array.
You need to add it to the third dimension, so just create an array with the corresponding shape. You can do easily with the use of numpy.newaxis, as shown below:
import numpy as np
x = np.array(
[
[[1, 2, 3]],
[[4, 5, 6]],
[[7,8,9]]
])
x[:, :, -1] = np.array([10, 11, 12])[:, np.newaxis]
x
Output
array([[[ 1, 2, 10]],
[[ 4, 5, 11]],
[[ 7, 8, 12]]])
Cheers!

Numpy: How to select row entries in a 2d array by column vector

How can I retrieve a column vector from a 2d array given an indicator column vector?
Suppose I have
X = np.array([[1, 4, 6],
[8, 2, 9],
[0, 3, 7],
[6, 5, 1]])
and
S = np.array([0, 2, 1, 2])
Is there an elegant way to get from X and S the result array([1, 9, 3, 1]), which is equivalent to
np.array([x[s] for x, s in zip(X, S)])
You can achieve this using np.take_along_axis:
>>> np.take_along_axis(X, S[..., None], axis=1)
array([[1],
[9],
[3],
[1]])
You need to make sure both array arguments are of the same shape (or broadcasting can be applied), hence the S[..., None] broadcasting.
Of course your can reshape the returned value with a [:, 0] slice.
>>> np.take_along_axis(X, S[..., None], axis=1)[:, 0]
array([1, 9, 3, 1])
Alternatively you can just use indexing with an arangement:
>>> X[np.arange(len(S)), S[np.arange(len(S))]]
array([1, 9, 3, 1])
I believe this is also equivalent to np.diag(X[:, S]) but with unnecessary copying...
For 2d arrays
# Mention row numbers as one list and S which is column number as other
X[[0, 1, 2, 3], S]
# more general
X[np.indices(S.shape), S]
indexing_basics

convert CSR format to dense/COO format in tensorflow

tf.sparse_to_dense() fucntion in tensorflow only support ((data, (row_ind, col_ind)), [shape=(M, N)]) format. How can I convert standard CSR tensor (((data, indices, indptr), [shape=(M, N)])) to dense representation in tensorflow?
For example given, data, indices and indptr the function will return dense tensor.
e.g., inputs:
indices = [1 3 3 0 1 2 2 3]
indptr = [0 2 3 6 8]
data = [2 4 1 3 2 1 1 5]
expected output:
[[0, 2, 0, 4],
[0, 0, 0, 1],
[3, 2, 1, 0],
[0, 0, 1, 5]]
According to Scipy documentation, we can convert it back by the following:
the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their
corresponding values are stored in data[indptr[i]:indptr[i+1]].
If the shape parameter is not supplied, the matrix dimensions are
inferred from the index arrays.
It is relatively easily to convert from the CSR format to the COO by expanding the indptr argument to get the row indices. Here is an example using a subtraction, tf.repeat and tf.range. The shape of the final sparse tensor is inferred from the max indices in the rows/columns respectively (but can also be provided explicitly).
def csr_to_sparse(data, indices, indptr, dense_shape=None):
rep = tf.math.subtract(indptr[1:], indptr[:-1])
row_indices = tf.repeat(tf.range(tf.size(rep)), rep)
sparse_indices = tf.cast(tf.stack((row_indices, indices), axis=-1), tf.int64)
if dense_shape is None:
max_row = tf.math.reduce_max(row_indices)
max_col = tf.math.reduce_max(indices)
dense_shape = (max_row + 1, max_col + 1)
return tf.SparseTensor(indices=sparse_indices, values=data, dense_shape=dense_shape)
With your example:
>>> indices = [1, 3, 3, 0, 1, 2, 2, 3]
>>> indptr = [0, 2, 3, 6, 8,]
>>> data = [2, 4, 1, 3, 2, 1, 1, 5]
>>> tf.sparse.to_dense(csr_to_sparse(data, indices, indptr))
<tf.Tensor: shape=(4, 4), dtype=int32, numpy=
array([[0, 2, 0, 4],
[0, 0, 0, 1],
[3, 2, 1, 0],
[0, 0, 1, 5]], dtype=int32)>

Usage of tf.gather_nd

Suppose that you have a 3-tensor
data = np.reshape(np.arange(12), [2, 2, 3])
x = tf.constant(data)
Thinking of this as 2x2 matrices indexed by the last index, I would like to get the first column from the first matrix, the second column from the second matrix and the second column from the third matrix.
How can I use tf.gather_nd to do this?
You need first generate the indices you want.
import tensorflow as tf
import numpy as np
indices = [[i,min(j,1),j] for j in range(3) for i in range(2)] # According to your description
# [[0, 0, 0], [1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 1, 2], [1, 1, 2]]
a = tf.constant(np.arange(12).reshape(2,2,3))
res = tf.gather_nd(a, indices)
sess = tf.InteractiveSession()
a.eval()
# array([[[ 0, 1, 2],
# [ 3, 4, 5]],
# [[ 6, 7, 8],
# [ 9, 10, 11]]])
res.eval()
# array([ 0, 6, 4, 10, 5, 11])
I found the following tutorial online explaining how to deal with this kind of problems: https://geekyisawesome.blogspot.com/2018/05/fancy-indexing-in-tensorflow-getting.html
Suppose we have a 4x3 matrix
M = tf.constant(np.arange(12).reshape(4,3))
Now let's say that you wanted the third element of the first row, the second element of the second row, the first element of the third row, and the second element of the fourth row. As explained in the tutorial, this could be accomplished like:
idx = tf.constant([2,1,0,1], tf.int32)
x = tf.gather_nd(M, tf.stack([tf.range(M.shape[0]), idx], axis=1))
But what if M has an unknown number of rows? (and idx as a tensor of integers of the appropriate size) Then tf.range(M.shape[0]) will raise an error. How can I go around that?

Numpy dot product of a matrix and an array is a matrix

When I updated to the most recent version of numpy, a lot of my code broke because now every time I call np.dot() on a matrix and an array, it returns a 1xn matrix rather than simply an array.
This causes me an error when I try to multiply the new vector/array by a matrix
example
A = np.matrix( [ [4, 1, 0, 0], [1, 5, 1, 0], [0, 1, 6, 1], [1, 0, 1, 4] ] )
x = np.array([0, 0, 0, 0])
print(x)
x1 = np.dot(A, x)
print(x1)
x2 = np.dot(A, x1)
print(x2)
output:
[0 0 0 0]
[[0 0 0 0]]
Traceback (most recent call last):
File "review.py", line 13, in <module>
x2 = np.dot(A, x1)
ValueError: shapes (4,4) and (1,4) not aligned: 4 (dim 1) != 1 (dim 0)
I would expect that either dot of a matrix and vector would return a vector, or dot of a matrix and 1xn matrix would work as expected.
Using the transpose of x doesn't fix this, nor does using A # x, or A.dot(x) or any variation of np.matmul(A, x)
Your arrays:
In [24]: A = np.matrix( [ [4, 1, 0, 0], [1, 5, 1, 0], [0, 1, 6, 1], [1, 0, 1, 4] ] )
...: x = np.array([0, 0, 0, 0])
In [25]: A.shape
Out[25]: (4, 4)
In [26]: x.shape
Out[26]: (4,)
The dot:
In [27]: np.dot(A,x)
Out[27]: matrix([[0, 0, 0, 0]]) # (1,4) shape
Let's try the same, but with a ndarray version of A:
In [30]: A.A
Out[30]:
array([[4, 1, 0, 0],
[1, 5, 1, 0],
[0, 1, 6, 1],
[1, 0, 1, 4]])
In [31]: np.dot(A.A, x)
Out[31]: array([0, 0, 0, 0])
The result is (4,) shape. That makes sense: (4,4) dot (4,) => (4,)
np.dot(A,x) is doing the same calculation, but returning a np.matrix. That by definition is a 2d array, so the (4,) is expanded to (1,4).
I don't have an older version to test this on, and am not aware of any changes.
If x is a (4,1) matrix, then the result (4,4)dot(4,1)=>(4,1):
In [33]: np.matrix(x)
Out[33]: matrix([[0, 0, 0, 0]])
In [34]: np.dot(A, np.matrix(x).T)
Out[34]:
matrix([[0],
[0],
[0],
[0]])