numpy - add two arrays to get matrix - numpy

How do I find a matrix of all sums of two arrays?
With the input
x1 = np.array([0, 1])
x2 = np.array([1,2,3])
I want the output of this to be like this:
[[1, 2, 3], [2, 3, 4]]

You can use NumPy's newaxis attribute:
x1[:, np.newaxis] + x2
which is an acronym for None:
In [2]: np.newaxis is None
Out[2]: True
Thus:
x1[:, None] + x2
also works.

You can use list comprehension like this example:
x1 = np.array([0, 1])
x2 = np.array([1,2,3])
final = [[j+k for j in x2] for k in x1]
# Or, maybe:
# final = np.array([[j+k for j in x2] for k in x1])
# >>> array([[1, 2, 3], [2, 3, 4]])
print(final)
Output:
[[1, 2, 3], [2, 3, 4]]

Related

Element-wise Euclidian distances between 2 numpy arrays

I have two Numpy arrays, each having n rows:
a = [[X1a, Y1a], [X2a, Y2a], .. , [Xna, Yna]]
b = [[X1b, Y1b], [X2b, Y2b], .. , [Xnb, Ynb]]
How can I get a new table with the Euclidean distance of each corresponding row?
c = [dis(1a, 1b), dis(2a, 2b), .. , dis(na, nb)]
or maybe
c = [[dis(1a, 1b)], [dis(2a, 2b)], .. , [dis(na, nb)]]
c = []
for i in range(a.shape[0]):
c.append(math.sqrt( (a[i][0]-b[i][0])**2 + (a[i][1] - b[i][1])**2))
This will work.
a.shape[0] will give the value of n
For inputs
a = np.array([[1, 2], [3, 4], [5, 6]])
b = np.array([[2, 1], [1, 2], [3, 4]])
You will get
c = [1.4142135623730951, 2.8284271247461903, 2.8284271247461903]
I find vectorizing is more pythonic and faster:
a = np.array(a)
b = np.array(b)
np.sqrt(np.sum((a-b)**2,axis=1))
There are loads of examples of using scipy's cdist, or pdist or just numpy's einsum to calculate distances. They scale to multiple dimensions as well.
from scipy.spatial.distance import cdist
a = np.array([[1., 2], [3, 4], [5, 6]])
b = np.array([[2, 1], [1, 2], [3, 4]])
cdist(a, b)
Out[14]:
array([[ 1.414, 0.000, 2.828],
[ 3.162, 2.828, 0.000],
[ 5.831, 5.657, 2.828]])
or
a = np.array([[1., 2], [3, 4], [5, 6]])
b = np.array([[2, 1], [1, 2], [3, 4]])
b = b[:, np.newaxis]
diff = a - b
np.sqrt(np.einsum('ijk,ijk->ij', diff, diff))
array([[ 1.414, 3.162, 5.831],
[ 0.000, 2.828, 5.657],
[ 2.828, 0.000, 2.828]])
The Euclidian distance is also known as the 2 norm. numpy.linalg.norm will calculate this efficiently across your vectors:
import numpy.linalg as la
a = np.array([[1, 2], [3, 4], [5, 6]])
b = np.array([[2, 1], [1, 2], [3, 4]])
c = la.norm(a - b, axis=1)

tensorflow count number of 0 at index 2 in a flatten-batch

Sorry for the inaccurate title. Here is the detail description of the problem: assume a tensor of shape (?, 2), e.g., a tensor T [[0,1], [0,2], [0,0], [1, 4], [1, 3], [2,0], [2,0], [2,0],[2,0]]. How to count how many zero showing up for every T[:, 0]. For the example above, because there is [0,0] and [2,0], the answer is 2.
More examples:
[[0,1], [0,2], [0,1], [1, 4], [1, 3], [2,0], [2,0], [2,0],[2,0]] (Answer: 1, because of [2,0])
[[0,1], [0,2], [0,1], [1, 4], [1, 3], [2,0], [2,0], [2,0],[2,0],[3,0]] (Answer: 2, because of [2, 0] and [3,0])
If I got what you are looking for, the question is how many unique "[X, 0]-pairs" you have in the data. If so, this should do it:
import tensorflow as tf
x = tf.placeholder(shape=(None, 2), dtype=tf.int32)
indices = tf.where(tf.equal(x[:,1], tf.constant(0, dtype=tf.int32)))
unique_values, _ = tf.unique(tf.squeeze(tf.gather(x[:, 0], indices)))
no_unique_values = tf.shape(unique_values, out_type=tf.int32)
data = [ .... ]
with tf.Session() as sess:
no_unique = sess.run(fetches=[no_unique_values], feed_dict={x: data})
Here is a solution I got myself
def get_unique(ts):
ts_part = ts[:, 1]
where = tf.where(tf.equal(0, ts_part))
gather_nd = tf.gather_nd(ts, where)
gather_plus = gather_nd[:, 0] + gather_nd[:, 1]
unique_values, _ = tf.unique(gather_plus)
return tf.shape(unique_values)[0]

Broadcasting between two same-rank tensors in tensorflow

I have two tensors x and s with shapes:
> x.shape
TensorShape([Dimension(None), Dimension(3), Dimension(5), Dimension(5)])
> s.shape
TensorShape([Dimension(None), Dimension(12), Dimension(5), Dimension(5)])
I want to broadcast the dot product between x and s through the dimension 1 as follows:
> x_s.shape
TensorShape([Dimension(None), Dimension(4), Dimension(5), Dimension(5)])
where
x_s[i, 0, k, l] = sum([x[i, j, k, l] * s[i, j, k, l] for j in range (3)])
x_s[i, 1, k, l] = sum([x[i, j-3, k, l] * s[i, j, k, l] for j in range (3, 6)])
x_s[i, 2, k, l] = sum([x[i, j-6, k, l] * s[i, j, k, l] for j in range (6, 9)])
x_s[i, 3, k, l] = sum([x[i, j-9, k, l] * s[i, j, k, l] for j in range (9, 12)])
I have this implementation:
s_t = tf.transpose(s, [0, 2, 3, 1]) # [None, 5, 5, 12]
x_t = tf.transpose(x, [0, 2, 3, 1]) # [None, 5, 5, 3]
x_t = tf.tile(x_t, [1, 1, 1, 4]) # [None, 5, 5, 12]
x_s = x_t * s_t # [None, 5, 5, 12]
x_s = tf.reshape(x_s, [tf.shape(x_s)[0], 5, 5, 4, 3]) # [None, 5, 5, 4, 3]
x_s = tf.reduce_sum(x_s, axis=-1) # [None, 5, 5, 4]
x_s = tf.transpose(x_s, [0, 3, 1, 2]) # [None, 4, 5, 5]
I understand this is not efficient in memory because of the tile. Also, reshape's, transpose's element-wise and reduce_sums operations can hurt the performance for larger tensors. Is there any alternative to make it cleaner?
Do you have any evidence that reshapes are expensive? The following uses a reshape and dimension broadcasting:
x_s = tf.reduce_sum(tf.reshape(s, (-1, 4, 3, 5, 5)) *
tf.expand_dims(x, axis=1), axis=2)
Just some advice and maybe not faster than yours. First split s with tf.split into four tensors, then use tf.tensordot to get final result, like this
splits = tf.split(s, [3] * 4, axis=1)
splits = map(lambda split: tf.tensordot(split, x, axes=[[1], [1]]), splits)
x_s = tf.stack(splits, axis=1)

Split Xy matrix into X and y

If I have a matrix Xy that I want to split into a matrix X and an array y, I usually do this
X, y = Xy[:, :-1], Xy[:, -1]
Is there a better way to do this using scikit-learn or numpy? I feel like it's a very common operation.
You can use NumPy built-in np.split -
X, y = np.split(Xy,[-1],axis=1) # Or simply : np.split(Xy,[-1],1)
Sample run -
In [93]: Xy
Out[93]:
array([[6, 2, 0, 5, 2],
[6, 3, 7, 0, 0],
[3, 2, 3, 1, 3],
[1, 3, 7, 1, 7]])
In [94]: X, y = np.split(Xy,[-1],axis=1)
In [95]: X
Out[95]:
array([[6, 2, 0, 5],
[6, 3, 7, 0],
[3, 2, 3, 1],
[1, 3, 7, 1]])
In [96]: y
Out[96]:
array([[2],
[0],
[3],
[7]])
Note that np.split would produce y as 2D. To have a 1D slice, we need to use np.squeeze(y) there.
Also, these slices would be views into original array, so no additional memory required there -
In [104]: np.may_share_memory(Xy, X)
Out[104]: True
In [105]: np.may_share_memory(Xy, y)
Out[105]: True
np.split uses np.array_split. That in turn does:
sub_arys = []
sary = _nx.swapaxes(ary, axis, 0)
for i in range(Nsections):
st = div_points[i]
end = div_points[i + 1]
sub_arys.append(_nx.swapaxes(sary[st:end], axis, 0))
swapaxes is needed with axis=1; or without the swapping:
sub_arys = []
for ...:
sub_arys.append(ary[:, st:end])
return sub_arys
i.e. the same as:
In [388]: ary=np.arange(12).reshape(3,4)
In [389]: [ary[:,0:3], ary[:,3:4]]
Out[389]:
[array([[ 0, 1, 2],
[ 4, 5, 6],
[ 8, 9, 10]]),
array([[ 3],
[ 7],
[11]])]
split like this keeps the original number of dimensions.
Wrapping your code in a function gives something that will be as fast, if not faster:
def xysplit(ary):
return ary[:,:-1], ary[:,-1]
X, y = xysplit(ary)
produces:
array([[ 0, 1, 2],
[ 4, 5, 6],
[ 8, 9, 10]]),
array([ 3, 7, 11])
When I commented that this seems to be more common in sklearn contexts I had in mind questions like
Python ValueError: non-broadcastable output operand with shape (124,1) doesn't match the broadcast shape (124,13)
X = df_wine.iloc[:, 1:].values
y = df_wine.iloc[:, 0].values
....
X_train, X_test, y_train, y_test = train_test_split(X, y, ...
X and y are 2d and 1d arrays, pulled in this case from a columns of a pandas dataframe. train_test_split is used to split X and y into training and testing groups. If there is a special X,y splitter, it would be in the sklearn package, not numpy.
Python - NumPy array_split adds a dminesion
train_inputs = train[:,: -1]
train_outputs = train[:, -1]

Merge duplicate indices in a sparse tensor

Lets say I have a sparse tensor with duplicate indices and where they are duplicate I want to merge values (sum them up)
What is the best way to do this?
example:
indicies = [[1, 1], [1, 2], [1, 2], [1, 3]]
values = [1, 2, 3, 4]
object = tf.SparseTensor(indicies, values, shape=[10, 10])
result = tf.MAGIC(object)
result should be a spare tensor with the following values (or concrete!):
indicies = [[1, 1], [1, 2], [1, 3]]
values = [1, 5, 4]
The only thing I have though of is to string concat the indicies together to create an index hash apply it to a third dimension and then reduce sum on that third dimension.
indicies = [[1, 1, 11], [1, 2, 12], [1, 2, 12], [1, 3, 13]]
sparse_result = tf.sparse_reduce_sum(sparseTensor, reduction_axes=2, keep_dims=true)
But that feels very very ugly
Here is a solution using tf.segment_sum. The idea is to linearize the indices in to a 1-D space, get the unique indices with tf.unique, run tf.segment_sum, and convert the indices back to N-D space.
indices = tf.constant([[1, 1], [1, 2], [1, 2], [1, 3]])
values = tf.constant([1, 2, 3, 4])
# Linearize the indices. If the dimensions of original array are
# [N_{k}, N_{k-1}, ... N_0], then simply matrix multiply the indices
# by [..., N_1 * N_0, N_0, 1]^T. For example, if the sparse tensor
# has dimensions [10, 6, 4, 5], then multiply by [120, 20, 5, 1]^T
# In your case, the dimensions are [10, 10], so multiply by [10, 1]^T
linearized = tf.matmul(indices, [[10], [1]])
# Get the unique indices, and their positions in the array
y, idx = tf.unique(tf.squeeze(linearized))
# Use the positions of the unique values as the segment ids to
# get the unique values
values = tf.segment_sum(values, idx)
# Go back to N-D indices
y = tf.expand_dims(y, 1)
indices = tf.concat([y//10, y%10], axis=1)
tf.InteractiveSession()
print(indices.eval())
print(values.eval())
Maybe you can try:
indicies = [[1, 1], [1, 2], [1, 2], [1, 3]]
values = [1, 2, 3, 4]
object = tf.SparseTensor(indicies, values, shape=[10, 10])
tf.sparse.to_dense(object, validate_indices=False)
Using unsorted_segment_sum could be simpler:
def deduplicate(tensor):
if not isinstance(tensor, tf.IndexedSlices):
return tensor
unique_indices, new_index_positions = tf.unique(tensor.indices)
summed_values = tf.unsorted_segment_sum(tensor.values, new_index_positions, tf.shape(unique_indices)[0])
return tf.IndexedSlices(indices=unique_indices, values=summed_values, dense_shape=tensor.dense_shape)
Another solution is to use tf.scatter_nd which will create a dense tensor and accumulate values on duplicate indices. This behavior is clearly stated in the documentation:
If indices contains duplicates, the duplicate values are accumulated (summed).
Then eventually we can convert it back to a sparse representation.
Here is a code sample for TensorFlow 2.x in eager mode:
import tensorflow as tf
indicies = [[1, 1], [1, 2], [1, 2], [1, 3]]
values = [1, 2, 3, 4]
merged_dense = tf.scatter_nd(indices, values, shape=(10, 10))
merged_sparse = tf.sparse.from_dense(merged_dense)
print(merged_sparse)
Output
SparseTensor(
indices=tf.Tensor(
[[1 1]
[1 2]
[1 3]],
shape=(3, 2),
dtype=int64),
values=tf.Tensor([1 5 4], shape=(3,), dtype=int32),
dense_shape=tf.Tensor([10 10], shape=(2,), dtype=int64))
So. As per the solution mentioned above.
Another example.
For the shape [12, 5]:
Lines to be changed in the code:
linearized = tf.matmul(indices, [[5], [1]])
indices = tf.concat([y//5, y%5], axis=1)