How I print specific values from a multidimensional array with numpy? - numpy

I have a multidimensional np.array like: [[2, 55, 62], [3, 56,63], [4, 57, 64], ...].
I'm pretending to print only the values greater than 2 at the firt column, returnig a print like: [[3, 56,63], [4, 57, 64], ...]
How can I get it?

All you need to do is to select just the values you want to print.
Short answer:
import numpy as np
a = np.array([[1,2,3],[3,2,1]])
print(a[a>2])
What's going on?
Well, first, a>2 return a boolean mask telling if condition is met for each position of the array. This is a numpy array with exactly the same shape than a, but with dtype=bool.
Then, this mask is used to select only values where the mask's value is True, which are also those hat meet your condition.
Finally, you just print them.
Step by step, you can write as follows:
import numpy as np
a = np.array([[1,2,3],[3,2,1]])
print(a.shape) # output is (2, 3)
mask = a > 2
print(mask.shape) # output is (2, 3)
print(mask.dtype) # output is book
print(mask) # here you can see True only for those positions where condition is met
print(a[mask])

Related

Assign numpy matrix to pandas columns

I have dataframe with 48870 rows and calculated embeddings with shape (48870, 768)
I wanna assign this embeddings to padnas column
When i try
test['original_text_embeddings'] = embeddings
I have an error: Wrong number of items passed 768, placement implies 1
I know if a make something like df.loc['original_text_embeddings'] = embeddings[0] will work but i need to automate this process
A dataframe/column needs a 1d list/array:
In [84]: x = np.arange(12).reshape(3,4)
In [85]: pd.Series(x)
...
ValueError: Data must be 1-dimensional
Splitting the array into a list (of arrays):
In [86]: pd.Series(list(x))
Out[86]:
0 [0, 1, 2, 3]
1 [4, 5, 6, 7]
2 [8, 9, 10, 11]
dtype: object
In [87]: _.to_numpy()
Out[87]:
array([array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8, 9, 10, 11])],
dtype=object)
Your embeddings have 768 columns, which would translate to equally 768 columns in a data frame. You are trying to assign all columns from the embeddings to just one column in the data frame, which is not possible.
What you could do is generating a new data frame from the embeddings and concatenate the test df with the embedding df
embedding_df = pd.DataFrame(embeddings)
test = pd.concat([test, embedding_df], axis=1)
Have a look at the documentation for handling indexes and concatenating on different axis:
https://pandas.pydata.org/docs/reference/api/pandas.concat.html

How to return one NumPy array per partition in Dask?

I need to compute many NumPy arrays (that can be up to 4-dimensional), one for each partition of a Dask dataframe, and then add them as arrays. However, I'm struggling to make map_partitions return an array for each partition instead of a single array for all of them.
import dask.dataframe as dd
import numpy as np, pandas as pd
df = pd.DataFrame(range(15), columns=['x'])
ddf = dd.from_pandas(df, npartitions=3)
def func(partition):
# Here I also tried returning the array in a list and in a tuple
return np.array([[1, 2], [3, 4]])
# Here I tried all the options available for 'meta'
results = ddf.map_partitions(func).compute()
Then results is:
array([[1, 2],
[3, 4],
[1, 2],
[3, 4],
[1, 2],
[3, 4]])
And if, instead, I do results.sum().compute() I get 30.
What I'd like to get is:
[np.array([[1, 2],[3, 4]]), np.array([[1, 2],[3, 4]]), np.array([[1, 2],[3, 4]])]
So that if I compute the sum, I get:
array([[ 3, 6],
[ 9, 12]])
How can you achieve this result with Dask?
I managed to make it work like this, but I don't know if this is the best way:
from dask import delayed
results = []
for partition in ddf.partitions:
result = delayed(func)(partition)
results.append(result)
delayed(sum)(results).compute()
The result of the computation is:
array([[ 3, 6],
[ 9, 12]])
You are right, a dask-array is usually to be viewed as a single logical array, which just happens to be made of pieces. Single you are not using the logical layer, you could have done your work with delayed alone. On the other hand, it seems like the end result you want really is a sum over all the data, so maybe even simpler would be an appropriate reshape and sum(axis=)?
ddf.map_partitions(func).compute_chunk_sizes().reshape(
-1, 2, 2).sum(axis=0).compute()
(compute_chunk_sizes is needed because although your original pandas dataframe had a known size, Dask did not evaluate your function yet to know what sizes it gave back)
However, given your setup, the following would work and be more similar to your original attempt, see .to_delayed()
list_of_delayed = ddf.map_partitions(func).to_delayed().tolist()
tuple_of_np_lists = dask.compute(*list_of_delayed)
(tolist forces evaluating the contained delayed objects)

Find maximum numbers from n-Numpy array

Is there a way to find out the highest individual numbers give a list of Numpy arrays?
e.g.
import numpy as np
a = np.array([10, 2])
b = np.array([3, 4])
c = np.array([5, 6])
The output will be a Numpy array with the following form:
np.array([10, 6])
You can stack all arrays together and get the max afterwards:
np.stack((a, b, c)).max(0)
# array([10, 6])

Slicing a tensor by an index tensor in Tensorflow

I have two following tensors (note that they are both Tensorflow tensors which means they are still virtually symbolic at the time I construct the following slicing op before I launch a tf.Session()):
params: has shape (64,784, 256)
indices: has shape (64, 784)
and I want to construct an op that returns the following tensor:
output: has shape (64,784) where
output[i,j] = params_tensor[i,j, indices[i,j] ]
What is the most efficient way in Tensorflow to do so?
ps: I tried with tf.gather but couldn't make use of it to perform the operation I described above.
Many thanks.
-Bests
You can get exactly what you want using tf.gather_nd. The final expression is:
tf.gather_nd(params, tf.stack([tf.tile(tf.expand_dims(tf.range(tf.shape(indices)[0]), 1), [1, tf.shape(indices)[1]]), tf.transpose(tf.tile(tf.expand_dims(tf.range(tf.shape(indices)[1]), 1), [1, tf.shape(indices)[0]])), indices], 2))
This expression has the following explanation:
tf.gather_nd does what you expected and uses the indices to gather the output from the params
tf.stack combines three separate tensors, the last of which is the indices. The first two tensors specify the ordering of the first two dimensions (axis 0 and axis 1 of params/indices)
For the example provided, this ordering is simply 0, 1, 2, ..., 63 for axis 0, and 0, 1, 2, ... 783 for axis 1. These sequences are obtained with tf.range(tf.shape(indices)[0]) and tf.range(tf.shape(indices)[1]), respectively.
For the example provided, indices has shape (64, 784). The other two tensors from the last point above need to have this same shape in order to be combined with tf.stack
First, an additional dimension/axis is added to each of the two sequences using tf.expand_dims.
The use of tf.tile and tf.transpose can be shown by example: Assume the first two axes of params and index have shape (5,3). We want the first tensor to be:
[[0, 0, 0], [1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]]
We want the second tensor to be:
[[0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2]]
These two tensors almost function like specifying the coordinates in a grid for the associated indices.
The final part of tf.stack combines the three tensors on a new third axis, so that the result has the same 3 axes as params.
Keep in mind if you have more or less axes than in the question, you need to modify the number of coordinate-specifying tensors in tf.stack accordingly.
What you want is like a custom reduction function. If you are keeping something like index of maximum value at indices then I would suggest using tf.reduce_max:
max_params = tf.reduce_max(params_tensor, reduction_indices=[2])
Otherwise, here is one way to get what you want (Tensor objects are not assignable so we create a 2d list of tensors and pack it using tf.pack):
import tensorflow as tf
import numpy as np
with tf.Graph().as_default():
params_tensor = tf.pack(np.random.randint(1,256, [5,5,10]).astype(np.int32))
indices = tf.pack(np.random.randint(1,10,[5,5]).astype(np.int32))
output = [ [None for j in range(params_tensor.get_shape()[1])] for i in range(params_tensor.get_shape()[0])]
for i in range(params_tensor.get_shape()[0]):
for j in range(params_tensor.get_shape()[1]):
output[i][j] = params_tensor[i,j,indices[i,j]]
output = tf.pack(output)
with tf.Session() as sess:
params_tensor,indices,output = sess.run([params_tensor,indices,output])
print params_tensor
print indices
print output
I know I'm late, but I recently had to do something similar, and was able to to do it using Ragged Tensors:
output = tf.gather(params, tf.RaggedTensor.from_tensor(indices), batch_dims=-1, axis=-1)
Hope it helps

How do I swap tensor's axes in TensorFlow?

I have a tensor of shape (30, 116, 10), and I want to swap the first two dimensions, so that I have a tensor of shape (116, 30, 10)
I saw that numpy as such a function implemented (np.swapaxes) and I searched for something similar in tensorflow but I found nothing.
Do you have any idea?
tf.transpose provides the same functionality as np.swapaxes, although in a more generalized form. In your case, you can do tf.transpose(orig_tensor, [1, 0, 2]) which would be equivalent to np.swapaxes(orig_np_array, 0, 1).
It is possible to use tf.einsum to swap axes if the number of input dimensions is unknown. For example:
tf.einsum("ij...->ji...", input) will swap the first two dimensions of input;
tf.einsum("...ij->...ji", input) will swap the last two dimensions;
tf.einsum("aij...->aji...", input) will swap the second and the third
dimension;
tf.einsum("ijk...->kij...", input) will permute the first three dimensions;
and so on.
You can transpose just the last two axes with tf.linalg.matrix_transpose, or more generally, you can swap any number of trailing axes by working out what the leading indices are dynamically, and using relative indices for the axes you want to transpose
x = tf.ones([5, 3, 7, 11])
trailing_axes = [-1, -2]
leading = tf.range(tf.rank(x) - len(trailing_axes)) # [0, 1]
trailing = trailing_axes + tf.rank(x) # [3, 2]
new_order = tf.concat([leading, trailing], axis=0) # [0, 1, 3, 2]
res = tf.transpose(x, new_order)
res.shape # [5, 3, 11, 7]