Combining two `numpy` arrays by using one as index for columns - numpy

If I have two numpy arrays
arr1
Out [7]: array([1, 0, 1, ..., 1, 0, 0])
and
arr2
Out [6]:
array([[0.10420547, 0.8957946 ],
[0.6609819 , 0.3390181 ],
[0.16680466, 0.8331954 ],
...,
[0.27138624, 0.7286138 ],
[0.6883444 , 0.31165552],
[0.70164204, 0.298358 ]], dtype=float32)
what is the quickest way to return a new array arr3 in such a way that arr1 indicates the column that I want from arr2 for each row? I would like to return something like:
arr3
array([0.8957946, 0.6609819, 0.8331954, ... ])
I would do it by filling a new empty array and iterating but I can't think of a quicker way right now.
EDIT:
Ok, a way that I found is the following, but probably not optimal (?):
arr3 = np.array([arr2[i][arr1[i]] for i in range(len(arr2))])
returns
arr3
Out [23]:
array([0.8957946 , 0.6609819 , 0.8331954 , ..., 0.7286138 , 0.6883444 ,
0.70164204], dtype=float32)

You can do it like this:
np.take_along_axis(arr2,arr1[:,None],1).squeeze()

Related

Compare numpy arrays of different shapes

I have two numpy arrays of shapes (4,4) and (9,4)
matrix1 = array([[ 72. , 72. , 72. , 72. ],
[ 72.00396729, 72.00396729, 72.00396729, 72.00396729],
[596.29998779, 596.29998779, 596.29998779, 596.29998779],
[708.83398438, 708.83398438, 708.83398438, 708.83398438]])
matrix2 = array([[ 72.02400208, 77.68997192, 115.6057663 , 105.64997101],
[120.98195648, 77.68997192, 247.19802856, 105.64997101],
[252.6330719 , 77.68997192, 337.25634766, 105.64997101],
[342.63256836, 77.68997192, 365.60125732, 105.64997101],
[ 72.02400208, 113.53997803, 189.65515137, 149.53997803],
[196.87202454, 113.53997803, 308.13119507, 149.53997803],
[315.3480835 , 113.53997803, 405.77023315, 149.53997803],
[412.86999512, 113.53997803, 482.0453186 , 149.53997803],
[ 72.02400208, 155.81002808, 108.98254395, 183.77003479]])
I need to compare all the rows of matrix2 with every row of matrix1. How can this be done without looping in the elements of matrix1?
If it is about element-wise comparison of the rows, then check this example:
# Generate sample arrays
a = np.random.randint(0, 5, size = (4, 3))
b = np.random.randint(-1, 6, size = (5, 3))
# Compare
a == b[:, None]
The last line does the comparison for you. The output array will have shape (num_of_b_rows, num_of_a_rows, common_num_of_cols): in this case, (5, 4, 3).

Unexpected behavior when trying to normalize a column in numpy.array (version 1.17.4)

So, I was trying to normalize (i.e. max = 1, min = value/max) a specific column within a numpy array.
I hoped this piece of code would do the trick:
bar = np.arange(12).reshape(6,2)
bar
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])
bar[:,1] = bar[:,1] / bar[:,1].max()
bar
array([[ 0, 0],
[ 2, 0],
[ 4, 0],
[ 6, 0],
[ 8, 0],
[10, 1]])
works as expected if the type of each value is 'float'.
foo = np.array([[1.1,2.2],
[3.3,4.4],
[5.5,6.6]])
foo[:,1] = foo[:,1] / foo[:,1].max()
foo
array([[1.1 , 0.33333333],
[3.3 , 0.66666667],
[5.5 , 1. ]])
I guess what I'm asking is where is this default 'int' I'm missing here?
(I'm taking this as a 'learning opportunity')
If you simply execute:
out = bar[:,1] / bar[:,1].max()
print(out)
>>> [0.09090909 0.27272727 0.45454545 0.63636364 0.81818182 1. ]
It's working just fine, since out is a newly created float array made to store these float values. But np.arange(12) gives you an int array by default. bar[:,1] = bar[:,1] / bar[:,1].max() tries to store the float values inside the integer array, and all the values become integers and you get [0 0 0 0 0 1].
To set the array as a float by default:
bar = np.arange(12, dtype='float').reshape(6,2)
Alternatively, you can also use:
bar = np.arange(12).reshape(6,2).astype('float')
It isn't uncommon for us to need to change the data type of the array throughout the program, as you may not always need the dtype you define originally. So .astype() is actually pretty handy in all kinds of scenarios.
From np.arange documentation :
dtype : dtype
The type of the output array. If dtype is not given, infer the data type from the other input arguments.
Since you passed int values it will infer that the values in the array are int and so they won't change to float, you can do like this if you want:
bar = np.arange(12.0).reshape(6,2)

Get column-wise maximums from a NumPy array

I have a 2D array, say
x = np.random.rand(10, 3)
array([[ 0.51158246, 0.51214272, 0.1107923 ],
[ 0.5210391 , 0.85308284, 0.63227215],
[ 0.57239625, 0.06276943, 0.1069803 ],
[ 0.71627613, 0.66454443, 0.56771438],
[ 0.24595493, 0.01007568, 0.84959605],
[ 0.99158904, 0.25034553, 0.00144037],
[ 0.43292656, 0.9247424 , 0.5123086 ],
[ 0.07224077, 0.57230282, 0.88522979],
[ 0.55665913, 0.20119776, 0.58865823],
[ 0.55129624, 0.26226446, 0.63070611]])
Then I find the indexes of maximum elements along the columns:
indexes = np.argmax(x, axis=0)
array([5, 6, 7])
So far so good.
But how do I actually get those elements? That is, how do I get ?some_operation?(x, indexes) == [0.99158904, 0.9247424, 0.88522979]?
Note that I need both the indexes and the associated values.
The best I could come up with was x[indexes, range(x.shape[1])], but it looks kinda complicated and inefficient. Is there a more idiomatic way?
You can use np.amax to find max value along an axis.
Using your example (x is the original array in your post):
In[1]: np.argmax(x, axis=0)
Out[1]:
array([5, 6, 7], dtype=int64)
In[2]: np.amax(x, axis=0)
Out[2]:
array([ 0.99158904, 0.9247424 , 0.88522979])
Documentation link

one-hot encoding and existing data

I have a numpy array (N,M) where some of the columns should be one-hot encoded. Please help to make a one-hot encoding using numpy and/or tensorflow.
Example:
[
[ 0.993, 0, 0.88 ]
[ 0.234, 1, 1.00 ]
[ 0.235, 2, 1.01 ]
.....
]
The 2nd column here ( with values 3 and 2 ) should be one hot-encoded, I know that there are only 3 distinct values ( 0, 1, 2 ).
The resulting array should look like:
[
[ 0.993, 0.88, 0, 0, 0 ]
[ 0.234, 1.00, 0, 1, 0 ]
[ 0.235, 1.01, 1, 0, 0 ]
.....
]
Like that I would be able to feed this array into the tensorflow.
Please notice that 2nd column was removed and it's one-hot version was appended in the end of each sub-array.
Any help would be highly appreciated.
Thanks in advance.
Update:
Here is what I have right now:
Well, not exactly...
1. I have more than 3 columns in the array...but I still want to do it only with 2nd..
2. First array is structured, ie it's shape is (N,)
Here is what I have:
def one_hot(value, max_value):
value = int(value)
a = np.zeros(max_value, 'uint8')
if value != 0:
a[value] = 1
return a
# data is structured array with the shape of (N,)
# it has strings, ints, floats inside..
# was get by np.genfromtxt(dtype=None)
unique_values = dict()
unique_values['categorical1'] = 1
unique_values['categorical2'] = 2
for row in data:
row[col] = unique_values[row[col]]
codes = np.zeros((data.shape[0], len(unique_values)))
idx = 0
for row in data:
codes[idx] = one_hot(row[col], len(unique_values)) # could be optimised by not creating new array every time
idx += 1
data = np.c_[data[:, [range(0, col), range(col + 1, 32)]], codes[data[:, col].astype(int)]]
Also trying to concatenate via:
print data.shape # shape (5000,)
print codes.shape # shape (5000,3)
data = np.concatenate((data, codes), axis=1)
Here's one approach -
In [384]: a # input array
Out[384]:
array([[ 0.993, 0. , 0.88 ],
[ 0.234, 1. , 1. ],
[ 0.235, 2. , 1.01 ]])
In [385]: codes = np.array([[0,0,0],[0,1,0],[1,0,0]]) # define codes here
In [387]: codes
Out[387]:
array([[0, 0, 0], # encoding for 0
[0, 1, 0], # encoding for 1
[1, 0, 0]]) # encoding for 2
# Slice out the second column and append one-hot encoded array
In [386]: np.c_[a[:,[0,2]], codes[a[:,1].astype(int)]]
Out[386]:
array([[ 0.993, 0.88 , 0. , 0. , 0. ],
[ 0.234, 1. , 0. , 1. , 0. ],
[ 0.235, 1.01 , 1. , 0. , 0. ]])

How to efficiently prepare matrices (2-d array) for multiple arguments?

If you want to evaluate a 1-d array for multiple arguments efficiently i.e. without for-loop, you can do this:
x = array([1, 2, 3])
def gen_1d_arr(x):
arr = array([2 + x, 2 - x,])
return arr
gen_1d_arr(x).T
and you get:
array([[ 3, 1],
[ 4, 0],
[ 5, -1]])
Okay, but how do you do this for 2-d array like below:
def gen_2d_arr(x):
arr = array([[2 + x, 2 - x,],
[2 * x, 2 / x]])
return arr
and obtain this?:
array([[[ 3. , 1. ],
[ 2. , 2. ]],
[[ 4. , 0. ],
[ 4. , 1. ]],
[[ 5. , -1. ],
[ 6. , 0.66666667]]])
Also, is this generally possible for n-d arrays?
Look at what you get with your function
In [274]: arr = np.array([[2 + x, 2 - x,],
[2 * x, 2 / x]])
In [275]: arr
Out[275]:
array([[[ 3. , 4. , 5. ],
[ 1. , 0. , -1. ]],
[[ 2. , 4. , 6. ],
[ 2. , 1. , 0.66666667]]])
In [276]: arr.shape
Out[276]: (2, 2, 3)
The 3 comes from x. The middle 2 comes from [2+x, 2-x] pairs, and the 1st 2 from the outer list.
Looks like what you want is a (3,2,2) array. One option is to apply a transpose or axis swap to arr.
arr.transpose([2,0,1])
The basic operation of np.array([arr1,arr2]) is to construct a new array with a new dimension in front, i.e. with shape (2, *arr1(shape)).
There are other operations that combine arrays. np.concatenate and its variants hstack, vstack, dstack, column_stack, join arrays. .reshape() and [None,...], atleast_nd etc add dimensions. Look at the code of the stack functions to get some ideas on how to combine arrays using these tools.
On the question of efficiency, my time tests show that concatenate operations are generally faster than np.array. Often np.array converts its inputs to lists, and reparses the values. This gives it more power in cooercing arrays to specific dtypes, but at the expense of time. But I'd only worry about this with large arrays where construction time is significant.