Related
I have data where each row in the batch is a fixed sized vector of mutually exclusive "words". Ex:
batch:
[1, 2, 3]
[4, 5, 6]
[4, 7, 8]
global dictionary:
{
0: {1, 4}
1: {2, 5, 7}
2: {3, 6, 8}
}
In the above example, for cols 0, 1, and 2...we have vocabs of [1, 4], [2, 5, 7], and [3, 6, 8] which are mutually exclusive. I also have a giant dictionary of all vocab words for all the columns.
How do I use that dictionary of dict(column_idx) -> {vocab set} to build a one hot encoder in tensorflow?
For the above example I would want an output of:
[1, 0, 1, 0, 0, 1, 0, 0]
[0, 1, 0, 1, 0, 0, 1, 0]
[0, 1, 0, 0, 1, 0, 0, 1]
with the one hot mappings being:
[1, 4, 2, 5, 7, 3, 6, 8]
The tricky part is that each column needs to be encoded differently. If you break the problem down to encoding one column at a time then it is mostly about deconstructing, encoding and recombining the batched input. Here is a working example in numpy:
import numpy as np
batch = np.array([
[1, 2, 3],
[4, 5, 6],
[4, 7, 8]])
vocab = {
0: {1, 4},
1: {2, 5, 7},
2: {3, 6, 8}
}
# Construct a one hot encoding matrix for each vocabulary
# Here we are using identity matrix as the getter
vocab_eye = {k: np.eye(len(v)) for k, v in vocab.items()}
# Construct a converter to indices, a map from value to index if you will
vocab_map = {k:np.vectorize(list(v).index) for k, v in vocab.items()}
# Deconstruct, encode and merge each column
encoded_cols = [vocab_eye[i][vocab_map[i](col[:,0])]
for i, col in enumerate(np.split(batch, batch.shape[1], axis=1))]
encoded_batch = np.concatenate(encoded_cols, axis=1)
# Outputs
# array([[1., 0., 1., 0., 0., 0., 1., 0.],
# [0., 1., 0., 1., 0., 0., 0., 1.],
# [0., 1., 0., 0., 1., 1., 0., 0.]])
You can perhaps optimise the encoding by already giving split data. So instead of batching everything and splitting, just give 3 separate inputs to the data pipeline in Tensorflow. Then you can imagine having 3 encoding paths one for each feature set, which then gets merged.
I have a numpy array
a = np.array([[1,0,0,1,0],
[0,1,0,0,0],
[0,0,1,0,1]])
I would like to replace every positive elements of this array by its row index+1. So the final result would be:
a = np.array([[1,0,0,1,0],
[0,2,0,0,0],
[0,0,3,0,3]])
Can I do this with a simply numpy command (without looping)?
Use numpy.arange
(a != 0) * np.reshape(np.arange(a.shape[0])+1, (-1, 1))
Output:
array([[1., 0., 0., 1., 0.],
[0., 2., 0., 0., 0.],
[0., 0., 3., 0., 3.]])
Works on any array:
a2 = np.array([[1,0,0,-1,0],
[0,20,0,0,0],
[0,0,-300,0,30]])
(a2 != 0) * np.reshape(np.arange(a2.shape[0])+1, (-1, 1))
Output:
array([[1., 0., 0., 1., 0.],
[0., 2., 0., 0., 0.],
[0., 0., 3., 0., 3.]])
Not sure if this is the proper numpy way, but you could use enumerate and multiply the sub-arrays by their indices:
>>> np.array([x * i for i, x in enumerate(a, start=1)])
array([[1, 0, 0, 1, 0],
[0, 2, 0, 0, 0],
[0, 0, 3, 0, 3]])
Note that this only works properly if "every positive element" is actually 1, as in your example, and if there are no negative elements. Alternatively, you can use a > 0 to first get an array with True (i.e. 1) in every place where a is > 0 and False (i.e. 0) otherwise.
>>> a = np.array([[ 1, 0, 0, 2, 0],
... [ 0, 3, 0, 0,-8],
,,, [-3, 0, 4, 0, 5]])
...
>>> np.array([x * i for i, x in enumerate(a > 0, start=1)])
array([[1, 0, 0, 1, 0],
[0, 2, 0, 0, 0],
[0, 0, 3, 0, 3]])
Given the following ndarray t -
In [26]: t.shape
Out[26]: (3, 3, 2)
In [27]: t
Out[27]:
array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[ 6, 7],
[ 8, 9],
[10, 11]],
[[12, 13],
[14, 15],
[16, 17]]])
this piecewise linear interpolant for the points t[:, 0, 0] can evaluated for [0 , 0.66666667, 1.33333333, 2.] as follows using numpy.interp -
In [38]: x = np.linspace(0, t.shape[0]-1, 4)
In [39]: x
Out[39]: array([0. , 0.66666667, 1.33333333, 2. ])
In [30]: xp = np.arange(t.shape[0])
In [31]: xp
Out[31]: array([0, 1, 2])
In [32]: fp = t[:,0,0]
In [33]: fp
Out[33]: array([ 0, 6, 12])
In [40]: np.interp(x, xp, fp)
Out[40]: array([ 0., 4., 8., 12.])
How can all the interpolants be efficiently calculated and returned together for all values of fp -
array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[ 4, 5],
[ 6, 7],
[ 8, 9]],
[[ 8, 9],
[10, 11],
[12, 13]],
[[12, 13],
[14, 15],
[16, 17]]])
As the interpolation is 1d with changing y values it must be run for each 1d slice of t. It's probably faster to loop explicitly but neater to loop using np.apply_along_axis
import numpy as np
t = np.arange( 18 ).reshape(3,3,2)
x = np.linspace( 0, t.shape[0]-1, 4)
xp = np.arange(t.shape[0])
def interfunc( arr ):
""" Function interpolates a 1d array. """
return np.interp( x, xp, arr )
np.apply_along_axis( interfunc, 0, t ) # apply function along axis 0
""" Result
array([[[ 0., 1.],
[ 2., 3.],
[ 4., 5.]],
[[ 4., 5.],
[ 6., 7.],
[ 8., 9.]],
[[ 8., 9.],
[10., 11.],
[12., 13.]],
[[12., 13.],
[14., 15.],
[16., 17.]]]) """
With explicit loops
result = np.zeros((4,3,2))
for c in range(t.shape[1]):
for p in range(t.shape[2]):
result[:,c,p] = np.interp( x, xp, t[:,c,p])
On my machine the second option runs in half the time.
Edit to use np.nditer
As the result and the parameter have different shapes I seem to have to create two np.nditer objects one for the parameter and one for the result. This is my first attempt to use nditer for anything so it could be over complicated.
def test( t ):
ts = t.shape
result = np.zeros((ts[0]+1,ts[1],ts[2]))
param = np.nditer( [t], ['external_loop'], ['readonly'], order = 'F')
with np.nditer( [result], ['external_loop'], ['writeonly'], order = 'F') as res:
for p, r in zip( param, res ):
r[:] = interfunc(p)
return result
It's slightly slower than the explicit loops and less easy to follow than either of the other solutions.
As requested by #Tis Chris, here is a solution using np.nditer with the multi_index flag but I prefer the explicit nested for loops method above because it is 10% faster
In [29]: t = np.arange( 18 ).reshape(3,3,2)
In [30]: ax0old = np.arange(t.shape[0])
In [31]: ax0new = np.linspace(0, t.shape[0]-1, 4)
In [32]: tnew = np.zeros((len(ax0new), t.shape[1], t.shape[2]))
In [33]: it = np.nditer(t[0], flags=['multi_index'])
In [34]: for _ in it:
...: tnew[:, it.multi_index[0], it.multi_index[1]] = np.interp(ax0new, ax0old, t[:, it.multi_
...: index[0], it.multi_index[1]])
...:
In [35]: tnew
Out[35]:
array([[[ 0., 1.],
[ 2., 3.],
[ 4., 5.]],
[[ 4., 5.],
[ 6., 7.],
[ 8., 9.]],
[[ 8., 9.],
[10., 11.],
[12., 13.]],
[[12., 13.],
[14., 15.],
[16., 17.]]])
You could try scipy.interpolate.interp1d:
from scipy.interpolate import interp1d
import numpy as np
t = np.array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[ 6, 7],
[ 8, 9],
[10, 11]],
[[12, 13],
[14, 15],
[16, 17]]])
# for the first slice
f = interp1d(np.arange(t.shape[0]), t[..., 0], axis=0)
# returns a function which you call with values within range np.arange(t.shape[0])
# data used for interpolation
t[..., 0]
>>> array([[ 0, 2, 4],
[ 6, 8, 10],
[12, 14, 16]])
f(1)
>>> array([ 6., 8., 10.])
f(1.5)
>>> array([ 9., 11., 13.])
I have a list of label names which I enuemrated and created a dictionary:
my_list = [b'airplane',
b'automobile',
b'bird',
b'cat',
b'deer',
b'dog',
b'frog',
b'horse',
b'ship',
b'truck']
label_dict =dict(enumerate(my_list))
{0: b'airplane',
1: b'automobile',
2: b'bird',
3: b'cat',
4: b'deer',
5: b'dog',
6: b'frog',
7: b'horse',
8: b'ship',
9: b'truck'}
Now I'm trying to cleaning map/apply the dict value to my target which is in an one-hot-encoded form.
y_test[0]
array([ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.])
y_test[0].map(label_dict) should return:
'cat'
I was playing around with
(lambda key,value: value for y_test[0] == 1)
but couldn't come up with any concrete
Thank you.
Since we are working with one-hot encoded array, argmax could be used to get the index for one off 1 for each row. Thus, using the list as input -
[my_list[i] for i in y_test.argmax(1)]
Or with np.take to have array output -
np.take(my_list,y_test.argmax(1))
To work with dict and assuming sequential keys as 0,1,.., we could have -
np.take(label_dict.values(),y_test.argmax(1))
If the keys are not essentially in sequence but sorted -
np.take(label_dict.values(), np.searchsorted(label_dict.keys(),y_test.argmax(1)))
Sample run -
In [79]: my_list
Out[79]:
['airplane',
'automobile',
'bird',
'cat',
'deer',
'dog',
'frog',
'horse',
'ship',
'truck']
In [80]: y_test
Out[80]:
array([[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])
In [81]: [my_list[i] for i in y_test.argmax(1)]
Out[81]: ['cat', 'automobile', 'ship']
In [82]: np.take(my_list,y_test.argmax(1))
Out[82]:
array(['cat', 'automobile', 'ship'],
dtype='|S10')
we can use dot product to reverse one-hot encoding, if it really is ONE-hot.
Let's start with factorizing your list
f, u = pd.factorize(my_list)
now if you have an array you'd like to get back your strings with
a = np.array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0])
Then use dot
a.dot(u)
'cat'
Now assume
y_test = np.array([
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
])
Then
y_test.dot(u)
array(['cat', 'automobile', 'ship'], dtype=object)
If it isn't one-hot but instead multi-hot, you could join with commas
y_test = np.array([
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 0, 0, 1, 0]
])
[', '.join(u[y.astype(bool)]) for y in y_test]
['cat', 'automobile, truck', 'bird, ship']
I am trying to use tf.gather_nd to convert
'R = tf.eye(3, batch_shape=[4])'
to :
array([[[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.]],
[[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.]],
[[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.]],
[[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.]]], dtype=float32)'
With the index:
ind = array([[0, 2, 1],
[2, 1, 0],
[1, 2, 0],
[0, 2, 1]], dtype=int32)
I found out if I can convert the index matrix to something like:
ind_c = np.array([[[0, 0], [0, 2], [0, 1]],
[[1, 2], [1, 1], [1, 0]],
[[2, 1], [2, 2], [2, 0]],
[[3, 0], [3, 2], [3, 1]]])
gather_nd will do the job. so my question is:
is there a better way than converting the index ind to ind_c
if this the only way how I can convert ind to ind_c with tensorflow? (I have done this for now manually)
Thanks
You can try the following:
ind = tf.constant([[0, 2, 1],[2, 1, 0],[1, 2, 0],[0, 2, 1]], dtype=tf.int32)
# Creates the row indices matrix
row = tf.tile(tf.expand_dims(tf.range(tf.shape(ind)[0]), 1), [1, tf.shape(ind)[1]])
# Concat to the ind to form the index matrix
ind_c = tf.concat([tf.expand_dims(row,-1), tf.expand_dims(ind, -1)], axis=2)