argsort -- breaking ties based on previous winner - numpy

Summary
I am applying tf.argsort() on a 3D matrix. I need to have the previous winner win ties.
Example
Array ArgSort
[5, 10, 15, 20] --> [3, 2, 1, 0]
[5, 10, 15, 20] --> [3, 2, 1, 0]
[5, 5, 15, 20] --> [3, 2, 1, 0]
[4, 4, 12, 15] --> [3, 2, 1, 0]
In row 2, the second '5' should win because it won in row 1.
In a tie, I want to be able to look at the prior rows, and sort ties by previous winners.
Notes
Also, I need to be able to do this in parallel on the GPU.
I might be able to implement it with thrust zip iterators instead, but tensorflow or numpy seemed the better option since I'm working with 3D matrices of various sizes, and because of the built in argsort.

Related

Finding values from different rows in pandas

I have a dataframe comprising the data and another dataframe, containing a single row carrying indices.
data = {'col_1': [4, 5, 6, 7], 'col_2': [3, 4, 9, 8],'col_3': [5, 5, 6, 9],'col_4': [8, 7, 6, 5]}
df = pd.DataFrame(data)
ind = {'ind_1': [2], 'ind_2': [1],'ind_3': [3],'ind_4': [2]}
ind = pd.DataFrame(ind)
Both have the same number of columns. I want to extract the values of df corresponding to the index stored in ind so that I get a single row at the end.
For this data it should be: [6, 4, 9, 6]. I tried df.loc[ind.loc[0]] but that of course gives me four different rows, not one.
The other idea I have is to zip columns and rows and iterate over them. But I feel there should be a simpler way.
you can go to NumPy domain and index there:
In [14]: df.to_numpy()[ind, np.arange(len(df.columns))]
Out[14]: array([[6, 4, 9, 6]], dtype=int64)
this pairs up 2, 1, 3, 2 from ind and 0, 1, 2, 3 from 0 to number of columns - 1; so we get the values at [2, 0], [1, 1] and so on.
There's also df.lookup but it's being deprecated, so...
In [19]: df.lookup(ind.iloc[0], df.columns)
~\Anaconda3\Scripts\ipython:1: FutureWarning: The 'lookup' method is deprecated and will beremoved in a future version.You can use DataFrame.melt and DataFrame.locas a substitute.
Out[19]: array([6, 4, 9, 6], dtype=int64)

What is the difference between np.array([val1, val2]) and np.array([[val1, val2]])?

What is the difference between np.array([1, 2]) and np.array([[1, 2]])?
Which one of them is a matrix?
I also do not understand the output for shape of the above tensors. The former returns (2,) and the latter returns (1,2).
np.array([1, 2]) builds an array starting from a list, thus giving you a 1D array with the shape (2, ) since it only contains a single list of two elements.
When using the double [ you are actually passing a list of lists, thus this gets you a multidimensional array, or matrix, with the shape (1, 2).
With the latter you are able to build more complex matrices like:
np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
rendering a 3x3 matrix:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

Determine number of preceding equal elements

Using numpy, given a sorted 1D array, how to efficiently obtain a 1D array with equal size where the value at each position is the number of preceding equal elements? I have very large arrays and processing each element in Python code one way or another is not acceptable.
Example:
input = [0, 0, 4, 4, 4, 5, 5, 5, 5, 6]
output = [0, 1, 0, 1, 2, 0, 1, 2, 3, 0]
import numpy as np
A=np.array([0, 0, 4, 4, 4, 5, 5, 5, 5, 6])
uni,counts=np.unique(A, return_counts=True)
out=np.concatenate([np.arange(n) for n in counts])
print(out)
Not certain about the efficiency (probably better way to form the out array rather than concatenating), but a very straightforward way to get the result you are looking for. Counts the unique elements, then does np.arange on each count to get the ascending sequence, then concatenates these arrays together.

How to design the label for tensorflow's ctc loss layer

I just started using ctc loss layer in tensorflow(r1.0) and got a little bit confused with the "labels" input
In tensorflow's API document, it says
labels: An int32 SparseTensor. labels.indices[i, :] == [b, t] means labels.values[i] stores the id for (batch b, time t). labels.values[i] must take on values in [0, num_labels)
Is [b,t] and values[i] mean there is a label "values[i]" at "t" of sequence "b" in the batch?
It says value must be in [0,num_labels), but for a sparse tensor, almost everywhere is 0 excepted for some specified places, so I don't really know how should the sparse tensor for ctc be like
And for example, if I have a short video of hand gesture, and it has a label "1",should I label the output of all timesteps as "1", or only label the last timestep as "1" and take other as "blank"?
thanks!
To address your questions:
1. The notation in the documentation here seems a bit misleading, as the output label index t need not be the same as the input time slice, it's simply the index to the output sequence. A different letter could be used because the input and output sequences are not explicitly aligned. Otherwise, your assertion seems correct. I give an example below.
Zero is a valid class in your sequence output label. The so-called blank label in TensorFlow's CTC implementation is the last (largest) class, which should probably not be in your ground truth labels anyhow. So if you were writing a binary sequence classifier, you'd have three classes, 0 (say "off"), 1 ("on") and 2 ("blank" output of CTC).
CTC Loss is for labeling sequence input with sequence output. If you only have
a single class label output for the sequence input, you're probably better off using a softmax cross entropy loss on the output of the last time step of the RNN cell.
If you do end up using CTC loss, you can see how I've constructed the training sequence through a reader here: How to generate/read sparse sequence labels for CTC loss within Tensorflow?.
As an example, after I batch two examples that have label sequences [44, 45, 26, 45, 46, 44, 30, 44] and [5, 8, 17, 4, 18, 19, 14, 17, 12], respectively, I get the following result from evaluating the (batched) SparseTensor:
SparseTensorValue(indices=array([[0, 0],
[0, 1],
[0, 2],
[0, 3],
[0, 4],
[0, 5],
[0, 6],
[0, 7],
[1, 0],
[1, 1],
[1, 2],
[1, 3],
[1, 4],
[1, 5],
[1, 6],
[1, 7],
[1, 8]]), values=array([44, 45, 26, 45, 46, 44, 30, 44, 5, 8, 17, 4, 18, 19, 14, 17, 12], dtype=int32), dense_shape=array([2, 9]))
Notice how the rows of the indices in the sparse tensor value correspond to the batch number and the columns correspond to the sequence index for that particular label. The values themselves are the sequence label classes. The rank is 2 and the size of the last dimension (nine in this case) is the length of the longest sequence.

Numpy Indexing Behavior

I am having a lot of trouble understanding numpy indexing for multidimensional arrays. In this example that I am working with, let's say that I have a 2D array, A, which is 100x10. Then I have another array, B, which is a 100x1 1D array of values between 0-9 (indices for A). In MATLAB, I would use A(sub2ind(size(A), 1:size(A,1)', B) to return for each row of A, the value at the index stored in the corresponding row of B.
So, as a test case, let's say I have this:
A = np.random.rand(100,10)
B = np.int32(np.floor(np.random.rand(100)*10))
If I print their shapes, I get:
print A.shape returns (100L, 10L)
print B.shape returns (100L,)
When I try to index into A using B naively (incorrectly)
Test1 = A[:,B]
print Test1.shape returns (100L, 100L)
but if I do
Test2 = A[range(A.shape[0]),B]
print Test2.shape returns (100L,)
which is what I want. I'm having trouble understanding the distinction being made here. In my mind, A[:,5] and A[range(A.shape[0]),5] should return the same thing, but it isn't here. How is : different from using range(sizeArray) which just creates an array from [0:sizeArray] inclusive, to use an indices?
Let's look at a simple array:
In [654]: X=np.arange(12).reshape(3,4)
In [655]: X
Out[655]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
With the slice we can pick 3 columns of X, in any order (and even repeated). In other words, take all the rows, but selected columns.
In [656]: X[:,[3,2,1]]
Out[656]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
If instead I use a list (or array) of 3 values, it pairs them up with the column values, effectively picking 3 values, X[0,3],X[1,2],X[2,1]:
In [657]: X[[0,1,2],[3,2,1]]
Out[657]: array([3, 6, 9])
If instead I gave it a column vector to index rows, I get the same thing as with the slice:
In [659]: X[[[0],[1],[2]],[3,2,1]]
Out[659]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
This amounts to picking 9 individual values, as generated by broadcasting:
In [663]: np.broadcast_arrays(np.arange(3)[:,None],np.array([3,2,1]))
Out[663]:
[array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]]),
array([[3, 2, 1],
[3, 2, 1],
[3, 2, 1]])]
numpy indexing can be confusing. But a good starting point is this page: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html