PyTorch indexing by argmax - indexing

Dear community I have a challenge with regard to tensor indexing in PyTorch. The problem is very simple. Given a tensor create an index tensor to index its maximum values per column.
x = T.tensor([[0, 3, 0, 5, 9, 8, 2, 0],
[0, 4, 9, 6, 7, 9, 1, 0]])
Given this tensor I would like to build a boolean mask for indexing its maximum values per colum. To be specific I do not need its maximum values, torch.max(x, dim=0), nor its indices, torch.argmax(x, dim=0), but a boolean mask for indexing other tensor based on this tensor max values. My ideal output would be:
# Input tensor
x
tensor([[0, 3, 0, 5, 9, 8, 2, 0],
[0, 4, 9, 6, 7, 9, 1, 0]])
# Ideal output bool mask tensor
idx
tensor([[1, 0, 0, 0, 1, 0, 1, 1],
[0, 1, 1, 1, 0, 1, 0, 0]])
I know that values_max = x[idx] and values_max = x.max(dim=0) are equivalent but I am not looking for values_max but for idx.
I have built a solution around it but it just seem to complex and I am sure torch have an optimized way to do this. I have tried to use torch.index_select with the output of x.argmax(dim=0) but failed so I built a custom solution that seems to cumbersome to me so I am asking for help to do this in a vectorized / tensorial / torch way.

You can perform this operation by first extracting the index of the maximum value column-wise of your tensor with torch.argmax, setting keepdim to True
>>> x.argmax(0, keepdim=True)
tensor([[0, 1, 1, 1, 0, 1, 0, 0]])
Then you can use torch.scatter to place 1s in a zero tensor at the designated indices:
>>> torch.zeros_like(x).scatter(0, x.argmax(0,True), value=1)
tensor([[1, 0, 0, 0, 1, 0, 1, 1],
[0, 1, 1, 1, 0, 1, 0, 0]])

Related

Standard implementation of vectorize_sequences

In François Chollet's Deep Learning with Python, appears this function:
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
I understand what this function does. This function is asked about in this quesion and in this question as well, also mentioned here, here, here, here, here & here. Despite being so wide-spread, this vectorization is, according to Chollet's book is done "manually for maximum clarity." I am interested whether there is a standard, not "manual" way of doing it.
Is there a standard Keras / Tensorflow / Scikit-learn / Pandas / Numpy implementation of a function which behaves very similarly to the function above?
Solution with MultiLabelBinarizer
Assuming sequences is an array of integers with maximum possible value upto dimension-1, we can use MultiLabelBinarizer from sklearn.preprocessing to replicate the behaviour of the function vectorize_sequences
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=range(dimension))
mlb.fit_transform(sequences)
Solution with Numpy broadcasting
Assuming sequences is an array of integers with maximum possible value upto dimension-1
(np.array(sequences)[:, :, None] == range(dimension)).any(1).view('i1')
Worked out example
>>> sequences
[[4, 1, 0],
[4, 0, 3],
[3, 4, 2]]
>>> dimension = 10
>>> mlb = MultiLabelBinarizer(classes=range(dimension))
>>> mlb.fit_transform(sequences)
array([[1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0, 0, 0, 0]])
>>> (np.array(sequences)[:, :, None] == range(dimension)).any(1).view('i1')
array([[0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0, 0, 0, 0, 0]])

how to use keras and pandas to duplicate similar arrays

I have a array from my teacher, he gave me an array like below:
array contains 0,1,none
[[1, 1, 0, 0, none, 0, 1], [1, 0, 0, 0, none,0, 1], [1, 1, none,0, 1, 0, none], [1,1,1,0,none, 0, 0], [1, 1,0, none, 0, 0,1]]
and asked me to duplicate the array ten times but however each column must have similar percent distribution say each column have no more than 8% percent compared with the origin array.
how should i achieve the goal?

tensorflow expand counts into ranges

We have a Tensor of unknown length N, containing some int32 values.
How can we generate another Tensor that will contain N ranges concatenated together, each one between 0 and the int32 value from the original tensor ?
For example, if we have [4, 4, 5, 3, 1], the output Tensor should look like [0 1 2 3 0 1 2 3 0 1 2 3 4 0 1 2 0].
Thank you for any advice.
You can make this work with a tensor as input by using a tf.RaggedTensor which can contain dimensions of non-uniform length.
# Or any other N length tensor
tf_counts = tf.convert_to_tensor([4, 4, 5, 3, 1])
tf.print(tf_counts)
# [4 4 5 3 1]
# Create a ragged tensor, each row is a sequence of length tf_counts[i]
tf_ragged = tf.ragged.range(tf_counts)
tf.print(tf_ragged)
# <tf.RaggedTensor [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3, 4], [0, 1, 2], [0]]>
# Read values
tf.print(tf_ragged.flat_values, summarize=-1)
# [0 1 2 3 0 1 2 3 0 1 2 3 4 0 1 2 0]
For this 2-dimensional case the ragged tensor tf_ragged is a “matrix“ of rows with varying length:
[[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3, 4],
[0, 1, 2],
[0]]
Check tf.ragged.range for more options on how to create the sequences on each row: starts for inclusive lower limits, limits for exclusive upper limit, deltas for increment. Each may vary for each sequence.
Also mind that the dtype of the tf_counts tensor will propagate to the final values.
If you want to have everything as a tensorflow object, then use tf.range() along with tf.concat().
In [88]: vals = [4, 4, 5, 3, 1]
In [89]: tf_range = [tf.range(0, limit=item, dtype=tf.int32) for item in vals]
# concat all `tf_range` objects into a single tensor
In [90]: concatenated_tensor = tf.concat(tf_range, 0)
In [91]: concatenated_tensor.eval()
Out[91]: array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 0], dtype=int32)
There're other approaches to do this as well. Here, I assume that you want a constant tensor but you can construct any tensor once you have the full range list.
First, we construct the full range list using a list comprehension, make a flat list out of it, and then construct a tensor.
In [78]: from itertools import chain
In [79]: vals = [4, 4, 5, 3, 1]
In [80]: range_list = list(chain(*[range(item) for item in vals]))
In [81]: range_list
Out[81]: [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 0]
In [82]: const_tensor = tf.constant(range_list, dtype=tf.int32)
In [83]: const_tensor.eval()
Out[83]: array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 0], dtype=int32)
On the other hand, we can also use tf.range() but then it returns an array when you evaluate it. So, you'd have to construct the list from the arrays and then make a flat list out of it and finally construct the tensor as in the following example.
list_of_arr = [tf.range(0, limit=item, dtype=tf.int32).eval() for item in vals]
range_list = list(chain(*[arr.tolist() for arr in list_of_arr]))
# output
[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 0]
const_tensor = tf.constant(range_list, dtype=tf.int32)
const_tensor.eval()
#output tensor as numpy array
array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 0], dtype=int32)

scipy: Adding a sparse vector to a specific row of a sparse matrix

In python, what is the best way to add a CSR vector to a specific row of a CSR matrix? I found one workaround here, but wondering if there is a better/more efficient way to do this. Would appreciate any help.
Given an NxM CSR matrix A and a 1xM CSR matrix B, and a row index i, the goal is to add B to the i-th row of A efficiently.
The obvious indexed addition does work. It gives a efficiency warning, but that doesn't mean it is the slowest way, just that you shouldn't count of doing this repeatedly. It suggests working with the lil format, but conversion to that and back probably takes more time than performing the addition to the csr matrix.
In [1049]: B.A
Out[1049]:
array([[0, 9, 0, 0, 1, 0],
[2, 0, 5, 0, 0, 9],
[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 9, 5, 3, 0, 7],
[1, 0, 0, 8, 9, 0]], dtype=int32)
In [1051]: B[1,:] += np.array([1,0,1,0,0,0])
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [1052]: B
Out[1052]:
<6x6 sparse matrix of type '<class 'numpy.int32'>'
with 17 stored elements in Compressed Sparse Row format>
In [1053]: B.A
Out[1053]:
array([[0, 9, 0, 0, 1, 0],
[3, 0, 6, 0, 0, 9],
[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 9, 5, 3, 0, 7],
[1, 0, 0, 8, 9, 0]])
As your linked question shows, it is possible to act directly on the attributes of the sparse matrix. His code shows why there's an efficiency warning - in the general case it has to rebuild the matrix attributes.
lil is more efficient for row replacement because it just has to change a sublist in the matrix .data and .rows attributes. A change in one row doesn't change the attributes of any of the others.
That said, IF your addition has the same sparsity as the original row, it is possible change specific elements of the data attribute without reworking .indices or .indptr. Drawing on the linked code
A.data[:idx_start_row : idx_end_row]
is the slice of A.data that will be changed. You need of course the corresponding slice from the 'vector'.
Starting with the In [1049] B
In [1085]: B.indptr
Out[1085]: array([ 0, 2, 5, 6, 7, 11, 14], dtype=int32)
In [1086]: B.data
Out[1086]: array([9, 1, 2, 5, 9, 2, 2, 9, 5, 3, 7, 1, 8, 9], dtype=int32)
In [1087]: B.indptr[[1,2]] # row 1
Out[1087]: array([2, 5], dtype=int32)
In [1088]: B.data[2:5]
Out[1088]: array([2, 5, 9], dtype=int32)
In [1089]: B.indices[2:5] # row 1 column indices
Out[1089]: array([0, 2, 5], dtype=int32)
In [1090]: B.data[2:5] += np.array([1,2,3])
In [1091]: B.A
Out[1091]:
array([[ 0, 9, 0, 0, 1, 0],
[ 3, 0, 7, 0, 0, 12],
[ 0, 2, 0, 0, 0, 0],
[ 2, 0, 0, 0, 0, 0],
[ 0, 9, 5, 3, 0, 7],
[ 1, 0, 0, 8, 9, 0]], dtype=int32)
Notice where the changed values, [3,7,12], are in the lil format:
In [1092]: B.tolil().data
Out[1092]: array([[9, 1], [3, 7, 12], [2], [2], [9, 5, 3, 7], [1, 8, 9]], dtype=object)
csr / csc matrices are efficient for most operations including addition (O(nnz)). However, little changes that affect the sparsity structure such as your example or even switching a single position from 0 to 1 are not because they require a O(nnz) reorganisation of the representation. Values and indices are packed; inserting one, all above need to move.
If you do just a single such operation, my guess would be that you can't easily beat scipy's implementation. However, if you are adding multiple rows for example it may be worthwile first making a sparse matrix of them and then adding that in one go.
Creating a csr matrix by hand from rows, say, is not that difficult. For example if your rows are dense and in order:
row_numbers, indices = np.where(rows)
data = rows[row_numbers, indices]
indptr = np.searchsorted(np.r_[true_row_numbers[row_numbers], N], np.arange(N+1))
If you have a collection of sparse rows and their row numbers:
data = np.r_[tuple([r.data for r in rows])]
indices = np.r_[tuple(r.indices for r in rows])]
jumps = np.add.accumulate([0] + [len(r) for r in rows])
indptr = np.repeat(jumps, np.diff(np.r_[-1, true_row_numbers, N]))

Convert string to integer pandas dataframe index

I have a pandas dataframe with a multiindex. Unfortunately one of the indices gives years as a string
e.g. '2010', '2011'
how do I convert these to integers?
More concretely
MultiIndex(levels=[[u'2010', u'2011'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, , ...]], names=[u'Year', u'Month'])
.
df_cbs_prelim_total.index.set_levels(df_cbs_prelim_total.index.get_level_values(0).astype('int'))
seems to do it, but not inplace. Any proper way of changing them?
Cheers,
Mike
Will probably be cleaner to do this before you assign it as index (as #EdChum points out), but when you already have it as index, you can indeed use set_levels to alter one of the labels of a level of your multi-index. A bit cleaner as your code (you can use index.levels[..]):
In [165]: idx = pd.MultiIndex.from_product([[1,2,3], ['2011','2012','2013']])
In [166]: idx
Out[166]:
MultiIndex(levels=[[1, 2, 3], [u'2011', u'2012', u'2013']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
In [167]: idx.levels[1]
Out[167]: Index([u'2011', u'2012', u'2013'], dtype='object')
In [168]: idx = idx.set_levels(idx.levels[1].astype(int), level=1)
In [169]: idx
Out[169]:
MultiIndex(levels=[[1, 2, 3], [2011, 2012, 2013]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
You have to reassign it to save the changes (as is done above, in your case this would be df_cbs_prelim_total.index = df_cbs_prelim_total.index.set_levels(...))