Related
I am trying to access a pytorch tensor by a matrix of indices and I recently found this bit of code that I cannot find the reason why it is not working.
The code below is split into two parts. The first half proves to work, whilst the second trips an error. I fail to see the reason why. Could someone shed some light on this?
import torch
import numpy as np
a = torch.rand(32, 16)
m, n = a.shape
xx, yy = np.meshgrid(np.arange(m), np.arange(m))
result = a[xx] # WORKS for a torch.tensor of size M >= 32. It doesn't work otherwise.
a = torch.rand(16, 16)
m, n = a.shape
xx, yy = np.meshgrid(np.arange(m), np.arange(m))
result = a[xx] # IndexError: too many indices for tensor of dimension 2
and if I change a = np.random.rand(16, 16) it does work as well.
To whoever comes looking for an answer: it looks like its a bug in pyTorch.
Indexing using numpy arrays is not well defined, and it works only if tensors are indexed using tensors. So, in my example code, this works flawlessly:
a = torch.rand(M, N)
m, n = a.shape
xx, yy = torch.meshgrid(torch.arange(m), torch.arange(m), indexing='xy')
result = a[xx] # WORKS
I made a gist to check it, and it's available here
First, let me give you a quick insight into the idea of indexing a tensor with a numpy array and another tensor.
Example: this is our target tensor to be indexed
numpy_indices = torch.tensor([[0, 1, 2, 7],
[0, 1, 2, 3]]) # numpy array
tensor_indices = torch.tensor([[0, 1, 2, 7],
[0, 1, 2, 3]]) # 2D tensor
t = torch.tensor([[1, 2, 3, 4], # targeted tensor
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20],
[21, 22, 23, 24],
[25, 26, 27, 28],
[29, 30, 31, 32]])
numpy_result = t[numpy_indices]
tensor_result = t[tensor_indices]
Indexing using a 2D numpy array: the index is read like pairs (x,y) tensor[row,column] e.g. t[0,0], t[1,1], t[2,2], and t[7,3].
print(numpy_result) # tensor([ 1, 6, 11, 32])
Indexing using a 2D tensor: walks through the index tensor in a row-wise manner and each value is an index of a row in the targeted tensor.
e.g. [ [t[0],t[1],t[2],[7]] , [[0],[1],[2],[3]] ] see the example below, the new shape of tensor_result after indexing is (tensor_indices.shape[0],tensor_indices.shape[1],t.shape[1])=(2,4,4).
print(tensor_result) # tensor([[[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12],
# [29, 30, 31, 32]],
# [[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12],
# [ 13, 14, 15, 16]]])
If you try to add a third row in numpy_indices, you will get the same error you have because the index will be represented by 3D e.g., (0,0,0)...(7,3,3).
indices = np.array([[0, 1, 2, 7],
[0, 1, 2, 3],
[0, 1, 2, 3]])
print(numpy_result) # IndexError: too many indices for tensor of dimension 2
However, this is not the case with indexing by tensor and the shape will be bigger (3,4,4).
Finally, as you see the outputs of the two types of indexing are completely different. To solve your problem, you can use
xx = torch.tensor(xx).long() # convert a numpy array to a tensor
What happens in the case of advanced indexing (rows of numpy_indices > 3 ) as your situation is still ambiguous and unsolved and you can check 1 , 2, 3.
I am finding outliers from a column and storing them in a list. Now i want to delete all the values which
are present in my list from the column.
How can achieve this ?
This is my function for finding outliers
outlier=[]
def detect_outliers(data):
threshold=3
m = np.mean(data)
st = np.std(data)
for i in data:
#calculating z-score value
z_score=(i-m)/st
#if the z_score value is greater than threshold value than its a outlier
if np.abs(z_score)>threshold:
outlier.append(i)
return outlier
This is my column in data frame
df_train_11.AMT_INCOME_TOTAL
import numpy as np, pandas as pd
df = pd.DataFrame(np.random.rand(10,5))
outlier_list=[]
def detect_outliers(data):
threshold=0.5
for i in data:
#calculating z-score value
z_score=(df.loc[:,i]- np.mean(df.loc[:,i])) /np.std(df.loc[:,i])
outliers = np.abs(z_score)>threshold
outlier_list.append(df.index[outliers].tolist())
return outlier_list
outlier_list = detect_outliers(df)
[[1, 2, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 4, 8],
[0, 1, 3, 4, 6, 8],
[0, 1, 3, 5, 6, 8, 9]]
This way, you get the outliers of each column. outlier_list[0] gives you [1, 2, 4, 5, 6, 7, 9] which means that the rows 1,2,etc are outliers for column 0.
EDIT
Shorter answer:
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
This willfilter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations.
I have the following DataFrame of dummy data
data = { 'user_id': np.random.randint(1000000, 10000000, size=(10)), 'week': np.random.randint(1, 10, size=(10)) }
df = pd.DataFrame(data = data)
I would like to add a new column whose values are arrays of length week (with those arrays containing random values). None of these work
df.loc[:,'inputs'] = np.random.randint(0, 28, size=(10))
(gives one integer per DataFrame cell, not an array of them)
df.loc[:,'inputs'] = np.random.randint(0, 28, size=(df['week']))
ValueError: Length of values does not match length of index
df.loc[:,'inputs'] = np.random.randint(0, 28, size=(10, df['week']))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
These are obviously all wrong, but I cannot see how to make a new column whose entries are each arrays where the length of those arrays depends on another column's value in the same row.
Use list comprehension for arrays by week numbers:
df['inputs'] = [np.random.randint(0, 28, size=x) for x in df['week']]
print (df)
user_id week inputs
0 9168288 4 [15, 5, 10, 9]
1 2765768 7 [21, 26, 6, 6, 22, 21, 4]
2 2948278 6 [6, 14, 4, 2, 3, 20]
3 9302275 1 [23]
4 5737115 5 [1, 20, 9, 19, 18]
5 5214343 9 [16, 25, 1, 10, 2, 23, 1, 16, 18]
6 9332184 7 [8, 27, 14, 8, 14, 11, 5]
7 1569483 5 [6, 19, 3, 10, 16]
8 2931319 2 [0, 15]
9 2126334 2 [20, 22]
Lets say I have a Python Numpy array a.
a = numpy.array([1,2,3,4,5,6,7,8,9,10,11])
I want to create a matrix of sub sequences from this array of length 5 with stride 3. The results matrix hence will look as follows:
numpy.array([[1,2,3,4,5],[4,5,6,7,8],[7,8,9,10,11]])
One possible way of implementing this would be using a for-loop.
result_matrix = np.zeros((3, 5))
for i in range(0, len(a), 3):
result_matrix[i] = a[i:i+5]
Is there a cleaner way to implement this in Numpy?
Approach #1 : Using broadcasting -
def broadcasting_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
return a[S*np.arange(nrows)[:,None] + np.arange(L)]
Approach #2 : Using more efficient NumPy strides -
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
Sample run -
In [143]: a
Out[143]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [144]: broadcasting_app(a, L = 5, S = 3)
Out[144]:
array([[ 1, 2, 3, 4, 5],
[ 4, 5, 6, 7, 8],
[ 7, 8, 9, 10, 11]])
In [145]: strided_app(a, L = 5, S = 3)
Out[145]:
array([[ 1, 2, 3, 4, 5],
[ 4, 5, 6, 7, 8],
[ 7, 8, 9, 10, 11]])
Starting in Numpy 1.20, we can make use of the new sliding_window_view to slide/roll over windows of elements.
And coupled with a stepping [::3], it simply becomes:
from numpy.lib.stride_tricks import sliding_window_view
# values = np.array([1,2,3,4,5,6,7,8,9,10,11])
sliding_window_view(values, window_shape = 5)[::3]
# array([[ 1, 2, 3, 4, 5],
# [ 4, 5, 6, 7, 8],
# [ 7, 8, 9, 10, 11]])
where the intermediate result of the sliding is:
sliding_window_view(values, window_shape = 5)
# array([[ 1, 2, 3, 4, 5],
# [ 2, 3, 4, 5, 6],
# [ 3, 4, 5, 6, 7],
# [ 4, 5, 6, 7, 8],
# [ 5, 6, 7, 8, 9],
# [ 6, 7, 8, 9, 10],
# [ 7, 8, 9, 10, 11]])
Modified version of #Divakar's code with checking to ensure that memory is contiguous and that the returned array cannot be modified. (Variable names changed for my DSP application).
def frame(a, framelen, frameadv):
"""frame - Frame a 1D array
a - 1D array
framelen - Samples per frame
frameadv - Samples between starts of consecutive frames
Set to framelen for non-overlaping consecutive frames
Modified from Divakar's 10/17/16 11:20 solution:
https://stackoverflow.com/questions/40084931/taking-subarrays-from-numpy-array-with-given-stride-stepsize
CAVEATS:
Assumes array is contiguous
Output is not writable as there are multiple views on the same memory
"""
if not isinstance(a, np.ndarray) or \
not (a.flags['C_CONTIGUOUS'] or a.flags['F_CONTIGUOUS']):
raise ValueError("Input array a must be a contiguous numpy array")
# Output
nrows = ((a.size-framelen)//frameadv)+1
oshape = (nrows, framelen)
# Size of each element in a
n = a.strides[0]
# Indexing in the new object will advance by frameadv * element size
ostrides = (frameadv*n, n)
return np.lib.stride_tricks.as_strided(a, shape=oshape,
strides=ostrides, writeable=False)
in [http://deeplearning.net/tutorial/lenet.html#lenet] it says:
This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
# or (500, 50 * 4 * 4) = (500, 800) with the default values.
layer2_input = layer1.output.flatten(2)
when I use flatten function on a numpy 3d array I get a 1D array. but here it says I get a matrix. How does flatten(2) work in theano?
A similar example on numpy produces 1D array:
a= array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, 11, 12],
[13, 14, 15],
[16, 17, 18]],
[[19, 20, 21],
[22, 23, 24],
[25, 26, 27]]])
a.flatten(2)=array([ 1, 10, 19, 4, 13, 22, 7, 16, 25, 2, 11, 20, 5, 14, 23, 8, 17,
26, 3, 12, 21, 6, 15, 24, 9, 18, 27])
numpy doesn't support flattening only some dimensions but Theano does.
So if a is a numpy array, a.flatten(2) doesn't make any sense. It runs without error but only because the 2 is passed as the order parameter which seems to cause numpy to stick with the default order of C.
Theano's flatten does support axis specification. The documentation explains how it works.
Parameters:
x (any TensorVariable (or compatible)) – variable to be flattened
outdim (int) – the number of dimensions in the returned variable
Return type:
variable with same dtype as x and outdim dimensions
Returns:
variable with the same shape as x in the leading outdim-1 dimensions,
but with all remaining dimensions of x collapsed into the last dimension.
For example, if we flatten a tensor of shape (2, 3, 4, 5) with
flatten(x, outdim=2), then we’ll have the same (2-1=1) leading
dimensions (2,), and the remaining dimensions are collapsed. So the
output in this example would have shape (2, 60).
A simple Theano demonstration:
import numpy
import theano
import theano.tensor as tt
def compile():
x = tt.tensor3()
return theano.function([x], x.flatten(2))
def main():
a = numpy.arange(2 * 3 * 4).reshape((2, 3, 4))
f = compile()
print a.shape, f(a).shape
main()
prints
(2L, 3L, 4L) (2L, 12L)