How to compare a 2D array against a 1D array column-wise?

How to compare a 2D array against a 1D array column-wise? - numpy

I have two numpy arrays. One of them is 2D while the other is 1D.
>>> a = np.arange(0,20).reshape(2,10)
>>> a
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
>>> b = np.full( a.shape[1], 10 )
>>> b
array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
I want to compare them column-wise:
If the columns elements in a is identical to the column element of b, then store row number(s) of a.
Else, find the closest matching of a to b and store the row number(s).
In my example, the output from the comparison should be:
[ 1, [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1] ]
How do I do this in NumPy?
I was thinking of using np.where( a==b, run a function to get row(s) if same, run another function to get row(s) of diff )? Is this the way?

Related

pytorch tensor indices is confusing [duplicate]

I am trying to access a pytorch tensor by a matrix of indices and I recently found this bit of code that I cannot find the reason why it is not working.
The code below is split into two parts. The first half proves to work, whilst the second trips an error. I fail to see the reason why. Could someone shed some light on this?
import torch
import numpy as np
a = torch.rand(32, 16)
m, n = a.shape
xx, yy = np.meshgrid(np.arange(m), np.arange(m))
result = a[xx] # WORKS for a torch.tensor of size M >= 32. It doesn't work otherwise.
a = torch.rand(16, 16)
m, n = a.shape
xx, yy = np.meshgrid(np.arange(m), np.arange(m))
result = a[xx] # IndexError: too many indices for tensor of dimension 2
and if I change a = np.random.rand(16, 16) it does work as well.

To whoever comes looking for an answer: it looks like its a bug in pyTorch.
Indexing using numpy arrays is not well defined, and it works only if tensors are indexed using tensors. So, in my example code, this works flawlessly:
a = torch.rand(M, N)
m, n = a.shape
xx, yy = torch.meshgrid(torch.arange(m), torch.arange(m), indexing='xy')
result = a[xx] # WORKS
I made a gist to check it, and it's available here

First, let me give you a quick insight into the idea of indexing a tensor with a numpy array and another tensor.
Example: this is our target tensor to be indexed
numpy_indices = torch.tensor([[0, 1, 2, 7],
[0, 1, 2, 3]]) # numpy array
tensor_indices = torch.tensor([[0, 1, 2, 7],
[0, 1, 2, 3]]) # 2D tensor
t = torch.tensor([[1, 2, 3, 4], # targeted tensor
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20],
[21, 22, 23, 24],
[25, 26, 27, 28],
[29, 30, 31, 32]])
numpy_result = t[numpy_indices]
tensor_result = t[tensor_indices]
Indexing using a 2D numpy array: the index is read like pairs (x,y) tensor[row,column] e.g. t[0,0], t[1,1], t[2,2], and t[7,3].
print(numpy_result) # tensor([ 1, 6, 11, 32])
Indexing using a 2D tensor: walks through the index tensor in a row-wise manner and each value is an index of a row in the targeted tensor.
e.g. [ [t[0],t[1],t[2],[7]] , [[0],[1],[2],[3]] ] see the example below, the new shape of tensor_result after indexing is (tensor_indices.shape[0],tensor_indices.shape[1],t.shape[1])=(2,4,4).
print(tensor_result) # tensor([[[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12],
# [29, 30, 31, 32]],
# [[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12],
# [ 13, 14, 15, 16]]])
If you try to add a third row in numpy_indices, you will get the same error you have because the index will be represented by 3D e.g., (0,0,0)...(7,3,3).
indices = np.array([[0, 1, 2, 7],
[0, 1, 2, 3],
[0, 1, 2, 3]])
print(numpy_result) # IndexError: too many indices for tensor of dimension 2
However, this is not the case with indexing by tensor and the shape will be bigger (3,4,4).
Finally, as you see the outputs of the two types of indexing are completely different. To solve your problem, you can use
xx = torch.tensor(xx).long() # convert a numpy array to a tensor
What happens in the case of advanced indexing (rows of numpy_indices > 3 ) as your situation is still ambiguous and unsolved and you can check 1 , 2, 3.

How to delete rows from column which have matching values in the list Pandas

I am finding outliers from a column and storing them in a list. Now i want to delete all the values which
are present in my list from the column.
How can achieve this ?
This is my function for finding outliers
outlier=[]
def detect_outliers(data):
threshold=3
m = np.mean(data)
st = np.std(data)
for i in data:
#calculating z-score value
z_score=(i-m)/st
#if the z_score value is greater than threshold value than its a outlier
if np.abs(z_score)>threshold:
outlier.append(i)
return outlier
This is my column in data frame
df_train_11.AMT_INCOME_TOTAL

import numpy as np, pandas as pd
df = pd.DataFrame(np.random.rand(10,5))
outlier_list=[]
def detect_outliers(data):
threshold=0.5
for i in data:
#calculating z-score value
z_score=(df.loc[:,i]- np.mean(df.loc[:,i])) /np.std(df.loc[:,i])
outliers = np.abs(z_score)>threshold
outlier_list.append(df.index[outliers].tolist())
return outlier_list
outlier_list = detect_outliers(df)
[[1, 2, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 4, 8],
[0, 1, 3, 4, 6, 8],
[0, 1, 3, 5, 6, 8, 9]]
This way, you get the outliers of each column. outlier_list[0] gives you [1, 2, 4, 5, 6, 7, 9] which means that the rows 1,2,etc are outliers for column 0.
EDIT
Shorter answer:
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
This willfilter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations.

Add new pandas DataFrame column whose values are an array of random numbers with length gleaned from another column

I have the following DataFrame of dummy data
data = { 'user_id': np.random.randint(1000000, 10000000, size=(10)), 'week': np.random.randint(1, 10, size=(10)) }
df = pd.DataFrame(data = data)
I would like to add a new column whose values are arrays of length week (with those arrays containing random values). None of these work
df.loc[:,'inputs'] = np.random.randint(0, 28, size=(10))
(gives one integer per DataFrame cell, not an array of them)
df.loc[:,'inputs'] = np.random.randint(0, 28, size=(df['week']))
ValueError: Length of values does not match length of index
df.loc[:,'inputs'] = np.random.randint(0, 28, size=(10, df['week']))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
These are obviously all wrong, but I cannot see how to make a new column whose entries are each arrays where the length of those arrays depends on another column's value in the same row.

Use list comprehension for arrays by week numbers:
df['inputs'] = [np.random.randint(0, 28, size=x) for x in df['week']]
print (df)
user_id week inputs
0 9168288 4 [15, 5, 10, 9]
1 2765768 7 [21, 26, 6, 6, 22, 21, 4]
2 2948278 6 [6, 14, 4, 2, 3, 20]
3 9302275 1 [23]
4 5737115 5 [1, 20, 9, 19, 18]
5 5214343 9 [16, 25, 1, 10, 2, 23, 1, 16, 18]
6 9332184 7 [8, 27, 14, 8, 14, 11, 5]
7 1569483 5 [6, 19, 3, 10, 16]
8 2931319 2 [0, 15]
9 2126334 2 [20, 22]

Efficiently construct numpy matrix from offset ranges of 1D array [duplicate]

Lets say I have a Python Numpy array a.
a = numpy.array([1,2,3,4,5,6,7,8,9,10,11])
I want to create a matrix of sub sequences from this array of length 5 with stride 3. The results matrix hence will look as follows:
numpy.array([[1,2,3,4,5],[4,5,6,7,8],[7,8,9,10,11]])
One possible way of implementing this would be using a for-loop.
result_matrix = np.zeros((3, 5))
for i in range(0, len(a), 3):
result_matrix[i] = a[i:i+5]
Is there a cleaner way to implement this in Numpy?

Approach #1 : Using broadcasting -
def broadcasting_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
return a[S*np.arange(nrows)[:,None] + np.arange(L)]
Approach #2 : Using more efficient NumPy strides -
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
Sample run -
In [143]: a
Out[143]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [144]: broadcasting_app(a, L = 5, S = 3)
Out[144]:
array([[ 1, 2, 3, 4, 5],
[ 4, 5, 6, 7, 8],
[ 7, 8, 9, 10, 11]])
In [145]: strided_app(a, L = 5, S = 3)
Out[145]:
array([[ 1, 2, 3, 4, 5],
[ 4, 5, 6, 7, 8],
[ 7, 8, 9, 10, 11]])

Starting in Numpy 1.20, we can make use of the new sliding_window_view to slide/roll over windows of elements.
And coupled with a stepping [::3], it simply becomes:
from numpy.lib.stride_tricks import sliding_window_view
# values = np.array([1,2,3,4,5,6,7,8,9,10,11])
sliding_window_view(values, window_shape = 5)[::3]
# array([[ 1, 2, 3, 4, 5],
# [ 4, 5, 6, 7, 8],
# [ 7, 8, 9, 10, 11]])
where the intermediate result of the sliding is:
sliding_window_view(values, window_shape = 5)
# array([[ 1, 2, 3, 4, 5],
# [ 2, 3, 4, 5, 6],
# [ 3, 4, 5, 6, 7],
# [ 4, 5, 6, 7, 8],
# [ 5, 6, 7, 8, 9],
# [ 6, 7, 8, 9, 10],
# [ 7, 8, 9, 10, 11]])

Modified version of #Divakar's code with checking to ensure that memory is contiguous and that the returned array cannot be modified. (Variable names changed for my DSP application).
def frame(a, framelen, frameadv):
"""frame - Frame a 1D array
a - 1D array
framelen - Samples per frame
frameadv - Samples between starts of consecutive frames
Set to framelen for non-overlaping consecutive frames
Modified from Divakar's 10/17/16 11:20 solution:
https://stackoverflow.com/questions/40084931/taking-subarrays-from-numpy-array-with-given-stride-stepsize
CAVEATS:
Assumes array is contiguous
Output is not writable as there are multiple views on the same memory
"""
if not isinstance(a, np.ndarray) or \
not (a.flags['C_CONTIGUOUS'] or a.flags['F_CONTIGUOUS']):
raise ValueError("Input array a must be a contiguous numpy array")
# Output
nrows = ((a.size-framelen)//frameadv)+1
oshape = (nrows, framelen)
# Size of each element in a
n = a.strides[0]
# Indexing in the new object will advance by frameadv * element size
ostrides = (frameadv*n, n)
return np.lib.stride_tricks.as_strided(a, shape=oshape,
strides=ostrides, writeable=False)

Clarification about flatten function in Theano

in [http://deeplearning.net/tutorial/lenet.html#lenet] it says:
This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
# or (500, 50 * 4 * 4) = (500, 800) with the default values.
layer2_input = layer1.output.flatten(2)
when I use flatten function on a numpy 3d array I get a 1D array. but here it says I get a matrix. How does flatten(2) work in theano?
A similar example on numpy produces 1D array:
a= array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, 11, 12],
[13, 14, 15],
[16, 17, 18]],
[[19, 20, 21],
[22, 23, 24],
[25, 26, 27]]])
a.flatten(2)=array([ 1, 10, 19, 4, 13, 22, 7, 16, 25, 2, 11, 20, 5, 14, 23, 8, 17,
26, 3, 12, 21, 6, 15, 24, 9, 18, 27])

numpy doesn't support flattening only some dimensions but Theano does.
So if a is a numpy array, a.flatten(2) doesn't make any sense. It runs without error but only because the 2 is passed as the order parameter which seems to cause numpy to stick with the default order of C.
Theano's flatten does support axis specification. The documentation explains how it works.
Parameters:
x (any TensorVariable (or compatible)) – variable to be flattened
outdim (int) – the number of dimensions in the returned variable
Return type:
variable with same dtype as x and outdim dimensions
Returns:
variable with the same shape as x in the leading outdim-1 dimensions,
but with all remaining dimensions of x collapsed into the last dimension.
For example, if we flatten a tensor of shape (2, 3, 4, 5) with
flatten(x, outdim=2), then we’ll have the same (2-1=1) leading
dimensions (2,), and the remaining dimensions are collapsed. So the
output in this example would have shape (2, 60).
A simple Theano demonstration:
import numpy
import theano
import theano.tensor as tt
def compile():
x = tt.tensor3()
return theano.function([x], x.flatten(2))
def main():
a = numpy.arange(2 * 3 * 4).reshape((2, 3, 4))
f = compile()
print a.shape, f(a).shape
main()
prints
(2L, 3L, 4L) (2L, 12L)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to compare a 2D array against a 1D array column-wise? - numpy

Related

pytorch tensor indices is confusing [duplicate]

How to delete rows from column which have matching values in the list Pandas

Add new pandas DataFrame column whose values are an array of random numbers with length gleaned from another column

Efficiently construct numpy matrix from offset ranges of 1D array [duplicate]

Clarification about flatten function in Theano

Categories

Resources