Most efficient method to loop through a dataframe and return a filtered array of values based on multiple criteria - pandas

I have a dataset which has data of events including various elements with positional data of these elements included at various points in time. The total dataset is very large covering many of these events.
For each element at each point in time, I want to find the closest other element. To start this I was going to return an array of the positional data of all other elements at a specific time period and include this in the same row of the original dataframe (to perform further calculations on later).
I had two attempts at coding this, which I have included below. Both take too long on such a large dataset. Any ways that I can make it more efficient would be greatly appreciated.
import pandas as pd
import numpy as np
def func1(db, val, frame):
return db.loc[(db['val'].isin([val])) & (db['frameId'].isin([frame])) & ['displayName', 'x', 'y']]
.reset_index(drop=True).values.tolist()
d = pd.DataFrame({'displayName': ['Bob', 'Jane', 'Alice',
'Bob', 'Jane', 'Alice'],
'x': [90, 88, 86, 94, 91, 92],
'y': [24, 13, 18, 20, 15, 16],
'val': [201801, 201801, 201801, 201801, 201801, 201801],
'frameId': [1, 1, 1, 2, 2, 2]})
res = d.apply(lambda row: func1(d, row['val'], row['frameId']), axis=1)
Approach 2:
def func2(db, val, frame):
return [l[[0, 1, 2]] for l in db if l[3] == val if l[4] == frame]
res = d.apply(lambda row: func2(np.array(d), row['val'], row['frameId']), axis=1)
The result (res) will thus be an array that looks like this:
[[['Bob', 90, 24], ['Jane', 88, 13], ['Alice', 86, 18]],
[['Bob', 90, 24], ['Jane', 88, 13], ['Alice', 86, 18]],
[['Bob', 90, 24], ['Jane', 88, 13], ['Alice', 86, 18]],
[['Bob', 94, 20], ['Jane', 91, 15], ['Alice', 92, 16]],
[['Bob', 94, 20], ['Jane', 91, 15], ['Alice', 92, 16]],
[['Bob', 94, 20], ['Jane', 91, 15], ['Alice', 92, 16]]]
However over the large dataset this is very time consuming to produce under both methods so any way to reduce time complexity would be welcomed.

If the order of the first dimension of the 3D-Array does not matter, then please use (if it does matter, then you will have to create a series that groups by displayName or index and takes the cumcount. Sort by that and then drop. Let me know.:
import pandas as pd
import numpy as np
d = pd.DataFrame({'displayName': ['Bob', 'Jane', 'Alice',
'Bob', 'Jane', 'Alice'],
'x': [90, 88, 86, 94, 91, 92],
'y': [24, 13, 18, 20, 15, 16],
'val': [201801, 201801, 201801, 201801, 201801, 201801],
'frameId': [1, 1, 1, 2, 2, 2]})
n = d['frameId'].max() + 1
x = d['displayName'].nunique()
pd.concat([d.iloc[:,0:3]]*n).to_numpy().reshape(df.shape[0],x,x)
Out[1]:
array([[['Bob', 90, 24],
['Jane', 88, 13],
['Alice', 86, 18]],
[['Bob', 94, 20],
['Jane', 91, 15],
['Alice', 92, 16]],
[['Bob', 90, 24],
['Jane', 88, 13],
['Alice', 86, 18]],
[['Bob', 94, 20],
['Jane', 91, 15],
['Alice', 92, 16]],
[['Bob', 90, 24],
['Jane', 88, 13],
['Alice', 86, 18]],
[['Bob', 94, 20],
['Jane', 91, 15],
['Alice', 92, 16]]], dtype=object)

Related

reshape numpy array, do a transformation and do the inverse reshape

Here is my problem :
I’m trying to do a operation on a numpy array after reshaping it.
But after this operation, I want to reshape again my array to get my original shape with the same indexing.
So I want to find the appropriate "inverse reshape" so that inverse_reshape(reshape(a))==a
length = 10
a = np.arange(length^2).reshape((length,length))
#a.spape = (10,10)
b = (a.reshape((length//2, 2, -1, 2))
.swapaxes(1, 2)
.reshape(-1, 2, 2))
#b.shape = (25,2,2)
b = my_function(b)
#b.shape = (25,2,2) still the same shape
# b --> a ?
I know that the numpy reshape funtion doesn’t copy the array, but the swapaxes one does.
How can I get the appropriate reshaping ?
Simply reverse the order of the the a=>b conversion.
The original made:
In [53]: a.reshape((length//2, 2, -1, 2)).shape
Out[53]: (5, 2, 5, 2)
In [54]: a.reshape((length//2, 2, -1, 2)).swapaxes(1,2).shape
Out[54]: (5, 5, 2, 2)
In [55]: b.shape
Out[55]: (25, 2, 2)
So we need to get b back to the 4d shape, swap the axes back, and reshape to original a shape:
In [56]: b.reshape(5,5,2,2).swapaxes(1,2).reshape(10,10)
Out[56]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
[90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

Apply function to specific selected columns in pandas data frame

I have the following dataframe:
# List of Tuples
matrix = [([22, 23], [34, 35, 65], [23, 29, 31]),
([33, 34], [31, 44], [11, 16, 18]),
([44, 56, 76], [16, 34, 76], [21, 34]),
([55, 34], [32, 35, 38], [22, 24, 26]),
([66, 65, 67], [33, 38, 39], [27, 32, 34]),
([77, 39, 45], [35, 36, 38], [11, 21, 34])]
# Create a DataFrame object
df = pd.DataFrame(matrix, columns=list('xyz'), index=list('abcdef'))
I'm able to apply my custom function to output start, end items in list like below for all columns:
def fl(x):
return [x[0], x[len(x)-1]]
df.apply(lambda x : [fl(i) for i in x])
But i want to apply the function to selected columns x & z.
I'm trying like below referring to this link
df.apply(lambda x: fl(x) if x in ['x', 'y'] else x)
and like this:
df[['x', 'y']].apply(fl)
How to get the output with the function applied to only x and z columns with y column unchanged.
Use DataFrame.applymap for elementwise processing, also for last value is possible use [-1] indexing:
def fl(x):
return [x[0], x[-1]]
df[['x', 'z']] = df[['x', 'z']].applymap(fl)
print (df)
x y z
a [22, 23] [34, 35, 65] [23, 31]
b [33, 34] [31, 44] [11, 18]
c [44, 76] [16, 34, 76] [21, 34]
d [55, 34] [32, 35, 38] [22, 26]
e [66, 67] [33, 38, 39] [27, 34]
f [77, 45] [35, 36, 38] [11, 34]
Or for solution with DataFrame.apply use zip with mapping tuples to lists and selexting by str:
def fl(x):
return list(map(list, zip(x.str[0], x.str[-1])))
df[['x', 'z']] = df[['x', 'z']].apply(fl)
print (df)
x y z
a [22, 23] [34, 35, 65] [23, 31]
b [33, 34] [31, 44] [11, 18]
c [44, 76] [16, 34, 76] [21, 34]
d [55, 34] [32, 35, 38] [22, 26]
e [66, 67] [33, 38, 39] [27, 34]
f [77, 45] [35, 36, 38] [11, 34]
Found out the mistake i'm doing.
Thanks for the reply.
I changed the function like below:
def fl(x):
new = []
for i in x:
new.append([i[0], i[-1]])
return new
Then applied the function like this.
df.apply(lambda x : fl(x) if x.name in ['x', 'z'] else x)
Then i'm able to get the expected output.

2D Numpy Array Boolean Slicing [duplicate]

I've got a strange situation.
I have a 2D Numpy array, x:
x = np.random.random_integers(0,5,(20,8))
And I have 2 indexers--one with indices for the rows, and one with indices for the column. In order to index X, I am having to do the following:
row_indices = [4,2,18,16,7,19,4]
col_indices = [1,2]
x_rows = x[row_indices,:]
x_indexed = x_rows[:,column_indices]
Instead of just:
x_new = x[row_indices,column_indices]
(which fails with: error, cannot broadcast (20,) with (2,))
I'd like to be able to do the indexing in one line using the broadcasting, since that would keep the code clean and readable...also, I don't know all that much about python under the hood, but as I understand it, it should be faster to do it in one line (and I'll be working with pretty big arrays).
Test Case:
x = np.random.random_integers(0,5,(20,8))
row_indices = [4,2,18,16,7,19,4]
col_indices = [1,2]
x_rows = x[row_indices,:]
x_indexed = x_rows[:,col_indices]
x_doesnt_work = x[row_indices,col_indices]
Selections or assignments with np.ix_ using indexing or boolean arrays/masks
1. With indexing-arrays
A. Selection
We can use np.ix_ to get a tuple of indexing arrays that are broadcastable against each other to result in a higher-dimensional combinations of indices. So, when that tuple is used for indexing into the input array, would give us the same higher-dimensional array. Hence, to make a selection based on two 1D indexing arrays, it would be -
x_indexed = x[np.ix_(row_indices,col_indices)]
B. Assignment
We can use the same notation for assigning scalar or a broadcastable array into those indexed positions. Hence, the following works for assignments -
x[np.ix_(row_indices,col_indices)] = # scalar or broadcastable array
2. With masks
We can also use boolean arrays/masks with np.ix_, similar to how indexing arrays are used. This can be used again to select a block off the input array and also for assignments into it.
A. Selection
Thus, with row_mask and col_mask boolean arrays as the masks for row and column selections respectively, we can use the following for selections -
x[np.ix_(row_mask,col_mask)]
B. Assignment
And the following works for assignments -
x[np.ix_(row_mask,col_mask)] = # scalar or broadcastable array
Sample Runs
1. Using np.ix_ with indexing-arrays
Input array and indexing arrays -
In [221]: x
Out[221]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[88, 92, 46, 67, 44, 81, 17, 67],
[31, 70, 47, 90, 52, 15, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, 76, 56, 72, 43, 79, 53, 37],
[74, 46, 95, 27, 81, 97, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
In [222]: row_indices
Out[222]: [4, 2, 5, 4, 1]
In [223]: col_indices
Out[223]: [1, 2]
Tuple of indexing arrays with np.ix_ -
In [224]: np.ix_(row_indices,col_indices) # Broadcasting of indices
Out[224]:
(array([[4],
[2],
[5],
[4],
[1]]), array([[1, 2]]))
Make selections -
In [225]: x[np.ix_(row_indices,col_indices)]
Out[225]:
array([[76, 56],
[70, 47],
[46, 95],
[76, 56],
[92, 46]])
As suggested by OP, this is in effect same as performing old-school broadcasting with a 2D array version of row_indices that has its elements/indices sent to axis=0 and thus creating a singleton dimension at axis=1 and thus allowing broadcasting with col_indices. Thus, we would have an alternative solution like so -
In [227]: x[np.asarray(row_indices)[:,None],col_indices]
Out[227]:
array([[76, 56],
[70, 47],
[46, 95],
[76, 56],
[92, 46]])
As discussed earlier, for the assignments, we simply do so.
Row, col indexing arrays -
In [36]: row_indices = [1, 4]
In [37]: col_indices = [1, 3]
Make assignments with scalar -
In [38]: x[np.ix_(row_indices,col_indices)] = -1
In [39]: x
Out[39]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[88, -1, 46, -1, 44, 81, 17, 67],
[31, 70, 47, 90, 52, 15, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, -1, 56, -1, 43, 79, 53, 37],
[74, 46, 95, 27, 81, 97, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
Make assignments with 2D block(broadcastable array) -
In [40]: rand_arr = -np.arange(4).reshape(2,2)
In [41]: x[np.ix_(row_indices,col_indices)] = rand_arr
In [42]: x
Out[42]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[88, 0, 46, -1, 44, 81, 17, 67],
[31, 70, 47, 90, 52, 15, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, -2, 56, -3, 43, 79, 53, 37],
[74, 46, 95, 27, 81, 97, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
2. Using np.ix_ with masks
Input array -
In [19]: x
Out[19]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[88, 92, 46, 67, 44, 81, 17, 67],
[31, 70, 47, 90, 52, 15, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, 76, 56, 72, 43, 79, 53, 37],
[74, 46, 95, 27, 81, 97, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
Input row, col masks -
In [20]: row_mask = np.array([0,1,1,0,0,1,0],dtype=bool)
In [21]: col_mask = np.array([1,0,1,0,1,1,0,0],dtype=bool)
Make selections -
In [22]: x[np.ix_(row_mask,col_mask)]
Out[22]:
array([[88, 46, 44, 81],
[31, 47, 52, 15],
[74, 95, 81, 97]])
Make assignments with scalar -
In [23]: x[np.ix_(row_mask,col_mask)] = -1
In [24]: x
Out[24]:
array([[17, 39, 88, 14, 73, 58, 17, 78],
[-1, 92, -1, 67, -1, -1, 17, 67],
[-1, 70, -1, 90, -1, -1, 24, 22],
[19, 59, 98, 19, 52, 95, 88, 65],
[85, 76, 56, 72, 43, 79, 53, 37],
[-1, 46, -1, 27, -1, -1, 93, 69],
[49, 46, 12, 83, 15, 63, 20, 79]])
Make assignments with 2D block(broadcastable array) -
In [25]: rand_arr = -np.arange(12).reshape(3,4)
In [26]: x[np.ix_(row_mask,col_mask)] = rand_arr
In [27]: x
Out[27]:
array([[ 17, 39, 88, 14, 73, 58, 17, 78],
[ 0, 92, -1, 67, -2, -3, 17, 67],
[ -4, 70, -5, 90, -6, -7, 24, 22],
[ 19, 59, 98, 19, 52, 95, 88, 65],
[ 85, 76, 56, 72, 43, 79, 53, 37],
[ -8, 46, -9, 27, -10, -11, 93, 69],
[ 49, 46, 12, 83, 15, 63, 20, 79]])
What about:
x[row_indices][:,col_indices]
For example,
x = np.random.random_integers(0,5,(5,5))
## array([[4, 3, 2, 5, 0],
## [0, 3, 1, 4, 2],
## [4, 2, 0, 0, 3],
## [4, 5, 5, 5, 0],
## [1, 1, 5, 0, 2]])
row_indices = [4,2]
col_indices = [1,2]
x[row_indices][:,col_indices]
## array([[1, 5],
## [2, 0]])
import numpy as np
x = np.random.random_integers(0,5,(4,4))
x
array([[5, 3, 3, 2],
[4, 3, 0, 0],
[1, 4, 5, 3],
[0, 4, 3, 4]])
# This indexes the elements 1,1 and 2,2 and 3,3
indexes = (np.array([1,2,3]),np.array([1,2,3]))
x[indexes]
# returns array([3, 5, 4])
Notice that numpy has very different rules depending on what kind of indexes you use. So indexing several elements should be by a tuple of np.ndarray (see indexing manual).
So you need only to convert your list to np.ndarray and it should work as expected.
I think you are trying to do one of the following (equlvalent) operations:
x_does_work = x[row_indices,:][:,col_indices]
x_does_work = x[:,col_indices][row_indices,:]
This will actually create a subset of x with only the selected rows, then select the columns from that, or vice versa in the second case. The first case can be thought of as
x_does_work = (x[row_indices,:])[:,col_indices]
Your first try would work if you write it with np.newaxis
x_new = x[row_indices[:, np.newaxis],column_indices]

TypeError: Expected binary or unicode string

I want to use tf.data to input my image data. And I have read all image in a fold into a np.array, then I used to np.array to create a tf.data.Dataset object. However, I had a TypeError. My code is shown as follows.
import os
from scipy.misc import imread
import numpy as np
import glob
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
image = []
img_dir = 'data/ILSVRC2012_test/*'
images = np.array([np.array(imread(data)) for data in glob.glob(img_dir)])
image_data = tf.data.Dataset.from_tensor_slices(images)
And the following block is error information.
TypeError: Expected binary or unicode string, got array([[[184, 210, 225],
[184, 210, 225],
[184, 210, 225],
...,
[160, 185, 205],
[159, 184, 204],
[159, 184, 204]],
[[183, 209, 224],
[184, 210, 225],
[184, 210, 225],
...,
[159, 186, 205],
[159, 186, 205],
[159, 186, 205]],
[[184, 210, 225],
[184, 210, 225],
[185, 211, 226],
...,
[160, 187, 206],
[160, 187, 206],
[160, 187, 206]],
...,
[[ 65, 65, 15],
[ 71, 71, 17],
[ 75, 76, 19],
...,
[ 83, 83, 19],
[ 82, 87, 21],
[ 85, 85, 21]],
[[ 70, 70, 18],
[ 74, 75, 18],
[ 74, 78, 19],
...,
[ 77, 81, 20],
[ 78, 87, 24],
[ 77, 81, 20]],
[[ 71, 71, 17],
[ 73, 74, 17],
[ 77, 78, 20],
...,
[ 85, 86, 20],
[ 85, 85, 21],
[ 75, 74, 20]]], dtype=uint8)
Any help would be appreciated!

Shuffle groups of rows of a 2D array - NumPy

Suppose I have a (50, 5) array. Is there a way for me to shuffle it on the basis of groupings of rows/sequences of datapoints, i.e. instead of shuffling every row, shuffle chunks of say, 5 rows?
Thanks
Approach #1 : Here's an approach that reshapes into a 3D array based on the group size, indexes into the indices of blocks with shuffled indices obtained from np.random.permutation and finally reshapes back to 2D -
N = 5 # Blocks of N rows
M,n = a.shape[0]//N, a.shape[1]
out = a.reshape(M,-1,n)[np.random.permutation(M)].reshape(-1,n)
Sample run -
In [141]: a
Out[141]:
array([[89, 26, 12],
[97, 60, 96],
[94, 38, 54],
[41, 63, 29],
[88, 62, 48],
[95, 66, 32],
[28, 58, 80],
[26, 35, 89],
[72, 91, 38],
[26, 70, 93]])
In [142]: N = 2 # Blocks of N rows
In [143]: M,n = a.shape[0]//N, a.shape[1]
In [144]: a.reshape(M,-1,n)[np.random.permutation(M)].reshape(-1,n)
Out[144]:
array([[94, 38, 54],
[41, 63, 29],
[28, 58, 80],
[26, 35, 89],
[89, 26, 12],
[97, 60, 96],
[72, 91, 38],
[26, 70, 93],
[88, 62, 48],
[95, 66, 32]])
Approach #2 : One can also simply use np.random.shuffle for an in-situ change -
np.random.shuffle(a.reshape(M,-1,n))
Sample run -
In [156]: a
Out[156]:
array([[15, 12, 14],
[55, 39, 35],
[73, 78, 36],
[54, 52, 32],
[83, 34, 91],
[42, 11, 98],
[27, 65, 47],
[78, 75, 82],
[33, 52, 93],
[87, 51, 80]])
In [157]: N = 2 # Blocks of N rows
In [158]: M,n = a.shape[0]//N, a.shape[1]
In [159]: np.random.shuffle(a.reshape(M,-1,n))
In [160]: a
Out[160]:
array([[15, 12, 14],
[55, 39, 35],
[27, 65, 47],
[78, 75, 82],
[73, 78, 36],
[54, 52, 32],
[33, 52, 93],
[87, 51, 80],
[83, 34, 91],
[42, 11, 98]])