How to sum pandas df rows where each cell contains a list? - pandas

I'm trying to sum my df's rows as follows,
let's say I have the beneath df (each cell in a row contains a vector/list of the same size!)
In the real problem, I have a large number of columns and it can vary. But I do have a list that contains the names of those columns.
df = pd.DataFrame([
[[1,2,3],[1,2,3],[1,2,3]],
[[1,1,1],[1,1,1],[1,1,1]],
[[2,2,2],[2,2,2],[2,2,2]]
], columns=['a','b','c'])
I'm trying to create a new Column that will contain the sum of all the vectors in every row- as np.array would do! and get this following vectors as a result:
[3,6,9]
[3,3,3]
[6,6,6]
and not like the .sum(axis=1) does..
[1,2,3,1,2,3,1,2,3]
[1,1,1,1,1,1,1,1,1]
[2,2,2,2,2,2,2,2,2]
Can anyone think of an idea, thanks in advance :)

If same lengths of lists create numpy array and sum for improve performance:
df['Sum'] = np.array(df.to_numpy().tolist()).sum(axis=1).tolist()
print (df)
a b c Sum
0 [1, 2, 3] [1, 2, 3] [1, 2, 3] [3, 6, 9]
1 [1, 1, 1] [1, 1, 1] [1, 1, 1] [3, 3, 3]
2 [2, 2, 2] [2, 2, 2] [2, 2, 2] [6, 6, 6]

Another way using pd.Series.explode:
df['sum'] = df.apply(pd.Series.explode).sum(axis=1).groupby(level=0).agg(list)
Output:
a b c sum
0 [1, 2, 3] [1, 2, 3] [1, 2, 3] [3.0, 6.0, 9.0]
1 [1, 1, 1] [1, 1, 1] [1, 1, 1] [3.0, 3.0, 3.0]
2 [2, 2, 2] [2, 2, 2] [2, 2, 2] [6.0, 6.0, 6.0]

Related

pandas: most elegant way to pivot table on pattern in name of columns

Given the following DataFrame:
pd.DataFrame({
'x': [0, 1],
'y': [0, 1],
'a_idx': [0, 1],
'a_val': [2, 3],
'b_idx': [4, 5],
'b_val': [6, 7],
})
What is the cleanest way to pivot the DataFrame based on the prefix of the idx and val columns if you have an indeterminate amount of unique prefixes (a, b, ... n), so as to obtain the following DataFrame?
pd.DataFrame({
'x': [0, 1, 0, 1],
'y': [0, 1, 0, 1],
'key': ['a','a','b','b'],
'idx': [0, 1, 4, 5],
'val': [2, 3, 6, 7]
})
I am not very knowledgeable in pandas, so my easiest solution was to go earlier in the data generation process and generate a subset of the result DataFrame for each prefix in SQL, and then concat the result sets into a final DataFrame. I'm curious however if there is a simple way to do this using the API of pandas.DataFrame. Is there such a thing?
Let's try wide_to_long with extras:
(pd.wide_to_long(df,stubnames=['a','b'],
i=['x','y'],
j='key',
sep='_',
suffix='\\w+'
)
.unstack('key').stack(level=0).reset_index()
)
Or manually with melt:
out = df.melt(['x', 'y'])
out = (out.join(out['variable'].str.split('_', expand=True))
.rename(columns={0: 'key'})
.pivot_table(index=['x', 'y', 'key'], columns=[1], values='value')
.reset_index()
)
Output:
key x y level_2 idx val
0 0 0 a 0 2
1 0 0 b 4 6
2 1 1 a 1 3
3 1 1 b 5 7

Calculate statistics of one numpy array based on the values in a second numpy array

Lets say I have a 2-d numpy array
a = np.array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]]
and a 3-d numpy array like
b = np.array([[[1, 2, 8, 8],
[3, 4, 8, 8],
[8, 7, 0, 1],
[6, 5, 3, 2]],
[[1, 1, 1, 3],
[1, 1, 4, 2],
[0, 3, 2, 1],
[3, 2, 3, 9]]])
I want to calculate the statistics (mean, median, majority, sum, count,...) of b according to the "IDs" in a.
Example: sum should result in another array (or a list if that is easier), that gives the sum of the values in b. There are 4 unique "IDs" in a: 1,2,3,4, and 2 'layers' in b. For the 1's in a that is a sum of 10 (layer 0) and 4 (layer 1). For the 2's
it's 32 (layer 0) and 10 (layer 1), and so on...
Expected result for sum:
sums = [[1, 10, 4],
[2, 32, 10],
[3, 26, 8],
[4, 6, 15]]
Expected result for mean:
avgs = [[1, 2.5, 1.0 ],
[2, 8.0, 2.5 ],
[3, 6.5, 2.0 ],
[4, 1.5, 3.75]]
My guess, is that there is a handy function in numpy that does that already, but I am not sure what to search for exactly. Any pointers of how to do it, or what to search for, are much appreciated.
Update:
I came up with this for-loop, which is fine for very small arrays. However, my arrays are much larger than 4 by 4 and a faster impementation is needed.
result = []
ids = np.unique(a)
for id in ids:
line = [id]
for band in range(0, b.shape[0]):
cell = b[band][np.where(a == id)]
line.append(cell.mean())
# line.append(cell.min())
# line.append(cell.max())
# line.append(cell.std())
line.append(cell.sum())
line.append(np.median(cell))
result.append(line)
You can try the code below
cal_sums = [[b[j, :, :][np.argwhere(a==i)[:,0],np.argwhere(a==i)[:,1]].sum()
for i in np.unique(a)] for j in range(2)]
cal_mean = [[b[j, :, :][np.argwhere(a==i)[:,0],np.argwhere(a==i)[:,1]].mean()
for i in np.unique(a)] for j in range(2)]
sums = np.zeros((np.unique(a).size, b.shape[0]+1))
means = np.zeros((np.unique(a).size, b.shape[0]+1))
sums[:, 0] , sums[:,1:] = np.unique(a), np.asarray(cal_sums).T
means[:, 0] , means[:,1:] = np.unique(a), np.asarray(cal_mean).T
print(sums)
[[ 1. 10. 4.]
[ 2. 32. 10.]
[ 3. 26. 8.]
[ 4. 6. 15.]]
print(means)
[[1. 2.5 1. ]
[2. 8. 2.5 ]
[3. 6.5 2. ]
[4. 1.5 3.75]]
I tested it in quite large array size and it is fast
n = 1000
a = np.random.randint(1, 5, size=(n, n))
b = np.random.randint(1, 10, size=(2, n, n))
speed:
377 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy: get indices where condition holds per row

I have an array such as the following:
In [70]: x
Out[70]:
array([[0, 1, 2],
[3, 4, 5]])
I am trying to get the indices per row where a condition holds, for example, x > 1.
Expected output is like ([2], [0, 1, 2])
I have tried numpy.where, numpy.nonzero, but they give strange results.
One approach -
r,c = np.where(x>1)
out = np.split(c, np.flatnonzero(r[1:] > r[:-1])+1)
Sample run -
In [140]: x
Out[140]:
array([[0, 2, 0, 1, 1],
[2, 2, 1, 2, 0],
[0, 2, 1, 1, 0],
[1, 0, 0, 2, 2]])
In [141]: r,c = np.where(x>1)
In [142]: np.split(c, np.flatnonzero(r[1:] > r[:-1])+1)
Out[142]: [array([1]), array([0, 1, 3]), array([1]), array([3, 4])]
Alternatively, we could use np.unique on the final step, like so -
np.split(c, np.unique(r, return_index=1)[1][1:])

Reduce a dimension of numpy array by selecting

I have a 3d array
A = np.random.random((4,4,3))
and a index matrix
B = np.int_(np.random.random((4,4))*3)
How do I get a 2D array from A based on index matrix B?
In general, how to get a N-1 dimensional array from a ND array and a N-1 dimensional index array?
Lets take an example:
>>> A = np.random.randint(0,10,(3,3,2))
>>> A
array([[[0, 1],
[8, 2],
[6, 4]],
[[1, 0],
[6, 9],
[7, 7]],
[[1, 2],
[2, 2],
[9, 7]]])
Use fancy indexing to take simple indices. Note that the all indices must be of the same shape and the shape of each index will be what is returned.
>>> ind = np.arange(2)
>>> A[ind,ind,ind]
array([0, 9]) #Index (0,0,0) and (1,1,1)
>>> ind = np.arange(2).reshape(2,1)
>>> A[ind,ind,ind]
array([[0],
[9]])
So for your example we need to supply the grid for the first two dimensions:
>>> A = np.random.random((4,4,3))
>>> B = np.int_(np.random.random((4,4))*3)
>>> A
array([[[ 0.95158697, 0.37643036, 0.29175815],
[ 0.84093397, 0.53453123, 0.64183715],
[ 0.31189496, 0.06281937, 0.10008886],
[ 0.79784114, 0.26428462, 0.87899921]],
[[ 0.04498205, 0.63823379, 0.48130828],
[ 0.93302194, 0.91964805, 0.05975115],
[ 0.55686047, 0.02692168, 0.31065731],
[ 0.92822499, 0.74771321, 0.03055592]],
[[ 0.24849139, 0.42819062, 0.14640117],
[ 0.92420031, 0.87483486, 0.51313695],
[ 0.68414428, 0.86867423, 0.96176415],
[ 0.98072548, 0.16939697, 0.19117458]],
[[ 0.71009607, 0.23057644, 0.80725518],
[ 0.01932983, 0.36680718, 0.46692839],
[ 0.51729835, 0.16073775, 0.77768313],
[ 0.8591955 , 0.81561797, 0.90633695]]])
>>> B
array([[1, 2, 0, 0],
[1, 2, 0, 1],
[2, 1, 1, 1],
[1, 2, 1, 2]])
>>> x,y = np.meshgrid(np.arange(A.shape[0]),np.arange(A.shape[1]))
>>> x
array([[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]])
>>> y
array([[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]])
>>> A[x,y,B]
array([[ 0.37643036, 0.48130828, 0.24849139, 0.71009607],
[ 0.53453123, 0.05975115, 0.92420031, 0.36680718],
[ 0.10008886, 0.02692168, 0.86867423, 0.16073775],
[ 0.26428462, 0.03055592, 0.16939697, 0.90633695]])
If you prefer to use mesh as suggested by Daniel, you may also use
A[tuple( np.ogrid[:A.shape[0], :A.shape[1]] + [B] )]
to work with sparse indices. In the general case you could use
A[tuple( np.ogrid[ [slice(0, end) for end in A.shape[:-1]] ] + [B] )]
Note that this may also be used when you'd like to index by B an axis different from the last one (see for example this answer about inserting an element into a list).
Otherwise you can do it using broadcasting:
A[np.arange(A.shape[0])[:, np.newaxis], np.arange(A.shape[1])[np.newaxis, :], B]
This may be generalized too but it's a bit more complicated.

referencing rows in a matrix using index from another matrix

You have an original sparse matrix X:
>>print type(X)
>>print X.todense()
<class 'scipy.sparse.csr.csr_matrix'>
[[1,4,3]
[3,4,1]
[2,1,1]
[3,6,3]]
You have a second sparse matrix Z, which is derived from some rows of X (say the values are doubled so we can see the difference between the two matrices). In pseudo-code:
>>Z = X[[0,2,3]]
>>print Z.todense()
[[1,4,3]
[2,1,1]
[3,6,3]]
>>Z = Z*2
>>print Z.todense()
[[2, 8, 6]
[4, 2, 2]
[6, 12,6]]
What's the best way of retrieving the rows in Z using the ORIGINAL indices from X. So for instance, in pseudo-code:
>>print Z[[0,3]]
[[2,8,6] #0 from Z, and what would be row **0** from X)
[6,12,6]] #2 from Z, but what would be row **3** from X)
That is, how can you retrieve rows from Z, using indices that refer to the original rows position in the original matrix X? To do this, you can't modify X in anyway (you can't add an index column to the matrix X), but there are no other limits.
If you have the original indices in an array i, and the values in i are in increasing order (as in your example), you can use numpy.searchsorted(i, [0, 3]) to find the indices in Z that correspond to indices [0, 3] in the original X. Here's a demonstration in an IPython session:
In [39]: X = csr_matrix([[1,4,3],[3,4,1],[2,1,1],[3,6,3]])
In [40]: X.todense()
Out[40]:
matrix([[1, 4, 3],
[3, 4, 1],
[2, 1, 1],
[3, 6, 3]])
In [41]: i = array([0, 2, 3])
In [42]: Z = 2 * X[i]
In [43]: Z.todense()
Out[43]:
matrix([[ 2, 8, 6],
[ 4, 2, 2],
[ 6, 12, 6]])
In [44]: Zsub = Z[searchsorted(i, [0, 3])]
In [45]: Zsub.todense()
Out[45]:
matrix([[ 2, 8, 6],
[ 6, 12, 6]])