Reading values within pandas.groupby - pandas

I have a dataframe like below
name item
0 Jack A
1 Sarah B
2 Ross A
3 Sean C
4 Jack C
5 Ross B
What I like to do is to produce a dictionary that connects people to the products they are related to.
{Jack: [1, 0, 1], Sarah: [0, 1, 0], Ross:[1, 1, 0], Sean:[0, 0, 1]}
I feel like this should be done fairly easily using pandas.groupby
I have tried looping through the dataframe, but I have >1E7 entries, and looping does not look very efficient.

Check with crosstab and to_dict
pd.crosstab(df.item,df.name).to_dict('l')
{'Jack': [1, 0, 1], 'Ross': [1, 1, 0], 'Sarah': [0, 1, 0], 'Sean': [0, 0, 1]}
Another interesting option is using str.get_dummies:
# if you need counts
df.set_index('item')['name'].str.get_dummies().sum(level=0).to_dict('l')
# if you want to record boolean indicators
df.set_index('item')['name'].str.get_dummies().max(level=0).to_dict('l')
# {'Jack': [1, 0, 1], 'Ross': [1, 1, 0], 'Sarah': [0, 1, 0], 'Sean': [0, 0, 1]}

Related

Set all elements left to index to one, right of index to zero for list of indices

Say I have a list of Indices:
np.array([1, 3, 2, 4])
How do I create the following matrix, where all elements left to the index are ones and right to the index zeros?
[[1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]]
1*(np.arange( 6 ) <= arr[:,None])
# array([[1, 1, 0, 0, 0, 0],
# [1, 1, 1, 1, 0, 0],
# [1, 1, 1, 0, 0, 0],
# [1, 1, 1, 1, 1, 0]])
This broadcasts the array of 6 elements across the rows and the array of indices across the columns. The 1* converts boolean to int.

how to get row indices where row slice contains a single value (0)

With the numpy array
arr = np.array([[1, 1, 0, 0, 0, 1], [1, 1, 0, 0, 1, 1], [1, 1, 0, 0, 0, 1]])
I would like to get the indices of all rows where the row slice 2:5 contains all zeros.
In the above example, it should return rows 0 and 2.
I tried:
zero_indices = np.where(not np.any(arr[:,2:5]))
but it doesn't seem to work.
I'm trying to do this over a large array with several million rows.
Try this
np.nonzero((~arr[:,2:5].astype(bool)).all(1))[0]
Out[133]: array([0, 2], dtype=int32)
Or
np.nonzero((arr[:,2:5] == 0).all(1))[0]
Out[139]: array([0, 2], dtype=int32)

Numpy Vectorization: add row above to current row on ndarray

I would like to add the values in the above row to the row below using vectorization. For example, if I had the ndarray,
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]]
Then after one iteration through this method, it would result in
[[0, 0, 0, 0],
[1, 1, 1, 1],
[3, 3, 3, 3],
[5, 5, 5, 5]]
One can simply do this with a for loop:
import numpy as np
def addAboveRow(arr):
cpy = arr.copy()
r, c = arr.shape
for i in range(1, r):
for j in range(c):
cpy[i][j] += arr[i - 1][j]
return cpy
ndarr = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]).reshape(4, 4)
print(addAboveRow(ndarr))
I'm not sure how to approach this using vectorization though. I think slicers should be used? Also, I'm not really sure how to deal with the issue of the top border, because nothing should be added onto the first row. Any help would be appreciated. Thanks!
Note: I am really new to vectorization so an explanation would be great!
You can use indexing directly:
b = np.zeros_like(a)
b[0] = a[0]
b[1:] = a[1:] + a[:-1]
>>> b
array([[0, 0, 0, 0],
[1, 1, 1, 1],
[3, 3, 3, 3],
[5, 5, 5, 5]])
An alternative:
b = a.copy()
b[1:] += a[:-1]
Or:
b = a.copy()
np.add(b[1:], a[:-1], out=b[1:])
You could try the following
np.put(arr, np.arange(arr.shape[1], arr.size), arr[1:]+arr[:-1])

numpy: get indices where condition holds per row

I have an array such as the following:
In [70]: x
Out[70]:
array([[0, 1, 2],
[3, 4, 5]])
I am trying to get the indices per row where a condition holds, for example, x > 1.
Expected output is like ([2], [0, 1, 2])
I have tried numpy.where, numpy.nonzero, but they give strange results.
One approach -
r,c = np.where(x>1)
out = np.split(c, np.flatnonzero(r[1:] > r[:-1])+1)
Sample run -
In [140]: x
Out[140]:
array([[0, 2, 0, 1, 1],
[2, 2, 1, 2, 0],
[0, 2, 1, 1, 0],
[1, 0, 0, 2, 2]])
In [141]: r,c = np.where(x>1)
In [142]: np.split(c, np.flatnonzero(r[1:] > r[:-1])+1)
Out[142]: [array([1]), array([0, 1, 3]), array([1]), array([3, 4])]
Alternatively, we could use np.unique on the final step, like so -
np.split(c, np.unique(r, return_index=1)[1][1:])

scipy: Adding a sparse vector to a specific row of a sparse matrix

In python, what is the best way to add a CSR vector to a specific row of a CSR matrix? I found one workaround here, but wondering if there is a better/more efficient way to do this. Would appreciate any help.
Given an NxM CSR matrix A and a 1xM CSR matrix B, and a row index i, the goal is to add B to the i-th row of A efficiently.
The obvious indexed addition does work. It gives a efficiency warning, but that doesn't mean it is the slowest way, just that you shouldn't count of doing this repeatedly. It suggests working with the lil format, but conversion to that and back probably takes more time than performing the addition to the csr matrix.
In [1049]: B.A
Out[1049]:
array([[0, 9, 0, 0, 1, 0],
[2, 0, 5, 0, 0, 9],
[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 9, 5, 3, 0, 7],
[1, 0, 0, 8, 9, 0]], dtype=int32)
In [1051]: B[1,:] += np.array([1,0,1,0,0,0])
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [1052]: B
Out[1052]:
<6x6 sparse matrix of type '<class 'numpy.int32'>'
with 17 stored elements in Compressed Sparse Row format>
In [1053]: B.A
Out[1053]:
array([[0, 9, 0, 0, 1, 0],
[3, 0, 6, 0, 0, 9],
[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 9, 5, 3, 0, 7],
[1, 0, 0, 8, 9, 0]])
As your linked question shows, it is possible to act directly on the attributes of the sparse matrix. His code shows why there's an efficiency warning - in the general case it has to rebuild the matrix attributes.
lil is more efficient for row replacement because it just has to change a sublist in the matrix .data and .rows attributes. A change in one row doesn't change the attributes of any of the others.
That said, IF your addition has the same sparsity as the original row, it is possible change specific elements of the data attribute without reworking .indices or .indptr. Drawing on the linked code
A.data[:idx_start_row : idx_end_row]
is the slice of A.data that will be changed. You need of course the corresponding slice from the 'vector'.
Starting with the In [1049] B
In [1085]: B.indptr
Out[1085]: array([ 0, 2, 5, 6, 7, 11, 14], dtype=int32)
In [1086]: B.data
Out[1086]: array([9, 1, 2, 5, 9, 2, 2, 9, 5, 3, 7, 1, 8, 9], dtype=int32)
In [1087]: B.indptr[[1,2]] # row 1
Out[1087]: array([2, 5], dtype=int32)
In [1088]: B.data[2:5]
Out[1088]: array([2, 5, 9], dtype=int32)
In [1089]: B.indices[2:5] # row 1 column indices
Out[1089]: array([0, 2, 5], dtype=int32)
In [1090]: B.data[2:5] += np.array([1,2,3])
In [1091]: B.A
Out[1091]:
array([[ 0, 9, 0, 0, 1, 0],
[ 3, 0, 7, 0, 0, 12],
[ 0, 2, 0, 0, 0, 0],
[ 2, 0, 0, 0, 0, 0],
[ 0, 9, 5, 3, 0, 7],
[ 1, 0, 0, 8, 9, 0]], dtype=int32)
Notice where the changed values, [3,7,12], are in the lil format:
In [1092]: B.tolil().data
Out[1092]: array([[9, 1], [3, 7, 12], [2], [2], [9, 5, 3, 7], [1, 8, 9]], dtype=object)
csr / csc matrices are efficient for most operations including addition (O(nnz)). However, little changes that affect the sparsity structure such as your example or even switching a single position from 0 to 1 are not because they require a O(nnz) reorganisation of the representation. Values and indices are packed; inserting one, all above need to move.
If you do just a single such operation, my guess would be that you can't easily beat scipy's implementation. However, if you are adding multiple rows for example it may be worthwile first making a sparse matrix of them and then adding that in one go.
Creating a csr matrix by hand from rows, say, is not that difficult. For example if your rows are dense and in order:
row_numbers, indices = np.where(rows)
data = rows[row_numbers, indices]
indptr = np.searchsorted(np.r_[true_row_numbers[row_numbers], N], np.arange(N+1))
If you have a collection of sparse rows and their row numbers:
data = np.r_[tuple([r.data for r in rows])]
indices = np.r_[tuple(r.indices for r in rows])]
jumps = np.add.accumulate([0] + [len(r) for r in rows])
indptr = np.repeat(jumps, np.diff(np.r_[-1, true_row_numbers, N]))