Vectorized gather operation in numpy - numpy

Given this (sample) data
target_slots = np.array([1, 3, 1, 0, 8, 5, 8, 1, 1, 2])
dummy_elements = np.arange(10*D).reshape(10, D)
is there any way to express in a vectorized numpy expression the operation
gathered_results = np.zeros((num_slots, D))
for i, target in enumerate(target_slots):
gathered_results[target] += dummy_elements[i]
this operation looks like a bincount but instead of counting we sum the elements of another array.
(It is implied that np.max(target_slots)<num_slots and np.min(target_slots)>=0 and target_slots.shape[0] == D)

Approach #1
You are performing interval-ed summing selecting rows off dummy_elements and adding in at specific rows into the output array. So, one obvious choice of a vectorized solution would be with np.add.reduceat, like so -
sidx = target_slots.argsort()
out = np.zeros((num_slots, D))
unq, shift_idx = np.unique(target_slots[sidx],return_index=True)
out[unq] = np.add.reduceat(dummy_elements[sidx],shift_idx)
Approach #2
Alternatively, we can use np.bincount as well to perform these ID based summing operations. One way would be with a loop that iterates along the columns of dummy_elements and I think would be beneficial when the no. of such columns is comparatively smaller. The implementation would look like this -
out = np.zeros((num_slots, D))
L = target_slots.size
for i in range(D):
out[:,i] = np.bincount(target_slots,dummy_elements[:,i],minlength=L)
Approach #3
A vectorized version of the same would be like this -
L = target_slots.size
ids = (target_slots[:,None] + np.arange(D)*L).ravel('F')
out = np.bincount(ids,dummy_elements.ravel('F'),minlength=L*D).reshape(D,-1).T

Related

What is the most efficient method for calculating per-row historical values in a large pandas dataframe?

Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:
Was_Sold (Y/N)
Color
Size_Group
Shape
Date_Made
Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).
Then for every row in df_a, I want to:
Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
The df_b toys must also be made before the given df_a toy
Then find the ratio of those sold (So count sold / count all matched)
What is the most efficient means to make those per-row calculations above?
The best I've came up with so far is something like the below.
(Note code might have an error or two as I'm rough typing from a different use case)
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
prior_count = len(prior_toys)
# Now find the ones that were sold
prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
Using .itertuples() is useful, but this is still pretty slow. Is there a more efficient method or something I'm missing?
EDIT
Added the below script which will emulated data for the above scenario:
import numpy as np
import pandas as pd
colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000
date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
df_a = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_a),
'Size_Group': np.random.choice(sizes, size_df_a),
'Shape': np.random.choice(shapes, size_df_a),
'Was_Sold': np.random.choice(sold, size_df_a),
'Date_Made': random_dates(date_start, date_end, n=size_df_a)
}
)
df_b = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_b),
'Size_Group': np.random.choice(sizes, size_df_b),
'Shape': np.random.choice(shapes, size_df_b),
'Was_Sold': np.random.choice(sold, size_df_b),
'Date_Made': random_dates(date_start, date_end, n=size_df_b)
}
)
First of all, I think your computation would be much more efficient using relational database and SQL query. Indeed, the filters can be done by indexing columns, performing a database join, some advance filtering and count the result. An optimized relational database can generate an efficient algorithm based on a simple SQL query (hash-based row grouping, binary search, fast intersection of sets, etc.). Pandas is sadly not very good to perform efficiently advanced requests like this. It is also very slow to iterate over pandas dataframe although I am not sure this can be alleviated in this case using only pandas. Hopefully you can use some Numpy and Python tricks and (partially) implement what fast relational database engines would do.
Additionally, pure-Python object types are slow, especially (unicode) strings. Thus, **converting column types to efficient ones in a first place can save a lot of time (and memory). For example, there is no need for the Was_Sold column to contains "Y"/"N" string objects: a boolean can just be used in that case. Thus let us convert that:
df_b.Was_Sold = df_b.Was_Sold == "Y"
Finally, the current algorithm has a bad complexity: O(Na * Nb) where Na is the number of rows in df_a and Nb is the number of rows in df_b. This is not easy to improve though due to the non-trivial conditions. A first solution is to group df_b by col column ahead-of-time so to avoid an expensive complete iteration of df_b (previously done with df_b[col] == relevant_val). Then, the date of the precomputed groups can be sorted so to perform a fast binary search later. Then you can use Numpy to count boolean values efficiently (using np.sum).
Note that doing prior_toys['Was_Sold'] is a bit faster than prior_toys.Was_Sold.
Here is the resulting code:
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Split df_b by col and sort each (indexed) group by date
colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
curColGroup = colGroups[relevant_val]
prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
prior_toys = curColGroup[:prior_count]
# Now find the ones that were sold
prior_sold_count = prior_toys['Was_Sold'].values.sum()
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
This is 5.5 times faster on my machine.
The iteration of the pandas dataframe is a major source of slowdown. Indeed, prior_toys['Was_Sold'] takes half the computation time because of the huge overhead of pandas internal function calls repeated Na times... Using Numba may help to reduce the cost of the slow iteration. Note that the complexity can be increased by splitting colGroups in subgroups ahead of time (O(Na log Nb)). This should help to completely remove the overhead of prior_sold_count. The resulting program should be about 10 time faster than the original one.

How to find matrix common members of matrices in Numpy

I have a 2D matrix A and a vector B. I want to find all row indices of elements in A that are also contained in B.
A = np.array([[1,9,5], [8,4,9], [4,9,3], [6,7,5]], dtype=int)
B = np.array([2, 4, 8, 10, 12, 18], dtype=int)
My current solution is only to compare A to one element of B at a time but that is horribly slow:
res = np.array([], dtype=int)
for i in range(B.shape[0]):
cres, _ = (B[i] == A).nonzero()
degElem = np.append(res, cres)
res = np.unique(res)
The following Matlab statement would solve my issue:
find(any(reshape(any(reshape(A, prod(size(A)), 1) == B, 2),size(A, 1),size(A, 2)), 2))
However comparing a row and a colum vector in Numpy does not create a Boolean intersection matrix as it does in Matlab.
Is there a proper way to do this in Numpy?
We can use np.isin masking.
To get all the row numbers, it would be -
np.where(np.isin(A,B).T)[1]
If you need them split based on each element's occurence -
[np.flatnonzero(i) for i in np.isin(A,B).T if i.any()]
Posted MATLAB code seems to be doing broadcasting. So, an equivalent one would be -
np.where(B[:,None,None]==A)[1]

Calculating values that depend on the previous row

It seems that cumsum, cumprod and other cumulative operations can not be transformed. At present, it seems that the cumulative operation can only be done in a row-by-row cycle.
Data about 10 million lines, need to do cross-line calculation cycle, the computer can not run at all, consult the solution, thank you.
The calculations needed are as follows:
for i in range(1,10000000):
df.iloc[i,3] = df.iloc[i-1,3]*df[i,1]+df[i,2]
There probably is no pythonic way to do it without looping it in C/Java style.
Added: Thus, just do a loop. Or hack using global variables etc as follow:
prev_result = 0
def my_func(x):
global prev_result
prev_result = x.a * prev_result + x.b
return prev_result
df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
df["c"] = df.apply(my_func, axis=1)
# df["c"] is now [1, 4, 15]
# 0 x 1 + 1 = 1; 1 x 2 + 2 = 4; 4 x 3 + 3 = 15;
Edit: The followings are not cumulative and hence, does not answer the question.
That being said, #pythonic833 's solution:
df.shift(-1).iloc[:,3]*df.iloc[:,1]+df.iloc[:,2]
is quite a decent one.
If I were you, I'd just assign df["temp_column"] as df["third_column"].shift(-1)
df["temp_column"] = df["third_column"].shift(-1)
df["third_column"] = df["temp_column"] * df["first_column"] + df["second_column"]
My proposed solution is a bit easier to read at the cost of memory for a column.

Get indices for values of one array in another array

I have two 1D-arrays containing the same set of values, but in a different (random) order. I want to find the list of indices, which reorders one array according to the other one. For example, my 2 arrays are:
ref = numpy.array([5,3,1,2,3,4])
new = numpy.array([3,2,4,5,3,1])
and I want the list order for which new[order] == ref.
My current idea is:
def find(val):
return numpy.argmin(numpy.absolute(ref-val))
order = sorted(range(new.size), key=lambda x:find(new[x]))
However, this only works as long as no values are repeated. In my example 3 appears twice, and I get new[order] = [5 3 3 1 2 4]. The second 3 is placed directly after the first one, because my function val() does not track which 3 I am currently looking for.
So I could add something to deal with this, but I have a feeling there might be a better solution out there. Maybe in some library (NumPy or SciPy)?
Edit about the duplicate: This linked solution assumes that the arrays are ordered, or for the "unordered" solution, returns duplicate indices. I need each index to appear only once in order. Which one comes first however, is not important (neither possible based on the data provided).
What I get with sort_idx = A.argsort(); order = sort_idx[np.searchsorted(A,B,sorter = sort_idx)] is: [3, 0, 5, 1, 0, 2]. But what I am looking for is [3, 0, 5, 1, 4, 2].
Given ref, new which are shuffled versions of each other, we can get the unique indices that map ref to new using the sorted version of both arrays and the invertibility of np.argsort.
Start with:
i = np.argsort(ref)
j = np.argsort(new)
Now ref[i] and new[j] both give the sorted version of the arrays, which is the same for both. You can invert the first sort by doing:
k = np.argsort(i)
Now ref is just new[j][k], or new[j[k]]. Since all the operations are shuffles using unique indices, the final index j[k] is unique as well. j[k] can be computed in one step with
order = np.argsort(new)[np.argsort(np.argsort(ref))]
From your original example:
>>> ref = np.array([5, 3, 1, 2, 3, 4])
>>> new = np.array([3, 2, 4, 5, 3, 1])
>>> np.argsort(new)[np.argsort(np.argsort(ref))]
>>> order
array([3, 0, 5, 1, 4, 2])
>>> new[order] # Should give ref
array([5, 3, 1, 2, 3, 4])
This is probably not any faster than the more general solutions to the similar question on SO, but it does guarantee unique indices as you requested. A further optimization would be to to replace np.argsort(i) with something like the argsort_unique function in this answer. I would go one step further and just compute the inverse of the sort:
def inverse_argsort(a):
fwd = np.argsort(a)
inv = np.empty_like(fwd)
inv[fwd] = np.arange(fwd.size)
return inv
order = np.argsort(new)[inverse_argsort(ref)]

Numpy index of the maximum with reduction - numpy.argmax.reduceat

I have a flat array b:
a = numpy.array([0, 1, 1, 2, 3, 1, 2])
And an array c of indices marking the start of each "chunk":
b = numpy.array([0, 4])
I know I can find the maximum in each "chunk" using a reduction:
m = numpy.maximum.reduceat(a,b)
>>> array([2, 3], dtype=int32)
But... Is there a way to find the index of the maximum <edit>within a chunk</edit> (like numpy.argmax), with vectorized operations (no lists, loops)?
Borrowing the idea from this post.
Steps involved :
Offset all elements in a group by a limit-offset. Sort them globally, thus limiting each group to stay at their positions, but sorting the elements within each group.
In the sorted array, we would look for the last element, which would be the group max. Their indices would be the argmax after offsetting down for the group lengths.
Thus, a vectorized implementation would be -
def numpy_argmax_reduceat(a, b):
n = a.max()+1 # limit-offset
grp_count = np.append(b[1:] - b[:-1], a.size - b[-1])
shift = n*np.repeat(np.arange(grp_count.size), grp_count)
sortidx = (a+shift).argsort()
grp_shifted_argmax = np.append(b[1:],a.size)-1
return sortidx[grp_shifted_argmax] - b
As a minor tweak and possibly faster one, we could alternatively create shift with cumsum and thus have a variation of the earlier approach, like so -
def numpy_argmax_reduceat_v2(a, b):
n = a.max()+1 # limit-offset
id_arr = np.zeros(a.size,dtype=int)
id_arr[b[1:]] = 1
shift = n*id_arr.cumsum()
sortidx = (a+shift).argsort()
grp_shifted_argmax = np.append(b[1:],a.size)-1
return sortidx[grp_shifted_argmax] - b