numpy, sums of subsets with no iterations [duplicate] - numpy

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?

you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.

I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.

You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.

If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)

The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)

You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop

Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)

Related

Limit the output variables of CP-SAT solver

The following data is given as df:
id
class
country
weigths
a
1
US
20
b
2
US
5
a
2
CH
5
a
1
CH
10
b
1
CH
5
c
1
US
10
b
2
GER
15
a
2
CH
5
c
1
US
15
a
1
US
10
The goal is to create an alternative allocation of the columns weight but keep the distribution of unique values in id, class and country. For example: 5 of 10 values -> 50% in column id are "a". An alternative solution for weights should keep this distribution of a = 50%. And all other distribution of each unique value in the first three columns.
For this I created the following code to get a dict with the distribution:
constraint_columns = ["id", "class", "country"]
constraints = {}
for column in constraint_columns:
constraints[column] = dict(zip(df.groupby([column]).sum().reset_index()[column],
df.groupby([column]).sum().reset_index()["weights"]))
The result looks as follows:
{'id': {'a': 50, 'b': 25, 'c': 25},
'class': {1: 70, 2: 30},
'country': {'CH': 25, 'GER': 15, 'US': 60}}
I then initiate the model, create the variables for the model to solve (weights) and create the constraints by looping through my constraints and map them with the variables:
model = cp_model.CpModel()
solver = cp_model.CpSolver()
count = 0
dict_weights = {}
for weight in range(len(df)):
dict_weights[count] = model.NewIntVar(0, 100, f"weight_{count}")
count += 1
weights_full = []
for weight in dict_weights:
weights_full.append(dict_weights[weight])
I give a 5% range where the distribution can be different:
for constraint in constraints:
for key in constraints[constraint]:
keys = df.loc[df[constraint] == key].index
model.Add(sum(list(map(dict_weights.get, keys))) >= int(constraints[constraint][key] * 1 - ((constraints[constraint][key] * 1) * 0.05)))
model.Add(sum(list(map(dict_weights.get, keys))) <= int(constraints[constraint][key] * 1 + ((constraints[constraint][key] * 1) * 0.05)))
I solve the model and everything works fine:
solver.parameters.cp_model_presolve = False # type: ignore
solver.parameters.max_time_in_seconds = 0.01 # type: ignore
solution_collector = VarArraySolutionCollector(weights_full)
solver.SolveWithSolutionCallback(model, solution_collector)
solution_collector.solution_list
Solution:
[0, 0, 0, 0, 8, 0, 15, 15, 23, 35]
In a next step I want to tell the model, that the result should consist out of a specific number of weights. For example: 3 - That would mean that 5 weight values should be 0 and only 3 are used to find a solution that fits the distribution. Right now it does not matter if there is a feasible solution or not.
Any ideas how to solve this?
the solver is integral. 0.05 will be silently rounded to 0 by python.
I do not understand your problem. My gut reaction is to create one bool var per weight value and per item, and convert all constraints to weighted sums of these bool variables.

Python :print the cumulative sum of x along axies 0 and 1

create a array of x shape (5,6) having 30 random integer between -30 and 30
print the cumulative sum of x along axies 0
print the cumulative sum of x along axies 1
The out expected is 9 and -32.
I tryed with below code
import numpy as np
np.random.seed(100)
l1= np.random.randint(-30,30, size=(5,6))
x= np.array(l1)
print(x.sum(axis=0))
print(x.sum(axis=1))
can you please help me what is wrong with this?
The results of your expressions are:
x.sum(axis=0) == array([ -9, -58, -38, 40, 16, 9])
x.sum(axis=1) == array([-68, 47, 1, 12, -32])
As you wrote the expected results are 9 and -32, maybe you want
to compute sums of the last column and last row?
To get just these results, compute:
x[:, -1].sum() (yields 9)
x[-1, :].sum() (yiels -32)

Does Pandas have a resample method without dependency on a datetime index?

I have a series that I want to apply an external function to in subsets/chunks of three. Although the actual external function is more complex, for the sake of an example, lets just assume my external function takes an ndarray of integers and returns the sum of all values. So for example:
series = pd.Series([1,1,1,1,1,1,1,1,1])
# Some pandas magic similar to:
result = series.resample(3).apply(myFunction)
# where 3 just represents every 3 values and
# result == pd.Series([3,3,3])
I looked at combining Series.resample and Series.apply as hinted to by the psuedo code above but it appears resample depends on a datetime index. Any ideas on how I can effectively downsample by applying an external function like this without a datetime index? Or do you just recommend creating a temporary datetime index to do this then reverting to the original index?
pandas.DataFrame.groupby would do the trick here. What you need is a repeated index to specify subsets/chunks
Create chunks
n = 3
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 3, 3, 3, 6, 6, 6])
Groupby
def myFunction(l):
output = 0
for item in l:
output+=item
return output
series = pd.Series([1,1,1,1,1,1,1,1,1])
result = series.groupby(repeat_idx).apply(myFunction)
(result)
0 3
3 3
6 3
The solution will also work for chunks not adding to the length of series,
n = 4
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 0, 4, 4, 4, 4, 8])
result = series.groupby(repeat_idx).apply(myFunction)
print(result)
0 4
4 4
8 1

Pandas | How to effectively filter a column

I'm looking for a way to quickly and effectively filter through a dataframe column and remove values that don't meet a condition.
Say, I have a column with the numbers 4, 5 and 10. I want to filter the column and replace any numbers above 7 with 0. How would I go about this?
You're talking about two separate things - filtering and value replacement. They both have uses and end up being similar in nature but for filtering I'll point to this great answer.
Let's say our data frame is called df and looks like
A B
1 4 10
2 4 2
3 10 1
4 5 9
5 10 3
Column A fits your statement of a column only having values 4, 5, 10. If you wanted to replace numbers above 7 with 0, this would do it:
df["A"] = [0 if x > 7 else x for x in df["A"]]
If you read through the right-hand side it cleanly explains what it is doing. It helps to include parentheses to separate out the "what to do" with the "what you're doing it over":
df["A"] = [(0 if x > 7 else x) for x in df["A"]]
If you want to do a manipulation over multiple columns, then utilizing zip allows you to do it easily. For example, if you want the sum of columns A and B then:
df["sum"] = [x[0] + x[1] for x in zip(df["A"], df["B"])]
Take care when you overwrite data - this removes information. It's a good practice to have the transformed data in other columns so you can trace back when something inevitably goes wonky.
There is many options. One possibility for if then... is np.where
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1, 200, 4, 5, 6, 11],
'y': [4, 5, 10, 24, 4 , 3]})
df['y'] = np.where(df['y'] > 7, 0, df['y'])

Sum of data entry with the given index in pandas dataframe

I try to get the sum of possible combination of given data in pandas dataframe. To do this I use itertools combination to get all of possible combinations, then by using loop, I sum each of it.
Is there any way to do this without using the loop?
Please check the following script that I created to shows what I want.
import pandas as pd
import itertools as it
A = pd.Series([50, 20, 75], index = list(range(1, 4)))
df = pd.DataFrame({'A': A})
listNew = []
for i in range(1, len(df.A)+1):
Temp=it.combinations(df.index.values, i)
for data in Temp:
listNew.append(data)
print(listNew)
for data in listNew:
print(df.A[list(data)].sum())
Output of these scripts are:
[(1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
50
20
75
70
125
95
145
thank you in advance.
IIUC, using reindex
#convert you list of tuple to data frame and using stack to flatten it
s=pd.DataFrame([(1,), (2,), (3,), (1, 2),(1, 3),(2, 3), (1, 2, 3)]).stack().to_frame('index')
# then we reindex base on the order of it using df.A
s['Value']=df.reindex(s['index']).A.values
#you can using groupby here, but since the index is here, I will recommend sum with level
s=s.Value.sum(level=0)
s
Out[796]:
0 50
1 20
2 75
3 70
4 125
5 95
6 145
Name: Value, dtype: int64