The following data is given as df:
id
class
country
weigths
a
1
US
20
b
2
US
5
a
2
CH
5
a
1
CH
10
b
1
CH
5
c
1
US
10
b
2
GER
15
a
2
CH
5
c
1
US
15
a
1
US
10
The goal is to create an alternative allocation of the columns weight but keep the distribution of unique values in id, class and country. For example: 5 of 10 values -> 50% in column id are "a". An alternative solution for weights should keep this distribution of a = 50%. And all other distribution of each unique value in the first three columns.
For this I created the following code to get a dict with the distribution:
constraint_columns = ["id", "class", "country"]
constraints = {}
for column in constraint_columns:
constraints[column] = dict(zip(df.groupby([column]).sum().reset_index()[column],
df.groupby([column]).sum().reset_index()["weights"]))
The result looks as follows:
{'id': {'a': 50, 'b': 25, 'c': 25},
'class': {1: 70, 2: 30},
'country': {'CH': 25, 'GER': 15, 'US': 60}}
I then initiate the model, create the variables for the model to solve (weights) and create the constraints by looping through my constraints and map them with the variables:
model = cp_model.CpModel()
solver = cp_model.CpSolver()
count = 0
dict_weights = {}
for weight in range(len(df)):
dict_weights[count] = model.NewIntVar(0, 100, f"weight_{count}")
count += 1
weights_full = []
for weight in dict_weights:
weights_full.append(dict_weights[weight])
I give a 5% range where the distribution can be different:
for constraint in constraints:
for key in constraints[constraint]:
keys = df.loc[df[constraint] == key].index
model.Add(sum(list(map(dict_weights.get, keys))) >= int(constraints[constraint][key] * 1 - ((constraints[constraint][key] * 1) * 0.05)))
model.Add(sum(list(map(dict_weights.get, keys))) <= int(constraints[constraint][key] * 1 + ((constraints[constraint][key] * 1) * 0.05)))
I solve the model and everything works fine:
solver.parameters.cp_model_presolve = False # type: ignore
solver.parameters.max_time_in_seconds = 0.01 # type: ignore
solution_collector = VarArraySolutionCollector(weights_full)
solver.SolveWithSolutionCallback(model, solution_collector)
solution_collector.solution_list
Solution:
[0, 0, 0, 0, 8, 0, 15, 15, 23, 35]
In a next step I want to tell the model, that the result should consist out of a specific number of weights. For example: 3 - That would mean that 5 weight values should be 0 and only 3 are used to find a solution that fits the distribution. Right now it does not matter if there is a feasible solution or not.
Any ideas how to solve this?
the solver is integral. 0.05 will be silently rounded to 0 by python.
I do not understand your problem. My gut reaction is to create one bool var per weight value and per item, and convert all constraints to weighted sums of these bool variables.
Related
I have a dataframe1 of 1802 rows and 29 columns (in code as df) - each row is a person and each column is a number representing their answer to 29 different questions.
I have another dataframe2 of 29 different coefficients (in code as seg_1).
Each column needs to be multiplied by the corresponding coefficient and this needs to be repeated for each participant.
For example - 1802 iterations of q1 * coeff1, 1802 iterations of q2 * coeff2 etc
So I should end up with 1802 * 29 = 52,258
but the answer doesn't seem to be this length and also the answers aren't what I expect - I think the loop is multiplying q1-29 by coeff1, then repeating this for coeff2 but that's not what I need.
questions = range(0, 28)
co = range(0, 28)
segment_1 = []
for a in questions:
for b in co:
answer = df.iloc[:,a] * seg_1[b]
segment_1.append([answer])
Proper encoding of the coefficients as a Pandas frame makes this a one-liner
df_person['Answer'] = (df_person * df_coeffs.values).sum(1)
and circumvents slow for-loops. In addition, you don't need to remember the number of rows in the given table 1802 and can use the code without changes even if you data grows larger.
For a minimum viable example, see:
# answer frame
df_person = pd.DataFrame({'Question_1': [10, 20, 15], 'Question_2' : [4, 4, 2], 'Question_3' : [2, -2, 1]})
# coefficient frame
seg_1 = [2, 4, -1]
N = len(df_person)
df_coeffs = pd.DataFrame({'C_1': [seg_1[0]] * N, 'C_2' : [seg_1[1]] * N, 'C_3' : [seg_1[2]] * N})
# elementwise multiplication & row-wise summation
df_person['Answer'] = (df_person * df_coeffs.values).sum(1)
giving
for the coefficient table df_coeffs
and answer table df_person
What am I trying to do?
I'm trying to generate a total_score column based on two other columns.
The level_count column will have a min value of 1 and max of 3
The range_bins column will have a min value of low and max of very_high
In order to sum I created a temp range_score column from 1 to 4. Is there a better way than creating this temp column?
How do I normalise the scores so level_count and range_bins have the same weighting even though the range of each column differs? (3 values vs 4)
Data
data = { 'level_count': {0: 2, 1: 2, 2: 3, 3: 1},
'range_bins': {0: 'high', 1: 'medium', 2: 'low', 3: 'very_high'}}
df = pd.DataFrame(data)
df["range_score"] = df.range_bins.replace({"low": 1, "medium": 2,"high":3,"very_high":4})
df["total_score"] = df[["level_count","range_score"]].sum(axis=1)
drop temp column and show output:
df.drop(columns= "range_score")
level_count range_bins total_score
0 2 high 5
1 2 medium 4
2 3 low 4
3 1 very_high 5
Desired output
Rows 2 and 3 have equal importance and the total_score should reflect this. I may also need to add other similar columns with maybe only two categories in a similar way.
To achieve equal weighting of row 2 and 3, you could create a function that takes a row from the dataframe and then write down the entire logic. You can then apply this function to the dataframe with df.apply. Since I cannot tell the complete logic from your description, I can only provide half the solution - You will have to write down the complete logic.
def total_score(row):
if row.level_count == 1 & row.range_bins == 'very_high':
return 4
elif row.level_count == 2 & row.range_bins == 'very_high':
return ?
elif row.level_count == 2 & row.range_bins == 'very_high':
return ?
df['total_score'] = df.apply(lambda x: total_score(x))
I have a series that I want to apply an external function to in subsets/chunks of three. Although the actual external function is more complex, for the sake of an example, lets just assume my external function takes an ndarray of integers and returns the sum of all values. So for example:
series = pd.Series([1,1,1,1,1,1,1,1,1])
# Some pandas magic similar to:
result = series.resample(3).apply(myFunction)
# where 3 just represents every 3 values and
# result == pd.Series([3,3,3])
I looked at combining Series.resample and Series.apply as hinted to by the psuedo code above but it appears resample depends on a datetime index. Any ideas on how I can effectively downsample by applying an external function like this without a datetime index? Or do you just recommend creating a temporary datetime index to do this then reverting to the original index?
pandas.DataFrame.groupby would do the trick here. What you need is a repeated index to specify subsets/chunks
Create chunks
n = 3
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 3, 3, 3, 6, 6, 6])
Groupby
def myFunction(l):
output = 0
for item in l:
output+=item
return output
series = pd.Series([1,1,1,1,1,1,1,1,1])
result = series.groupby(repeat_idx).apply(myFunction)
(result)
0 3
3 3
6 3
The solution will also work for chunks not adding to the length of series,
n = 4
repeat_idx = np.repeat(np.arange(0,len(series), n), n)[:len(series)]
print(repeat_idx)
array([0, 0, 0, 0, 4, 4, 4, 4, 8])
result = series.groupby(repeat_idx).apply(myFunction)
print(result)
0 4
4 4
8 1
I have a strange problem with calculating the weighted mean of a pandas dataframe. I want to do the following steps:
(1) calculate the weighted mean of all the data
(2) calculate the weighted mean of each group of data
The issue is when I do step 2, then the mean of groups means (weighted by the number of members in each group) is not the same as the weighted mean of all the data (step 1). Mathematically it should be (here). I even thought maybe the issue is the dtype, so I set everything on float64 but the problem still exists. Below I provided a simple example that illustrates this problem:
My dataframe has a data, a weight and group columns:
data = np.array([
0.20651903, 0.52607571, 0.60558061, 0.97468593, 0.10253621, 0.23869854,
0.82134792, 0.47035085, 0.19131938, 0.92288234
])
weights = np.array([
4.06071562, 8.82792146, 1.14019687, 2.7500913, 0.70261312, 6.27280216,
1.27908358, 7.80508994, 0.69771745, 4.15550846
])
groups = np.array([1, 1, 2, 2, 2, 2, 3, 3, 4, 4])
df = pd.DataFrame({"data": data, "weights": weights, "groups": groups})
print(df)
>>> print(df)
data weights groups
0 0.206519 4.060716 1
1 0.526076 8.827921 1
2 0.605581 1.140197 2
3 0.974686 2.750091 2
4 0.102536 0.702613 2
5 0.238699 6.272802 2
6 0.821348 1.279084 3
7 0.470351 7.805090 3
8 0.191319 0.697717 4
9 0.922882 4.155508 4
# Define a weighted mean function to apply to each group
def my_fun(x, y):
tmp = np.average(x, weights=y)
return tmp
# Mean of the population
total_mean = np.average(np.array(df["data"], dtype="float64"),
weights= np.array(df["weights"], dtype="float64"))
# Group data
group_means = df.groupby("groups").apply(lambda d: my_fun(d["data"],d["weights"]))
# number of members of each group
counts = np.array([2, 4, 2, 2],dtype="float64")
# Total mean calculated from mean of groups mean weighted by counts of each group
total_mean_from_group_means = np.average(np.array(group_means,
dtype="float64"),
weights=counts)
print(total_mean)
0.5070955626929458
print(total_mean_from_group_means)
0.5344436242465216
As you can see the total mean calculated from group means is not equal to the total mean. What I am doing wrong here?
EDIT: Fixed a typo in the code.
You compute a weighted mean within each group, so when you compute the total mean from the weighted means, the correct weight for each group is the sum of the weights within the group (and not the size of the group).
In [47]: wsums = df.groupby("groups").apply(lambda d: d["weights"].sum())
In [48]: total_mean_from_group_means = np.average(group_means, weights=wsums)
In [49]: total_mean_from_group_means
Out[49]: 0.5070955626929458
I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)