Extract the summary of data using groupby and optimize the inspector utilisation - pandas and and other optimisation package in python - pandas

I have accident record data as shown below across the places
Inspector_ID Place Date
0 1 A 1-09-2019
1 2 A 1-09-2019
2 1 A 1-09-2019
3 1 B 1-09-2019
4 3 A 1-09-2019
5 3 A 1-09-2019
6 1 A 2-09-2019
7 3 A 2-09-2019
8 2 B 2-09-2019
9 3 A 3-09-2019
10 1 C 3-09-2019
11 1 D 3-09-2019
12 1 A 3-09-2019
13 1 E 3-09-2019
14 1 A 3-09-2019
15 1 A 3-09-2019
16 3 A 4-09-2019
17 3 B 5-09-2019
18 4 B 5-09-2019
19 3 A 5-09-2019
20 3 C 5-09-2019
21 3 A 5-09-2019
22 3 D 5-09-2019
23 3 C 5-09-2019
From the above data, I want to optimize the inspector utlisation.
for that tried below codes get the objective function of the optimisation.
c = df.groupby('Place').Inspector_ID.agg(
Total_Number_of_accidents='count',
Number_unique_Inspector='nunique',
Unique_Inspector='unique').reset_index().sort_values(['Total_Number_of_accidents'], ascending=False)
Below is the output of above code
Place Total_Number_of_accidents Number_unique_Inspector Unique_Inspector
0 A 14 3 [1, 2, 3]
1 B 4 4 [1, 2, 3, 4]
2 C 3 2 [1, 3]
3 D 2 2 [1, 3]
4 E 1 1 [1]
And then
f = df.groupby('Inspector_ID').Place.agg(
Total_Number_of_accidents='count',
Number_unique_Place='nunique',
Unique_Place='unique').reset_index().sort_values(['Total_Number_of_accidents'], ascending=False)
Output:
Inspector_ID Total_Number_of_accidents Number_unique_Place Unique_Place
2 3 11 4 [A, B, C, D]
0 1 10 5 [A, B, C, D, E]
1 2 2 2 [A, B]
3 4 1 1 [B]
From the above we have 4 Inspectors, 5 Places and 24 accidents. I want to optimize the allocation of inspectors based on the above data.
condition 1 - There should be at least 1 inspector in each Place.
condition 2 - All inspector should be assigned at least one Place.
Condition 3 - Identify the Place which is over utilised based on number of accidents (for eg: Place - B - Only 4 accidents and four inspector, So some inpspector from Place B can be assigned to Place A and next question is which inspector? and How many?.
Is it possible to do that in python, if possible which algorithm? and how?

it is an https://en.wikipedia.org/wiki/Assignment_problem maybe it should be reduced to max-flow problem but with optimization of equality in flow (using graph package like NetworkX):
how to create di-graph:
vertice s source of flow (of accidents)
S-set would be all places that will have accidents
X_s - set of all edges (s, x) where x in S, now t is sink, and we have analogus sets T and X_t now let's set capacity for edges in X_s - it would be set from column Total_Number_of_accidents in X_t we would set max number of accidents to process by inspector and we will get back to it later on, now let's make edges from S to T (x, y) where x in X_s and y in X_t and let's set capacity of these edges to high number (e.g. 1e6) and let's call this set X_c these edges will tell us how much load will get inspector y from place x.
now solve max-flow problem, and when some edges from X_t would have too big flow you can decrease capacity of these (to reduce load on particular inspector) and when some edges in X_c will have very small flow you can just remove these edges to reduce complexity of work organization, after few iterations you should have desired solution
you can code some super algorithm but if it's real life problem you would like to avoid situations like assigning one inspector to all places and to process 0.38234 accident at each place...
also there should be probably some constraints on how many accidents should be processed by inspector in given time but you didn't mentioned it...

Related

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

consecutive days constraint in linear programming

For a work shift optimization problem, I've defined a binary variable in PuLP as follows:
pulp.LpVariable.dicts('VAR', (range(D), range(N), range(T)), 0, 1, 'Binary')
where
D = # days in each schedule we create (=28, or 4 weeks)
N = # of workers
T = types of work shift (=6)
For the 5th and 6th type of work shift (with index 4 and 5), I need to add a constraint that any worker who works these shifts must do so for seven consecutive days... and not any seven days but the seven days starting from Monday (aka a full week). I've tried defining the constraint as follows, but I'm getting an infeasible solution when I add this constraint and try to solve the problem (it worked before without it)
I know this constraint (along with the others from before) should theoretically be feasible because we manually schedule work shifts with the same set of constraints. Is there anything wrong with the way I've coded the constraint?
## looping over each worker
for j in range(N):
## looping for every Monday in the 28 days
for i in range(0,D,7):
c = None
## accessing only the 5th and 6th work shift type
for k in range(4,T):
c+=var[i][j][k]+var[i+1][j][k]+var[i+2][j][k]+var[i+3][j][k]+var[i+4][j][k]+var[i+5][j][k]+var[i+6][j][k]
problem+= c==7
If I understand correctly then your constraint requires that each worker is required to work the 4th and 5th shift in every week. This is because of c == 7, i.e. 7 of the binaries in c must be set to 1. This does not allow any worker to work in shift 0 through 3, right?
You need to change the constraint so that c == 7 is only enforced if the worker works any shift in that range. A very simple way to do that would be something like
v = list()
for k in range(4,T):
v.extend([var[i][j][k], var[i+1][j][k], var[i+2][j][k], var[i+3][j][k], var[i+4][j][k], var[i+5][j][k], var[i+6][j][k]])
c = sum(v)
problem += c <= 7 # we can pick at most 7 variables from v
for x in v:
problem += 7 * x <= c # if any variable in v is picked, then we must pick 7 of them
This is by no means the best way to model that (indicator variables would be much better), but it should give you an idea what to do.
Just to offer an alternative approach, assuming (as I read it) that for any given week a worker can either work some combination of shifts in [0:3] across the seven days, or one of the shifts [4:5] every day: we can do this by defining a new binary variable Y[w][n][t] which is 1 if in week w worker n does a restricted shift t, 0 otherwise. Then we can relate this variable to our existing variable X by adding constraints so that the values X can take depend on the values of Y.
# Define the sets of shifts
non_restricted_shifts = [0,1,2,3]
restricted_shifts = [4,5]
# Define a binary variable Y, 1 if for week w worker n works restricted shift t
Y = LpVariable.dicts('Y', (range(round(D/7)), range(N), restricted_shifts), cat=LpBinary)
# If sum(Y[week][n][:]) = 1, the total number of non-restricted shifts for that week and n must be 0
for week in range(round(D/7)):
for n in range(N):
prob += lpSum(X[d][n][t] for d in range(week*7, week*7 + 7) for t in non_restricted_shifts) <= 1000*(1-lpSum(Y[week][n][t] for t in restricted_shifts))
# If worker n has 7 restricted shift t in week w, then Y[week][n][t] == 1, otherwise it is 0
for week in range(round(D/7)):
for n in range(N):
for t in restricted_shifts:
prob += lpSum(X[d][n][t] for d in range(week*7, week*7+7)) <= 7*(Y[week][n][t])
prob += lpSum(X[d][n][t] for d in range(week*7, week*7+7)) >= Y[week][n][t]*7
Some example output (D=14, N=2, T=6):
/ M T W T F S S / M T W T F S S / M T W T F S S / M T W T F S S
WORKER 0
Shifts: / 2 3 1 3 3 2 2 / 1 0 2 3 2 2 0 / 3 1 2 2 3 1 1 / 2 3 0 3 3 0 3
WORKER 1
Shifts: / 3 1 2 3 1 1 2 / 3 3 2 3 3 3 3 / 4 4 4 4 4 4 4 / 1 3 2 2 3 2 1
WORKER 2
Shifts: / 1 2 3 1 3 1 1 / 3 3 2 2 3 2 3 / 3 2 3 0 3 1 0 / 4 4 4 4 4 4 4
WORKER 3
Shifts: / 2 2 3 2 1 2 3 / 5 5 5 5 5 5 5 / 3 1 3 1 0 3 1 / 2 2 2 2 3 0 3
WORKER 4
Shifts: / 5 5 5 5 5 5 5 / 3 3 1 0 2 3 3 / 0 3 3 3 3 0 2 / 3 3 3 2 3 2 3

Compute element overlap based on another column, pandas

If I have a dataframe of the form:
tag element_id
1 12
1 13
1 15
2 12
2 13
2 19
3 12
3 15
3 22
how can I compute the overlaps of the tags in terms of the element_id ? The result I guess should be an overlap matrix of the form:
1 2 3
1 X 2 2
2 2 X 1
3 2 1 X
where I put X on the diagonal since the overlap of a tag with itself is not relevant and where the numbers in the matrix represent the total element_ids that the two tags share.
My attempts:
You can try and use a for loop like :
for item in df.itertuples():
element_lst += [item.element_id]
element_tag = item.tag
# then intersect the element_list row by row.
# This is extremely costly for large datasets
The second thing I was thinking about was to use df.groupby('tag') and try to somehow intersect on element_id, but it is not clear to me how I can do that with grouped data.
merge + crosstab
# Find element overlap, remove same tag matches
res = df.merge(df, on='element_id').query('tag_x != tag_y')
pd.crosstab(res.tag_x, res.tag_y)
Output:
tag_y 1 2 3
tag_x
1 0 2 2
2 2 0 1
3 2 1 0

Finding relationship between variables

There are two sets:
A: 1 2 3
B : 1 2 3 4 5 6 7 8 9 10
Points in A serve to multiple points in B. for example:
A 1: B 1 2 4
A 2: B 3 5 6
A 3: B 7 8 9 10
Given historical data of points in both A and B set, how to determine the which point in A is serving to points in set B?
Encode A and B columns as vectors and fit classification model. Then, after fitting, you can make predictions for various inputs of A ((1, 0, 0) as an example) and get probabilities in vector B ((0.25, 0.5, 0.1, ..., 0.15) as an example). So, in this case, value 1 for A serves values (1, 2, 3, 10) with probabilities above. Depending on the task, you can select some threshold.
Depending on the data you need to select an encoding method (dummy vs one-hot), model, think about sampling, metric and so on.

Assigning one column to another column between pandas DataFrames (like vector to vector assignment)

I have a super strange problem which I spent the last hour trying to solve, but with no success. It is even more strange since I can't replicate it on a small scale.
I have a large DataFrame (150,000 entries). I took out a subset of it and did some manipulation. the subset was saved as a different variable, x.
x is smaller than the df, but its index is in the same range as the df. I'm now trying to assign x back to the DataFrame replacing values in the same column:
rep_Callers['true_vpID'] = x.true_vpID
This inserts all the different values in x to the right place in df, but instead of keeping the df.true_vpID values that are not in x, it is filling them with NaNs. So I tried a different approach:
df.ix[x.index,'true_vpID'] = x.true_vpID
But instead of filling x values in the right place in df, the df.true_vpID gets filled with the first value of x and only it! I changed the first value of x several times to make sure this is indeed what is happening, and it is. I tried to replicate it on a small scale but it didn't work:
df = DataFrame({'a':ones(5),'b':range(5)})
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
z =Series([random() for i in range(5)],index = range(5))
0 0.812561
1 0.862109
2 0.031268
3 0.575634
4 0.760752
df.ix[z.index[[1,3]],'b'] = z[[1,3]]
a b
0 1 0.000000
1 1 0.812561
2 1 2.000000
3 1 0.575634
4 1 4.000000
5 1 5.000000
I really tried it all, need some new suggestions...
Try using df.update(updated_df_or_series)
Also using a simple example, you can modify a DataFrame by doing an index query and modifying the resulting object.
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
df_2 = df_1.ix[3:5]
df_2.b = df_2.b + 2
df_2
a b
3 1 5
4 1 6
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 5
4 1 6