I am trying to maximize a sequence of shifts for 5 workers in a year. Let's say I have 3 shifts(1, 2, 3) and break(B). Then I get as an input a sequence of shifts, let's say I get '111222333BBBBBB'.
What I need to do with it is maximize the number of appearences of that sequence in my workers schedule.
I definied my workers schedule as shifts[(w, d, s)] = 1 meaning worker w in day d works the shift s.
What I tried to do: create bools (w_d) meaning for worker w the sequence will start in day d. Then I try to maximize the number of positive bools.
The problem: This takes too long and doesn't stop running even after 1 day, I did set the number of cores to 8. If anyone got a better idea on how to do this, please let me know!
The code:
worker = 5
days = 365
required_sequence_bools = []
required_sequence = "111222333BBBBBB"
for w in range(worker):
required_sequence_bools.append([])
for d in range(1, days - len(required_sequence)):
required_sequence_bools[w].append(model.NewBoolVar(f"{w}_{d}"))
for w in range(worker):
for d in range(0, days - len(required_sequence) - 1):
day = d + 1
for letter in required_sequence:
if letter == '1':
model.Add(shifts[(w, day, 0)] == 1).OnlyEnforceIf(required_sequence_bools[w][d])
elif letter == '2':
model.Add(shifts[(w, day, 1)] == 1).OnlyEnforceIf(required_sequence_bools[w][d])
elif letter == '3':
model.Add(shifts[(w, day, 2)] == 1).OnlyEnforceIf(required_sequence_bools[w][d])
elif letter == 'B':
model.Add(shifts[(w, day, 3)] == 1).OnlyEnforceIf(required_sequence_bools[w][d])
day += 1
model.Maximize(sum(required_sequence_bools[w][d] for w in range(worker) for d in range(0, days - len(required_sequence) - 1)))
Example: required_sequence="112233BB" 4 workers
| day1 | day2 | day3 | day4 | day5 | day6 | day7 | day8 | day9 | day10 | day11| day12 | day13 | day14 |
|worker1| 1 1 2 2 3 3 B B
|worker2| 1 1 2 2 3 3 B B
|worker3| 1 1 2 2 3 3 B B
|worker4| 1 1 2 2 3 3 B B
Related
For a work shift optimization problem, I've defined a binary variable in PuLP as follows:
pulp.LpVariable.dicts('VAR', (range(D), range(N), range(T)), 0, 1, 'Binary')
where
D = # days in each schedule we create (=28, or 4 weeks)
N = # of workers
T = types of work shift (=6)
For the 5th and 6th type of work shift (with index 4 and 5), I need to add a constraint that any worker who works these shifts must do so for seven consecutive days... and not any seven days but the seven days starting from Monday (aka a full week). I've tried defining the constraint as follows, but I'm getting an infeasible solution when I add this constraint and try to solve the problem (it worked before without it)
I know this constraint (along with the others from before) should theoretically be feasible because we manually schedule work shifts with the same set of constraints. Is there anything wrong with the way I've coded the constraint?
## looping over each worker
for j in range(N):
## looping for every Monday in the 28 days
for i in range(0,D,7):
c = None
## accessing only the 5th and 6th work shift type
for k in range(4,T):
c+=var[i][j][k]+var[i+1][j][k]+var[i+2][j][k]+var[i+3][j][k]+var[i+4][j][k]+var[i+5][j][k]+var[i+6][j][k]
problem+= c==7
If I understand correctly then your constraint requires that each worker is required to work the 4th and 5th shift in every week. This is because of c == 7, i.e. 7 of the binaries in c must be set to 1. This does not allow any worker to work in shift 0 through 3, right?
You need to change the constraint so that c == 7 is only enforced if the worker works any shift in that range. A very simple way to do that would be something like
v = list()
for k in range(4,T):
v.extend([var[i][j][k], var[i+1][j][k], var[i+2][j][k], var[i+3][j][k], var[i+4][j][k], var[i+5][j][k], var[i+6][j][k]])
c = sum(v)
problem += c <= 7 # we can pick at most 7 variables from v
for x in v:
problem += 7 * x <= c # if any variable in v is picked, then we must pick 7 of them
This is by no means the best way to model that (indicator variables would be much better), but it should give you an idea what to do.
Just to offer an alternative approach, assuming (as I read it) that for any given week a worker can either work some combination of shifts in [0:3] across the seven days, or one of the shifts [4:5] every day: we can do this by defining a new binary variable Y[w][n][t] which is 1 if in week w worker n does a restricted shift t, 0 otherwise. Then we can relate this variable to our existing variable X by adding constraints so that the values X can take depend on the values of Y.
# Define the sets of shifts
non_restricted_shifts = [0,1,2,3]
restricted_shifts = [4,5]
# Define a binary variable Y, 1 if for week w worker n works restricted shift t
Y = LpVariable.dicts('Y', (range(round(D/7)), range(N), restricted_shifts), cat=LpBinary)
# If sum(Y[week][n][:]) = 1, the total number of non-restricted shifts for that week and n must be 0
for week in range(round(D/7)):
for n in range(N):
prob += lpSum(X[d][n][t] for d in range(week*7, week*7 + 7) for t in non_restricted_shifts) <= 1000*(1-lpSum(Y[week][n][t] for t in restricted_shifts))
# If worker n has 7 restricted shift t in week w, then Y[week][n][t] == 1, otherwise it is 0
for week in range(round(D/7)):
for n in range(N):
for t in restricted_shifts:
prob += lpSum(X[d][n][t] for d in range(week*7, week*7+7)) <= 7*(Y[week][n][t])
prob += lpSum(X[d][n][t] for d in range(week*7, week*7+7)) >= Y[week][n][t]*7
Some example output (D=14, N=2, T=6):
/ M T W T F S S / M T W T F S S / M T W T F S S / M T W T F S S
WORKER 0
Shifts: / 2 3 1 3 3 2 2 / 1 0 2 3 2 2 0 / 3 1 2 2 3 1 1 / 2 3 0 3 3 0 3
WORKER 1
Shifts: / 3 1 2 3 1 1 2 / 3 3 2 3 3 3 3 / 4 4 4 4 4 4 4 / 1 3 2 2 3 2 1
WORKER 2
Shifts: / 1 2 3 1 3 1 1 / 3 3 2 2 3 2 3 / 3 2 3 0 3 1 0 / 4 4 4 4 4 4 4
WORKER 3
Shifts: / 2 2 3 2 1 2 3 / 5 5 5 5 5 5 5 / 3 1 3 1 0 3 1 / 2 2 2 2 3 0 3
WORKER 4
Shifts: / 5 5 5 5 5 5 5 / 3 3 1 0 2 3 3 / 0 3 3 3 3 0 2 / 3 3 3 2 3 2 3
I have a large data frame. Sample below
| year | sentences | company |
|------|-------------------|---------|
| 2020 | [list of strings] | A |
| 2019 | [list of strings] | A |
| 2018 | [list of strings] | A |
| ... | .... | ... |
| 2020 | [list of strings] | Z |
| 2019 | [list of strings] | Z |
| 2018 | [list of strings] | Z |
I want to compare the sentences column by company by year so as to get a year on year change.
Example: for company A, I would like to apply an operator such as sentence similarity or some distance metric for the [list of strings]2020 and [list of strings]2019, then [list of strings]2019 and [list of strings]2018.
Similarly for company B, C, ... Z.
How can this be achieved?
EDIT
length of [list of strings] is variable. So some simple quantifying operators could be
Difference in number of elements --> length([list of strings]2020) - length([list of strings]2019)
Count of common elements --> length(set([list of strings]2020, [list of strings]2019))
The comparisons should be:
| years | Y-o-Y change (Some function) | company |
|-----------|------------------------------|---------|
| 2020-2019 | 15 | A |
| 2019-2018 | 3 | A |
| 2018-2017 | 55 | A |
| ... | .... | ... |
| 2020-2019 | 33 | Z |
| 2019-2018 | 32 | Z |
| 2018-2017 | 27 | Z |
TL;DR: see full code on bottom
You have to break down your task in simpler subtasks. Basically, you want to apply one or several calculations on your dataframe on successive rows, this grouped by company. This means you will have to use groupby and apply.
Let's start with generating an example dataframe. Here I used lowercase letters as words for the "sentences" column.
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
df
output:
date sentences company
0 2020 [z] A
1 2019 [s, f, g, a, d, a, h, o, c] A
2 2018 [b] A
…
26 2014 [q] C
27 2013 [i, w] C
28 2012 [o, p, i, d, f, w, k, d] C
29 2011 [l, f, h, p] C
Concatenate the "sentences" column of the next row (previous year):
pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
output:
date sentences company date_pre sentences_pre company_pre
0 2020 [z] A 2019.0 [s, f, g, a, d, a, h, o, c] A
1 2019 [s, f, g, a, d, a, h, o, c] A 2018.0 [b] A
2 2018 [b] A 2017.0 [x, n, r, a, s, d] A
3 2017 [x, n, r, a, s, d] A 2016.0 [u, n, g, u, k, s, v, s, o] A
4 2016 [u, n, g, u, k, s, v, s, o] A 2015.0 [v, g, d, i, b, z, y, k] A
5 2015 [v, g, d, i, b, z, y, k] A 2014.0 [q, o, p] A
6 2014 [q, o, p] A 2013.0 [j, s, s] A
7 2013 [j, s, s] A 2012.0 [g, u, l, g, n] A
8 2012 [g, u, l, g, n] A 2011.0 [v, p, y, a, s] A
9 2011 [v, p, y, a, s] A 2020.0 [a, h, c, w] B
…
Define a function to compute a number of distance metrics (here the two defined in the question). TypeError is caught to handle the case where there is no row to compare with (one occurrence per group).
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
This works on a sub-dataframe that was filtered to match only one company:
df2 = df.query('company == "A"')
pd.concat([df2, df2.shift(-1).add_suffix('_pre')], axis=1).dropna().apply(compare_lists, axis=1
output:
years yoy_diff_len yoy_nb_common company
0 2020–2019 -4 0 A
1 2019–2018 6 1 A
2 2018–2017 1 0 A
3 2017–2016 1 0 A
4 2016–2015 -7 0 A
5 2015–2014 4 0 A
6 2014–2013 1 0 A
7 2013–2012 -1 0 A
8 2012–2011 -5 1 A
Now you can make a function to construct each dataframe per group and apply the computation:
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
return df2.apply(compare_lists, axis=1)
and use this function to apply on each group:
df.groupby('company').apply(group_compare)
Full code:
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1).dropna()
return df2.apply(compare_lists, axis=1)
## uncomment below to remove "company" index
df.groupby('company').apply(group_compare) #.reset_index(level=0, drop=True)
output:
years yoy_diff_len yoy_nb_common company
company
A 0 2020–2019 -8 0 A
1 2019–2018 8 0 A
2 2018–2017 -5 0 A
3 2017–2016 -3 2 A
4 2016–2015 1 3 A
5 2015–2014 5 0 A
6 2014–2013 0 0 A
7 2013–2012 -2 0 A
8 2012–2011 0 0 A
B 10 2020–2019 3 0 B
11 2019–2018 -6 1 B
12 2018–2017 3 0 B
13 2017–2016 -5 1 B
14 2016–2015 2 2 B
15 2015–2014 4 1 B
16 2014–2013 3 0 B
17 2013–2012 -8 0 B
18 2012–2011 1 1 B
C 20 2020–2019 8 1 C
21 2019–2018 -7 0 C
22 2018–2017 0 1 C
23 2017–2016 7 0 C
24 2016–2015 -3 0 C
25 2015–2014 3 0 C
26 2014–2013 -1 0 C
27 2013–2012 -6 2 C
28 2012–2011 4 2 C
I want to generate a matrix using pandas for the data df with the following logic:
Group by id
Low: Mid Top: End
For day 1: Count if (If level has Mid and End and if day == 1)
For day 2: Count if (If level has Mid and End and if day == 2)
….
Begin: Mid to New
For day 1: Count if (If level has Mid and New and if day == 1)
For day 2: Count if (If level has Mid and New and if day == 2)
….
df = pd.DataFrame({'Id':[111,111,222,333,333,444,555,555,555,666,666],'Level':['End','Mid','End','End','Mid','New','End','New','Mid','New','Mid'],'day' : ['',3,'','',2,3,'',3,4,'',2]})
Id |Level | day
111 |End|
111 |Mid| 3
222 |End|
333 |End|
333 |Mid| 2
444 |New| 3
555 |End|
555 |New| 3
555 |Mid| 4
666 |New|
666 |Mid| 2
The matrix would look like this:
Low Top day1 day2 day3 day4
Mid End 0 1 1 0
Mid New 0 1 0 1
New End 0 0 1 0
New Mid 0 0 0 1
Thank you! Thank you!
Starting from your dataframe
# all the combination of Levels
level_combos=[c for c in itertools.combinations(df['Level'].unique().tolist(), 2)]
# create output and fill with zeros
df_output=pd.DataFrame(0,index=level_combos,columns=range(4))
Probably is not very efficient, but it should work
for g in df.groupby(['Id']): # group by ID
# combination of levels for this ID
level_combos_this_id=[c for c in itertools.combinations(g[1]['Level'].unique().tolist(), 2)]
# set to 1 the days present
df_output.loc[level_combos_this_id,pd.to_numeric(g[1]['day']).dropna(inplace=True).values]=1
Finally rename the columns to get to the desired output
df_output.columns=['day'+str(i+1) for i in range(4)]
I have a dataframe, I want to
FROM:
dow yield
0 F 2
1 F 3
2 M 4
3 M 6
4 TH 7
TO:
dow ysum
0 F 5
1 M 10
2 TH 7
butI got this :
|yield
-------------
dow |
-------------
F |5
M |10
TH |7
This is how I did it:
d1=['F','F','M','M','TH']
d2=[2,3,4,6,7]
d = {'dow': d1, 'yield': d2}
df = pd.DataFrame(data=d, index=None)
df1= df.groupby('dow').sum()
How could get result use dow as a column in stead of index?
First column is index, so you can add parameter as_index=False:
df1 = df.groupby('dow', as_index=False).sum()
print (df1)
dow yield
0 F 5
1 M 10
2 TH 7
Or reset_index:
df1 = df.groupby('dow').sum().reset_index()
print (df1)
dow yield
0 F 5
1 M 10
2 TH 7
Let say I have the following dataframe :
elements = [1,1,1,1,1,2,3,4,5]
df = pd.DataFrame({'elements': elements})
df.set_index(['elements'])
print df
elements
0 1
1 1
2 1
3 1
4 1
5 2
6 3
I have a list [1, 1, 2, 3] and I want a subset of the dataframe including those 4 elements, for example:
elements
0 1
1 1
5 2
6 3
I have been able to deal with it by building a dict counting the items occurrences in the array and building a new dataframe by appending subparts of the initial one.
Would you know some dataframe methods to help me find a more elegant solution?
After #jezrael comment : I must add that i need to keep track of the initial index (in df).
We can see df (first dataframe) as a repository of resources and i need to track which rows/indices are attributed :
Use case is : among the elements in df give me two 1, one 2 and one 3. i would persist the fact that i have the rows 0 and 1 as 1, row 4 as 2 and row 5 as 3.
If and only if your Series and list are sorted (otherwise, see below), then you can do:
L = [1, 1, 2, 3]
df[df.elements.apply(lambda x: x == L.pop(0) if x in L else False)]
elements
0 1
1 1
5 2
6 3
list.pop(i) returns and removes the value in list at index i. Because both, the elements and L, are sorted, popping the first element (i==0) of the subset list L will always occur at the corresponding first element in elements.
So at each iteration of lambda on elements, L will become:
| element | L | Output |
|=========|==============|===========|
| 1 | [1, 1, 2, 3] | True |
| 1 | [1, 2, 3] | True |
| 1 | [2, 3] | False |
| 1 | [2, 3] | False |
| 1 | [2, 3] | False |
| 2 | [2, 3] | True |
| 3 | [3] | True |
| 4 | [] | False |
| 5 | [] | False |
As you can see, your list is empty at the end, so if it's a problem, you can copy it beforehand. Or, you actually have that information in the new dataframe you just created!
If df.elements is not sorted, create a sorted copy on which you apply the same lambda function as above, but the output of it will be used as index for the original dataframe (indexes whose values are True are used):
df
elements
0 5
1 4
2 3
3 1
4 2
5 1
6 1
7 1
8 1
cp = df.elements.copy()
cp.sort_values(inplace=True)
tmp = df.loc[cp.apply(lambda x: x == L.pop(0) if x in L else False)]
print tmp
elements
2 3
3 1
4 2
5 1
HTH
Extracting can be possible by merge with new columns by GroupBy.cumcount:
L = [1,1,2,3]
df1 = pd.DataFrame({'elements':L})
df['g'] = df.groupby('elements')['elements'].cumcount()
df1['g'] = df1.groupby('elements')['elements'].cumcount()
print (df)
elements g
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 2 0
6 3 0
7 4 0
8 5 0
print (df1)
elements g
0 1 0
1 1 1
2 2 0
3 3 0
print (pd.merge(df,df1, on=['elements', 'g']))
elements g
0 1 0
1 1 1
2 2 0
3 3 0
print (pd.merge(df.reset_index(),df1, on=['elements', 'g'])
.drop('g', axis=1)
.set_index('index')
.rename_axis(None))
elements
0 1
1 1
5 2
6 3