Qunatify total time saved by prioritizing tasks based on the failure rate probability of each task - optimization

I am trying to solve a problem where I am trying to prioritize the tasks in a job based on the failure rates of each task. For ex:
Task p(Failure) TimeTaken(sec)
A 0.7 10
B 0.1 15
C 0.5 3
D 0.3 5
This is a sequence of tasks and if even one task fails, the entire job fails. So I want to prioritize my tasks to save the max amount of time. To do that I am trying to run the tasks in the order of failure probability. So my current order of performing the tasks is A,C,D and then B. I feel the problem with my approach is I am not considering the time factor. Is there a better way to prioritize my tasks based on the time taken also into consideration?

Well, your intuition is correct. The things you'd like to do first are things that have either a real high failure probability (so you don't waste a bunch of time and fail later) and things that are short duration so if you do fail, you haven't wasted time on a long task.
Here is a brute-force solution that looks at all possible sequences. It scales OK. There are probably more elegant solutions and maybe even some math model that doesn't come to mind too quickly.
Anyhow. Assumptions: All failures are independent of each other, failure occurs (or is recognized) at the end of the task, and that a failure occurs. We know from probability theory that if the tasks are independent, then the P{success} does not depend on the sequence of the tasks, so all sequences have the same likelihood of failure overall as well. Just that the lost time is different depending on the sequence and when in the sequence it occurs.
The code below calculates the expected value of x given that a failure occurs where x is the wasted time, and it is just the sum-product of the possible intermediate sequences times and probability of occurring.
# scheduling with failures
from itertools import permutations as perm
from math import prod
from tabulate import tabulate
tasks = { 'A': (0.7, 10), # (P{fail}, time)
'B': (0.1, 15),
'C': (0.5, 3),
'D': (0.3, 5)}
# let's start with finding P{success}
p_s = prod( (1-tasks[k][0]) for k in tasks)
success_time = sum(tasks[k][1] for k in tasks)
print(f'the probability the sequence completes successfully is: {p_s:0.3f}')
print(f'when successful, the time is: {success_time}')
min_E_x = success_time # upper bound on min_E_x: The minimum expected value of x
best = None
sequences, vals = [], []
for seq in perm(tasks,len(tasks)) : # all permutations
E_x = 0 # the expected value of x for this sequence, where x = time wasted
for i in range(len(seq)):
p = tasks[seq[i]][0] # p{fail on last}
earlier = prod((1-tasks[seq[j]][0]) for j in range(i)) # p{all earlier events pass}
if earlier:
p *= earlier
# get elapsed time for this sequence
time = sum(tasks[seq[j]][1] for j in range(i+1))
# normalize the probability (we know a failure has occurred)
p = p/(1-p_s)
E_x += p*time # E[x] = sum of all p*x
# print(seq[0:i+1], time, p, E_x)
sequences.append(seq)
vals.append(E_x)
if E_x < min_E_x:
best = seq
min_E_x = E_x
print(f'\nThe best selection with minimal wasted time given a failure is: {best} with E[wasted time]: {min_E_x:0.3f}\n')
print(tabulate(zip(sequences, vals), floatfmt=".2f", headers=['Sequence', 'E[x]|failure']))
Yields
the probability the sequence completes successfully is: 0.095
when successful, the time is: 33
The best selection with minimal wasted time given a failure is: ('C', 'A', 'D', 'B') with E[wasted time]: 7.959
Sequence E[x]|failure
-------------------- --------------
('A', 'B', 'C', 'D') 14.21
('A', 'B', 'D', 'C') 14.69
('A', 'C', 'B', 'D') 11.82
('A', 'C', 'D', 'B') 11.16
('A', 'D', 'B', 'C') 13.36
('A', 'D', 'C', 'B') 11.69
('B', 'A', 'C', 'D') 24.70
('B', 'A', 'D', 'C') 25.18
('B', 'C', 'A', 'D') 21.82
('B', 'C', 'D', 'A') 22.07
('B', 'D', 'A', 'C') 25.67
('B', 'D', 'C', 'A') 23.66
('C', 'A', 'B', 'D') 8.62
('C', 'A', 'D', 'B') 7.96
('C', 'B', 'A', 'D') 13.87
('C', 'B', 'D', 'A') 14.12
('C', 'D', 'A', 'B') 8.23
('C', 'D', 'B', 'A') 11.91
('D', 'A', 'B', 'C') 13.91
('D', 'A', 'C', 'B') 12.24
('D', 'B', 'A', 'C') 21.26
('D', 'B', 'C', 'A') 19.24
('D', 'C', 'A', 'B') 10.00
('D', 'C', 'B', 'A') 13.67
[Finished in 0.1s]

Related

how do I set the max values for a plotly px.sunburst graph?

I am trying to show how much a student has completed from a set of challenges with a plotly sunburst graph. I want to have the 'category' maximum value be shown for each of them but only fill in the challenges that they've done. I was thinking of having the ones they did not do be greyed out. I have the max values for each of the challenges in the dataframe 'challenge_count_df' and the students work in the 'student_df':
import pandas as pd
import plotly.express as px
challenge_count_df = pd.DataFrame({'Challenge': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Value' : ["5","5","10","15","5","10","5","10","15","10"],
'Category' : ['linux','primer','windows','linux','linux','primer','windows','linux', 'linux', 'primer']})
student_df = pd.DataFrame({'Challenge': ['B', 'C', 'E', 'F', 'G', 'H', 'I'],
'Value' : ["5","10","5","10","5","10","15"],
'Category' : ['primer','windows','linux','primer','windows','linux', 'linux']})
As you can see, the student_df has some of the challenges missing. That's because they didn't answer them.
I know how to start a starburst like this:
fig = px.starburst(challenge_count_df, path=['Category','Challenge'],values='Value')
Is there a way to overlap that with this?
fig = px.starburst(student_df, path=['Category','Challenge'],values='Value')

Pyspark equivalent for groupby and aggregation

I have a pyspark dataframe and i am trying to perform groupby and aggregation on that.
I am performing the following operations in Pandas and its working fine:
new_df = new_df.groupBy('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'H', 'K', 'L', 'Cost1','Cost2','Cost3','Cost4','Cost5')
new_df = new_df.agg({'Cost1':sum, 'Cost2':sum, 'Cost3':sum,'Cost4':sum, 'Cost5':sum})
But i am unable to perform the same operations in Pyspark using the below syntax:
new_df = new_df.groupBy('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'H', 'K', 'L', 'Cost1','Cost2','Cost3','Cost4','Cost5').agg(F.sum(ost1','Cost2','Cost3','Cost4','Cost5'))
Error:
AttributeError: 'GroupedData' object has no attribute 'groupBy'
You have a typo here (ost1',. you forgot 'C. And the error relates to another problem in your code. You probably call groupBy() twice like this: groupBy("A").groupBy("B"). You can not do this. You should call one of aggregation function from GrouppedData object after groupBy(). I think, you need these code
new_df = df.groupBy("A", "B").sum("Cost1", "Cost2")
new_df.show()

Pyspark conditional function evaluation based on another column

I have a sample data set like below
sample_data = [('A', 'Chetna', 5, 'date_add(date_format(current_date(), \'yyyy-MM-dd\'), 7)'),
('B', 'Tanmay', 6, '`date_add(date_format(current_date(), \'yyyy-MM-dd\'), 1)`'),
('C', 'CC', 2, '`date_add(date_format(current_date(), \'yyyy-MM-dd\'), 3)`'),
('D', 'TC', 9, '`date_add(date_format(current_date(), \'yyyy-MM-dd\'), 5)`')]
df = spark.createDataFrame(sample_data, ['id', 'name', 'days', 'applyMe'])
from pyspark.sql.functions import lit
df = df.withColumn("salary", lit('days * 60'))
I am trying to evaluate the function provided in applyMe column and salary.
So far have tried doing it with expr and eval but no luck.
Could someone please point me in right direction to achieve the desired output.

Pandas Dataframe Create Seaborn Horizontal Barplot with categorical data

I'm currently working with a data frame like this:
What I want is to show the total numer of the Victory column where the value is S grouped by AGE_GROUP and differenced by GENDER, something like in the following horizontal barplot:
Until now I could obtain the following chart:
Following this steps:
victory_df = main_df[main_df["VICTORY"] == "S"]
victory_count = victory_df["AGE_GROUP"].value_counts()
sns.set(style="darkgrid")
sns.barplot(victory_count.index, victory_count.values, alpha=0.9)
Which strategy I should use to difference in the value_count by gender and include it in the chart?
It would obviously help giving raw data and not an image. Came up with own data.Not sure understood your question but my attempt below.
Data
df=pd.DataFrame.from_dict({'VICTORY':['S', 'S', 'N', 'N', 'N', 'S', 'N', 'S', 'N', 'S', 'N', 'S', 'S'],'AGE':[5., 88., 12., 19., 30., 43., 77., 50., 78., 34., 45., 9., 67.],'AGE_GROUP':['0-13', '65+', '0-13', '18-35', '18-35', '36-64', '65+', '36-64','65+', '18-35', '36-64', '0-13', '65+'],'GENDER':['M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'F']})
Plotting. I groupby AGE_GROUP, value count GENDER, unstack and plot a stacked horizontal bar plot. Seaborn is build on matplotlib and when plotting is not straightforward in seaborn like the stacked horizontal bar, I fall back to matplotlib. Hope you dont take offence.
df[df['VICTORY']=='S'].groupby('AGE_GROUP')['GENDER'].apply(lambda x: x.value_counts()).unstack().plot(kind='barh', stacked=True)
plt.xlabel('Count')
plt.title('xxxx')
Output

Plotting a multi-index dataframe with Altair

I have a dataframe which looks like:
data = {'ColA': {('A', 'A-1'): 0,
('A', 'A-2'): 1,
('A', 'A-3'): 1,
('B', 'B-1'): 2,
('B', 'B-2'): 2,
('B', 'B-3'): 0,
('C', 'C-1'): 1,
('C', 'C-2'): 2,
('C', 'C-3'): 2,
('C', 'C-4'): 3},
'ColB': {('A', 'A-1'): 3,
('A', 'A-2'): 1,
('A', 'A-3'): 1,
('B', 'B-1'): 0,
('B', 'B-2'): 2,
('B', 'B-3'): 2,
('C', 'C-1'): 2,
('C', 'C-2'): 0,
('C', 'C-3'): 3,
('C', 'C-4'): 1}}
df = pd.DataFrame( data )
The values for every column are either 0, 1, 2, or 3. These values could just as easily be 'U', 'Q', 'R', or 'Z' ... i.e. there is nothing inherently numeric about them.
I would like to use Altair
** First Set of Charts
I would like to get one bar chart per column.
The labels for the X-axis should be based on the unique values in the columns. The Y-axis should be the count of the unique values in the column.
** Second Set of Charts
Similar to the first set, I would like to get one bar chart per row.
The labels for the X-axis should be based on the unique values in the row. The Y-axis should be the count of the unique values in the row.
This should be easy, but I am not sure how to do it.
All of Altair's APIs are column-based, and ignore indices unless you explicitly include them (see Including Index Data in Altair's documentation).
For the first set of charts (one bar chart per column) you can do this:
alt.Chart(df.reset_index()).mark_bar().encode(
alt.X(alt.repeat(), type='nominal'),
y='count()'
).repeat(['ColA', 'ColB'])
For the second set of charts (one bar chart per row) you can do something like this:
df_transposed = df.reset_index(0, drop=True).T
alt.Chart(df_transposed).mark_bar().encode(
alt.X(alt.repeat(), type='nominal'),
y='count()'
).repeat(list(df_transposed.columns), columns=5)
Though this is a bit of a strange visualization, so I suspect I'm misunderstanding what you're after... your data has ten rows, so one chart per row is ten charts.