Plotting a multi-index dataframe with Altair - pandas

I have a dataframe which looks like:
data = {'ColA': {('A', 'A-1'): 0,
('A', 'A-2'): 1,
('A', 'A-3'): 1,
('B', 'B-1'): 2,
('B', 'B-2'): 2,
('B', 'B-3'): 0,
('C', 'C-1'): 1,
('C', 'C-2'): 2,
('C', 'C-3'): 2,
('C', 'C-4'): 3},
'ColB': {('A', 'A-1'): 3,
('A', 'A-2'): 1,
('A', 'A-3'): 1,
('B', 'B-1'): 0,
('B', 'B-2'): 2,
('B', 'B-3'): 2,
('C', 'C-1'): 2,
('C', 'C-2'): 0,
('C', 'C-3'): 3,
('C', 'C-4'): 1}}
df = pd.DataFrame( data )
The values for every column are either 0, 1, 2, or 3. These values could just as easily be 'U', 'Q', 'R', or 'Z' ... i.e. there is nothing inherently numeric about them.
I would like to use Altair
** First Set of Charts
I would like to get one bar chart per column.
The labels for the X-axis should be based on the unique values in the columns. The Y-axis should be the count of the unique values in the column.
** Second Set of Charts
Similar to the first set, I would like to get one bar chart per row.
The labels for the X-axis should be based on the unique values in the row. The Y-axis should be the count of the unique values in the row.
This should be easy, but I am not sure how to do it.

All of Altair's APIs are column-based, and ignore indices unless you explicitly include them (see Including Index Data in Altair's documentation).
For the first set of charts (one bar chart per column) you can do this:
alt.Chart(df.reset_index()).mark_bar().encode(
alt.X(alt.repeat(), type='nominal'),
y='count()'
).repeat(['ColA', 'ColB'])
For the second set of charts (one bar chart per row) you can do something like this:
df_transposed = df.reset_index(0, drop=True).T
alt.Chart(df_transposed).mark_bar().encode(
alt.X(alt.repeat(), type='nominal'),
y='count()'
).repeat(list(df_transposed.columns), columns=5)
Though this is a bit of a strange visualization, so I suspect I'm misunderstanding what you're after... your data has ten rows, so one chart per row is ten charts.

Related

Qunatify total time saved by prioritizing tasks based on the failure rate probability of each task

I am trying to solve a problem where I am trying to prioritize the tasks in a job based on the failure rates of each task. For ex:
Task p(Failure) TimeTaken(sec)
A 0.7 10
B 0.1 15
C 0.5 3
D 0.3 5
This is a sequence of tasks and if even one task fails, the entire job fails. So I want to prioritize my tasks to save the max amount of time. To do that I am trying to run the tasks in the order of failure probability. So my current order of performing the tasks is A,C,D and then B. I feel the problem with my approach is I am not considering the time factor. Is there a better way to prioritize my tasks based on the time taken also into consideration?
Well, your intuition is correct. The things you'd like to do first are things that have either a real high failure probability (so you don't waste a bunch of time and fail later) and things that are short duration so if you do fail, you haven't wasted time on a long task.
Here is a brute-force solution that looks at all possible sequences. It scales OK. There are probably more elegant solutions and maybe even some math model that doesn't come to mind too quickly.
Anyhow. Assumptions: All failures are independent of each other, failure occurs (or is recognized) at the end of the task, and that a failure occurs. We know from probability theory that if the tasks are independent, then the P{success} does not depend on the sequence of the tasks, so all sequences have the same likelihood of failure overall as well. Just that the lost time is different depending on the sequence and when in the sequence it occurs.
The code below calculates the expected value of x given that a failure occurs where x is the wasted time, and it is just the sum-product of the possible intermediate sequences times and probability of occurring.
# scheduling with failures
from itertools import permutations as perm
from math import prod
from tabulate import tabulate
tasks = { 'A': (0.7, 10), # (P{fail}, time)
'B': (0.1, 15),
'C': (0.5, 3),
'D': (0.3, 5)}
# let's start with finding P{success}
p_s = prod( (1-tasks[k][0]) for k in tasks)
success_time = sum(tasks[k][1] for k in tasks)
print(f'the probability the sequence completes successfully is: {p_s:0.3f}')
print(f'when successful, the time is: {success_time}')
min_E_x = success_time # upper bound on min_E_x: The minimum expected value of x
best = None
sequences, vals = [], []
for seq in perm(tasks,len(tasks)) : # all permutations
E_x = 0 # the expected value of x for this sequence, where x = time wasted
for i in range(len(seq)):
p = tasks[seq[i]][0] # p{fail on last}
earlier = prod((1-tasks[seq[j]][0]) for j in range(i)) # p{all earlier events pass}
if earlier:
p *= earlier
# get elapsed time for this sequence
time = sum(tasks[seq[j]][1] for j in range(i+1))
# normalize the probability (we know a failure has occurred)
p = p/(1-p_s)
E_x += p*time # E[x] = sum of all p*x
# print(seq[0:i+1], time, p, E_x)
sequences.append(seq)
vals.append(E_x)
if E_x < min_E_x:
best = seq
min_E_x = E_x
print(f'\nThe best selection with minimal wasted time given a failure is: {best} with E[wasted time]: {min_E_x:0.3f}\n')
print(tabulate(zip(sequences, vals), floatfmt=".2f", headers=['Sequence', 'E[x]|failure']))
Yields
the probability the sequence completes successfully is: 0.095
when successful, the time is: 33
The best selection with minimal wasted time given a failure is: ('C', 'A', 'D', 'B') with E[wasted time]: 7.959
Sequence E[x]|failure
-------------------- --------------
('A', 'B', 'C', 'D') 14.21
('A', 'B', 'D', 'C') 14.69
('A', 'C', 'B', 'D') 11.82
('A', 'C', 'D', 'B') 11.16
('A', 'D', 'B', 'C') 13.36
('A', 'D', 'C', 'B') 11.69
('B', 'A', 'C', 'D') 24.70
('B', 'A', 'D', 'C') 25.18
('B', 'C', 'A', 'D') 21.82
('B', 'C', 'D', 'A') 22.07
('B', 'D', 'A', 'C') 25.67
('B', 'D', 'C', 'A') 23.66
('C', 'A', 'B', 'D') 8.62
('C', 'A', 'D', 'B') 7.96
('C', 'B', 'A', 'D') 13.87
('C', 'B', 'D', 'A') 14.12
('C', 'D', 'A', 'B') 8.23
('C', 'D', 'B', 'A') 11.91
('D', 'A', 'B', 'C') 13.91
('D', 'A', 'C', 'B') 12.24
('D', 'B', 'A', 'C') 21.26
('D', 'B', 'C', 'A') 19.24
('D', 'C', 'A', 'B') 10.00
('D', 'C', 'B', 'A') 13.67
[Finished in 0.1s]

Julia - Generate 2-matching odd Set

In Julia, given a Set{Tuple{Int, Int}} named S of length greater than 3, for instance:
julia> S = Set{Tuple{Int,Int}}([(1, 4), (2, 5), (2, 6), (3, 6)])
Set{Tuple{Int64,Int64}} with 4 elements:
(2, 5)
(3, 6)
(2, 6)
(1, 4)
I want to return a subset T of S of length greater than 3 and odd (3, 5, 7, ...) such that, all first values of the tuples are unique. For instance, I can't have (2, 5) and (2, 6) because first value, 2 will not be unique. The same applies for second values meaning that I can't have (2, 6) and (3, 6).
If it is not possible, returning an empty Set of Tuple is fine.
Finally for the above minimal example the code should return:
julia> T = Set{Tuple{Int,Int}}([(1, 4), (2, 5), (3, 6)])
Set{Tuple{Int64,Int64}} with 3 elements:
(2, 5)
(3, 6)
(1, 4)
I am truly open to any other type of strucutre if you think it is better than Set{Tuple{Int, Int}} :)
I know how I can do it with integer programming. However, I will run this many times with large instances and I would like to know if there is a better way because I deeply think it can be done in polynomial time and perhaps in Julia with clever map or other efficient functions!
What you need is a way to filter the possible combinations of members of a set. So create a filtering function. If the part about an odd [3, 5, 7...] sequence you mentioned applies here, somehow, you might need to add that to the filter logic below:
using Combinatorics
allunique(a) = length(a) == length(unique(a))
slice(tuples, position) = [t[position] for t in tuples]
uniqueslice(tuples, position) = allunique(slice(tuples, position))
is_with_all_positions_unique(tuples) = all(n -> uniqueslice(tuples, n), 1:length(first(tuples)))
Now you can find combinations. With big sets these will explode in number, so make sure to exit when you have enough. You could use Lazy.jl here, or just a function:
function tcombinations(tuples, len, needed)
printed = 0
for combo in combinations(collect(tuples), len)
if is_with_all_positions_unique(combo)
printed += 1
println(combo)
printed >= needed && break
end
end
end
tcombinations(tuples, 3, 4)
[(2, 5), (4, 8), (3, 6)]
[(2, 5), (4, 8), (1, 4)]
[(2, 5), (4, 8), (5, 6)]
[(2, 5), (3, 6), (1, 4)]

Pyspark conditional function evaluation based on another column

I have a sample data set like below
sample_data = [('A', 'Chetna', 5, 'date_add(date_format(current_date(), \'yyyy-MM-dd\'), 7)'),
('B', 'Tanmay', 6, '`date_add(date_format(current_date(), \'yyyy-MM-dd\'), 1)`'),
('C', 'CC', 2, '`date_add(date_format(current_date(), \'yyyy-MM-dd\'), 3)`'),
('D', 'TC', 9, '`date_add(date_format(current_date(), \'yyyy-MM-dd\'), 5)`')]
df = spark.createDataFrame(sample_data, ['id', 'name', 'days', 'applyMe'])
from pyspark.sql.functions import lit
df = df.withColumn("salary", lit('days * 60'))
I am trying to evaluate the function provided in applyMe column and salary.
So far have tried doing it with expr and eval but no luck.
Could someone please point me in right direction to achieve the desired output.

Finding every points in a sphere in a 3d coordinates

I'm trying to solve this algorithm.
Given a radius and an a point.
Find every points in the 3d coordinate system that are in the sphere of that radius that centered at the given point, and store them in a list.
you could do this with numpy, below.
Note the code here will give you coordinates relative to a sphere centered at a point you choose, with a radius you choose. You need to make sure that your input dimensions 'dim' below are set so that the sphere would be fully contained within that volume first. It also will only work for positive indicies. If your point has any coordinates that are negative, use the positive of that, and then in the output flip the signs of that axis coordinates yourself.
import numpy as np
dim = 15
# get 3 arrays representing indicies along each axis
xx, yy, zz = np.ogrid[:dim, :dim, :dim]
# set you center point and radius you want
center = [7, 7, 7]
radius = 3
# create 3d array with values that are the distance from the
# center squared
d2 = (xx-center[0])**2 + (yy-center[1])**2 + (zz-center[2])**2
# create a logical true/false array based on whether the values in d2
# above are less than radius squared
#
# so this is what you want - all the values within "radius" of the center
# are now set to True
mask = d2 <= radius**2
# calculate distance squared and compare to radius squared to avoid having to use
# slow sqrt()
# now you want to get the indicies from the mask array where the value of the
# array is True. numpy.nonzero does that, and gives you 3 numpy 1d arrays of
# indicies along each axis
s, t, u = np.nonzero(mask)
# finally, to get what you want, which is all those indicies in a list, zip them together:
coords = list(zip(s, t, u))
print(coords)
>>>
[(2, 5, 6),
(3, 4, 5),
(3, 4, 6),
(3, 4, 7),
(3, 5, 5),
(3, 5, 6),
(3, 5, 7),
(3, 6, 5),
(3, 6, 6),
(3, 6, 7),
(4, 3, 6),
(4, 4, 5),
(4, 4, 6),
(4, 4, 7),
(4, 5, 4),
(4, 5, 5),
(4, 5, 6),
(4, 5, 7),
(4, 5, 8),
(4, 6, 5),
(4, 6, 6),
(4, 6, 7),
(4, 7, 6),
(5, 4, 5),
(5, 4, 6),
(5, 4, 7),
(5, 5, 5),
(5, 5, 6),
(5, 5, 7),
(5, 6, 5),
(5, 6, 6),
(5, 6, 7),
(6, 5, 6)]

Get row and column index of value in Pandas df

Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.
Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.
If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]
I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!
To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'