Create the hierarchy rank of all employees given supervisor ID using Networkx - pandas

I have a dataframe in the following format
I want to generate a new column which will tell me the rank of the employee in the company hierarchy. For example, the rank of the CEO is 0. Someone who reports directly to the CEO has a rank of 1. Someone just below that 2...and so on.
I have the supervisor ID for each employee. One employee can have only one supervisor. I have a feeling that I might be able to do this using Networkx, but I can't quite place how to.

I figured it out myself. It's actually quite trivial.
I need to construct a graph, where the CEO is one of the nodes like every other employee.
import networkx as nx
# Create a graph using the networx builtin function
G = nx.from_pandas_edgelist(df, 'employeeId', 'supervisorId')
# Calculate the distance of the CEO from each employee. Sepcify target = CEO
dikt_rank_of_employees = { x[0]: x[1] for x in nx.single_target_shortest_path_length(
G, target='c511b73c4ad30dde1b6d2d57ab2d4ddc') }
# Use this dictionary to create a column in the dataframe
df['rank_of_employee'] = df['employeeId'].map(dikt_rank_of_employees)
The result looks like this

Related

comapring compressed distribution per cohort

How can I easily compare the distributions of multiple cohorts?
Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.
It was created as:
SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender
where compress_distributionUDF simply takes a list of tuples and returns the counts per group.
This leaves me with a list of
Row(distribution_value=60.0, count=314251, target_y_n=0)
nested inside a pandas.Series, but one per each chohort.
Basically, it is similar to:
pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})
and I wonder how to compare distributions:
within a cohort 0 vs. 1 of target_y_n
over multiple cohorts
in a way which is visually still understandable and not only a mess.
edit
For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?
I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.
sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)
sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)
plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])
sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)
Finally try to have a look at the FacetGrid class to extend your comparison (see here).
g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
In your case you would have something like:
g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
And a qqplot option:
from scipy import stats
def qqplot(x, y, **kwargs):
_, xr = stats.probplot(x, fit=False)
_, yr = stats.probplot(y, fit=False)
plt.scatter(xr, yr, **kwargs)
g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')

How to check the highest score among specific columns and compute the average in pandas?

Help with homework problem: "Let us define the "data science experience" of a given person as the person's largest score among Regression, Classification, and Clustering. Compute the average data science experience among all MSIS students."
Beginner to coding. I am trying to figure out how to check amongst columns and compare those columns to each other for the largest value. And then take the average of those found values.
I greatly appreciate your help in advance!
Picture of the sample data set: 1: https://i.stack.imgur.com/9OSjz.png
Provided Code:
import pandas as pd
df = pd.read_csv("cleaned_survey.csv", index_col=0)
df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)
Sample Data:
What I have tried so far:
df[data_science_experience]=df[["Regression","Classification","Clustering"]].values.max()
df['z']=df[['Regression','Classification','Clustering']].apply(np.max,axis=1)
df[data_science_experience]=df[["Regression","Classification","Clustering"]].apply(np.max,axis=1)
If you want to get the highest score of column 'hw1' you can get it with:
pd['hw1'].max(). this gives you a series of all the values in that column and max returns the maximum. for average use mean:
pd['hw1'].mean()
if you want to find the maximum of multiple columns, you can use:
maximum_list = list()
for col in pd.columns:
maximum_list.append(pd[col].max)
max = maximum_list.max()
avg = maximum_list.mean()
hope this helps.
First, you want to get only the rows with MSIS in the Program column. That can be done in the following way:
df[df['Program'] == 'MSIS']
Next, you want to get only the Regression, Classification and Clustering columns. The previous query filtered only rows; we can add to that, like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']]
Now, for each row remaining, we want to take the maximum. That can be done by appending .max(axis=1) to the previous line (axis=1 because we want the maximum of each row, not each column).
At this point, we should have a DataFrame where each row represents the highest score of the three categories for each student. Now, all that's left to do is take the mean, which can be done with .mean(). The full code should therefore look like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']].max(axis=1).mean()

Pandas manipulation: matching data from other columns to one column, applied uniquely to all rows

I have a model that predicts 10 words for a particular course in order of likelihood, and I'd like the first 5 words of those words that appear in the course's description.
This is the format of the data:
course_name course_title course_description predicted_word_10 predicted_word_9 predicted_word_8 predicted_word_7 predicted_word_6 predicted_word_5 predicted_word_4 predicted_word_3 predicted_word_2 predicted_word_1
Xmath 32 Precalculus Polynomial and rational functions, exponential... directed scholars approach build african different visual cultures placed global
Xphilos 2 Morality Introduction to ethical and political philosop... make presentation weekly european ways general range questions liberal speakers
My idea is for each row to start iterating from predicted_word_1 until I get the first 5 that are in the description. I'd like to save those words in the order they appear into additional columns description_word_1 ... description_word_5. (If there are <5 predicted words in the description I plan to return NAN in the corresponding columns).
To clarify with an example: if the course_description of a course is 'Polynomial and rational functions, exponential and logarithmic functions, trigonometry and trigonometric functions. Complex numbers, fundamental theorem of algebra, mathematical induction, binomial theorem, series, and sequences. ' and its first few predicted words are irrelevantword1, induction, exponential, logarithmic, irrelevantword2, polynomial, algebra...
I would want to return induction, exponential, logarithmic, polynomial, algebra for that in that order and do the same for the rest of the courses.
My attempt was to define an apply function that will take in a row and iterate from the first predicted word until it finds the first 5 that are in the description, but the part I am unable to figure out is how to create these additional columns that have the correct words for each course. This code will currently only keep the words for one course for all the rows.
def find_top_description_words(row):
print(row['course_title'])
description_words_index=1
for i in range(num_words_per_course):
description = row.loc['course_description']
word_i = row.loc['predicted_word_' + str(i+1)]
if (word_i in description) & (description_words_index <=5) :
print(description_words_index)
row['description_word_' + str(description_words_index)] = word_i
description_words_index += 1
df.apply(find_top_description_words,axis=1)
The end goal of this data manipulation is to keep the top 10 predicted words from the model and the top 5 predicted words in the description so the dataframe would look like:
course_name course_title course_description top_description_word_1 ... top_description_word_5 predicted_word_1 ... predicted_word_10
Any pointers would be appreciated. Thank you!
If I understand correctly:
Create new DataFrame with just 100 predicted words:
pred_words_lists = df.apply(lambda x: list(x[3:].dropna())[::-1], axis = 1)
Please note that, there are lists in each row with predicted words. The order is nice, I mean the first, not empty, predicted word is on the first place, the second on the second place and so on.
Now let's create a new DataFrame:
pred_words_df = pd.DataFrame(pred_words_lists.tolist())
pred_words_df.columns = df.columns[:2:-1]
And The final DataFrame:
final_df = df[['course_name', 'course_title', 'course_description']].join(pred_words_df.iloc[:,0:11])
Hope this works.
EDIT
def common_elements(xx, yy):
temp = pd.Series(range(0, len(xx)), index= xx)
return list(df.reindex(yy).sort_values()[0:10].dropna().index)
pred_words_lists = df.apply(lambda x: common_elements(x[2].replace(',','').split(), list(x[3:].dropna())), axis = 1)
Does it satisfy your requirements?
Adapted solution (OP):
def get_sorted_descriptions_words(course_description, predicted_words, k):
description_words = course_description.replace(',','').split()
predicted_words_list = list(predicted_words)
predicted_words = pd.Series(range(0, len(predicted_words_list)), index=predicted_words_list)
predicted_words = predicted_words[~predicted_words.index.duplicated()]
ordered_description = predicted_words.reindex(description_words).dropna().sort_values()
ordered_description_list = pd.Series(ordered_description.index).unique()[:k]
return ordered_description_list
df.apply(lambda x: get_sorted_descriptions_words(x['course_description'], x.filter(regex=r'predicted_word_.*'), k), axis=1)

r tidygraph find first parent in hierarchy that meets criteria

Find the first approver within the reporting line which is a minimum of 2 grades higher. Eg grade 0 is two grades higher than grade 2.
Employee Graph
I'm trying to create a new node attribute Approver which should populate for Employees H,E,F,G which should all identify the approver as Employee A.
I've been trying to figure this out all day but don't really get how how to do a one hop traversal let alone several. My understanding is that a breath first search BFS will start from the leaf nodes and then work it's way up the tree. I'm getting really stuck on how to access different parts of the tree.
Any help much appreciated.
library(tidygraph)
library(visNetwork)
# sample graph
g <- tidygraph::create_tree(8,2) %>%
activate(nodes) %>%
mutate(Employee = LETTERS[1:8],
Grade = c(0,1,1,1,2,2,2,2),
label = paste("Emp",Employee,"Grade",Grade)
)
visIgraph(g,layout="layout_as_tree", flip.y=F,idToLabel = F)
g %>% activate(nodes) %>%
mutate(Approver = map_bfs_back(node_is_root(),.f = function(node,path,...){
#If starting node grade - node grade >= 2 then Approver Employee ID
}))
Took awhile to figure this out. Not sure if anyone else is interested in this sort of query? Hopefully helpful to someone.
g %>% activate(nodes) %>%
mutate(P = map_bfs_chr(node_is_root(),.f = function(node,path, ...){
.N()$Employee[tail(path$node[ .N()$Grade[node] - .N()$Grade[path$node] >= 2],1)[1]]
}))

How to decide group assignments in Dirichlet process clustering

As in the Dirichlet clustering, the dirichlet process can be represented by the following:
Chinese Restaurant Process
Stick Breaking Process
Poly Urn Model
For instance, if we consider Chinese Restaurant Process the process is as follows:
Initially the restaurant is empty
The first person to enter (Alice) sits down at a table (selects a
group).
The second person to enter (Bob) sits down at a table.
Which table does he sit at?
He sits down at a new table with probability α/(1+α)
He sits with at existing table with Alice (mean he'll join existing group)
with probability 1/(1+α)
The (n+1)-st person sits down at a new table with probability
α/(n+α)α/(n+α), and at table k with probability nk/(n+α)nk/(n+α),
where nk is the number of people currently sitting at table k.
The question is:
Initially, the first person will join, say G1 (i.e. group 1),
Now the second person will join
new group = G2 with probability α/(1+α) = P(N)
existing group = G1 with probability 1/(1+α) = P(E)
Now if I calculate the probabilities for new entry, I'll have values for both i.e. P(N) and P(E). Then,
How will I decide that new entry will join which group G1 or G2?
Would it be decided on basis of values of both probabilities?
As,
If (P(N) > P(E))
then
_new entry_ will join G2
AND
If (P(E) > P(N))
then
_new entry_ will join G1
Based on the CRP representation,
customer 1 sits at table 1
customer i, sits at pre-occupied table k with p_k and at a new table with p_new where
Note that the sum of the probabilities is equal to 1. To find the table assignment, all you have to do is toss a coin and select the relevant table.
For example for customer i, assume you have the following probability vector
which means the probability of sitting at table 1 is 0.2, table 2 is 0.4, table 3 is 0.3, and a new table is 0.1. By constructing the cumulative probability vector and drawing a random number, you can sample the table. Let's say the random number 0.81, therefore your customer sits at table 3.