How to decide group assignments in Dirichlet process clustering - process

As in the Dirichlet clustering, the dirichlet process can be represented by the following:
Chinese Restaurant Process
Stick Breaking Process
Poly Urn Model
For instance, if we consider Chinese Restaurant Process the process is as follows:
Initially the restaurant is empty
The first person to enter (Alice) sits down at a table (selects a
group).
The second person to enter (Bob) sits down at a table.
Which table does he sit at?
He sits down at a new table with probability α/(1+α)
He sits with at existing table with Alice (mean he'll join existing group)
with probability 1/(1+α)
The (n+1)-st person sits down at a new table with probability
α/(n+α)α/(n+α), and at table k with probability nk/(n+α)nk/(n+α),
where nk is the number of people currently sitting at table k.
The question is:
Initially, the first person will join, say G1 (i.e. group 1),
Now the second person will join
new group = G2 with probability α/(1+α) = P(N)
existing group = G1 with probability 1/(1+α) = P(E)
Now if I calculate the probabilities for new entry, I'll have values for both i.e. P(N) and P(E). Then,
How will I decide that new entry will join which group G1 or G2?
Would it be decided on basis of values of both probabilities?
As,
If (P(N) > P(E))
then
_new entry_ will join G2
AND
If (P(E) > P(N))
then
_new entry_ will join G1

Based on the CRP representation,
customer 1 sits at table 1
customer i, sits at pre-occupied table k with p_k and at a new table with p_new where
Note that the sum of the probabilities is equal to 1. To find the table assignment, all you have to do is toss a coin and select the relevant table.
For example for customer i, assume you have the following probability vector
which means the probability of sitting at table 1 is 0.2, table 2 is 0.4, table 3 is 0.3, and a new table is 0.1. By constructing the cumulative probability vector and drawing a random number, you can sample the table. Let's say the random number 0.81, therefore your customer sits at table 3.

Related

Karp Reduction Between Subset Sum Problem And Vaccine Problem

I am supposed to show that the following problem is NP-complete by Karp reducing it to the Subset Sum Problem. The problem is to distribute vaccine doses among different age groups according to:
Given: D vaccine doses, n age groups, a1 to an as input, where age group k consists of ak individuals, d1 to dn as input and each individual in age group k receives dk doses, at least tk percent of each age group must be fully vaccinated, and the maximal number of left-over doses can be S.
I am supposed to prove this problem is NP-complete. One of the steps is making a Karp reduction between this problem and the Subset Sum problem. I have tried to do this reduction in various ways but not been successful. Any ideas? Pseudo-code would be ideal.
Note: The Subset Sum problem receives the following input: A set of positive integers and a target K. The goal is to find a subset of the set of integers which sum up to K.

Create the hierarchy rank of all employees given supervisor ID using Networkx

I have a dataframe in the following format
I want to generate a new column which will tell me the rank of the employee in the company hierarchy. For example, the rank of the CEO is 0. Someone who reports directly to the CEO has a rank of 1. Someone just below that 2...and so on.
I have the supervisor ID for each employee. One employee can have only one supervisor. I have a feeling that I might be able to do this using Networkx, but I can't quite place how to.
I figured it out myself. It's actually quite trivial.
I need to construct a graph, where the CEO is one of the nodes like every other employee.
import networkx as nx
# Create a graph using the networx builtin function
G = nx.from_pandas_edgelist(df, 'employeeId', 'supervisorId')
# Calculate the distance of the CEO from each employee. Sepcify target = CEO
dikt_rank_of_employees = { x[0]: x[1] for x in nx.single_target_shortest_path_length(
G, target='c511b73c4ad30dde1b6d2d57ab2d4ddc') }
# Use this dictionary to create a column in the dataframe
df['rank_of_employee'] = df['employeeId'].map(dikt_rank_of_employees)
The result looks like this

Distribute numbers as close to possible

This seems to be a 2 step problem I'm trying to solve.
Let's say we have N records, and we are trying to distribute as evenly as possible into K groups.
The second problem - each group in K can only accept an M amount of records.
For example, if we have 5 records, and 3 groups, then we would distribute 2 into Group K1, 2 into Group K2 and 1 record into Group K3. However, if say in group 1, it only accepts at most 1 record. Then the arrangement would need to be 1 into Group K1, 2 into Group K2, and 2 into Group K3.
I'm not necessary after the solution but what algorithm I might need to use to solve this? Apparently for the distribution, I need to use the Greedy algorithm? But for the second step, this seems to be a bit more complicated
Edit:
The example I'm looking at is:
Number of records: 23
Groups: 10
Max records for each group
G1 = 4
G2 = 1
G3 = 0
G4 = 5
G5 = 0
G6 = 0
G7 = 2
G8 = 4
G9 = 2
G10 = 2
if N=12 and K=3 then in normal situation,you just split it V=12/3=4 for each group. but since you have M limitation, and for example K3 can only accept 1 then the distribution can be 6-5-1 which is not evenly distributed.
So i guess you need to sort K based on the M limitation, so for the example above the groups order become K3-K1-K2.
then if the distributed value V is bigger than the accepted amount M for that group, you need to take the remainder and distribute it again to the remaining group (K3=1, then 4-1=3 must be distributed to K1 and K2).
the implementation might be complicated, i hope you can find more simple solution for this
From what I understood, you need to separate all groups which allows a fixed number of values first and then equally distribute records among remaining groups. Let's take an example, let's say we have 15 records which needs to be distributed among 5 groups (G1, G2, G3, G4 and G5). Also let's assume that G2 and G4 allows max records of 2 and 4 respectively. Now algorithm should go like this:
Get average(ceiling integer) of records based on number of groups (In this example we'll get 3).
Add all max allowed records which are smaller than our average (In this example it's G2 only who's max limit(i.e. 2) is less than our average hence the number comes as 2).
Now subtract our number from step 2 from total records and also subtract the number of groups involved in step 2 from total groups. (remaining total records: 13, remaining total groups 4).
Get the new average(ceiling integer) using remaining records and groups. (New average 4).
Get average (Integer) (i.e. 3) and allot equal number of records to remaining groups - 1.
Get Mod (i.e. 1) and allot that number to the last group.
Now what we finally will have here:
G1(No limit): 4
G2(Limit 2): 2
G3(No limit): 4
G4(Limit 4): 4
G5(No limit): 1
Let me know if you think that this algo might fail for some scenarios.
Formula to get ceiling integer average
floor((#total_records + #total_groups-1) / #total_groups)

Example on Transportaion dilemma

A lumber company ships pine flooring from its three mills, A 1 ,
A 2 and A 3 , to three building suppliers, B 1 , B 2 and B 3 . The
table below shows the demand, availabilities and unit costs of
transportation. Starting with the north-west corner solution
and using the stepping-stone method, determine the
transportation pattern that minimises the total cost.
The distribution matrix with nothes-west corner method give the following matrix :
{ [25,0,0] , [5,30,5] , [0,0,31] }
then i compute the improvements indices for unused cells , and check for optimal. it's not optimal sol cell (3,1) is negative 1 .
I cannot apply stepping stone method on this distribution matrix because the second row has three consecutive basic cell . What is the optimal solution ?
The Optimal distribution Matrix is { [0,0,25],[0,30,10],[30,0,1] }.
The Optimal cost = 25*(2)+30*(2)+10*(3)+30*(3)+1*(3) = 233
the answer obtained after three iterations.

Repetition while copying data to SQL table from multiple sheets

I have to copy data from multiple excel sheets to the single SQL table.
Excel inputs:
Sheet1's columns: fname a, b. lname c, d. (2 rows)
Sheet2's columns: city boston, austin, state ma, tx. (2 rows)
My output (tMSSqlOutpout) has 4 rows instead of 2.
a c boston ma, a c austin tx, b d boston ma, b d austin tx.
Desired output: a c boston ma, b d austin tx. (2 rows only)
How do I manage this?
As per the comments, you don't have a natural key to join the two data sets. Instead you could generate a sequence for each data set that would increment equally for both data sets and would equate to being your row number on each data set.
First of all, this should set alarm bells ringing about the state of your data and how you can be sure that row n in one data set definitely corresponds to row n in another data set. It smacks of something being badly normalised out without proper keys being added and it can be very dangerous to assume that the resulting data set from this is going to be accurate.
If you absolutely must do this, however, then you should assign a Numeric.sequence to each of your data sets. You can do this in a tMap that precedes your joining tMap:
Notice the "s1" parameter to the Numeric.sequence. If you reuse this elsewhere then it will increment this one rather than starting from 1 so typically you would want to choose a unique name for each sequence you have in your job (although there are obviously occasions where incrementing a previously defined sequence is what you desire).
Once you have defined a unique sequence with the same starting numbers (the second parameter) and the same increment numbers (the third parameter) then you should be able to create a join on these instances: