Design Problem - oop

Recently I was faced with this interview question (K-Means Clustering solution). The design I came up with did not meet the expectations of the interviewer (to put simply I didnt get the job because I lost to another candidate on this design problem). I am wondering how many different / efficient / simply solutions can the SO community come up with (by doing this I am hoping to hone my skills):
To implement a simple algorithm to cluster people according to their weight and height. The
data set includes a list of people with their weights and heights like so:
Person Weight Height
(kg) (inches)
Person 1 70 70
Person 2 75 80
Person 3 120 85
You can plot the data as a 2 dimensional data. Weight being one dimension and height being
the other dimension. Weight can range from a minimum of 50kg to 150kg. Height can range
from a minimum of 38inches to 90inches
Algorithm:
The algorithm (called K-means clustering) will cluster data into K groups goes as such:
Start with K clusters. Each cluster is defined by its center point which will start of as
random weight and random height. Pick random numbers from within the
corresponding ranges defined above.
For each person
Calculate distance to center of each cluster using formula
distance = sqrt(pow((wperson−wcenter), 2) + (pow(hperson−hcenter),2))
where wperson = weight of person,
hperson = height of person
wcenter = weight of cluster center point,
hcenter = height of cluster center point
Assign the person to the cluster with the shortest distance to center point of cluster
After end of step 2, you will end up with K clusters each assigned with a set of people
For each cluster, set the weight and height of the center point to the average of the
people in the cluster
wcenter = (sum of weight of each person in cluster)/(number of people in cluster)
hcenter = (sum of height of each person in cluster)/number of people in cluster)
Repeat steps 2 to 5 for 1000 iterations, then print out following information for each
cluster.
weight and height of center of cluster.
list of people in cluster.
I am not looking for a implementation/solution but for a high level design. can you list the interfaces / classes etc.
I dont want to give my solution now, but will post it later in the day?

This is my attempt at the design. I only show the static diagram since the algorithm is pretty much laid out already. I would have a plan to have a visitor for the representation of the clusters, could allow different types of output (xml, strings, csv..etc). Maybe the visitor is overkill, if it was then I'd just have something like a ToString method that could be overridden.
Note: the Cluster creates a CenterClusterItem on the SetCenter and FindNewCenter methods. The CenterClusterItem is not a PersonClusterItem, it just holds the same amount of AClusterValues as a PersonClusterItem would (since the average isn't really a person).
Also, I forgot to make a method on the KCluster to begin the process, but that's implied.
Class Diagram http://img11.imageshack.us/img11/499/kcluster.png

Well, I would first tackle all the constants/magic numbers that reduce the reusability of the algorithm:
instead of a fixed number of iterations, use a stopping criterion (e.g., if clusters don't change too much, terminate)
don't restrict yourself to 2-dim data, use vectors
let the user define the number of clusters to be found
Then, you could hide some specifics behind interfaces, e.g. the distance might be calculated differently (for example, it might at some point have to cope with values other than double).
On the other hand, if you really have this simple problem, some of these generalizations might well be overkill - but that's what I would discuss with someone telling me to implement this algorithm.

You can create the following classes:
Person to store data about persons and centers. Properties: id, weight and height. Method: calculateDistance
Cluster to store one center and a list of persons: Properties: center and list of Person. Method: calculateCenter.
KCluster to hold your algorithm and store a list of clusters: Property: list of Cluster. Methods: generateClusters.

I'm not sure what your question actually is, the steps you point out effectively define the algorithm you're talking about.
A better idea may be to include exactly what you did then people can give you some hints / tips as to where you might have gone wrong or what they would have done differently.

That sounds like a really good way to do it. K-means will usually converge quickly (though not necessarily to the global optimum), so my one suggestion would be to run the algorithm until no more changes occur, rather than a fixed number of 1000 iterations. You could then repeat the entire process a few times with different random starting points.
One weakness of k-means is that it does require you to specify (i.e. guess) an appropriate value for k up-front. I think you would get points for asking the interviewer what an appropriate value for k would be, or, if there is no way to know, describing some goodness-of-fit measure and then calculating that measure for different values of k to find a "just low enough" value.

Related

Traveling salesman ampl

I am working on a Traveling salesman problem and can't figure how to solve it. The problem contains ten workers, ten workplaces where they should be driven to and one car driving them one by one. There is a cost of $1.5 per km. Also, all the nodes (both workers and workplaces) are positioned in a 10*10 matrix and the distance between each block in the matrix is 1 km.
The problem should be solved using AMPL.
I have already calculated the distances between each coordinate in excel and have copy pasted the matrix to the dat.file in AMPL.
This is my mod.file so far (without the constrains):
param D > 0;
param D > 0;
set A = 1..W cross 1..D;
var x{A}; # 1 if the route goes from person p to work d,
# 0 otherwise
param cost;
param distance;
minimize Total_Cost:
sum {(w,d) in A} cost * x[w,d];
OK, so your route looks like: start-worker 1-job 1-worker 2-job 2-worker 3-job-3-...-job 10-end (give or take start & end points, depending on how you formulate the problem.
That being the case, the "worker n-job n" parts of your route are predetermined. You don't need to include "worker n-job n" costs in the route optimisation, because there's no choice about those parts of the route (though you do need to remember them for calculating total cost, of course).
So what you have here is really a basic TSP with 10 "destinations" (each representing a single worker and their assigned job) but with an asymmetric cost matrix (because cost of travel from job i to worker j isn't the same as cost of travel from job j to worker i).
If you already have an implementation for the basic TSP, it should be easy to adapt. If not, then you need to write one and make that small change for an asymmetric cost matrix. I've seen two different approaches to this in AMPL.
2-D decision matrix with subtour elimination
Decision variable x{1..10,1..10} is defined as: x[i,j] = 1 if the route goes from job i to job j, and 0 otherwise. Constraints require that every row and column of this matrix has exactly one 1.
The challenging part with this approach is preventing subtours (i.e. the "route" produced is actually two or more separate cycles instead of one large cycle). It sounds like your current attempt is at this stage.
One solution to the problem of subtours is an iterative approach:
Write an implementation that includes all requirements except for subtour prevention.
Solve with this implementation.
Check the resulting solution for subtours.
If no subtours are found, return the solution and end.
If you do find subtours, add a constraint which prevents that particular subtour. (Identify the arcs involved in the subtour, and set a constraint which implies they can't all be selected.) Then go to #2.
For a small exercise you may be able to do the subtour elimination by hand. For a larger exercise, or if your lecturer doesn't like that approach, you can create a .run that automates it. See Bob Fourer's post of 31/7/2013 in this thread for an example of implementation.
3-D decision matrix with time dimension
Under this approach, you set up a decision variable x{1..10,1..10,1..10} where x[i,j,t] = 1 if the route goes from job i to worker j at time t, and 0 otherwise. Your constraints then require that the route goes to and from each job/worker combination exactly once, that if it goes to worker i at time t then it must go from job i at time t+1 (excepting first/last issues), that it's doing exactly one thing at time t, and that the endpoint at time 10 matches the startpoint at time 1 (assuming you want a circuit).
This prevents subtours, because it forces a route that starts at some point at time 1, returns to that point at time 10, and doesn't visit any other point more than once - meaning that it has to go through all of them exactly once.

Data Science: Using Inferential Statistics to label train dataset

Lack of High Schools in remote areas is a problem for students in developing country. Students in some locations are better than that in other. So, I have to find those locations. Now, the main problem is defining "BETTER". I have made some rules that will define the profile of a location.
Right now, I am concerned with the good students.
So, what I have done is-
1. Used some inferential statistics to and made some rules to come up with the conclusion that Location A,B,C,etc are the most potential locations where you can put the high schools because according to my rules these locations contain quality students.
I did all of the things above to label the data because I required to define "BETTER" and label the data so that I can now use machine learning algorithm to learn the factors which makes a location a potential location so that if I give a data point from test data to the model, it will instantly tell if the location is better or not.
Overview of the method:
For each location, I have these 4 information:
total_students_staying_for_high_school_education(A),
total_students_leaving_for_high_school_education_in_another_place(B),
mean_grade_point_of_students_of_type_B,
ratio (calculated as B/A),
For the location whose ratio > 1
I applied the chi-squared significance test to come up with a statistic which would tell me if students are leaving that place in significant amount than staying. I used ANOVA and then Tukey test to compare means_grade points and then find combinations of pairs of locations whose means vary and whose is greater than the others.
I then wrote a python program with a custom comparator that first compares if mean_grade of those points vary and returns the one with greater mean. If the means don't vary, the comparator return the location with the one whose chi-squared value is greater.
This is how, the whole process comes up with few suggestions of location and I call those location "BETTER".
What I am concerned about is-
1. How do I verify if my rules are valid? Or do I even need to verify it?
2. Most importantly, is mingling statistics with machine learning as described above an appropriate approach?Is there any major leakage in the method?Can anyone suggest a more general method?

Similarity matching algorithm

I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.
For example, if a product has:
Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new
Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.
I'm open to suggestions as the project is just starting.
One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:
Colors Weight Height Condition
00011011000 10110110 10001100 01
Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).
The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.
Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.
Further question: also if I could use different weights for each attribute, it would be awesome.
You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.
Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.
You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.
What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.
You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.
What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements.
When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).
Simple example of BK-tree (from the second link)
BOOK
/ \
/(1) \(4)
/ \
BOOKS CAKE
/ / \
/(2) /(1) \(2)
/ | |
BOO CAPE CART
Your metric should be Hamming distance (count of differences between bit representations of two objects).
BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?
BK-tree with details:
https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/

How to group nearby latitude and longitude locations stored in SQL

Im trying to analyse data from cycle accidents in the UK to find statistical black spots. Here is the example of the data from another website. http://www.cycleinjury.co.uk/map
I am currently using SQLite to ~100k store lat / lon locations. I want to group nearby locations together. This task is called cluster analysis.
I would like simplify the dataset by ignoring isolated incidents and instead only showing the origin of clusters where more than one accident have taken place in a small area.
There are 3 problems I need to overcome.
Performance - How do I ensure finding nearby points is quick. Should I use SQLite's implementation of an R-Tree for example?
Chains - How do I avoid picking up chains of nearby points?
Density - How to take cycle population density into account? There is a far greater population density of cyclist in london then say Bristol, therefore there appears to be a greater number of backstops in London.
I would like to avoid 'chain' scenarios like this:
Instead I would like to find clusters:
London screenshot (I hand drew some clusters)...
Bristol screenshot - Much lower density - the same program ran over this area might not find any blackspots if relative density was not taken into account.
Any pointers would be great!
Well, your problem description reads exactly like the DBSCAN clustering algorithm (Wikipedia). It avoids chain effects in the sense that it requires them to be at least minPts objects.
As for the differences in densities across, that is what OPTICS (Wikipedia) is supposed do solve. You may need to use a different way of extracting clusters though.
Well, ok, maybe not 100% - you maybe want to have single hotspots, not areas that are "density connected". When thinking of an OPTICS plot, I figure you are only interested in small but deep valleys, not in large valleys. You could probably use the OPTICS plot an scan for local minima of "at least 10 accidents".
Update: Thanks for the pointer to the data set. It's really interesting. So I did not filter it down to cyclists, but right now I'm using all 1.2 million records with coordinates. I've fed them into ELKI for analysis, because it's really fast, and it actually can use the geodetic distance (i.e. on latitude and longitude) instead of Euclidean distance, to avoid bias. I've enabled the R*-tree index with STR bulk loading, because that is supposed to help to get the runtime down a lot. I'm running OPTICS with Xi=.1, epsilon=1 (km) and minPts=100 (looking for large clusters only). Runtime was around 11 Minutes, not too bad. The OPTICS plot of course would be 1.2 million pixels wide, so it's not really good for full visualization anymore. Given the huge threshold, it identified 18 clusters with 100-200 instances each. I'll try to visualize these clusters next. But definitely try a lower minPts for your experiments.
So here are the major clusters found:
51.690713 -0.045545 a crossing on A10 north of London just past M25
51.477804 -0.404462 "Waggoners Roundabout"
51.690713 -0.045545 "Halton Cross Roundabout" or the crossing south of it
51.436707 -0.499702 Fork of A30 and A308 Staines By-Pass
53.556186 -2.489059 M61 exit to A58, North-West of Manchester
55.170139 -1.532917 A189, North Seaton Roundabout
55.067229 -1.577334 A189 and A19, just south of this, a four lane roundabout.
51.570594 -0.096159 Manour House, Picadilly Line
53.477601 -1.152863 M18 and A1(M)
53.091369 -0.789684 A1, A17 and A46, a complex construct with roundabouts on both sides of A1.
52.949281 -0.97896 A52 and A46
50.659544 -1.15251 Isle of Wight, Sandown.
...
Note, these are just random points taken from the clusters. It may be sensible to compute e.g. cluster center and radius instead, but I didn't do that. I just wanted to get a glimpse of that data set, and it looks interesting.
Here are some screenshots, with minPts=50, epsilon=0.1, xi=0.02:
Notice that with OPTICS, clusters can be hierarchical. Here is a detail:
First, your example is quite misleading. You have two different sets of data, and you don't control the data. If it appears in a chain, then you will get a chain out.
This problem is not exactly suitable for a database. You'll have to write code or find a package that implements this algorithm on your platform.
There are many different clustering algorithms. One, k-means, is an iterative algorithm where you look for a fixed number of clusters. k-means requires a few complete scans of the data, and voila, you have your clusters. Indexes are not particularly helpful.
Another, which is usually appropriate on slightly smaller data sets, is hierarchical clustering -- you put the two closest things together, and then build the clusters. An index might be helpful here.
I recommend though that you peruse a site such as kdnuggets in order to see what software -- free and otherwise -- is available.

Building ranking with genetic algorithm,

Question after BIG edition :
I need to built a ranking using genetic algorithm, I have data like this :
P(a>b)=0.9
P(b>c)=0.7
P(c>d)=0.8
P(b>d)=0.3
now, lets interpret a,b,c,d as names of football teams, and P(x>y) is probability that x wins with y. We want to build ranking of teams, we lack some observations P(a>d),P(a>c) are missing due to lack of matches between a vs d and a vs c.
Goal is to find ordering of team names, which the best describes current situation in that four team league.
If we have only 4 teams than solution is straightforward, first we compute probabilities for all 4!=24 orderings of four teams, while ignoring missing values we have :
P(abcd)=P(a>b)P(b>c)P(c>d)P(b>d)
P(abdc)=P(a>b)P(b>c)(1-P(c>d))P(b>d)
...
P(dcba)=(1-P(a>b))(1-P(b>c))(1-P(c>d))(1-P(b>d))
and we choose the ranking with highest probability. I don't want to use any other fitness function.
My question :
As numbers of permutations of n elements is n! calculation of probabilities for all
orderings is impossible for large n (my n is about 40). I want to use genetic algorithm for that problem.
Mutation operator is simple switching of places of two (or more) elements of ranking.
But how to make crossover of two orderings ?
Could P(abcd) be interpreted as cost function of path 'abcd' in assymetric TSP problem but cost of travelling from x to y is different than cost of travelling from y to x, P(x>y)=1-P(y<x) ? There are so many crossover operators for TSP problem, but I think I have to design my own crossover operator, because my problem is slightly different from TSP. Do you have any ideas for solution or frame for conceptual analysis ?
The easiest way, on conceptual and implementation level, is to use crossover operator which make exchange of suborderings between two solutions :
CrossOver(ABcD,AcDB) = AcBD
for random subset of elements (in this case 'a,b,d' in capital letters) we copy and paste first subordering - sequence of elements 'a,b,d' to second ordering.
Edition : asymetric TSP could be turned into symmetric TSP, but with forbidden suborderings, which make GA approach unsuitable.
It's definitely an interesting problem, and it seems most of the answers and comments have focused on the semantic aspects of the problem (i.e., the meaning of the fitness function, etc.).
I'll chip in some information about the syntactic elements -- how do you do crossover and/or mutation in ways that make sense. Obviously, as you noted with the parallel to the TSP, you have a permutation problem. So if you want to use a GA, the natural representation of candidate solutions is simply an ordered list of your points, careful to avoid repitition -- that is, a permutation.
TSP is one such permutation problem, and there are a number of crossover operators (e.g., Edge Assembly Crossover) that you can take from TSP algorithms and use directly. However, I think you'll have problems with that approach. Basically, the problem is this: in TSP, the important quality of solutions is adjacency. That is, abcd has the same fitness as cdab, because it's the same tour, just starting and ending at a different city. In your example, absolute position is much more important that this notion of relative position. abcd means in a sense that a is the best point -- it's important that it came first in the list.
The key thing you have to do to get an effective crossover operator is to account for what the properties are in the parents that make them good, and try to extract and combine exactly those properties. Nick Radcliffe called this "respectful recombination" (note that paper is quite old, and the theory is now understood a bit differently, but the principle is sound). Taking a TSP-designed operator and applying it to your problem will end up producing offspring that try to conserve irrelevant information from the parents.
You ideally need an operator that attempts to preserve absolute position in the string. The best one I know of offhand is known as Cycle Crossover (CX). I'm missing a good reference off the top of my head, but I can point you to some code where I implemented it as part of my graduate work. The basic idea of CX is fairly complicated to describe, and much easier to see in action. Take the following two points:
abcdefgh
cfhgedba
Pick a starting point in parent 1 at random. For simplicity, I'll just start at position 0 with the "a".
Now drop straight down into parent 2, and observe the value there (in this case, "c").
Now search for "c" in parent 1. We find it at position 2.
Now drop straight down again, and observe the "h" in parent 2, position 2.
Again, search for this "h" in parent 1, found at position 7.
Drop straight down and observe the "a" in parent 2.
At this point note that if we search for "a" in parent one, we reach a position where we've already been. Continuing past that will just cycle. In fact, we call the sequence of positions we visited (0, 2, 7) a "cycle". Note that we can simply exchange the values at these positions between the parents as a group and both parents will retain the permutation property, because we have the same three values at each position in the cycle for both parents, just in different orders.
Make the swap of the positions included in the cycle.
Note that this is only one cycle. You then repeat this process starting from a new (unvisited) position each time until all positions have been included in a cycle. After the one iteration described in the above steps, you get the following strings (where an "X" denotes a position in the cycle where the values were swapped between the parents.
cbhdefga
afcgedbh
X X X
Just keep finding and swapping cycles until you're done.
The code I linked from my github account is going to be tightly bound to my own metaheuristics framework, but I think it's a reasonably easy task to pull the basic algorithm out from the code and adapt it for your own system.
Note that you can potentially gain quite a lot from doing something more customized to your particular domain. I think something like CX will make a better black box algorithm than something based on a TSP operator, but black boxes are usually a last resort. Other people's suggestions might lead you to a better overall algorithm.
I've worked on a somewhat similar ranking problem and followed a technique similar to what I describe below. Does this work for you:
Assume the unknown value of an object diverges from your estimate via some distribution, say, the normal distribution. Interpret your ranking statements such as a > b, 0.9 as the statement "The value a lies at the 90% percentile of the distribution centered on b".
For every statement:
def realArrival = calculate a's location on a distribution centered on b
def arrivalGap = | realArrival - expectedArrival |
def fitness = Σ arrivalGap
Fitness function is MIN(fitness)
FWIW, my problem was actually a bin-packing problem, where the equivalent of your "rank" statements were user-provided rankings (1, 2, 3, etc.). So not quite TSP, but NP-Hard. OTOH, bin-packing has a pseudo-polynomial solution proportional to accepted error, which is what I eventually used. I'm not quite sure that would work with your probabilistic ranking statements.
What an interesting problem! If I understand it, what you're really asking is:
"Given a weighted, directed graph, with each edge-weight in the graph representing the probability that the arc is drawn in the correct direction, return the complete sequence of nodes with maximum probability of being a topological sort of the graph."
So if your graph has N edges, there are 2^N graphs of varying likelihood, with some orderings appearing in more than one graph.
I don't know if this will help (very brief Google searches did not enlighten me, but maybe you'll have more success with more perseverance) but my thoughts are that looking for "topological sort" in conjunction with any of "probabilistic", "random", "noise," or "error" (because the edge weights can be considered as a reliability factor) might be helpful.
I strongly question your assertion, in your example, that P(a>c) is not needed, though. You know your application space best, but it seems to me that specifying P(a>c) = 0.99 will give a different fitness for f(abc) than specifying P(a>c) = 0.01.
You might want to throw in "Bayesian" as well, since you might be able to start to infer values for (in your example) P(a>c) given your conditions and hypothetical solutions. The problem is, "topological sort" and "bayesian" is going to give you a whole bunch of hits related to markov chains and markov decision problems, which may or may not be helpful.