Multithreaded grouping algorithm - vb.net

I have a collection of circles, each of which may or may not intersect one or more other circles in the collection. I want to group these circles such that each "group" contains all circles such that every member of the group intersects at least one other member of the group, and such that no member of any group intersects any member of any other group. I have come up with the following VB.NET/pseudocode algorithm to solve this problem on a single thread:
Dim groups As New List(Of List(Of Circle))
For Each circleToClassify In allCircles
Dim added As Boolean
For Each group In groups
For Each circle In group
If circleToClassify.Intersects(circle) Then
group.Add(circleToClassify)
added = True
Exit For
End If
Next
If added Then
Exit For
End If
Next
If Not added Then
Dim newGroup As New List(Of Circle)
newGroup.Add(circleToClassify)
groups.Add(newGroup)
End If
Next
Return groups
Or in English
Take each item from the collection of circles
Check if it intersects with any member of any existing group (Bear in mind a "group" may only contain a single circle)
If the circle does intersect in the aforementioned manner add it to the appropriate group
Otherwise create a new group with this circle as its only member
Go to step 1.
What I want to be able to do is perform this task using an arbitrary number of threads. However, I haven't got very far at all as all solutions I've come up with so far will just end up executing serially due to locking.
Can anyone provide any tips on what I want to be thinking about to achieve this multithreading?

TLDR
The best multithreaded solutions avoid sharing or perform read-only sharing. (And hence don't need locks.)
Consider partitioning your work so that threads don't share result data, and then merging each thread's results.
Note that when you strip away the detail of detecting whether groups of circles intersect, you are really dealing with a connected components graph theory problem. There's plenty of useful material on this subject online. And in fact you may find it much easier and sufficiently fast to simply apply a breadth first search algorithm to find connected components.
Detail
When doing multi-threaded development, first prize is to implement the threads in such a way as to minimise the number of locks. In the most trivial case: if they don't share any data, they don't need locks at all. However, if you can guarantee that the shared data won't be modified while the threads are running: then you don't need locks in this case either.
In your question, there's no need for your input list of circles to be modified. The problem you have is that you're building up a shared list of circle groups. Basically you're sharing your result space and need locks to ensure the integrity of the results.
One technique in this situation is to "partition and merge". As a trivial example consider finding the maximum of a large list of numbers. The naive (and ideal single-threaded solution) is to:
keep a single "current maximum" found;
compare each element to this value;
and update the "current maximum" if it's higher.
The problem for multithreading occurs in updating of the shared result. One solution is to:
partition the list for each of p threads;
find the maximum within each partition;
once all threads finish their work, the final result is trivially obtained by finding the maximum of the p partitioned maximums.
The trade-off against a single-threaded solution involves weighing up the ease with which the workload can be partitioned and the per-thread results merged versus the often much simpler single-threaded approach.
Applying partition and merge to circle clusters
As a side note: Observe that your question is essentially a graph theory question such that: Each circle is a node; where if any 2 circles intersect, there's an undirected edge between them; and you're trying to determine the connected components of the graph.
Obviously this provides an area that you can research for more ideas/information. But more importantly it makes easier to analyse the problem with simple boolean assessment of whether 2 circles intersect.
Also note the potential performance improvements by first pre-processing your circles into a suitable graph structure.
Assume you have 8 circles (A-H) where 1's in the table below indicate the 2 circles intersect.
ABCDEFGH
A11000110
B11000000
C00100000
D00010101
E00001110
F10011100
G10001010
H00010001
One partitioning idea involves determining what's connected by only considering a subset of circles and all their immediate connections.
ABCDEFGH
A11000110 p1 [AB]
B11000000
---------
C00100000 p2 [CD]
D00010101
---------
E00001110 p3 [EF]
F10011100
---------
G10001010 p4 [GH]
H00010001
NB Even though threads are sharing data (e.g. 2 threads may consider the intersection between circles A and F concurrently), the share is read-only and doesn't require a lock.
Assume 4 partitions (and 4 threads) of [AB][CD][EF][GH]. Connected components per partition would be broken down as follows:
[AB]: ABFG
[CD]: C DFH
[EF]: ADEFG
[GH]: AEG DH
You now have a list of potentially overlapping connected components. Merging involves iterating the list to find overlaps. If found, take union of the 2 sets is a new connected component. This will finally produce: ABFGDHE and C
Some optimisation techniques to consider:
The bottom left of the matrix mirrors the top-right. So you should be able to avoid duplicating processing of the inverse connections.
The merging of partitions can itself be partitioned and merged.
In fact in the extreme case you could start out partitioning a single circle per partition.
Connected(A) = ABFG
Connected(B) = B
Connected(AB) = ABFG
Connected(C) = C
Connected(D) = DFH
Connected(CD) = C,DFH
Connected(ABCD) = ABFGDH,C
Connected(E) = EFG
Connected(F) = F
Connected(EF) = EFG
Connected(G) = G
Connected(H) = H
Connected(GH) = G,H
Connected(EFGH) = EFG,H
Connected(ABCDEFGH) = ABFGDHE,C
Very NB You need to ensure appropriate selection of data structures and algorithms or suffer extremely poor performance. E.g. A naive intersection implementation might require O(n^2) operations to determine if two intermediate connected components intersect and totally destroy your goal that lead to all this additional complexity.

One approach is to divide the image into blocks, run the algorithm for each block independently, on different threads (i.e. considering only the circles whose center is in that block), and afterwards join the groups from different blocks that have intersecting circles.
Another approach is to formulate the problem using a graph, where the nodes represent circles, and an edge exists between two nodes if the corresponding circles are intersecting. We need to find the connected components of this graph. This disregards the geometric aspects of the problem, however, there are general algorithms which may be useful (e.g. you could consider the last slides from this link).

Related

Can proc sql embedded in sas macros dynamically merge to data-sets, simulating residential treatment placement decisions for trouble youth?

Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P
The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.

Recursive Hierarchy Ranking

I have no idea if I wrote that correctly. I want to start learning higher end data mining techniques and I'm currently using SQL server and Access 2016.
I have a system that tracks ID cards. Each ID is tagged to one particular level of a security hierarchy, which has many branches.
For example
Root
-Maintenance
- Management
- Supervisory
- Manager
- Executive
- Vendors
- Secure
- Per Diem
- Inside Trades
There are many other departments like Maintenance, some simple, some with much more convoluted, hierarchies.
Each ID card is tagged to a level so in the Maintenance example, - Per Diem:Vendors:Maintenance:Root. Others may be just tagged to Vendors, Some to general Maintenance itself (No one has root, thank god).
So lets say I have 20 ID Cards selected, these are available personnel I can task to a job but since they have different area's of security I want to find a commonalities they can all work on together as a 20 person group or whatever other groupings I can make.
So the intended output would be
CommonMatch = - Per Diem
CardID = 1
CardID = 3
CommonMatch = Vendors
CardID = 1
CardID = 3
CardID = 20
So in the example above, while I could have 2 people working on -Per Diem work, because that is their lowest common security similarity, there is also card holder #20 who has rights to the predecessor group (Vendors), that 1 and 3 share, so I could have three of them work at that level.
I'm not looking for anyone to do the work for me (Although examples always welcome), more to point me in the right direction on what I should be studying, what I'm trying to do is called, etc. I know CTE's are a way to go but that seems like only a tool in a much bigger process that needs to be done.
Thank you all in advance
Well, it is not so much a graph-theory or data-mining problem but rather a data-structure problem and one that has almost solved itself.
The objective is to be able to partition the set of card IDs into disjoint subsets given a security clearance level.
So, the main idea here would be to layout the hierarchy tree and then assign each card ID to the path implied by its security level clearance. For this purpose, each node of the hierarchy tree now becomes a container of card IDs (e.g. each node of the hierarchy tree holds a) its own name (as unique identification) b) pointers to other nodes c) a list of card IDs assigned to its "name".)
Then, retrieving the set of cards with clearance UP TO a specific security level is simply a case of traversing the tree from that specific level downwards until the tree's leafs, all along collecting the card IDs from the node containers as they are encountered.
Suppose that we have access tree:
A
+-B
+-C
D
+-E
And card ID assignments:
B:[1,2,3]
C:[4,8]
E:[10,12]
At the moment, B,C,E only make sense as tags, there is no structural information associated with them. We therefore need to first "build" the tree. The following example uses Networkx but the same thing can be achieved with a multitude of ways:
import networkx
G = networkx.DiGraph() #Establish a directed graph
G.add_edge("A","B")
G.add_edge("A","C")
G.add_edge("A","D")
G.add_edge("D","E")
Now, assign the card IDs to the node containers (in Networkx, nodes can be any valid Python object so I am going to go with a very simple list)
G.node["B"]=[1,2,3]
G.node["C"]=[4,8]
G.node["E"]=[10,12]
So, now, to get everybody working under "A" (the root of the tree), you can traverse the tree from that level downwards either via Depth First Search (DFS) or Breadth First Search (BFS) and collect the card IDs from the containers. I am going to use DFS here, purely because Networkx has a function that returns the visited nodes depending on visiting order, directly.
#dfs_preorder_nodes returns a generator, this is an efficient way of iterating very large collections in Python but I am casting it to a "list" here, so that we get the actual list of nodes back.
vis_nodes = list(networkx.dfs_preorder_nodes(G,"A")); #Start from node "A" and DFS downwards
cardIDs = []
#I could do the following with a one-line reduce but it might be clearer this way
for aNodeID in vis_nodes:
if G.node[aNodeID]:
cardIDs.extend(G.node[aNodeID])
In the end of the above iteration, cardIDs will contain all card IDs from branch "A" downwards in one convenient list.
Of course, this example is ultra simple, but since we are talking about trees, the tree can be as large as you like and you are still traversing it in the same way requiring only a single point of entry (the top level branch).
Finally, just as a note, the fact that you are using Access as your backend is not necessarily an impediment but relational databases do not handle graph type data with great ease. You might get away easily for something like a simple tree (like what you have here for example), but the hassle of supporting this probably justifies undertaking this process outside of the database (e.g, use the database just for retrieving the data and carry out the graph type data processing in a different environment. Doing a DFS on SQL is the sort of hassle I am referring to above.)
Hope this helps.

Building ranking with genetic algorithm,

Question after BIG edition :
I need to built a ranking using genetic algorithm, I have data like this :
P(a>b)=0.9
P(b>c)=0.7
P(c>d)=0.8
P(b>d)=0.3
now, lets interpret a,b,c,d as names of football teams, and P(x>y) is probability that x wins with y. We want to build ranking of teams, we lack some observations P(a>d),P(a>c) are missing due to lack of matches between a vs d and a vs c.
Goal is to find ordering of team names, which the best describes current situation in that four team league.
If we have only 4 teams than solution is straightforward, first we compute probabilities for all 4!=24 orderings of four teams, while ignoring missing values we have :
P(abcd)=P(a>b)P(b>c)P(c>d)P(b>d)
P(abdc)=P(a>b)P(b>c)(1-P(c>d))P(b>d)
...
P(dcba)=(1-P(a>b))(1-P(b>c))(1-P(c>d))(1-P(b>d))
and we choose the ranking with highest probability. I don't want to use any other fitness function.
My question :
As numbers of permutations of n elements is n! calculation of probabilities for all
orderings is impossible for large n (my n is about 40). I want to use genetic algorithm for that problem.
Mutation operator is simple switching of places of two (or more) elements of ranking.
But how to make crossover of two orderings ?
Could P(abcd) be interpreted as cost function of path 'abcd' in assymetric TSP problem but cost of travelling from x to y is different than cost of travelling from y to x, P(x>y)=1-P(y<x) ? There are so many crossover operators for TSP problem, but I think I have to design my own crossover operator, because my problem is slightly different from TSP. Do you have any ideas for solution or frame for conceptual analysis ?
The easiest way, on conceptual and implementation level, is to use crossover operator which make exchange of suborderings between two solutions :
CrossOver(ABcD,AcDB) = AcBD
for random subset of elements (in this case 'a,b,d' in capital letters) we copy and paste first subordering - sequence of elements 'a,b,d' to second ordering.
Edition : asymetric TSP could be turned into symmetric TSP, but with forbidden suborderings, which make GA approach unsuitable.
It's definitely an interesting problem, and it seems most of the answers and comments have focused on the semantic aspects of the problem (i.e., the meaning of the fitness function, etc.).
I'll chip in some information about the syntactic elements -- how do you do crossover and/or mutation in ways that make sense. Obviously, as you noted with the parallel to the TSP, you have a permutation problem. So if you want to use a GA, the natural representation of candidate solutions is simply an ordered list of your points, careful to avoid repitition -- that is, a permutation.
TSP is one such permutation problem, and there are a number of crossover operators (e.g., Edge Assembly Crossover) that you can take from TSP algorithms and use directly. However, I think you'll have problems with that approach. Basically, the problem is this: in TSP, the important quality of solutions is adjacency. That is, abcd has the same fitness as cdab, because it's the same tour, just starting and ending at a different city. In your example, absolute position is much more important that this notion of relative position. abcd means in a sense that a is the best point -- it's important that it came first in the list.
The key thing you have to do to get an effective crossover operator is to account for what the properties are in the parents that make them good, and try to extract and combine exactly those properties. Nick Radcliffe called this "respectful recombination" (note that paper is quite old, and the theory is now understood a bit differently, but the principle is sound). Taking a TSP-designed operator and applying it to your problem will end up producing offspring that try to conserve irrelevant information from the parents.
You ideally need an operator that attempts to preserve absolute position in the string. The best one I know of offhand is known as Cycle Crossover (CX). I'm missing a good reference off the top of my head, but I can point you to some code where I implemented it as part of my graduate work. The basic idea of CX is fairly complicated to describe, and much easier to see in action. Take the following two points:
abcdefgh
cfhgedba
Pick a starting point in parent 1 at random. For simplicity, I'll just start at position 0 with the "a".
Now drop straight down into parent 2, and observe the value there (in this case, "c").
Now search for "c" in parent 1. We find it at position 2.
Now drop straight down again, and observe the "h" in parent 2, position 2.
Again, search for this "h" in parent 1, found at position 7.
Drop straight down and observe the "a" in parent 2.
At this point note that if we search for "a" in parent one, we reach a position where we've already been. Continuing past that will just cycle. In fact, we call the sequence of positions we visited (0, 2, 7) a "cycle". Note that we can simply exchange the values at these positions between the parents as a group and both parents will retain the permutation property, because we have the same three values at each position in the cycle for both parents, just in different orders.
Make the swap of the positions included in the cycle.
Note that this is only one cycle. You then repeat this process starting from a new (unvisited) position each time until all positions have been included in a cycle. After the one iteration described in the above steps, you get the following strings (where an "X" denotes a position in the cycle where the values were swapped between the parents.
cbhdefga
afcgedbh
X X X
Just keep finding and swapping cycles until you're done.
The code I linked from my github account is going to be tightly bound to my own metaheuristics framework, but I think it's a reasonably easy task to pull the basic algorithm out from the code and adapt it for your own system.
Note that you can potentially gain quite a lot from doing something more customized to your particular domain. I think something like CX will make a better black box algorithm than something based on a TSP operator, but black boxes are usually a last resort. Other people's suggestions might lead you to a better overall algorithm.
I've worked on a somewhat similar ranking problem and followed a technique similar to what I describe below. Does this work for you:
Assume the unknown value of an object diverges from your estimate via some distribution, say, the normal distribution. Interpret your ranking statements such as a > b, 0.9 as the statement "The value a lies at the 90% percentile of the distribution centered on b".
For every statement:
def realArrival = calculate a's location on a distribution centered on b
def arrivalGap = | realArrival - expectedArrival |
def fitness = Σ arrivalGap
Fitness function is MIN(fitness)
FWIW, my problem was actually a bin-packing problem, where the equivalent of your "rank" statements were user-provided rankings (1, 2, 3, etc.). So not quite TSP, but NP-Hard. OTOH, bin-packing has a pseudo-polynomial solution proportional to accepted error, which is what I eventually used. I'm not quite sure that would work with your probabilistic ranking statements.
What an interesting problem! If I understand it, what you're really asking is:
"Given a weighted, directed graph, with each edge-weight in the graph representing the probability that the arc is drawn in the correct direction, return the complete sequence of nodes with maximum probability of being a topological sort of the graph."
So if your graph has N edges, there are 2^N graphs of varying likelihood, with some orderings appearing in more than one graph.
I don't know if this will help (very brief Google searches did not enlighten me, but maybe you'll have more success with more perseverance) but my thoughts are that looking for "topological sort" in conjunction with any of "probabilistic", "random", "noise," or "error" (because the edge weights can be considered as a reliability factor) might be helpful.
I strongly question your assertion, in your example, that P(a>c) is not needed, though. You know your application space best, but it seems to me that specifying P(a>c) = 0.99 will give a different fitness for f(abc) than specifying P(a>c) = 0.01.
You might want to throw in "Bayesian" as well, since you might be able to start to infer values for (in your example) P(a>c) given your conditions and hypothetical solutions. The problem is, "topological sort" and "bayesian" is going to give you a whole bunch of hits related to markov chains and markov decision problems, which may or may not be helpful.

SQL Server 2008+ : Best method for detecting if two polygons overlap?

We have an application that has a database full of polygons (currently stored as points) that a .net app pulls out and checks if they overlap.
I occurred to me that it would be much nicer to convert these point arrays to polygon / polyline objects within the database and use sql to get a bool of weather they overlap or not.
I have seen different methods suggested to do this but non of the examples given were quite in-line with my needs.
I would be very happy to receive input from those kind enough to offer their experience.
Additional:
In response to questions: It is indeed 2D. and yes any crossover of the two is considered true. The polygons have n points and can be concave. The polygons will be saved as 1 per row (after data conversion task) as polygons (i.e. the polygon type .. it might be called something else spatial / geom my memory is not on my side right now)
You can use .STIntersection with .STAsText() to test for overlapping polygons. (I really hate the terminology Microsoft has used (or whoever set the standard terms). "Touching," in my mind, should be a test for whether or not two geometry/geography shapes overlap at all, not just share a border.)
Anyway....
If #RadiusGeom is a geometry representing a radius from a point, the following will return a list of any two polygons where an intersection (a geometry that represents the area where two geometries overlap) is not empty.
SELECT CT.ID AS CTID, CT.[Geom] AS CensusTractGeom
FROM CensusTracts CT
WHERE CT.[Geom].STIntersection(#RadiusGeom).STAsText() <> 'GEOMETRYCOLLECTION EMPTY'
If your geometry field is spatially indexed, this runs pretty quickly. I ran this on 66,000 US CT records in about 3 seconds. There may be a better way, but since no one else had an answer, this was my attempt at an answer for you. Hope it helps!
Calculate and store the bounding rectangle of each polygon in a set of new fields within the row which is associated with that polygon. (I assume you have one; if not, create one.) When your dotnet app has a polygon and is looking for overlapping polygons, it can fetch from the database only those polygons whose bounding rectangles overlap, using a relatively simple SQL SELECT statement. Those polygons should be relatively few, so this will be efficient. Then, your dotnet app can perform the finer polygon overlap calculations in order to determine which ones of those really overlap.
Okay, I got another idea, so I am posting it as a different answer. I think my previous answer with the bounding polygons probably has some merit on its own, even if it was to reduce the number of polygons fetched from the database by a small percentage, but this one is probably better.
MSSQL supports integration with the CLR since version 2005. This means that you can define your own data type in an assembly, register the assembly with MSSQL, and from that moment on MSSQL will be accepting your user-defined data type as a valid type for a column, and it will be invoking your assembly to perform operations with your user-defined data type.
An example article for this technique on the CodeProject: Creating User-Defined Data Types in SQL Server 2005
I have never used this mechanism, so I do not know details about it, but I presume that you should be able to either define a new operation on your data type, or perhaps overload some existing operation like "less-than", so that you can check if one polygon intersects another. This is likely to speed things up a lot.

Design Problem

Recently I was faced with this interview question (K-Means Clustering solution). The design I came up with did not meet the expectations of the interviewer (to put simply I didnt get the job because I lost to another candidate on this design problem). I am wondering how many different / efficient / simply solutions can the SO community come up with (by doing this I am hoping to hone my skills):
To implement a simple algorithm to cluster people according to their weight and height. The
data set includes a list of people with their weights and heights like so:
Person Weight Height
(kg) (inches)
Person 1 70 70
Person 2 75 80
Person 3 120 85
You can plot the data as a 2 dimensional data. Weight being one dimension and height being
the other dimension. Weight can range from a minimum of 50kg to 150kg. Height can range
from a minimum of 38inches to 90inches
Algorithm:
The algorithm (called K-means clustering) will cluster data into K groups goes as such:
Start with K clusters. Each cluster is defined by its center point which will start of as
random weight and random height. Pick random numbers from within the
corresponding ranges defined above.
For each person
Calculate distance to center of each cluster using formula
distance = sqrt(pow((wperson−wcenter), 2) + (pow(hperson−hcenter),2))
where wperson = weight of person,
hperson = height of person
wcenter = weight of cluster center point,
hcenter = height of cluster center point
Assign the person to the cluster with the shortest distance to center point of cluster
After end of step 2, you will end up with K clusters each assigned with a set of people
For each cluster, set the weight and height of the center point to the average of the
people in the cluster
wcenter = (sum of weight of each person in cluster)/(number of people in cluster)
hcenter = (sum of height of each person in cluster)/number of people in cluster)
Repeat steps 2 to 5 for 1000 iterations, then print out following information for each
cluster.
weight and height of center of cluster.
list of people in cluster.
I am not looking for a implementation/solution but for a high level design. can you list the interfaces / classes etc.
I dont want to give my solution now, but will post it later in the day?
This is my attempt at the design. I only show the static diagram since the algorithm is pretty much laid out already. I would have a plan to have a visitor for the representation of the clusters, could allow different types of output (xml, strings, csv..etc). Maybe the visitor is overkill, if it was then I'd just have something like a ToString method that could be overridden.
Note: the Cluster creates a CenterClusterItem on the SetCenter and FindNewCenter methods. The CenterClusterItem is not a PersonClusterItem, it just holds the same amount of AClusterValues as a PersonClusterItem would (since the average isn't really a person).
Also, I forgot to make a method on the KCluster to begin the process, but that's implied.
Class Diagram http://img11.imageshack.us/img11/499/kcluster.png
Well, I would first tackle all the constants/magic numbers that reduce the reusability of the algorithm:
instead of a fixed number of iterations, use a stopping criterion (e.g., if clusters don't change too much, terminate)
don't restrict yourself to 2-dim data, use vectors
let the user define the number of clusters to be found
Then, you could hide some specifics behind interfaces, e.g. the distance might be calculated differently (for example, it might at some point have to cope with values other than double).
On the other hand, if you really have this simple problem, some of these generalizations might well be overkill - but that's what I would discuss with someone telling me to implement this algorithm.
You can create the following classes:
Person to store data about persons and centers. Properties: id, weight and height. Method: calculateDistance
Cluster to store one center and a list of persons: Properties: center and list of Person. Method: calculateCenter.
KCluster to hold your algorithm and store a list of clusters: Property: list of Cluster. Methods: generateClusters.
I'm not sure what your question actually is, the steps you point out effectively define the algorithm you're talking about.
A better idea may be to include exactly what you did then people can give you some hints / tips as to where you might have gone wrong or what they would have done differently.
That sounds like a really good way to do it. K-means will usually converge quickly (though not necessarily to the global optimum), so my one suggestion would be to run the algorithm until no more changes occur, rather than a fixed number of 1000 iterations. You could then repeat the entire process a few times with different random starting points.
One weakness of k-means is that it does require you to specify (i.e. guess) an appropriate value for k up-front. I think you would get points for asking the interviewer what an appropriate value for k would be, or, if there is no way to know, describing some goodness-of-fit measure and then calculating that measure for different values of k to find a "just low enough" value.