Efficient algorithm for dendrogram cutoff - hierarchical-clustering

I have implemented an algorithm for hierarchical clustering and a simple method for drawing the dendrogram in C#.
Now I want to add dendrogram cutoff method and another one for coloring dendrogram branches.
What would be an efficient algorithm to do that?
The cutoff method should return a list of dendrogram nodes beneath which each subtree represents a single cluster.
My data structure is a simple binary tree represented by a Root Node
the node structure is as follows:
class DendrogramNode
{
String Id { get; set; }
DendrogramNode LeftNode { get; set; }
DendrogramNode RightNode { get; set; }
Double Height { get; set; }
}
the CutOff method should have the following signature
List<DendrogramNode> CufOff(int numberOfClusters)
What I did so far:
My first attempt was to create a list of all DendrogramNodes and sort them in descending order. Then take numberOfClusters first entries from the sorted list. - This fails because we may end up with a list containing parent nodes that all children also belong to. In such situation parent nodes should be removed.
Second attempt was to create a list off all linkages and store them in linkage order. This way I could take last numberOfClusters linages and use them to create cutoff list - this works fine, but I don't like to store this information, as it is hard to maintain (specially for iterative clustering)
It sees like a simple problem but somehow I have stacked on this. Can you help me find an efficient solution?
I guess the solution 1 was OK to some point, but then there should be some part that removes parent nodes when all their children are also on the list, ad it should be somehow iterative/recursive, as removing a node creates space to add another.

Good solution is to use PriorityQueue (ites are nodes the priority is node hight):
private List<DendrogramNode> GetCutOffNodes(int numberOfClusters)
{
numberOfClusters = System.Math.Min(this.NumberOfInstances, System.Math.Max(numberOfClusters, 0));
PriorityQueue<DendrogramNode, DendrogramNodeDescendingComparer> queue = newPriorityQueue<DendrogramNode, DendrogramNodeDescendingComparer>();
queue.Enqueue(this.Root);
intclusters2Find = numberOfClusters - 1;
DendrogramNodenode = null;
while(queue.Count > 0 && clusters2Find > 0)
{
node = queue.Dequeue();
if(node.LeftNode != null)
queue.Enqueue(node.LeftNode);
if(node.RightNode != null)
queue.Enqueue(node.RightNode);
clusters2Find--;
}
List<DendrogramNode> result = new List<DendrogramNode>(numberOfClusters);
while(queue.Count > 0)
result.Add(queue.Dequeue());
return result;
}

Related

gtk#: Tree View object always passed as call-by-reference

It seems as if my arguments in the method call of processing a Tree View is always done as call-by-reference.
I have a visible GTK "Tree View" control on a top level window. The data was written by the respective model.
Now I want to remove some of the columns (based on options set by the user) and pass the manipulated Tree View to an Export-Function.
In order to remove the columns only from the output, not from the GUI itself, I thought of copying the visible Tree View control into a temporary one, manipulating the temportary one and calling the export-functionality on the temporary one.
My problem is: even though I pass my origin, visible Tree View as referenc-by-value (as of my understanding), the origin will be manipulated and the removing of columns will be done on the visual Tree View.
It seem as if my arguments in the method call is always done as call-by-reference.
Code:
"treeview1" is the visual Gtk.Tree View...
I call my Export-function:
...
TreeView treeviewExport = SetExportViewAccordingToCheckboxes(treeview1);
ExportFile(treeviewExport);
...
In the method SetExportViewAccordingToCheckboxes() I just pass the global treeview1 as call-by-value, manipulate it internally and return the manipulated Tree View:
protected static TreeView SetExportViewAccordingToCheckboxes(TreeView tvSource)
{
TreeView tvRet = tvSource;
if (cbName == false)
tvRet.RemoveColumn( ... );
...
return tvRet;
}
But even though I have removed the columns from the internal Tree View "tvRet", my visual control "treeview1" lacks all the columns which were removed from "tvRet"; it looks like "treeview1" was passed as call-by-reference.
Question: why is that?
Note: I also tried with the keyword "in" which made no difference:
protected static TreeView SetExportViewAccordingToCheckboxes(in TreeView p_tvSource)
The problem comes here:
In the method SetExportViewAccordingToCheckboxes() I just pass the
global treeview1 as call-by-value, manipulate it internally and return
the manipulated Tree View:
protected static TreeView SetExportViewAccordingToCheckboxes(TreeView tvSource)
{
TreeView tvRet = tvSource;
if (cbName == false)
tvRet.RemoveColumn( ... );
...
return tvRet;
}
First some background. In C# terminology, value types are those that directly contain a value, while reference types are those that reference the data, instead of holding it directly.
So, int x = 5 means that you are creating the value object 5 of type integer, and storing it in x, while TreeView tree = new TreeView() means that you are creating a reference tree of type TreeView, which points to an object of the same type.
All of this means that you cannot pass an object by value, even if you want to. In the best case, you are passing the reference by value, which has no effect.
So, the next step is to copy the data, and modify the copied object instead of the original one. This is theoretically sound, but the line: TreeView tvRet = tvSource; unfortunately does not achieve that. You are creating a new reference, yes, but that reference points to the same object the original reference points to.
Now, say that we are managing objects of class Point instead of TreeView, with properties x and y.
class Point {
public int X { get; set; }
public int Y { get; set; }
}
You can create a point easily:
Point p1 = new Point { X = 5, Y = 7 };
But this does not copy it:
Point p2 = p1;
This would do:
Point p2 = new Point { X = p1.X, Y = p1.Y };
Now the original problem was that you wanted to pass a few columns to an Export() function. In that case, you only need to pass a vector of the filtered columns to the exporting function, instead of a copy of the TreeView.
void PrepareExporting()
{
var columns = new List<TreeViewColumn>();
foreach(TreeViewColumn col in this.treeView.Columns) {
if ( this.Filter( col ) ) {
columns.add( col );
}
}
this.Export( columns.ToArray() );
}
void Export(TreeViewColumn[] columns)
{
// ...
}
I think that would be easier, since it is not needed to try to achieve a pass-by-reference (impossible), nor copy the tree view.
Hope this helps.

Returning composite class with Neo4jClient

The Neop4jClient cypher wiki (https://github.com/Readify/Neo4jClient/wiki/cypher) contains an example of using lambda expressions to return multiple projections...
var query = client
.Cypher
.Start(new { root = client.RootNode })
.Match("root-[:HAS_BOOK]->book-[:PUBLISHED_BY]->publisher")
.Return((book, publisher) => new {
Book = book.As<Book>(),
Publisher = publisher.As<Publisher>(),
});
So the query will return details of both book nodes and publisher nodes. But I want to do something slightly different. I want to combine the contents of a single node type with a property of the matched path. Lets say I have Person nodes with a property 'name', and a class defined so,,,
public class descendant
{
public string name { get; set; }
public int depth { get; set; }
}
A cypher query like this will return what I want, which is all descendants of a given node with the depth of the relationship...
match p=(n:Person)<-[*]-(child:Person)
where n.name='George'
return distinct child.name as name, length(p) as depth
If I try a Neo4jClient query like this...
var query =
_graphClient.Cypher
.Match("p=(n:Person)<-[*]-(child:Person)")
.Where("n.name='George'")
.Return<descendant>("child.name, length(p)") ;
I get an error that the syntax is obsolete, but I can't figure out how should I project the cypher results onto my C# POCO. Any ideas anyone?
The query should look like this:
var query =
_graphClient.Cypher
.Match("p=(n:Person)<-[*]-(child:Person)")
.Where((Person n) => n.name == "George")
.Return((n,p) => new descendant
{
name = n.As<Person>().Name,
depth = p.Length()
});
The Return statement should have the 2 parameters you care about (in this case n and p) and project them via the lambda syntax (=>) to create a new descendant instance.
The main point this differs from the example, is that the example creates a new anonymous type, whereas you want to create a concrete type.
We then use the property initializer (code inside the { } braces) to set the name and depth, using the As<> and Length extension methods to get the values you want.
As a side note, I've also changed the Where clause to use parameters, you should always do this if you can, it will make your queries both faster and safer.

LightSwitch - bulk-loading all requests into one using a domain service

I need to group some data from a SQL Server database and since LightSwitch doesn't support that out-of-the-box I use a Domain Service according to Eric Erhardt's guide.
However my table contains several foreign keys and of course I want the correct related data to be shown in the table (just doing like in the guide will only make the key values show). I solved this by adding a Relationship to my newly created Entity like this:
And my Domain Service class looks like this:
public class AzureDbTestReportData : DomainService
{
private CountryLawDataDataObjectContext context;
public CountryLawDataDataObjectContext Context
{
get
{
if (this.context == null)
{
EntityConnectionStringBuilder builder = new EntityConnectionStringBuilder();
builder.Metadata =
"res://*/CountryLawDataData.csdl|res://*/CountryLawDataData.ssdl|res://*/CountryLawDataData.msl";
builder.Provider = "System.Data.SqlClient";
builder.ProviderConnectionString =
WebConfigurationManager.ConnectionStrings["CountryLawDataData"].ConnectionString;
this.context = new CountryLawDataDataObjectContext(builder.ConnectionString);
}
return this.context;
}
}
/// <summary>
/// Override the Count method in order for paging to work correctly
/// </summary>
protected override int Count<T>(IQueryable<T> query)
{
return query.Count();
}
[Query(IsDefault = true)]
public IQueryable<RuleEntryTest> GetRuleEntryTest()
{
return this.Context.RuleEntries
.Select(g =>
new RuleEntryTest()
{
Id = g.Id,
Country = g.Country,
BaseField = g.BaseField
});
}
}
public class RuleEntryTest
{
[Key]
public int Id { get; set; }
public string Country { get; set; }
public int BaseField { get; set; }
}
}
It works and all that, both the Country name and the Basefield loads with Autocomplete-boxes as it should, but it takes VERY long time. With two columns it takes 5-10 seconds to load one page.. and I have 10 more columns I haven't implemented yet.
The reason it takes so long time is because each related data (each Country and BaseField) requires one request. Loading a page looks like this in Fiddler:
This isn't acceptable at all, it should be a way of combining all those calls into one, just as it does when loading the same table without going through the Domain Service.
So.. that was a lot explaining, my question is: Is there any way I can make all related data load at once or improve the performance by any other way? It should not take 10+ seconds to load a screen.
Thanks for any help or input!s
My RIA Service queries are extremely fast, compared to not using them, even when I'm doing aggregation. It might be the fact that you're using "virtual relationships" (which you can tell by the dotted lines between the tables), that you've created using your RuleEntryTest entity.
Why is your original RuleEntry entity not related to both Country & BaseUnit in LightSwitch BEFORE you start creating your RIA entity?
I haven't used Fiddler to see what's happening, but I'd try creating "real" relationships, instead of "virtual" ones, & see if that helps your RIA entity's performance.

How to filter Azure logs, or WCF Data Services filters for Dummies

I am looking at my Azure logs in the WADLogsTable and would like to filter the results, but I'm clueless as to how to do so. There is a textbox that says:
"Enter a WCF Data Services filter to limit the entities returned"
What is the syntax of a "WCF Data Services filter"? The following gives me an InvalidValueType error saying "The value specified is invalid.":
Timestamp gt '2011-04-20T00:00'
Am I even close? Is there a handy syntax reference somewhere?
This query should be in the format:
Timestamp gt datetime'2011-04-20T00:00:00'
Remembering to put that datetime in there is the important bit.
This trips me up every time, so I use the OData overview for reference.
Adding to knightffhor's response, you can certainly write a query which filters by Timstamp but this is not recommended approach as querying on "Timestamp" attribute will lead to full table scan. Instead query this table on PartitionKey attribute. I'm copying my response from other thread here (Can I capture Performance Counters for an Azure Web/Worker Role remotely...?):
"One of the key thing here is to understand how to effectively query this table (and other diagnostics table). One of the things we would want from the diagnostics table is to fetch the data for a certain period of time. Our natural instinct would be to query this table on Timestamp attribute. However that's a BAD DESIGN choice because you know in an Azure table the data is indexed on PartitionKey and RowKey. Querying on any other attribute will result in full table scan which will create a problem when your table contains a lot of data.The good thing about these logs table is that PartitionKey value in a way represents the date/time when the data point was collected. Basically PartitionKey is created by using higher order bits of DateTime.Ticks (in UTC). So if you were to fetch the data for a certain date/time range, first you would need to calculate the Ticks for your range (in UTC) and then prepend a "0" in front of it and use those values in your query.
If you're querying using REST API, you would use syntax like:
PartitionKey ge '0<from date/time ticks in UTC>' and PartitionKey le '0<to date/time in UTC>'."
I've written a blog post about how to write WCF queries against table storage which you may find useful: http://blog.cerebrata.com/specifying-filter-criteria-when-querying-azure-table-storage-using-rest-api/
Also if you're looking for a 3rd party tool for viewing and managing diagnostics data, may I suggest that you take a look at our product Azure Diagnostics Manager: /Products/AzureDiagnosticsManager. This tool is built specifically for surfacing and managing Windows Azure Diagnostics data.
The answer I accepted helped me immensely in directly querying the table through Visual Studio. Eventually, however, I needed a more robust solution. I used the tips I gained here to develop some classes in C# that let me use LINQ to query the tables. In case it is useful to others viewing this question, here is roughly how I now query my Azure logs.
Create a class that inherits from Microsoft.WindowsAzure.StorageClient.TableServiceEntity to represent all the data in the "WADLogsTable" table:
public class AzureDiagnosticEntry : TableServiceEntity
{
public long EventTickCount { get; set; }
public string DeploymentId { get; set; }
public string Role { get; set; }
public string RoleInstance { get; set; }
public int EventId { get; set; }
public int Level { get; set; }
public int Pid { get; set; }
public int Tid { get; set; }
public string Message { get; set; }
public DateTime EventDateTime
{
get
{
return new DateTime(EventTickCount, DateTimeKind.Utc);
}
}
}
Create a class that inherits from Microsoft.WindowsAzure.StorageClient.TableServiceContext and references the newly defined data object class:
public class AzureDiagnosticContext : TableServiceContext
{
public AzureDiagnosticContext(string baseAddress, StorageCredentials credentials)
: base(baseAddress, credentials)
{
this.ResolveType = s => typeof(AzureDiagnosticEntry);
}
public AzureDiagnosticContext(CloudStorageAccount storage)
: this(storage.TableEndpoint.ToString(), storage.Credentials) { }
// Helper method to get an IQueryable. Hard code "WADLogsTable" for this class
public IQueryable<AzureDiagnosticEntry> Logs
{
get
{
return CreateQuery<AzureDiagnosticEntry>("WADLogsTable");
}
}
}
I have a helper method that creates a CloudStorageAccount from configuration settings:
public CloudStorageAccount GetStorageAccount()
{
CloudStorageAccount.SetConfigurationSettingPublisher(
(name, setter) => setter(RoleEnvironment.GetConfigurationSettingValue(name)));
string configKey = "Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString";
return CloudStorageAccount.FromConfigurationSetting(configKey);
}
I create an AzureDiagnosticContext from the CloudStorageAccount and use that to query my logs:
public IEnumerable<AzureDiagnosticEntry> GetAzureLog(DateTime start, DateTime end)
{
CloudStorageAccount storage = GetStorageAccount();
AzureDiagnosticContext context = new AzureDiagnosticContext(storage);
string startTicks = "0" + start.Ticks;
string endTicks = "0" + end.Ticks;
IQueryable<AzureDiagnosticEntry> query = context.Logs.Where(
e => e.PartitionKey.CompareTo(startTicks) > 0 &&
e.PartitionKey.CompareTo(endTicks) < 0);
CloudTableQuery<AzureDiagnosticEntry> tableQuery = query.AsTableServiceQuery();
IEnumerable<AzureDiagnosticEntry> results = tableQuery.Execute();
return results;
}
This method takes advantage of the performance tip in Gaurav's answer to filter on PartitionKey rather than Timestamp.
If you wanted to filter the results by more than just date, you could filter the returned IEnumerable. But, you'd probably get better performance by filtering the IQueryable. You could add a filter parameter to your method and call it within the IQueryable.Where(). Eg,
public IEnumerable<AzureDiagnosticEntry> GetAzureLog(
DateTime start, DateTime end, Func<AzureDiagnosticEntry, bool> filter)
{
...
IQueryable<AzureDiagnosticEntry> query = context.Logs.Where(
e => e.PartitionKey.CompareTo(startTicks) > 0 &&
e.PartitionKey.CompareTo(endTicks) < 0 &&
filter(e));
...
}
In the end, I actually further abstracted most of these classes into base classes in order to reuse the functionality for querying other tables, such as the one storing the Windows Event Log.

Does Fluent Nhibernate Lazy loads IList from Criteria

I want to make a kind of "news feed" in the application of my company.
In the scenario, User actions will generate "Activity" of different kinds, and other users will see in their "news feed".
However, an "Activity" is not related to all users, and to determine the relation, we have a complex piece of code.
Here is my Activity class
public class Activity: IActivity
{
public virtual int Id { get; set; }
public virtual ActivityType Type { get; set; }
public virtual User User { get; set; }
public virtual bool IsVisibleToUser(User userLook)
{
// Complex business calculation etc.
return true;
}
}
I want to get latest 10 news that is visible to User. But since the Activity table will be quite huge, and performance is an issue, I want to do the best practice about it.
What i am about to do, is get 25 last Activity, and check if we fill the list to show to user. For example, if only 5 Activity is visible to user, i will get another 25 Activities and so on.
IList<Activity> resultList = session.CreateCriteria(typeof(Activity))
.SetMaxResults(25)
.AddOrder(Order.Desc("Id"))
.List<Activity>();
I want to learn, if I get the whole list ordered by Id, and check one by one if it is visible to User, would NHibernate only loads the objects that i use for me or not?
IList<Activity> resultList = session.CreateCriteria(typeof(Activity))
.AddOrder(Order.Desc("Id"))
.List<Activity>();
int count = 0;
foreach( Activity act in resultList){
if (act.IsVisible(CurrentUser)){
count++;
// Do something with act
if (count == 10)
break;
}
}
EDIT:
Here is ActivityMapping for Activity model.
public class ActivityMap : ClassMap<Activity>
{
public ActivityMap()
{
Id(x => x.Id);
Map(x => x.Type).CustomType(typeof(Int32));
References(x => x.User).Nullable();
}
}
If your question is about how the generated SQL would look like, my guess would be :
SELECT
this_.Id as Id0_0_,
this_.ActivityTypeas ActivityType_0_0_,
--Other fields
FROM dbo.ActivityType this_
WHERE
--condition
ORDER BY
--condition
Since you have mentioned that the Activity count is huge, you can make use of
ICriteria's SetFirstResult and SetMaxResult.
SetFirstResult(int) indicates the index of the first item that you wish to obtain and SetMaxResult(int) indicates the number of rows you wish to get, 25 in your case.
The ToList would load all the records in memory at once.
[UPDATE] If you need the records to be returned one by one, make use of Enumerable() -
If you expect your query to return a very large number of objects, but you don't expect to use them all, you might get better performance from the Enumerable() methods, which return a System.Collections.IEnumerable. The iterator will load objects on demand, using the identifiers returned by an initial SQL query (n+1 selects total).
Source - Link
No, the List() method pulls everything into memory at once.