Implementations of GRGPF Clustering algorithm - hierarchical-clustering

I am looking for source that implements the Clustering algorithm found in Ganti et.al. "Clustering Large Datasets in Arbitrary Metric Spaces." IN particular, I have a large data problem that I seek to cluster (so it is a one pass cluster problem) and I do not have binary operators on the space (so finding "average" elements between elements is not an option).
I am agnostic to language (although I would prefer a simple I/O mechanism).
Any thoughts?

Related

Redis Hyperloglog limitations

I am trying to solve a problem in a hacky way using Redis Hyperloglog but what I am trying to understand is the limitations and assumptions by Hyperloglog on the data or the distribution.
The count-min and bloom filter have their own set of limitations but google isn't being helpful in providing much info on applications and limitations of Hyperloglog.
I am using Redis Hyperloglog and as Antirez describes there are no practical limits to the cardinality of the sets we can count. But from a theory perspective, does Hyperloglog make any assumptions/constraints about the data or the distribution?
The HyperLogLog algorithm assumes that a strong universal hash function is used. Redis uses MurmurHash64A which should be good enough from a practical point of view. Redis HyperLogLog implementation uses 6 bits per registers which allows to represent any bit run-lengths within 64bit hash values. Hence, the only limitation I see is the 64bit hash value itself. If the cardinality is in the order of 2^64, there will be many hash collisions that finally would lead to large estimation errors. However, cardinalities of this order of magnitude never occur in practice.

Titan vertex centric indices vs Neo4j labels

I was trying to make a comparison between these two technologies when approaching this and I was wondering if any of you already have some experience dealing with any or both of them?
I am mainly interested in performance numbers when dealing with similar use cases.
The difference between the two concepts is the difference between global and local indexing.
As I understand it, Neo4j vertex labels allow you to break up your index space by "categories" of vertices. In this way, a O(log(|V|)) lookup is now an O(log(|V|/c)), where c is the number of categories/labels you have over your vertex set and (the equation) assumes an equal number of vertices in each category. As such, vertex label aid in global index calls as this is a function of V.
Next, Titan's vertex-centric indices sort and index the incident edges of a vertex. The cost to find a particular edge by its label/properties incident to a vertex is O(log(inc(v))), where inc(v) is the size of the incident edge set to vertex v. As such, vertex-centric indices are local indices as this is a function of v.
As I understand it, Neo4j does not support vertex-centric indices. You see this concept currently in Titan, OrientDB, and TinkerGraph (…and RDF stores sort in this manner as well -- via spog pairings). Next, all known graph databases support global indices though, (I believe only Neo4j and OrientDB), support a vertex set partition via the concept of a label.
Again, assuming my assumptions are correct about the use of vertex labels in Neo4j, we are talking about two different use cases — global vs. local indexing. From the perspective of the supernode problem, global indices do not quell the issue of traversing through a large vertex, while this is the sole purpose of the local vertex-centric indices.
You can read about the supernode problem and vertex-centric indices here:
http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/
Agreeing with everything Marko said, one could take it further and argue that in the graph database world local indexes can (and even should) substitute global ones. In my opinion, the single greatest advantage of a graph data model is that it lets you encode your data model into the graph topology, gaining qualitative advantages in terms of flexibility, ease of evolution and performance. With this in mind, I'd argue that labels in Neo4j actually detract from all this; reifying a label into a node with adjacent edges pointing to the source having that label is much more in line with the "schema is the graph" philosophy.
Of course, if your engine lacks local indexes we are back at the supernode problem. But if you do have them (something which I'd say should be a requirement for something to be called a graph database), you can easily transform your label into a node L, and create relationships pointing to that node for those vertices which you want labeled with L
v -[L]-> L
meaning that v has label L. Now if you want this in Titan to behave like a Neo4j label, just make the -[L]-> relation to be "manyToOne" (see Titan cardinality constraints) and create a vertex-centric index. This pattern lets you get everything that you could with labels and much more; you can
effectively use this as a namespace for properties relating to that label
sort your elements inside one label
nest labels easily without losing performance (just use a composite key)
separate the declaration of a label L with how elements labeled with it are accessed
Labels may afford some design patterns that improve performance by de-densifying the graph. For example: they eliminate the need for type nodes, which can often get quite dense. Labels can optionally be associated with a unique index. Here, the ability to index a property isn't new, but the ability to constrain it uniquely is. If you were previously doing work in your application, you may experience some performance gains by letting the database handle this. (It's certainly much more convenient to do so.) Finally, if you don't assign a unique index to a label, it will still be indexed, in order to help performance for certain kinds of queries (e.g. "give me all of the nodes having label ")
All that said, while labels may help with performance in certain cases, they were introduced more with ease-of-use in mind. We're just getting started with Neo4j 2.1, which specifically addresses dense node performance (something I know you've been waiting for), along with other performance & scalability improvements... including removing (for all practical purposes eliminating) the upper size limits.
Philip

Clustering: Cluster validation

I want to use some clustering method for large social network dataset. The problem is how to evaluate the clustering method. yes, I can use some external ,internal and relative cluster validation methods. I used Normalized mutual information(NMI) as external validation method for cluster validation based on synthetic data. I produced some synthetic dataset by producing 5 clusters with equal number of nodes and some strongly connected links inside each cluster and weak links between clusters to check the clustering method, Then I analysed the spectral clustering and modularity based community detection methods on this synthetic datasets. I use the clustering with the best NMI for my real world dataset and check the error(cost function) of my algorithm and the result was good. Is my testing method for my cost function is good? or I should also validate clusters of my real word clusters again?
Thanks.
Try more than one measure.
There are a dozen cluster validation measures, and it's hard to predict which one is most appropriate for a problem. The differences between them are not really understood yet, so it's best if you consult more than one.
Also note that if you don't use a normalized measure, the baseline may be really high. So the measures are mostly useful to say "result A is more similar to result B than result C", but should not be taken as an absolute measure of quality. They are a relative measure of similarity.

Fast in-memory inverted index

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

Difference between Gene Expression Programming and Cartesian Genetic Programming

Something pretty annoying in evolutionary computing is that mildly different and overlapping concepts tend to pick dramatically different names. My latest confusion because of this is that gene-expression-programming seems very similar to cartesian-genetic-programming.
(how) Are these fundamentally different concepts?
I've read that indirect encoding of GP instructions is an effective technique ( both GEP and CGP do that ). Has there been reached some sort of consensus that indirect encoding has outdated classic tree bases GP?
Well, it seems that there is some difference between gene expression programming (GEP) and cartesian genetic programming (CGP or what I view as classic genetic programming), but the difference might be more hyped up than it really ought to be. Please note that I have never used GEP, so all of my comments are based on my experience with CGP.
In CGP there is no distinction between genotype and a phenotype, in other words- if you're looking at the "genes" of a CGP you're also looking at their expression. There is no encoding here, i.e. the expression tree is the gene itself.
In GEP the genotype is expressed into a phenotype, so if you're looking at the genes you will not readily know what the expression is going to look like. The "inventor" of GP, Cândida Ferreira, has written a really good paper and there are some other resources which try to give a shorter overview of the whole concept.
Ferriera says that the benefits are "obvious," but I really don't see anything that would necessarily make GEP better than CGP. Apparently GEP is multigenic, which means that multiple genes are involved in the expression of a trait (i.e. an expression tree). In any case, the fitness is calculated on the expressed tree, so it doesn't seem like GEP is doing anything to increase the fitness. What the author claims is that GEP increases the speed at which the fitness is reached (i.e. in fewer generations), but frankly speaking you can see dramatic performance shifts from a CGP just by having a different selection algorithm, a different tournament structure, splitting the population into tribes, migrating specimens between tribes, including diversity into the fitness, etc.
Selection:
random
roulette wheel
top-n
take half
etc.
Tournament Frequency:
once per epoch
once per every data instance
once per generation.
Tournament Structure:
Take 3, kill 1 and replace it with the child of the other two.
Sort all individuals in the tournament by fitness, kill the lower half and replace it with the offspring of the upper half (where lower is worse fitness and upper is better fitness).
Randomly pick individuals from the tournament to mate and kill the excess individuals.
Tribes
A population can be split into tribes that evolve independently of each-other:
Migration- periodically, individual(s) from a tribe would be moved to another tribe
The tribes are logically separated so that they're like their own separate populations running in separate environments.
Diversity Fitness
Incorporate diversity into the fitness, where you count how many individuals have the same fitness value (thus are likely to have the same phenotype) and you penalize their fitness by a proportionate value: the more individuals with the same fitness value, the more penalty for those individuals. This way specimens with unique phenotypes will be encouraged, therefore there will be much less stagnation of the population.
Those are just some of the things that can greatly affect the performance of a CGP, and when I say greatly I mean that it's in the same order or greater than Ferriera's performance. So if Ferriera didn't tinker with those ideas too much, then she could have seen much slower performance of the CGPs... especially if she didn't do anything to combat stagnation. So I would be careful when reading performance statistics on GEP, because sometimes people fail to account for all of the "optimizations" available out there.
There seems to be some confusion in these answers that must be clarified. Cartesian GP is different from classic GP (aka tree-based GP), and GEP. Even though they share many concepts and take inspiration from the same biological mechanisms, the representation of the individuals (the solutions) varies.
In CGPthe representation (mapping between genotype and phenotype) is indirect, in other words, not all of the genes in a CGP genome will be expressed in the phenome (a concept also found in GEP and many others). The genotypes can be coded in a grid or array of nodes, and the resulting program graph is the expression of active nodes only.
In GEP the representation is also indirect, and similarly not all genes will be expressed in the phenotype. The representation in this case is much different from treeGP or CGP, but the genotypes are also expressed into a program tree. In my opinion GEP is a more elegant representation, easier to implement, but also suffers from some defects like: you have to find the appropriate tail and head size which is problem specific, the mnltigenic version is a bit of a forced glue between expression trees, and finally it has too much bloat.
Independently of which representation may be better than the other in some specific problem domain, they are general purpose, can be applied to any domain as long as you can encode it.
In general, GEP is simpler from GP. Let's say you allow the following nodes in your program: constants, variables, +, -, *, /, if, ...
For each of such nodes with GP you must create the following operations:
- randomize
- mutate
- crossover
- and probably other genetic operators as well
In GEP for each of such nodes only one operation is needed to be implemented: deserialize, which takes array of numbers (like double in C or Java), and returns the node. It resembles object deserialization in languages like Java or Python (the difference is that deserialization in programming languages uses byte arrays, where here we have arrays of numbers). Even this 'deserialize' operation doesn't have to be implemented by the programmer: it can be implemented by a generic algorithm, just like it's done in Java or Python deserialization.
This simplicity from one point of view may make searching of best solution less successful, but from other side: requires less work from programmer and simpler algorithms may execute faster (easier to optimize, more code and data fits in CPU cache, and so on). So I would say that GEP is slightly better, but of course the definite answer depends on problem, and for many problems the opposite may be true.