R-Tree and Quadtree Comparison - indexing

I want to compare the R-Tree and the Quadtree for geospatial data. While there is literature out there I struggle to find documents that cover real basic comparison. So I decided to ask this question.
In my opinion, the R-Tree has the advantage of being balanced and the tree has no empty leaves.
As a disadvantage, the basic operation like insert or delete could result in restructering the whole index.
The Quadtree is the opposite, it is not balanced and has empty leaves, but it does not need to be restrctured.
So as a fazit from that I would say that the R-Tree does need less memory and is faster for searching because of the minimal height.
The quadtree is better when there are many update-operations, but the resulting tree could be unbalanced.
Are these points correct in your opinion?
Are there any good documents out there that cover this topic?
Auf Wiedersehen, Andre

Here's paper which has pretty nice comparison of QuadTrees and R Trees:
Quadtree and R-tree Indexes in Oracle Spatial: A Comparison using GIS Data
Some differences:
Quadtrees require fine-tuning by choosing appropriate tiling level in order to optimize performance. No specific tuning is required for R-Trees.
Quadtree can be implemented on top of existing B-tree. R-Tree -cannot
Quadtree indexes are created faster than R-tree.
R-trees are much faster than Quadtree for nearest neighbours queries.
R-trees are substantially faster than Quadtree for window queries, like "inside", "contains", "covers" etc.

"restructuring the whole index". No. Restructuring the R-tree is restricted to a single path, not the "whole" index.
It works similar to the B-tree, actually.
Consider implementing both, and doing some benchmarks yourself, to really know how they perform. Don't only use theory.
On uniformly distributed data with a high change frequency, quadtrees will usually work better. On disk, the R-tree has clear advantages.

Related

Finding the nearest location to a test point

I have about 2000+ sets of geographical coordinates (lat, long). Given one coordinate I want to find the closest one from that set. My approach was to measure the distance but hundreds of requests per second can be a little rough to the server doing all that math.
What is the best-optimized solution for this?
The problem you’re describing here is called a nearest neighbor search and there are lots of good data structures that support fast nearest neighbor lookups. The k-d tree is a particularly simple and fast choice and there are many good libraries out there that you can use. You can also look into alternatives like vantage-point trees or quadtrees if you’d like.
Hope this helps!

What are the advantages of the knowledge that a corpus of text follows the zipf's law?

I have the frequency count of all the words from a file (that i am using to analyze and index data: elasticsearch), and the frequency of words follows zipf's law. How can i use this knowledge to improve my search over it? Rather, how can i use it to get anything done to my benefit?
I think this is a very interesting question, and I'm sad that it's gone without answer or comment for so long. Zipfian distribution is a phenomenon that occurs not only in language, but far beyond that.
Zipf and Pareto
Zipfian distribution or Zipf's Law is a rank-frequency distribution of words in this case. But perhaps more importantly pareto distribution implies that approximately 20% of words(cause) account for roughly 80% of word occurrences(outcome) in any given body, or bodies, of text. Lucene, the brain behind elasticsearch, accounts for this in multiple ways, and often beyond that of zipf's law. It's common that your results will contain a zipfian distribution.
Word frequency, least is best(usually)
One of the problems here is in most bodies of text the most common words actually bare the least context. Usually being an article or having a very limited context. The top 3 most common words in english are: "the", "of", and "to". Elasticsearch actually comes with a list of stop words which will optimize indexing by ignoring articles.
Elasticsearch stop words:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no,
not, of, on, or, such, that, the, their, then, there, these, they,
this, to, was, will, with
It's actually a common occurrence that words that appear the least frequent bare the most context. So you're likely going to look for the least frequent words when doing text search.
80:20 phenomenon
The thing is elasticsearch and lucene are both build with these things in mind, and well optimised for such. A simple LRU eviction policy for caching indices actually works very well, as 80% of your searches will likely use 20% of your actual indices making cache pollution both infrequent and low impact due to a predictable workload. So if you allocate a cache size larger than 20% your total index size you should be fine. In the event that the index is not in cache it will read off the disk(usually mmap), and you can optimize performance by having a drive with fast random reads(like an SSD).
More Reading
There is an interesting article on this. Its likely that the total word rank in your data set looks very similar to that of the word rank of most other data sets. So optimizing performance as well as relevance is left to those few words which are likely to occur least often, but are likely to be most search for. This may be jargon in context to the demographic/profession your application is targeting.
Conclusion
These optimizations, however, could be premature. Like I stated, lucene and elasticsearch both do their part to increase effectiveness and efficiency of search with these principles in mind. Like I stated earlier, a simple LRU cache works very well in this case, and LRU is both common(already part of ES) and relatively simple. Cases where it might be worth-while are usually cases where you have a lot of jargon or specific language or perhaps multilingual. For something like a news site you'll likely want a more broad solution, as you cover a huge spectrum of topics which include many different words and subjects. These are usually things to consider when you're configuring elasticsearch, but tinkering with the analyzer can be complicated and may be hard to do effectively especially if you have a large range of subjects with different terminologies that you need to index, however this will likely have the largest effect on increasing search relevance.

Fast in-memory inverted index

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

How does Lucene/Solr achieve high performance in multi-field / faceted search?

Context
This is a question mainly about Lucene (or possibly Solr) internals. The main topic is faceted search, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car).
When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable.
Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object).
Question
The inverted index of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices.
Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index.
So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)?
Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.
Faceting
There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.
Enum faceting. Results of a query are a bit vector where the ith bit is 1 if the ith document was a match. The facet is also a bit vector, so intersection is just a bitwise AND. I don't think this is a novel approach, and most RDBMS's probably support it.
Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:
select facet, count(*) from field_cache
where docId in query_results
group by facet
Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.
Multi-term search
This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.
An explaining post can be found at: http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
The new method works by un-inverting the indexed field to be faceted, allowing quick lookup of the terms in the field for any given document. It’s actually a hybrid approach – to save memory and increase speed, terms that appear in many documents (over 5%) are not un-inverted, instead the traditional set intersection logic is used to get the counts.

Difference between Gene Expression Programming and Cartesian Genetic Programming

Something pretty annoying in evolutionary computing is that mildly different and overlapping concepts tend to pick dramatically different names. My latest confusion because of this is that gene-expression-programming seems very similar to cartesian-genetic-programming.
(how) Are these fundamentally different concepts?
I've read that indirect encoding of GP instructions is an effective technique ( both GEP and CGP do that ). Has there been reached some sort of consensus that indirect encoding has outdated classic tree bases GP?
Well, it seems that there is some difference between gene expression programming (GEP) and cartesian genetic programming (CGP or what I view as classic genetic programming), but the difference might be more hyped up than it really ought to be. Please note that I have never used GEP, so all of my comments are based on my experience with CGP.
In CGP there is no distinction between genotype and a phenotype, in other words- if you're looking at the "genes" of a CGP you're also looking at their expression. There is no encoding here, i.e. the expression tree is the gene itself.
In GEP the genotype is expressed into a phenotype, so if you're looking at the genes you will not readily know what the expression is going to look like. The "inventor" of GP, Cândida Ferreira, has written a really good paper and there are some other resources which try to give a shorter overview of the whole concept.
Ferriera says that the benefits are "obvious," but I really don't see anything that would necessarily make GEP better than CGP. Apparently GEP is multigenic, which means that multiple genes are involved in the expression of a trait (i.e. an expression tree). In any case, the fitness is calculated on the expressed tree, so it doesn't seem like GEP is doing anything to increase the fitness. What the author claims is that GEP increases the speed at which the fitness is reached (i.e. in fewer generations), but frankly speaking you can see dramatic performance shifts from a CGP just by having a different selection algorithm, a different tournament structure, splitting the population into tribes, migrating specimens between tribes, including diversity into the fitness, etc.
Selection:
random
roulette wheel
top-n
take half
etc.
Tournament Frequency:
once per epoch
once per every data instance
once per generation.
Tournament Structure:
Take 3, kill 1 and replace it with the child of the other two.
Sort all individuals in the tournament by fitness, kill the lower half and replace it with the offspring of the upper half (where lower is worse fitness and upper is better fitness).
Randomly pick individuals from the tournament to mate and kill the excess individuals.
Tribes
A population can be split into tribes that evolve independently of each-other:
Migration- periodically, individual(s) from a tribe would be moved to another tribe
The tribes are logically separated so that they're like their own separate populations running in separate environments.
Diversity Fitness
Incorporate diversity into the fitness, where you count how many individuals have the same fitness value (thus are likely to have the same phenotype) and you penalize their fitness by a proportionate value: the more individuals with the same fitness value, the more penalty for those individuals. This way specimens with unique phenotypes will be encouraged, therefore there will be much less stagnation of the population.
Those are just some of the things that can greatly affect the performance of a CGP, and when I say greatly I mean that it's in the same order or greater than Ferriera's performance. So if Ferriera didn't tinker with those ideas too much, then she could have seen much slower performance of the CGPs... especially if she didn't do anything to combat stagnation. So I would be careful when reading performance statistics on GEP, because sometimes people fail to account for all of the "optimizations" available out there.
There seems to be some confusion in these answers that must be clarified. Cartesian GP is different from classic GP (aka tree-based GP), and GEP. Even though they share many concepts and take inspiration from the same biological mechanisms, the representation of the individuals (the solutions) varies.
In CGPthe representation (mapping between genotype and phenotype) is indirect, in other words, not all of the genes in a CGP genome will be expressed in the phenome (a concept also found in GEP and many others). The genotypes can be coded in a grid or array of nodes, and the resulting program graph is the expression of active nodes only.
In GEP the representation is also indirect, and similarly not all genes will be expressed in the phenotype. The representation in this case is much different from treeGP or CGP, but the genotypes are also expressed into a program tree. In my opinion GEP is a more elegant representation, easier to implement, but also suffers from some defects like: you have to find the appropriate tail and head size which is problem specific, the mnltigenic version is a bit of a forced glue between expression trees, and finally it has too much bloat.
Independently of which representation may be better than the other in some specific problem domain, they are general purpose, can be applied to any domain as long as you can encode it.
In general, GEP is simpler from GP. Let's say you allow the following nodes in your program: constants, variables, +, -, *, /, if, ...
For each of such nodes with GP you must create the following operations:
- randomize
- mutate
- crossover
- and probably other genetic operators as well
In GEP for each of such nodes only one operation is needed to be implemented: deserialize, which takes array of numbers (like double in C or Java), and returns the node. It resembles object deserialization in languages like Java or Python (the difference is that deserialization in programming languages uses byte arrays, where here we have arrays of numbers). Even this 'deserialize' operation doesn't have to be implemented by the programmer: it can be implemented by a generic algorithm, just like it's done in Java or Python deserialization.
This simplicity from one point of view may make searching of best solution less successful, but from other side: requires less work from programmer and simpler algorithms may execute faster (easier to optimize, more code and data fits in CPU cache, and so on). So I would say that GEP is slightly better, but of course the definite answer depends on problem, and for many problems the opposite may be true.