How does Lucene use Similarity during indexing time? I understand the role of similarity while reading the index. So, searcher.setSimilarity() makes sense in scoring. What is the use of IndexWriterConfig.setSimilarity()?
How does Lucene use Similarity during indexing time?
The short answer is: Lucene captures some statistics at indexing time which can then be used to support scoring at query time. I expect it is simply a matter of efficiency that these are captured as part of the indexing process, rather than being repeatedly re-computed on the fly, when running queries.
There is a section in the Similarity javadoc which describes this at a high level:
At indexing time, the indexer calls computeNorm(FieldInvertState), allowing the Similarity implementation to set a per-document value for the field that will be later accessible via LeafReader.getNormValues(String).
The javadoc goes on to describe further details - for example:
Many formulas require the use of average document length, which can be computed via a combination of CollectionStatistics.sumTotalTermFreq() and CollectionStatistics.docCount().
So, for example, the segment info file within a Lucene index records the number of documents in each segment.
There are other statistics which can be captured in an index to support scoring calculations at query time. You can see a summary of these stats in the Index Structure Overview documentation - with links to further details.
What is the use of IndexWriterConfig.setSimilarity()?
This is a related question which follows on from the above points.
By default, Lucene uses the BM25Similarity formula.
That is one of a few different scoring models that you may choose to use (or you can define your own). The setSimilarity() method is how you can choose a different similarity (scoring model) from the default one. This means different statistics may need to be captured (and then used in different ways) to support the chosen scoring model.
It would not make sense to use one scoring model at indexing time, and a different one at query time.
(Just to note: I have never set the similarity scoring model myself - I have always used the default model.)
Related
I have a very simple object as keys in my cache and I want to be able to iterate on the key/value pairs where a string matches a field in my keys.
Here is how the field is declared in the class
#AffinityKeyMapped #QueryTextField String crawlQueueID;
I run many queries and expect a small amount of documents to match. The queries take a relatively large amount of time, which is surprising given that there are maybe only 100K pairs locally in the cache. My queries are local, I want to hit only the K/V stored in the local node.
According to the profiler I am using, 80% of the CPU is spent here
GridLuceneIndex.java:285 org.apache.lucene.search.IndexSearcher.search(Query, int)
Knowing Lucene's performance, I am really surprised. Any suggestions?
BTW I want to sort the results based on a numerical field in the value object. Can this be done via annotations?
I could have one cache per value of the field I am querying against but given that there are potentially hundreds of thousands or even millions of different values, that would probably be too many caches for Ignite to handle.
EDIT
Looking at the code that handles the Lucene indexing and querying, the index gets reloaded for every query. Given that I do hundreds of them in a row, we probably don't benefit from any caching or optimisation of the index structure in Lucene.
Additionally, there is a range query running as a filter to check for the TTL. FilterQueries are faster but on a fresh indexreader, there would not be much caching either. Of course, if no TTL is needed for a given table, this should not be required.
Judging by the documentation about the indexing with SQL indexing:
Ignite automatically creates indexes for each primary key and affinity
key field.
the indexing is done on the key alone. In my case, the value I want to use for sorting is in the value object so that would not work.
I've been reading the documentation and it provides some examples for drools and constraint streams, but it doesn't explicitly say whether you can or cannot use Constraint Configuration with an EasyScoreCalculator.
As the ConstrationConfiguration is a field in the PlanningSolution class, it's available in the EasyScoreCalculator's calculateScore(Solution_ solution) method, which computes the score of the entire solution for every move.
Let me just note that the EasyScoreCalculator does not scale for bigger data sets - exactly because it computes the score of the entire solution for every move.
Going over some interview info about data structures etc.
So, as I understand, arrays are O(1) for indexing, which I believe means finding the specific element contained at space x in the array. Just want to confirm this as I am second guessing myself.
Also, hash maps are O(1) for indexing, searching, insertion and deletion. Does that not kind of make any data structure question pointless, since a hash map will always be the best solution?
Thanks
Well indexing is not only about arrays,
according to this - indexing is creating tables (indexes) that point to the location of folders, files and records. Depending on the purpose, indexing identifies the location of resources based on file names, key data fields in a database record, text within a file or unique attributes in a graphics or video file.
For your second question hash maps are not absolute or best data structures for various reasons, mainly:
Collisions
Hash function calculation time
Extra memory used
Also there's lots of Data Structure questions where hashmaps are not superior:
Data structure for finding k-th minimum element and supporting updates (Hashmap would be like bruteforce because it does not keep elements sorted, so we need something like Balanced binary search tree)
Data structure for finding if word is in dictionary (Sure hashmap works but Trie is so much faster & less memory)
Data structure for finding minimum element in any range of an array with updates (Once again hashmap is just too slow for this, we need something like segment tree)
...
When creating a document in my Lucene index (v7.2), I add a uid field to it which contains a unique id/key (string):
doc.add(new StringField("uid",uid,Field.Store.YES))
To retrieve that document later on, I create a TermQuery for the given unique id and search for it with an IndexSearcher:
searcher.search(new TermQuery(new Term("uid",uid)),1)
Being a Lucene "novice", I would like to know the following:
How should I improve this approach to get the best lookup performance?
Would it, for example, make a difference if I store the unique id as
a byte array instead of as a string? Or are there some special codecs or filters that can be used?
What is the time complexity of looking up a document by its unique id? Since the index contains at least one unique term for each document, the lookup times will increase linearly with the number of documents (O(n)), right?
Theory
There is a blog post about Lucene term index and lookup performance. It clearly reveals all the details of complexity of looking up a document by id. This post is quite old, but nothing was changed since then.
Here is some highlights related to your question:
Lucene is a search engine where the minimum element of retrieval is a text term, so this means: binary, number and string fields are represented as strings in the BlockTree terms dictionary.
In general, the complexity of lookup depends on the term length: Lucene uses an in-memory prefix-trie index structure to perform a term lookup. Due to restrictions of real-world hardware and software implementations (in order to avoid superfluous disk reads and memory overflow for extremely large tries), Lucene uses a BlockTree structure. This means it stores prefix-trie in small chunks on disk and loads only one chunk at time. This is why it's so important to generate keys in an easy-to-read order. So let's arrange the factors according to the degree of their influence:
term's length - more chunks to load
term's pattern - to avoid superfluous reads
terms count - to reduce chunks count
Algorithms and Complexity
Let term be a single string and let term dictionary be a large set of terms. If we have a term dictionary, and we need to know whether a single term is inside the dictionary, the trie (and minimal deterministic acyclic finite state automaton (DAFSA) as a subclass) is the data structure that can help us. On your question: “Why use tries if a hash lookup can do the same?”, here are a few reasons:
The tries can find strings in O(L) time (where L represents the length of a single term). This is a bit faster compared to hash table in the worst case (hash table requires linear scan in case of hash collisions and sophisticated hashing algorithm like MurmurHash3), or similar to a hash table in perfect case.
The hash tables can only find terms of a dictionary that exactly match with the single term that we are looking for; whereas the trie allows us to find terms that have a single different character, a prefix in common, a character missing, etc.
The trie can provide an alphabetical ordering of the entries by key, so we can enumerate all terms in alphabetical order.
The trie (and especially DAFSA) provides a very compact representation of terms with deduplication.
Here is an example of DAFSA for 3 terms: bath, bat and batch:
In case of key lookup, notice that lowering a single level in the automata (or trie) is done in constant time, and every time that the algorithm lowers a single level in the automata (trie), a single character is cut from the term, so we can conclude that finding a term in a automata (trie) can be done in O(L) time.
Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?
Here is the lucene index format documentation.
The major file is the compound index (.cfs file).
If you have term statistics, you can probably get an estimate for the .cfs file size,
Note that this varies greatly based on the Analyzer you use, and on the field types you define.
The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.
I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?