How does Lucene work - lucene

I would like to find out how lucene search works so fast. I can't find any useful docs on the web. If you have anything (short of lucene source code) to read, let me know.
A text search query using mysql5 text search with index takes about 18 minutes in my case. A lucene search for the same query takes less than a second.

Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in fact I think you'll find the big databases can do a simple string-equality query very quickly in that case.
Lucene does not have to optimize for transaction processing. When you add a document, it need not ensure that queries see it instantly. And it need not optimize for updates to existing documents.
However, at the end of the day, if you really want to know, you need to read the source. Both things you reference are open source, after all.

Lucene creates a big index. The index contains word id, number of docs where the word is present, and the position of the word in those documents. So when you give a single word query it just searches the index (O(1) time complexity). Then the result is ranked using different algorithms. For multi-word query just take the intersection of the set of files where the words are present.
Thus Lucene is very very fast.
For more info read this article by Google developers- http://infolab.stanford.edu/~backrub/google.html

In a word: indexing.
Lucene creates an index of your document that allows it to search much more quickly.
It's the same difference between a list O(N) data structure and a hash table O(1) data structure. The list has to walk through the entire collection to find what you want. The hash table has an index that lets it figure out exactly where the desired item is and simply fetch it.
Update:
I'm not certain what you mean by "Lucene index searches are a lot faster than mysql index searches."
My guess is that you're using MySQL "WHERE document LIKE '%phrase%'" to search for a document. If that's true, then MySQL has to do a table scan on every row, which will be O(N).
Lucene gets to parse the document into tokens, group them into n-grams at your direction, and calculate indexes for each one of those. It's O(1) to find a word in an indexed Lucene document.

Lucene works with Term frequency and Inverse document frequency. It creates an index mapping each word with the document and it's frequency count which is nothing but inverse index on the document.
Example :
File 1 : Random Access Memory is the main memory.
File 2 : Hard disks are secondary memory.
Lucene creates a reverse index something like
File 1 :
Term : Random
Frequency : 1
Position : 0
Term : Memory
Frequency : 2
Position : 3
Position : 6
So it is able to search and retrieve the searched content quickly. When there is too many matches for the search query it outputs the result based on the weight. Consider the search query "Main Memory" it searches for all 4 words individually and the result would be like,
Main
File 1 : Frequency - 1
Memory
File 1 : Frequency - 2
File 2 : Frequency - 1
The result would be File1 followed by File2. To stop getting carried away by weights on most common words like 'and', 'or', 'the' it considers the inverse document frequency (ie' it decreases the weight of the word which is most popular among the document set).

Related

SOLR and Ratio of Matching Words

Using SOLR version 4.3, it appears that SOLR is valuing the percentage of matching terms more than the number of matching terms.
For example, we do a search for Dog and a document with just the word dog and a three other words returns. We have another article with hundreds of words, that has the word dog in it 27 times.
I would expect the second article to return first. However, the one with one word out of three returns first. I was hoping to find out what in SOLR controls this so I can make the appropriate modifications. I have looked the SOLR documentation and have seen COORD mentioned, but it seems to indicate that the article with 27 references should return first. Any help would be appreciated.
For 4.x Solr still used regular TF/IDF as its scoring formula, and you can see the Lucene implementation detailed in the documentation for TFIDFSimilarity.
For your question, the two factors that affect the score is:
The length of the field, as given in norm():
norm(t,d) encapsulates a few (indexing time) boost and length factors:
Field boost - set by calling field.setBoost() before adding the field to a document.
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
.. while the number of terms matching (not their frequency), is given by coord():
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
There are a few settings in your schema that can affect how Solr scores the documents in your example:
omitNorms
If true, omits the norms associated with this field (this disables length normalization for the field, and saves some memory)
.. this will remove the norm() part of the score.
omitTermFreqAndPositions
If true, omits term frequency, positions, and payloads from postings for this field.
.. and this will remove the boost from multiple occurrences of the same term. Be aware that this will remove positions as well, making phrase queries impossible.
But you should also consider upgrading Solr, as the BM25 similarity that's the default from 6.x usually performs better. I can't remember if a version is available for 4.3.

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.

How to build a simple inverted index?

I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence.
Now, I have several questions:
What kind of data structure is often used for building inverted index? Multidimensional list?
After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper?
You can see a very simple implementation of inverted index and search in TinySearchEngine.
For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this:
val invertedIndex = new collection.mutable.HashMap[String, List[Posting]]
or a Java-esque:
HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>();
The hash maps each term/word/token to a list of Postings. A Posting is just an object that represents an occurrence of a word inside a document:
case class Posting(docId:Int, var termFrequency:Int)
Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. There are other ways of doing this. For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert Postings once with the correct termFrequency instead of updating it every time.
Regarding your second question, there are normally two cases:
(1) you have an (almost) immutable index. You index all your data once and if you have new data you can just reindex. There is no need to real-time or indexing many times in an hour, for example.
(2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible.
For case (1), you can have at least 2 files:
1 - The Inverted Index file. It lists for each term all Postings (docId/termFrequency pairs). Here represented in plain text, but normally stored as binary data.
Term1<docId1,termFreq><docId2,termFreq><docId3,termFreq><docId4,termFreq><docId5,termFreq><docId6,termFreq><docId7,termFreq>
Term2<docId3,termFreq><docId5,termFreq><docId9,termFreq><docId10,termFreq><docId11,termFreq>
Term3<docId1,termFreq><docId3,termFreq><docId10,termFreq>
Term4<docId5,termFreq><docId7,termFreq><docId10,termFreq><docId12,termFreq>
...
TermN<docId5,termFreq><docId7,termFreq>
2- The offset file. Stores for each term the offset to find its inverted list in the inverted index file. Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. This file can be loaded to memory at startup time. When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file.
Term1 -> 0
Term2 -> 126
Term3 -> 222
....
Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm.
For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch) do it.
The file format can be the same as explained above. The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. So every time you have to index something, you do it in a new separated index.
To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user.
Lucene calls this "little" indexes segments.
The obvious concern here is that you'll get a lot of little segments very quick. To avoid this, you'll need a policy for merging segments and creating larger segments. For example, if you have more than N segments you can decide to merge all segments smaller than 10 KBs together.

lucene index match

I am trying to use Lucene for doing undup or dedup match. Essentially I have a file with records which I want to group based on certain fields (fuzzy search) and get back a result with a match key that tells me which records within that file matched to each other.
Is this possible?
This can be done (if I understand this correctly). You would index your terms that/records will be searched on in one pass. In the second pass, you will search for each term and log results.
While pre-processing the document you can generate a hash that aggregate those fields, and store this (as NOT_ANALYZED), this way you just have to search by one field with a known size, take a look at MessageDigest. This is what I normally do for duplicate detection of the file content (since the content might be too big for a single query).
If what you are looking for is creating a more complex query, try using CachingWrapperFilter, this way subsequent calls to your deduplication algorithm will be much faster.

Lucene - Scoring and payload

We have an application where every term position in a document is associated with an "engine score".
A term query should then be scored according to the sum of "engine scores" of the term in a document, rather than on the term frequency.
For example, term frequency of 5 with an average engine score of 100 should be equivalent to term frequency of 1 with engine score 500.
I understood that if I keep the engine score per position in the payload, I will be able to use scorePayload in combination of a summary version of PayloadFunction to get the sum of engine scores of a term in a document, and so will be able to achieve my goal.
There are two issues with this solution:
Even the simplest term query should scan the positions file in order to get the payloads, which could be a performance issue.
We would prefer to index the sum of engine scores in advance per document, in addition to the term frequency. This is some sort of payload in the document level. Does Lucene support that or have any other solution for this issue ?
The "engine score" of a phrase occurrence is defined as the multiplication of engine scores of the terms that compose the phrase.
So in scorePayload I need the payloads of all the terms in the phrase in order to be able to appropriately score the phrase occurrence.
As much as I understand, the current interface of scorePayload does not provide this information.
Is there another way this can be achieved in Lucene ?
One workaround for a document-level payload is to create a single Lucene document / your document that just contains the engine score for your whole document as a specially-named field (different from all other Lucene document field names). You can then combine / extract that document during your searches. Not much of a workaround, but there it is.