Lucene: Query at least - lucene

I'm trying to find if there's a way to search in lucene to say find all documents where there is at least one word that does not match a particualar word.
E.g. I want to find all documents where there is at least one word besides "test". i.e. "test" may or may not be present but there should be at least one word other than "test". Is there a way to do this in Lucene?
thanks,
Purushotham

Lucene could do this, but this wouldn't be a good idea.
The performance of query execution is bound to two factors:
the time to intersect the query with the term dictionary,
the time to retrieve the docs for every matching term.
Performant queries are the ones which can be quickly intersected with the term dictionary, and match only a few terms so that the second step doesn't take too long. For example, in order to prohibit too complex boolean queries, Lucene limits the number of clauses to 1024 by default.
With a TermQuery, intersecting the term dictionary requires (by default) O(log(n)) operations (where n is the size of the term dictionary) in memory and then one random access on disk plus the streaming of at most 16 terms. Another example is this blog entry from Lucene committer Mike McCandless which describes how FuzzyQuery performance improved when a brute-force implementation of the first step was replaced by something more clever.
However, the query you are describing would require to examine every single term of the term dictionary and dismiss documents which are in the "test" document set only!
You should give more details about your use-case so that people can think about a more efficient solution to your problem.

If you need a query with a single negative condition, then use a BooleanQuery with the MatchAllDocsQuery and a TermQuery with occurs=MUST_NOT. There is no way to additionaly enforce the existential constraint ("must contain at least one term that is not excluded"). You'll have to check that separately, once you retrieve Lucene's results. Depending on the ratio of favorable results to all the results returned from Lucene, this kind of solution can range from perfectly fine to a performance disaster.

Related

Is that possible to use full text index to find closest match strings? What does Statistical Semantics do in Full Text Indexing

I am looking for SQL Server 2016 full text indexes and they are awesome to make searches for finding multiple words containing strings
When i try to compose the full text index, it shows Statistical Semantics as a tickbox. What does statistical semantics do?
Moreover, I want to find did you mean queries
For example lets say i have a record as house. The user types hause
Can i use full text index to return hause as closest match and show user did you mean house efficiently ? thank you
I have tried soundex but the results it generates are terrible
It returns so many unrelated words
And since there are so many records in my database and i need very fast results, i need something SQL server natively supports
Any ideas? Any way to achieve such thing with using indexes?
I know there are multiple algorithms but they are not efficient enough for me to use online. I mean like calculating edit distance between each records. They could be used for offline projects but i need this efficiency in an online dictionary where there will be thousands of requests constantly.
I already have a plan in my mind. Storing not-found results in the database and offline calculating closest matches. And using them as cache. However, i wonder any possible online/live solution may exists? Consider that there will be over 100m nvarchar records
Short answer is no, Full Text Search cannot search for words that are similar, but different.
Full Text Search uses stemmers and thesaurus files:
The stemmer generates inflectional forms of a particular word based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").
A Full-Text Search thesaurus defines a set of synonyms for a specific language.
Both stemmers and thesaurus are configurable and you can easily have FT match house for a search on hause, but only if you added hause as a synonym for house. This is obviously a non-solution as it requires you to add every possible typo as a synonym...
Semantic search is a different topic, it allows you to search for documents that are semantically close to a given example.
What you want is to find records that have a short Levenshtein distance from a given word (aka. 'fuzzy' search). I don't know of any technique for creating an index that can answer a Levenshtein search. If you're willing to scan the entire table for each term, T-SQL and CLR implementations of Levenshtein exists.

Lucene, Alternatives to MoreLikeThis?

I am building a recommender system for restaurants. Each restaurant is represented in the form of documents. It has the following features(fields), Cuisine, Facilities, Types.
Now, I read about MoreLikeThis Query. It finds similar documents based on term frequencies. So, it ignores for example, two documents with the following Cuisine
"steakhouse australian gluten-free"
because, lucene index doesnot consider them important terms, because they only occur once.
Is there any other query that ignores term frequencies? and just finds similar documents based on largest number of keywords matched?
You could create a query with the entire contents of the document, by running it through a QueryParser, something like:
QueryParser myQueryParser = new QueryParser(myFieldName, new StandardAnalyzer());
Query query = myQueryParser.parse(QueryParserBase.escape(myDoc.get(myFieldName)));
Potential issues would be overlong queries causing poor performance (that is why MoreLikeThis attempts to select the best terms to query instead of searching for all of them), or Too Many Clauses exceptions.

Searching efficiently with keywords

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

Understanding range indexes in Marklogic

I found the following in ML documentation:
a range index lets the server map values to fragments, and fragments to values...The former capability is used to support "range predicates" ....The latter is used to support fast order by operations..
Can anyone please explain this to me.Some sort of diagram depicting how this mapping is maintained would be very helpful.
Yes, do read Jason's excellent paper for all the detail of the inner workings of MarkLogic.
A simple summary of range indexes is this: A range index is a sorted term list. Term lists are an inverted index of values stored in documents. For word indexes, for example, a list of terms (a term list) is created that contains all the words in all the documents. Each term in the list is a word, say "humdinger", and an associated set of fragment IDs where that word occurs. When you do a word search for "humdinger", ML checks the term lists to find out which fragments that word occurs in. Easy. A more complex search is simply the set intersections of all the matched terms from all applicable term lists.
Most "regular" indexes are not sorted, they're organized as hashes to make matching terms efficient. They produce a set of results, but not ordered (relevance ordering is applied after). A range index on the other hand, is a term list that's sorted by the values of its terms. A range index therefore represents the range of unique values that occur in all instances of an element or attribute in the database.
Because range index term lists are ordered, when you get matches in a search you not only know which fragments they occur in, you also know the sorted order of the possible values for that field. MarkLogic's XQuery is optimized to recognize when you've supplied an "order by" clause that refers to a element or attribute which is range indexed. This lets it sort not by comparing the matched documents, but by iterating down the sorted term list and fetching matched documents in that order. This makes it much faster because the documents themselves need not be touched to determine their sort order.
But wait, there's more. If you're paginating through search results, taking only a slice of the matching results, then fast sorting by a range indexed field helps you there as well. If you're careful not to access any other part of the document (other than the range index element) before applying the page-window selection predicate, then the documents outside that window will never need to be fetched. The combination of pre-sorted selection and fast skip ahead is really the only way you can efficiently step through large, sorted result sets.
Range indexes have one more useful feature. You can access their values as lexicons, enumerating the unique values that occur in a given element or attribute across your entire database but without every actually looking inside any documents. This comes in handy for things like auto-suggest and getting counts for facets.
I hope that clarifies what range indexes are.
Take a look at Jason Hunter's writeup in Inside MarkLogic Server. There's a whole section on range indexes.

Determining search results quality in Lucene

I have been searching about score normalization for few days (now i know this can't be done) in Lucene using mailing list, wiki, blogposts, etc. I'm going to expose my problem because I'm not sure that score normalization is what our project need.
Background:
In our project, we are using Solr on top of Lucene with custom RequestHandlers and SearchComponents. For a given query, we need to detect when a query got poor results to trigger different actions.
Assumptions:
Inmutable index (once indexed, it is not updated) and Same query tipology (dismax qparser with same field boosting, without boost functions nor boost queries).
Problem:
We know that score normalization is not implementable. But is there any way to determine (using TF/IDF and boost field assumptions) when search results match quality are poor?
Example: We've got an index with science papers and other one with medcare centre's info. When a user query against first index and got poor results (inferring it from score?), we want to query second index and merge results using some threshold (score threshold?)
Thanks in advance
You're right that normalization of scores across different queries doesn't make sense, because nearly all similarity measures base on term frequency, which is of course local to a query.
However, I think that it is viable to compare the scores in this very special case that you are describing, if only you would override the default similarity to use IDF calculated jointly for both indexes. For instance, you could achieve it easily by keeping all the documents in one index and adding an extra (and hidden to the users) 'type' field. Then, you could compare the absolute values returned by these queries.
Generally, it could be possible to determine low quality results by looking at some features, like for example very small number of results, or some odd distributions of scores, but I don't think it actually solves your problem. It looks more similar to the issue of merging of isolated search results, which is discussed for example in this paper.