Lucens best way to do "starts-with" queries - lucene

I want to be able to do the following types of queries:
The data to index consists of (let's say), music videos where only the title is interesting.
I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the title. Also all this should be case insensitive.
Example:
For documents:
Video1Title = Sea is blue
Video2Title = Wild sea
Video3Title = Wild sea Whatever
Video4Title = Seaside Whatever
If I search "sea" I want to get
"Video1Title = Sea is blue"
first followed by all the other documents that contain "sea" in title, but not at the beginning.
If I search "Wild sea" I want to get
Video2Title = Wild sea
Video3Title = Wild sea Whatever
first followed by all the other documents that have "Wild" or "Sea" in their title but don't have "Wild Sea" as title prefix.
If I search "Seasi" I don't wanna get anything (I don't care for Keyword Tokenization and prefix queries).
Now AFAIKS, there's no actual way to tell Lucene "find me documents where word1 and word2 and etc. are in positions 1 and 2 and 3 and etc."
There are "workarounds" to simulate that behaviour:
Index the field twice. In field1 you have the words tokenized (using perhaps StandardAnalyzer) and in field2 you have them all clumped up into one element (using KeywordAnalyzer). Then if you search something like :
+(field1:word1 word2 word3) (field2:"word1 word2 word3*")
effectively telling Lucene "Documents must contain word1 or word2 or word3 in the title, and furthermore those that match "title starts with >word1 word2 word3<" are better (get higher score).
Add a "lucene_start_token" to the beginning of the field when indexing them such that
Video2Title = Wild sea is indexed as "title:lucene_start_token Wild sea" and so on for the rest
Then do a query such that:
+(title:sea) (title:"lucene_start_token sea")
and having Lucene return all documents which contain my search word(s) in the title and also give a better score on those who matched "lucene_start_token+search words"
My question is then, are there indeed better ways to do this (maybe using PhraseQuery and Term position)? If not, which of the above is better perfromance-wise?

You can use Lucene Payloads for that. You can give custom boost for every term of the field value.
So, when you index your titles you can start using a boost factor of 3 (for example):
title: wild|3.0 creatures|2.5 blue|2.0 sea|1.5
title: sea|3.0 creatures|2.5
Indexing this way you are boosting nearest terms to the start of title.
The main problem using this approach is you have to tokenize by yourself and add all this boost information "manually" as the Analyzer needs the text structured that way (term1|1.1 term2|3.0 term3).

What you could do is index the title and each token separately, e.g. text wild deep blue endless sea would be indexed like:
title: wild deep blue endless sea
t1: wild
t2: deep
t3: blue
t4: endless
t5: sea
Then if someone queries "wild deep", the query would be rewritten into
title:"wild deep" OR (t1:wild AND t2:deep)
This way you will always find all matching documents (if they match title) but matching t1..tN tokens will score the relevant documents higher.

Related

Full text search returning too many irrelevant results and causing poor performance

I'm using the full text search feature from Postgres and for the most part it works fine.
I have a column in my database table called documentFts that is basically the ts_vector version of the body field, which is a text column, and that's indexed with GIN index.
Here's my query:
select
count(*) OVER() AS full_count,
id,
url,
(("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) as "finalScore",
ts_headline(\'english_unaccent\', title, websearch_to_tsquery($4, $1)) as title,
ts_headline(\'english_unaccent\', body, websearch_to_tsquery($4, $1)) as body,
"possibleEncoding",
"responseYear"
from "Entries"
where
"language" = $3 and
"documentFts" ## websearch_to_tsquery($4, $1)
order by (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) desc limit 20 offset $2;
The dictionary is english_unaccent because I created one based on english that uses the unaccent extension by using:
CREATE TEXT SEARCH CONFIGURATION english_unaccent (
COPY = english
);
ALTER TEXT SEARCH CONFIGURATION english_unaccent
ALTER MAPPING FOR hword, hword_part, word WITH unaccent,
english_stem;
I did the same for other languages.
And then I did this to my Entries db:
ALTER TABLE "Entries"
ADD COLUMN "documentFts" tsvector;
UPDATE
"Entries"
SET
"documentFts" = (setweight(to_tsvector('english_unaccent', coalesce(title)), 'A') || setweight(to_tsvector('english_unaccent', coalesce(body)), 'C'))
WHERE
"language" = 'english';
I have a column in my table with the language of the entry, hence the "language" = 'english'.
So, the problem I'm having is that for words like animal, anime or animation, they all go into the vector as anim, which means that if I search for any of those words I get results with all of those variations.
That returns a HUGE dataset that causes the query to be quite slow compared to searches that return fewer items. And also, if I search for Anime, my first results contain Animal, Animated and the first result that has the word Anime is the 12th one.
Shouldn't animation be transformed to animat in the vector and animal just be animal as the other variations for it are animals or animalia?
I've been searching for a solution to this without much luck, is there any way I can improve this, I'm happy to install extensions, reindex the column or whatever.
There are so many little details to this. The best solution depends on the exact situation and exact requirements.
Two simple options:
Simple tweak 1
If you want to sort rows where title or body have a word starting with 'Anime' (exactly) in it, matched case-insensitively, add an ORDER BY expression like:
ORDER BY unaccent(concat_ws(' ', title, body) !~* ('\m' || f_regexp_escape($4))
, (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) DESC
Where the auxiliary function f_regexp_escape() escapes special regexp characters and is defined here:
Escape function for regular expression or LIKE patterns
That expression is rather expensive, but since it's only applied to filtered results, the effect is limited.
You may have to fine-tune, as other search terms present other difficulties. Think of 'body' / 'bodies' stemming to 'bodi' ...
Simple tweak 2
To remove English stemming completely, base yours on the 'simple' TEXT SEARCH CONFIGURATION:
CREATE TEXT SEARCH CONFIGURATION simple_unaccent (
COPY = simple
);
Etc.
Then the actual language of the text is irrelevant.The index gets substantially bigger, and the search is done on literal spellings. You can now widen the search with prefix matching like:
WHERE "documentFts" ## to_tsquery('simple_unaccent', ($1 || ':*')
Again, you'll have to fine-tune. The simple example only works for single-word patterns. And I doubt you want to get rid of stemming altogether. Probably too radical.
See:
Get partial match from GIN indexed TSVECTOR column
Proper solution: Synonym dictionary
You need access to the installation drive of the Postgres server for this. So typically not possible with most hosted services.
To overrule some of the stemmer decisions, overrule with your own set of synonym(rule)s. Create a mapping file in $SHAREDIR/tsearch_data/my_synonyms.syn. That's /usr/share/postgresql/13/tsearch_data/my_synonyms.syn in my Linux installation:
Let it contain (case insensitive by default):
anime anime
Then:
CREATE TEXT SEARCH DICTIONARY my_synonym (
TEMPLATE = synonym,
SYNONYMS = my_synonyms
);
There is a chapter with instructions in the manual. One quote:
A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word “Paris” to “pari”. It is enough to have a Paris paris line in the synonym dictionary and put it before the english_stem dictionary.
Then:
CREATE TEXT SEARCH CONFIGURATION my_english_unaccent (
COPY = english
);
ALTER TEXT SEARCH CONFIGURATION my_english_unaccent
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, my_synonym, english_stem; -- added my_synonym!
You have to update your column "documentFts" with my_english_unaccent. While being at it, use a proper lower-case column name like document_fts, and consider a GENERATED column. See:
Computed / calculated / virtual / derived columns in PostgreSQL
Are PostgreSQL column names case-sensitive?
Now, searching for Anime (or ánime, for that matter) won't find animal any more. And searching for animal won't find Anime.

Amazon CloudSearch returns false results

I have a DB of articles, and i would like to search for all the articles who:
1. contain the word 'RIO' in either the title or the excerpt
2. contain the word 'BRAZIL' in the parent_post_content
3. and in a certain time range
The query I search with (structured) was:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (phrase field=title 'RIO') (phrase field=excerpt 'RIO')))
but for some reason i get results that contain 'RIO' in the title, but do not contain 'BRAZIL' in the parent_post_content.
This is especially weird because i tried to condition only on the title (and not the excerpt) with this query:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (phrase field=name 'RIO'))
and the results seem OK.
I'm fairy new to CloudSearch, so i very likely have syntax errors, but i can't seem to find them. help?
You're using the phrase operator but not actually searching for a phrase; it would be best to use the term operator (or no operator) instead. I can't see why it should matter but using something outside of how it was intended to be used can invite unintended consequences.
Here is how I'd re-write your queries:
Using term (mainly just used if you want to boost fields):
(and (term field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (term field=title 'RIO') (term field=excerpt 'RIO')))
Without an operator (I find this simplest):
(and parent_post_content:'BRAZIL' (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or title:'RIO' excerpt:'RIO'))
If that fails, can you post the complete query? I'd like to check that, for example, you're using the structured query parser since you mentioned you're new to CloudSearch.
Here are some relevant docs from Amazon:
Compound queries for more on the various operators
Searching text for specifics on the phrase operator
Apparently the problem was not with the query, but with the displayed content. I foolishly trusted that the content displaying in the CloudSearch site was complete, and so concluded that it does not contain Brazil. But alas, it is not the full content, and when i check the full content, Brazil was there.
Sorry for the foolery.

Using wildcard and required operator in an Elasticsearch search

We have various rows inside our Elasticsearch index that contain the text
"... 2% milk ...".
User enters a query like "2% milk" into a search field and we transform it internally to a query
title:(+milk* +2%*)
because all terms should be required and we are possibly interested into rows that contain "2% milkfat".
This query above return zero hits. Changing the query to
title:(+milk* +2%)
returns reasonable results. So why does the '*' operator in the first query not work?
Unless you set a mapping, the "%" sign will get removed in the tokenization process. Basically "2% milk" will get turned into the tokens 2 and milk.
When you search for "2%*" it looks for tokens like: 2%, 2%a, 2%b, etc... and not match any indexed tokens, giving no hits.
When you search for "2%", it will go through the same tokenization process as at index-time (you can specify this, but the default tokenization is the same) and you will be looking for documents matching the token 2, which will give you a hit.
You can read more about the analysis/tokenization process here and you can set up the analysis you want by defining a custom mapping
Good luck!
Prefix and Wildcard queries do not appear to apply the Analyzer to their content. To provide a few examples:
title:(+milk* +2%) --> +title:milk* +title:2
title:(+milk* +2%*) --> +title:milk* +title:2%*
title:(+milk* +2%3) --> +title:milk* +(title:2 title:3)
title:(+milk* +2%3*) --> +title:milk* +title:2%3*
+title:super\\-milk --> +title:super title:milk
+title:super\\-milk* --> +title:super-milk*
It does make some sense to prevent tokenization of wildcard queries, since wildcard phrase queries are not allowed. If tokenization were allowed, it would seem to beg the question, especially with embeded wildcards, of just how many terms that wildcard can span.

How to do a startsWith and then Contains Search using Lucene.NET 3.0?

What is the best way to search and Index in Lucene.NET 3.0 so that the results come out ordered in the following way:
Results that start with the full query text (as a single word) e.g. "Bar Acme"
Results that start with the search term as a word fragment e.g. "Bart Simpson"
Results that contain the query text as a full word e.g. "National Bar Association"
Results that contain the query text as a fragment e.g. "United Bartenders Inc"
Example: Searching for Bar
Ordered Results:
Bar Acme
Bar Lunar
Bart Simpson
National Bar Association
International Bartenders Association
Lucene doesn't generally support searching/scoring based on position within a field. It would be possible to support it if you prefix every fields with some known fieldstart delimiter, or something. I don't really think it makes sense, in the lens of a full text search where position within the text field isn't relevant (ie. if I were searching for Bar in a document, I would likely be rather annoyed if "Bart Simpson" were returned before "national bar association")
Apart from that though, a simple prefix search handles everything else. So if you simply add your start of word token, you can search for the modified term with a higher boost prefix query than the original, and then you should have precisely what you describe.
It can be achieved with linq. Make lucene search with hit count Int32.MaxValue. Loop the results of ScoreDocs and store it in collection Searchresults.
sample code:
Searchresults = (from scoreDoc in results.ScoreDocs select (new SearchResults { suggestion = searcher.Doc(scoreDoc.Doc).Get("suggestion") })).OrderBy(x => x.suggestion).ToList();
Searchresultsstartswith = Searchresults.Where(x => x.suggestion.ToLower().StartsWith(searchStringLinq.ToLower())).Take(10).ToList();
if (SearchresultsStartswith.Count > 0)
return SearchresultsStartswith.ToList();
else
return Searchresults.Take(10).ToList();

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.