How to prioritize complete sentences instead of single search terms? - amazon-cloudsearch

In my CloudSearch instance I want to give a greater priority to complete phrases instead of just word counts.
An example would be, when I search for "foo bar" I would like documents that have "foo" and "bar" next to each other be scored better than documents that have the two terms scatterred in the document. Of course, any other document containing either words should retrieved but not scored as highly.
Any ideas of how the query could be done ?

$searchQuery = array(
'query' => "(or ".$searchParam." (phrase boost=10 ".$searchParam."))",
'queryParser' => 'structured',
'queryOptions' => json_encode(array('defaultOperator' => 'or'))
);
This worked to some extend.
Totally random boosting value.

Related

Escaping special characters & Encoding unsafe and reserved characters Lucene query syntax Azure Search

I have words "C&&K", "So`am`I" , "Ant||Man", "A*B==AB", "Ant+Man" in index of azure search.
According to Doc for Escaping special characters + - && || ! ( ) { } [ ] ^ " ~ * ? : \ / I need to prefixing them with backslash (\) And for unsafe and reserved characters need to encode them in URL.
for "C&&K" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=C%5C%26%5C%26K~&queryType=full
for "So`am`I" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=So%5C%60am%5C%60I~&queryType=full
for "Ant||Man" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=A%5C*B%3D%3DAB~&queryType=full
for "A*B==AB" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=A%5C*B%3D%3DAB~&queryType=full
for "Ant+Man" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=Ant%5C%2BMan~&queryType=full
For all off them I do not get search result. I get "value": []
for "C&&K" I have also tried
url => /indexes/{index-name}/docs?api-version=2017-11-11&search=C%5C%26%26K~&queryType=full
url => /indexes/{index-name}/docs?api-version=2017-11-11&search=C%26%5C%26K~&queryType=full
for "So`am`I" I have also tried
url => /indexes/{index-name}/docs?api-version=2017-11-11&search=So%60am%60I~&queryType=full
It does not work. What am I doing wrong here?
With standard analysis, all of these would be indexed as multiple terms. Fuzzy queries, however, are not analyzed, so it will attempt to find it as a single term. That is, when you index "Ant||Man", after analysis, you end up with the terms "ant" and "man" in the index. When you search for Ant||Man, it will analyze it in much the same way as at index time, but when searching for Ant||Man~, the query won't be analyzed, and since no terms like that exist in the index, you won't get any matches. Similarly, for "A*B==AB" you get the terms "b" and "ab" ("a" is a stop word with default analysis).
So, try the queries without the ~.
In addition to femtoRgon's response, you may want to consider using a custom analyzer that does not index these as multiple terms if you would always like them to be searchable as they are. There is documentation on custom analyzers here, and you can use the Analyze API to test to make sure a given analyzer works as you expect.

Accented or special characters in RavenDB index

I have several fields in my collection that contain accented characters, and the languages from which the words come are quite varied: Czech, German, Spanish, Finnish, Hungarian, etc.
I have noticed that when searching for, e.g. "Andalucía" (note the accented i), the query comes up empty - however, searching for "Andaluc*" returns what I am looking for.
I have found this in the RavenDB docs, and wanted to ask if changing the field indexing method from default to exact would solve my problem.
Thanks !
EDIT: RavenDB appears to be dropping letters after AND including the accented character in the search. In the cmd window, I can see the query (which I enter from RavenDB Studio as NAME_1:Andalucía) coming out as (...)/ByName?term=Andaluc&field=NAME_1&max(...)
When I navigate to the terms of the index, I can see "andalucía" (lowercase !!). The index definition is simply a "select new { NAME_1 = area.NAME_1 }". Forgot to mention I'm still on RavenDB 3.5.
Index definition:
Map = areas => from area in areas
select new
{
NAME_0 = area.NAME_0,
NAME_1 = area.NAME_1
};
Indexes.Add(x => x.NAME_1, FieldIndexing.Analyzed);
//Analyzers.Add(x => x.NAME_1, typeof(StandardAnalyzer).FullName);
The commented-out line doesn't work because the type StandardAnalyzer doesn't resolve in my VS2017 project. I'm curently looking into how to get either the dll or the correct using statement.
Querying for Andalucía in quotation marks results in the "correct query" being sent to Raven: (...)/ByName?term=Andalucía&field=NAME_1&max=5(...) - but produces no results.
FURTHER EDIT: Found the Lucene dll, included it in the project, used the StandardAnalyzer als the analyzer - same result (no results found).
On RavenDB 4, this appears fixed. meh
You need to verify that both 'Full-Text-Search' and 'Suggestions' options are 'turned on' in the index.
You need to specify the field you want the suggestions for.
Add this in your index definition:
Suggestion(x => x.NAME_1);
You must not have the following line of code in your index definition on the properties where you perform search operations:
Indexes.Add(x => x.PropertyXYZ, FieldIndexing.No);
By default if you didn't change the Indexing your query works.

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)
This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

How to do a startsWith and then Contains Search using Lucene.NET 3.0?

What is the best way to search and Index in Lucene.NET 3.0 so that the results come out ordered in the following way:
Results that start with the full query text (as a single word) e.g. "Bar Acme"
Results that start with the search term as a word fragment e.g. "Bart Simpson"
Results that contain the query text as a full word e.g. "National Bar Association"
Results that contain the query text as a fragment e.g. "United Bartenders Inc"
Example: Searching for Bar
Ordered Results:
Bar Acme
Bar Lunar
Bart Simpson
National Bar Association
International Bartenders Association
Lucene doesn't generally support searching/scoring based on position within a field. It would be possible to support it if you prefix every fields with some known fieldstart delimiter, or something. I don't really think it makes sense, in the lens of a full text search where position within the text field isn't relevant (ie. if I were searching for Bar in a document, I would likely be rather annoyed if "Bart Simpson" were returned before "national bar association")
Apart from that though, a simple prefix search handles everything else. So if you simply add your start of word token, you can search for the modified term with a higher boost prefix query than the original, and then you should have precisely what you describe.
It can be achieved with linq. Make lucene search with hit count Int32.MaxValue. Loop the results of ScoreDocs and store it in collection Searchresults.
sample code:
Searchresults = (from scoreDoc in results.ScoreDocs select (new SearchResults { suggestion = searcher.Doc(scoreDoc.Doc).Get("suggestion") })).OrderBy(x => x.suggestion).ToList();
Searchresultsstartswith = Searchresults.Where(x => x.suggestion.ToLower().StartsWith(searchStringLinq.ToLower())).Take(10).ToList();
if (SearchresultsStartswith.Count > 0)
return SearchresultsStartswith.ToList();
else
return Searchresults.Take(10).ToList();

ActiveRecord search model using LIKE only returning exact matches

In my rails app I am trying to search the Users model based on certain conditions.
In particular, I have a location field which is a string and I want to search this field based on whether it contains the search string. For example, if I search for users with location 'oxford' I want it to also return users with a variation on that, like 'oxford, england'.
Having searched the web for the answer to this it seems that I should be using the LIKE keyword in the activerecord search, but for me this is only returning exact matches.
Here is a snippet of my code from the search method
conditions_array = []
conditions_array << [ 'lower(location) LIKE ?', options[:location].downcase ] if !options[:location].empty?
conditions = build_search_conditions(conditions_array)
results = User.where(conditions)
Am I doing something wrong? Or is using LIKE not the right approach to achieving my objective?
You need to do like '%oxford%'
% Matches any number of characters, even zero characters
conditions_array << [ 'lower(location) LIKE ?', "%#{options[:location].downcase}%" ] if !options[:location].empty?