Accented or special characters in RavenDB index - indexing

I have several fields in my collection that contain accented characters, and the languages from which the words come are quite varied: Czech, German, Spanish, Finnish, Hungarian, etc.
I have noticed that when searching for, e.g. "Andalucía" (note the accented i), the query comes up empty - however, searching for "Andaluc*" returns what I am looking for.
I have found this in the RavenDB docs, and wanted to ask if changing the field indexing method from default to exact would solve my problem.
Thanks !
EDIT: RavenDB appears to be dropping letters after AND including the accented character in the search. In the cmd window, I can see the query (which I enter from RavenDB Studio as NAME_1:Andalucía) coming out as (...)/ByName?term=Andaluc&field=NAME_1&max(...)
When I navigate to the terms of the index, I can see "andalucía" (lowercase !!). The index definition is simply a "select new { NAME_1 = area.NAME_1 }". Forgot to mention I'm still on RavenDB 3.5.
Index definition:
Map = areas => from area in areas
select new
{
NAME_0 = area.NAME_0,
NAME_1 = area.NAME_1
};
Indexes.Add(x => x.NAME_1, FieldIndexing.Analyzed);
//Analyzers.Add(x => x.NAME_1, typeof(StandardAnalyzer).FullName);
The commented-out line doesn't work because the type StandardAnalyzer doesn't resolve in my VS2017 project. I'm curently looking into how to get either the dll or the correct using statement.
Querying for Andalucía in quotation marks results in the "correct query" being sent to Raven: (...)/ByName?term=Andalucía&field=NAME_1&max=5(...) - but produces no results.
FURTHER EDIT: Found the Lucene dll, included it in the project, used the StandardAnalyzer als the analyzer - same result (no results found).
On RavenDB 4, this appears fixed. meh

You need to verify that both 'Full-Text-Search' and 'Suggestions' options are 'turned on' in the index.

You need to specify the field you want the suggestions for.
Add this in your index definition:
Suggestion(x => x.NAME_1);

You must not have the following line of code in your index definition on the properties where you perform search operations:
Indexes.Add(x => x.PropertyXYZ, FieldIndexing.No);
By default if you didn't change the Indexing your query works.

Related

Escaping special characters & Encoding unsafe and reserved characters Lucene query syntax Azure Search

I have words "C&&K", "So`am`I" , "Ant||Man", "A*B==AB", "Ant+Man" in index of azure search.
According to Doc for Escaping special characters + - && || ! ( ) { } [ ] ^ " ~ * ? : \ / I need to prefixing them with backslash (\) And for unsafe and reserved characters need to encode them in URL.
for "C&&K" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=C%5C%26%5C%26K~&queryType=full
for "So`am`I" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=So%5C%60am%5C%60I~&queryType=full
for "Ant||Man" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=A%5C*B%3D%3DAB~&queryType=full
for "A*B==AB" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=A%5C*B%3D%3DAB~&queryType=full
for "Ant+Man" my search url => /indexes/{index-name}/docs?api-version=2017-11-11&search=Ant%5C%2BMan~&queryType=full
For all off them I do not get search result. I get "value": []
for "C&&K" I have also tried
url => /indexes/{index-name}/docs?api-version=2017-11-11&search=C%5C%26%26K~&queryType=full
url => /indexes/{index-name}/docs?api-version=2017-11-11&search=C%26%5C%26K~&queryType=full
for "So`am`I" I have also tried
url => /indexes/{index-name}/docs?api-version=2017-11-11&search=So%60am%60I~&queryType=full
It does not work. What am I doing wrong here?
With standard analysis, all of these would be indexed as multiple terms. Fuzzy queries, however, are not analyzed, so it will attempt to find it as a single term. That is, when you index "Ant||Man", after analysis, you end up with the terms "ant" and "man" in the index. When you search for Ant||Man, it will analyze it in much the same way as at index time, but when searching for Ant||Man~, the query won't be analyzed, and since no terms like that exist in the index, you won't get any matches. Similarly, for "A*B==AB" you get the terms "b" and "ab" ("a" is a stop word with default analysis).
So, try the queries without the ~.
In addition to femtoRgon's response, you may want to consider using a custom analyzer that does not index these as multiple terms if you would always like them to be searchable as they are. There is documentation on custom analyzers here, and you can use the Analyze API to test to make sure a given analyzer works as you expect.

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)
This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

Amazon CloudSearch returns false results

I have a DB of articles, and i would like to search for all the articles who:
1. contain the word 'RIO' in either the title or the excerpt
2. contain the word 'BRAZIL' in the parent_post_content
3. and in a certain time range
The query I search with (structured) was:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (phrase field=title 'RIO') (phrase field=excerpt 'RIO')))
but for some reason i get results that contain 'RIO' in the title, but do not contain 'BRAZIL' in the parent_post_content.
This is especially weird because i tried to condition only on the title (and not the excerpt) with this query:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (phrase field=name 'RIO'))
and the results seem OK.
I'm fairy new to CloudSearch, so i very likely have syntax errors, but i can't seem to find them. help?
You're using the phrase operator but not actually searching for a phrase; it would be best to use the term operator (or no operator) instead. I can't see why it should matter but using something outside of how it was intended to be used can invite unintended consequences.
Here is how I'd re-write your queries:
Using term (mainly just used if you want to boost fields):
(and (term field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (term field=title 'RIO') (term field=excerpt 'RIO')))
Without an operator (I find this simplest):
(and parent_post_content:'BRAZIL' (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or title:'RIO' excerpt:'RIO'))
If that fails, can you post the complete query? I'd like to check that, for example, you're using the structured query parser since you mentioned you're new to CloudSearch.
Here are some relevant docs from Amazon:
Compound queries for more on the various operators
Searching text for specifics on the phrase operator
Apparently the problem was not with the query, but with the displayed content. I foolishly trusted that the content displaying in the CloudSearch site was complete, and so concluded that it does not contain Brazil. But alas, it is not the full content, and when i check the full content, Brazil was there.
Sorry for the foolery.

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.
Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.
Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

How to do a startsWith and then Contains Search using Lucene.NET 3.0?

What is the best way to search and Index in Lucene.NET 3.0 so that the results come out ordered in the following way:
Results that start with the full query text (as a single word) e.g. "Bar Acme"
Results that start with the search term as a word fragment e.g. "Bart Simpson"
Results that contain the query text as a full word e.g. "National Bar Association"
Results that contain the query text as a fragment e.g. "United Bartenders Inc"
Example: Searching for Bar
Ordered Results:
Bar Acme
Bar Lunar
Bart Simpson
National Bar Association
International Bartenders Association
Lucene doesn't generally support searching/scoring based on position within a field. It would be possible to support it if you prefix every fields with some known fieldstart delimiter, or something. I don't really think it makes sense, in the lens of a full text search where position within the text field isn't relevant (ie. if I were searching for Bar in a document, I would likely be rather annoyed if "Bart Simpson" were returned before "national bar association")
Apart from that though, a simple prefix search handles everything else. So if you simply add your start of word token, you can search for the modified term with a higher boost prefix query than the original, and then you should have precisely what you describe.
It can be achieved with linq. Make lucene search with hit count Int32.MaxValue. Loop the results of ScoreDocs and store it in collection Searchresults.
sample code:
Searchresults = (from scoreDoc in results.ScoreDocs select (new SearchResults { suggestion = searcher.Doc(scoreDoc.Doc).Get("suggestion") })).OrderBy(x => x.suggestion).ToList();
Searchresultsstartswith = Searchresults.Where(x => x.suggestion.ToLower().StartsWith(searchStringLinq.ToLower())).Take(10).ToList();
if (SearchresultsStartswith.Count > 0)
return SearchresultsStartswith.ToList();
else
return Searchresults.Take(10).ToList();