Amazon CloudSearch returns false results - amazon-cloudsearch

I have a DB of articles, and i would like to search for all the articles who:
1. contain the word 'RIO' in either the title or the excerpt
2. contain the word 'BRAZIL' in the parent_post_content
3. and in a certain time range
The query I search with (structured) was:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (phrase field=title 'RIO') (phrase field=excerpt 'RIO')))
but for some reason i get results that contain 'RIO' in the title, but do not contain 'BRAZIL' in the parent_post_content.
This is especially weird because i tried to condition only on the title (and not the excerpt) with this query:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (phrase field=name 'RIO'))
and the results seem OK.
I'm fairy new to CloudSearch, so i very likely have syntax errors, but i can't seem to find them. help?

You're using the phrase operator but not actually searching for a phrase; it would be best to use the term operator (or no operator) instead. I can't see why it should matter but using something outside of how it was intended to be used can invite unintended consequences.
Here is how I'd re-write your queries:
Using term (mainly just used if you want to boost fields):
(and (term field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (term field=title 'RIO') (term field=excerpt 'RIO')))
Without an operator (I find this simplest):
(and parent_post_content:'BRAZIL' (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or title:'RIO' excerpt:'RIO'))
If that fails, can you post the complete query? I'd like to check that, for example, you're using the structured query parser since you mentioned you're new to CloudSearch.
Here are some relevant docs from Amazon:
Compound queries for more on the various operators
Searching text for specifics on the phrase operator

Apparently the problem was not with the query, but with the displayed content. I foolishly trusted that the content displaying in the CloudSearch site was complete, and so concluded that it does not contain Brazil. But alas, it is not the full content, and when i check the full content, Brazil was there.
Sorry for the foolery.

Related

Accented or special characters in RavenDB index

I have several fields in my collection that contain accented characters, and the languages from which the words come are quite varied: Czech, German, Spanish, Finnish, Hungarian, etc.
I have noticed that when searching for, e.g. "Andalucía" (note the accented i), the query comes up empty - however, searching for "Andaluc*" returns what I am looking for.
I have found this in the RavenDB docs, and wanted to ask if changing the field indexing method from default to exact would solve my problem.
Thanks !
EDIT: RavenDB appears to be dropping letters after AND including the accented character in the search. In the cmd window, I can see the query (which I enter from RavenDB Studio as NAME_1:Andalucía) coming out as (...)/ByName?term=Andaluc&field=NAME_1&max(...)
When I navigate to the terms of the index, I can see "andalucía" (lowercase !!). The index definition is simply a "select new { NAME_1 = area.NAME_1 }". Forgot to mention I'm still on RavenDB 3.5.
Index definition:
Map = areas => from area in areas
select new
{
NAME_0 = area.NAME_0,
NAME_1 = area.NAME_1
};
Indexes.Add(x => x.NAME_1, FieldIndexing.Analyzed);
//Analyzers.Add(x => x.NAME_1, typeof(StandardAnalyzer).FullName);
The commented-out line doesn't work because the type StandardAnalyzer doesn't resolve in my VS2017 project. I'm curently looking into how to get either the dll or the correct using statement.
Querying for Andalucía in quotation marks results in the "correct query" being sent to Raven: (...)/ByName?term=Andalucía&field=NAME_1&max=5(...) - but produces no results.
FURTHER EDIT: Found the Lucene dll, included it in the project, used the StandardAnalyzer als the analyzer - same result (no results found).
On RavenDB 4, this appears fixed. meh
You need to verify that both 'Full-Text-Search' and 'Suggestions' options are 'turned on' in the index.
You need to specify the field you want the suggestions for.
Add this in your index definition:
Suggestion(x => x.NAME_1);
You must not have the following line of code in your index definition on the properties where you perform search operations:
Indexes.Add(x => x.PropertyXYZ, FieldIndexing.No);
By default if you didn't change the Indexing your query works.

SQL wildcards via Ruby

I am trying to use a wildcard or regular expression to give some leeway with user input in retrieving information from a database in a simple library catalog program, written in Ruby.
The code in question (which currently works if there is an exact match):
puts "Enter the title of the book"
title = gets.chomp
book = $db.execute("SELECT * FROM books WHERE title LIKE ?", title).first
puts %Q{Title:#{book['title']}
Author:#{book['auth_first']} #{book['auth_last']}
Country:#{book['country']}}
I am using SQLite 3. In the SQLite terminal I can enter:
SELECT * FROM books WHERE title LIKE 'Moby%'
or
SELECT * FROM books WHERE title LIKE "Moby%"
and get (assuming there's a proper entry):
Title: Moby-Dick
Author: Herman Melville
Country: USA
I can't figure out any corresponding way of doing this in my Ruby program.
Is it not possible to use the SQL % wildcard character in this context? If so, do I need to use a Ruby regular expression here? What is a good way of handling this?
(Even putting the ? in single quotes ('?') will cause it to no longer work in the program.)
Any help is greatly appreciated.
(Note: I am essentially just trying to modify the sample code from chapter 9 of Beginning Ruby (Peter Cooper).)
The pattern you give to SQL's LIKE is just a string with optional pattern characters. That means that you can build the pattern in Ruby:
$db.execute("SELECT * FROM books WHERE title LIKE ?", "%#{title}%")
or do the string work in SQL:
$db.execute("SELECT * FROM books WHERE title LIKE '%' || ? || '%'", title)
Note that the case sensitivity of LIKE is database dependent but SQLite's is case insensitive so you don't have to worry about that until you try to switch database. Different databases have different ways of dealing with this, some have a case insensitive LIKE, some have a separate ILIKE case insensitive version of LIKE, and some make you normalize the case yourself.

Using wildcard and required operator in an Elasticsearch search

We have various rows inside our Elasticsearch index that contain the text
"... 2% milk ...".
User enters a query like "2% milk" into a search field and we transform it internally to a query
title:(+milk* +2%*)
because all terms should be required and we are possibly interested into rows that contain "2% milkfat".
This query above return zero hits. Changing the query to
title:(+milk* +2%)
returns reasonable results. So why does the '*' operator in the first query not work?
Unless you set a mapping, the "%" sign will get removed in the tokenization process. Basically "2% milk" will get turned into the tokens 2 and milk.
When you search for "2%*" it looks for tokens like: 2%, 2%a, 2%b, etc... and not match any indexed tokens, giving no hits.
When you search for "2%", it will go through the same tokenization process as at index-time (you can specify this, but the default tokenization is the same) and you will be looking for documents matching the token 2, which will give you a hit.
You can read more about the analysis/tokenization process here and you can set up the analysis you want by defining a custom mapping
Good luck!
Prefix and Wildcard queries do not appear to apply the Analyzer to their content. To provide a few examples:
title:(+milk* +2%) --> +title:milk* +title:2
title:(+milk* +2%*) --> +title:milk* +title:2%*
title:(+milk* +2%3) --> +title:milk* +(title:2 title:3)
title:(+milk* +2%3*) --> +title:milk* +title:2%3*
+title:super\\-milk --> +title:super title:milk
+title:super\\-milk* --> +title:super-milk*
It does make some sense to prevent tokenization of wildcard queries, since wildcard phrase queries are not allowed. If tokenization were allowed, it would seem to beg the question, especially with embeded wildcards, of just how many terms that wildcard can span.

Apache solr - more like this score

I have a small index with ~1000 documents with only two fields:
- id (string)
- content (text_general)
I noticed that when I do MLT search by id for similar content, the original document(which id is the searched id) have a score 5.241327.
There is 1:1 duplicated document and for the duplicated content it is returning score = 1.5258181. Why? Why it is not 5.241327 when it is 100% duplicate.
Another question is can I in any way to get similarity documents by content by passing some text in the query.
Example:
/mlt/?q=content:Some encoded long text&mlt.fl=content
I am trying to check if there is similar content uploaded and the check must be performed at new content upload time.
It might be worth to try some different parameters. I also use MLT on only one field, I use the following parameters:
'mlt.boost': 'true',
'mlt.fl': 'my_field_name',
'mlt.maxqt': 1000,
'mlt.mindf': '0',
'mlt.mintf': '0',
'qt': 'mlt',
'rows': '10'
See http://wiki.apache.org/solr/MoreLikeThis for an explanation of the parameters. I think with a small index mindf might be important and I see the default mintf (term frequency) is 2, so I assume an ID is only one term, so this is probably ignored!
First, how does Solr More-Like-This works?
A regular Solr query is conducted (e.g. "?q=content:Some encoded long text&.....".
For each document returned by the above query, More-Like-This conduct More like this query...
So, the first result set "response", is just like any Solr query results set.
The More-Like-This appears below and start with something like that (Json format):
"moreLikeThis":{
"57375":{"numFound":18155,"start":0,"docs":["
For an explanation about More Like This algorithm, please read that:
http://blog.brattland.no/node/18
and: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
If you didn't solved the problem yet, please let me know and I will guide you through.

Endeca UrlENEQuery java API search

I'm currently trying to create an Endeca query using the Java API for a URLENEQuery. The current query is:
collection()/record[CONTACT_ID = "xxxxx" and SALES_OFFICE = "yyyy"]
I need it to be:
collection()/record[(CONTACT_ID = "xxxxx" or CONTACT_ID = "zzzzz") and
SALES_OFFICE = "yyyy"]
Currently this is being done with an ERecSearchList with CONTACT_ID and the string I'm trying to match in an ERecSearch object, but I'm having difficulty figuring out how to get the UrlENEQuery to generate the or in the correct fashion as I have above. Does anyone know how I can do this?
One of us is confused on multiple levels:
Let me try to explain why I am confused:
If Contact_ID and Sales_Office are different dimensions, where Contact_ID is a multi-or dimension, then you don't need to use EQL (the xpath like language) to do anything. Just select the appropriate dimension values and your navigation state will reflect the query you are trying to build with XPATH. IE CONTACT_IDs "ORed together" with SALES_OFFICE "ANDed".
If you do have to use EQL, then the only way to modify it (provided that you have to modify it from the returned results) is via string manipulation.
ERecSearchList gives you ability to use "Search Within" functionality which functions completely different from the EQL filtering, though you can achieve similar results by using tricks like searching only specified field (which would be separate from the generic search interface") I am still not sure what's the connection between ERecSearchList and the EQL expression above?
Having expressed my confusion, I think what you need to do is to use String manipulation to dynamically build the EQL expression and add it to the Query.
A code example of what you are doing would be extremely helpful as well.