Search multiple fields with "and" operator (but use fields' own analyzers) - lucene

ElasticSearch Version: 0.90.2
Here's the problem: I want to find documents in the index so that they:
match all query tokens across multiple fields
fields own analyzers are used
So if there are 4 documents:
{ "_id" : 1, "name" : "Joe Doe", "mark" : "1", "message" : "Message First" }
{ "_id" : 2, "name" : "Ann", "mark" : "3", "message" : "Yesterday Joe Doe got 1 for the message First"}
{ "_id" : 3, "name" : "Joe Doe", "mark" : "2", "message" : "Message Second" }
{ "_id" : 4, "name" : "Dan Spencer", "mark" : "2", "message" : "Message Third" }
And the query is "Joe First 1" it should find ids 1 and 2. I.e., it should find documents which contain all the tokens from search query, no matter in which fields they are (maybe all tokens are in one field, or maybe each token is in its own field).
One solution would be to use elasticsearch "_all" field functionality: that way it will merge all the fields I need (name, mark, message) into one and I'll be able to query it with something like
"match": {
"_all": {
"query": "Joe First 1",
"operator": "and"
}
}
But this way I can specify analyzer for the "_all" field only. And I need "name" and "message" fields to have different set of tokenizers/token filters (let's say name will have phonetic analyzer and message will have some stemming token filter).
Is there a way to do this?

Thanks to guys at elasticsearch group, here's the solution... pretty simple need to say :)
All I needed to do is to use query_string query http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query/ with default_operator = AND and it will do the trick:
{
"query": {
"query_string": {
"fields": [
"name",
"mark",
"message"
],
"query": "Joe First 1",
"default_operator": "AND"
}
}
}

I think using a multi match query makes sense here. Something like:
"multi_match": {
"query": "Joe First 1",
"operator": "and"
"fields": [ "name", "message", "mark"]
}

As you say, you can set the analyzer (or search_analyzer/index_analyzer) to be used on the _all field. It seems to me that should indeed be your first step to achieve the query results you're looking for.
From http://jontai.me/blog/2012/10/lucene-scoring-and-elasticsearch-_all-field/, we have this tasty quote:
... the _all field copies the text from the other fields and analyzes
them again; it doesn’t copy the pre-analyzed tokens. You can set a
separate analyzer for the _all field.
Which I interpret to mean that you should set your _all analyzer(s) as well as individual field analyzer(s). The _all field won't re-analyze the individual field data; it will grab the original field contents.

Related

How to setup a field mapping for ElasticSearch that allows both exact and full text searching?

Here is my problem:
I have a field called product_id that is in a format similar to:
A+B-12321412
If I used the standard text analyzer it splits it into tokens like so:
/_analyze/?analyzer=standard&pretty=true" -d '
A+B-1232412
'
{
"tokens" : [ {
"token" : "a",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "b",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "1232412",
"start_offset" : 5,
"end_offset" : 12,
"type" : "<NUM>",
"position" : 3
} ]
}
Ideally, I would like to sometimes search for an exact product id and other times use a sub string and or just do a query for part of the product id.
My understanding of mappings and analyzers is that I can only specify one analyzer per field.
Is there a way to store a field as both analyzed and exact match?
Yes, you can use the fields parameter. In your case:
"product_id": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
This allows you to index the same data twice, using two different definitions. In this case it will be indexed via both the default analyzer and not_analyzed which will only pick up exact matches. This is also useful for sorting return results:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/multi-fields.html
However, you will need to spend some time thinking about how you want to search. In particular, given part numbers with a mix of alpha, numeric and punctuation or special characters you may need to get creative to tune your queries and matches.

How to get list of statements for a given Wikidata ID?

The only thing I managed to do is this link:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q568&format=jsonfm
But this produces lots of useless data. What I need is to get all the statements for the given item, but I can't see any of the statements in the query above.
here it will be:
{ "instance of" : "chemical element",
"element symbol" : "Li",
"atomic number" : 3,
"oxidation state" : 1,
"subclass of" : ["chemical element", "alkali metal"]
// etc...
}
Is there an API for this or must I scrape the web page?
The information you want is in your query, except it's hard to decode. For example, this:
"P246": [
{
"id": "q568$E47B8CE7-C91D-484A-9DA4-6153F132997D",
"mainsnak": {
"snaktype": "value",
"property": "P246",
"datatype": "string",
"datavalue": {
"value": "Li",
"type": "string"
}
},
"type": "statement",
"rank": "normal",
"references": …
}
]
means that the “element symbol” (property P246) is “Li”. So, you will need to read all the properties from your query and then find out the name for each of the properties you found.
To get just the statements, you could also use action=wbgetclaims, but it's in the same format as above.

elasticsearch: how to index terms which are stopwords only?

I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation.
I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed.
I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode.
So, is it possible to save "The The" in the index but not disabling the stop words at all?
You can use the synonym filter to convert The The into a single token eg thethe which won't be removed by the stopwords filter.
First, configure the analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"syn" : {
"synonyms" : [
"the the => thethe"
],
"type" : "synonym"
}
},
"analyzer" : {
"syn" : {
"filter" : [
"lowercase",
"syn",
"stop"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'
Then test it with the string "The The The Who".
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who&analyzer=syn'
{
"tokens" : [
{
"end_offset" : 7,
"position" : 1,
"start_offset" : 0,
"type" : "SYNONYM",
"token" : "thethe"
},
{
"end_offset" : 15,
"position" : 3,
"start_offset" : 12,
"type" : "<ALPHANUM>",
"token" : "who"
}
]
}
"The The" has been tokenized as "the the", and "The Who" as "who" because the preceding "the" was removed by the stopwords filter.
To stop or not to stop
Which brings us back to whether we should include stopwords or not? You said:
I know I can ignore the stop words list completely
but this is not what I want since the results searching
for other bands like "the who" would explode.
What do you mean by that? Explode how? Index size? Performance?
Stopwords were originally introduced to improve search engine performance by removing common words which are likely to have little effect on the relevance of a query. However, we've come a long way since then. Our servers are capable of much more than they were back in the 80s.
Indexing stopwords won't have a huge impact on index size. For instance, to index the word the means adding a single term to the index. You already have thousands of terms - indexing the stopwords as well won't make much difference to size or to performance.
Actually, the bigger problem is that the is very common and thus will have a low impact on relevance, so a search for "The The concert Madrid" will prefer Madrid over the other terms.
This can be mitigated by using a shingle filter, which would result in these tokens:
['the the','the concert','concert madrid']
While the may be common, the the isn't and so will rank higher.
You wouldn't query the shingled field by itself, but you could combine a query against a field tokenized by the standard analyzer (without stopwords) with a query against the shingled field.
We can use a multi-field to analyze the text field in two different ways:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"test" : {
"properties" : {
"text" : {
"fields" : {
"shingle" : {
"type" : "string",
"analyzer" : "shingle"
},
"text" : {
"type" : "string",
"analyzer" : "no_stop"
}
},
"type" : "multi_field"
}
}
}
},
"settings" : {
"analysis" : {
"analyzer" : {
"no_stop" : {
"stopwords" : "",
"type" : "standard"
},
"shingle" : {
"filter" : [
"standard",
"lowercase",
"shingle"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}
'
Then use a multi_match query to query both versions of the field, giving the shingled version more "boost"/relevance. In this example the text.shingle^2 means that we want to boost that field by 2:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"multi_match" : {
"fields" : [
"text",
"text.shingle^2"
],
"query" : "the the concert madrid"
}
}
}
'

elasticsearch splits by space in facets

I am trying to do a simple facet request over a field containing more than one word (Eg: 'Name1 Name2', sometimes with dots and commas inside) but what I get is...
"terms" : [{
"term" : "Name1",
"count" : 15
},
{
"term" : "Name2",
"count" : 15
}]
so my field value is split by spaces and then runs the facet request...
Query example:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": [
"dataset"
],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"field": [
"speciesName"
],
"size": 50000
}
}
}
}'
Your field shouldn't be analyzed, or at least not tokenized. You need to update your mapping and then reindex if you want to index the field without tokenizing it.
First of all, javanna provided a very good answer from a practical perspective. However, for the sake of completeness, I want to mention that in some cases there is a way to do it without reindexing the data.
If the speciesName field is stored and your queries produce relatively small number of results, you can use script_field to retrieve stored field values:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": ["dataset"],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"script_field": "_fields['\''speciesName'\''].value",
"size": 50000
}
}
}
}
'
As a result of this query, elasticsearch will retrieve the speciesName field for every record in your result set and it will construct facets from these values. Needless to say, if your result set contains millions of records, performance of this query might be sluggish.
Similarly, if the field is not stored, but record source is stored, you can use script_field facet to retrieve field values from the source:
......
"script_field": "_source['\''speciesName'\'']",
......
Again, source for each record in the result list will be retrieved and parsed, so you might need some patience to run this query on a large set of records.

How to prevent Facet Terms from tokenizing

I am using Facet Terms to get all the unique values and their count for a field. And I am getting wrong results.
term: web
Count: 1191979
term: misc
Count: 1191979
term: passwd
Count: 1191979
term: etc
Count: 1191979
While the actual result should be:
term: WEB-MISC /etc/passwd
Count: 1191979
Here is my sample query:
{
"facets": {
"terms1": {
"terms": {
"field": "message"
}
}
}
}
If reindexing is an option, it would be the best to change mapping and mark this fields as not_analyzed
"your_field" : { "type": "string", "index" : "not_analyzed" }
You can use multi field type if keeping an analyzed version of the field is desired:
"your_field" : {
"type" : "multi_field",
"fields" : {
"your_field" : {"type" : "string", "index" : "analyzed"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
This way, you can continue using your_field in the queries, while running facet searches using your_field.untouched.
Alternatively, if this field is stored, you can use a script field facet instead:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_fields.your_field.value"
}
}
}
As the last resort, if this field is not stored, but record source is stored in the index, you can try this:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_source.your_field"
}
}
}
The first solution is the most efficient. The last solution is the least efficient and may take a lot of time on a large index.
Wow, I also got this same issue today while term aggregating in the recent elastic-search. After googling and some partial understanding, found how this geeky indexing works(which is very simple).
Queries can find only terms that actually exist in the inverted index
When you index the following string
"WEB-MISC /etc/passwd"
it will be passed to an analyzer. The analyzer might tokenize it into
"WEB", "MISC", "etc" and "passwd"
with its position details. And this tokens might filtered to lowercase such as
"web", "misc", "etc" and "passwd"
So, after indexing,the search query can see the above 4 only. not the complete word "WEB-MISC /etc/passwd". For your requirement the following are my options you can use
1.Change the Default Analyzer used by elasticsearch([link][1])
2.If it is not need, just TurnOff the analyzer by setting 'not_analyzed' for the fields you need
3.To convert the already indexed data searchable, re-indexing is the only option
I have briefly explained this problem and proposed two solutions here.
I have talked about multiple approaches here.
One is use of not_analyzed to preserve the string as it is. But then as it has the drawback of being case insensitive , a better approach would be use keyword tokenizer + lowercase filter