Elasticsearch - How to normalize score when combining regular query and function_score? - lucene

Idealy what I am trying to achieve is to assign weights to queries such that query1 constitutes 30% of the final score and query2 consitutes other 70%, so to achieve the maximum score a document has to have highest possible score on query1 and query2. My study of the documentation did not yield any hints as to how to achieve this so lets try to solve a simpler problem.
Consider a query in following form:
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {"match_all": {}},
"script_score": {
"script": "<some_script>",
}
}
},
{
"match": {
"message": "this is a test"
}
}
]
}
}
}
The script can return an arbitrary number (think-> it can return something like 12392002).
How do I make sure that the result from the script will not dominate the overall score?
Is there any way to normalize it? For example instead of script score return the ratio to max_script_score (achieved by document with highest score)?

Recently i am working on a problem like this too. I couldn't find any formal documentation about this issue but when i investigate the results with "explain api", it seems like "queryNorm" is not applied to the score directly coming from "functions" field. This means that you can not directly normalize script value.
However, i think i find a little bit tricky solution to this problem. If you combine this function field with a query like you do (match_all query) and give a boost to that query, normalization is working on this query that is, multiplication of this two scores - from normalized query and from script- will give us a total normalization. For a better explanation query will be like:
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {"match_all": {"boost":1}},
"functions": [ {
"script_score": {
"script": "<some_script>",
}}],
"score_mode": "sum",
"boost_mode": "multiply"
}
},
{
"match": {
"message": "this is a test"
}
}
]
}
}
}
This answer is not a proper solution to your problem but i think you can play with this query to obtain required result. My suggestion to you is use explain api, try to understand what it is returned, examine the parameters affecting final score and play with script and boost values to get optimized solution.
Btw, "rescore query" may help a lot to obtain that %30-%70 ratio on the final score:
Official documentation

As far as I searched, there is no way to get a normalized score out of elastic. You will have to hack it by making two queries. First will be a pilot query (preferably with size 1, but rest all attributes same) and it will fetch you the max_score. Then you can shoot your actual query and use functional_score to normalize the score. Pass the max_score you got as part of pilot query in params to function_score and use it to normalize every score. Refer: This article snippet

Related

What is the best way to create a subset of my data in Elasticsearch?

I have an index in elasticsearch containing apache log data. Here is what I want to do:
Identify all visitors (by ip number) that accessed a certain file (e.g. /signup.php).
Do a search/query/aggregation on my data, but limit the documents that are examined to those containing an ip number found in step 1.
In the sql world, I would just create a temporary table and insert all the matching IP numbers from step one. Next I would query my main table and limit the result set by joining in my temporary table on IP number.
I understand joins are not possible in elasticsearch. The elasticsearch documentation suggests a few ways to handle situations like this:
Application side joins
This does not seem practical, because the list of IP numbers may be very large and it seems inefficient to send the results to the client and then pass it back to elasticsearch in one huge terms filter.
Denormalizing the data
This would involve iterating over the matching IP numbers and updating every document in the index for any given IP number with something like "in_group": true, so I can use that in my query later on. This also seems very impractical and inefficient, especially since the source query (step 1) is dynamic.
Nested Object and/or parent-Child relationship
I'm not sure if dynamically creating new documents with nested objects is practical in this case. It seems to me that I would end up copying huge parts of my data.
I'm new to elasticsearch and noSQL in general, so perhaps I'm just looking at the problem the wrong way and I shouldn't be trying to emulate a JOIN in the first place.
But this seems like such a common case for segmenting a dataset, it makes me wonder if I am overlooking some other obvious way of doing this?
Any help would be appreciated!
If I understood your question correctly, you are trying to get a subset of your documents based on certain condition and use that sub set to query/search/aggregate it further.
If true, why would you like to store it in another view(sql types). The main power of elasticsearch is it's caching capability of filters and thus it highly reduces your query time. Using this feature, all the queries/searches/aggregation you need to perform on, would require a term filter which would specify the condition you are trying to do in step 1. Now, whatever other operations you want to do, you can do it in the same query on the already shrinked dataset.
If you have other different use cases, then the storage of document(mapping) might be considered to get changed for easier and faster retrieval.
This is a current workaround that I use:
Run this bash script to save the first query ip-list to a temp index, then use a terms-query filter (in Kibana) to query using the ip-list from step1.
#!/usr/bin/env bash
es_host='https://************'
elk_user='************'
cred=($(pass ELK/************ | tr "\n" " ")) ##password
index_name='iis-************'
index_hostname='"************"'
temp_index_path='temp1/_doc/1'
results_limit=1000
timestamp_gte='"2018-03-20T13:00:00"' #UTC
timestamp_lte='"now"' #UTC
resp_data="$(curl -X POST $es_host/$index_name/_search -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d #- << EOF
{
"query": {
"bool": {
"must": [{
"match": {
"index_hostname": {
"query": $index_hostname
}
}
},
{
"regexp": {
"iis.access.url":{
"value": ".*((jpg)|(jpeg)|(png))"
}
}
}],
"must_not": {
"match": {
"iis.access.agent": {
"query": "Amazon+CloudFront"
}
}
},
"filter": {
"range": {
"#timestamp": {
"gte": $timestamp_gte,
"lte": $timestamp_lte
}
}
}
}
},
"aggs" : {
"whatever" : {
"terms" : { "field" : "iis.access.remote_ip", "size":$results_limit }
}
},
"size" : 0
}
EOF
)"
ip_list="$(echo "$resp_data" | jq '.aggregations.whatever.buckets[].key' | tr "\n" ",\ " | head -c -1)"
resp_data2="$(curl -X PUT $es_host/$temp_index_path -u $elk_user:${cred[0]} -H 'Content-Type: application/json; charset=utf-8' -d #- << EOF
{
"ips" : [$ip_list]
}
EOF
)"
echo "$resp_data2"
Query DSL - "terms-query" filter:
{
"query": {
"terms": {
"iis.access.remote_ip": {
"id": "1",
"index": "temp1",
"path": "ips",
"type": "_doc"
}
}
}
}

Elastic Search: Ordering based on custom logic

I am giving a pattern "Master Servant" to elastic server search api.
It returns all the documents that contain at least one of them (Master OR Servant).
It shows the results in descending order of score.
However, I want to change that ordering to my custom logic i.e If a document contains both the words i.e. Master AND Servant, show that document first.
Can this be achieved?
Use the bool query.
From the 'ES definitive Guide'
The bool query takes a more-matches-is-better approach, so the score from each match clause will be added together to provide the final _score for each document. Documents that match both clauses will score higher than documents that match just one clause.
EDIT Based on comment:
to clarify I believe you want something like this:
{
"query": {
"bool": {
"should": [
{ "match": { "field": "Master" }},
{ "match": { "field": "Servant" }}
]
}
}
}

How should I index this schema in Elasticsearch

I am a bit lost on how to index these documents in Elasticsearch.
Document 1
{
text: ['chicken']
}
Document 2
{
text: ['chicken'], [['broth', 'stock']]
}
I need to be able to query these using either 'chicken flavored stock' or 'chicken flavored broth' and it should return both documents with the same score, since all of their terms have been matched in the input query. It also shouldn't return doc 2 with only 'chicken' as query.
Basically, I want to know that all the terms in 'text' field have been found somewhere in the query, and the internal array (ie: 'broth' and 'stock' acts like an OR clause).
Is this even possible?
Update:
I did find a (cumbersome) way of doing it. I save the document by combining their fields into phrases (ex: ['chicken broth', 'chicken stock'] for doc 2). Then I search using every combination of the input as a phrase (ex: ['chicken', 'chicken flavored', 'chicken flavored broth', 'chicken broth', ...].)
This solution does give me the results I want, but I can't help but feel this is a common case that could be handled much more elegantly. It feels like the ngrams are along the path to my answer, but I can't quite work it out.
When you index documents without adding a custom mapping, Elasticsearch using the Standard analyzer by default.
You could remove the arrays from the text fields and index your documents as:
Document 1
{
"text": "chicken"
}
Document 2
{
"text": "chicken broth stock"
}
The standard analyzer will create the following tokens in the Lucene index:
Document 1
"chicken"
Document 2
"chicken", "broth", "stock"
Your documents are matching the search terms as follows:
chicken : the term 'chicken' matches in both documents, because the text field is shorter in Document 1 it scores higher than Document 2.
chicken flavored: the term 'chicken' matches in both documents, but there is no match for the term 'flavoured'. Again, as the text field is shorter in Document 1 it scores higher than Document 2.
chicken flavored broth: the term 'chicken' matches in both documents, and the term 'broth' also matched in document 2. There is no match on the term 'flavored' in either of the documents. Document 2 is scored higher than Document 1 as it matches two of the terms in the query.
I don't really see a use case for ngrams as the above does what you want.
So here is something that you can try. Percolator can solve your problem but you will have to change the way you are indexing your documents.
So instead of indexing doc1 the way you are doing, index it like so:
PUT /test-index/.percolator/1
{
"query": {
"term": {
"text": {
"value": "chicken"
}
}
}
}
And, index doc2 like so:
PUT /test-index/.percolator/2
{
"query": {
"bool": {
"must": [
{
"term": {
"text": {
"value": "chicken"
}
}
},
{
"bool": {
"should": [
{
"term": {
"text": {
"value": "broth"
}
}
},
{
"term": {
"text": {
"value": "stock"
}
}
}
]
}
}
]
}
}
}
No instead of querying the way you were querying your documents earlier, percolate them:
GET /test-index/all_terms_search/_percolate
{
"doc": {
"text": "chicken flavored stock"
}
}
This will get both your documents. This also gives you the flexibility to control what and how much you want to match. While you are indexing your document's reverse queries in percolator, you provide an ID for that query and corresponding to that ID, you can maintain the text in a much simpler form for you to consume either in a separate index in Elasticsearch or may be some other datastore which can get matching documents really fast.

Query match without score in elasticsearch

I would like to simply match value of the field and I dont care about score (it will return always one match). I dont want elasticsearch to create me a score which may result on worse performance... or I am wrong and I should not care?
Simple query like this:
GET /testing/test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "My name here",
"operator": "and"
}
}
}
]
}
}
}
I expect one result with no score (and I dont want to use filtered).
You could override the default similarity with a custom one that just spits out a constant score for all matches. See the ElasticSearch documentation on how to set the Similarity module
However, for a query just involving a simple exact match on a term or phrase, the performance impact is unlikely to be significant. Profiling might help determine if this is really worth pursuing.

How to prevent Facet Terms from tokenizing

I am using Facet Terms to get all the unique values and their count for a field. And I am getting wrong results.
term: web
Count: 1191979
term: misc
Count: 1191979
term: passwd
Count: 1191979
term: etc
Count: 1191979
While the actual result should be:
term: WEB-MISC /etc/passwd
Count: 1191979
Here is my sample query:
{
"facets": {
"terms1": {
"terms": {
"field": "message"
}
}
}
}
If reindexing is an option, it would be the best to change mapping and mark this fields as not_analyzed
"your_field" : { "type": "string", "index" : "not_analyzed" }
You can use multi field type if keeping an analyzed version of the field is desired:
"your_field" : {
"type" : "multi_field",
"fields" : {
"your_field" : {"type" : "string", "index" : "analyzed"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
This way, you can continue using your_field in the queries, while running facet searches using your_field.untouched.
Alternatively, if this field is stored, you can use a script field facet instead:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_fields.your_field.value"
}
}
}
As the last resort, if this field is not stored, but record source is stored in the index, you can try this:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_source.your_field"
}
}
}
The first solution is the most efficient. The last solution is the least efficient and may take a lot of time on a large index.
Wow, I also got this same issue today while term aggregating in the recent elastic-search. After googling and some partial understanding, found how this geeky indexing works(which is very simple).
Queries can find only terms that actually exist in the inverted index
When you index the following string
"WEB-MISC /etc/passwd"
it will be passed to an analyzer. The analyzer might tokenize it into
"WEB", "MISC", "etc" and "passwd"
with its position details. And this tokens might filtered to lowercase such as
"web", "misc", "etc" and "passwd"
So, after indexing,the search query can see the above 4 only. not the complete word "WEB-MISC /etc/passwd". For your requirement the following are my options you can use
1.Change the Default Analyzer used by elasticsearch([link][1])
2.If it is not need, just TurnOff the analyzer by setting 'not_analyzed' for the fields you need
3.To convert the already indexed data searchable, re-indexing is the only option
I have briefly explained this problem and proposed two solutions here.
I have talked about multiple approaches here.
One is use of not_analyzed to preserve the string as it is. But then as it has the drawback of being case insensitive , a better approach would be use keyword tokenizer + lowercase filter