What's the denominator for ElasticSearch scores? - lucene

I have a search which has multiple criterion.
Each criterion (grouped by should) has a different weighted score.
ElasticSearch returns a list of results; each with a score - which seems an arbitrary score to me. This is because I can't find a denominator for that score.
My question is - how can I represent each score as a ratio?
Dividing each score by max_score would not work since it'll show the best match as a 100% match with the search criteria.

The _score calculation depends on the combination of queries used. For instance, a simple query like:
{ "match": { "title": "search" }}
would use Lucene's TFIDFSimilarity, combining:
term frequency (TF): how many times does the term search appear in the title field of this document? The more often, the higher the score
inverse document frequency (IDF): how many times does the term search appear in the title field of all documents in the index? The more often, the lower the score
field norm: how long is the title field? The longer the field, the lower the score. (Shorter fields like title are considered to be more important than longer fields like body.)
A query normalization factor. (can be ignored)
On the other hand, a bool query like this:
"bool": {
"should": [
{ "match": { "title": "foo" }},
{ "match": { "title": "bar" }},
{ "match": { "title": "baz" }}
]
}
would calculate the _score for each clause which matches, add them together then divide by the total number of clauses (and once again have the query normalization factor applied).
So it depends entirely on what queries you are using.
You can get a detailed explanation of how the _score was calculated by adding the explain parameter to your query:
curl localhost:9200/_search?explain -d '
{
"query": ....
}'
My question is - how can I represent each score as a ratio?
Without understanding what you want your query to do it is impossible to answer this. Depending on your use case, you could use the function_score query to implement your own scoring algorithm.

Related

How to querying and filtering efficiently on Fauna DB?

For example, let’s assume we have a collection with hundreds of thousands of documents of clients with 3 fields, name, monthly_salary, and age.
How can I search for documents that monthly_salary is higher than 2000 and age higher than 30?
In SQL this would be straightforward but with Fauna, I´m struggling to understand the best approach because terms of Index only work with an exact match. I see in docs that I can use the Filter function but I would need to get all documents in advance so it looks a bit counterintuitive and not performant.
Below is an example of how I can achieve it, but not sure if it’s the best approach, especially if it contains a lot of records.
Map(
Filter(
Paginate(Documents(Collection('clients'))),
Lambda(
'client',
And(
GT(Select(['data', 'monthly_salary'], Get(Var('client'))), 2000),
GT(Select(['data', 'age'], Get(Var('client'))), 30),
)
)
),
Lambda(
'filteredClients',
Get(Var('filteredClients'))
)
)
Is this correct or I´m missing some fundamental concepts about Fauna and FQL?
can anyone help?
Thanks in advance
Efficient searching is performed using Indexes. You can check out the docs for search with Indexes, and there is a "cookbook" for some different search examples.
There are two ways to use Indexes to search, and which one you use depends on if you are searching for equality (exact match) or inequality (greater than or less than, for example).
Searching for equality
If you need an exact match, then use Index terms. This is most explicit in the docs, and it is also not what your original question is about, so I am not going to dwell much here. But here is a simple example
given user documents with this shape
{
ref: Ref(Collection("User"), "1234"),
ts: 16934907826026,
data: {
name: "John Doe",
email: "jdoe#example.com,
age: 50,
monthly_salary: 3000
}
}
and an index defined like the following
CreateIndex({
name: "users_by_email",
source: Collection("User"),
terms: [ { field: ["data", "email"] } ],
unique: true // user emails are unique
})
You can search for exact matches with... the Match function!
Get(
Match(Index("user_by_email"), "jdoe#example.com")
)
Searching for inequality
Searching for inequalities is more interesting and also complicated. It requires using Index values and the Range function.
Keeping with the document above, we can create a new index
CreateIndex({
name: "users__sorted_by_monthly_salary",
source: Collection("User"),
values: [
{ field: ["data", "monthly_salary"] },
{ field: ["ref"] }
]
})
Note that I've not defined any terms in the above Index. The important thing for inequalities is again the values. We've also included the ref as a value, since we will need that later.
Now we can use Range to get all users with salary in a given range. This query will get all users with salary starting at 2000 and all above.
Paginate(
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
)
)
Combining Indexes
For "OR" operations, use the Union function.
For "AND" operations, use the Intersection function.
Functions like Match and Range return Sets. A really important part of this is to make sure that when you "combine" Sets with functions like Intersection, that the shape of the data is the same.
Using sets with the same shape is not difficult for Indexes with no values, they default to the same single ref value.
Paginate(
Intersection(
Match(Index("user_by_age"), 50), // type is Set<Ref>
Match(Index("user_by_monthly_salary, 3000) // type is Set<Ref>
)
)
When the Sets have different shapes they need to be modified or else the Intersection will never return results
Paginate(
Intersection(
Range(
Match(Index("users__sorted_by_age")),
[30],
[]
), // type is Set<[age, Ref]>
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
) // type is Set<[salary, Ref]>
)
)
{
data: [] // Intersection is empty
}
So how do we change the shape of the Set so they can be intersected? We can use the Join function, along with the Singleton function.
Join will run an operation over all entries in the Set. We will use that to return only a ref.
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
Altogether then:
Paginate(
Intersection(
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
),
Join(
Range(Match(Index("users__sorted_by_monthly_salary")), [2000], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
)
)
tips for combining indexes
You can use additional logic to combine different indexes when different terms are provided, or search for missing fields using bindings. Lot's of cool stuff you can do.
Do check out the cook book and the Fauna forums as well for ideas.
BUT WHY!!!
It's a good question!
Consider this: Since Fauna is served as a serverless API, you get charged for each individual read and write on your documents and indexes as well as the compute time to execute your query. SQL can be much easier, but it is a much higher level language. Behind SQL sits a query planner making assumptions about how to get you your data. If it cannot do it efficiently it may default to scanning your entire table of data or otherwise performing an operation much more expensive than you might have expected.
With Fauna, YOU are the query planner. That means it is much more complicated to get started, but it also means you have fine control over the performance of you database and thus your cost.
We are working on improving the experience of defining schemas and the indexes you need, but at the moment you do have to define these queries at a low level.

Cosmosdb index precision for fixed value strings

I want to index over a field in a collection whose values can be only 4 characters long
{ "field": "abcd" }
Can I use an index precision of 4 like below to save on RU's without having any side effects?
"indexes": [
{
"kind": "Range",
"dataType": "String",
"precision": 4
},
]
For Range indexes, the index term length will never exceed the actual string length. So,if all of your strings are 4 characters long, then this will not have any impact (neither positive or negative). You're better off, however, to set the precision to -1 so that we don't have to change your index policy in the future in case the length of the strings changes.
Based on this official statement, the choice of the index precision might affect the performance of string range queries. Obviously, there is no specific statements about
effect like hash type index. So,I suggest you do actual test based on the simulation data instead so that maybe you could find the truth.
BTW, if you want to perform ORDER BY queries against your string properties, you must specify a precision of -1 for the corresponding paths.
There are more documents about saving RUs for your reference.
1.https://lbadri.wordpress.com/2018/04/07/is-azure-cosmos-db-really-expensive/
2.https://medium.com/#thomasweiss_io/how-i-learned-to-stop-worrying-and-love-cosmos-dbs-request-units-92c68c62c938

How to get the original entity value from Wit.Ai?

I was wondering if there is a way to also return the original value of an entity from Wit.Ai.
For instance, my entity "states" correctly maps the misspelled and lower case word massachusets to Massachusetts. But it does not return the original word. So, I cannot easily tag the incorrect word.
{
"msg_id": "a6ac0938-d92c-45f4-8c41-4ca990d83415",
"_text": "What is the temperature in massachusets?",
"entities": {
"states": [
{
"confidence": 0.7956227869227184,
"type": "value",
"value": "Massachusetts"
}
]
}
}
I really appreciate if you know how I can accomplish that with Wit.Ai.
Thanks
Keep the search strategy of "states" as free-text & Keywords. This way you can extract the original word in the message. Declaring it as keyword matches it with the close one and returns that keyword where as if it is a free-text it returns original word in the message.
You may have to train wit each time to do so, by highlighting "Massachusetts" as a resolved value. This will make wit understand that you do not agree with the autocorrection.

Cloudant - Lucene range search using numbers stored as text

I have a number of documents in Cloudant, that have ID field of type string. ID can be a simple string, like "aaa", "bbb" or number stored as text, e.g. "111", "222", etc. I need to be able to full text search using the above field, but I encountered some problems.
Assuming that I have two documents, having ID="aaa" and ID="111", then searching with query:
ID:aaa
ID:"aaa"
ID:[aaa TO zzz]
ID:["aaa" TO "zzz"]
returns first document, as expected
ID:111
returns nothing, but
ID:"111"
returns second document, so at least there is a way to retrieve it.
Unfortunately, when searching for range:
ID:[111 TO 999]
ID:["111" TO "999"]
I get no results, and I have no idea what to do to get around this problem. Is there any special syntax for such case?
UPDATE:
Index function:
function(doc){
if(!doc.ID) return;
index("ID", doc.ID, { index:'not_analyzed_no_norms', store:true });
}
Changing index to analyzed doesn't help. Analyzer itself is keyword, but changing to standard doesn't help either.
UPDATE 2
Just to add some more context, because I think I missed one key point. The field I'm indexing will be searched using ranges, and both min and max values can be provided by user. So it is possible that one of them will be number stored as a string, while other will be a standard non-numeric text. For example search all document where ID >= "11" and ID <= "foo".
Assumig that database contains documents with ID "1", "5", "alpha", "beta", "gamma", this query should return "5", "alpha", "beta". Please note that "5" should actually be returned, because string "5" is greater than string "11".
Our team just came to a workaround solution. We managed to get proper results by adding some arbitrary character, e.g. 'a' to an upper range value, and by introducing additional search term, to exclude documents having ID between upper range value and upper range value + 'a'.
When searching for a range
ID:[X TO Y]
actual query would be
(ID:[X TO Ya] AND -ID:{Y TO Ya])
For example, to find a documents having ID between 23 and 758, we execute
(ID:[23 TO 758a] AND -ID:{758 TO 758a]).
First of all, I would suggest to use keyword analyzer, so you can control the right tokenization during both indexing and search.
"analyzer": "keyword",
"index": "function(doc){\n if(!doc.ID) return;\n index(\"ID\", doc.ID, {store:true });\n}
To retrieve you document with _id "111", use the following range query:
curl -X GET "http://.../facetrangetest/_design/ddoc/_search/f?q=ID:\[111%20TO%A\]"
If you use a query q=ID:\[111%20TO%20999\], Cloudant search seeing numbers on both size of the range, will interpret it as NumericRangeQuery; and since your ID of "111" is a String, it will not be part of the results returned. Including a string into query [111%20TO%20A], will make Cloudant interpret it as a range query on strings.
You can get both docs returned like this:
q=ID:["111" TO "CCC"]
Here's a working live example:
https://rajsingh.cloudant.com/facetrangetest/_design/ddoc/_search/f?q=ID:[%22111%22%20TO%20%22CCC%22]
I found something quirky. It seems that range queries on strings only work if at least one of the range values is a string. Querying on ID:["111" TO "555"] doesn't return anything either, so maybe this is resolving to a numeric query somehow? Could be a bug.
This could also be achieved using regular expressions in queries. Something line this:
curl -X POST "https://.../facetrangetest/_design/ddoc/_search/f" -d '{"q":"ID:/<23-758>/"}' | jq .
This regular expressions means to retrieve all documents with ID field from 23 to 758. Slashes: / / are used to enclose a regular expression; the interval is enclosed inside <>.

Elasticsearch: match every position only once

In my Elasticsearch index I have documents that have multiple tokens at the same position.
I want to get a document back when I match at least one token at every position.
The order of the tokens is not important.
How can I accomplish that? I use Elasticsearch 0.90.5.
Example:
I index a document like this.
{
"field":"red car"
}
I use a synonym token filter that adds synonyms at the same positions as the original token.
So now in the field, there are 2 positions:
Position 1: "red"
Position 2: "car", "automobile"
My solution for now:
To be able to ensure that all positions match, I index the maximum position as well.
{
"field":"red car",
"max_position": 2
}
I have a custom similarity that extends from DefaultSimilarity and returns 1 tf(), idf() and lengthNorm(). The resulting score is the number of matching terms in the field.
Query:
{
"custom_score": {
"query": {
"match": {
"field": "a car is an automobile"
}
},
"_script": "_score*100/doc[\"max_position\"]+_score"
},
"min_score":"100"
}
Problem with my solution:
The above search should not match the document, because there is no token "red" in the query string. But it matches, because Elasticsearch counts the matches for car and automobile as two matches and that gives a score of 2 which leads to a script score of 102, which satisfies the "min_score".
If you needed to guarantee 100% matches against the query terms you could use minimum_should_match. This is the more common case.
Unfortunately, in your case, you wish to provide 100% matches of the indexed terms. To do this, you'll have to drop down to the Lucene level and write a custom (java - here's boilerplate you can fork) Similarity class, because you need access to low-level index information that is not exposed to the Query DSL:
Per document/field scanned in the query scorer:
Number of analyzed terms matched (overlap is the Lucene terminology, it is used the the coord() method of the DefaultSimilarity class)
Number of total analyzed terms in the field: Look at this thread for a couple different ways to get this information: How to count the number of terms for each document in lucene index?
Then your custom similarity (you can probably even extend DefaultSimilarity) will need to detect queries where terms matched < total terms and multiply their score by zero.
Since query and index-time analysis have already happened at this level of scoring, the total number of indexed terms will already be expanded to include synonyms, as should the query terms, avoiding the false-positive "a car is an automobile" issue above.