Elasticsearch: match every position only once - lucene

In my Elasticsearch index I have documents that have multiple tokens at the same position.
I want to get a document back when I match at least one token at every position.
The order of the tokens is not important.
How can I accomplish that? I use Elasticsearch 0.90.5.
Example:
I index a document like this.
{
"field":"red car"
}
I use a synonym token filter that adds synonyms at the same positions as the original token.
So now in the field, there are 2 positions:
Position 1: "red"
Position 2: "car", "automobile"
My solution for now:
To be able to ensure that all positions match, I index the maximum position as well.
{
"field":"red car",
"max_position": 2
}
I have a custom similarity that extends from DefaultSimilarity and returns 1 tf(), idf() and lengthNorm(). The resulting score is the number of matching terms in the field.
Query:
{
"custom_score": {
"query": {
"match": {
"field": "a car is an automobile"
}
},
"_script": "_score*100/doc[\"max_position\"]+_score"
},
"min_score":"100"
}
Problem with my solution:
The above search should not match the document, because there is no token "red" in the query string. But it matches, because Elasticsearch counts the matches for car and automobile as two matches and that gives a score of 2 which leads to a script score of 102, which satisfies the "min_score".

If you needed to guarantee 100% matches against the query terms you could use minimum_should_match. This is the more common case.
Unfortunately, in your case, you wish to provide 100% matches of the indexed terms. To do this, you'll have to drop down to the Lucene level and write a custom (java - here's boilerplate you can fork) Similarity class, because you need access to low-level index information that is not exposed to the Query DSL:
Per document/field scanned in the query scorer:
Number of analyzed terms matched (overlap is the Lucene terminology, it is used the the coord() method of the DefaultSimilarity class)
Number of total analyzed terms in the field: Look at this thread for a couple different ways to get this information: How to count the number of terms for each document in lucene index?
Then your custom similarity (you can probably even extend DefaultSimilarity) will need to detect queries where terms matched < total terms and multiply their score by zero.
Since query and index-time analysis have already happened at this level of scoring, the total number of indexed terms will already be expanded to include synonyms, as should the query terms, avoiding the false-positive "a car is an automobile" issue above.

Related

How to querying and filtering efficiently on Fauna DB?

For example, let’s assume we have a collection with hundreds of thousands of documents of clients with 3 fields, name, monthly_salary, and age.
How can I search for documents that monthly_salary is higher than 2000 and age higher than 30?
In SQL this would be straightforward but with Fauna, I´m struggling to understand the best approach because terms of Index only work with an exact match. I see in docs that I can use the Filter function but I would need to get all documents in advance so it looks a bit counterintuitive and not performant.
Below is an example of how I can achieve it, but not sure if it’s the best approach, especially if it contains a lot of records.
Map(
Filter(
Paginate(Documents(Collection('clients'))),
Lambda(
'client',
And(
GT(Select(['data', 'monthly_salary'], Get(Var('client'))), 2000),
GT(Select(['data', 'age'], Get(Var('client'))), 30),
)
)
),
Lambda(
'filteredClients',
Get(Var('filteredClients'))
)
)
Is this correct or I´m missing some fundamental concepts about Fauna and FQL?
can anyone help?
Thanks in advance
Efficient searching is performed using Indexes. You can check out the docs for search with Indexes, and there is a "cookbook" for some different search examples.
There are two ways to use Indexes to search, and which one you use depends on if you are searching for equality (exact match) or inequality (greater than or less than, for example).
Searching for equality
If you need an exact match, then use Index terms. This is most explicit in the docs, and it is also not what your original question is about, so I am not going to dwell much here. But here is a simple example
given user documents with this shape
{
ref: Ref(Collection("User"), "1234"),
ts: 16934907826026,
data: {
name: "John Doe",
email: "jdoe#example.com,
age: 50,
monthly_salary: 3000
}
}
and an index defined like the following
CreateIndex({
name: "users_by_email",
source: Collection("User"),
terms: [ { field: ["data", "email"] } ],
unique: true // user emails are unique
})
You can search for exact matches with... the Match function!
Get(
Match(Index("user_by_email"), "jdoe#example.com")
)
Searching for inequality
Searching for inequalities is more interesting and also complicated. It requires using Index values and the Range function.
Keeping with the document above, we can create a new index
CreateIndex({
name: "users__sorted_by_monthly_salary",
source: Collection("User"),
values: [
{ field: ["data", "monthly_salary"] },
{ field: ["ref"] }
]
})
Note that I've not defined any terms in the above Index. The important thing for inequalities is again the values. We've also included the ref as a value, since we will need that later.
Now we can use Range to get all users with salary in a given range. This query will get all users with salary starting at 2000 and all above.
Paginate(
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
)
)
Combining Indexes
For "OR" operations, use the Union function.
For "AND" operations, use the Intersection function.
Functions like Match and Range return Sets. A really important part of this is to make sure that when you "combine" Sets with functions like Intersection, that the shape of the data is the same.
Using sets with the same shape is not difficult for Indexes with no values, they default to the same single ref value.
Paginate(
Intersection(
Match(Index("user_by_age"), 50), // type is Set<Ref>
Match(Index("user_by_monthly_salary, 3000) // type is Set<Ref>
)
)
When the Sets have different shapes they need to be modified or else the Intersection will never return results
Paginate(
Intersection(
Range(
Match(Index("users__sorted_by_age")),
[30],
[]
), // type is Set<[age, Ref]>
Range(
Match(Index("users__sorted_by_monthly_salary")),
[2000],
[]
) // type is Set<[salary, Ref]>
)
)
{
data: [] // Intersection is empty
}
So how do we change the shape of the Set so they can be intersected? We can use the Join function, along with the Singleton function.
Join will run an operation over all entries in the Set. We will use that to return only a ref.
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
Altogether then:
Paginate(
Intersection(
Join(
Range(Match(Index("users__sorted_by_age")), [30], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
),
Join(
Range(Match(Index("users__sorted_by_monthly_salary")), [2000], []),
Lambda(["age", "ref"], Singleton(Var("ref")))
)
)
)
tips for combining indexes
You can use additional logic to combine different indexes when different terms are provided, or search for missing fields using bindings. Lot's of cool stuff you can do.
Do check out the cook book and the Fauna forums as well for ideas.
BUT WHY!!!
It's a good question!
Consider this: Since Fauna is served as a serverless API, you get charged for each individual read and write on your documents and indexes as well as the compute time to execute your query. SQL can be much easier, but it is a much higher level language. Behind SQL sits a query planner making assumptions about how to get you your data. If it cannot do it efficiently it may default to scanning your entire table of data or otherwise performing an operation much more expensive than you might have expected.
With Fauna, YOU are the query planner. That means it is much more complicated to get started, but it also means you have fine control over the performance of you database and thus your cost.
We are working on improving the experience of defining schemas and the indexes you need, but at the moment you do have to define these queries at a low level.

Search withTermvector position in lucene

It it possible to search for document similarity based on term-vector position in lucene?
For example there are three documents with content as follows
1: Hi how are you
2: Hi how you are
3: Hi how are you
Now if doc 1 is searched in lucene then it should return doc 3 with more score then doc 2 with less score because doc 2 has "you" and "are" words at different positions,
In short lucene should return exact matching documents with term positions
I think what you need is a PhraseQuery, it is a Lucene Query type that will take into account the precise position of your tokens and allow you to define a slop or permutation tolerance regarding those tokens.
In other words the more your tokens differ from the source in terms of positions the less they will be scored.
You can use it like that :
QueryBuilder analyzedBuilder = new QueryBuilder(new MyAnalyzer());
PhraseQuery query = analyzedBuilder.createPhraseQuery("fieldToSearchOn", textQuery);
the createPhraseQuery allows for a third parameter the slop I alluded to if you want to tweak it.
Regards,

What's the denominator for ElasticSearch scores?

I have a search which has multiple criterion.
Each criterion (grouped by should) has a different weighted score.
ElasticSearch returns a list of results; each with a score - which seems an arbitrary score to me. This is because I can't find a denominator for that score.
My question is - how can I represent each score as a ratio?
Dividing each score by max_score would not work since it'll show the best match as a 100% match with the search criteria.
The _score calculation depends on the combination of queries used. For instance, a simple query like:
{ "match": { "title": "search" }}
would use Lucene's TFIDFSimilarity, combining:
term frequency (TF): how many times does the term search appear in the title field of this document? The more often, the higher the score
inverse document frequency (IDF): how many times does the term search appear in the title field of all documents in the index? The more often, the lower the score
field norm: how long is the title field? The longer the field, the lower the score. (Shorter fields like title are considered to be more important than longer fields like body.)
A query normalization factor. (can be ignored)
On the other hand, a bool query like this:
"bool": {
"should": [
{ "match": { "title": "foo" }},
{ "match": { "title": "bar" }},
{ "match": { "title": "baz" }}
]
}
would calculate the _score for each clause which matches, add them together then divide by the total number of clauses (and once again have the query normalization factor applied).
So it depends entirely on what queries you are using.
You can get a detailed explanation of how the _score was calculated by adding the explain parameter to your query:
curl localhost:9200/_search?explain -d '
{
"query": ....
}'
My question is - how can I represent each score as a ratio?
Without understanding what you want your query to do it is impossible to answer this. Depending on your use case, you could use the function_score query to implement your own scoring algorithm.

Using Lucene when indexing I boost certain documents, but their score at search is still 1

I am trying to boost certain documents. But they dont get boosted. Please tell me what I am missing. Thanks!
In my index code I have:
if (myCondition)
{
myDocument.SetBoost(1.1f);
}
myIndexWriter.AddDocument(document);
then in my search code I retrieve the collection of documents from the ScoreDocs object into myDocuments collection and:
foreach (Lucene.Net.Documents.Document doc in myDocuments)
{
float tempboost = doc.GetBoost();
}
and I place a breakpoint in the foreach clause to break if the tempboost is not 1. and the breakpoint is never hit.
What did I miss?
Many thanks!
From javadoc of Lucene (Java version but same behaviors apply):
public float getBoost()
Returns, at indexing time, the boost factor as
set by setBoost(float).
Note that once a document is indexed this value is no longer available
from the index. At search time, for retrieved documents, this method
always returns 1. This however does not mean that the boost value set
at indexing time was ignored - it was just combined with other
indexing time factors and stored elsewhere, for better indexing and
search performance.
note: for those of you who get NaN when retrieving the score please use the following line
searcher.SetDefaultFieldSortScoring(true,true);

How to define a boost factor to each term in each document during indexing?

I want to insert another score factor in Lucene's similarity equation. The problem is that I can't just override Similarity class, as it is unaware of the document and terms it is computing scores.
For example, in a document with the text below:
The cat is in the top of the tree, and he is going to stay there.
I have an algorithm of my own, that assigns for each one the terms in this document a score regarding how much each one of them are important to the document as whole. A possible score for each word is:
cat: 0.789212
tree: 0.633423
top: 0.412315
stay: 0.123912
there: 0.0999842
going: 0.00988412
...
The score for each word is different from document to document. For example, in another document cat could have score: 0.0023912
I want to add this score to the Lucene's scoring, but I'm kind of lost on how to do that.
Any tips?
Use Lucene's Payload feature:
From: http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
Add a Payload to one or more Tokens during indexing.
Override the Similarity class to handle scoring payloads
Use a Payload aware Query during your search