Solr: How can I get all documents ordered by score with a list of keywords? - lucene

I have a Solr 3.1 database containing Emails with two fields:
datetime
text
For the query I have two parameters:
date of today
keyword array("important thing", "important too", "not so important, but more than average")
Is it possible to create a query to
get ALL documents of this day AND
sort them by relevancy by ordering them so that the email with contains most of my keywords(important things) scores best?
The part with the date is not very complicated:
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
I know that you can boost the keywords this way:
q=text:"first keyword"^5 OR text:"second one"^2 OR text:"minus scoring"^0.5 OR text:"*"
But how do I only use the keywords to sort this list and get ALL entries instead of doing a realy query and get only a few entries back?
Thanks for help!

You need to specify your terms in the main query and then change your date query to be a filter query on these results by adding the following.
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
So you should have something like this:
q=<terms go here>&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
Edit: A little more about filter queries (as suggested by rfreak).
From Solr Wiki - FilterQuery Guidance - "Now, what is a filter query? It is simply a part of a query that is factored out for special treatment. This is achieved in Solr by specifying it using the fq (filter query) parameter instead of the q (main query) parameter. The same result could be achieved leaving that query part in the main query. The difference will be in query efficiency. That's because the result of a filter query is cached and then used to filter a primary query result using set intersection."
These should be sorted by relevancy score already, that is just the default behavior of Solr. You can see the score by adding that field.
fl=*,score
If you use the Full Interface for Make A Query on the Admin Interface on your Solr installation at http://<yourserver:port#>/<instancename>/admin/form.jspyou will see where you can specify the filter query, fields, and other options. You can check out the Solr Wiki for more details on the options and how they are used.
I hope that this helps you.

You could do a first query for:
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
which gives all documents that match the range. Then, use CachingWrapperFilter for the second query to find documents in the DocSet from first query which have at least one keyword. They will be relevance ranked per tf-idf. You may want to use ConstantScoringQuery for the first to get the list of matching docids in the fastest possible way.

Sorting by relevance is default behavior on solr/lucene.
If your results are unsatisfied, try to put the keywords in quotes
//Edit: Folowing the answer from Paige Cook, use somethink like that
q="important thing"&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
//2. nd update. By thinking about this answer: quotes are not an good idea, because in this case you will only receive "important thing" mails, but no "important too"
The Point is: what keywords you are using. Because: searching for -- important thing -- results in the highest scores for "important thing" mails. But lucene does not know, how to score "important too" or "not so important, but more than average" in relation to your keywords.
An other idea would be searching only for "important". But the field-values "importand thing" and "importand too" gives nearly the same score values,because 50% of the searched keywords (in this key: "imported") are part of the field-value.
So probably you have to change your keywords. It could work after changeing "importend to" into "also an important mail", to get the beast ratio of search-word "important" and field-value in order to score the shortest Mail-discripton to the highest value.

Related

Filter on Count Aggregation

I have been looking for the solution on the internet from quite a while and I'm still not sure that if it is possible on Kibana or not.
Suppose I apply filter on term and it gives me count of the respective terms but I want the results to show only those terms where count equals a specific value.
Being more specific,
I want to find out the number of tills which are the most busy (most number of transactions). Currently when I apply a filter on term and count it shows me the all the tills with their respective transaction count. What I want is that to show only those tills where the count is equal to let's say 10.
In other words a similar functionality like HAVING clause in relational dbms.
I have found a lot of work arounds of the same usecase but I'm looking for a solution.
I hope I understand what you're asking. I think You can search the field in question with proper parameters. For example, for the field 'field_name' with more than 10 hits, try following Lucene query:
field_name:(*) AND count:[10 TO *]
For the exact result of field_name with count=10, query:
field_name:(*) AND count:[10]
Let me know if this was what you were looking for!

Why manipulate filtered column will affect index efficiency?

I'm reading "Tsql Fundamental" by Ben Itzik.
The author briefly mentioned that we shouldn't manipulate the filtered column if we want to use index efficiently. But he didn't really go into detail as to why this is the case.
So could someone please kindly explain the reason behind it?
The author briefly mentioned that we shouldn't manipulate the filtered column if we want to use index efficiently
What author mentions is called SARGABILITY.
Assume this statement
select * from t1 where name='abc'
Assume,you have an index on above filtered column
,then the query is Sargable
But not below one
select * from t1 where len(name)=3
When SQL is presented with above query,the only way ,it can filter out the data is to scan the table and then apply predicate to each row
Think of an index as being like a telephone directory (hopefully that's still a familiar enough concept) where everyone is listed by their surnames followed by their addresses.
This index is useful if you want to locate someone's phone number and you know their surname (and maybe their address).
But what if you want to locate everyone who (to steal TheGameiswar's example) has a 3 letter surname - is the index useful to you? It may be slightly more useful than having to go and visit every house in town1, but it's not nearly so efficient as being able to just jump to the appropriate surnames. You have to search the entire book.
Similarly, if you want to locate everyone who lives on a particular street, the index isn't so useful2 - you have to search through the entire book to make sure you've found everyone. Or to locate everyone whose surname ends with Son, etc.
1This is the analogy for when a database may choose to perform an index scan to satisfy a query simply because the index is smaller and so is easier than a full table scan.
2This is the analogy for a query that isn't attempting to filter on the left-most column in the index.
WHERE clause in a SQL query use predicates to filter the rows. Predicate is an expression, that determines whether an argument applied on a database object is true or false. Example : "Salary > 5000".
Relational models use predicates as a core element in filtering the data. These predicates should be written in certain form known as "Search Arguments" in order for the query optimizer to use the indexes effectively on the attributes used in the WHERE clause to filter data.
A predicate in the form - "column - operator - value" or "value - operator - column" is considered an appropriate search argument. Example - Salary = 1000 or Salary > 5000. As you can see, the column name should appear ALONE on one side of the expression and the constant or calculated value should be on the other side to form a valid search argument. The moment a built-in function like MAX , MIN, DATEADD or DATEDIFF etc was used on the column name, the expression is no longer treated as a search argument and the query optimizer won't use the indexes on those column names.
I hope this is clear.

Is it possible to order lucene documents by matching term?

I'm using Lucene 4.10.3 with Java 1.7
I'm wondering whether it's possible to order query results the matching term?
Simply put, if my documents conatin a text field;
The query is
text:a*
I want documents with ab, then ac, then ad etc.
The real case is more complex however, what I'm actually trying to accomplish is to "stuff" a relational DB into my lucene Index (probably not the best idea?).
An appropriate example would be :
I have documents representing books in a library. every book has a title and also a list of people who has borrowed this book and the date of borrowing.
when a user searches for a book with title containing "JAVA", I want to give priority to books that were borrowed by this user. This could be accomplished by adding a TextField "borrowers", adding a SHOULD clause on it and ordering by score)
also, if there are several books with "JAVA" that this user has borrowed before, I want to show the most recent borrowed ones first. so I thought to create a TextField "borrowers" that will look like
borrowers : "user1__20150505 user2__20150506" etc.
I will add a BooleanClause borrowers: user1* and order by matching term.
any other solution ideas will be welcome
I understand your real problem is more complex, but maybe this is helpful anyway.
You could first search for Tokens in the index that match your query, then for each matching token executing a query using this token specifically.
See https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/index/TermsEnum.html for that. Just seek to the prefix and iterate until the prefix stops matching.
In general it is sometimes easy to just issue two queries. For example one within the corpus of books the user as borrowed before and another witin the whole corpus.
These approaches may not work, but in that case you could implement a custom Scorer somehow mapping the ordering to a number.
See http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

Querying Solr indexes in order, stopping when you get a match?

I have a setup in which I have two indexes in solr: product_code and title. product_code uses a StrField and title uses a TextField with DoubleMetaphone.
I have a single search box for users to type in either a product code or free text for a title search. I'm currently using dismax and doing qf=product_code title. The problem I have is that very often a product code (e.g. LC12345) might match a word in the title once the DoubleMetaphone has been applied.
So what I want to do is construct a query in such a way that it first applies the query term to the product_code index and only if there are no matches then apply the query term to the title index. Is there a way of doing this without having to do two separate queries to Solr? This is for an AJAX 'live search' so I want to keep latency to a minimum so don't want to have to do two separate queries to Solr.
-Matt
The answer ist - no.
If I get you right you need something like. q=X if answer = o q=Y. There is no such function in Solr. even if there would it would be necessary for solr to query the index twice which would be the same as using two querys.
What I would suggest you to do to improve your application with only one query (if it's really nessecary) is using query-booster. If you set something like
product_code:query^5 and title^1
in your solrconfig for your indexed data. That will make you have the product_code results on top and the titles somewhere near the bottom. In case that there is no product_code there would be just the title results.
I hope that helps

Lucene exact ordering

I've had this long term issue in not quite understanding how to implement a decent Lucene sort or ranking. Say I have a list of cities and their populations. If someone searches "new" or "london" I want the list of prefix matches ordered by population, and I have that working with a prefix search and an sort by field reversed, where there is a population field, IE New Mexico, New York; or London, Londonderry.
However I also always want the exact matching name to be at the top. So in the case of "London" the list should show "London, London, Londonderry" where the first London is in the UK and the second London is in Connecticut, even if Londonderry has a higher population than London CT.
Does anyone have a single query solution?
dlamblin,let me see if I get this correctly: You want to make a prefix-based query, and then sort the results by population, and maybe combine the sort order with preference for exact matches.
I suggest you separate the search from the sort and use a CustomSorter for the sorting:
Here's a blog entry describing a custom sorter.
The classic Lucene book describes this well.
API for
Sortcomparator
says
There is a distinct Comparable for each unique term in the field - if
some documents have the same term in
the field, the cache array will have
entries which reference the same
Comparable
You can apply a
FieldSortedHitQueue
to the sortcomparator which has a Comparator field for which the api says ...
Stores a comparator corresponding to
each field being sorted by.
Thus the term can be sorted accordingly
My current solution is to create an exact searcher and a prefix searcher, both sorted by reverse population, and then copy out all my hits starting from the exact hits, moving to the prefix hits. It makes paging my results slightly more annoying than I think it should be.
Also I used a hash to eliminate duplicates but later changed the prefix searcher into a boolean query of a prefix search (MUST) with an exact search (MUST NOT), to have Lucene remove the duplicates. Though this seemed even more wasteful.
Edit: Moved to a comment (since the feature now exists): Yuval F Thank you for your blog post ... How would the sort comparator know that the name field "london" exactly matches the search term "london" if it cannot access the search term?