SOLR - Boost function (bf) to increase score of documents whose date is closest to NOW - lucene

I have a solr instance containing documents which have a 'startTime' field ranging from last month to a year from now. I'd like to add a boost query/function to boost the scores of documents whose startTime field is close to the current time.
So far I have seen a lot of examples which use rord to add boosts to documents whom are newer but I have never seen an example of something like this.
Can anyone tell me how to do it please?
Thanks

If you're on Solr 1.4+, then you have access to the "ms" function in function queries, and the standard, textbook approach to boosting by recency is:
recip(ms(NOW,startTime),3.16e-11,1,1)
ms gives the number of milliseconds between its two arguments. The expression as a whole boosts scores by 1 for docs dated now, by 1/2 for docs dated 1 year ago, by 1/3 for docs dated 2 years ago, etc.. (See http://wiki.apache.org/solr/FunctionQuery#Date_Boosting, as Sean Timm pointed out.)
In your case you have docs dated in the future, and those will get assigned a negative score by the above function, so you probably would want to throw in an absolute value, like this:
recip(abs(ms(NOW,startTime)),3.16e-11,1,1)
abs(ms(NOW,startTime)) will give the # of milliseconds between startTime and now, guaranteed to be nonnegative.
That would be a good starting place. If you want, you can then tweak the 3.16e-11 if it's too agressive or not agressive enough.
Tangentially, the ms function will only work on fields based on the TrieDate class, not the classic Date and LegacyDate classes. If your schema.xml was based on the example one for Solr 1.4, then your date field is probably already in the correct format.

You can do date math in Solr 1.4.
http://wiki.apache.org/solr/FunctionQuery#Date_Boosting

Related

Using Optaplanner for long trip planning of a fleet of vehicles in a Vehicle Routing Problem (VRP)

I am applying the VRP example of optaplanner with time windows and I get feasible solutions whenever I define time windows in a range of 24 hours (00:00 to 23:59). But I am needing:
Manage long trips, where I know that the duration between leaving the depot to the first visit, or durations between visits, will be more than 24 hours. So currently it does not give me workable solutions, because the TW format is in 24 hour format. It happens that when applying the scoring rule "arrivalAfterDueTime", always the "arrivalTime" is higher than the "dueTime", because the "dueTime" is in a range of (00:00 to 23:59) and the "arrivalTime" is the next day.
I have thought that I should take each TW of each Customer and add more TW to it, one for each day that is planned.
Example, if I am planning a trip for 3 days, then I would have 3 time windows in each Customer. Something like this: if Customer 1 is available from [08:00-10:00], then say it will also be available from [32:00-34:00] and [56:00-58:00] which are the equivalent of the same TW for the following days.
Likewise I handle the times with long, converted to milliseconds.
I don't know if this is the right way, my consultation would be more about some ideas to approach this constraint, maybe you have a similar problematic and any idea for me would be very appreciated.
Sorry for the wording, I am a Spanish speaker. Thank you.
Without having checked the example, handing multiple days shouldn't be complicated. It all depends on how you model your time variable.
For example, you could:
model the time stamps as a long value denoted as seconds since epoch. This is how most of the examples are model if I remember correctly. Note that this is not very human-readable, but is the fastest to compute with
you could use a time data type, e.g. LocalTime, this is a human-readable time format but will work in the 24-hour range and will be slower than using a primitive data type
you could use a date time data tpe, e.g LocalDateTime, this is also human-readable and will work in any time range and will also be slower than using a primitive data type.
I would strongly encourage to not simply map the current day or current hour to a zero value and start counting from there. So, in your example you denote the times as [32:00-34:00]. This makes it appear as you are using the current day midnight as the 0th hour and start counting from there. While you can do this it will affect debugging and maintainability of your code. That is just my general advice, you don't have to follow it.
What I would advise is to have your own domain models and map them to Optaplanner models where you use a long value for any time stamp that is denoted as seconds since epoch.

pandas.date_range -- freq="WOM-3FRI", how to understand that offset alias?

I've been trying to learn pandas in a lab class. One part of our lab manual goes over generating time-based indices with the date_range function. The class's lab manual says
The freq parameter accepts a variety of string representations, referred to as offset aliases. See Table 1.3 for a sampling of some of the options. For a complete list of the options, see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases.
I checked through the 'offset-alias' and 'anchored offsets' sections of the online documentation. Most of the entries in table 1.3 can be understood from those two sections.
However, the last entry of the table is "WOM-3FRI" The table says this corresponds to a frequency of every 3rd Friday of the month. I have no idea how to deduce that from the online documentation. It looks like "WOM" is being used as the alias and "3FRI" is being used as an anchor. But, "WOM" is not listed as an alias in the online documentation. So, I'm struggling to make sense of what's happening here.
One hypothesis I have is that this is some sort of operation.
The online documentation and my lab book have a couple examples where prepending a number in front of an alias increases the length of a the period by that number. So, '2' operates in a way so that '2M' creates a frequency of every 2 months. Similarly, '5' operates in a way so that '5Y' creates a frequency of every 5 years. Does 'O' somehow operate in a way that the offset alias 'XOY' gives the xth sub-period of period Y? For example, would "MOY-5" give the 5th month of the year? Would "DOY-7FRI" give the 7th Friday of the year?
Another hypothesis I have is that "MOA" is a new-fangled alias, and "3FRI" is an anchor for it. However, the documentation online does not list "MOA". I checked, and it was pandas 0.23.4 documentation. My lab machine is running version 0.23.4, and it can handle "WOM-3FRI" just fine. Have they just not updated the documentation yet?
Would anyone could clear up the method/theory behind creating "WOM-3FRI"?
Lab manual with Table 1.3: http://www.acme.byu.edu/wp-content/uploads/2018/10/Pandas4.pdf
I did a little more digging. It looks like "WOM" is just an undocumented offset alias. Source: https://github.com/pandas-dev/pandas/issues/2289#issuecomment-269616457
read pandas DateOffsets:
WeekOfMonth - 'WOM' - the x-th day of the y-th week of each month
And see example here exercise
Create a DateTimeIndex consisting of the third Thursday in each month for the years 2015 and 2016.
pd.date_range('2015-01-01', '2016-12-31', freq='WOM-3THU')

Subtract documents score based on their age in Apache SOLR

I'm trying to implement a kind of negative boost in Apache SOLR in such a way that, documents created within last 2 years should not have any effect on their score.
But documents created before NOW-2YEARS should be penalized based on there difference in number of days.For example if difference is 60 days then a plenty should be given to the score based on these days.
Any solution to this problem using Apache SOLR.
Edit:
So far I have used boost function of dismax using following function.
if(gt(ms(created_at,NOW-2YEARS),0),0,sub(query($q),mul(div(sub(ms(NOW),ms(NOW-2YEARS)),8.64e+7),0.2)))
The problem with this method is that, lets say if score of document which was created two years ago is 20 and the difference in days is 60 then doing subtraction give 20-60=-40 which is not correct.
Also I'm always getting "query($q)":1.0, which doesn't make sense.
Thanks

Can Bing News search return the news last year?

I am looking into Bing News Search, and my testing results (with a free trial api-key) only contains articles in the last month. However I would like to get articles from last 1 or 2 years.
for example:
https://api.cognitive.microsoft.com/bing/v7.0/news/search?count=100&q=AMD+AND+product&since=1451606400&sortBy=Date
The document only mentioned use a 'freshness' as the filter with day/week/month options but no year.
Can I do that with Bing search? If I can, how can I do it?
There is no direct way. One alternate way to do this is by using trending topics with "sortBy" option and then jump to the last set of results using "offset" parameter.
Another - less preferred - alternate way is to use datPublished field in JSON and parse results - may also help you to use "offset" parameter with bigger "count" number to go to slightly lower ranked results, which maybe older. More details here: https://learn.microsoft.com/en-us/rest/api/cognitiveservices/bing-news-api-v7-reference.

Querying Apache Solr based on score values

I am working on an image retrieval task. I have a dataset of wikipedia images with their textual description in xml files (1 xml file per image). I have indexed those xmls in Solr. Now while retrieving those, I want to maintain some threshold for Score values, so that docs with less score will not come in the result (because they are not of much importance). For example I want to retrieve all documents having similarity score greater than or equal to 2.0. I have already tried range queries like score:[2.0 TO *] but can't get it working. Does anyone have any idea how can I do that?
What's the motivation for wanting to do this? The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. It is only meaningful for comparing the results of a
specific query with a specific instance of the index. In other words,
it isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.
http://lucene.472066.n3.nabble.com/score-filter-td493438.html
Also, take a look here - http://wiki.apache.org/lucene-java/ScoresAsPercentages
So, in general it's bad to cut off by some value, because you'll never know which threshold value is best. In good query it could be score=2, in bad query score=0.5, etc.
These two links should explain you why you DONT want to do it.
P.S. If you still want to do it take a look here - https://stackoverflow.com/a/15765203/2663985
P.P.S. I recommend you to fix your search queries, so they will search better with high precision (http://en.wikipedia.org/wiki/Precision_and_recall)