Best way to search for date availability in lucene - lucene

I have a scenario in which I have an object which has an availability property assosiated with it. I have encoded the dates in a month as a 32 bit binary with 1 for available and 0 for not available. Now I want to search for objects that are available on a range of dates. How will I best do it with lucene?

Maybe a better way to store that would be as:
available_on=20111028
available_on=20111029
where the date is encoded as an integer, and one field for each date that is available. Then you can use a NumericRangeQuery to search the availability range.
Failing that, I guess you could write a filter to step through each value used for your bitfield and pick out the ones with one of the requisite bits set.

Related

How to make a query that matches values within a specified range of non standard types?

For standard ones I find it pretty straightforward
NumericRangeQuery.NewIntRange(item.Name, item.MinValue, item.MaxValue, true, true))
It works great with most common numeric types.
But what I would like to do is to make a range query with such datatypes as Date and decimal.
How could I achieve this?
For dates, store them as ints. So 2016 July 23 = 20160723
If you want to the hour or minute or second, just add those digits to the right. You may need to switch to long (Int64) for the longer versions.
If you want finer grain then store Ticks.
After all that just use the appropriate NumericRange query.
In Lucene.net 3.0.3 the best float accuracy is with Double

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

Is it possible to affect a Lucene rank based on a numeric value?

I have content with various numeric values, and a higher value indicates (theoretically) more valuable content, which I want to rank higher.
For instance:
Average rating (0 - 5)
Number of comments (0 - whatever)
Number of inbound link references from other pages (0 - whatever)
Some arbitrary number I apply to indicate how important I feel the content is (1 - whatever)
These can be indexed by Lucene as a numeric value, but how can I tell Lucene to use this value in its ranking algorithm?
you can set this value using "Field.SetBoost" while indexing.
Depending how exactly you want to proceed, you can set boost while indexing as suggested by #L.B, or if you want to make it dynamic, i.e. at search time rather than indexing time, you can use ValueSourceQuery and CustomScoreQuery.
You can see example in the question I asked some time ago:
Lucene custom scoring for numeric fields (the example was tested with Lucene 3.0).

How do I get Average field length and Document length in Lucene?

I am trying to implement BM25f scoring system on Lucene. I need to make a few minor changes to the original implementation given here for my needs, I got lost at the part where he gets Average Field Length and document length... Could someone guide me as to how or where I get it from?
You can get field length from TermVector instances associated with documents' fields, but that will increase your index size. This is probably the way to go unless you cannot afford a larger index. Of course you will still need to calculate the average yourself, and store it elsewhere (or perhaps in a special document with a well-known external id that you just update when the statistics change).
If you can store the data outside of the index, one thing you can do is count the tokens when documents are tokenized, and store the counts for averaging. If your document collection is static, just dump the values for each field into a file & process after indexing. If the index needs to get updated with additions only, you can store the number of documents and the average length per field, and recompute the average. If documents are going to be removed, and you need an accurate count, you will need to re-parse the document being removed to know how many terms each field contained, or get the length from the TermVector if you are using that.

List of Best Practice MySQL Data Types

Is there a list of best practice MySQL data types for common applications. For example, the list would contain the best data type and size for id, ip address, email, subject, summary, description content, url, date (timestamp and human readable), geo points, media height, media width, media duration, etc
Thank you!!!
i don't know of any, so let's start one!
numeric ID/auto_increment primary keys: use an unsigned integer. do not use 0 as a value. and keep in mind the maximum value of of the various sizes, i.e. don't use int if you don't need 4 billion values when the 16 million offered by mediumint will suffice.
dates: unless you specifically need dates/times that are outside the supported range of mysql's DATE and TIME types, use them! if you instead use unix timestamps, you have to convert them to use the built-in date and time functions. if your app needs unix timestamps, you can always convert the standard date and time data types on the way out using unix_timestamp().
ip addresses: use inet_aton() and inet_ntoa() since it easily compacts an ip address in to 4 bytes and gives you the ability to do range searches that utilize indexes.
Integer Display Width You likely define your integers something like this "INT(4)" but have been baffled by the fact that (4) has no real effect on the stored numbers. In other words, you can store numbers like 999999 just fine. The reason is that for integers, (4) is the display width, and only has an effect if used with the ZEROFILL modifier. Further, this is for display purposes only, so you could define a column as "INT(4) ZEROFILL" and store 99999. If you stored 999, the mysql REPL (console) would output 0999 when you've selected this column.
In other words, if you don't need the ZEROFILL stuff, you can leave off the display width.
Money: Use the Decimal data type. Based on real-world production scenarios I recommend (19,8).
EDIT: My original recommendation was (19,4); however, I've recently run into a production issue where the client reported that they absolutely needed decimal with a "scale" of "8"; thus "4" wasn't enough and was causing improper tax calculations. I now recommend (19,8) based on a real-world scenario. I would love to hear stories needing a more granular scale.