Lucene SimpleFacetedSearch Facet count exceeded 2048 - lucene

I've stumbled into an issue using Lucene.net in one of my project where i'm using the SimpleFacetedSearch feature to have faceted search.
I get an exception thrown
Facet count exceeded 2048
I've a 3 columns which I'm faceting as soon as a add another facet I get the exception.
If I remove all the other facets the new facet works.
Drilling down into the source of SimpleFacetedSearch I can see inside the constructor of SimpleFacetedSearch it's checking of the number of facets don't exceed MAX_FACETS which is a constant set to 2048.
foreach (string field in groupByFields)
{
...
num *= fieldValuesBitSets1.FieldValueBitSetPair.Count;
if (num > SimpleFacetedSearch.MAX_FACETS)
throw new Exception("Facet count exceeded " + (object) SimpleFacetedSearch.MAX_FACETS);
fieldValuesBitSets.Add(fieldValuesBitSets1);
...
}
However as it's public I am able to set it like so.
SimpleFacetedSearch.MAX_FACETS = int.MaxValue;
Does anyone know why it is set to 2048 and if there are issues changing it? I was unable to find any documentation on it.

No there should't be any issue in changing it. But remember that using Bitsets(as done by SimpleFacetedSearch internally) is more performant when the search results are big but facet counts don't exceed some number. (Say 1000 facets 10M hits)
If you have much more facets but search results are not big you can iterate on the results(in a collector) and create facets. This way you may get a better performance. (say 100K facets 1000 hits)
So, 2048 may be an optimized number where exceeding it may result in performance loss.

The problem that MAX_FACETS is there to avoid is one of memory usage and performance.
Internally SimpleFS uses bitmaps to record which documents each facet value is used in. There is a bit for each document and each value has a separate bitmap. So if you have a lot of values the amount of memory needed grows quickly especially if you also have a lot of documents. memory = values * documents /8 bytes.
My company has indexes with millions of documents and 10's of thousands of values which would require many GB's of memory.
I've created another implementation which I've called SparseFacetedSearcher. This records the doc IDs for each value. So you only pay for hits not a bits per doc. If you have exactly one value in each document (like a product category) then the break even point is if you have more than 32 values (more than 32 product categories).
In our case the memory usage has dropped to a few 100MB.
Feel free to have a look at https://github.com/Artesian/SparseFacetedSearch

Related

Elasticsearch much slower when returning all results

I have about 20,000 documents stored in elastic search, at about 200kb each.
I have a search which has 733 hits total, I'm running that takes about 50ms to complete when returning 10 results.
If I set the size to 1000 so that it returns all results, the search takes 3-5 seconds to return.
Normally I would see that this is because it has to continue searching until it finds all of them, which takes extra time. However when returning 10 results only, the search still says 733 hits in total, so it already knows which documents are to be returned!
Note that I am not returning the _source field here, all I want it the list of _ids back, so I can't imagine that it would have to read any more data from the disk, as all the _ids are surely stored in the indices anyway.
Am I missing something in the way this works?
(My _ids are guids that we use internally).
EDIT: Since posting I've re-indexed with two changes to the mapping:
Set _source to false, so now the actual documents aren't stored.
Changed the index for the field that I was searching on to be not_analyzed.
This solves the problem, now I'm getting all 733 _ids back in ~50ms. Not sure which change solved it though. I'll take one of them back out and re-index.
It will take that Time. Because it need to fetch all data from ES and calculate score for your query.
Try
1)set fields to not analyzed which you Don search in.
2)change the store type of ES from simplfs to mmaps.. ( mention "index.store.type:mmaps" in elasticsearch.yml..)
3)configure less shard as much possible.. Shard more must be equal to move on nodes you gonna use..

How do I get Average field length and Document length in Lucene?

I am trying to implement BM25f scoring system on Lucene. I need to make a few minor changes to the original implementation given here for my needs, I got lost at the part where he gets Average Field Length and document length... Could someone guide me as to how or where I get it from?
You can get field length from TermVector instances associated with documents' fields, but that will increase your index size. This is probably the way to go unless you cannot afford a larger index. Of course you will still need to calculate the average yourself, and store it elsewhere (or perhaps in a special document with a well-known external id that you just update when the statistics change).
If you can store the data outside of the index, one thing you can do is count the tokens when documents are tokenized, and store the counts for averaging. If your document collection is static, just dump the values for each field into a file & process after indexing. If the index needs to get updated with additions only, you can store the number of documents and the average length per field, and recompute the average. If documents are going to be removed, and you need an accurate count, you will need to re-parse the document being removed to know how many terms each field contained, or get the length from the TermVector if you are using that.

What is the maximum value size you can store in redis?

Does anyone know what the maximum value size you can store in redis? I want to use redis as a message queue with celery to store some small documents that need to be processed by a worker on another server, and I want to make sure the documents aren't going to be too big.
I found one page with a reference to 1GB, but when I followed the link on the page for where they got that answer the link wasn't valid anymore. Here is the link:
http://news.ycombinator.com/item?id=1182005
All string values are limited to 512 MiB. This is the size limit you probably care most about.
EDIT: Because keys in Redis are strings, the maximum key size is 512 MiB. The maximum number of keys is 2^32 - 1 = 4,294,967,295.
Values, on the other hand, can vary in size depending on their type. For aggregate data types (i.e. hash, list, set, and sorted set), the maximum value size is 512 MiB for each element, although the data structure itself can have up to 2^32 - 1 elements.
https://redis.io/topics/data-types
https://redis.io/topics/faq#what-is-the-maximum-number-of-keys-a-single-redis-instance-can-hold-and-what-is-the-max-number-of-elements-in-a-hash-list-set-sorted-set
http://groups.google.com/group/redis-db/browse_thread/thread/1c7e33fbc98734b3?fwc=2
Article about Redis Memory Usage can help you to roughly determine how much memory your database would take.
It's in the order of the amount of RAM you have, at least, so unless you plan on puting multi-gigabyte objects in there I wouldn't worry. I've had sets that were hundreds of megabytes big without a problem, but I don't know the exact limits.
A String value can accommodate the size of max 512MB. But according to this link, the size can be increased.

Which data structure should I use for storing hash values?

I have a hash table that I want to store to disk. The list looks like this:
<16-byte key > <1-byte result>
a7b4903def8764941bac7485d97e4f76 04
b859de04f2f2ff76496879bda875aecf 03
etc...
There are 1-5 million entries. Currently I'm just storing them in one file, 17-bytes per entry times the number of entries. That file is tens of megabytes. My goal is to store them in a way that optimizes first for space on the disk and then for lookup time. Insertion time is unimportant.
What is the best way to do this? I'd like the file to be as small as possible. Multiple files would be okay, too. Patricia trie? Radix trie?
Whatever good suggestions I get, I'll be implementing and testing. I'll post the results here for all to see.
You could just sort entries by key and do a binary search.
Fixed size keys and data entries means you can very quickly jump from row to row, and storing only the key and data means you're not wasting any space on meta data.
I don't think you'll do any better on disk space, and lookup times are O(log(n)). Insertion times are crazy long, but you said that didn't matter.
If you're really willing to tolerate long access times, do sort the table but then chunk it into blocks of some size and compress them. Store the offset* and start/end keys of each block in a section of the file at the start. Using this scheme, you can find the block containing the key you need in linear time and then perform a binary search within the decompressed block. Choose the block sized based on how much of the file you're willing to loading into memory at once.
Using an off the shelf compression scheme (like GZIP) you can tune the compression ratio as needed; larger files will presumably have quicker lookup times.
I have doubts that the space savings will be all that great, as your structure seems to be mostly hashes. If they are actually hashes, they're random and won't compress terribly well. Sorting will help increase the compression ratio, but not by a ton.
*Use the header to lookup the offset of a block to decompress and use.
5 million records it's about 81MB - acceptable to work with array in memory.
As you described problem - it's more unique keys than hash values.
Try to use hash table for accessing values (look at this link).
If there is my misunderstand and this is real hash - try to build second hash level above this.
Hash table can be successfuly organized on disk too (e.g. as separate file).
Addition
Solution with good search performance and little overhead is:
Define hash function, which produces integer values from keys.
Sort records in file according to values, produced by this function
Store file offsets where each hash value starts
To locate value:
4.1. compute it's hash with function
4.2. lookup for offset in file
4.3. read records from file starting from this position until key found or offset of next key not reached or End-Of-File.
There are some additional things which must be pointed out:
Hash function must be fast to be effective
Hash function must produce linear distributed values or near that
Table of hash value offsets can be placed in separated file
Table of hash value offsets can be produced dynamically with sequential read of whole sorted file at start of application and stored in memory
at step 4.3. records must be readed by blocks, not one-by-one, to be effective. Ideally reads all values with computed hash to memory at once.
You can find some examples of hash functions here.
Would the simple approach work and store them in a sqlite database? I don't suppose it'll get any smaller but you should get very good lookup performance, and it's very easy to implement.
First of all - multiple files are not OK if you want to optimize for disk space, because of cluster size - when you create file with size ~100 bytes, disk spaces decreases per cluster size - 2kB for example.
Secondly - in your case i would store all table in single binary file, ordered simply ASC by bytes values in keys. It will give you file with length exactly equals to entriesNumber*17, which is minimal if you do not want to use archiving, and secondly, you can use very quick search with time ~log2(entriesNumber), when you search for key dividing file into two parts and comparing key on their border with needed key. If "border key" is bigger, you take first part of file, if bigger - then second part. And again divide taken part into two parts, etc.
So you will need about log2(entriesNumber) read operations to search single key.
Your key is 128 bits, but if you have max 10^7 entries, it only takes 24 bits to index it.
You could make a hash table, or
Use Bentley-style unrolled binary search (at most 24 comparisons), as in
Here's the unrolled loop (with 32-bit ints).
int key[4];
int a[1<<24][4];
#define COMPARE(key, i) (key[0]>=a[i][0] && key[1]>=a[i][1] && key[2]>=a[i][2] && key[3]>=a[i][3])
i = 0;
if (COMPARE(key, (i+(1<<23))) >= 0) i += (1<<23);
if (COMPARE(key, (i+(1<<22))) >= 0) i += (1<<22);
if (COMPARE(key, (i+(1<<21))) >= 0) i += (1<<21);
...
if (COMPARE(key, (i+(1<<3))) >= 0) i += (1<<3);
if (COMPARE(key, (i+(1<<2))) >= 0) i += (1<<2);
if (COMPARE(key, (i+(1<<1))) >= 0) i += (1<<3);
As always with file design, the more you know (and tell us) about the distribution of data the better. On the assumption that your key values are evenly distributed across the set of all 16-byte keys -- which should be true if you are storing a hash table -- I suggest a combination of what others have already suggested:
binary data such as this belongs in a binary file; don't let the fact that the easy representation of your hashes and values are as strings of hexadecimal digits fool you into thinking that this is string data;
file size is such that the whole shebang can be kept in memory on any modern PC or server and a lot of other devices too;
the leading 4 bytes of your keys divide the set of possible keys into 16^4 (= 65536) subsets; if your keys are evenly distributed and you have 5x10^6 entries, that's about 76 entries per subset; so create a file with space for, say, 100 entries per subset; then:
at offset 0 start writing all the entries with leading 4 bytes 0x0000; pad to the total of 100 entries (1700 bytes I think) with 0s;
at offset 1700 start writing all the entries with leading 4 bytes 0x0001, pad,
repeat until you've written all the data.
Now your lookup becomes a calculation to figure out the offset into the file followed by a scan of up to 100 entries to find the one you want. If this isn't fast enough then use 16^5 subsets, allowing about 6 entries per subset (6x16^5 = 6291456). I guess that this will be faster than binary search -- but it is only a guess.
Insertion is a bit of a problem, it's up to you with your knowledge of your data to decide whether new entries (a) necessitate the re-sorting of a subset or (b) can simply be added at the end of the list of entries at that index (which means scanning the entire subset on every lookup).
If space is very important you can, of course, drop the leading 4 bytes from your entries, since they are computed by the calculation for the offset into the file.
What I'm describing, not terribly well, is a hash table.

Collect all hits for a search in Lucene / Optimization

Summary: I collect the doc ids of all hits for a given search by using a custom Collector (it populates a BitSet with the ids). The searching and getting doc ids are quite fast according to my needs but when it comes to actually fetching the documents from disk, things get very slow. Is there a way to optimize Lucene for faster document collection?
Details: I'm working on a processed corpus of Wikipedia and I keep each sentence as a separate document. When I search for "computer", I get all sentences containing the term computer. Currently, searching the corpus and getting all document ids work in sub-second but fetching the first 1000 documents takes around 20 seconds. Fetching all documents takes proportionally more time (i.e. another 20 sec for each 1000-document batch).
Subsequent searches and document fetching takes much less time (though, I don't know who's doing the caching, OS or Lucene?) but I'll be searching for many diverse terms and I don't want to rely on caching, the performance on the very first search is crucial for me.
I'm looking for suggestions/tricks that will improve the document-fetching performance (if it's possible at all). Thanks in advance!
Addendum:
I use Lucene 3.0.0 but I use Jython to drive Lucene classes. Which means, I call the get_doc method of the following Jython class for every doc id I retrieved during the search:
class DocumentFetcher():
def __init__(self, index_name):
self._directory = FSDirectory.open(java.io.File(index_name))
self._index_reader = IndexReader.open(self._directory, True)
def get_doc(self, doc_id):
return self._index_reader.document(doc_id)
I have 50M documents in my index.
You, probably, are storing lot of information in the document. Reduce the stored fields to as much as you can.
Secondly, while retrieving fields, select only those fields which you need. You can use following method of IndexReader to specify only few of the stored fields.
public abstract Document document(int n, FieldSelector fieldSelector)
This way you don't load up fields which are not used.
You can utilize following code sample.
FieldSelector idFieldSelector =
new SetBasedFieldSelector(Collections.singleton("idFieldName"), Collections.emptySet());
for (int i: resultDocIDs) {
String id = reader.document(i, idFieldSelector).get("idFieldName");
}
Scaling Lucene and Solr discusses many ways to improve Lucene performance.
As you are working on Lucene search within Wikipedia, you may be interested in Rainman's Lucene Search of Wikipedia. He mostly discusses algorithms and less performance, but this may still be relevant.