As I have implemented azure search in my application
I am facing one challenge that is like I am having facet which should be having around 50 sub-elements but it is returning only 10
I am looking for functionality by which I can configure max number of sub-elements
For each faceted field in the navigation tree, there is a default limit of 10 values. This default makes sense for navigation structures because it keeps the values list to a manageable size. You can override the default by assigning a value to count.
For example: &facet=City,count:50
In a facet query, you can set count to a value. You can set it higher or lower. Setting count:50 gets the top 50 matches in facet results by document count.
However, when document counts are high, there is a performance penalty, so use this option judiciously.
For more details, you could refer to this article.
Related
I have a web / mobile application that should display an infinite scroll view (the continuation of the list of items is loaded periodically in a dynamic way) with items where each of the items have a weight, the bigger is the weight in comparison to the weights of other items the higher should be the chances/probability to load the item and display it in the list for the users, the items should be loaded randomly, just the chances for the items to be in the list should be different.
I am searching for an efficient algorithm / solution or at least hints that would help me achieve that.
Some points worth to mention:
the weight has those boundaries: 0 <= w < infinite.
the weight is not a static value, it can change over time based on some item properties.
every item with a weight higher than 0 should have a chance to be displayed to the user even if the weight is significantly lower than the weight of other items.
when the users scrolls and performs multiple requests to API, he/she should not see duplicate items or at least the chance should be low.
I use a SQL Database (PostgreSQL) for storing items so the solution should be efficient for this type of database. (It shouldn't be a purely SQL solution)
Hope I didn't miss anything important. Let me know if I did.
The following are some ideas to implement the solution:
The database table should have a column where each entry is a number generated as follows:
log(R) / W,
where—
W is the record's weight greater than 0 (itself its own column), and
R is a per-record uniform random number in (0, 1)
(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.
However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM() for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND() with T-SQL's NEWID()).
Peter O had a good idea, but had some issues. I would expand it a bit in favor of being able to shuffle a little better as far as being user-specific, at a higher database space cost:
Use a single column, but store in multiple fields. Recommend you use the Postgres JSONB type (which stores it as json which can be indexed and queried). Use several fields where the log(R) / W. I would say roughly log(U) + log(P) where U is the number of users and P is the number of items with a minimum of probably 5 columns. Add an index over all the fields within the JSONB. Add more fields as the number of users/items get's high enough.
Have a background process that is regularly rotating the numbers in #1. This can cause duplication, but if you are only rotating a small subset of the items at a time (such as O(sqrt(P)) of them), the odds of the user noticing are low. Especially if you are actually querying for data backwards and forwards and stitch/dedup the data together before displaying the next row(s). Careful use of manual pagination adjustments helps a lot here if it's an issue.
Before displaying items, randomly pick one of the index fields and sort the data on that. This means you have a 1 in log(P) + log(U) chance of displaying the same data to the user. Ideally the user would pick a random subset of those index fields (to avoid seeing the same order twice) and use that as the order, but can't think of a way to make that work and be practical. Though a random shuffle of the index and sorting by that might be practical if the randomized weights are normalized, such that the sort order matters.
I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.
As per our business requirement, I need to index full story body (consider it for a news story for example) but in the Solr query result I need to return a preview text (say, first 400 characters) to bind to the target news listing page.
As I know there are 2 options in schema file for any field stored=false/true. Only way I can see as of now is I set it to true and take the full story body in result and then excerpt text to preview manually, but this seems not to be practical because (1) It will occupy GBs of space on disc for storing full body and (2) the json response becomes very heavy. (The query result can return 40K/50K stories).
I also know about limiting the number of records but for some reasons we need complete result at once.
Any help for achieving this requirement efficiently ?
In order to display just 400 characters in the news overview, you can simply use Solr Highlighting Feature and specify the number of snippets and their size. For instance for Standard highlighter you have parameters:
hl.snippets: Specifies maximum number of highlighted snippets to generate per field. It is possible for any number of snippets from
zero to this value to be generated. This parameter accepts per-field
overrides.
hl.fragsize: Specifies the size, in characters, of fragments to consider for highlighting. 0 indicates that no fragmenting should be
considered and the whole field value should be used. This parameter
accepts per-field overrides.
If you want to index everything but store only part of the text then you can follow the solution advised here in Solr Community.
I have a number of large sorted sets (5m-25m) in Redis and I want to get the first element that appears in a combination of those sets.
e.g I have 20 sets and wanted to take set 1, 5, 7 and 12 and get only the first intersection of only those sets.
It would seem that a ZINTERSTORE followed by a "ZRANGE foo 0 0" would be doing a lot more work that I require as it would calculate all the intersections then return the first one. Is there an alternative solution that does not need to calculate all the intersections?
There is no direct, native alternative, although I'd suggest this:
Create a hash which its members are your elements. Upon each addition to one of your sorted sets, increment the relevant member (using HINCRBY). Of course, you'll make the increment only after you check that the element does not exist already in the sorted set you are attempting to add to.
That way, you can quickly know which elements appear in 4 sets.
UPDATE: Now that I rethink about it, it might be too expensive to query your hash to find items with value of 4 (O(n)). Another option would be creating another Sorted Set, which its members are your elements, and their score gets incremented (as I described before, but using ZINCRBY), and you can quickly pull all elements with score 4 (using ZRANGEBYSCORE).
I've stumbled into an issue using Lucene.net in one of my project where i'm using the SimpleFacetedSearch feature to have faceted search.
I get an exception thrown
Facet count exceeded 2048
I've a 3 columns which I'm faceting as soon as a add another facet I get the exception.
If I remove all the other facets the new facet works.
Drilling down into the source of SimpleFacetedSearch I can see inside the constructor of SimpleFacetedSearch it's checking of the number of facets don't exceed MAX_FACETS which is a constant set to 2048.
foreach (string field in groupByFields)
{
...
num *= fieldValuesBitSets1.FieldValueBitSetPair.Count;
if (num > SimpleFacetedSearch.MAX_FACETS)
throw new Exception("Facet count exceeded " + (object) SimpleFacetedSearch.MAX_FACETS);
fieldValuesBitSets.Add(fieldValuesBitSets1);
...
}
However as it's public I am able to set it like so.
SimpleFacetedSearch.MAX_FACETS = int.MaxValue;
Does anyone know why it is set to 2048 and if there are issues changing it? I was unable to find any documentation on it.
No there should't be any issue in changing it. But remember that using Bitsets(as done by SimpleFacetedSearch internally) is more performant when the search results are big but facet counts don't exceed some number. (Say 1000 facets 10M hits)
If you have much more facets but search results are not big you can iterate on the results(in a collector) and create facets. This way you may get a better performance. (say 100K facets 1000 hits)
So, 2048 may be an optimized number where exceeding it may result in performance loss.
The problem that MAX_FACETS is there to avoid is one of memory usage and performance.
Internally SimpleFS uses bitmaps to record which documents each facet value is used in. There is a bit for each document and each value has a separate bitmap. So if you have a lot of values the amount of memory needed grows quickly especially if you also have a lot of documents. memory = values * documents /8 bytes.
My company has indexes with millions of documents and 10's of thousands of values which would require many GB's of memory.
I've created another implementation which I've called SparseFacetedSearcher. This records the doc IDs for each value. So you only pay for hits not a bits per doc. If you have exactly one value in each document (like a product category) then the break even point is if you have more than 32 values (more than 32 product categories).
In our case the memory usage has dropped to a few 100MB.
Feel free to have a look at https://github.com/Artesian/SparseFacetedSearch