How to generate the optimal index combination for the index recommendation of opengauss under multiple targets? - indexing

The index recommendation of the AI module in the opengauss document supports the introduction of the optimal index combination within the limit of the index space. However, the index recommendation code only seems to use the hill-climbing method. The hill-climbing method is a greedy algorithm. Each time, it only selects the one with the largest current profit and converges and local solutions. However, under the constraints of the two goals of index return and space combination, is the algorithm unable to find the optimal solution? How do we calculate the optimal solution in this case?

Monte Carlo Tree Search, which can effectively solve some problems with huge exploration space, can balance exploration and utilization, and find effective solutions.

Related

Why do we set the similarity function at index time in Lucene?

How does Lucene use Similarity during indexing time? I understand the role of similarity while reading the index. So, searcher.setSimilarity() makes sense in scoring. What is the use of IndexWriterConfig.setSimilarity()?
How does Lucene use Similarity during indexing time?
The short answer is: Lucene captures some statistics at indexing time which can then be used to support scoring at query time. I expect it is simply a matter of efficiency that these are captured as part of the indexing process, rather than being repeatedly re-computed on the fly, when running queries.
There is a section in the Similarity javadoc which describes this at a high level:
At indexing time, the indexer calls computeNorm(FieldInvertState), allowing the Similarity implementation to set a per-document value for the field that will be later accessible via LeafReader.getNormValues(String).
The javadoc goes on to describe further details - for example:
Many formulas require the use of average document length, which can be computed via a combination of CollectionStatistics.sumTotalTermFreq() and CollectionStatistics.docCount().
So, for example, the segment info file within a Lucene index records the number of documents in each segment.
There are other statistics which can be captured in an index to support scoring calculations at query time. You can see a summary of these stats in the Index Structure Overview documentation - with links to further details.
What is the use of IndexWriterConfig.setSimilarity()?
This is a related question which follows on from the above points.
By default, Lucene uses the BM25Similarity formula.
That is one of a few different scoring models that you may choose to use (or you can define your own). The setSimilarity() method is how you can choose a different similarity (scoring model) from the default one. This means different statistics may need to be captured (and then used in different ways) to support the chosen scoring model.
It would not make sense to use one scoring model at indexing time, and a different one at query time.
(Just to note: I have never set the similarity scoring model myself - I have always used the default model.)

Plone - ZODB catalog query sort_on multiple indexes?

I have a ZODB catalog query with a start and end date. I want to sort the result on end_date first and then start_date second.
Sorting on either end_date or start_date works fine.
I tried with a tuple (start_date,end_date), but with no luck.
Is there a way to achieve this or do one have to employ some custom logic afterwards?
The generalized answer ought to be post-hoc-sort of your entire result set of catalog brains, use zope.sequencesort (via PyPI, but already shipped with Plone) or similar.
The more complex answer is a rabbit-hole of optimizations that you should only go down if you know you need to and know what you are doing:
Make sure when you do sort the brains that your user gets a sticky session to the same instance, at least for cache-affinity to get the same catalog indexes and brains (metadata);
You might want to cache across requests (thread-global) a unique session id, and a sequence of catalog RID (integer) values for your entire sorted request, should you expect the user to come back and need in subsequent batches. Of course, RIDs need to be re-constituted into ZCatalog's lazy-sequences of brains, and this requires some know-how (or reading the source).
Finally, for large result (many thousands) sets, I would suggest that it is reasonable to make application-specific compromises that approximate correct by post-hoc sorting of the current batch through to the end of the n-batches after it, where n is inversely proportional to the len(site.portal_catalog.uniqueValuesFor(indexnamehere)). For a large set of results, the correctness of an approximated secondary-sort is high for high-variability, and low for low variability (many items with same secondary value, such that count is much larger than batch size can make this frustrating).
Do not optimize as such unless you are dealing with particularly large result sets.
It should go without saying: if you do optimize, you need to verify that you are actually getting a superior result (profile and benchmark). If you cannot justify investing the time to do this, you cannot justify optimizing.

Elasticsearch multindex performance

I'm thinking about moving from one index to day-based indexes (multi-index) in our Elasticsearch cluster with huge number of records.
The actual question is how it can affect the performance of indexing, searching and mapping in the ES cluster?
Is it take more time to search through one huge index than from a hundreds of big indices?
It will take less time to search through one large index, rather than hundreds of smaller ones.
Breaking an index in this fashion could help performance if you will be primarily searching only one of the broken out indexes. In your case, if you most often will need to search for records added on a particular day, then you might be helped by this, performance-wise. If you will mostly be searching across the entire of range of indexes, you would generally be better off searching in the single monolithic index.
Finally, we have implemented ES multi-indexing in our company. For our application we chose monthly indices strategy, so we create a new index every month.
Of course, as it was advised by #femtoRgon, the search through all smaller indices takes a little bit more, but speed of application has been increased because of its logic.
So, my advice to everybody who wants to move from one index to multi-indices: make research of your application needs and select appropriate slices of the whole index (if it's really needed).
As example, i can share some results of research of our application, that helped us to make a decision to use monthly indices:
90-95% of our queries are only for last 3 months
we have about 4 big groups of queries: today, last week, last month and last 3 months (of course, we could create weekly or daily indices, but they would be too small, since we don't have enough documents inside)
we can explain to customers why he need to wait if he makes "non usual" query across whole period (all indices).

lucene estimate index size, search time

I search a way to estimate indexing time, index size, search time with lucene library.
I have some number for 500 files and i would like to estimate value for 5000 document.
I search on the web and i don't found any good way to estimate theses number.
The answer depends hugely on what you put into the index. Obviously, if you store full field content, then you can expect at least linear growth, with the factor within an order of magnitude from 1. If you only index the terms, you will need much less space, but at the same time the estimate will get much more difficult. The number of unique index terms is a very important factor, for example. This will probably start levelling off at some number that depends highly on the details of your content. All in all, in such a case measurement is probably your only reliable method.

How do I estimate the size of a Lucene index?

Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?
Here is the lucene index format documentation.
The major file is the compound index (.cfs file).
If you have term statistics, you can probably get an estimate for the .cfs file size,
Note that this varies greatly based on the Analyzer you use, and on the field types you define.
The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.
I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?