lucene estimate index size, search time - lucene

I search a way to estimate indexing time, index size, search time with lucene library.
I have some number for 500 files and i would like to estimate value for 5000 document.
I search on the web and i don't found any good way to estimate theses number.

The answer depends hugely on what you put into the index. Obviously, if you store full field content, then you can expect at least linear growth, with the factor within an order of magnitude from 1. If you only index the terms, you will need much less space, but at the same time the estimate will get much more difficult. The number of unique index terms is a very important factor, for example. This will probably start levelling off at some number that depends highly on the details of your content. All in all, in such a case measurement is probably your only reliable method.

Related

How to generate the optimal index combination for the index recommendation of opengauss under multiple targets?

The index recommendation of the AI module in the opengauss document supports the introduction of the optimal index combination within the limit of the index space. However, the index recommendation code only seems to use the hill-climbing method. The hill-climbing method is a greedy algorithm. Each time, it only selects the one with the largest current profit and converges and local solutions. However, under the constraints of the two goals of index return and space combination, is the algorithm unable to find the optimal solution? How do we calculate the optimal solution in this case?
Monte Carlo Tree Search, which can effectively solve some problems with huge exploration space, can balance exploration and utilization, and find effective solutions.

Index versus Sequential search performance?

Say I have a database that holds information about books and their dates of publishing. (two attributes, bookName and publicationDate).
Say that the attribute publicationDate has a Hash Index.
If I wanted to display every book that was published in 2010 I would enter this query : select bookName from Books where publicationDate=2010.
In my lecture, it is explained that if there is a big volume of data and that the publication dates are very diverse, the more optimized way is to use the Hash index in order to keep only the books published in 2010.
However, if the vast majority of the books that are in the database were published in 2010 it is better to search the database sequentially in terms of performance.
I really don't understand why? What are the situations where using an index is more optimized and why?
It is surprising that you are learning about hash indexes without understanding this concept. Hash indexing is a pretty advanced database concept; most databases don't even support them.
Although the example is quite misleading. 2010 is not a DATE; it is a YEAR. This is important because a hash index only works on equality comparisons. So the natural way to get a year of data from dates:
where publicationDate >= date '2010-01-01' and
publicationDate < date '2011-01-01'
could not use a hash index because the comparisons are not equality comparisons.
Indexes can be used for several purposes:
To quickly determine which rows match filtering conditions so fewer data pages need to be read.
To identify rows with common key values for aggregations.
To match rows between tables for joins.
To support unique constraints (via unique indexes).
And for b-tree indexes, to support order by.
This is the first purpose, which is to reduce the number of data pages being read. Reading a data page is non-trivial work, because it needs to be fetched from disk. A sequential scan reads all data pages, regardless of whether or not they are needed.
If only one row matches the index conditions, then only one page needs to be read. That is a big win on performance. However, if every page has a row that matches the condition, then you are reading all the pages anyway. The index seems less useful.
And using an index is not free. The index itself needs to be loaded into memory. The keys need to be hashed and processed during the lookup operation. All of this overhead is unnecessary if you just scan the pages (although there is other overhead for the key comparisons for filtering).
Using an index has a performance cost. If the percentage of matches is a small fraction of the whole table, this cost is more than made up for by not having to scan the whole table. But if there's a large percentage of matches, it's faster to simply read the table.
There is the cost of reading the index. A small, frequently used index might be in memory, but a large or infrequently used one might be on disk. That means slow disk access to search the index and get the matching row numbers. If the query matches a small number of rows this overhead is a win over searching the whole table. If the query matches a large number of rows, this overhead is a waste; you're going to have to read the whole table anyway.
Then there is an IO cost. With disks it's much, much faster to read and write sequentially than randomly. We're talking 10 to 100 times faster.
A spinning disk has a physical part, the head, it must move around to read different parts of the disk. The time it takes to move is known as "seek time". When you skip around between rows in a table, possibly out of order, this is random access and induces seek time. In contrast, reading the whole table is likely to be one long continuous read; the head does not have to jump around, there is no seek time.
SSDs are much, much faster, there's no physical parts to move, but they're still much faster for sequential access than random.
In addition, random access has more overhead between the operating system and the disk; it requires more instructions.
So if the database decides a query is going to match most of the rows of a table, it can decide that it's faster to read them sequentially and weed out the non-matches, than to look up rows via the index and using slower random access.
Consider a bank of post office boxes, each numbered in a big grid. It's pretty fast to look up each box by number, but it's much faster to start at a box and open them in sequence. And we have an index of who owns which box and where they live.
You need to get the mail for South Northport. You look up in the index which boxes belong to someone from South Northport, see there's only a few of them, and grab the mail individually. That's an indexed query and random access. It's fast because there's only a few mailboxes to check.
Now I ask you to get the mail for everyone but South Northport. You could use the index in reverse: get the list of boxes for South Northport, subtract those from the list of every box, and then individually get the mail for each box. But this would be slow, random access. Instead, since you're going to have to open nearly every box anyway, it is faster to check every box in sequence and see if it's mail for South Northport.
More formally, the indexed vs table scan performance is something like this.
# Indexed query
C[index] + (C[random] * M)
# Full table scan
(C[sequential] + C[match]) * N
Where C are various constant costs (or near enough constant), M is the number of matching rows, and N is the number of rows in the table.
We know C[sequential] is 10 to 100 times faster than C[random]. Because disk access is so much slower than CPU or memory operations, C[match] (the cost of checking if a row matches) will be relatively small compared to C[sequential]. More formally...
C[random] >> C[sequential] >> C[match]
Using that we can assume that C[sequential] + C[match] is C[sequential].
# Indexed query
C[index] + (C[random] * M)
# Full table scan
C[sequential] * N
When M << N the indexed query wins. As M approaches N, the full table scan wins.
Note that the cost of using the index isn't really constant. C[index] is things like loading the index, looking up a key, and reading the row IDs. This can be quite variable depending on the size of the index, type of index, and whether it is on disk (cold) or in memory (hot). This is why the first few queries are often rather slow when you've first started a database server.
In the real world it's more complicated than that. In reality rows are broken up into data pages and databases have many tricks to optimize queries and disk access. But, generally, if you're matching most of the rows a full table scan will beat an indexed lookup.
Hash indexes are of limited use these days. It is a simple key/value pair and can only be used for equality checks. Most databases use a B-Tree as their standard index. They're a little more costly, but can handle a broader range of operations including equality, ranges, comparisons, and prefix searches such as like 'foo%'.
The Postgres Index Types documentation is pretty good high level run-down of the various advantages and disadvantages of types of indexes.

How does geohash index work in Lucene

In lucene spatial 4 I'm wondering how the geohash index works behind the scenes. I understand the concept of the geohash which basically takes 2 points (lat, lon) and creates a single "string" hash.
Is the index just a "string" index (r-tree or quad-tree) or something along these lines (such as just indexing a last name).....or is there something special with it.
For pre-fixed type searches do all of the n-grams of the hash get indexed such as if a geohash is
drgt2abc does this get indexed as d, dr, drg, drgt, etc..
Is there a default number of n-grams that we might want indexed?
With this type of indexing will search queries with 100 thousand records verse 100 million records have similar query performance for spatial queries. (Such as box/polygon, or distance) or can I expect a general/typical slow degradation of the index as lots of records added.
Thanks
The best online explanation is my video: Lucene / Solr 4 Spatial deep dive
Is the index just a "string" index (r-tree or quad-tree) or something
along these lines (such as just indexing a last name).....or is there
something special with it.
Lucene, fundamentally, has just one index used for text, numbers, and now spatial. You could say it's a string index. It's a sorted list of bytes/strings. From a higher level view, using spatial in this way is the family of "Tries" AKA "PrefixTrees" in computer science.
For pre-fixed type searches do all of the n-grams of the hash get
indexed such as if a geohash is
drgt2abc does this get indexed as d, dr, drg, drgt, etc..
Yes.
Is there a default number of n-grams that we might want indexed?
You tell it conveniently in terms of the precision requirements you have and it'll lookup how long it needs to be. Or you can tell it by length.
With this type of indexing will search queries with 100 thousand
records verse 100 million records have similar query performance for
spatial queries. (Such as box/polygon, or distance) or can I expect a
general/typical slow degradation of the index as lots of records
added.
Indeed, this type of indexing (and more specifically the clever recursive search tree algorithm that uses it) means that you'll have scalable search performance. 100m is a ton of documents for one filter to match so it's of course going to be slower than one that matches only 100k docs, but it's definitely sub-linear. And by next year it'll be even faster, due to work happening this summer on a new PrefixTree encoding plus a spatial benchmark in progress which will allow me to make further tuning optimizations I have planned.

How to optimize large index on solr

our index is rising relatively fast, by adding 2000-3000 documents a day.
We are running an optimize every night.
The point is, that Solr needs double disc space while optimizing. Actually the index has an size of 44GB, which works on an 100GB partition - for the next few months.
The point is, that 50% of the disk space are unused for 90% of the day and only needed during optimize.
Next thing: we have to add more space on that partition periodical - which is always a painful discussion with the guys from the storage department (because we have more than one index...).
So the question is: is there a way to optimize an index without blocking additional 100% of the index size on disk?
I know, that multi-cores an distributed search is an option - but this is only an "fall back" solution, because for that we need to change the application basically.
Thank you!
There is continous merging going on under the hood in Lucene. Read up on the Merge Factor which can be set in the solrconfig.xml. If you tweak this setting you probably wont have to optimize at all.
You can try partial optimize by passing maxSegment parameter.
This will reduce the index to that specified number.
I suggest you do in batches (e.g if there are 50 segments first reduce to 30 then to 15 and so on).
Here's the url:
host:port/solr/CORE_NAME/update?optimize=true&maxSegments=(Enter the number of segments you want to reduce to. Ignore the parentheses)&waitFlush=false

How do I estimate the size of a Lucene index?

Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?
Here is the lucene index format documentation.
The major file is the compound index (.cfs file).
If you have term statistics, you can probably get an estimate for the .cfs file size,
Note that this varies greatly based on the Analyzer you use, and on the field types you define.
The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.
I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?