Solr-8.9.0 index directory size reduced - indexing

Enviornment - solr-8.9.0,jdk-11.0.12
i have used following command to calculate index in solr.
du -sh solr-8.9.0/server/solr/core_name/data/index
But when i executed above command again after some time(5 Hrs) than index size was reduced to 64% of previous calculated index size. what can be possible reason for this?
Is it right way to calculate the index or there are any other method?

It depends on what you mean by "index size". What you use gives you the size on disk of the index, not the number of documents it contains.
The size on disk can reduce thanks to the "segments merging" done by Solr: basically your index is a bunch of files/segments that keep deleted documents in addition to current documents until a merge occurs, when the merge occurs the deleted documents are actually really deleted from disk, thus freeing some space.
This is quite a simplification of what's really happening, you can read more about it in Solr documentation.

Related

Unused index in PostgreSQL

I'm learning indexing in PostgreSQL now. I started trying to create my index and analyzing how it will affect execution time. I created some tables with such columns:
also, I filled them with data. After that I created my custom index:
create index events_organizer_id_index on events(organizer_ID);
and executed this command (events table contains 148 rows):
explain analyse select * from events where events.organizer_ID = 4;
I was surprised that the search was executed without my index and I got this result:
As far as I know, if my index was used in search there would be the text like "Index scan on events".
So, can someone explain or give references to sites, please, how to use indexes effectively and where should I use them to see differences?
From "Rows removed by filter: 125" I see there are too few rows in the events table. Just add couple of thousands rows and give it another go
from the docs
Use real data for experimentation. Using test data for setting up indexes will tell you what indexes you need for the test data, but that is all.
It is especially fatal to use very small test data sets. While
selecting 1000 out of 100000 rows could be a candidate for an index,
selecting 1 out of 100 rows will hardly be, because the 100 rows probably fit within a single disk page, and there is no plan that can
beat sequentially fetching 1 disk page.
In most cases, when database using an index it gets only address where the row is located. It contains data block_id and the offset because there might be many rows in one block of 4 or 8 Kb.
So, the database first searches index for the block adress, then it looks for the block on disk, reads it and parses the line you need.
When there are too few rows they fit into one on in couple of data blocks which makes it easier and quicker for DB to read whole table without using index at all.
See it the following way:
The database decides which way is faster to find your tuple (=record) with organizer_id 4. There are two ways:
a) Read the index and then skip to the block which contains the data.
b) Read the heap and find the record there.
The information in your screenshot show 126 records (125 skipped + your record) with a length ("width") of 62 bytes. Including overhead these data fit into two database blocks of 8 KB. As a rotating disk or SSD reads a series of blocks anyway - they read always more blocks into the buffer - it's one read operation for these two blocks.
So the database decides that it is pointless to read first the index to find the correct record (of in our case two blocks) and then read the data from the heap with the information from the index. That would be two read operations. Even with modern technology newer than rotating disks this needs more time than just scanning the two blocks. That's why the database doesn't use the index.
Indexes on such small tables aren't good for searching. Nevertheless unique indexes avoid double entries.

Put and Delete with CouchDB + Lucene

I'm running CouchDB (1.2.1) + Lucene on Linux (https://github.com/rnewson/couchdb-lucene/), and I have a few questions.
I index everything - one index for all documents. i've got around 20.000.000 documents.
How fast are puts/deletes done on the index -- I have about 10-50 Puts/Deletes etc. a second.
Is there a rule, like after 10,000 updates you have to optimize the index?
Are changes in documents immediately visible in the index? If not is there a delay or a temporary table for this updates/deletes?
Thanks in advance - Brandon
Use a profiler to measure the put/delete performance. That's the only way you'll get reasonably accurate numbers.
Optimization depends on how quickly the index is changing -- again, you will need to experiment and profile.
Changes are immediately visible in the index, but not to already-open IndexReaders.

How to optimize large index on solr

our index is rising relatively fast, by adding 2000-3000 documents a day.
We are running an optimize every night.
The point is, that Solr needs double disc space while optimizing. Actually the index has an size of 44GB, which works on an 100GB partition - for the next few months.
The point is, that 50% of the disk space are unused for 90% of the day and only needed during optimize.
Next thing: we have to add more space on that partition periodical - which is always a painful discussion with the guys from the storage department (because we have more than one index...).
So the question is: is there a way to optimize an index without blocking additional 100% of the index size on disk?
I know, that multi-cores an distributed search is an option - but this is only an "fall back" solution, because for that we need to change the application basically.
Thank you!
There is continous merging going on under the hood in Lucene. Read up on the Merge Factor which can be set in the solrconfig.xml. If you tweak this setting you probably wont have to optimize at all.
You can try partial optimize by passing maxSegment parameter.
This will reduce the index to that specified number.
I suggest you do in batches (e.g if there are 50 segments first reduce to 30 then to 15 and so on).
Here's the url:
host:port/solr/CORE_NAME/update?optimize=true&maxSegments=(Enter the number of segments you want to reduce to. Ignore the parentheses)&waitFlush=false

How do I remove logically deleted documents from a Solr index?

I am implementing Solr for a free text search for a project where the records available to be searched will need to be added and deleted on a large scale every day.
Because of the scale I need to make sure that the size of the index is appropriate.
On my test installation of Solr, I index a set of 10 documents. Then I make a change in one of the document and want to replace the document with the same ID in the index. This works correctly and behaves as expected when I search.
I am using this code to update the document:
getSolrServer().deleteById(document.getIndexId());
getSolrServer().add(document.getSolrInputDocument());
getSolrServer().commit();
What I noticed though is that when I look at the stats page for the Solr server that the figures are not what I expect.
After the initial index, numDocs and maxDocs both equal 10 as expected. When I update the document however, numDocs is still equal to 10 (expected) but maxDocs equals 11 (unexpected).
When reading the documentation I see that
maxDoc may be larger as the maxDoc count includes logically deleted documents that have not yet been removed from the index.
So the question is, how do I remove logically deleted documents from the index?
If these documents still exist in the index do I run the risk of performance penalties when this is run with a very large volume of documents?
Thanks :)
You have to optimize your index.
Note that an optimize is expansive, you probably should not do it more than daily.
Here is some more info on optimize:
http://www.lucidimagination.com/search/document/CDRG_ch06_6.3.1.3
http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations

How do I estimate the size of a Lucene index?

Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?
Here is the lucene index format documentation.
The major file is the compound index (.cfs file).
If you have term statistics, you can probably get an estimate for the .cfs file size,
Note that this varies greatly based on the Analyzer you use, and on the field types you define.
The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.
I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?