How much storage should a LucidWorks search engine index occupy? - lucene

I'm trying to use LucidWorks (http://www.lucidimagination.com/products/lucidworks-search-platform) as a search engine for my organization intranet.
I want it to index various document-types (Office formats, PDFs, web pages) from various data sources (web & wiki, file system, Subversion repositories).
So far I tried indexing several sites, directories & repositories (about 500K documents, with total size of about 50GB) - and the size of the index is 155GB.
Is this reasonable? Should the index occupy more storage than the data itself? What would be a reasonable thumb-rule for data-size to index-size ratio?

There is no reasonable size of index, basically depends upon the the data you have.
Ideally should be less, but there is no thumb rule.
However, For the index size and the data size, depends upon how you are indexing the data.
Many factors would determine and have affect on your index size.
Most of the space in the index is consumed by the Stored data fields.
If you are indexing the data from documents and all the content is stored, the index size will surely grow hugh.
Fine tuning of indexed fields attributes also helps in space saving.
You may want to revisit the fields which you need to be indexed and which needs to be stored.
Also, are you using lots of copyfields to duplicate data or maintaining repititive data.
Optimization might help as well.
More info # http://wiki.apache.org/solr/SolrPerformanceFactors

Related

Does the RavenDB compresion bundle provide benefits with many small documents?

I am trying to better understand how RavenDB uses disk space.
My application has many small documents (approximately 140 bytes each). Presently, there are around 81,000 documents which would give a total data size of around 11MB. However, the size of the database is just over 70MB.
Is most of the actual space being used by indexes?
I had read somewhere else that there may be a minimum overhead of around 600 bytes per document. This would consume around 49MB, which is more in the ballpark of the actual use I am seeing.
Would using the compression bundle provide much benefit in this scenario (many small documents), or is it targeted towards helping reduce the size of databases with very large documents?
I have done some further testing on my own and determined, in answer to my own question, that:
Indexes are not the main consumer of disk space in my scenario. In this case, indexes represent < 25% of the disk space used.
Adding the compression bundle for a database with a large number of small documents does not really reduce the total amount of disk space used. This is likely due to some minimum data overhead that each document requires. Compression would benefit documents that are very large.
Is most of the actual space being used by indexes?
Yes, that's likely. Remember that Raven creates indexes for different queries you make. You can fire up Raven Studio to see what indexes it's created for you:
Would using the compression bundle provide much benefit in this
scenario (many small documents), or is it targeted towards helping
reduce the size of databases with very large documents?
Probably wouldn't benefit your scenario of small documents. The compression bundle works on individual documents, not on indexes. But it might be worth trying to see what results you get.
Bigger question: since hard drive space is cheap and only getting cheaper, and 70MB is a spec on the map, why are you concerned about hard drive space? Databases often trade disk space for speed (e.g. multiple indexes, like Raven), and this is usually a good trade off for most apps.

RavenDB : Storage Size Problems

I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)
As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with Lucene.net using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.

lucene file index

I have to index log record from captured from enterprice networks.In current implementation every protocol,has index files as year/mont/day/lucene file ,i want to know if i use only one single lucene index file and every day i update this single file how this effect search time ? .is it Considerable increase,in current sitiuation when i search i am querying exacly for that day.
Current: smtp/year/month/ay/luceneindex
if i do smtp/luceneindex all idex in a single file.Let me know prons and cons
That depends on a whole range of factors.
When you say a single lucene file?
Lucene stores an index, using multiple types of files and has segments, so there is more than one file anyway.
What and how are you indexing log data?
What do you use for querying across lucene indexes, solr, elasticsearch, custom?
Are you running a single instance, single machine configuration.
Can you run multiple processes, on separate hosts, use some for search tasks and others for index updates?
What are your typical search queries like, optimise for those cases.
Have a look at http://elasticsearch.org/ or http://lucene.apache.org/solr/ for distributed search options.
lucene has options to run in memory, like RAMDirectory, you may like to investigate.
Is the size of the one-day file going to be problematic for administration?
Are the File sizes going to be so large relative to disk, bandwidth constraints that copying, moving introduces issues.

When should we store images in database?

I have a table of productList in which i have 4 column, now i have to store image for each row so i have two option for this..
Store image in data base.
Save images in a folder and store only path on table.
So my question is which one is better in this situation and why ?
Microsoft Research published quite an extensive paper on the subject, called To Blob Or Not To Blob.
Their synopsis is:
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
It depends -
You can store images in DB if you know that they wont increase in size very often. This has its advantage when you are deploying your systems or migrating to new servers. you dont have to worry about copying images seperately.
If the no. of rows increase very frequently on that system, and the images get bulkier, then its good to store on the file system and have a path stored in database for later retrieval. This also will keep you on toes when migrating your servers where you have to take care of copying the images from filepath seperately.

Store fields in database or in Lucene index file

My domain object has 20 properties(columns, attributes, whatever you call it) and simple relationships. I want to index 5 properties for full-text search and 3 for sorting. There might be 100,000 records.
To keep my application simple, I want to store the fields in a Lucene index file to avoid introducing a database. Will there be a performance problem?
Depending on how you access stored fields, they may all be loaded into memory (basically, if you use a FieldCache everything will be cached into memory after the first use). And if you have a gig of storage which is taking up memory, that's a gig less to use for your actual index.
Depending on how much memory you have, this may be a performance enhancement, or a performance detriment.