I have to index log record from captured from enterprice networks.In current implementation every protocol,has index files as year/mont/day/lucene file ,i want to know if i use only one single lucene index file and every day i update this single file how this effect search time ? .is it Considerable increase,in current sitiuation when i search i am querying exacly for that day.
Current: smtp/year/month/ay/luceneindex
if i do smtp/luceneindex all idex in a single file.Let me know prons and cons
That depends on a whole range of factors.
When you say a single lucene file?
Lucene stores an index, using multiple types of files and has segments, so there is more than one file anyway.
What and how are you indexing log data?
What do you use for querying across lucene indexes, solr, elasticsearch, custom?
Are you running a single instance, single machine configuration.
Can you run multiple processes, on separate hosts, use some for search tasks and others for index updates?
What are your typical search queries like, optimise for those cases.
Have a look at http://elasticsearch.org/ or http://lucene.apache.org/solr/ for distributed search options.
lucene has options to run in memory, like RAMDirectory, you may like to investigate.
Is the size of the one-day file going to be problematic for administration?
Are the File sizes going to be so large relative to disk, bandwidth constraints that copying, moving introduces issues.
Related
Is it possible to tell a Lucene to write its segments sequentially and of fixed size? By this way we would avoid merges which are heavy for large segments. Lucene has LogMergePolicy classes with similar functionality which gives ability to set max segment size by doc count or file size, but it is just a limit for merges.
You could use the NRTCachingDirectory to do the small segment merges in memory and only write them out to disk once they reach ~256MiB or so.
But fundamentally the merges are necessary since the data structures like the FST are write-once and are modified by creating a new one.
Maybe this can be combined with the NoMergePolicy for the FilesystemDirectory it will not perform further merges. But that will have pretty bad query performance.
Maybe do merges manually and somehow merge them all at once (by setting TieredMergePolicy.setMaxMergeAtOnceExplicit())
But Merging is just a cost of doing business, probably better to get used to it and tune the MergePolicy to your workload.
I am learning "Lucene in Action". It is said that in order to search the contents of files you need to index the files. I am not much clear on indexing files.
How much file space does indexing 1 GB of documents (like doc,xls,pdb) take?
How long it will take to index these files?
Do we need to update the index every day?
Q> How much file space does indexing 1 GB of documents (like doc,xls,pdb) takes?
A> Your question is too vague. Documents and spreadsheets can vary from virtually nothing to tens or even hundreds of megabytes. It also depends on the analyzer you are going to use and many other factors (e.g. fields only indexed or indexed and stored, etc.). You can use this spreadsheet for rough estimation, plus add some extra space for merges.
Q> How long it will to index these files?
A> Again, it depends on how much content is there. Generally speaking, indexing is fast. On the given link, it went as fast as 95.8 GB/hour but I assume conversion from doc/xsl will add some costs (which is irrelevant to Lucene btw).
Q> Do we need to update the index every day?
A> It is up to you. If you won't update the index, you will get the same search results. There's no magic way for new/updated content to get into index without updating.
I try to build some real-time aggregates on Lucene as a part of experiment. Documents have their values stored in the index. This works very nice for up-to 10K documents.
For larger numbers of documents, this gets kinda slow. I assume there is not too much invested in getting bulk-amounts of documents, as this kind of defeats the purpose of a search engine.
However, it would be cool to be able to do this. So, basically my question is: what could I do to get documents faster from Lucene? Or are there smarter approaches?
I already only retrieve fields I need.
[edit]
The index is quite large >50GB. This does not fit in memory. The number of fields differ, I have several types of documents. Aggregation will mostly take place on a fixed document type; but there is no way to tell on beforehand which one.
Have you put the index in memory? If the entire index fits in memory, that is a huge speedup.
Once you get the hits (which comes back super quick even for 10k records), I would open up multiple threads/readers to access them.
Another thing I have done is store only some properties in Lucene (i.e. don't store 50 attributes from a class). You can get things faster sometimes just by getting a list of IDs and getting the other content from a service/database faster.
I'm trying to use LucidWorks (http://www.lucidimagination.com/products/lucidworks-search-platform) as a search engine for my organization intranet.
I want it to index various document-types (Office formats, PDFs, web pages) from various data sources (web & wiki, file system, Subversion repositories).
So far I tried indexing several sites, directories & repositories (about 500K documents, with total size of about 50GB) - and the size of the index is 155GB.
Is this reasonable? Should the index occupy more storage than the data itself? What would be a reasonable thumb-rule for data-size to index-size ratio?
There is no reasonable size of index, basically depends upon the the data you have.
Ideally should be less, but there is no thumb rule.
However, For the index size and the data size, depends upon how you are indexing the data.
Many factors would determine and have affect on your index size.
Most of the space in the index is consumed by the Stored data fields.
If you are indexing the data from documents and all the content is stored, the index size will surely grow hugh.
Fine tuning of indexed fields attributes also helps in space saving.
You may want to revisit the fields which you need to be indexed and which needs to be stored.
Also, are you using lots of copyfields to duplicate data or maintaining repititive data.
Optimization might help as well.
More info # http://wiki.apache.org/solr/SolrPerformanceFactors
I am kind of working on speeding up my Solr Indexing speed. I just want to know by default how many threads(if any) does Solr use for indexing. Is there a way to increase/decrease that number.
When you index a document, several steps are performed :
the document is analyzed,
data is put in the RAM buffer,
when the RAM buffer is full, data is flushed to a new segment on disk,
if there are more than ${mergeFactor} segments, segments are merged.
The first two steps will be run in as many threads as you have clients sending data to Solr, so if you want Solr to run three threads for these steps, all you need is to send data to Solr from three threads.
You can configure the number of threads to use for the fourth step if you use a ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html). However, there is no mean to configure the maximum number of threads to use from Solr configuration files, so what you need is to write a custom class which call setMaxThreadCount in the constructor.
My experience is that the main ways to improve indexing speed with Solr are :
buying faster hardware (especially I/O),
sending data to Solr from several threads (as many threads as cores is a good start),
using the Javabin format,
using faster analyzers.
Although StreamingUpdateSolrServer looks interesting for improving indexing performance, it doesn't support the Javabin format. Since Javabin parsing is much faster than XML parsing, I got better performance by sending bulk updates (800 in my case, but with rather small documents) using CommonsHttpSolrServer and the Javabin format.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.
This article describes an approach to scaling indexing with SolrCloud, Hadoop and Behemoth. This is for Solr 4.0 which hadn't been released at the time this question was originally posted.
You can store the content in external storage like file;
What are all the field that contains huge size of content,in schema set stored="false" for that corresponding field and store the content for that field in external file using some efficient file system hierarchy.
It improves indexing by 40 to 45% reduced time. But when doing search, search time speed is some what increased.For search it took 25% more time than normal search.