Suggestion around Lucene 4.4 (Log Search) - lucene

I am new to Lucene and trying to use it for searching log files/entries generated by a SystemA.
Architecture
Receive each log entry (i.e. XML) in a INPUT Directory. SystemA sends log entries to a MQ queue which is polled by a small utility, that picks the message and create a file in INPUT directory.
WriteIndex.java (i.e. IndexWriter/Lucene) keep checking if a new file received in INPUT directory. If yes, it takes the file, puts in Index and move the file to OUTPUT directory. As part of Indexing, I am putting filename, path, timestamp, contents in Index.
"Note: I am creating index on Content as well putting whole Content as StringField."
SearchIndex.java (ie. SeacherManager/Lucene/refereshIfChanged) is created. As part of Creation I started a new thread as well that keep checking every 1 min if Index has changed on not. I acquire IndexSearcher for every request. It's working fine.
Everything so far worked very fine. But I am not sure what will happen in production as I have tested it for few hundred files but in production, I will be getting like 500K log entries in a day which means 500K small file, each having an XML. "WriteIndex.java" will have to run non-stop to update index whenever new file received.
I have following questions
Anyone has done any similar work? Any issues/best practices I should follow.
Do you see any problem with Index files generated for such large number of xml files. Each XML file would be 2KB max. Remember I am indexing on the content as well as putting content as String in index so that I can retrieve from the index whenever I found a match on index while searching.
I would be exposing SearchIndex.java as Servlet to allow admins to come on a WebPage and search log entries. Any issues you see with it?
Please let me know if anyone need anything specific.
Thanks,
Rohit Goyal

Architecture looks fine.
Few things
Consider using TextField instead of StringField. TextField will be tokenized and hence user would be able to search on tokens. StringField is not tokenized and hence for document to match search, full text should match.
No problem in performance for lucene. Check out Lucene performance graphs. Lucene can generate index for over a billion wikipedia documents in minutes. Searching is fast too.

Related

Using Lucene Highlighter infrastructure to mark up arbitrary text

I am using Lucene 3.5 in a client-server architecture as follows: the client issues a query to the server. The server returns a list of terms used in the query, and a list of hits, including snippets generated by the application of a Highlighter to the retrieved documents. The user can then request that the full document be displayed. This document comes from another service that is part of the system I am building.
When the requested document is displayed, I would like to highlight the same terms that were used to retrieve it. I can write some other code to do this without involving the Lucene infrastructure, but since I already have code to generate the snippets, I was hoping to be able to re-use it. (DRY and all that.)
So my question is how best to do this: When the need to mark up a document with search results occurs, the client has the set of terms that were used to retrieve the document and the id of the document that was retrieved. It also knows which fields in the document can be marked up with query terms.
Some possible strategies:
Create a query filter that selects only the needed document and then re-run the query only on that document.
Somehow (how?) construct a Scorer that doesn't depend on a Query but that can be seeded with the terms I already have.
Skip the Lucene infrastructure entirely.
What else?
I believe you could index your documents with a TermVector which will tell you the position of each term in the original document. Making highlighting trivial. Or simply reuse the contrib highlighter

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.

How to find the document visitior's count?

Actually I am in need of counting the visitors count for a particular document.
I can do it by adding a field, and increasing its value.
But the problem is following.,
I have 10 replication copies in different location. It is being replicated by scheduled manner. So replication conflict is happening because of document count is editing the same document in different location.
I would use an external solution for this. Just search for "visitor count" in your favorite search engine and choose a third party tool. You can then display the count on the page if that is important.
If you need to store the value in the database for some reason, perhaps you could store it as a new doc type that gets added each time (and cleaned up later) to avoid the replication issues.
Otherwise if storing it isn't required consider Google Analytics too.
Also I faced this problem. I can not say that it has a easy solution. Document locking is the only solution that i had found. But the visitor's count is not possible.
It is possible, but not by updating the document. Instead have an AJAX call to an agent or form with parameters on the URL identifying the document being read. This call writes a document into a tracking DB with one or two views and then determines from those views how many reads you have had. The number of reads is the return value of the AJAX form.
This can be written in LS, Java or #Formulas. I would try to do it 100% in #Formulas to make it as efficient as possible.
You can also add logic to exclude reads from the same user or same source IP address.
The tracking database then replicates using the same schedule as the other database.
Daily or Hourly agents can run to create summary documents and delete the detail documents so that you do not exceed the limits for #DBLookup.
If you do not need very nearly real time counts (and that is the best you can get with replicated system like this) you could use the web logs that domino generates by finding the reads in the logs and building the counts in a document per server.
/Newbs
Back in the 90s, we had a client that needed to know that each person had read a document without them clicking to sign or anything.
The initial solution was to add each name to a text field on a separate tracking document. This ran into problems when it got over 32k real fast. Then, one of my colleagues realized you could just have it create a document for each user to record that they'd read it.
Heck, you could have one database used to track all reads for all users of all documents, since one user can only open one document at a time -- each time they open a new document, either add that value to a field or create a field named after the document they've read on their own "reader tracker" document.
Or you could make that a mail-in database, so no worries about replication. Each time they open a document for which you want to track reads, it create a tiny document that has only their name and what document they read which gets mailed into the "read counter database". If you don't care who read it, you have an agent that runs on a schedule that updates the count and deletes the mailed-in documents.
There really are a lot of ways to skin this cat.

How to maintain lucene indexes in azure cloud-app

I just started playing with the Azure Library for Lucene.NET (http://code.msdn.microsoft.com/AzureDirectory). Until now, I was using my own custom code for writing lucene indexes on the azure blob. So, I was copying the blob to localstorage of the azure web/worker role and reading/writing docs to the index. I was using my custom locking mechanism to make sure we dont have clashes between reads and writes to the blob. I am hoping Azure Library would take care of these issues for me.
However, while trying out the test app, I tweaked the code to use compound-file option, and that created a new file everytime I wrote to the index. Now, my question is, if I have to maintain the index - i.e keep a snapshot of the index file and use it if the main index gets corrupt, then how do I go about doing this. Should I keep a backup of all the .cfs files that are created or handling only the latest one is fine. Are there api calls to clean up the blob to keep the latest file after each write to the index?
Thanks
Kapil
After i answered this, we ended up changing our search infrastructure and used Windows Azure Drive. We had a Worker Role, which would mount a VHD using the Block Storage, and host the Lucene.NET Index on it. The code checked to make sure the VHD was mounted first and that the index directory existed. If the worker role fell over, the VHD would automatically dismount after 60 seconds, and a second worker role could pick it up.
We have since changed our infrastructure again and moved to Amazon with a Solr instance for search, but the VHD option worked well during development. it could have worked well in Test and Production, but Requirements meant we needed to move to EC2.
i am using AzureDirectory for Full Text indexing on Azure, and i am getting some odd results also... but hopefully this answer will be of some use to you...
firstly, the compound-file option: from what i am reading and figuring out, the compound file is a single large file with all the index data inside. the alliterative to this is having lots of smaller files (configured using the SetMaxMergeDocs(int) function of IndexWriter) written to storage. the problem with this is once you get to lots of files (i foolishly set this to about 5000) it takes an age to download the indexes (On the Azure server it takes about a minute,, of my dev box... well its been running for 20 min now and still not finished...).
as for backing up indexes, i have not come up against this yet, but given we have about 5 million records currently, and that will grow, i am wondering about this also. if you are using a single compounded file, maybe downloading the files to a worker role, zipping them and uploading them with todays date would work... if you have a smaller set of documents, you might get away with re-indexing the data if something goes wrong... but again, depends on the number....

Is it possible to re-generate Lucene index in background?

Sometimes there is need to re-generate a lucene index, e.g. when something changes in the Compass mapping or in the way boosts are applied, or if something went corrupt for whatever reason.
In my case, generation of the index takes about 5 to 6 hours, clearing the index before leads to data not being complete for this interval. I. e. doing a search in this time returns an incomplete result.
Is there any standard way to have lucene generate the index in the background? E.g. write index to a temporary directory and (when indexing is finished without exceptions etc) replace the existing index with the new one?
Of course, one could implement this "manually", but does one have to? Sounds like a common use case to me.
Best regards + Thanks for your opinion,
Peter :)
I had a similar experience; there were certain parameters to the Analyzer which would get changed from time to time; obviously if that was the case, the entire index needs to get rebuilt. (I won't go into the details, suffice to say I had the same requirement!)
I did what you suggested in your question. There were three directories, "old", "current" and "new". Queries from the live site went against "current" always. The index recreation process was:
Recursive delete on the "old" and "new" directories
Create the new index into the "new" directory (in my case takes about 6 hrs)
Rename "current" to "old"; and "new" to "current"
Recursive delete the "old" directory
An analysis of what happens when the process crashes - if it crashes in the 1st step, the next time it will just carry on. If it crashes in the 2nd step then the "new" directory will get deleted next run. The 3rd step is very fast - renaming a directory is fast and atomic. Crashing in the 4th step doesn't matter, it'll just get cleaned up next run.
The careful observer will note that in step 3, the system could crash between renaming the current directory away and moving the new directory in. This is unlikely to happen as directory rename is so fast. The system has been in production for a few years and this has never happened (yet?).
I think the usual way to do this is to use solr's replication functionality. In your case though, the master and slave would be on the same machine, but just pointed at different directories.
We have a similar problem. Our data is indexed in Lucene, but the original source is DB and content repo.
So if an index goes out of sync (or data type changes, etc.), we simply iterate over all existing entries in the index and re-generate the data so each document gets updated. It is not really a complex thing to do.