Is it possible to re-generate Lucene index in background? - indexing

Sometimes there is need to re-generate a lucene index, e.g. when something changes in the Compass mapping or in the way boosts are applied, or if something went corrupt for whatever reason.
In my case, generation of the index takes about 5 to 6 hours, clearing the index before leads to data not being complete for this interval. I. e. doing a search in this time returns an incomplete result.
Is there any standard way to have lucene generate the index in the background? E.g. write index to a temporary directory and (when indexing is finished without exceptions etc) replace the existing index with the new one?
Of course, one could implement this "manually", but does one have to? Sounds like a common use case to me.
Best regards + Thanks for your opinion,
Peter :)

I had a similar experience; there were certain parameters to the Analyzer which would get changed from time to time; obviously if that was the case, the entire index needs to get rebuilt. (I won't go into the details, suffice to say I had the same requirement!)
I did what you suggested in your question. There were three directories, "old", "current" and "new". Queries from the live site went against "current" always. The index recreation process was:
Recursive delete on the "old" and "new" directories
Create the new index into the "new" directory (in my case takes about 6 hrs)
Rename "current" to "old"; and "new" to "current"
Recursive delete the "old" directory
An analysis of what happens when the process crashes - if it crashes in the 1st step, the next time it will just carry on. If it crashes in the 2nd step then the "new" directory will get deleted next run. The 3rd step is very fast - renaming a directory is fast and atomic. Crashing in the 4th step doesn't matter, it'll just get cleaned up next run.
The careful observer will note that in step 3, the system could crash between renaming the current directory away and moving the new directory in. This is unlikely to happen as directory rename is so fast. The system has been in production for a few years and this has never happened (yet?).

I think the usual way to do this is to use solr's replication functionality. In your case though, the master and slave would be on the same machine, but just pointed at different directories.

We have a similar problem. Our data is indexed in Lucene, but the original source is DB and content repo.
So if an index goes out of sync (or data type changes, etc.), we simply iterate over all existing entries in the index and re-generate the data so each document gets updated. It is not really a complex thing to do.

Related

What's a best approach to create a filestore

This is an open ended question. I have noob understanding of databases but willing to learn whatever is required. Though I believe my problem could be done without learning a lot.
So, here goes the question:
I have large amount of files getting generated in mt projects(depending on the builds) and I need to archive them and also need to reproduce them according to buildNumber if requested by users. I don't expect these requests to be a lot. May be 1-2 requests a day.
For eg: 16GB data per build every week. Most of the files in weekly builds are duplicate. And I don't want to archive them again and again. I prefer to store them only once. There is one caveat that it can happen that the files relative location can change, even though content hasn't changed.
My approach is as follow: Create a hash from each file. Create the key-value pair as fileHash-actual file and store it. Store this information in some kind of manifest file for each build. So, I should be able to create the builds back with correct files/paths etc.
Can it ever happen that 2 different files will ever have the same hash? Can some database help to do it efficiently? I am currently thinking of dumping all files in one folder.
Thanks

How to programmatically pause after Sitecore item creation until index is updated

A coworker asked this question, and I wasn't immediately finding a solution, so I'm posting it here. He is programmatically inserting a Sitecore item in the master DB, and then subsequently has to insert another item that has a dependency on the first item being present in the index. He originally was having that second item insert fail every time or two, but has since inserted a manual pause in his code to try to allow the index time to catch up, and it's now failing only about every tenth time. Better, but not perfect.
He is looking for whether there's a Sitecore way to check for if the index has been updated before he proceeds with inserting the dependent item.
I did find this blog post by Alex Shyba (http://sitecoreblog.alexshyba.com/2011/04/search-index-troubleshooting.html), which looks like it might have some applicability, but my coworker is strictly working in the master DB (no publishing involved), and we already have the first several steps in Alex's article implemented in our solution (I didn't go through the whole thing).
If you are dependent on an index add, in the end the only way to ensure the item is in the index is to take the action following the asynchronous index update. And in Sitecore 6, the only way to do that which I am aware of is the database:propertychanged event. Alex Shyba describes this event in another article, with regard to HTML cache clearing.
Your challenge will likely be knowing in the event handler what item was inserted, and what to do with it. You'll need some sort of global data structure to communicate this state information, since the index update runs as an asynch job.
Other options (which may be easier) would be to remove the dependency on the index update (use Sitecore query or fast query), or poll the index until the item is there (which is a bit ugly).
Why not just add the item the index yourself? That way the UI will be blocked until its done.
You could do it by hooking into the item:saved event. I'm thinking the event handler would be based on the code from the database crawler
Have you thought about queuing the second task as a "timed task", with some wrapper to check the dependency and requeue if necessary? See http://www.sitecore.net/Community/Technical-Blogs/John-West-Sitecore-Blog/Posts/2010/11/All-About-Sitecore-Scheduling-Agents-and-Tasks.aspx.

Loading Razor views from a database - VirtualPathProvider and CacheDependency confusion

I'm confused as to how CacheDependency works in VirtualPathProvider.GetCacheDependency().
Every example I've seen creates a cache dependency based on some physical file on disk, while I'm returning records from a database. Right now, I'm overriding GetFileHash and just returning the last date/time the relevant record was modified as the hash string. This works well, and I'm not sure using a CacheDependency item would affect the performance as I'd still have to go check the database every time the view is requested to see if it's been updated, but I'm still curious how to use CacheDependency.
Has anyone used this when returning views from a database?
Update
Using this now (http://razorengine.codeplex.com/) which works VERY well.
The point of CacheDependency is to provide you with an event that will be called when the cache becomes invalid (because the file on disk changed). Check out SqlCacheDependency that does the same thing with SQL Server entries.

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.

CouchDB View, Map, Index, and Sequence

I think read somewhere that when a View is requested the "map" is only run across documents that have been added since the last time it was requested? How is this determined? I thought I saw something about a sequence number. Is this something that you can get to? Its not part of the UUID trailing on the _rev field is it?
Any way to force a 'recalc' of the entire View (across all records)?
The section about View Indexes in the Technical Overview is a great guide to this.
The view builder uses the database sequence ID to determine if the view group is fully up-to-date with the database. If not, the view engine examines the all database documents (in packed sequential order) changed since the last refresh. Documents are read in the order they occur in the disk file, reducing the frequency and cost of disk head seeks.
As documents are examined, their previous row values are removed from the view indexes, if they exist. If the document is selected by a view function, the function results are inserted into the view as a new row.
CouchDB first checks to see if anything has changed in the entire database using a sequence id (that gets updated whenever there's a change to any document in the database). If something has changed it goes looking for those documents and runs the map function on them.
There really shouldn't be any need to rebuild/regenerate your views since it will incrementally refresh as you modify your documents (note that it won't update the view until you use it though). With hat said one way (and I'm sure there's a better way) would be to remove the design document describing the view and insert it again seeing as a design document is no different (almost) from a normal document.