Solr configuration on Heroku - ruby-on-rails-3

I am using WebSolr Cobalt on Heroku.
The search is working if I search whether for the first letter or the full word, but no partial parts of the word.
Any help?

To enable partial word searching
you must edit your local schema.xml file, usually under solr/config, to add either:
NGramFilterFactory
EdgeNGramFilterFactory
Here's what mine looks like - sample schema.xml
EdgeNGram
I went with the EdgeN option. It doesn't allow for searching in the middle of words, but it does allow partial word search starting from the beginning of the word. This cuts way down on false positives / matches you don't want, performs better, and is usually not missed by the users. Also, I like the minGramSize=2 so you must enter a minimum of 2 characters. Some folks set this to 3.
Once your local is setup and working, you must edit the schema.xml used by websolr, otherwise you will get the default behavior which requires the full-word to be entered even if you have full text searching configured for your models.
To edit the websolr schema.xml
Go to the Heroku online dashboard for your app
Go to the resources tab, then click on the Websolr add-on
Click the default link under Indexes
Click on the Advanced Configuration link
Paste in your schema.xml from your local, including the config for your Ngram tokenizer of choice (mentioned above). Save.
Copy the link in the "Configure your Heroku application" box, then paste it into terminal to set your WEBSOLR_URL link in your heroku config.
Click the Index Status link to get nifty stats and see if you are running fast or slow.
Reindex everything
heroku run rake sunspot:reindex[5000]
Don't use heroku run rake sunspot:solr:reindex - it is deprecated, accepts no parameters and is WAY slower
Default batch size is 50, most people suggest using 1000, but I've seen significantly faster results (1000 rows per second as opposed to around 500 rps) by bumping it up to 5000+
Take it to the next level
5 ways to speed up indexing

Related

How to fix empty semantic media wiki query results after restore?

After restoring a semantic media wiki installation from backup the SMW engine does no longer return any query results. I have (re)inserted all regular pages, all form pages, all property pages into the new MW instance. So all content is there but query results remain empty. It seems as the internal data structures maintained by SMW are not filled. How can this be fixed? Are there any specific scripts that need to be run manually?
Indeed the internal SMW cache is not filled after restore. The solution is simple: You need to go to extensions/SemanticMediaWiki/maintenance and run the script rebuildData.php which will reparse every single Wiki page and fill the SMW database accordingly.
Be aware that for this to work your Wiki needs to be configured properly. By default SMW will not process additional namespaces! You need to enable this manually for every single namespace you add yourself in LocalSettings.php after the line where you enable SMW in this file. (This configuration will only have effect if you do this after the line that enables SMW.)

Sitecore syncMaster index strategy not working

I have a rendering component that runs a search using the Lucene index to populate itself.
We have two indexes defined; Master & Web. When in the experience editor it uses the Master index, and the Web index for the actual site.
We've configured the Web index strategy as onPublishEndAsync, and we've configured the Master index strategy as syncMaster, the idea being that CMS users can add/edit Sitecore items that power this component, and see them straight away in the experience editor.
However, it seems that the master index is not being updated as we change data in Sitecore. The experience editor only shows the data once I've manually run an index rebuild.
<strategies hint="list:AddStrategy">
<strategy ref="contentSearch/indexConfigurations/indexUpdateStrategies/syncMaster" />
</strategies>
Why doesn't the index update itself upon data changes?
UPDATE
So I've compared the files suggested to a clean install and they are the same.
I should add, I'm not using the standard sitecore_master_index. We have multiple sites running off the same instance of sitecore, so we have added a config include for websitename_master_index. I have compared the config for this within the <index> node against sitecore_master_index in Sitecore.ContentSearch.Lucene.Index.Master.config and the only differences are the crawler's <root> element which points to the particular sites content node, plus we've added some custom fields, but I assume that these fields wouldn't be causing a problem we can manually rebuild the index fine?
One other interesting thing I found when looking at the showconfig.aspx was this:
<agent type="Sitecore.ContentSearch.Tasks.Optimize" method="Run" interval="12:00:00" patch:source="Sitecore.ContentSearch.config">
<indexes hint="list">
<index>sitecore_master_index</index>
</indexes>
</agent>
I'm not sure if this has any significance, but there was not a matching entry for our custom websitename_master_index?
UPDATE
I've also added debug level logging to the crawler
In the crawling.log I only see the following:
14416 08:55:10 INFO [Index=website_master_index] Initializing SitecoreItemCrawler. DB:master / Root:/sitecore/Content/Website/Home
14416 08:55:10 INFO [Index=website_master_index] Initializing SynchronousStrategy.
Upon editing and saving items, there is no further mention of the index in the log, and this is actually true of the standard sitecore_master_index which we haven't altered the config for?
In order to guarantee Lucene files are not concurrently modified, Lucene adds a .lock file concept - whatever process is about to write, has to create the file.
In case there is one already - wait for it to be removed.
Should a writer process be terminated, file never got removed, hence index never got updated.
The solution was to clean the folder manually.
In order to make a better prediction a memory snapshot of the process is needed to see what is happening inside (or what does each thread do).

Suggestion around Lucene 4.4 (Log Search)

I am new to Lucene and trying to use it for searching log files/entries generated by a SystemA.
Architecture
Receive each log entry (i.e. XML) in a INPUT Directory. SystemA sends log entries to a MQ queue which is polled by a small utility, that picks the message and create a file in INPUT directory.
WriteIndex.java (i.e. IndexWriter/Lucene) keep checking if a new file received in INPUT directory. If yes, it takes the file, puts in Index and move the file to OUTPUT directory. As part of Indexing, I am putting filename, path, timestamp, contents in Index.
"Note: I am creating index on Content as well putting whole Content as StringField."
SearchIndex.java (ie. SeacherManager/Lucene/refereshIfChanged) is created. As part of Creation I started a new thread as well that keep checking every 1 min if Index has changed on not. I acquire IndexSearcher for every request. It's working fine.
Everything so far worked very fine. But I am not sure what will happen in production as I have tested it for few hundred files but in production, I will be getting like 500K log entries in a day which means 500K small file, each having an XML. "WriteIndex.java" will have to run non-stop to update index whenever new file received.
I have following questions
Anyone has done any similar work? Any issues/best practices I should follow.
Do you see any problem with Index files generated for such large number of xml files. Each XML file would be 2KB max. Remember I am indexing on the content as well as putting content as String in index so that I can retrieve from the index whenever I found a match on index while searching.
I would be exposing SearchIndex.java as Servlet to allow admins to come on a WebPage and search log entries. Any issues you see with it?
Please let me know if anyone need anything specific.
Thanks,
Rohit Goyal
Architecture looks fine.
Few things
Consider using TextField instead of StringField. TextField will be tokenized and hence user would be able to search on tokens. StringField is not tokenized and hence for document to match search, full text should match.
No problem in performance for lucene. Check out Lucene performance graphs. Lucene can generate index for over a billion wikipedia documents in minutes. Searching is fast too.

Orchard - Search & Indexing issue

I have a project completed with Orchard CMS. The all functionalities are implemented through modules. Search module was also working till a few days ago, but suddenly it is stopped to working, "without any reason".
The issue is that I can not rebuild/update indexes. When I run indexing, it will only index default list of fields (id, title, body, format, type, author, created, published, modified, culture) but my custom fields are not indexed.
I tried everything but without any success. I tried:
- Deleting Indexing/Search folder with all files
- Reinstalling Search/Indexing/Lucene modules
- Rebuilding and rebuilding indexes....
- Clearing solution and rebuilding...
I didn't extend any of Orchard modules, they are the same as when I downloaded them.
Any advice on this one...?
P.S. Yes, I already checked custom fields thats need to be indexed. :)
Thanks,
If you think the index is corrupted, delete App_data\Sites\Default\Search.settings.xml and App_data\Sites\Default\Indexes, then restart the app pool. You should then be able to rebuild the index.
Apparently you already did this, but for others who may not have, you also need to check the fields you want indexed under Settings/Search. This will include the fields in search.
But for the fields to be included in search, they need to be indexed first. For this, you need to go to Content/Content Types and edit the content type the fields are on. Check "index this content type for search". Also deploy the settings for each field you want indexed, and check "include in the index".
You'll need to run the "Recipe" to create the "Search" index.
It appears that Search + Lucene + Indexing works with Text Fields but not Numeric Fields.
When the search feature is enabled, the Settings screen in the dashboard displays the fields that will be queried from the index (listed on the Search screen).
enter image description here

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.