how can I use Solr to do real-time search - lucene

now we use deltaImport to update data from db to index.
but we have some information need a real-time search or near real-time search.
what should I do if I use solr to solve this?

to generate near real-time-search i would update the data in small packages and also update the index in small packages every minute (index update needs only some seconds - depending on the size of new data)
don't forget to optimize the index regularly

This post could be useful for you: Solr and Near Real-Time Search

You should take a look at Solr 3.3 with RankingAlgorithm 1.2. It supports NRT and can update 10,000 docs / sec. You can search concurrently during the updates. You can get more information from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

Related

How to create a facet in Sitecore Content Search (Lucene) based on Real Time Data?

With Sitecore Content Search configuration is it possible to support the addition of a field which is populated with a value at search time, not index time? The population would be from an in-memory data structure for performance.
Essentially without re-indexing the values need to be updated/accessed, examples for this real time field would be Facebook Likes, In Stock, or Real Time Pricing. This data would then be used for faceting such as items with a range of Facebook likes, in-stock versus out-of-stock, or real time price facets.
The content search api does the searching on an iindexable, so I would look into that - you'd probably have to implement this interface yourself.
More info here:
http://www.sitecore.net/learn/blogs/technical-blogs/sitecore-7-development-team/posts/2013/04/sitecore-7-search-operations-explained.aspx
If you need to search on data that is not in the index I would question whether sitecore search is the best option here. If the data needs to be searched in real time then maybe a database would suffice.
If the data set is large and you need realtime access then maybe a nosql database such as MongoDB might be the right choice. Hope this has given you some ideas and you reach a solution
You can leverage the Sitecore dynamic index. The idea is to query your "large" index from within your in-memory index which you'll use dynamically. The implementation is relatively easy.
More info: http://www.sitecore.net/en-gb/learn/blogs/technical-blogs/sitecore-7-development-team/posts/2013/04/sitecore-7-dynamic-indexes.aspx

Redis full text search : reverse indexing or sunspot?

I have 3,5 millions records (readonly) actually stored in a MySQL DB that I would want to pull out to Redis for performance reasons. Actually, I've managed to store things like this into Redis :
1 {"type":"Country","slug":"albania","name_fr":"Albanie","name_en":"Albania"}
2 {"type":"Country","slug":"armenia","name_fr":"Arménie","name_en":"Armenia"}
...
The key I use here is the legacy MySQL id, so with some Ruby glue, I can break as less things as possible in this existing app (and this is a serious concern here).
Now the problem is when I need to perform a search on the keyword "Armenia", inside the value part. Seems like there's only two ways out :
Either I multiplicate Redis index :
id => JSON values (as shown above)
slug => id (reverse indexing based on the slug, that could do the basic search trick)
finally, another huge index specifically for autocomplete, as shown in this post : http://oldblog.antirez.com/post/autocomplete-with-redis.html
Either I use sunspot or some full text search engine (unfortunatly, I actually use ThinkingSphinx which is too much tied to MySQL :-(
So, what would you do ? Do you think the MySQL to Redis move of a single table is even a good idea ? I'm afraid of the Memory footprint those gigantic Redis key/values could take on a 16GB RAM Server.
Any feedback on a similar Redis usage ?
Before I start with a real answer, I wanted to mention that I don't see a good reason for you to be using Redis here. Based on what types of use cases it sounds like you're trying to do, it sounds like something like elasticsearch would be more appropriate for you.
That said, if you just want to be able to search for a few different fields within your JSON, you've got two options:
Auxiliary index that points field_key -> list_of_ids (in your case, "Armenia" -> 1).
Use Lua on top of Redis with JSON encoding and decoding to get at what you want. This is way more flexible and space efficient, but will be slower as your table grows.
Again, I don't think either is appropriate for you because it doesn't sound like Redis is going to be a good choice for you, but if you must, those should work.
Here's my take on Redis.
Basically I think of it as an in-memory cache that can be configured to only store the least recently used data (LRU). Which is the role I made it to play in my use case, the logic of which may be applicable to helping you think about your use case.
I'm currently using Redis to cache results for a search engine based on some complex queries (slow), backed by data in another DB (similar to your case). So Redis serves as a cache storage for answering queries. All queries either get served the data in Redis or the DB if it's a cache-miss in Redis. So, note that Redis is not replacing the DB, but merely being an extension via cache in my case.
This fit my specific use case, because the addition of Redis was supposed to assist future scalability. The idea is that repeated access of recent data (in my case, if a user does a repeated query) can be served by Redis, and take some load off of the DB.
Basically my Redis schema ended up looking somewhat like the duplication of your index you outlined above. I used sets and sortedSets to create "batches / sets" of redis-keys, each of which pointed to specific query results stored under a particular redis-key. And in the DB, I still had the complete data set and an index.
If your data set fits on RAM, you could do the "table dump" into Redis, and get rid of the need for MySQL. I could see this working, as long as you plan for persistent Redis storage and plan for the possible growth of your data, if this "table" will grow in the future.
So depending on your actual use case and how you see Redis fitting into your stack, and the load your DB serves, don't rule out the possibility of having to do both of the options you outlined above (which happend in my case).
Hope this helps!
Redis does provide Full Text Search with RediSearch.
Redisearch implements a search engine on top of Redis. This also enables more advanced features, like exact phrase matching, auto suggestions and numeric filtering for text queries, that are not possible or efficient with traditional Redis search approaches.

Solr Index slow after a while

I use SolrJ to send data to my Solr server.
When I start my program off, it indexes stuff at the rate of about 1000 docs per sec(I commit every 250,000 docs)
I have noticed that when my index is filled up with about 5 million docs, it starts crawling, not just at commit time, add time too.
My Solr server and indexing program run on the same machine
Here are some of the relevant portions from my solrconfig:
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>1024</ramBufferSizeMB>
<mergeFactor>150</mergeFactor>
Any suggestions about how to fix this?
that merge factor seems really, really (really) high.
Do you really want that?
If you aren't using compound files that could easily lead to a ulimit problem (if you are linux).

Simultaneous queries in Solr

Hej,
I am deploying a Solr server containg more than 30m docs. Currently, I am testing the searching performance and the results are very dependant of the number of simultaneous queries I execute:
1 simultaneous query: 2516ms
2 simultaneous queries: 4250,4469 ms
3 simultaneous queries: 5781, 6219, 6219 ms
4 simultaneous queries: 6484, 7203, 7719, 7781 ms
...
Jetty threadpool is configured as default:
New class="org.mortbay.thread.BoundedThreadPool"
Set name="minThreads" 10
Set name="lowThreads" 50
Set name="maxThreads" 10000
I would like to know if there is any factor I can set for decreasing the impact of the simultaneous requests in response times.
Solrconfig is configured also as default but without cache for measuring worst cases and mergeFactor=5 (searching will be more requested than updating).
Thanks in advance
Why are you trying to do this with caching turned off? What exactly are you trying to measure?
You have effectively forced Solr (Lucene) to perform every search from the disk. What you are actually measuring is concurrency of Java itself combined with your OS and disk throughput. This has nothing to do with Jetty or Solr.
Caches are your friend. You really should be using them in any sort of a production capacity. In my opinion, you should be measuring your throughput under load while varying the caches to see what the tradeoff is between cache size and throughput.
Please check out this IBM Tutorial for Solr
I got a great help from this.
Hope you will find your answer. :-)

Best way to keep index real time?

I have a Solr/Lucene index file of approximately 700 Gb. The documents that I need to index are being read in real-time, roughly 1000 docs every 30 minutes are submitted and need to be indexed. In my scenario a script is run every 30 mins that indexes the documents that are not yet indexed, since it is a requirement that new documents should be searchable as soon as possible, but this process slow down the searching.
Is this the best way i can index latest documents or there is some other better way!
First, remember that Solr is not a real-time search engine (yet). There is still work to be done.
You can use a master/slave setup, where the indexation are done on the master and the search on the slave. With this, indexation does not affect search performance. After the commit is done on the master, force the slave to fetch the latest index from the master. While the new index is being replicated on the slave, it is still processing queries with the previous index.
Also, check you cache warming settings. Remember that this might slow down the searches if those settings are too aggressive. Also check the queries launched on the new searcher event.
You can do this with Lucene easily. Split the indexes in multiple parts (or to be precise, while building indexes, create "smaller" parts.) Create searcher for each of the part and store a reference to them. You can create a MultiSearcher on top of these individual parts.
Now, there will be only one index that will get the new documents. At regular intervals, add documents to this index, commit and re-open this searcher.
After the last index is updated, you can create a new multi-searcher again, using the previously opened searchers.
Thus, at any point, you will be re-opening only one searcher and that will be quite fast.
Check http://code.google.com/p/zoie/ wrapper around Lucene to make it real time - code donated from Linkedin.
^^i do this, with normal lucene, non solr, and it works really nice. however not sure if there is a solr way to do that at the moment. twitter recently went with lucene for searching and has effectively real time searching by just writing to their index at any update. their index resides completely in memory, so updating/reading the index is of no consequence and happens instantly, a lucene index can always be read while being written to, as long as there is only one writer at a time.
Check out this wiki page