auto-optimize data in apache solr 5.4.1

auto-optimize data in apache solr 5.4.1 - apache

I am indexing over than 1500000 of items from Mysql with Apache Solr 5.4.1, and when I enter to the Solr Admin Page everyday, I found that there is over than 5000 deleted items that they should be optimized, then I click to Optimze and all will be okay.
Is there a simple url to put it in the Crontab to Automate the Optimization of the indexes in Apache Solr 5.4.1 ?
Thank you.

Example from UpdateXMLMessages:
This example will cause the index to be optimized down to at most 10
segments, but won't wait around until it's done (waitFlush=false):
curl
'http://localhost:8983/solr/core/update?optimize=true&maxSegments=10&waitFlush=false'
.. but in general, you don't have to optimize very often. It might not be worth the time spent doing the actual optimize and the extra disk activity. If you're re-indexing the index completely each time as well, indexing to a fresh collection and then swapping the collections afterwards is also a possible solution.

Related

Sphinx: hangs on indexer after indexing completed

I use sphinx search engine for indexing about 22M records that is read from oracle with ODBC. The speed of indexing is not bad but after that indexing and sorting is completed, indexer hangs several minutes. I also used ranged query and nothing changed that was a little effective but the problem is still there.
I want to know what's going on behind the scenes in this time and how can I reduce that?

To see what's going on behind the scenes run indexer with --print-queries, if it doesn't help review database log and running queries at the moment when indexer seems to be hanging.

Is manifold cf a good option for Google Drive indexing?

I am using apache manifoldcf open source project for indexing documents from Google Drive into my solr. Often I have seen it is quite inconsistent in indexing the data. Also it takes time to reflect even small number of documents in solr . Do you really think its a good option to index Google Drive using it?

It is currently bit on slow side, due to response time and throttling constraints from google drive itself. But this limit can probably relieved if you buy additional bandwidth from google. With current setup if you are looking to index a large set of documents in google drive it may not be quick as you may expect

Manifold CF is good for crawling through file-system. You can go for Apache Nutch if you are interested in web crawling.
Yes ManifoldCF does take a lot of time to reflect a small number of document. Also it has very less documentation. Although, you can join the mailing list where you can ask questions to the lead developer "Karl". He is very helpful and usually answers withing a few hours.
P.S. :I have worked using ManifoldCF over a project for a span of 10 months.

Nutch tips how to test it

For my testing purpose I want to make small fetch-generate-parse-index cycle. Now I have about thousands records in my database and when I run index it starts indexing many many records which takes hours.
Do you have any tip how to test this use case? Is there any way to generate limited number of pages generated in one batch?
Well I generally appreciate for any advice on testing Nutch 2.2.1 with Hbase and Solr 4.

It shouldnt take ages for a thousand records in the database, have you enabled multithreading?

High disk IO rate

My rails application always reaches the threshold of the disk I/O rate set by my VPS at Linode. It's set at 3000 (I up it from 2000), and every hour or so I will get a notification that it reaches 4000-5000+.
What are the methods that I can use to minimize the disk IO rate? I mostly use Sphinx (Thinking Sphinx plugin) and Latitude and Longitude distance search.
What are the methods to avoid?
I'm using Rails 2.3.11 and MySQL.
Thanks.

did you check if your server is swapping itself to death? what does "top" say?
your Linode may have limited RAM, and it could be very likely that it is swapping like crazy to keep things running..
If you see red in the IO graph, that is swapping activity! You need to upgrade your Linode to more RAM,
or limit the number / size of processes which are running. You should also add approximately 2x the RAM size as Swap space (swap partition).
http://tinypic.com/view.php?pic=2s0b8t2&s=7

Since your question is too vague to answer concisely, this is generally a sign of one of a few things:
Your data set is too large because of historical data that you could prune. Delete what is no longer relevant.
Your tables are not indexed properly and you are hitting a lot of table scans. Check with EXAMINE on each of your slow queries.
Your data structure is not optimized for the way you are using it, and you are doing too many joins. Some tactical de-normalization would help here. Make sure all your JOIN queries are strictly necessary.
You are retrieving more data than is required to service the request. It is, sadly, all too common that people load enormous TEXT or BLOB columns from a user table when displaying only a list of user names. Load only what you need.
You're being hit by some kind of automated scraper or spider robot that's systematically downloading your entire site, page by page. You may want to alter your robots.txt if this is an issue, or start blocking troublesome IPs.

Is it going high and staying high for a long time, or is it just spiking temporarily?
There aren't going to be specific methods to avoid (other than not writing to disk).
You could try using a profiler in production like NewRelic to get more insight into your performance. A profiler will highlight the actions that are taking a long time, however, and when you examine the specific algorithm you're using in that action, you might discover what's inefficient about that particular action.

How often to call commit on an offline Solr/Lucene index?

I know there have been some semi-similar questions, but in this case, I am building an index which is offline, until build is complete. I am building from scratch two cores, one has about 300k records with alot of citation information and large blocks of full text (this is the document index) and another core which has about 6.6 Million records, with full text (this is the page index).
Given this index is being built offline, the only real performance issue is speed of building. Noone should be querying this data.
The auto-commit would apparently fire if I stop adding items for 50 seconds? Which I don't do. I am adding ten at a time and they are added every couple seconds.
So, should I commit more often? I feel like the longer this runs the slower it gets, at least in my test case of 6k documents to index.
With noone searching this index, how often would anyone suggest I commit?
Should say I am using Solr 3.1 and SolrNet.

Although it's commits that are taking time for you, you might want to consider looking into other tweaking than commit frequency.
Is it the indexing core that also does searching, or is it replicated somewhere else after indexing concludes? If the latter is the case, then turning off caches might have a very noticeable impact on performance (solr rebuilds caches every time you commit).

You could also look into using the autoCommit or commitWith features of Solr.
commitWithin is done as part of the document add command. I believe that this is supported with SolrNet - please see Using the commiWithin attribute thread for more details.
autoCommit is a Solr configuration value added to the update handler section.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

auto-optimize data in apache solr 5.4.1 - apache

Related

Sphinx: hangs on indexer after indexing completed

Is manifold cf a good option for Google Drive indexing?

Nutch tips how to test it

High disk IO rate

How often to call commit on an offline Solr/Lucene index?

Categories

Resources