I use sphinx search engine for indexing about 22M records that is read from oracle with ODBC. The speed of indexing is not bad but after that indexing and sorting is completed, indexer hangs several minutes. I also used ranged query and nothing changed that was a little effective but the problem is still there.
I want to know what's going on behind the scenes in this time and how can I reduce that?
To see what's going on behind the scenes run indexer with --print-queries, if it doesn't help review database log and running queries at the moment when indexer seems to be hanging.
Related
I have a CDP environment running Hive, for some reason some queries run pretty quickly and others are taking even more than 5 minutes to run, even a regular select current_timestamp or things like that.
I see that my cluster usage is pretty low so I don't understand why this is happening.
How can I use my cluster fully? I read some posts in the cloudera website, but they are not helping a lot, after all the tuning all the things are the same.
Something to note is that I have the following message in the hive logs:
"Get Query Coordinator (AM) 350"
Then I see that the time to execute the query was pretty low.
I am using tez, any idea what can I look at?
Besides taking care of the overall tuning: https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279
Please check my answer to this same issue here Enable hive parallel processing
That post explains what you need to do to enable parallel processing.
I am indexing over than 1500000 of items from Mysql with Apache Solr 5.4.1, and when I enter to the Solr Admin Page everyday, I found that there is over than 5000 deleted items that they should be optimized, then I click to Optimze and all will be okay.
Is there a simple url to put it in the Crontab to Automate the Optimization of the indexes in Apache Solr 5.4.1 ?
Thank you.
Example from UpdateXMLMessages:
This example will cause the index to be optimized down to at most 10
segments, but won't wait around until it's done (waitFlush=false):
curl
'http://localhost:8983/solr/core/update?optimize=true&maxSegments=10&waitFlush=false'
.. but in general, you don't have to optimize very often. It might not be worth the time spent doing the actual optimize and the extra disk activity. If you're re-indexing the index completely each time as well, indexing to a fresh collection and then swapping the collections afterwards is also a possible solution.
I have a search engine application that parse feeds constantly and index the results in ES (Version 1.5.2).
I have an average of 3.5 million documents indexed.
The deleted documents percentage is about 40% sometimes and I am having some request timeouts while indexing (bulk).
Which optimize policy should I take?
Should I have to stop indexing once or multiple times a day to
optimize the index and reduce the percentage of deleted documents and
merge the segments?
Does the optimization process affects queries?
I would like to know which is the best solution for this use of case.
I am using a custom _id, I know it has performance issues, but it is not an option to change it sadly.
Thanks in advance
If some of your bulk index requests are timing out, that is indication that you need to lower the rate of indexing. Elasticsearch gurus advice not to use the optimize API. In the background segment merges happen which take care of getting rid of deleted documents automatically. Also never use optimize API if you have a high indexing rate. That will only cause more indexing requests to time out. And yes, optimize can also negatively affect search performance as it is a very resource intensive operation.
In a nutshell, just reduce your indexing rate. That should solve most of the problems you have mentioned here. Requests will not time out and deleted document percentage may also come down.
My rails application always reaches the threshold of the disk I/O rate set by my VPS at Linode. It's set at 3000 (I up it from 2000), and every hour or so I will get a notification that it reaches 4000-5000+.
What are the methods that I can use to minimize the disk IO rate? I mostly use Sphinx (Thinking Sphinx plugin) and Latitude and Longitude distance search.
What are the methods to avoid?
I'm using Rails 2.3.11 and MySQL.
Thanks.
did you check if your server is swapping itself to death? what does "top" say?
your Linode may have limited RAM, and it could be very likely that it is swapping like crazy to keep things running..
If you see red in the IO graph, that is swapping activity! You need to upgrade your Linode to more RAM,
or limit the number / size of processes which are running. You should also add approximately 2x the RAM size as Swap space (swap partition).
http://tinypic.com/view.php?pic=2s0b8t2&s=7
Since your question is too vague to answer concisely, this is generally a sign of one of a few things:
Your data set is too large because of historical data that you could prune. Delete what is no longer relevant.
Your tables are not indexed properly and you are hitting a lot of table scans. Check with EXAMINE on each of your slow queries.
Your data structure is not optimized for the way you are using it, and you are doing too many joins. Some tactical de-normalization would help here. Make sure all your JOIN queries are strictly necessary.
You are retrieving more data than is required to service the request. It is, sadly, all too common that people load enormous TEXT or BLOB columns from a user table when displaying only a list of user names. Load only what you need.
You're being hit by some kind of automated scraper or spider robot that's systematically downloading your entire site, page by page. You may want to alter your robots.txt if this is an issue, or start blocking troublesome IPs.
Is it going high and staying high for a long time, or is it just spiking temporarily?
There aren't going to be specific methods to avoid (other than not writing to disk).
You could try using a profiler in production like NewRelic to get more insight into your performance. A profiler will highlight the actions that are taking a long time, however, and when you examine the specific algorithm you're using in that action, you might discover what's inefficient about that particular action.
I know there have been some semi-similar questions, but in this case, I am building an index which is offline, until build is complete. I am building from scratch two cores, one has about 300k records with alot of citation information and large blocks of full text (this is the document index) and another core which has about 6.6 Million records, with full text (this is the page index).
Given this index is being built offline, the only real performance issue is speed of building. Noone should be querying this data.
The auto-commit would apparently fire if I stop adding items for 50 seconds? Which I don't do. I am adding ten at a time and they are added every couple seconds.
So, should I commit more often? I feel like the longer this runs the slower it gets, at least in my test case of 6k documents to index.
With noone searching this index, how often would anyone suggest I commit?
Should say I am using Solr 3.1 and SolrNet.
Although it's commits that are taking time for you, you might want to consider looking into other tweaking than commit frequency.
Is it the indexing core that also does searching, or is it replicated somewhere else after indexing concludes? If the latter is the case, then turning off caches might have a very noticeable impact on performance (solr rebuilds caches every time you commit).
You could also look into using the autoCommit or commitWith features of Solr.
commitWithin is done as part of the document add command. I believe that this is supported with SolrNet - please see Using the commiWithin attribute thread for more details.
autoCommit is a Solr configuration value added to the update handler section.