RavenDB MaxNumberOfParallelIndexTasks not working?

RavenDB MaxNumberOfParallelIndexTasks not working? - indexing

Ayende, my mails are not delivered to your mailing list, so I'll ask here, maybe someone else would have a solution to my problem.
I'm testing ravendb again and again :) and I think I found a little bug. On your documentation page I read
Raven/MaxNumberOfParallelIndexTasks
The maximum number of indexing
tasks allowed to run in parallel Default: the number of processors in
the current machine
But beside that, looks like RavenDB is using only a one core to do indexing tasks. And it takes too long with a single core to finish indexing large dataset. I tried overriding that configuration and assigned 3 to MaxNumberOfParallelIndexTasks, but still, it uses only single core.
take a look at this screenshot http://dl.dropbox.com/u/3055964/Untitled.gif
CPU utilization is at 25% only, and I have a quad core processor. I didn't modify affinity mask.
Am I doing something wrong or I have just crossed a bug?
Thanks

Davita,
I fixed the mailing list issue.
The problem you are seeing is likely because you are seeing only one index that has work to do. The work for a single index is always done on a single CPU.
We spread the work of indexing across multiple CPUs on index boundary.

Related

GUROBI only uses single core to setup problem with cvxpy (python)

I have a large MILP that I build with cvxpy and want to solve with GUROBI. When I give use the solve() function of cvxpy it take a really really really long time to setup and does not start solving for hours. Whilest doing that only 1 core of my cluster is being used. It is used for 100%. I would like to use multiple cores to build the model so that the process of building the model does not take so long. Running grbprobe also shows that gurobi knows about the other cores and for solving the problem it uses multiple cores.
I have tried to run with different flags i.e. turning presolve off and on or giving the number of Threads to be used (this seemed like i didn't even for the solving.
I also have reduce the number of constraints in the problem and it start solving much faster which means that this is definitively not a problem of the model itself.
The problem in it's normal state should have 2200 constraints i reduce it to 150 and it took a couple of seconds until it started to search for a solution.
The problem is that I don't see anything since it takes so long to get the ""set username parameters"" flag and I don't get any information on what the computer does in the mean time.
Is there a way to tell GUROBI or CVXPY that it can take more cpus for the build-up?
Is there another way to solve this problem?

Sorry. The first part of the solve (cvxpy model generation, setup, presolving, scaling, solving the root, preprocessing) is almost completely serial. The parallel part is when it really starts working on the branch-and-bound tree. For many problems, the parallel part is by far the most expensive, but not for all.
This is not only the case for Gurobi. Other high-end solvers have the same behavior.
There are options to do less presolving and preprocessing. That may get you earlier in the B&B. However, usually, it is better not to touch these options.
Running things with verbose=True may give you more information. If you have more detailed questions, you may want to share the log.

Unable to resume solving using a previous best solution or a serialized solution

To all,
Version of optaplanner: 7.48
Since a moment now, I'm no longer able to resume solving.
The process is:
thread 1: solver.solve();
thread 2: solver.terminateEarly();
thread 2: solver.solve(solver.getBestSolution());
The longer the time spent between solve() and terminateEarly() is short, the less likely the resume is to work fine.
When not working, symptoms are after the Construction Heuristics is finished, only a few new best solutions are found and then the solver stops for ever to find new best solutions even if it's still calculating at a significant CPU rate.
The problem is similar when solver.getBestSolution() is serialized and reloaded later.
Any suggestion?
Thanks.
Regards.
JLL

Based on the contents of the question, the title is wrong - OptaPlanner resumes just fine, it just can not find any better solutions. There are two reasons for why that could be the case:
There are no more better solutions to be found. The bigger your data set becomes, the less likely this is.
There are better solutions available, but OptaPlanner can not get to them, as it is stuck in a local optima. This is a common problem.
Escaping local optima is usually accomplished by a combination of the following:
Eliminating score traps from your constraints.
Increasing variety in move selection. See the available generic moves, or consider implementing a custom move for any intricacies of your particular problem.
Iterative local search. We do not (yet) support that out of the box, but the general idea is that at a certain point, you ruin a part of your solution (perhaps by uninitializing it) and then recreate it (randomly or otherwise).
Finally, I wholeheartedly recommend you to upgrade to OptaPlanner 8. The upgrade is easy, and the 7.x stream has been in maintenance mode for a very long time now.

Scaling CakePHP Version 2.3.0

I'm beginning a new project using CakePHP. I like the "auto-magic" features, I think its a good fit for the project. I'm wondering about the potential to scale CakePHP to several million IP hits a day. and hundreds of thousands of database writes and reads a day. Also about 50,000 to 500,000 users, often with 3000 concurrently using the site. I'm making use of heavy stored procedures to offset this, and I'm accessing several servers including a load balancer.
I'm wondering about the computational time of some of the auto-magic and how well Cake is able to assist with session requests making many db hits. Has anyone has had success with cake running from a single server array setup with this level of traffic? I'm not using the cloud or a distributed database (yet). I'm really worried about potential bottlenecks with using this framework. I'm interested in advice from anyone who has worked with Cake in production. I've reseached, but I would love a second opinion. Thank you for your time.

This is not a problem but optimization is up to you.
There are different cache methods available you can implement, memcache, redis, full page caching... All of that is supported by cacke already. What you cache and where is up to you.
For searching you could try elastic search to speedup things
There are before dispatcher filters to by pass controller instantiation (you might want to do that in special cases, check the asset filter for example)
Use nginx not apache
Also I would not start with over optimizing and over-thinking this before any code is written, start well, think about caching but when you start to come across bottleneck analyse and fix them. Otherwise you'll waste a lot of time with over optimization before you even have written anything that works.
Cake itself is very fast. Just to proof the bullshit factor of these fancy benchmarks some frameworks do we did one using a dispatcher filter to "optimize" it and even beat Yii who seems to be pretty eager to show how fast it is, but benchmarks are pointless, specially in a huge project where so many human made fail can be introduced.

High disk IO rate

My rails application always reaches the threshold of the disk I/O rate set by my VPS at Linode. It's set at 3000 (I up it from 2000), and every hour or so I will get a notification that it reaches 4000-5000+.
What are the methods that I can use to minimize the disk IO rate? I mostly use Sphinx (Thinking Sphinx plugin) and Latitude and Longitude distance search.
What are the methods to avoid?
I'm using Rails 2.3.11 and MySQL.
Thanks.

did you check if your server is swapping itself to death? what does "top" say?
your Linode may have limited RAM, and it could be very likely that it is swapping like crazy to keep things running..
If you see red in the IO graph, that is swapping activity! You need to upgrade your Linode to more RAM,
or limit the number / size of processes which are running. You should also add approximately 2x the RAM size as Swap space (swap partition).
http://tinypic.com/view.php?pic=2s0b8t2&s=7

Since your question is too vague to answer concisely, this is generally a sign of one of a few things:
Your data set is too large because of historical data that you could prune. Delete what is no longer relevant.
Your tables are not indexed properly and you are hitting a lot of table scans. Check with EXAMINE on each of your slow queries.
Your data structure is not optimized for the way you are using it, and you are doing too many joins. Some tactical de-normalization would help here. Make sure all your JOIN queries are strictly necessary.
You are retrieving more data than is required to service the request. It is, sadly, all too common that people load enormous TEXT or BLOB columns from a user table when displaying only a list of user names. Load only what you need.
You're being hit by some kind of automated scraper or spider robot that's systematically downloading your entire site, page by page. You may want to alter your robots.txt if this is an issue, or start blocking troublesome IPs.

Is it going high and staying high for a long time, or is it just spiking temporarily?
There aren't going to be specific methods to avoid (other than not writing to disk).
You could try using a profiler in production like NewRelic to get more insight into your performance. A profiler will highlight the actions that are taking a long time, however, and when you examine the specific algorithm you're using in that action, you might discover what's inefficient about that particular action.

How often to call commit on an offline Solr/Lucene index?

I know there have been some semi-similar questions, but in this case, I am building an index which is offline, until build is complete. I am building from scratch two cores, one has about 300k records with alot of citation information and large blocks of full text (this is the document index) and another core which has about 6.6 Million records, with full text (this is the page index).
Given this index is being built offline, the only real performance issue is speed of building. Noone should be querying this data.
The auto-commit would apparently fire if I stop adding items for 50 seconds? Which I don't do. I am adding ten at a time and they are added every couple seconds.
So, should I commit more often? I feel like the longer this runs the slower it gets, at least in my test case of 6k documents to index.
With noone searching this index, how often would anyone suggest I commit?
Should say I am using Solr 3.1 and SolrNet.

Although it's commits that are taking time for you, you might want to consider looking into other tweaking than commit frequency.
Is it the indexing core that also does searching, or is it replicated somewhere else after indexing concludes? If the latter is the case, then turning off caches might have a very noticeable impact on performance (solr rebuilds caches every time you commit).

You could also look into using the autoCommit or commitWith features of Solr.
commitWithin is done as part of the document add command. I believe that this is supported with SolrNet - please see Using the commiWithin attribute thread for more details.
autoCommit is a Solr configuration value added to the update handler section.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

RavenDB MaxNumberOfParallelIndexTasks not working? - indexing

Davita, I fixed the mailing list issue. The problem you are seeing is likely because you are seeing only one index that has work to do. The work for a single index is always done on a single CPU. We spread the work of indexing across multiple CPUs on index boundary.

Related

GUROBI only uses single core to setup problem with cvxpy (python)

Unable to resume solving using a previous best solution or a serialized solution

Scaling CakePHP Version 2.3.0

High disk IO rate

How often to call commit on an offline Solr/Lucene index?

Categories

Resources