CNTK: c# sample for parallel request processing and Clone() method performance - cntk

In the c# sample for parallel request processing
https://github.com/Microsoft/CNTK/blob/zhouwang/cseval-example-clone/Examples/Evaluation/CNTKLibraryCSEvalCPUOnlyExamples/CNTKLibraryCSEvalExamples.cs
how expensive is it to call rootFunc.Clone(ParameterCloningMethod.Share) ?
We are currently calling it per request we receive in our service but it would be worth knowing if it is something not advisable to be doing.
Thanks in advance

Clone() with parameter sharing is not expensive regardign memory overhead, since all cloned instances share the model's parameters, e.g. weights. Regarding latency, it should also be quite low, as it only creates some in-memory network description data structure. Of course, if you use a pool to manage all cloned instances, you will get better performance.

Related

Parallel datareaders on the same topic

I have a question that I'm curious about, but can't find the answer to exactly. Suppose we have a datawriter and a datareader and as a topic, we have an airplane's flight runtime information such as coordinates, etc. And suppose that datareader should take a heavy process from this information. When there is more than 100 flight at the same time, parallel operation of multiple datareaders seems to be a suitable solution. But I think multiplexing Datareaders won't make sense since because they will all process the same message (topic) in parallel in the same way. Alternative solution. To make the DataReader multithread. But this time there will be only one datareader. And we will have a constraint.
What kind of approach can be created in dds to create multiple datareaders and process them in parallel to distribute the workload.
OpenSpliceDDS implements an extension to the DDS specification to do exactly this i.e. using its DDS_ShareQoSPolicy on a subscriber/reader that allows sharing of entities by multiple processes or threads. When that policy is enabled, OpenSplice will try to lookup an existing entity that matches the name supplied in that ShareQoSPolicy. A new entity will only be created if a shared entity registered under the specified name doesn't exist yet.
Shared Readers can be useful for implementing algorithms like the worker pattern,
where a single shared reader can contain samples representing different tasks that
may be processed in parallel by separate processes. In this algorithm each process
consumes the (samples related to the) task it is going to perform (i.e. it takes the samples representing that task) thus preventing other processes from consuming and therefore performing the same task.
NOTE: Entities can only be shared between co-located processes where OpenSplice is running in federated mode where shared memory is exploited to share the data between the readers of the set of federated processes (so this doesn't work over machine-boundaries).

Solution Cloning Performance Tips

We are currently trying to improve the performance of a planning problem we've implemented in OptaPlanner. Our model has ~45,000 chained variables and after profiling the application it seems like the main bottleneck is around the cloning. Approximately 90% of the CPU run-time is consumed by the FieldAccessingSolutionCloner method calls.
We've already tried to make our object model more lightweight by reducing the number of Maps and Sets within the PlanningEntities, changing fields to primitives where possible, but from your own OptaPlanner experience have you any advice about how speed up cloning performance?
Have you tried writing a custom cloner? See docs.
The default one needs to rely on reflection, so it's slower.
Also, the structure of your domain model influences how much you need to clone (regardless if you go custom or not):
If you delete your Solution and Planning Entities classes, do your other domain classes still compile?
If yes, then the clone is minimal. If no, it's not.

Performance implications of changing NSManagedObject instances that I never intend to save

I have a CoreData-based application that retrieves data about past events from an SQLite persistence store. Once I have the past events my application does some statistical analysis to predict future events based on the data it has about past events. Once my application has made a prediction about future events I want to run another algorithm that does some evaluation of that prediction. I'm expecting to do a lot of these evaluations, so performance optimization for each evaluation is likely to be critical.
Now, all of the classes I need to represent my future event predictions exist in my data model, and I have NSManagedObject subclasses for most of the important entities. The easiest way for me to implement my algorithms is to "fill in" the results for future events based on the prediction, and then run my evaluation using NSManagedObject instances for both the past events and the predictions for future events. However, I typically don't want to save these future event predictions in my persistent store: Once I have performed my evaluation of the prediction I want to throw away the predictions and just keep the evaluation results. I can do this pretty easily, I think, by just sending the rollback: message to my managed object context once my evaluation is complete.
That will all work fine, and from a coding perspective it seems like it will be quite easy to implement. However, I am wondering if I should expect performance concerns making such heavy use of managed objects when I have no intention of ever saving the changes I'm making. Given that performance is likely to be a factor, does using NSManagedObject instances for this make sense? Surely all the things it's doing to keep track of changes and support things like undo and complex entity relationships come with some amount of overhead. Should I be concerned about this overhead?
I could of course create non-NSManagedObject classes that implement an optimized version of my model classes for use when making predictions and evaluating them. That would involve a lot of additional work, including the work necessary to copy data back and forth between the NSManagedObject instances for past events and the optimized class instances for future events: I'd rather not create that code if it is not needed.
Surely all the things it's doing to keep track of changes and support
things like undo and complex entity relationships come with some
amount of overhead.
Core Data doesn't have the overhead that people expect owing to its optimizations. In general, using managed objects in memory is as fast or faster than any custom objects and management code you write yourself.
Should I be concerned about this overhead?
Can't really say without implementation details but most likely not. You can hand tweak Core Data for specific circumstances to get better performance.
The best approach is always to start with the most simple solution and then move to a more complex only when testing reveals that the simple solution does not perform well.
Premature optimization is the root of all evil.

Improve MPI program

I wrote a MPI program that seems to run ok, but I wonder about performance. Master thread needs to do 10 or more times MPI_Send, and the worker receives data 10 or more times and sends it. I wonder if it gives a performance penalty and whether I could transfer everything in single structs or which other technique could I benefit from.
Other general question, once a mpi program works more or less, what are the best optimization techniques.
It's usually the case that sending 1 large message is faster than sending 10 small messages. The time cost of sending a message is well modelled by considering a latency (how long it would take to send an empty message, which is non-zero because of the overhead of function calls, network latency, etc) and a bandwidth (how much longer it takes to send an extra byte given that the network communications has already started). By bundling up messages into one message, you only incurr the latency cost once, and this is often a win (although it's always possible to come up with cases where it isn't). The best way to know for any particular code is simply to try. Note that MPI datatypes allow you very powerful ways to describe the layout of your data in memory so that you can take it almost directly from memory to the network without having to do an intermediate copy into some buffer (so-called "marshalling" of the data).
As to more general optimization questions about MPI -- without knowing more, all we can do is give you advice which is so general as to not be very useful. Minimize the amount of communications which need to be done; wherever possible, use built-in MPI tools (collectives, etc) rather than implementing your own.
One way to fully understand the performance of your MPI application is to run it within the SimGrid platform simulator. The tooling and models provided are sufficient to get realistic timing predictions of mid-range applications (like, a few dozen thousands lines of C or Fortran), and it can be associated to adapted visualization tools that can help you fully understand what is going on in your application, and the actual performance tradeoffs that you have to consider.
For a demo, please refer to this screencast: https://www.youtube.com/watch?v=NOxFOR_t3xI

Does it consume CPU when reading a large file

Suppose I want to do following opeartions on my 2-core machine:
Read a very large file
Compute
Does the file reading operation need to consume 1 core? Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
Thanks.
Edit
Thanks guys, yea, we should always consider if the file I/O blocks the computing. Now let's just consider that the file I/O will never block computing, you can think the computing doesn't depends on the file's data, we just read the file in for future processing. Now we have 2 core, we need to read in a file, and we need to do computing, is it the best solution to create 3 threads, 1 for file reading and 2 for computing, as most of you has already pointed out: file reading consumes very little CPU?
It depends on how your hardware is configured. Normally, reading is not CPU-intensive, thanks to DMA. It may be very expensive though, if it initiates swap-out of other applications. But there is more to it.
Don't read a huge file at once if you can
If your file is really big, you should use mmap or sequential processing, when you don't need to read a whole file at once. Try to consume it by chunks is possible.
For example, to sum all values in a huge file, you don't need to load this file into the memory. You can process it by small chunks, accumulating the sum. Memory is an expensive resource in most situations.
Reading is sequential
Does the file reading operation need to consume 1 core?
Yes, I think most low-level read operations are implemented sequentially (consume 1 core).
You can avoid blocking on read operation if you use asynchronous I/O, but it is just a variation of the same "read by small chunks" technique. You can launch several small asynchronous read operations at once, but you have always to check if an operation has finished before you use the result.
See also this Stack Overflow answer to a related question).
Reading and computing in parallel
Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
It depends, if you need all data to start computations, than there is no reason to start computation in parallel. It will have to wait effectively until reading is done.
If you can start computing even with partial data, likely you don't need to read the whole file at once. And it is usually much better not to do so with huge files.
What is your bottleneck — computation or IO?
Finally, you should know if your task is computation-bound or input-output bound. If it is limited by the performance of input-output subsystem, there is little benefit in parallelizing computation. If computation is very CPU-intensive, and reading time is negligible, you can benefit from parallelizing computation. Input-output is usually a bottleneck unless you are doing some number-crunching.
This is a good candidate for parallelization, because you have two types of operations here - disk I/O (for reading the file), and CPU load (for your computations). So the first step would be to write your application such that the file I/O wasn't blocking the computation. You could do this by reading a little bit at a time from the file and handing it off to the compute thread.
But now you're saying you have two cores that you want to utilize. Your second thought about parallelizing the CPU-intensive part is correct, because we can only parallelize compute tasks if we have more than one processor to use. But, it might be the case that the blocking part of your application is still the file I/O - that depends on a lot of factors, and the only way to tell what level of parallelization is appropriate is to benchmark.
SO required caveat: multithreading is hard and error-prone, and it's better to have correct code than fast code, if you can pick only one. But I don't advocate against threads, as you may find from others on the site.
I would think this depends on the computation you are performing. If you are doing very heavy computations then I would suggest threading the application. Reading a file demands very little from your CPU and because of this, the overhead created by threading the application might slow it down.
Another thing to consider is if you need to load the entire file before you can compute, if so, there is no point in threading it at all as you will have to complete one action before you can perform the other.