Parallel Processing in MATLAB with more than 12 cores - optimization

I created a function to compute the correct number of ks for a dataset using the Gap Statistics algorithm. This algorithm requires at one point to compute the dispersion (i.e., the sum of the distances between every point and its centroid) for, let's say, 100 different datasets (called "test data(set)" or "reference data(set)"). Since these operations are independent I want to parallel them across all the cores. I have the Mathworks' Parallel Toolbox but I am not sure how to use it (problem 1; I can use past threads to understand this, I guess). However, my real problem is another one: this toolbox seems to allow the usage of just 12 cores (problem 2). My machine has 64 cores and I need to use all of them. Do you know how to parallel a process among 12+ cores?
For your information this is the bit of code that should run in parallel:
%This cycle is repeated n_tests times where n_tests is equal
%to the number of reference datasets we want to use
for id_test = 2:n_tests+1
test_data = generate_test_data(data);
%% Calculate the dispersion(s) for the generated dataset(s)
dispersions(id_test, 1:1:max_k) = zeros;
%We calculate the dispersion for the id_test reference dataset
for id_k = 1:1:max_k
dispersions(id_test, id_k) = calculate_dispersion(test_data, id_k);
end
end

Please note that in R2014a the limit on the number of local workers was removed. See the release notes.

The number of local workers available with Parallel Computing Toolbox is license dependent. When introduced, the limit was 4; this changed to 8 in R2009a; and to 12 in R2011b.
If you want to use 16 workers, you will need a 16-node MDCS licence, and you'll also need to set up some sort of scheduler to manage those. There are detailed instructions about how to do this here:http://www.mathworks.de/support/product/DM/installation/ver_current/. Once you've done that, yes, you'll be able to do "matlabpool open 16".
EDIT: As of Matlab version R2014a there is no longer a limit on the number of local workers for the Parallel Computing Toolbox. That is, if you are using an up-to-date version of Matlab you will not encounter the problem described by the OP.

The fact that matlab creates this restriction on its parallel toolbox make it often not worth the money and effort of using it.
One way of solving is by using a combination of the matlab compiler and virtual machines using either vmware or virtual box.
Compile the code required to run your tests.
Load your compiled code with the MCR(matlab compiler runtime)on a VM template.
Create multiple copies of the VM template, let each template run the required calculations for some of the datasets.
Gather the data of all your results
This method is time consuming and only worth it if it saves more time than porting the code and the code is already highly optimised.

I had the same problem on 32 core machine and 6 datasets. I've overcame this by creating shell script, which started matlab six times, one for each data set. I could do this, becase the computations weren't dependent. From what I understand, You could use similar approach. By starting around 6 instances, each counting around 16 datasets. It depends how much RAM you have and how much each instance consumes.

Related

OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why).
The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.
First, I consider some theory basics.
Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:
Alignment (bits) of base address: 2048.
Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)
Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.
But still, I can place my data with two different access patterns:
1) contiguous
a[i] = v[get_global_id(0) * get_global_size(0) + i];
(wich actually perfomed as)
*(int8*)a = *(int8*)v;
2) interleaved
a[i] = v[get_global_id(0) + i * get_global_size(0)];
I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently.
While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.
So, my second question: is my assumption right?
Now, the practice.
Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.
And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)
Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):
«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»
So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.
Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).
In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.
Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.
Best Regards,
Michael

out of memory sql execution

I have the following script:
SELECT
DEPT.F03 AS F03, DEPT.F238 AS F238, SDP.F04 AS F04, SDP.F1022 AS F1022,
CAT.F17 AS F17, CAT.F1023 AS F1023, CAT.F1946 AS F1946
FROM
DEPT_TAB DEPT
LEFT OUTER JOIN
SDP_TAB SDP ON SDP.F03 = DEPT.F03,
CAT_TAB CAT
ORDER BY
DEPT.F03
The tables are huge, when I execute the script in SQL Server directly it takes around 4 min to execute, but when I run it in the third party program (SMS LOC based on Delphi) it gives me the error
<msg> out of memory</msg> <sql> the code </sql>
Is there anyway I can lighten the script to be executed? or did anyone had the same problem and solved it somehow?
I remember having had to resort to the ROBUST PLAN query hint once on a query where the query-optimizer kind of lost track and tried to work it out in a way that the hardware couldn't handle.
=> http://technet.microsoft.com/en-us/library/ms181714.aspx
But I'm not sure I understand why it would work for one 'technology' and not another.
Then again, the error message might not be from SQL but rather from the 3rd-party program that gathers the output and does so in a 'less than ideal' way.
Consider adding paging to the user edit screen and the underlying data call. The point being you dont need to see all the rows at one time, but they are available to the user upon request.
This will alleviate much of your performance problem.
I had a project where I had to add over 7 million individual lines of T-SQL code via batch (couldn't figure out how to programatically leverage the new SEQUENCE command). The problem was that there was limited amount of memory available on my VM (I was allocated the max amount of memory for this VM). Because of the large amount lines of T-SQL code I had to first test how many lines it could take before the server crashed. For whatever reason, SQL (2012) doesn't release the memory it uses for large batch jobs such as mine (we're talking around 12 GB of memory) so I had to reboot the server every million or so lines. This is what you may have to do if resources are limited for your project.

DTLB miss number counting discrepency

I am running Linux on 32-nm Intel Westmere processor. I have a concern with seemingly conflicting data on DTLB miss numbers from performance counters. I ran two experiments with a random memory access test program (single-threaded) as follows:
Experiment (1): I counted the DTLB misses using following performance counter
DTLB_MISSES.WALK_COMPLETED ((Event 49H, Umask 02H)
Experiment (2): I counted the DTLB misses by summing up following the two counter values below
MEM_LOAD_RETIRED.DTLB_MISS (Event CBH, Umask 80H)
MEM_STORE_RETIRED.DTLB_MISS (Event 0CH, Umask 01H)
I expected the output of these experiments to be similar. However I found that numbers reported in experiment (1) is almost twice that of in experiment (2). I am at a loss why this is the case.
Can somebody help shed some light on this apparent discrepancy?
That is expected since the first event counts the number of misses to all TLB levels caused by all possible reasons (load, store, pre-fetch), including memory accesses performed speculatively, while the other two events count only retired (that is, non-speculative) load and store operations, and only those among them that didn’t cause any fault.
Please refer to Chapter 19.6 of Volume 3 of Intel® 64 and IA-32 Architectures Software Developer’s Manual.
Thanks,
Stas

How to control the exact number of tests to run with caliper

I tried to understand, what is the proper way to control the number of runs: is it the trial or rep? It is confusing: I run the benchmark with --trial 1 and recieve the output:
0% Scenario{vm=java, trial=0, benchmark=SendPublisher} 1002183670.00 ns; Ï=315184.24 ns # 3 trials
It looks like 3 trials were run. What are that trials? What are reps? I can control the rep value with the options --debug & --debug-reps, but what is the value when running w/o debug? I need to know how many times exactly my tested method was called.
Between Caliper 0.5 and 1.0 a bit of the terminology has changed, but this should explain it for both. Keep in mind than things were a little murky in 0.5, so most of the changes made for 1.0 were to make things clearer and more precise.
Caliper 0.5
A single invocation of Caliper is a run. Each run has some number of trials, which is just another iteration of all of the work performed in a run. Within each trial, Caliper executes some number of scenarios. A scenario is the combination of VM, benchmark, etc. The runtime of a scenario is measured by timing the execution of some number of reps, which is the number passed to your benchmark method at runtime. Multiple reps are, of course, necessary because it would be impossible to get precise measurements for a single invocation in a microbenchmark.
Caliper 1.0
Caliper 1.0 follows a pretty similar model. A single invocation of Caliper is still a run. Each run consists of some number of trials, but a trial is more precisely defined as an invocation of a scenario measured with an instrument.
A scenario is roughly defined as what you're measuring (host, VM, benchmark, parameters) and the instrument is what code performs the measurement and how it was configured. The idea being that if a perfectly repeatable benchmark were a function of the form f(x)=y, Caliper would be defined as instrument(scenario)=measurements.
Within the execution of the runtime instrument (it's similar for others), there is still the same notion of reps, which is the number of iterations passed to the benchmark code. You can't control the rep value directly since each instrument will perform its own calculation to determine what it should be.
At runtime, Caliper plans its execution by calculating some number of experiments, which is the combination of instrument, benchmark, VM and parameters. Each experiment is run --trials number of times and reported as an individual trial with its own ID.
How to use the reps parameter
Traditionally, the best way to use the reps parameter is to include a loop in your benchmark code that looks more or less like:
for (int i = 0; i < reps; i++) {…}
This is the most direct way to ensure that the number of reps scales linearly with the reported runtime. That is a necessary property because Caliper is attempting to infer the cost of a single, uniform operation based on the aggregate cost of many. If runtime isn't linear with the number of reps, the results will be invalid. This also implies that reps should not be passed directly to the benchmarked code.

optimizing a function to find global and local peaks with R

Y
I have 6 parameters for which I know maxi and mini values. I have a complex function that includes the 6 parameters and return a 7th value (say Y). I say complex because Y is not directly related to the 6 parameters; there are many embeded functions in between.
I would like to find the combination of the 6 parameters which returns the highest Y value. I first tried to calculate Y for every combination by constructing an hypercube but I have not enough memory in my computer. So I am looking for kinds of markov chains which progress in the delimited parameter space, and are able to overpass local peaks.
when I give one combination of the 6 parameters, I would like to know the highest local Y value. I tried to write a code with an iterative chain like a markov's one, but I am not sure how to process when the chain reach an edge of the parameter space. Obviously, some algorythms should already exist for this.
Question: Does anybody know what are the best functions in R to do these two things? I read that optim() could be appropriate to find the global peak but I am not sure that it can deal with complex functions (I prefer asking before engaging in a long (for me) process of code writing). And fot he local peaks? optim() should not be able to do this
In advance, thank you for any lead
Julien from France
Take a look at the Optimization and Mathematical Programming Task View on CRAN. I've personally found the differential evolution algorithm to be very fast and robust. It's implemented in the DEoptim package. The rgenoud package is another good candidate.
I like to use the Metropolis-Hastings algorithm. Since you are limiting each parameter to a range, the simple thing to do is let your proposal distribution simply be uniform over the range. That way, you won't run off the edges. It won't be fast, but if you let it run long enough, it will do a good job of sampling your space. The samples will congregate at each peak, and will spread out around them in a way that reflects the local curvature.