I am new to gem5. I am running SPEC2006 CPU benchmarks on gem5. What I want is to analyse the distribution different patterns in terms of ratio of 1's and 0's or different states '11', '00' etc. on memory accesses for different benchmarks. What should I do?
Related
I regularly see coder discussions here about CPU usage and questions about reducing 'high usage', covering everything from Javascript functions to compiled C executables.
I notice that almost always people are referring to the percentage of CPU being consumed - which naturally varies hugely according to where the code is running. eg. "When I run this I get 80% CPU usage, so I need to optimise my code".
Whilst it's clear that a level of 'high CPU usage' for looping code is often a good indicator that something is wrong, and code needs to sleep a little or be refactored, I am very surprised not to be able to find a common unit of processing measurement that is used to describe intense CPU usage rather than the percentage of the author's own machine's CPU, for example.
We can easily measure memory/disk usage by an algorithm on a certain platform, but is there any easily attainable and consistent useful figure for an amount of processing that could be used to compare usage?
Are FLOPS still used in the modern world, for instance?
This is a very generic question. What is the best way to study basic CPU models in gem5 so that i can build my own cpu models using them. DO i need to understand the base models fully. I mean do i need to go through the codes line by line to understand the funcionality of those cpu models in gem5?
If your goal is only to change the timing of different pipeline stages, you can change it in your configuration script, as the cpu models in gem5 have options. You can change instruction latencies, number of functional units, cycles between fetch/decode/execute/...
You could take a look at https://github.com/gem5/gem5/tree/master/configs/common/cores/arm where the authors of these file set some options to change the structure of a cpu core. The core still uses the detailed gem5 out-of-order cpu model, but only the parameters (sizes of structures, latencies between structures ...) are modified.
Using this as an example you could change what you want without having to fully understand the code for the detailed cpu model.
I have two SSE registers and I want to replace the high half of one by the low half of the other. As usual, the fastest way.
I guess it is doable by shifting one of the registers by 8 bytes, then alignr to concatenate.
Is there any single-instruction solution?
You can use punpcklqdq to combine the low halves of two registers into hi:lo in a single register. This is identical to what the movlhps FP instruction does, and also unpcklpd, but operates in the integer domain on CPUs that care about FP vs. integer shuffles for bypass delays.
Bonus reading: combining different parts of two registers
palignr would only be good for combining hi:xxx with xxx:lo, to produce lo:hi (i.e. reversed). You can use an FP shuffle (the register-register form of movsd) to get hi:lo (by moving the low half of xxx:lo to replace the low garbage in hi:xxx). Without that, you'd want to use punpckhqdq to bring the high half of one register to the low half, then use punpcklqdq to combine the low halves of two registers.
On most CPUs other than Intel Nehalem, floating-point shuffles on integer data are generally fine (little or no extra latency when used between vector-integer ALU instructions). On Nehalem, you might get two cycles of extra latency into and out of a floating point shuffle (for a total of 4 cycles latency), but that's only a big problem for throughput if it's part of a loop-carried dependency chain. See Agner Fog's guides for more info.
Agner's Optimizing Assembly guide also has a whole section of tables of SSE/AVX instructions that are useful for various kinds of data movement within or between registers. See the sse tag wiki for a link, download the PDF, read section 13.7 "Permuting data" on page 130.
To use FP shuffles with intrinsics, you have to clutter up your code with _mm_castsi128_ps and _mm_castps_si128, which are reinterpret-casts that emit no instructions.
I need to accelerate many computations I am now doing with PyLab. I thought of using CUDA. The overall unit of computation (A) consists in doing several (thousands) entirely independent smaller computations (B). Each of them involves, at their initial stage, doing 40-41 independent, even smaller, computations (C). So parallel programming should really help. With PyLab the overall (A) takes 20 minutes and (B) takes some tenth of a second.
As a beginner in this realm, my question is what level I should parallelize the computation at, whether at (C) or at (B).
I should clarify that the (C) stage consists in taking a bunch of data (thousands of floats) which is shared between all the (C) processes, and doing various tasks, among which one of the most time consuming is linear regression, which is, too, parallelizable! The output of each procedure (C) is a single float. Each computation (B) consists basically in doing many times procedure (C) and doing a linear regression on the data that comes out. Its output, again, is a single float.
I'm not familiar with CUDA programming so I am basically asking what would be the wisest strategy to start with.
An important consideration when deciding how (and if) to convert your project to CUDA is what type of memory access patterns your code requires. The GPU runs threads in groups of 32, called warps, and to get the best performance, the threads in a warp should access the memory in some basic patterns, that are described in the CUDA Programming Guide (included with CUDA). In general, the more random the access patterns, the more likely the kernel is to become memory bound. In that case, the compute power in the GPU cannot be fully utilized.
The other main case when the compute power in the GPU cannot be fully utilized is if there is conditional logic and loops that causes the threads in a warp to run through different code paths, as the GPU has to run all the threads in the warp through each code path.
If you find that these points may cause issues for your code, you should also do some research to see if there are known alternative ways to implement your code to run better on the GPU (this is often the case).
If you see your question about at which level to parallelize the computation in the light of the above considerations, it may become clear which choice to make.
I have a quad core computer; and I use the parallel computing toolbox.
I set different number for the "worker" number in the parallel computing setting, for example 2,4,8..............
However, no matter what I set, the AVERAGE cpu usage by MATLAB is exactly 25% of total CPU usage; and None of the cores run at 100% (All are around 10%-30%). I am using MATLAB to run optimization problem, so I really want my quad core computer using all its power to do the computing. Please help
Setting a number of workers (up to 4 on a quad-core) is not enough. You also need to use a command like parfor to signal to Matlab what part of the calculation should be distributed among the workers.
I am curious about what kind of optimization you're running. Normally, optimization problems are very difficult to parallelize, since the result of every iteration depends on the previous one. However, if you want to e.g. try and fit multiple models to the data, or if you have to fit multiple data sets, then you can easily run these in parallel as opposed to sequentially.
Note that having many cores may not be sufficient in terms of resources - if performing the optimization on one worker uses k GB of RAM, performing it on n workers requires at least n*k GB of RAM.