Processes in Operating Systems - process

When I read a source about the processes and threads in the operating system, I faced this sentence and it sounded weird to me:
When a program is executed and handled by the processor, it converts into a process. A process needs to use the data and code segment in the memory.
I think the first sentence is true naturally. However, I cannot understand why the process needs to use solely data and code segment?
#include <stdio.h>
x = 10;
y;
int main(void){
int *array = (int*)malloc(sizeof(int) * 4);
printf("x and y are %d %d", x, y);
return 0;
}
I think that when this code is executed, the generated process use bss, data, heap and code segment. In my opinion, a process can benefit from any segment of the memory.
If my thoughts are wrong, can anyone explain the reason ?

A process has to store in memory:
Code.
Heap.
Stack.
Data.
BSS.
Except for really trivial ones, a program will use all these segments. Take a look at wikipedia's explanation of what the segments contain.
I think in the sentence the author didn't want to go into details and refers to Stack/Heap/Data/BSS as the data of your program, not the actual data segment.

This statement is not correct.
When a program is executed and handled by the processor, it converts into a process. A process needs to use the data and code segment in the memory.
A process has to exist before a program can be executed. On many non-eunuch's systems a single process runs multiple program.s
I think that when this code is executed, the generated process use bss, data, heap and code segment. In my opinion, a process can benefit from any segment of the memory.
The LINKER deine program segments. The loader follows the instructions of the linker to create the address space.
"bss, data, heap, and code" is a bad way to envision the address space.
There is:
Executable data
Read only data
Read/write data that can be
initialized
uninitialized
Heap and stack are just read/write data. The operating system cannot even tell what data is stack and what is heap. It's all just memory.

Related

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

Neon VLD consuming more cycles than what is expected?

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:
http://pulsar.webshaker.net/ccc/sample-d3a7fe78
As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?
I have taken care of the following:
The address is 16 byte aligned.
Have provided the alignment hint in the instruction vld1.64 {d0, d1} [r0,:128]!
Tried preload instruction pld [r0, #192], at places but that seems to add to the cycles instead of actually reducing the latency.
Can someone tell me what am I doing wrong, why this latency?
Other Details:
With reference to cortex-a8
arm-2009q1 cross compiler tool chain
coding in assembly
Your code is executing much slower than expected because as it's currently written, it's causing the perfect storm of pipeline stalls. On any modern CPU with a pipelined architecture, instructions can execute in one cycle under ideal conditions. The ideal conditions are that the instruction is not waiting for memory and doesn't have any register dependencies. The way you've written the code, you're not allowing for the delay in reading from memory and making the next instruction dependent on the results of the read. This is causing the worst possible performance. Also, I'm not sure why you're accumulating the pairwise adds into multiple registers. Try something like this:
veor.u16 q12,q12,q12 # clear accumulated sum
top_of_loop:
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
subs r1,r1,#8
bne top_of_loop
Experiment with different numbers of load instructions before executing the adds. The point is that you need to allow time for the read to occur before you can use the target register.
Note: Using Q4-Q7 is risky because they're non-volatile registers. On Android you will get random garbage appearing in these (especially Q4).

SPU Instruction Pointer

I've been searching through the entire manuals and I can't find a single mention of the Instruction Pointer. I need this for a SPU program that I'm writing. Maybe it has a different name? Can anyone tell me how I can access the address of the instruction that is to be executed? Thanks in advance for your help.
UPDATE: Apperantly it's called the Program Counter, but how can I access it from within my SPU Program?
If you just want to get the instruction pointer, you can do it in assembly:
brsl r<n>, .+4
This loads the address of the next instruction into register r<n>.
Seems like you can get the next instruction by executing a spe_context_run operation:
int spe_context_run(spe_context_ptr_t spe, unsigned int *entry, unsigned int runflags, void *argp, void *envp, spe_stop_info_t *stopinfo)
entry
Input: The entry point, that is, the initial value of the SPU
instruction pointer, at which the SPE program should start executing.
If the value of entry is SPE_DEFAULT_ENTRY, the entry point for the
SPU main program is obtained from the loaded SPE image. This is
usually the local store address of the initialization function crt0
(see Cell Broadband Engine Programming Handbook, Objects, Executables,
and SPE Loading).
Output: The SPU instruction pointer at the moment the SPU stopped
execution, that is, the local store address of the next instruction
that would be have been executed.
This parameter can be used, for example, to allow the SPE program to
"pause" and request some action from the PPE thread, for example,
performing an I/O operation. After this PPE-side action has been
completed, you can continue the SPE program calling spe_context_run
again without changing entry.

Concurrent access to a single FFTSetup data structure in GCD

Is it okay to create a single FFTSetup data structure and use it to perform multiple FFT computations concurrently? Would something like the following work?
FFTSetup fftSetup = vDSP_create_fftsetup(
16, // vDSP_Length __vDSP_log2n,
kFFTRadix2 // FFTRadix __vDSP_radix
);
NSAssert(fftSetup != NULL, #"vDSP_create_fftsetup() failed to allocate storage");
for (int i = 0; i < 100; i++)
{
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0),
^{
vDSP_fft_zrip(
fftSetup, // FFTSetup __vDSP_setup,
&(splitComplex[i]), // DSPSplitComplex *__vDSP_ioData,
1, // vDSP_Stride __vDSP_stride,
16, // vDSP_Length __vDSP_log2n,
kFFTDirection_Forward // FFTDirection __vDSP_direction
);
});
}
I suppose that the answer depends on the following considerations:
1) Does vDSP_fft_zrip() only access the data within fftSetup (or the data pointed to by it) in a "read-only" fashion? Or are there perhaps some temporary buffers (scratch space) within fftSetup that is written to by vDSP_fft_zrip() in performing its FFT computations?
2) If data like that in fftSetup is being accessed in a "read-only" fashion, is it okay for multiple processes/threads/tasks/blocks to access it simultaneously? (I am thinking of the case where it is possible for more than one process to open the same file for reading, though not necessarily for writing or appending. Is this analogy appropriate?)
On a related note, just how much memory is being taken up by the FFTSetup data structure? Is there any way to find out? (It is an opaque data type.)
You may create one FFT setup and use it repeatedly and concurrently. This is the intended use. (I am the author of the current implementations of vDSP_fft_zrip and other FFT implementations in vDSP.)
In Using Fourier Transforms we're told that the FFTSetup contains the FFT weight array which is a series of complex exponentials. The vDSP_create_fftsetup documentation says
Once prepared, the setup structure can be used repeatedly by FFT
functions (which read the data in the structure and do not alter it)
for any (power of two) length up to that specified when you created
the structure.
so
conceptually, vDSP_fft_zrip should not need to modify the weight array and so it would appear to be one of the FFT functions that do not alter the FFTSetup (I haven't seen any that do apart from create/destroy), however there are no guarantees on what the actual implementation does - it could do anything.
if vDSP_fft_zrip truly accesses its FFTSetup in a read-only fashion, then it's fine to do that from multiple threads.
As for memory usage, the FFT weight array is e^{i*k*2*M_PI/N} for k = [0..N-1], which are N complex float values, so that would 2*N*sizeof(float).
But those complex exponentials are very symmetric so who knows, under the hood the implementation could require less memory. Or more!
In your case, N = 2^16, so it wouldn't be strange to see up to 256k being used.
Where does that leave you? I think it seems reasonable that the FFTSetup be accessible from multiple threads, but it appears to be undocumented. You could be lucky. Or unlucky and unpleasantly surprised now or in a future version of the framework.
So... do you feel lucky?
I wouldn't attempt any explicit concurrency with the vDSP functions, or any other function in the Accelerate framework (of which vDSP is a part) for that matter. Why? Because Accelerate is already designed to take advantage of multiple cores, as well as specific nuances of a given processor implementation, on your behalf - see http://developer.apple.com/library/mac/#DOCUMENTATION/Darwin/Reference/ManPages/man7/vecLib.7.html. You may end up essentially re-parallelizing already parallel computations that are internal to the implementation (if not now, then possibly in a later version). The best approach to the Accelerate framework is generally to assume that it's more clever than you are and just use it in the simplest way possible, then do your performance measurements. If those measurements reflect a level of performance that is somehow insufficient for your needs, then try your own optimizations (and/or file a bug report against the Accelerate framework at http://bugreport.apple.com since the authors of that framework are always interested in knowing where or if their efforts somehow fell short of developer requirements).

Assembly code for optimized bitshifting of a vector

i'm trying to write a routine that will logically bitshift by n positions to the right all elements of a vector in the most efficient way possible for the following vector types: BYTE->BYTE, WORD->WORD, DWORD->DWORD and WORD->BYTE (assuming that only 8 bits are present in the result). I would like to have three routines for each type depending on the type of processor (SSE2 supported, only MMX suppported, only standard instruction se supported). Therefore i need 12 functions in total.
I have already found by myself how to backup and restore the registers that i need, how to make a loop, how to copy data into regular registers or MMX registers and how to shift by 1 position logically.
Because i'm not familiar with assembly language that's about it.
Which registers should i use for each instruction set?
How will the availability of the large vector (an image) in L1 cache be optimized?
How do i find the next element of the vector (a pointer kind of thing), i know i can make a mov by address and i assume i have to increment the address by 1, 2 or 4 depending on my type of data?
Although i have all the ideas, writing the code is a bit difficult at this point.
Thank you.
Arnaud.
Edit:
Here is what i'm trying to do for MMX for a shift by 1 on a DWORD:
__asm("push mm"); // backup register
__asm("push cx"); // backup register
__asm("mov %cx, length"); // initialize loop
__asm("loopstart_shift1:"); // start label
__asm("movd %xmm0, r/m32"); // get 32 bits data
__asm("psrlq %xmm0, 1"); // right shift 32 bits data logically (stuffs 0 on the left) by 1
__asm("mov r/m32,%xmm0"); // set 32 bits data
__asm("dec %cx"); // decrement index
__asm("cmp %cx,0");
__asm("jnz loopstart_shift1");
__asm("pop cx"); // restore register
__asm("pop mm"); // restore register
__asm("emms"); // leave MMX state
I strongly suggest you pause and take a look at using intrinsics with C or C++ instead of trying to write raw asm - that way the C/C++ compiler will take care of all the register allocation, instruction scheduling and general housekeeping tasks and you can just focus on the important parts, e.g. instead of using psrlq see _m_psrlq in mmintrin.h. (Better yet, look at using 128 bit SSE intrinsics.)
Sounds like you'd benefit from either using or looking into BitMagic's source. its entirely intrinsics based too, which makes its far more portable (though from the looks of it your using GCC, so it might have to get an MSVC to GCC intrinics mapping).