Can I implement Branch target buffer in two stage pipelined RISC architecture?

Can I implement Branch target buffer in two stage pipelined RISC architecture? - branch

I am trying to implement the BTB in low-level microcontroller such as PIC16. I don't know is it feasible or not. So wanted your suggestion.
Thanks.

The basic BTB is fairly simple, and is the equivalent of
BTBEntry be = BTB[curAddr & BTBBitMask];
nextFetch = be.addr;
which implemented as electronics takes the lower curAddr bits and feed them into a BTB memory and get the next address out.
And when the branch is resolved the result is written back into the BTB.
The lookup can be done in parallel with memory fetch and must be faster as additional steps must be done.
struct BTBEntry {
int addr;
int curAddr; // upper address bits.
}
To not just jump random around in the program due to the addr stored not corresponding to the curAddr, we also need to check if the address we are looking up is for the correct branch.
if ((curAddr & ~BTBBitMask) == be.curAddr)
nextFetch = be.addr; // found in the BTB
else
nextFetch = curAddr + instrutionSize; // not found, take next instruction
So it can be done, if the BTB is small enough and the total time used is less than an instruction fetch. But the effect might not be so large as you might want.

Related

Vullkan compute shader caches and barriers

I'm trying to understand how the entire L1/L2 flushing works. Suppose I have a compute shader like this one
layout(std430, set = 0, binding = 2) buffer Particles{
Particle particles[];
};
layout(std430, set = 0, binding = 4) buffer Constraints{
Constraint constraints[];
};
void main(){
const uint gID = gl_GlobalInvocationID.x;
for (int pass=0;pass<GAUSS_SEIDEL_PASSES;pass++){
// first query the constraint, which contains particle_id_1 and particle_id_1
const Constraint c = constraints[gID*GAUSS_SEIDEL_PASSES+pass];
// read newest positions
vec3 position1 = particles[c.particle_id_1].position;
vec3 position2 = particles[c.particle_id_2].position;
// modify position1 and position2
position1 += something;
position2 -= something;
// update positions
particles[c.particle_id_1].position = position1;
particles[c.particle_id_2].position = position2;
// in the next iteration, different constraints may use the updated positions
}
}
From what I understand, initially all data resides in L2. When I read particles[c.particle_id_1].position I copy some of the data from L2 to L1 (or directly to a register).
Then in position1 += something I modify L1 (or the register). Finally in particles[c.particle_id_2].position = position1, I flush the data from L1 (or a register) back to L2, right? So if I then have a second compute shader that I want to run afterward this one, and that second shader will read positions of particles, I do not need to synchronize Particles. It would be enough to just put an execution barrier, without memory barrier
void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
VkPipelineStageFlags srcStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkPipelineStageFlags dstStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkDependencyFlags dependencyFlags, // here nothing
uint32_t memoryBarrierCount, // here 0
const VkMemoryBarrier* pMemoryBarriers, // nullptr
uint32_t bufferMemoryBarrierCount, // 0
const VkBufferMemoryBarrier* pBufferMemoryBarriers, // nullptr
uint32_t imageMemoryBarrierCount, // 0
const VkImageMemoryBarrier* pImageMemoryBarriers); // nullptr

Vulkan's memory model does not care about "caches" as caches. Its model is built on the notion of availability and visibility. A value produced by GPU command/stage A is "available" to GPU command/stage B if the command/stage A has an execution dependency with command/stage B. A value produced by GPU command/stage A is "visible" to GPU command/stage B if command/stage A has a memory dependency with command/stage B with regard to the particular memory in question and the access modes that A wrote it and B will access it.
If a value is not both available and visible to a command/stage, then attempting to access it yields undefined behavior.
The implementation of availability and visibility will involve clearing caches and the like. But as far as the Vulkan memory model is concerned, this is an implementation detail it doesn't care about. Nor should you: understand the Vulkan memory model and write code that works within it.
Your pipeline barrier creates an execution dependency, but not a memory dependency. Therefore, values written by CS processes before the barrier are available to CS processes afterwards, but not visible to them. You need to have a memory dependency to establish visibility.
However, if you want a GPU level understanding... it all depends on the GPU. Does the GPU have a cache hierarchy, an L1/L2 split? Maybe some do, maybe not.
It's kind of irrelevant anyway, because merely writing a value to an address in memory is not equivalent to a "flush" of the appropriate caches around that memory. Even using the coherent qualifier would only cause a flush for compute shader operations executing within that same dispatch call. It would not be guaranteed to affect later dispatch calls.

Implementation-dependent. For all we know, a device might have no cache at all, or in future it might be some quantum magic bs.
Shader assignment operation does not imply anything about anything. There's no "L1" or "L2" mentioned anywhere in the Vulkan specification. It is a concept that does not exist.
Completely divorce ourselves from the cache stuff, and all mental bagage that comes with it.
What is important here is that when you read something, then that thing needs to be "visible to" the reading agent (irrespective of what kind of device you use, and what obscure memory architecture it might have). If it is not "visible to", then you might be reading garbage.
When you write something, this does not happen automatically. The writes are not "visible to" anyone.
First you put your writes into src* part of a memory dependency (e.g. via a pipeline barrier). That will make your writes "available from".
Then you put your reader into dst* that will take all referenced writes that are "available from" and make them "visible to" the second synchronization scope.
If you really want to shoehorn this into a cache system concept, don't think of it as levels of cache. Think of it as separate caches. That something is already in some cache does not mean it is in the particular cache the consumer needs.

(STM32) Erasing flash and writing to flash gives HAL_FLASH_ERROR_PGP error (using HAL)

Trying to write to flash to store some configuration. I am using an STM32F446ze where I want to use the last 16kb sector as storage.
I specified VOLTAGE_RANGE_3 when I erased my sector. VOLTAGE_RANGE_3 is mapped to:
#define FLASH_VOLTAGE_RANGE_3 0x00000002U /*!< Device operating range: 2.7V to 3.6V */
I am getting an error when writing to flash when I use FLASH_TYPEPROGRAM_WORD. The error is HAL_FLASH_ERROR_PGP. Reading the reference manual I read that this has to do with using wrong parallelism/voltage levels.
From the reference manual I can read
Furthermore, in the reference manual I can read:
Programming errors
It is not allowed to program data to the Flash
memory that would cross the 128-bit row boundary. In such a case, the
write operation is not performed and a program alignment error flag
(PGAERR) is set in the FLASH_SR register. The write access type (byte,
half-word, word or double word) must correspond to the type of
parallelism chosen (x8, x16, x32 or x64). If not, the write operation
is not performed and a program parallelism error flag (PGPERR) is set
in the FLASH_SR register
So I thought:
I erased the sector in voltage range 3
That gives me 2.7 to 3.6v specification
That gives me x32 parallelism size
I should be able to write WORDs to flash.
But, this line give me an error (after unlocking the flash)
uint32_t sizeOfStorageType = ....; // Some uint I want to write to flash as test
HAL_StatusTypeDef flashStatus = HAL_FLASH_Program(TYPEPROGRAM_WORD, address++, (uint64_t) sizeOfStorageType);
auto err= HAL_FLASH_GetError(); // err == 4 == HAL_FLASH_ERROR_PGP: FLASH Programming Parallelism error flag
while (flashStatus != HAL_OK)
{
}
But when I start to write bytes instead, it goes fine.
uint8_t *arr = (uint8_t*) &sizeOfStorageType;
HAL_StatusTypeDef flashStatus;
for (uint8_t i=0; i<4; i++)
{
flashStatus = HAL_FLASH_Program(TYPEPROGRAM_BYTE, address++, (uint64_t) *(arr+i));
while (flashStatus != HAL_OK)
{
}
}
My questions:
Am I understanding it correctly that after erasing a sector, I can only write one TYPEPROGRAM? Thus, after erasing I can only write bytes, OR, half-words, OR, words, OR double words?
What am I missing / doing wrong in above context. Why can I only write bytes, while I erased with VOLTAGE_RANGE_3?

This looks like an data alignment error, but not the one related with 128-bit flash memory rows which is mentioned in the reference manual. That one is probably related with double word writes only, and is irrelevant in your case.
If you want to program 4 bytes at a time, your address needs to be word aligned, meaning that it needs to be divisible by 4. Also, address is not a uint32_t* (pointer), it's a raw uint32_t so address++ increments it by 1, not 4. As far as I know, Cortex M4 core converts unaligned accesses on the bus into multiple smaller size aligned accesses automatically, but this violates the flash parallelism rule.
BTW, it's perfectly valid to perform a mixture of byte, half-word and word writes as long as they are properly aligned. Also, unlike the flash hardware of F0, F1 and F3 series, you can try to overwrite a previously written location without causing an error. 0->1 bit changes are just ignored.

Processes in Operating Systems

When I read a source about the processes and threads in the operating system, I faced this sentence and it sounded weird to me:
When a program is executed and handled by the processor, it converts into a process. A process needs to use the data and code segment in the memory.
I think the first sentence is true naturally. However, I cannot understand why the process needs to use solely data and code segment?
#include <stdio.h>
x = 10;
y;
int main(void){
int *array = (int*)malloc(sizeof(int) * 4);
printf("x and y are %d %d", x, y);
return 0;
}
I think that when this code is executed, the generated process use bss, data, heap and code segment. In my opinion, a process can benefit from any segment of the memory.
If my thoughts are wrong, can anyone explain the reason ?

A process has to store in memory:
Code.
Heap.
Stack.
Data.
BSS.
Except for really trivial ones, a program will use all these segments. Take a look at wikipedia's explanation of what the segments contain.
I think in the sentence the author didn't want to go into details and refers to Stack/Heap/Data/BSS as the data of your program, not the actual data segment.

This statement is not correct.
When a program is executed and handled by the processor, it converts into a process. A process needs to use the data and code segment in the memory.
A process has to exist before a program can be executed. On many non-eunuch's systems a single process runs multiple program.s
I think that when this code is executed, the generated process use bss, data, heap and code segment. In my opinion, a process can benefit from any segment of the memory.
The LINKER deine program segments. The loader follows the instructions of the linker to create the address space.
"bss, data, heap, and code" is a bad way to envision the address space.
There is:
Executable data
Read only data
Read/write data that can be
initialized
uninitialized
Heap and stack are just read/write data. The operating system cannot even tell what data is stack and what is heap. It's all just memory.

Assembly code for optimized bitshifting of a vector

i'm trying to write a routine that will logically bitshift by n positions to the right all elements of a vector in the most efficient way possible for the following vector types: BYTE->BYTE, WORD->WORD, DWORD->DWORD and WORD->BYTE (assuming that only 8 bits are present in the result). I would like to have three routines for each type depending on the type of processor (SSE2 supported, only MMX suppported, only standard instruction se supported). Therefore i need 12 functions in total.
I have already found by myself how to backup and restore the registers that i need, how to make a loop, how to copy data into regular registers or MMX registers and how to shift by 1 position logically.
Because i'm not familiar with assembly language that's about it.
Which registers should i use for each instruction set?
How will the availability of the large vector (an image) in L1 cache be optimized?
How do i find the next element of the vector (a pointer kind of thing), i know i can make a mov by address and i assume i have to increment the address by 1, 2 or 4 depending on my type of data?
Although i have all the ideas, writing the code is a bit difficult at this point.
Thank you.
Arnaud.
Edit:
Here is what i'm trying to do for MMX for a shift by 1 on a DWORD:
__asm("push mm"); // backup register
__asm("push cx"); // backup register
__asm("mov %cx, length"); // initialize loop
__asm("loopstart_shift1:"); // start label
__asm("movd %xmm0, r/m32"); // get 32 bits data
__asm("psrlq %xmm0, 1"); // right shift 32 bits data logically (stuffs 0 on the left) by 1
__asm("mov r/m32,%xmm0"); // set 32 bits data
__asm("dec %cx"); // decrement index
__asm("cmp %cx,0");
__asm("jnz loopstart_shift1");
__asm("pop cx"); // restore register
__asm("pop mm"); // restore register
__asm("emms"); // leave MMX state

I strongly suggest you pause and take a look at using intrinsics with C or C++ instead of trying to write raw asm - that way the C/C++ compiler will take care of all the register allocation, instruction scheduling and general housekeeping tasks and you can just focus on the important parts, e.g. instead of using psrlq see _m_psrlq in mmintrin.h. (Better yet, look at using 128 bit SSE intrinsics.)

Sounds like you'd benefit from either using or looking into BitMagic's source. its entirely intrinsics based too, which makes its far more portable (though from the looks of it your using GCC, so it might have to get an MSVC to GCC intrinics mapping).

Keeping time using timer interrupts an embedded microcontroller

This question is about programming small microcontrollers without an OS. In particular, I'm interested in PICs at the moment, but the question is general.
I've seen several times the following pattern for keeping time:
Timer interrupt code (say the timer fires every second):
...
if (sec_counter > 0)
sec_counter--;
...
Mainline code (non-interrupt):
sec_counter = 500; // 500 seconds
while (sec_counter)
{
// .. do stuff
}
The mainline code may repeat, set the counter to various values (not just seconds) and so on.
It seems to me there's a race condition here when the assignment to sec_counter in the mainline code isn't atomic. For example, in PIC18 the assignment is translated to 4 ASM statements (loading each byte at the time and selecting the right byte from the memory bank before that). If the interrupt code comes in the middle of this, the final value may be corrupted.
Curiously, if the value assigned is less than 256, the assignment is atomic, so there's no problem.
Am I right about this problem?
What patterns do you use to implement such behavior correctly? I see several options:
Disable interrupts before each assignment to sec_counter and enable after - this isn't pretty
Don't use an interrupt, but a separate timer which is started and then polled. This is clean, but uses up a whole timer (in the previous case the 1-sec firing timer can be used for other purposes as well).
Any other ideas?

The PIC architecture is as atomic as it gets. It ensures that all read-modify-write operations to a memory file are 'atomic'. Although it takes 4-clocks to perform the entire read-modify-write, all 4-clocks are consumed in a single instruction and the next instruction uses the next 4-clock cycle. It is the way that the pipeline works. In 8-clocks, two instructions are in the pipeline.
If the value is larger than 8-bit, it becomes an issue as the PIC is an 8-bit machine and larger operands are handled in multiple instructions. That will introduce atomic issues.

You definitely need to disable the interrupt before setting the counter. Ugly as it may be, it is necessary. It is a good practice to ALWAYS disable the interrupt before configuring hardware registers or software variables affecting the ISR method. If you are writing in C, you should consider all operations as non-atomic. If you find that you have to look at the generated assembly too many times, then it may be better to abandon C and program in assembly. In my experience, this is rarely the case.
Regarding the issue discussed, this is what I suggest:
ISR:
if (countDownFlag)
{
sec_counter--;
}
and setting the counter:
// make sure the countdown isn't running
sec_counter = 500;
countDownFlag = true;
...
// Countdown finished
countDownFlag = false;
You need an extra variable and is better to wrap everything in a function:
void startCountDown(int startValue)
{
sec_counter = 500;
countDownFlag = true;
}
This way you abstract the starting method (and hide ugliness if needed). For example you can easily change it to start a hardware timer without affecting the callers of the method.

Write the value then check that it is the value required would seem to be the simplest alternative.
do {
sec_counter = value;
} while (sec_counter != value);
BTW you should make the variable volatile if using C.
If you need to read the value then you can read it twice.
do {
value = sec_counter;
} while (value != sec_counter);

Because accesses to the sec_counter variable are not atomic, there's really no way to avoid disabling interrupts before accessing this variable in your mainline code and restoring interrupt state after the access if you want deterministic behavior. This would probably be a better choice than dedicating a HW timer for this task (unless you have a surplus of timers, in which case you might as well use one).

If you download Microchip's free TCP/IP Stack there are routines in there that use a timer overflow to keep track of elapsed time. Specifically "tick.c" and "tick.h". Just copy those files over to your project.
Inside those files you can see how they do it.

It's not so curious about the less than 256 moves being atomic - moving an 8 bit value is one opcode so that's as atomic as you get.
The best solution on such a microcontroller as the PIC is to disable interrupts before you change the timer value. You can even check the value of the interrupt flag when you change the variable in the main loop and handle it if you want. Make it a function that changes the value of the variable and you could even call it from the ISR as well.

Well, what does the comparison assembly code look like?
Taken to account that it counts downwards, and that it's just a zero compare, it should be safe if it first checks the MSB, then the LSB. There could be corruption, but it doesn't really matter if it comes in the middle between 0x100 and 0xff and the corrupted compare value is 0x1ff.

The way you are using your timer now, it won't count whole seconds anyway, because you might change it in the middle of a cycle.
So, if you don't care about it. The best way, in my opinion, would be to read the value, and then just compare the difference. It takes a couple of OPs more, but has no multi-threading problems.(Since the timer has priority)
If you are more strict about the time value, I would automatically disable the timer once it counts down to 0, and clear the internal counter of the timer and activate once you need it.

Move the code portion that would be on the main() to a proper function, and have it conditionally called by the ISR.
Also, to avoid any sort of delaying or missing ticks, choose this timer ISR to be a high-prio interrupt (the PIC18 has two levels).

One approach is to have an interrupt keep a byte variable, and have something else which gets called at least once every 256 times the counter is hit; do something like:
// ub==unsigned char; ui==unsigned int; ul==unsigned long
ub now_ctr; // This one is hit by the interrupt
ub prev_ctr;
ul big_ctr;
void poll_counter(void)
{
ub delta_ctr;
delta_ctr = (ub)(now_ctr-prev_ctr);
big_ctr += delta_ctr;
prev_ctr += delta_ctr;
}
A slight variation, if you don't mind forcing the interrupt's counter to stay in sync with the LSB of your big counter:
ul big_ctr;
void poll_counter(void)
{
big_ctr += (ub)(now_ctr - big_ctr);
}

No one addressed the issue of reading multibyte hardware registers (for example a timer.
The timer could roll over and increment its second byte while you're reading it.
Say it's 0x0001ffff and you read it. You might get 0x0010ffff, or 0x00010000.
The 16 bit peripheral register is volatile to your code.
For any volatile "variables", I use the double read technique.
do {
t = timer;
} while (t != timer);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas