How is an annulled branch different from a regular branch? - optimization

For SPARC Assembly particularly, how are annulled branches different from regular branches?
I always thought that annulling branch instructions is required when I need to fill the nop delay slot for branch instructions. However, I don't think I'm correct on this part, because you can fill the nop without annulling the branch.

The annulled branch instruction causes the instruction in the delay slot -- the instruction after the branch -- to be ignored if the branch is not taken.
Why would this be important? Because normally, the instruction after the branch is executed, even if the branch is taken. This is because there are two program counters, PC and NPC. PC, which indicates the instruction being executed, is updated to NPC, which is PC + 4, at the same time as NPC is being updated to the target of the branch instruction. So because of the timing of these events, the next instruction has to be loaded. Rather than just throw that cycle away, it's more profitable to use that cycle if we can. We would then just make that instruction part of the loop.
loop: someOp
someOtherOp
branch loop ;
delayslotOp ; will actually be executed, before someOp, after branch
If we can't use the instruction slot after the branch, then we stick a nop in there, and do nothing on that cycle.
So why then have different instructions with annulled and non-annulled branch options? To give us the choice of what happens on exit from the loop. If we've made the delay slot part of the loop activity, we might not want that op executed upon leaving from the loop. Therefore, we'd add ",a" to the end of the branch instruction.
This page has some nice examples.

According to the SPARC Architecture Manual (v9):
3.2.3 Control Transfer
[...]
Most of the control-transfer instructions are delayed; that is, the instruction immediately following a
control-transfer instruction in logical sequence is dispatched before
the control transfer to the target address is completed.
[...]
The instruction following a delayed control-transfer instruction is called
a delay instruction. A bit in a delayed control-transfer instruction
(the annul bit) can cause the delay instruction to be annulled (that
is, to have no effect) if the branch is not taken.
6.3.4 Control-Transfer Instructions (CTIs)
[...]
Programming Note: The annul bit increases the likelihood that a
compiler can find a useful instruction to fill the delay slot after a
branch, thereby reducing the number of instructions executed by a
program. For example, the annul bit can be used to move an instruction
from within a loop to fill the delay slot of the branch that closes
the loop. Likewise, the annul bit can be used to move an instruction
from either the “else” or “then” branch of an “ifthen- else” program
block to the delay slot of the branch that selects between them. Since
a full set of conditions are provided, a compiler can arrange the code
(possibly reversing the sense of the condition) so that an instruction
from either the “else” branch or the “then” branch can be moved to the
delay slot.
The following code shows two branches, in the first one the delay instruction is always executed, in the second one the delay instruction is annulled if the branch is not taken:
cmp %i3, %i0
ble %icc, -0x5c
ld [%l0 - 0x4], %i5 ; executed whether the branch is taken or not
...
cmp %l1, 0x80
bl,a %icc, +0x40
ld [%fp + 0x7c7], %g2 ; annulled if the branch is not taken, executed otherwise

Related

How to break the gem5 executable in GDB at a the nth instruction?

Using --debug-flags ExecAll tracing, I found that there is a bug at the Nth instruction, which happens at the Nth line of the log.
Is there an easy way to break specifically at that instruction to debug it in GDB and view gem5's internal state?
The simplest approach is to use --debug-break as shown at: schedBreak(<tick>) gdb debugging function not working
That makes gem5 raise a signal at a given simulation, which GDB stops at by default. You can determine what simulation time corresponds to your instruction by looking at an --debug-flags ExecAll trace beforehand.
You will want to break on the tick much more often than on the Nth instructions, in particular since gem5 simulates the instruction pipeline, and therefor there can be multiple instructions in flight at the same time.
Alternatively, from GDB your point of interest sees the ExecutionContext object, which if often called xc, you can just add a conditional breakpoint like:
b MyClass::myFunction if xc->numInsts.data()->value() == <n> - 2
The -2 is needed because this index is zero based, and because the tick increments after instruction execution.
You can also find the tick time rather than instruction count with:
p xc->cpu->tick
or from the other commonly available ThreadContext object with:
p tc->baseCpu->tick
You generally want to do this from the ::tick() function of your CPU model of interest.
For AtomicSimpleCPU::tick() you could also break just before the second instruction with:
b AtomicSimpleCPU::tick if (*threadInfo[curThread]).numInst == 1
Or to break at a given tick, say 1000 (500 is the one before it):
b AtomicSimpleCPU::tick if tick == 500
Two other important break locations are at the main event loop when an event is executed:
b EventQueue::serviceOne() if head->when() == 1000
and the event scheduling target point:
b EventQueue::schedule if when == <target-time>
b EventQueue::reschedule if when == <target-time>
or for the time of schedule itself:
b EventQueue::schedule if _curTick == 1000
b EventQueue::reschedule if _curTick == 1000
Together with reverse debugging and:
--debug-flags Event
these event breakpoints will actually allow you to understand what gem5 is doing.
Note however that conditional breakpoints significantly slow down simulation unfortunately... arghh.
Another useful technique to have in mind is that you can do a run that stops shortly after the point of interest with:
-m <tick>
and then reverse debug back to the exact point of interest, possibly conditionally since now you will be close the the point of interest, so the performance loss will not be a huge problem. You can then just continue going back to the root cause.
Tested in gem5 9f247403e558977738b5911a45e5776afff87b1a.

What branch misprediction does the Branch Target Buffer detect?

I am currently looking at the various parts of the CPU pipeline which can detect branch mispredictions. I have found these are:
Branch Target Buffer (BPU CLEAR)
Branch Address Calculator (BA CLEAR)
Jump Execution Unit (not sure of the signal name here??)
I know what 2 and 3 detect, but I do not understand what misprediction is detected within the BTB. The BAC detects where the BTB has erroneously predicted a branch for a non-branch instruction, where the BTB has failed to detect a branch, or the BTB has mispredicted the target address for a x86 RET instruction. The execution unit evaluates the branch and determines if it was correct.
What type of misprediction is detected at the Branch Target Buffer? What exactly is detected as a misprediction here?
The only clue I could find was this inside Vol 3 of the Intel Developer Manuals (the two BPU CLEAR event counters at the bottom):
BPU predicted a taken branch after incorrectly assuming that it was
not taken.
This seems to imply the prediction is not done "synchronously", but rather "asynchronously", hence the "after incorrectly assuming"??
UPDATE:
Ross, this is the CPU branch circuitry, from the original Intel Patent (hows that for "reading"?):
I don't see "Branch Prediction Unit" anywhere? Would it be reasonable that somebody having read this paper would assume that "BPU" is a lazy way of grouping the BTB Circuit, BTB Cache, BAC and RSB together??
So my question still stands, which component raises the BPU CLEAR signal?
This is a good question! I think the confusion that it's causing is due to Intel's strange naming schemes which often overload terms standard in academia. I will try to both answer your question and also clear up the confusion I see in the comments.
First of all. I agree that in standard computer science terminology a branch target buffer isn't synonymous with branch predictor. However in Intel terminology the Branch Target Buffer (BTB) [in capitals] is something specific and contains both a predictor and a Branch Target Buffer Cache (BTBC) which is just a table of branch instructions and their targets on a taken outcome. This BTBC is what most people understand as a branch target buffer [lower case]. So what is the Branch Address Calculator (BAC) and why do we need it if we have a BTB?
So, you understand that modern processors are split into pipelines with multiple stages. Whether this is a simple pipelined processor or an out of order supersclar processor, the first stages are typically fetch then decode. In the fetch stage all we have is the address of the current instruction contained in the program counter (PC). We use the PC to load bytes from memory and send them to the decode stage. In most cases we increment the PC in order to load the subsequent instruction(s) but in other cases we process a control flow instruction which can modify the contents of the PC completely.
The purpose of the BTB is to guess if the address in the PC points to a branch instruction, and if so, what should the next address in the PC be? That's fine, we can use a predictor for conditional branches and the BTBC for the next address. If the prediction was right, that's great! If the prediction was wrong, what then? If the BTB is the only unit we have then we would have to wait until the branch reaches the issue/execute stage of the pipeline. We would have to flush the pipeline and start again. But not every situation needs to be resolved so late. This is where the Branch Address Calculator (BAC) comes in.
The BTB is used in the fetch stage of the pipeline but the BAC resides in the decode stage. Once the instruction we fetched is decoded, we actually have a lot more information which can be useful. The first new piece of information we know is: "is the instruction I fetched actually a branch?" In the fetch stage we have no idea and the BTB can only guess, but in the decode stage we know it for sure. It is possible that the BTB predicts a branch when in fact the instruction is not a branch; in this case the BAC will halt the fetch unit, fix the BTB, and reinitiate fetching correctly.
What about branches like unconditional relative and call? These can be validated at the decode stage. The BAC will check the BTB, see if there are entries in the BTBC and set the predictor to always predict taken. For conditional branches, the BAC cannot confirm if they are taken/not-taken yet, but it can at least validate the predicted address and correct the BTB in the event of a bad address prediction. Sometimes the BTB won't identify/predict a branch at all. The BAC needs to correct this and give the BTB new information about this instruction. Since the BAC doesn't have a conditional predictor of its own, it uses a simple mechanism (backwards branches taken, forward branches not taken).
Somebody will need to confirm my understanding of these hardware counters, but I believe they mean the following:
BACLEAR.CLEAR is incremented when the BTB in fetch does a bad
job and the BAC in decode can fix it.
BPU_CLEARS.EARLY is
incremented when fetch decides (incorrectly) to load the next
instruction before the BTB predicts that it should actually load from
the taken path instead. This is because the BTB requires multiple cycles and fetch uses that time to speculatively load a consecutive block of instructions. This can be due to Intel using two BTBs, one quick and the other slower but more accurate. It takes more cycles to get a better prediction.
This explains why the penalty of a detecting a misprediction in the BTB is 2/3 cycles whereas the detecting a misprediction in the BAC is 8 cycles.
The fact BPU_CLEARS.EARLY description shows that this event occurs when the BPU predicts, correcting an assumption, implies there are 3 stages in the front end. Assume, predict and calculate. My current guess is that an early clear is flushing the stages of the pipeline that are before the stage that a prediction from the L1 BTB is even known, when the prediction is 'taken' as opposed to not taken.
The BTB set contains 4 ways for a maximum of 4 branches per 16 bytes (where all the ways in the set get tagged with the same tag indicating that particular 16 byte aligned block). Each way has an offset indicating the 4 LSBs of the address therefore the byte position within 16 bytes. Each entry also has a speculative bit, valid bit, pLRU bits, a speculative local BHR, a real local BHR, and each way shares the set BPT (PHT) as a second level of prediction. This may be alloyed with a GHR / speculative GHR.
I think the BPU provides a 64B prediction block to the uop cache and instruction cache (it used to be 32B, and it was 16B on P6). For the legacy route it needs to provide a 64 (or 32/16) byte prediction block which is a set of 64 bit masks which mark predicted branch instructions, prediction directions and which byte is a branch target. This information will be furnished by the L1 BTB while the fetch for the 64 byte line is underway such that 16 byte aligned (IFU has always been 16B) blocks that are read out of it with no used bits at all will not be fetched by the instruction predecoder (unused is the default because on architectures where the prediction block is smaller than the line size, the BPU may only provide bitmasks for 16 or 32B of the line). The BPU therefore provides the predictions made mask, the used/unused mask (marking bytes after first taken branch in the first prediction block and before the branch target in the second prediction block as unused and the rest used), the prediction directions mask; and the ILD provides the branch instructions mask. The first used byte in the prediction block is implicitly a branch target or the start of the instruction flow after a resteer or switch from the uop cache (DSP) to the legacy pipeline (MITE). The used bytes within the prediction block make up a prediction window.
Here is a P6 pipeline. Using this as an example, an early clear is in the 3rd cycle (13), when a prediction is made (and the target and entry type is read, and therefore unconditional branch targets are now known as well as conditional and their predictions). The first predicted taken branch target in the set for the 16 byte block is used. At this point, the 2 pipe stages before it have already been filled with fetches and beginning of lookups from the next sequential 16 byte blocks, which means that they need to be flushed if there is any taken prediction (otherwise it doesn't need to be as the next sequential 16 byte block is already beginning to be looked up in the pipestage before it), leaving a 2 cycle gap or bubble. The cache lookup occurs at the same time as the BTB lookup, so both the BTB and cache parallel 2 pipestages will have to be flushed, whereas the 3rd stage doesn't need to be flushed from the cache or the BTB because the IP is on a confirmed path and is the IP being currently looked up for the next one. In fact, in this P6 design, there is only a one cycle bubble for this early clear, because the new IP can be sent to the first stage to decode a set again on the high edge of clock while those other stages are being flushed.
This pipelining is more beneficial than waiting for the prediction before beginning a lookup on the next IP. This would mean a lookup every other cycle. This would give a throughput of 16 bytes of predictions every 2 cycles, so 8B/c. In the P6 pipelined scenario, the throughput is 16 bytes per cycle on a correct assumption and 8B/c on an incorrect assumption. Obviously faster. If we assume 2/3s of assumptions are correct for 1 in 9 instructions being a taken branch for 4 instructions per block, this gives a throughput of 16B per ((1*0.666)+2*0.333)) =1.332 cycles instead of 16B per 2 cycles.
If this is true, every taken branch will cause an early clear. This is however not the case when I use the event on my KBL. Hopefully the event is actually wrong because it is supposed to not be supported on my KBL, but does show a random low number, so hopefully it is counting something else. This also does not appear to be supported by the following https://gist.github.com/mattgodbolt/4e2cbb1c9aa97e0c5478 https://github.com/mattgodbolt/agner/blob/master/tests/branch.py. Given the 900k instructions and 100k early clears, I do not see how you can have an odd number of early clears if you use my definition of early clears and then look at his code. If we assume that the window is 32B for that CPU, then if you use an alignment of 4 on each branch instruction in that macro you get a clear every 8 instructions, because 8 will fit into the 32B aligned window.
I am not sure why Haswell and Ivy Bridge have such values for early and late clears but these events (0xe8) disappear starting with SnB, which happens to coincide with when the BTB was unified into a single structure. It also looks like the late clears counter is now counting the early clears event because it is the same number as the early clears on the Arrandale CPU, and the early clears event is now counting nothing. I'm also not sure why Nehalem has a 2 cycle bubble for early clears as the design of the L1 Nehalem BTB doesn't seem to change much from the P6 BTB, both 512 entries with 4 ways per set. It is probably because it has been broken down into more stages due to the higher clock speeds and hence also the longer L1i cache latency.
The late clear (BPU_CLEARS.LATE) appears to happen at the ILD. In the diagram above, the cache lookup takes only 2 cycles. In more recent processors, it apparently takes 4 cycles. This allows another L2 BTB to be inserted and a lookup in it to take place. 'MRU bypass' and 'MRU conflicts' could just mean that there was a miss in the MRU BTB or it could also mean that the prediction in the L2 is different to the one in L1 in the event that it uses a different prediction algorithm and history file. On my KBL, which does not support either event, I always get 0 for ILD_STALL.MRU but not BPU_CLEARS.LATE. The 3 cycle bubble comes from the BPU at stage 5 (which is also an ILD stage) resteering the pipeline at the low edge of stage 1 and flushing stages 2, 3 and 4 (which falls in line with cited L1i latencies of 4 cycles, as the L1i lookup occurs across stages 1–4 for a hit+ITLB hit). As soon as the prediction is made, the BTBs update the entries' speculative local BHR bits with the prediction that was made.
A BACLEAR happens for instance when the IQ compares the predictions-made mask with the branch instruction mask produced by the predecoder, and then for certain instruction types like a relative jump, it will check the sign bit to perform a static branch prediction. I'd imagine the static prediction happens as soon as it enters the IQ from the predecoder, such that instructions that immediately go the decoder contain the static prediction. The branch now being predicted taken will result in a BACLEAR_FORCE_IQ when the branch instruction is decoded, because there won't be a target to verify when the BAC calculates the absolute address of the relative conditional branch instruction, which is needs to verify when it is predicted taken.
The BAC at the decoders also makes sure the relative branches and direct branches have the correct branch target prediction after calculating the absolute address from the instruction itself and comparing it with it, if not, a BACLEAR is issued. For relative jumps, static prediction in the BAC uses the sign bit of the jump displacement to statically predict taken / not taken if a prediction has not been made but also overrides all return predictions as taken if the BTB does not support return entry types (it doesn't on P6 and makes no prediction, instead the BAC uses the BPU's RSB mechanism and it is the first point in the pipeline that a return instruction is acknowledged) and overrides all register indirect branch predictions as taken on P6 (because there is no IBTB) as it uses the statistic that more branches are taken that not. The BAC calculates and inserts the absolute target from the relative target into the uop and inserts the IP delta into the uop and inserts the fall through IP (NLIP) into the BPU's BIT, which may be tagged to the uop, or more likely the BIT entries work on a corresponding circular queue which will stall if there aren't enough BIT entries, and the indirect target prediction or known target is inserted into the uop 64 bit immediate field. These fields in the uop are used by the allocator for allocation into the RS/ROB later on. The BAC also informs the BTB of any spurious predictions (non branch instructions) that need their entries deallocating from the BTB. At the decoders, branch instructions are detected early in the logic (when prefixes are decoded and the instruction is examined to see if it can be decoded by the decoder) and the BAC is accessed in parallel with the rest. The BAC inserting the known or otherwise predicted target into the uop is known as converting an auop into a duop. The prediction is encoded into the uop opcode.
The BAC likely instructs the BTB to speculatively update its BTB for the detected branch instruction's IP. If the target is now known and no prediction was made for it (meaning it wasn't in the cache) -- it is still speculative as although the branch target is known for certain, it still could be on a speculative path, so is marked with a speculative bit -- this will now immediately provide early steers especially for unconditional branches now entering the pipeline but also for conditional, with a blank history so will predict not taken next time, rather than having to wait until retire).
The IQ above contains a bitmask field for branch prediction directions (BTBP) and branch predictions made / no prediction made (BTBH) (to distinguish which of the 0s in the BTBP are not taken as opposed to no prediction made) for each of the 8 instruction bytes in an IQ line as well as the target of a branch instruction, meaning there can only be one branch per IQ line and it ends the line. This diagram does not show the branch instruction mask produced by the predecoder that shows what instructions actually are branches such that the IQ knows what not-made predictions it needs to make a prediction for (and what ones are not branch instructions at all).
The IQ is a contiguous block of instruction bytes and the ILD populates 8-bit bitmasks which identify the first opcode byte (OpM) and instruction end byte (EBM) of each macroinstruction as it wraps round bytes into the IQ. It probably also provides bits indicating whether it is a complex instruction or a simple instruction (as suggested by the 'predecode bits' on many AMD patents). The gaps between these markers are implicitly prefix bytes for the following instruction. I'm thinking the IQ is designed such that the uops it issues in the IDQ/ROB will rarely outrun the IQ such that the head pointer in the IQ starts overwriting macroinstructions still tagged in the IDQ waiting to be allocated, and when it does, there is a stall, so the IDQ tags refer back to the IQ, which the allocator accesses. I think the ROB uses this uop tag as well. The IQ on SnB if 16 bytes * 40 entries contains 40 macroops in the worst case, 320 in the average case, 640 in the best case. The number of uops these produce will be much greater, so it will rarely outrun, and when it does, I guess it stalls decode until more instructions retire. The tail pointer contains the recently allocated tag by the ILD, the head pointer contains the next macroinstruction instruction waiting to retire, and the read pointer is the current tag to be consumed by the decoders (which moves towards the tail pointer). Although, this becomes difficult now that some if not the majority of the uops in the path come from the uop cache since SnB. The IQ may be allowed to outrun the back end in the event that uops are not tagged with the IQ entries (and the fields in the IQ are instead inserted into uops directly), and this will be detected and the pipeline will just be resteered from the beginning.
When the allocator allocates a physical destination (Pdst) for a branch micro-op into the ROB, the allocator provides the Pdst entry number to the BPU. The BPU inserts this into the correct BIT entry assigned by the BAC (which is probably is at the head of a circular queue of active BIT entries that are yet to be allocated a Pdst). The allocator also extracts fields from the uop and allocates the data into the RS.
The RS contains a field that indicates whether an instruction is a MSROM uop or a regular uop, which the allocator populates. The allocator also inserts the confirmed absolute target or the predicted absolute target into the immediate data and as a source, renames the flags register (or just a flag bit) and in the case of an indirect branch, there is also the renamed register that contains the address as another source. The Pdst in the PRF scheme would be the ROB entry, which as a Pdst would be the retirement macro-RIP or micro-IP register. The JEU writes the target or fallthrough to that register (it may not need to if the prediction is correct).
When the reservation station dispatches a branch micro-op to a jump execution unit located in the integer execution unit, the reservation station informs the BTB of the Pdst entry for the corresponding branch micro-op. In response, the BTB accesses the corresponding entry for the branch instruction in the BIT and the fall through IP (NLIP) is read out, decremented by the IP delta in the RS, and decoded to point to the set that the branch entry will be updated/allocated.
The outcome from the renamed flag register source Pdst to determine whether the branch is taken / not taken is then compared with the prediction in the opcode in the scheduler, and additionally, if the branch is indirect, the predicted target in the BIT is compared with the the address in the source Pdst (that was calculated and became available in the RS before it was scheduled and dispatched) and it is now known whether a correct prediction was made or not and whether the target is correct or not.
The JEU propagates an exception code to the ROB and flushes the pipeline (JEClear -- which flushes the whole pipeline before the allocate stage, as well as stalls the allocator) and redirects the next IP logic at the start of the pipeline using the fallthrough (in BIT) / target IP appropriately (as well as microsequencer if it is a microbranch misprediction; the RIP directed to the start of the pipeline will be the same one throughout the MSROM procedure). Speculative entries are deallocated and true BHRs are copied into the speculative BHRs. In the event there is a BOB in the PRF scheme, the BOB takes snapshots of the RAT state for every branch instruction and when there is a misprediction. The JEU rolls back the RAT state to that snapshot and the allocator can proceed immediately (which is particularly useful for microbranch misprediction as it is closer to the allocator therefore the bubble will not be as well hidden by the pipeline), rather than stalling the allocator and having to wait until retire for the retirement RAT state to be known and using that to restore the RAT and then clear the ROB (ROClear, which unstalls the allocator). With a BOB, the allocator can start issuing new instructions while the stale uops continue to execute, and when the branch is ready to retire, the ROClear only clears the uops between the retired misprediction and the new uops. If it is an MSROM uop, because it may have completed, the start of the pipeline still needs to be redirected to the MSROM uop again, but this time it will start at the redirected microip (this is the case with inline instructions (and it may be able to replay it out of the IQ). If a misprediction happens in an MSROM exception then it doesn't need to resteer the pipeline, just redirects it directly, because it has taken over the IDQ issue until the end of the procedure -- the issue may have already ended for inline issues.
The ROClear in response to the branch exception vector in the ROB actually happens on the second retirement stage RET2 (which is really the 3rd of 3 stages of typical retirement pipeline) when the uops are retired. The macroinstruction only retires and exceptions only trigger and the macroinstruction RIP only updates (with new target or increase by IP delta in the ROB) when the EOM uop (end of macroinstruction) marker retires, even if a non EOM instruction writes to it, it is not written to the RRF immediately unlike other registers -- anyway, the branch uop is likely going to be the final uop in typical branch macroinstruction handled by the decoders. If this is a microbranch in an MSROM procedure, it will not update the BTB; it updates the uIP when it retires, and the RIP is not updated until the end of the procedure.
If a generic non-mispredict exception occurs (i.e. one that requires a handler) during a MSROM macroop execution, once it has been handled, the microip that caused the exception is restored by the handler to the uIP register (in the event that it is passed to the handler when it is called), as well as the current RIP of the macroinstruction which triggered the exception, and when the exception handling ends, instruction fetch is resumed at this RIP+uIP: the macroinstruction is refetched and reattempted in the MSROM, which starts at the uIP signalled to it. The RRF write (or retirement RAT update on the PRF scheme) for previous uops in a complex non-MSROM macroinstruction may occur on the cycle before the EOM uop retires, which means that a restart can happen at a certain uop within a regular complex macroop and not just a MSROM macroop, and in this case, the instruction flow is restarted at the BPU at the RIP, and the complex decoder is configured with valid / invalid bits on the PLA cuop outputs. The uIP for this regular complex instruction that is used to configure the complex decoder valid bits is a value between 0-3, which I think the ROB sets to 0 at each EOM and increments for each microop retired, so that the non-MSROM complex instructions can be addressed, and for MSROM complex instructions, the MSROM routine contains a uop that tells the ROB the uIP of that instruction. The architectural RIP register however, which is updated by the IP delta only when the EOM uop retires is still pointing to the current macroop because the EOM uop failed to retire), which only happens for exceptions but not hardware interrupts, which can't interrupt MSROM procedures or complex instruction mid retirement (software interrupts are similar and trigger at the EOM -- the trap MSROM handler performs a macrojump to the RIP of the software trap handler once it has finished).
The BTB read and tag comparison happens in RET1 while the branch unit writes back the results, and then in the next cycle, perhaps also during RET1 (or maybe this is done in RET2), the tags in the set are compared and then, if there is a hit, a new history BHR is calculated; if there is a miss, an entry needs to be allocated at that IP with that target. Only once the uop retires in order (in RET2) can the the result be placed into the real history and the branch prediction algorithm is utilised to update the pattern table where an update is required. If the branch requires allocation, the replacement policy is utilised to decide the means for allocating the branch. If there is a hit, the target will already be correct for all direct and relative branches, so it doesn't have to be compared, in the event of no IBTB. The speculative bit is now removed from the entry if present. Finally, in the next cycle, the branch is written in the BTB cache by the BTB write control logic. The first part of the BTB lookup may be able to go ahead throughout RET1 and then may stall the BTB write pipeline until RET2 when the stage waiting to write to the BTB's ROB entry retires, otherwise, the lookup could be decoupled and the first part completes and writes data to, for instance, the BIT, and at RET2 the corresponding entry to the one retiring is just written back to the BTB (which would mean decoding the set again, comparing tags again and then writing the entry, so maybe not)
If P6 had a uop cache, the pipeline would be something like:
1H: select IP
1L: BTB set decode + cache set decode (physical/virtual index) + ITLB lookup + uop cache set decode
2H: cache read + BTB read + uop cache read
2L: cache tag compare + BTB tag compare + uop cache tag compare; if uop cache hit, stall until uop cache can issue, then clock gate legacy decode pipeline
3H: predict, if taken, flush 3H,2L,2H,1L
3L if taken, begin a 1L with new IP to decode new set and continue with current 16 byte block for which the branch instruction resides to 4L
As for the uop cache, because it is past the stage of the BAC, there is never going to be a bogus branch or an incorrect prediction for an unconditional branch or an incorrect target for a non-indirect branch. The uop cache will used the used/unused mask from the BPU to emit uops for instructions that begin at those bytes, and will use the prediction direction mask to change the macrobranch uops to a predicted not taken / predicted taken macrobranch uop (T/NT predictions are inserted into the uop itself). If it is predicted taken then it stops emitting uops for that 64B aligned block (again used to be 32B, previously 16B) and waits for the next window right behind it in the pipeline. The uop cache is going to know what uops are branches and probably statically predicts not taken to all non-predictions, or might have a more advanced static prediction. Indirect target predictions from the IBTB are inserted into the uop immediate field and then it will wait for the next BPU prediction block if this branch is also predicted taken. I would imagine the uop cache creates BIT entries and updates predictions in the BTBs, to ensure that uop cache and MITE (legacy decode) uops update the history in correct sequential order.

Understanding CUDA serialization and reconvergence point

EDIT: I realized that I, unfortunately, overlooked a semicolon at the end of the while statement in the first example code and misinterpreted it myself. So there is in fact an empty loop for threads with threadIdx.x != s, a convergency point after that loop and a thread waiting at this point for all the others without incrementing the s variable. I am leaving the original (uncorrected) question below for anyone interested in it. Be aware, that there is a semicolon missing at the end of the second line in the first example and thus, s++ has nothing in common with the cycle body.
--
We were studying serialization in our CUDA lesson and our teacher told us that a code like this:
__shared__ int s = 0;
while (s != threadIdx.x)
s++; // serialized code
would end up with a HW deadlock because the nvcc compiler puts a reconvergence point between the while (s != threadIdx.x) and s++ statements. If I understand it correctly, this means that once the reconvergence point is reached by a thread, this thread stops execution and waits for the other threads until they reach the point too. In this example, however, this never happens, because thread #0 enters the body of the while loop, reaches the reconvergence point without incrementing the s variable and other threads get stuck in an endless loop.
A working solution should be the following:
__shared__ int s = 0;
while (s < blockDim.x)
if (threadIdx.x == s)
s++; // serialized code
Here, all threads within a block enter the body of the loop, all evaluate the condition and only thread #0 increments the s variable in the first iteration (and loop goes on).
My question is, why does the second example work if the first hangs? To be more specific, the if statement is just another point of divergence and in terms of the Assembler language should be compiled into the same conditional jump instruction as the condition in the loop. So why isn't there any reconvergence point before s++ in the second example and has it in fact gone immediately after the statement?
In other sources I have only found that a divergent code is computed independently for every branch - e.g. in an if/else statement, first the if branch is computed with all else-branched threads masked within the same warp and then the other threads compute the else branch while the first wait. There's a reconvergence point after the if/else statement. Why then does the first example freeze, not having the loop split into two branches (a true branch for one thread and a waiting false branch for all the others in a warp)?
Thank you.
It does not make sense to put the reconvergence point between the call to while (s != threadIdx.x) and s++;. It disrupts the program flow since the reconvergence point for a piece of code should be reachable by all threads at compile time. Below picture shows the flowchart of your first piece of code and possible and impossible points of reconvergence.
Regarding this answer about recording the convergence point via SSY instruction, I created below simple kernel resembling your first piece of code
__global__ void kernel_1() {
__shared__ int s;
if(threadIdx.x==0)
s = 0;
__syncthreads();
while (s == threadIdx.x)
s++; // serialized code
}
and compiled it for CC=3.5 with -O3. Below is the result of using cuobjdumbinary tool for the output to observe the CUDA assembly. The result is:
I'm not an expert in reading CUDA assembly but I can see while loop condition checks in lines 0038 and 00a0. At line 00a8, it branches to 0x80 if it satisfies the while loop condition and executes the code block again. The introduction of the reconvergence point is at line 0058 introducing line 0xb8 as the reconvergence point which is after the loop condition check near the exit.
Overall, it is not clear what you're trying to achieve with this piece of code. Also in the second piece of code, the reconvergence point should be again after while loop code block (I don't mean between while and if).
The reason why it "hangs" is neither a HW deadlock nor branching, at least not directly. You produce an endless loop for one or multiple threads (as already suspected).
In your example, there isn't really a convergence point. Since you do not use any synchronization, there aren't any threads that actually wait. What happens here with the while-loop is pretty much a busy-wait.
A kernel only finishes if all threads return. Since you have one (or multiple) endless loops (by accident maybe even none - this is unlikely however) the kernel will never finish.
You declared a shared variable s. This variable is known to all threads within a block.
With your while-statement you basically say (to each thread): increment s until it reaches the value of your (local) thread id. Since all threads are incrementing s in parallel, you introduce race conditions.
Example:
List item
Thread 5 is looping and checking for s to become 5
s is 4
Two threads increment s, it becomes 6
At the same time thread 5 only reached the end of its loop.
Now it reaches the next loop iteration and checks for s and it's not 5.
Thread 5 will never be able to finish since you check via == and the value of s already exceeded the value of the thread id.
Also your solution is quite confusing, because each thread executes the serialized code consecutively (which probably was the intention after all - even though that actually is strange):
Thread 0 will execute the serialized code
After that, thread 1 will execute the serialized code
and so on
Most examples show a program where each thread works on some code, then all threads are synchronized and only single thread executes some more code (maybe it needed the results of all threads).
So, your second example "works" because no thread is stuck in an endless loop, however I can't think of a reason why anyone would use such a code,
since it is confusing and, well, not parallel at all.

<<module name>> not a task or void function in verilog

I am trying to create a module for carry select adder in verilog. Everything works fine except the following portion where it is causing compilation error.
module csa(a,b,s,cout);
input[15:0] a,b;
output [15:0] s;
output cout;
wire zero_c1, zero_c2,zero_c3,zero_c4,zero_c5;
wire one_c1, one_c2,one_c3,one_c4,one_c5;
wire temp_c1,temp_c2,temp_c3,temp_c4,temp_c5;
wire [15:0] s_zero, s_one;
initial
begin
fork
fa(s[0], temp_c1,a[0],b[0],0);
fa_one(s_zero[1],s_one[1],zero_c1,one_c1,a[1],b[1]);
fa_two(s_zero[3:2],s_one[3:2],zero_c2,one_c2,a[3:2],b[3:2]);
fa_three(s_zero[6:4],s_one[6:4],zero_c3,one_c3,a[6:4],b[6:4]);
fa_four(s_zero[10:7],s_one[10:7],zero_c4,one_c4,a[10:7],b[10:7]);
fa_five(s_zero[15:11],s_one[15:11],zero_c5,one_c5,a[15:11],b[15:11]);
join
end
When I try to compile that it says -
the module "fa", "fa_one" are not a task or void function
I deleted the "initial" statement and now it says -
Syntax error near "fork", expecting "endmodule"
I just want to run the code between join and fork in parallel. I have also confirmed that the module fa, fa_one works fine.
Would appreciate if anyone can help me pointing out what I am doing wrong here. Thanks.
Verilog modules are not run or executed but instantiated, they represent physical blocks of hardware.
Everything is in parallel unless you have made effort to time share pieces of hardware. For example you might write an ALU core, which exists only once but use a program ROM to tell it which instruction to process every clockcycle.
Inside your modules you can have combinatorial code and sequential code.
Combinatorial logic will simulate in 0 time but will actually take some time for values to propagate through when placed on real devices.
If this propagation delay is not thought about and very large blocks of logic are created you will struggle to close timing on synthesis, due to the settling time through the logic being greater than the clock speed either side of the combinatorial logic.
Sequential logic implies that the results are held in flip-flops, which only update on clock edges. This means chains of sequential logic can take many clock cycles for data to propagate.
When pipelining a processor you break individual section up with flip-flops giving each section a full clock cycle for combinatorial propagation, at the expense of taking several clock cycles to calculate a single result.
To correct your example you would just have:
module csa(
input [15:0] a,
input [15:0] b,
output [15:0] s,
output cout
);
wire zero_c1, zero_c2,zero_c3,zero_c4,zero_c5;
wire one_c1, one_c2,one_c3,one_c4,one_c5;
wire temp_c1,temp_c2,temp_c3,temp_c4,temp_c5;
wire [15:0] s_zero, s_one;
fa ufa(s[0], temp_c1,a[0],b[0],0);
fa_one ufa_one(s_zero[1],s_one[1],zero_c1,one_c1,a[1],b[1]);
fa_two ufa_two(s_zero[3:2],s_one[3:2],zero_c2,one_c2,a[3:2],b[3:2]);
fa_three ufa_three(s_zero[6:4],s_one[6:4],zero_c3,one_c3,a[6:4],b[6:4]);
fa_four ufa_four(s_zero[10:7],s_one[10:7],zero_c4,one_c4,a[10:7],b[10:7]);
fa_five ufa_five(s_zero[15:11],s_one[15:11],zero_c5,one_c5,a[15:11],b[15:11]);
endmodule
NB: it is module_name #(parameters) instance_name ( ports );
fork is used to run procedural statements within a module in parallel. Separate module instances always run in parallel.
Child modules are instantiated directly within their parent module, not within an initial, begin, or fork which are used for procedural statements. So you can remove the initial, begin, fork, join, and end, and add an endmodule at the end.

Neon VLD consuming more cycles than what is expected?

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:
http://pulsar.webshaker.net/ccc/sample-d3a7fe78
As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?
I have taken care of the following:
The address is 16 byte aligned.
Have provided the alignment hint in the instruction vld1.64 {d0, d1} [r0,:128]!
Tried preload instruction pld [r0, #192], at places but that seems to add to the cycles instead of actually reducing the latency.
Can someone tell me what am I doing wrong, why this latency?
Other Details:
With reference to cortex-a8
arm-2009q1 cross compiler tool chain
coding in assembly
Your code is executing much slower than expected because as it's currently written, it's causing the perfect storm of pipeline stalls. On any modern CPU with a pipelined architecture, instructions can execute in one cycle under ideal conditions. The ideal conditions are that the instruction is not waiting for memory and doesn't have any register dependencies. The way you've written the code, you're not allowing for the delay in reading from memory and making the next instruction dependent on the results of the read. This is causing the worst possible performance. Also, I'm not sure why you're accumulating the pairwise adds into multiple registers. Try something like this:
veor.u16 q12,q12,q12 # clear accumulated sum
top_of_loop:
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
subs r1,r1,#8
bne top_of_loop
Experiment with different numbers of load instructions before executing the adds. The point is that you need to allow time for the read to occur before you can use the target register.
Note: Using Q4-Q7 is risky because they're non-volatile registers. On Android you will get random garbage appearing in these (especially Q4).