What is the Difference between Aging and Healing in AUTOSAR DEM?

What is the Difference between Aging and Healing in AUTOSAR DEM? - embedded

Event Aging
The process of aging resets status bit 3 – ConfirmedDTC when a sufficient amount of time
has elapsed so that the cause for the error entry is assumedly not relevant anymore. This
is often used as a trigger to also clear stored snapshots or extended data from the event
memory.
But I don't get the healing process. I couldn't find anything about it.

Aged counter
Aging Counter The Dem module provides the ability to remove a specific event from the event memory, if its fault conditions are not fulfilled for a certain period of time (operation cycles). This process is called as "aging" or "unlearning". The usage of this feature requires the maintaining of an additional NVRAM block
Healing counter
Available both in positive direction, counting up from 0 (healing not started), latching at 255;
and in reverse counting down from the healing threshold (healing not started) to 0. The
counter is incremented resp. decremented as soon as the healing conditions are fulfilled (at
the end of a ‘passed’ tested operation cycle without failed result), irrespective of the status
of the ‘ConfirmedDTC ‘ or ‘WarningIndicatorRequested’ status bit.
The up-counting data element corresponds to ‘Cycles Tested Since Last Failed’.
Both data elements are also calculated for events without indicator.

I found the following Diagram in AUTOSAR documentation, now It's clear

According to AUTOSAR DEM SWS Document :
Healing of diagnostic events
The Dem module provides the ability to activate and deactivate indicators per event
stored in the event memory. The process of deactivation is defined as healing of a
diagnostic event.
Aging of diagnostic events
The Dem module provides the ability to remove a specific event from the event memory, if its fault conditions are not fulfilled for a certain period of time (operation cycles).This process is called as "aging" or "unlearning".
Few points to notice ( According to my point of view ) :
1 - Each one of them has a separate counter and a separate threshold, When the counting value meets the provided threshold, corresponding action is being taken.
2 - Normally, healing comes before aging.
3 - Aging resets the confirmedDTC bit in the status byte of the DTC. Healing just means we have an operation cycle in which the event status byte never had the testFailedThisOperationCycle bit set before.

Related

DEM Fault detection counter vs aging counter

According to the Dem specification document ( AR 4.4.0 _ page 227 _ section 8.3.3.12)
Dem_GetFaultDetectionCounter Gets the fault detection counter of an event, but I can not find any clear explanation that How Fault detection works.
GetFaultDetectionCounter specification
My question is:
What is Fault detection counter and they difference from aging counter?

The debouncing of DemEvents is specified in AUTOSAR Dem chapter 7.7.3 "Debouncing of diagnostic events".
The fault detection counter is used for this debouncing.
-128 means PASSED, +127 FAILED, 0 means, the monitor did not report anything (neither PASSED nor FAILED) yet. The sint8 range -128 .. 0 .. +127 is according to UDS spec ISO14229-1. The Dem can be configured with different thresholds with sint16, but they are linearly scaled down to sint8 for FDC when reading out.
A monitor can report (using Dem_SetEventStatus(DemEventId, DemEventStatus):
DEM_EVENT_STATUS_FAILED - set FDC to thresholdMax
DEM_EVENT_STATUS_PREFAILED - increment FDC up to thresholdMax
DEM_EVENT_STATUS_PREPASSED - decrement FDC down to thresholdMin
DEM_EVENT_STATUS_PASSED - set FDC to thresholdMin
Depending on the counter algorithm, counter or timer based deboucning is used.
AUTOSAR Dem chapter 7.7.8 "Aging of diagnostic events" specifies the aging and the aging counter.
The aging counter is actually, to be able to remove a DemEvent. If a DemEvent has been reported before, on each new operation cycle without a failed report, the aging counter is counted up to configured maximum. WHen it's reached, the DemEvent and its FreezeFrame/SnapshotData and ExtendedDataRecords can be removed.

Does the process get activated or suspended?

I am still having a little bit of trouble understanding the sensitivity list and it activating a process.
most textbooks say that a process is activated every time an event occurs on a signal inside the sensitivity list.
process(in)
begin
x <= in;
end process;
Now looking at this example, "in" is an input declared in the entity. Now if "in" starts off at 0 and changes to 1 then the process would activate and the value of x would take in value "in". Now suppose after in changed from 0 to 1 that it now stays at constant value of 1. Does this mean the process will not get activated? Will x still give output of '1'? I want to say that it wont get activated and will only activate once in changes back from 1 back to 0. Can someone please confirm?

Within the sensitivity list (I assume this is hardware language, VHDL has the same exact syntax and format), whenever there is some type of signal change (L-> H, 0 ->1, 1-> 0... any change in the variable you listed within the sensitivity list), it will activate the process and the process will execute until completion, which then the process will end. When the process end, signals/outputs (depends on how you interpret them) will be stored on a driver, which will update given signals after some propagation delay.
So from your second statement, yes. If it changes to 0 -> 1, the process activates, if its 1 -> 0, the process activates, and if in remains 1, the process will not be activated. So x's value remains.

What branch misprediction does the Branch Target Buffer detect?

I am currently looking at the various parts of the CPU pipeline which can detect branch mispredictions. I have found these are:
Branch Target Buffer (BPU CLEAR)
Branch Address Calculator (BA CLEAR)
Jump Execution Unit (not sure of the signal name here??)
I know what 2 and 3 detect, but I do not understand what misprediction is detected within the BTB. The BAC detects where the BTB has erroneously predicted a branch for a non-branch instruction, where the BTB has failed to detect a branch, or the BTB has mispredicted the target address for a x86 RET instruction. The execution unit evaluates the branch and determines if it was correct.
What type of misprediction is detected at the Branch Target Buffer? What exactly is detected as a misprediction here?
The only clue I could find was this inside Vol 3 of the Intel Developer Manuals (the two BPU CLEAR event counters at the bottom):
BPU predicted a taken branch after incorrectly assuming that it was
not taken.
This seems to imply the prediction is not done "synchronously", but rather "asynchronously", hence the "after incorrectly assuming"??
UPDATE:
Ross, this is the CPU branch circuitry, from the original Intel Patent (hows that for "reading"?):
I don't see "Branch Prediction Unit" anywhere? Would it be reasonable that somebody having read this paper would assume that "BPU" is a lazy way of grouping the BTB Circuit, BTB Cache, BAC and RSB together??
So my question still stands, which component raises the BPU CLEAR signal?

This is a good question! I think the confusion that it's causing is due to Intel's strange naming schemes which often overload terms standard in academia. I will try to both answer your question and also clear up the confusion I see in the comments.
First of all. I agree that in standard computer science terminology a branch target buffer isn't synonymous with branch predictor. However in Intel terminology the Branch Target Buffer (BTB) [in capitals] is something specific and contains both a predictor and a Branch Target Buffer Cache (BTBC) which is just a table of branch instructions and their targets on a taken outcome. This BTBC is what most people understand as a branch target buffer [lower case]. So what is the Branch Address Calculator (BAC) and why do we need it if we have a BTB?
So, you understand that modern processors are split into pipelines with multiple stages. Whether this is a simple pipelined processor or an out of order supersclar processor, the first stages are typically fetch then decode. In the fetch stage all we have is the address of the current instruction contained in the program counter (PC). We use the PC to load bytes from memory and send them to the decode stage. In most cases we increment the PC in order to load the subsequent instruction(s) but in other cases we process a control flow instruction which can modify the contents of the PC completely.
The purpose of the BTB is to guess if the address in the PC points to a branch instruction, and if so, what should the next address in the PC be? That's fine, we can use a predictor for conditional branches and the BTBC for the next address. If the prediction was right, that's great! If the prediction was wrong, what then? If the BTB is the only unit we have then we would have to wait until the branch reaches the issue/execute stage of the pipeline. We would have to flush the pipeline and start again. But not every situation needs to be resolved so late. This is where the Branch Address Calculator (BAC) comes in.
The BTB is used in the fetch stage of the pipeline but the BAC resides in the decode stage. Once the instruction we fetched is decoded, we actually have a lot more information which can be useful. The first new piece of information we know is: "is the instruction I fetched actually a branch?" In the fetch stage we have no idea and the BTB can only guess, but in the decode stage we know it for sure. It is possible that the BTB predicts a branch when in fact the instruction is not a branch; in this case the BAC will halt the fetch unit, fix the BTB, and reinitiate fetching correctly.
What about branches like unconditional relative and call? These can be validated at the decode stage. The BAC will check the BTB, see if there are entries in the BTBC and set the predictor to always predict taken. For conditional branches, the BAC cannot confirm if they are taken/not-taken yet, but it can at least validate the predicted address and correct the BTB in the event of a bad address prediction. Sometimes the BTB won't identify/predict a branch at all. The BAC needs to correct this and give the BTB new information about this instruction. Since the BAC doesn't have a conditional predictor of its own, it uses a simple mechanism (backwards branches taken, forward branches not taken).
Somebody will need to confirm my understanding of these hardware counters, but I believe they mean the following:
BACLEAR.CLEAR is incremented when the BTB in fetch does a bad
job and the BAC in decode can fix it.
BPU_CLEARS.EARLY is
incremented when fetch decides (incorrectly) to load the next
instruction before the BTB predicts that it should actually load from
the taken path instead. This is because the BTB requires multiple cycles and fetch uses that time to speculatively load a consecutive block of instructions. This can be due to Intel using two BTBs, one quick and the other slower but more accurate. It takes more cycles to get a better prediction.
This explains why the penalty of a detecting a misprediction in the BTB is 2/3 cycles whereas the detecting a misprediction in the BAC is 8 cycles.

The fact BPU_CLEARS.EARLY description shows that this event occurs when the BPU predicts, correcting an assumption, implies there are 3 stages in the front end. Assume, predict and calculate. My current guess is that an early clear is flushing the stages of the pipeline that are before the stage that a prediction from the L1 BTB is even known, when the prediction is 'taken' as opposed to not taken.
The BTB set contains 4 ways for a maximum of 4 branches per 16 bytes (where all the ways in the set get tagged with the same tag indicating that particular 16 byte aligned block). Each way has an offset indicating the 4 LSBs of the address therefore the byte position within 16 bytes. Each entry also has a speculative bit, valid bit, pLRU bits, a speculative local BHR, a real local BHR, and each way shares the set BPT (PHT) as a second level of prediction. This may be alloyed with a GHR / speculative GHR.
I think the BPU provides a 64B prediction block to the uop cache and instruction cache (it used to be 32B, and it was 16B on P6). For the legacy route it needs to provide a 64 (or 32/16) byte prediction block which is a set of 64 bit masks which mark predicted branch instructions, prediction directions and which byte is a branch target. This information will be furnished by the L1 BTB while the fetch for the 64 byte line is underway such that 16 byte aligned (IFU has always been 16B) blocks that are read out of it with no used bits at all will not be fetched by the instruction predecoder (unused is the default because on architectures where the prediction block is smaller than the line size, the BPU may only provide bitmasks for 16 or 32B of the line). The BPU therefore provides the predictions made mask, the used/unused mask (marking bytes after first taken branch in the first prediction block and before the branch target in the second prediction block as unused and the rest used), the prediction directions mask; and the ILD provides the branch instructions mask. The first used byte in the prediction block is implicitly a branch target or the start of the instruction flow after a resteer or switch from the uop cache (DSP) to the legacy pipeline (MITE). The used bytes within the prediction block make up a prediction window.
Here is a P6 pipeline. Using this as an example, an early clear is in the 3rd cycle (13), when a prediction is made (and the target and entry type is read, and therefore unconditional branch targets are now known as well as conditional and their predictions). The first predicted taken branch target in the set for the 16 byte block is used. At this point, the 2 pipe stages before it have already been filled with fetches and beginning of lookups from the next sequential 16 byte blocks, which means that they need to be flushed if there is any taken prediction (otherwise it doesn't need to be as the next sequential 16 byte block is already beginning to be looked up in the pipestage before it), leaving a 2 cycle gap or bubble. The cache lookup occurs at the same time as the BTB lookup, so both the BTB and cache parallel 2 pipestages will have to be flushed, whereas the 3rd stage doesn't need to be flushed from the cache or the BTB because the IP is on a confirmed path and is the IP being currently looked up for the next one. In fact, in this P6 design, there is only a one cycle bubble for this early clear, because the new IP can be sent to the first stage to decode a set again on the high edge of clock while those other stages are being flushed.
This pipelining is more beneficial than waiting for the prediction before beginning a lookup on the next IP. This would mean a lookup every other cycle. This would give a throughput of 16 bytes of predictions every 2 cycles, so 8B/c. In the P6 pipelined scenario, the throughput is 16 bytes per cycle on a correct assumption and 8B/c on an incorrect assumption. Obviously faster. If we assume 2/3s of assumptions are correct for 1 in 9 instructions being a taken branch for 4 instructions per block, this gives a throughput of 16B per ((1*0.666)+2*0.333)) =1.332 cycles instead of 16B per 2 cycles.
If this is true, every taken branch will cause an early clear. This is however not the case when I use the event on my KBL. Hopefully the event is actually wrong because it is supposed to not be supported on my KBL, but does show a random low number, so hopefully it is counting something else. This also does not appear to be supported by the following https://gist.github.com/mattgodbolt/4e2cbb1c9aa97e0c5478 https://github.com/mattgodbolt/agner/blob/master/tests/branch.py. Given the 900k instructions and 100k early clears, I do not see how you can have an odd number of early clears if you use my definition of early clears and then look at his code. If we assume that the window is 32B for that CPU, then if you use an alignment of 4 on each branch instruction in that macro you get a clear every 8 instructions, because 8 will fit into the 32B aligned window.
I am not sure why Haswell and Ivy Bridge have such values for early and late clears but these events (0xe8) disappear starting with SnB, which happens to coincide with when the BTB was unified into a single structure. It also looks like the late clears counter is now counting the early clears event because it is the same number as the early clears on the Arrandale CPU, and the early clears event is now counting nothing. I'm also not sure why Nehalem has a 2 cycle bubble for early clears as the design of the L1 Nehalem BTB doesn't seem to change much from the P6 BTB, both 512 entries with 4 ways per set. It is probably because it has been broken down into more stages due to the higher clock speeds and hence also the longer L1i cache latency.
The late clear (BPU_CLEARS.LATE) appears to happen at the ILD. In the diagram above, the cache lookup takes only 2 cycles. In more recent processors, it apparently takes 4 cycles. This allows another L2 BTB to be inserted and a lookup in it to take place. 'MRU bypass' and 'MRU conflicts' could just mean that there was a miss in the MRU BTB or it could also mean that the prediction in the L2 is different to the one in L1 in the event that it uses a different prediction algorithm and history file. On my KBL, which does not support either event, I always get 0 for ILD_STALL.MRU but not BPU_CLEARS.LATE. The 3 cycle bubble comes from the BPU at stage 5 (which is also an ILD stage) resteering the pipeline at the low edge of stage 1 and flushing stages 2, 3 and 4 (which falls in line with cited L1i latencies of 4 cycles, as the L1i lookup occurs across stages 1–4 for a hit+ITLB hit). As soon as the prediction is made, the BTBs update the entries' speculative local BHR bits with the prediction that was made.
A BACLEAR happens for instance when the IQ compares the predictions-made mask with the branch instruction mask produced by the predecoder, and then for certain instruction types like a relative jump, it will check the sign bit to perform a static branch prediction. I'd imagine the static prediction happens as soon as it enters the IQ from the predecoder, such that instructions that immediately go the decoder contain the static prediction. The branch now being predicted taken will result in a BACLEAR_FORCE_IQ when the branch instruction is decoded, because there won't be a target to verify when the BAC calculates the absolute address of the relative conditional branch instruction, which is needs to verify when it is predicted taken.
The BAC at the decoders also makes sure the relative branches and direct branches have the correct branch target prediction after calculating the absolute address from the instruction itself and comparing it with it, if not, a BACLEAR is issued. For relative jumps, static prediction in the BAC uses the sign bit of the jump displacement to statically predict taken / not taken if a prediction has not been made but also overrides all return predictions as taken if the BTB does not support return entry types (it doesn't on P6 and makes no prediction, instead the BAC uses the BPU's RSB mechanism and it is the first point in the pipeline that a return instruction is acknowledged) and overrides all register indirect branch predictions as taken on P6 (because there is no IBTB) as it uses the statistic that more branches are taken that not. The BAC calculates and inserts the absolute target from the relative target into the uop and inserts the IP delta into the uop and inserts the fall through IP (NLIP) into the BPU's BIT, which may be tagged to the uop, or more likely the BIT entries work on a corresponding circular queue which will stall if there aren't enough BIT entries, and the indirect target prediction or known target is inserted into the uop 64 bit immediate field. These fields in the uop are used by the allocator for allocation into the RS/ROB later on. The BAC also informs the BTB of any spurious predictions (non branch instructions) that need their entries deallocating from the BTB. At the decoders, branch instructions are detected early in the logic (when prefixes are decoded and the instruction is examined to see if it can be decoded by the decoder) and the BAC is accessed in parallel with the rest. The BAC inserting the known or otherwise predicted target into the uop is known as converting an auop into a duop. The prediction is encoded into the uop opcode.
The BAC likely instructs the BTB to speculatively update its BTB for the detected branch instruction's IP. If the target is now known and no prediction was made for it (meaning it wasn't in the cache) -- it is still speculative as although the branch target is known for certain, it still could be on a speculative path, so is marked with a speculative bit -- this will now immediately provide early steers especially for unconditional branches now entering the pipeline but also for conditional, with a blank history so will predict not taken next time, rather than having to wait until retire).
The IQ above contains a bitmask field for branch prediction directions (BTBP) and branch predictions made / no prediction made (BTBH) (to distinguish which of the 0s in the BTBP are not taken as opposed to no prediction made) for each of the 8 instruction bytes in an IQ line as well as the target of a branch instruction, meaning there can only be one branch per IQ line and it ends the line. This diagram does not show the branch instruction mask produced by the predecoder that shows what instructions actually are branches such that the IQ knows what not-made predictions it needs to make a prediction for (and what ones are not branch instructions at all).
The IQ is a contiguous block of instruction bytes and the ILD populates 8-bit bitmasks which identify the first opcode byte (OpM) and instruction end byte (EBM) of each macroinstruction as it wraps round bytes into the IQ. It probably also provides bits indicating whether it is a complex instruction or a simple instruction (as suggested by the 'predecode bits' on many AMD patents). The gaps between these markers are implicitly prefix bytes for the following instruction. I'm thinking the IQ is designed such that the uops it issues in the IDQ/ROB will rarely outrun the IQ such that the head pointer in the IQ starts overwriting macroinstructions still tagged in the IDQ waiting to be allocated, and when it does, there is a stall, so the IDQ tags refer back to the IQ, which the allocator accesses. I think the ROB uses this uop tag as well. The IQ on SnB if 16 bytes * 40 entries contains 40 macroops in the worst case, 320 in the average case, 640 in the best case. The number of uops these produce will be much greater, so it will rarely outrun, and when it does, I guess it stalls decode until more instructions retire. The tail pointer contains the recently allocated tag by the ILD, the head pointer contains the next macroinstruction instruction waiting to retire, and the read pointer is the current tag to be consumed by the decoders (which moves towards the tail pointer). Although, this becomes difficult now that some if not the majority of the uops in the path come from the uop cache since SnB. The IQ may be allowed to outrun the back end in the event that uops are not tagged with the IQ entries (and the fields in the IQ are instead inserted into uops directly), and this will be detected and the pipeline will just be resteered from the beginning.
When the allocator allocates a physical destination (Pdst) for a branch micro-op into the ROB, the allocator provides the Pdst entry number to the BPU. The BPU inserts this into the correct BIT entry assigned by the BAC (which is probably is at the head of a circular queue of active BIT entries that are yet to be allocated a Pdst). The allocator also extracts fields from the uop and allocates the data into the RS.
The RS contains a field that indicates whether an instruction is a MSROM uop or a regular uop, which the allocator populates. The allocator also inserts the confirmed absolute target or the predicted absolute target into the immediate data and as a source, renames the flags register (or just a flag bit) and in the case of an indirect branch, there is also the renamed register that contains the address as another source. The Pdst in the PRF scheme would be the ROB entry, which as a Pdst would be the retirement macro-RIP or micro-IP register. The JEU writes the target or fallthrough to that register (it may not need to if the prediction is correct).
When the reservation station dispatches a branch micro-op to a jump execution unit located in the integer execution unit, the reservation station informs the BTB of the Pdst entry for the corresponding branch micro-op. In response, the BTB accesses the corresponding entry for the branch instruction in the BIT and the fall through IP (NLIP) is read out, decremented by the IP delta in the RS, and decoded to point to the set that the branch entry will be updated/allocated.
The outcome from the renamed flag register source Pdst to determine whether the branch is taken / not taken is then compared with the prediction in the opcode in the scheduler, and additionally, if the branch is indirect, the predicted target in the BIT is compared with the the address in the source Pdst (that was calculated and became available in the RS before it was scheduled and dispatched) and it is now known whether a correct prediction was made or not and whether the target is correct or not.
The JEU propagates an exception code to the ROB and flushes the pipeline (JEClear -- which flushes the whole pipeline before the allocate stage, as well as stalls the allocator) and redirects the next IP logic at the start of the pipeline using the fallthrough (in BIT) / target IP appropriately (as well as microsequencer if it is a microbranch misprediction; the RIP directed to the start of the pipeline will be the same one throughout the MSROM procedure). Speculative entries are deallocated and true BHRs are copied into the speculative BHRs. In the event there is a BOB in the PRF scheme, the BOB takes snapshots of the RAT state for every branch instruction and when there is a misprediction. The JEU rolls back the RAT state to that snapshot and the allocator can proceed immediately (which is particularly useful for microbranch misprediction as it is closer to the allocator therefore the bubble will not be as well hidden by the pipeline), rather than stalling the allocator and having to wait until retire for the retirement RAT state to be known and using that to restore the RAT and then clear the ROB (ROClear, which unstalls the allocator). With a BOB, the allocator can start issuing new instructions while the stale uops continue to execute, and when the branch is ready to retire, the ROClear only clears the uops between the retired misprediction and the new uops. If it is an MSROM uop, because it may have completed, the start of the pipeline still needs to be redirected to the MSROM uop again, but this time it will start at the redirected microip (this is the case with inline instructions (and it may be able to replay it out of the IQ). If a misprediction happens in an MSROM exception then it doesn't need to resteer the pipeline, just redirects it directly, because it has taken over the IDQ issue until the end of the procedure -- the issue may have already ended for inline issues.
The ROClear in response to the branch exception vector in the ROB actually happens on the second retirement stage RET2 (which is really the 3rd of 3 stages of typical retirement pipeline) when the uops are retired. The macroinstruction only retires and exceptions only trigger and the macroinstruction RIP only updates (with new target or increase by IP delta in the ROB) when the EOM uop (end of macroinstruction) marker retires, even if a non EOM instruction writes to it, it is not written to the RRF immediately unlike other registers -- anyway, the branch uop is likely going to be the final uop in typical branch macroinstruction handled by the decoders. If this is a microbranch in an MSROM procedure, it will not update the BTB; it updates the uIP when it retires, and the RIP is not updated until the end of the procedure.
If a generic non-mispredict exception occurs (i.e. one that requires a handler) during a MSROM macroop execution, once it has been handled, the microip that caused the exception is restored by the handler to the uIP register (in the event that it is passed to the handler when it is called), as well as the current RIP of the macroinstruction which triggered the exception, and when the exception handling ends, instruction fetch is resumed at this RIP+uIP: the macroinstruction is refetched and reattempted in the MSROM, which starts at the uIP signalled to it. The RRF write (or retirement RAT update on the PRF scheme) for previous uops in a complex non-MSROM macroinstruction may occur on the cycle before the EOM uop retires, which means that a restart can happen at a certain uop within a regular complex macroop and not just a MSROM macroop, and in this case, the instruction flow is restarted at the BPU at the RIP, and the complex decoder is configured with valid / invalid bits on the PLA cuop outputs. The uIP for this regular complex instruction that is used to configure the complex decoder valid bits is a value between 0-3, which I think the ROB sets to 0 at each EOM and increments for each microop retired, so that the non-MSROM complex instructions can be addressed, and for MSROM complex instructions, the MSROM routine contains a uop that tells the ROB the uIP of that instruction. The architectural RIP register however, which is updated by the IP delta only when the EOM uop retires is still pointing to the current macroop because the EOM uop failed to retire), which only happens for exceptions but not hardware interrupts, which can't interrupt MSROM procedures or complex instruction mid retirement (software interrupts are similar and trigger at the EOM -- the trap MSROM handler performs a macrojump to the RIP of the software trap handler once it has finished).
The BTB read and tag comparison happens in RET1 while the branch unit writes back the results, and then in the next cycle, perhaps also during RET1 (or maybe this is done in RET2), the tags in the set are compared and then, if there is a hit, a new history BHR is calculated; if there is a miss, an entry needs to be allocated at that IP with that target. Only once the uop retires in order (in RET2) can the the result be placed into the real history and the branch prediction algorithm is utilised to update the pattern table where an update is required. If the branch requires allocation, the replacement policy is utilised to decide the means for allocating the branch. If there is a hit, the target will already be correct for all direct and relative branches, so it doesn't have to be compared, in the event of no IBTB. The speculative bit is now removed from the entry if present. Finally, in the next cycle, the branch is written in the BTB cache by the BTB write control logic. The first part of the BTB lookup may be able to go ahead throughout RET1 and then may stall the BTB write pipeline until RET2 when the stage waiting to write to the BTB's ROB entry retires, otherwise, the lookup could be decoupled and the first part completes and writes data to, for instance, the BIT, and at RET2 the corresponding entry to the one retiring is just written back to the BTB (which would mean decoding the set again, comparing tags again and then writing the entry, so maybe not)
If P6 had a uop cache, the pipeline would be something like:
1H: select IP
1L: BTB set decode + cache set decode (physical/virtual index) + ITLB lookup + uop cache set decode
2H: cache read + BTB read + uop cache read
2L: cache tag compare + BTB tag compare + uop cache tag compare; if uop cache hit, stall until uop cache can issue, then clock gate legacy decode pipeline
3H: predict, if taken, flush 3H,2L,2H,1L
3L if taken, begin a 1L with new IP to decode new set and continue with current 16 byte block for which the branch instruction resides to 4L
As for the uop cache, because it is past the stage of the BAC, there is never going to be a bogus branch or an incorrect prediction for an unconditional branch or an incorrect target for a non-indirect branch. The uop cache will used the used/unused mask from the BPU to emit uops for instructions that begin at those bytes, and will use the prediction direction mask to change the macrobranch uops to a predicted not taken / predicted taken macrobranch uop (T/NT predictions are inserted into the uop itself). If it is predicted taken then it stops emitting uops for that 64B aligned block (again used to be 32B, previously 16B) and waits for the next window right behind it in the pipeline. The uop cache is going to know what uops are branches and probably statically predicts not taken to all non-predictions, or might have a more advanced static prediction. Indirect target predictions from the IBTB are inserted into the uop immediate field and then it will wait for the next BPU prediction block if this branch is also predicted taken. I would imagine the uop cache creates BIT entries and updates predictions in the BTBs, to ensure that uop cache and MITE (legacy decode) uops update the history in correct sequential order.

measuring time between two rising edges in beaglebone

I am reading sensor output as square wave(0-5 volt) via oscilloscope. Now I want to measure frequency of one period with Beaglebone. So I should measure the time between two rising edges. However, I don't have any experience with working Beaglebone. Can you give some advices or sample codes about measuring time between rising edges?

How deterministic do you need this to be? If you can tolerate some inaccuracy, you can probably do it on the main Linux OS; if you want to be fancy pants, this seems like a potential use case for the BBB's PRU's (which I unfortunately haven't used so take this with substantial amounts of salt). I would expect you'd be able to write PRU code that just sits with an infinite outerloop and then inside that loop, start looping until it sees the pin shows 0, then starts looping until the pin shows 1 (this is the first rising edge), then starts counting until either the pin shows 0 again (this would then be the falling edge) or another loop to the next rising edge... either way, you could take the counter value and you should be able to directly convert that into time (the PRU is states as having fixed frequency for each instruction, and is a 200Mhz (50ns/instruction). Assuming your loop is something like
#starting with pin low
inner loop 1:
registerX = loadPin
increment counter
jump if zero registerX to inner loop 1
# pin is now high
inner loop 2:
registerX = loadPin
increment counter
jump if one registerX to inner loop 2
# pin is now low again
That should take 3 instructions per counter increment, so you can get the time as 3 * counter * 50 ns.

As suggested by Foon in his answer, the PRUs are a good fit for this task (although depending on your requirements it may be fine to use the ARM processor and standard GPIO). Please note that (as far as I know) both the regular GPIOs and the PRU inputs are based on 3.3V logic, and connecting a 5V signal might fry your board! You will need an additional component or circuit to convert from 5V to 3.3V.
I've written a basic example that measures timing between rising edges on the header pin P8.15 for my own purpose of measuring an engine's rpm. If you decide to use it, you should check the timing results against a known reference. It's about right but I haven't checked it carefully at all. It is implemented using PRU assembly and uses the pypruss python module to simplify interfacing.

Simple Debounce Routine

Do you have a simple debounce routine handy to deal with a single switch input?
This is a simple bare metal system without any OS.
I would like to avoid a looping construct with a specific count, as the processor speed might fluctuate.

I think you could learn a lot about this here: http://www.ganssle.com/debouncing.pdf
Your best bet is always to do this in hardware if possible, but there are some thoughts on software in there as well.
Simple example code from TFA:
#define CHECK_MSEC 5 // Read hardware every 5 msec
#define PRESS_MSEC 10 // Stable time before registering pressed
#define RELEASE_MSEC 100 // Stable time before registering released
// This function reads the key state from the hardware.
extern bool_t RawKeyPressed();
// This holds the debounced state of the key.
bool_t DebouncedKeyPress = false;
// Service routine called every CHECK_MSEC to
// debounce both edges
void DebounceSwitch1(bool_t *Key_changed, bool_t *Key_pressed)
{
static uint8_t Count = RELEASE_MSEC / CHECK_MSEC;
bool_t RawState;
*Key_changed = false;
*Key_pressed = DebouncedKeyPress;
RawState = RawKeyPressed();
if (RawState == DebouncedKeyPress) {
// Set the timer which allows a change from current state.
if (DebouncedKeyPress) Count = RELEASE_MSEC / CHECK_MSEC;
else Count = PRESS_MSEC / CHECK_MSEC;
} else {
// Key has changed - wait for new state to become stable.
if (--Count == 0) {
// Timer expired - accept the change.
DebouncedKeyPress = RawState;
*Key_changed=true;
*Key_pressed=DebouncedKeyPress;
// And reset the timer.
if (DebouncedKeyPress) Count = RELEASE_MSEC / CHECK_MSEC;
else Count = PRESS_MSEC / CHECK_MSEC;
}
}
}

Simplest solutions are often the best, and I've found that simply only reading the switch state every N millseconds (between 10 and 50, depending on switches) has always worked for me.
I've stripped out broken and complex debounce routines and replaced them with a simple slow poll, and the results have always been good enough that way.
To implement it, you'll need a simple periodic timer interrupt on your system (assuming no RTOS support), but if you're used to programming it at the bare metal, that shouldn't be difficult to arrange.
Note that this simple approach adds a delay to detection of the change in state. If a switch takes T ms to reach a new steady state, and it's polled every X ms, then the worst case delay for detecting the press is T+X ms. Your polling interval X must be larger than the worst-case bounce time T.

There's no single simple solution that works for all types of buttons. No matter what someone here tells you to use, you'll have to try it with your hardware, and see how well it works. And look at the signals on a scope, to make sure you really know what's going on. Rich B's link to the pdf looks like a good place to start.

I have used a majority vote method to debounce an input. I set up a simple three state shift register type of data structure, and shift each sample and take the best two out of three as the "correct" value. This is obviously a function of either your interrupt handler, or a poller, depending on what method is used to actually read the hardware.
But, the best advice is to ask your friendly hardware designer to "latch" the value and allow you to clear this value when you get to it.

To debounce, you want to ignore any switch up that lasts under a certain threshold. You can set a hardware timer on switch up, or use a flag set via periodic interrupt.

If you can get away with it, the best solution in hardware is to have the switch have two distinct states with no state between. That is, use a SPDT switch, with each pole feeding either the R or S lines of a flip/flop. Wired that way, the output of the flip/flop should be debounced.

The algorithm from ganssle.com could have a bug in it. I have the impression the following line
static uint8_t Count = RELEASE_MSEC / CHECK_MSEC;
should read
static uint8_t Count = PRESS_MSEC / CHECK_MSEC;
in order to debounce correctly the initial press.

At the hardware level the basic debouncing routine has to take into account the following segments of a physical key's (or switch's) behavior:
Key sitting quietly->finger touches key and begins pushing down->key reaches bottom of travel and finger holds it there->finger begins releasing key and spring pushes key back up->finger releases key and key vibrates a bit until it quiesces
All of these stages involve 2 pieces of metal scraping and rubbing and bumping against each other, jiggling the voltage up and down from 0 to maximum over periods of milliseconds, so there is electrical noise every step of the way:
(1) Noise while the key is not being touched, caused by environmental issues like humidity, vibration, temperature changes, etc. causing voltage changes in the key contacts
(2) Noise caused as the key is being pressed down
(3) Noise as the key is being held down
(4) Noise as the key is being released
(5) Noise as the key vibrates after being released
Here's the algorithm by which we basically guess that the key is being pressed by a person:
read the state of the key, which can be "might be pressed", "definitely is pressed", "definitely is not pressed", "might not be pressed" (we're never really sure)
loop while key "might be" pressed (if dealing with hardware, this is a voltage sample greater than some threshold value), until is is "definitely not" pressed (lower than the threshold voltage)
(this is initialization, waiting for noise to quiesce, definition of "might be" and "definitely not" is dependent on specific application)
loop while key is "definitely not" pressed, until key "might be" pressed
when key "might be" pressed, begin looping and sampling the state of the key, and keep track of how long the key "might be" pressed
- if the key goes back to "might not be" or "definitely is not" pressed state before a certain amount of time, restart the procedure
- at a certain time (number of milliseconds) that you have chosen (usually through experimenting with different values) you decide that the sample value is no longer caused by noise, but is very likely caused by the key actually being held down by a human finger and you return the value "pressed"
while(keyvalue = maybepressed){
//loop - wait for transition to notpressed
sample keyvalue here;
maybe require it to be "notpressed" a number of times before you assume
it's really notpressed;
}
while(keyvalue = notpressed){
//loop - wait for transition to maybepressed
sample keyvalue
again, maybe require a "maybepressed" value a number of times before you
transition
}
while(keyvalue=maybepressed){
presstime+=1;
if presstime>required_presstime return pressed_affirmative
}
}
return pressed_negative

What I usually do is have three or so variables the width of the input register. Every poll, usually from an interrupt, shift the values up one to make way for the new sample. Then I have a debounced variable formed by setting the logical-and of the samples, and clearing the inverse logical-or. i.e. (untested, from memory)
input3 = input2;
input2 = input1;
input1 = (*PORTA);
debounced |= input1 & input2 & input3;
debounced &= (input1 | input2 | input3);
Here's an example:
debounced has xxxx (where 'x' is "whatever")
input1 = 0110,
input2 = 1100,
input3 = 0100
With the information above,
We need to switch only bit 2 to 1, and bit 0 to 0. The rest are still "bouncing".
debounced |= (0100); //set only bit 2
debounced &= (1110); //clear only bit 0
The result is that now debounced = x1x0

use integration and you'll be a happy camper. Works well for all switches.
just increment a counter when read as high and decrement it when read as low and when the integrator reaches a limit (upper or lower) call the state (high or low).

The whole concept is described well by Jack Ganssle. His solution posted as an answer to the original question is very good, but I find part of it not so clear how does it work.
There are three main ways how to deal with switch bouncing:
- using polling
- using interrupts
- combination of interrupts and pooling.
As I deal mostly with embedded systems that are low-power or tend to be low-power so the answer from Keith to integrate is very reasonable to me.
If you work with SPST push button type switch with one mechanically stable position then I would prefer the solution which works using a combination of interrupt and pooling.
Like this: use GPIO input interrupt to detect first edge (falling or rising, the opposite direction of un-actuated switch state). Under GPIO input ISR set flag about detection.
Use another interrupt for measuring time (ie. general purpose timer or SysTick) to count milliseconds.
On every SysTick increment (1 ms):
IF buttonFlag is true then call function to poll the state of push button (polling).
Do this for N consecutive SysTick increments then clear the flag.
When you poll the button state use logic as you wish to decide button state like M consecutive readings same, average more than Z, count if the state, last X readings the same, etc.
I think this approach should benefit from responsiveness on interrupt and lower power usage as there will be no button polling after N SysTick increments. There are no complicated interrupt modifications between various interrupts so the program code should be fairly simple and readable.
Take into consideration things like: do you need to "release" button, do you need to detect long press and do you need action on button release. I don't like button action on button release, but some solutions work that way.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas