What kind of states and transitions does Spin's "depth reached" consider? - verification

For verifications (with ispin) that use never claims, I get outputs with depth reached larger than the number of states and the number of transitions, e.g.:
Full statespace search for:
never claim + (REQ5)
assertion violations + (if within scope of claim)
cycle checks - (disabled by -DSAFETY)
invalid end states - (disabled by never claim)
State-vector 60 byte, depth reached 87, errors: 1
41 states, stored
10 states, matched
51 transitions (= stored+matched)
9 atomic steps
hash conflicts: 0 (resolved)
I find that a bit unintuitive. Is there a precise description of the semantics of "depth reached" somewhere (more thorough than pan's output format description)? Maybe the meaning of
longest depth-first search path contained 87 transitions
does not refer to the 51 transitions, but to the transitions of the system automata composed with the never claim?

Yes, you are (kind of) right when you say it refer to the transitions of the system automaton composed with the never claim. Yet in the same time it is the length of the path in the system being verified, because one step of system composed with never claim is exactly one step of the system. Of course depending on never claim one might need to explore more or less transition than system has. Paths are not even necessary loop free (depending on the claim) and even not minimal (unless special option is set).

Related

DEFLATE: how to handle "no distance codes" case?

I mostly get RFC 1951, however I'm not too clear on how to manage the case where (when using dynamic Huffman tables) no distance codes are needed or present. For example, let's take the input:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890987654321ZYXWVUTSR
where no backreference is possible since there are no repetitions of length >= 3.
According to RFC 1951, at least one distance code must be present regardless, otherwise it wouldn't be possible to encode HDIST - 1. I understand, according to the reference, that such code should be of zero bits to signal "no distance codes".
One distance code of zero bits means that there are no distance codes
used at all (the data is all literals).
In infgen symbols, I'd expect to see a dist 0 0.
Analyzing what gzip does with infgen, however, I see that TWO distance codes are emitted (each 1 bit long) for the above input (even though none is actually used then):
! infgen 2.4 output
!
gzip
!
last
dynamic
litlen 48 6
litlen 49 6
litlen 50 6
...cut...
litlen 121 6
litlen 122 6
litlen 256 6
dist 0 1
dist 1 1
literal 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890987654321Z
literal 'YXWVUTSR
end
!
crc
length
So what's the correct behavior in these cases?
If there are no matches in the deflate block, there will be no lengths from the length/literal code, and so the decoder will never look for a distance code. In that case, what would make the most sense is to provide no information at all about a distance code.
However the format does not permit that, since the 5-bit HDIST value in the header is interpreted as 1 to 32 distance codes, for which lengths must be provided for in the header. You must provide at least one distance code length in the header, even though it will never be used.
There are several valid things you can do in that case. RFC 1951 notes you can provide a single distance code (HDIST == 0, meaning one length), with length zero, which would be just one zero in the list of lengths.
It is also permitted to provide a single code of length one, or you could do as zlib is doing, which is to provide two codes of length one. You can actually put any valid distance code description you like there, and it will still be accepted.
As to why zlib's deflate is choosing to define two codes there, I can only guess that Jean-loup was being conservative, writing something he knew that even an over-simplified inflator would have to accept. Both gzip and zopfli do the same thing. They all do the same thing when there is only one distance code used. They could emit just the single one-bit distance code, per the RFC, but they emit two single-bit distance codes, one of which is never used.
Really the right thing to do would be to write a single zero length as noted in the RFC, which would take the fewest number of bits in the header. I will consider updating zlib to do that, to eke out a few more bits of compression.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

What branch misprediction does the Branch Target Buffer detect?

I am currently looking at the various parts of the CPU pipeline which can detect branch mispredictions. I have found these are:
Branch Target Buffer (BPU CLEAR)
Branch Address Calculator (BA CLEAR)
Jump Execution Unit (not sure of the signal name here??)
I know what 2 and 3 detect, but I do not understand what misprediction is detected within the BTB. The BAC detects where the BTB has erroneously predicted a branch for a non-branch instruction, where the BTB has failed to detect a branch, or the BTB has mispredicted the target address for a x86 RET instruction. The execution unit evaluates the branch and determines if it was correct.
What type of misprediction is detected at the Branch Target Buffer? What exactly is detected as a misprediction here?
The only clue I could find was this inside Vol 3 of the Intel Developer Manuals (the two BPU CLEAR event counters at the bottom):
BPU predicted a taken branch after incorrectly assuming that it was
not taken.
This seems to imply the prediction is not done "synchronously", but rather "asynchronously", hence the "after incorrectly assuming"??
UPDATE:
Ross, this is the CPU branch circuitry, from the original Intel Patent (hows that for "reading"?):
I don't see "Branch Prediction Unit" anywhere? Would it be reasonable that somebody having read this paper would assume that "BPU" is a lazy way of grouping the BTB Circuit, BTB Cache, BAC and RSB together??
So my question still stands, which component raises the BPU CLEAR signal?
This is a good question! I think the confusion that it's causing is due to Intel's strange naming schemes which often overload terms standard in academia. I will try to both answer your question and also clear up the confusion I see in the comments.
First of all. I agree that in standard computer science terminology a branch target buffer isn't synonymous with branch predictor. However in Intel terminology the Branch Target Buffer (BTB) [in capitals] is something specific and contains both a predictor and a Branch Target Buffer Cache (BTBC) which is just a table of branch instructions and their targets on a taken outcome. This BTBC is what most people understand as a branch target buffer [lower case]. So what is the Branch Address Calculator (BAC) and why do we need it if we have a BTB?
So, you understand that modern processors are split into pipelines with multiple stages. Whether this is a simple pipelined processor or an out of order supersclar processor, the first stages are typically fetch then decode. In the fetch stage all we have is the address of the current instruction contained in the program counter (PC). We use the PC to load bytes from memory and send them to the decode stage. In most cases we increment the PC in order to load the subsequent instruction(s) but in other cases we process a control flow instruction which can modify the contents of the PC completely.
The purpose of the BTB is to guess if the address in the PC points to a branch instruction, and if so, what should the next address in the PC be? That's fine, we can use a predictor for conditional branches and the BTBC for the next address. If the prediction was right, that's great! If the prediction was wrong, what then? If the BTB is the only unit we have then we would have to wait until the branch reaches the issue/execute stage of the pipeline. We would have to flush the pipeline and start again. But not every situation needs to be resolved so late. This is where the Branch Address Calculator (BAC) comes in.
The BTB is used in the fetch stage of the pipeline but the BAC resides in the decode stage. Once the instruction we fetched is decoded, we actually have a lot more information which can be useful. The first new piece of information we know is: "is the instruction I fetched actually a branch?" In the fetch stage we have no idea and the BTB can only guess, but in the decode stage we know it for sure. It is possible that the BTB predicts a branch when in fact the instruction is not a branch; in this case the BAC will halt the fetch unit, fix the BTB, and reinitiate fetching correctly.
What about branches like unconditional relative and call? These can be validated at the decode stage. The BAC will check the BTB, see if there are entries in the BTBC and set the predictor to always predict taken. For conditional branches, the BAC cannot confirm if they are taken/not-taken yet, but it can at least validate the predicted address and correct the BTB in the event of a bad address prediction. Sometimes the BTB won't identify/predict a branch at all. The BAC needs to correct this and give the BTB new information about this instruction. Since the BAC doesn't have a conditional predictor of its own, it uses a simple mechanism (backwards branches taken, forward branches not taken).
Somebody will need to confirm my understanding of these hardware counters, but I believe they mean the following:
BACLEAR.CLEAR is incremented when the BTB in fetch does a bad
job and the BAC in decode can fix it.
BPU_CLEARS.EARLY is
incremented when fetch decides (incorrectly) to load the next
instruction before the BTB predicts that it should actually load from
the taken path instead. This is because the BTB requires multiple cycles and fetch uses that time to speculatively load a consecutive block of instructions. This can be due to Intel using two BTBs, one quick and the other slower but more accurate. It takes more cycles to get a better prediction.
This explains why the penalty of a detecting a misprediction in the BTB is 2/3 cycles whereas the detecting a misprediction in the BAC is 8 cycles.
The fact BPU_CLEARS.EARLY description shows that this event occurs when the BPU predicts, correcting an assumption, implies there are 3 stages in the front end. Assume, predict and calculate. My current guess is that an early clear is flushing the stages of the pipeline that are before the stage that a prediction from the L1 BTB is even known, when the prediction is 'taken' as opposed to not taken.
The BTB set contains 4 ways for a maximum of 4 branches per 16 bytes (where all the ways in the set get tagged with the same tag indicating that particular 16 byte aligned block). Each way has an offset indicating the 4 LSBs of the address therefore the byte position within 16 bytes. Each entry also has a speculative bit, valid bit, pLRU bits, a speculative local BHR, a real local BHR, and each way shares the set BPT (PHT) as a second level of prediction. This may be alloyed with a GHR / speculative GHR.
I think the BPU provides a 64B prediction block to the uop cache and instruction cache (it used to be 32B, and it was 16B on P6). For the legacy route it needs to provide a 64 (or 32/16) byte prediction block which is a set of 64 bit masks which mark predicted branch instructions, prediction directions and which byte is a branch target. This information will be furnished by the L1 BTB while the fetch for the 64 byte line is underway such that 16 byte aligned (IFU has always been 16B) blocks that are read out of it with no used bits at all will not be fetched by the instruction predecoder (unused is the default because on architectures where the prediction block is smaller than the line size, the BPU may only provide bitmasks for 16 or 32B of the line). The BPU therefore provides the predictions made mask, the used/unused mask (marking bytes after first taken branch in the first prediction block and before the branch target in the second prediction block as unused and the rest used), the prediction directions mask; and the ILD provides the branch instructions mask. The first used byte in the prediction block is implicitly a branch target or the start of the instruction flow after a resteer or switch from the uop cache (DSP) to the legacy pipeline (MITE). The used bytes within the prediction block make up a prediction window.
Here is a P6 pipeline. Using this as an example, an early clear is in the 3rd cycle (13), when a prediction is made (and the target and entry type is read, and therefore unconditional branch targets are now known as well as conditional and their predictions). The first predicted taken branch target in the set for the 16 byte block is used. At this point, the 2 pipe stages before it have already been filled with fetches and beginning of lookups from the next sequential 16 byte blocks, which means that they need to be flushed if there is any taken prediction (otherwise it doesn't need to be as the next sequential 16 byte block is already beginning to be looked up in the pipestage before it), leaving a 2 cycle gap or bubble. The cache lookup occurs at the same time as the BTB lookup, so both the BTB and cache parallel 2 pipestages will have to be flushed, whereas the 3rd stage doesn't need to be flushed from the cache or the BTB because the IP is on a confirmed path and is the IP being currently looked up for the next one. In fact, in this P6 design, there is only a one cycle bubble for this early clear, because the new IP can be sent to the first stage to decode a set again on the high edge of clock while those other stages are being flushed.
This pipelining is more beneficial than waiting for the prediction before beginning a lookup on the next IP. This would mean a lookup every other cycle. This would give a throughput of 16 bytes of predictions every 2 cycles, so 8B/c. In the P6 pipelined scenario, the throughput is 16 bytes per cycle on a correct assumption and 8B/c on an incorrect assumption. Obviously faster. If we assume 2/3s of assumptions are correct for 1 in 9 instructions being a taken branch for 4 instructions per block, this gives a throughput of 16B per ((1*0.666)+2*0.333)) =1.332 cycles instead of 16B per 2 cycles.
If this is true, every taken branch will cause an early clear. This is however not the case when I use the event on my KBL. Hopefully the event is actually wrong because it is supposed to not be supported on my KBL, but does show a random low number, so hopefully it is counting something else. This also does not appear to be supported by the following https://gist.github.com/mattgodbolt/4e2cbb1c9aa97e0c5478 https://github.com/mattgodbolt/agner/blob/master/tests/branch.py. Given the 900k instructions and 100k early clears, I do not see how you can have an odd number of early clears if you use my definition of early clears and then look at his code. If we assume that the window is 32B for that CPU, then if you use an alignment of 4 on each branch instruction in that macro you get a clear every 8 instructions, because 8 will fit into the 32B aligned window.
I am not sure why Haswell and Ivy Bridge have such values for early and late clears but these events (0xe8) disappear starting with SnB, which happens to coincide with when the BTB was unified into a single structure. It also looks like the late clears counter is now counting the early clears event because it is the same number as the early clears on the Arrandale CPU, and the early clears event is now counting nothing. I'm also not sure why Nehalem has a 2 cycle bubble for early clears as the design of the L1 Nehalem BTB doesn't seem to change much from the P6 BTB, both 512 entries with 4 ways per set. It is probably because it has been broken down into more stages due to the higher clock speeds and hence also the longer L1i cache latency.
The late clear (BPU_CLEARS.LATE) appears to happen at the ILD. In the diagram above, the cache lookup takes only 2 cycles. In more recent processors, it apparently takes 4 cycles. This allows another L2 BTB to be inserted and a lookup in it to take place. 'MRU bypass' and 'MRU conflicts' could just mean that there was a miss in the MRU BTB or it could also mean that the prediction in the L2 is different to the one in L1 in the event that it uses a different prediction algorithm and history file. On my KBL, which does not support either event, I always get 0 for ILD_STALL.MRU but not BPU_CLEARS.LATE. The 3 cycle bubble comes from the BPU at stage 5 (which is also an ILD stage) resteering the pipeline at the low edge of stage 1 and flushing stages 2, 3 and 4 (which falls in line with cited L1i latencies of 4 cycles, as the L1i lookup occurs across stages 1–4 for a hit+ITLB hit). As soon as the prediction is made, the BTBs update the entries' speculative local BHR bits with the prediction that was made.
A BACLEAR happens for instance when the IQ compares the predictions-made mask with the branch instruction mask produced by the predecoder, and then for certain instruction types like a relative jump, it will check the sign bit to perform a static branch prediction. I'd imagine the static prediction happens as soon as it enters the IQ from the predecoder, such that instructions that immediately go the decoder contain the static prediction. The branch now being predicted taken will result in a BACLEAR_FORCE_IQ when the branch instruction is decoded, because there won't be a target to verify when the BAC calculates the absolute address of the relative conditional branch instruction, which is needs to verify when it is predicted taken.
The BAC at the decoders also makes sure the relative branches and direct branches have the correct branch target prediction after calculating the absolute address from the instruction itself and comparing it with it, if not, a BACLEAR is issued. For relative jumps, static prediction in the BAC uses the sign bit of the jump displacement to statically predict taken / not taken if a prediction has not been made but also overrides all return predictions as taken if the BTB does not support return entry types (it doesn't on P6 and makes no prediction, instead the BAC uses the BPU's RSB mechanism and it is the first point in the pipeline that a return instruction is acknowledged) and overrides all register indirect branch predictions as taken on P6 (because there is no IBTB) as it uses the statistic that more branches are taken that not. The BAC calculates and inserts the absolute target from the relative target into the uop and inserts the IP delta into the uop and inserts the fall through IP (NLIP) into the BPU's BIT, which may be tagged to the uop, or more likely the BIT entries work on a corresponding circular queue which will stall if there aren't enough BIT entries, and the indirect target prediction or known target is inserted into the uop 64 bit immediate field. These fields in the uop are used by the allocator for allocation into the RS/ROB later on. The BAC also informs the BTB of any spurious predictions (non branch instructions) that need their entries deallocating from the BTB. At the decoders, branch instructions are detected early in the logic (when prefixes are decoded and the instruction is examined to see if it can be decoded by the decoder) and the BAC is accessed in parallel with the rest. The BAC inserting the known or otherwise predicted target into the uop is known as converting an auop into a duop. The prediction is encoded into the uop opcode.
The BAC likely instructs the BTB to speculatively update its BTB for the detected branch instruction's IP. If the target is now known and no prediction was made for it (meaning it wasn't in the cache) -- it is still speculative as although the branch target is known for certain, it still could be on a speculative path, so is marked with a speculative bit -- this will now immediately provide early steers especially for unconditional branches now entering the pipeline but also for conditional, with a blank history so will predict not taken next time, rather than having to wait until retire).
The IQ above contains a bitmask field for branch prediction directions (BTBP) and branch predictions made / no prediction made (BTBH) (to distinguish which of the 0s in the BTBP are not taken as opposed to no prediction made) for each of the 8 instruction bytes in an IQ line as well as the target of a branch instruction, meaning there can only be one branch per IQ line and it ends the line. This diagram does not show the branch instruction mask produced by the predecoder that shows what instructions actually are branches such that the IQ knows what not-made predictions it needs to make a prediction for (and what ones are not branch instructions at all).
The IQ is a contiguous block of instruction bytes and the ILD populates 8-bit bitmasks which identify the first opcode byte (OpM) and instruction end byte (EBM) of each macroinstruction as it wraps round bytes into the IQ. It probably also provides bits indicating whether it is a complex instruction or a simple instruction (as suggested by the 'predecode bits' on many AMD patents). The gaps between these markers are implicitly prefix bytes for the following instruction. I'm thinking the IQ is designed such that the uops it issues in the IDQ/ROB will rarely outrun the IQ such that the head pointer in the IQ starts overwriting macroinstructions still tagged in the IDQ waiting to be allocated, and when it does, there is a stall, so the IDQ tags refer back to the IQ, which the allocator accesses. I think the ROB uses this uop tag as well. The IQ on SnB if 16 bytes * 40 entries contains 40 macroops in the worst case, 320 in the average case, 640 in the best case. The number of uops these produce will be much greater, so it will rarely outrun, and when it does, I guess it stalls decode until more instructions retire. The tail pointer contains the recently allocated tag by the ILD, the head pointer contains the next macroinstruction instruction waiting to retire, and the read pointer is the current tag to be consumed by the decoders (which moves towards the tail pointer). Although, this becomes difficult now that some if not the majority of the uops in the path come from the uop cache since SnB. The IQ may be allowed to outrun the back end in the event that uops are not tagged with the IQ entries (and the fields in the IQ are instead inserted into uops directly), and this will be detected and the pipeline will just be resteered from the beginning.
When the allocator allocates a physical destination (Pdst) for a branch micro-op into the ROB, the allocator provides the Pdst entry number to the BPU. The BPU inserts this into the correct BIT entry assigned by the BAC (which is probably is at the head of a circular queue of active BIT entries that are yet to be allocated a Pdst). The allocator also extracts fields from the uop and allocates the data into the RS.
The RS contains a field that indicates whether an instruction is a MSROM uop or a regular uop, which the allocator populates. The allocator also inserts the confirmed absolute target or the predicted absolute target into the immediate data and as a source, renames the flags register (or just a flag bit) and in the case of an indirect branch, there is also the renamed register that contains the address as another source. The Pdst in the PRF scheme would be the ROB entry, which as a Pdst would be the retirement macro-RIP or micro-IP register. The JEU writes the target or fallthrough to that register (it may not need to if the prediction is correct).
When the reservation station dispatches a branch micro-op to a jump execution unit located in the integer execution unit, the reservation station informs the BTB of the Pdst entry for the corresponding branch micro-op. In response, the BTB accesses the corresponding entry for the branch instruction in the BIT and the fall through IP (NLIP) is read out, decremented by the IP delta in the RS, and decoded to point to the set that the branch entry will be updated/allocated.
The outcome from the renamed flag register source Pdst to determine whether the branch is taken / not taken is then compared with the prediction in the opcode in the scheduler, and additionally, if the branch is indirect, the predicted target in the BIT is compared with the the address in the source Pdst (that was calculated and became available in the RS before it was scheduled and dispatched) and it is now known whether a correct prediction was made or not and whether the target is correct or not.
The JEU propagates an exception code to the ROB and flushes the pipeline (JEClear -- which flushes the whole pipeline before the allocate stage, as well as stalls the allocator) and redirects the next IP logic at the start of the pipeline using the fallthrough (in BIT) / target IP appropriately (as well as microsequencer if it is a microbranch misprediction; the RIP directed to the start of the pipeline will be the same one throughout the MSROM procedure). Speculative entries are deallocated and true BHRs are copied into the speculative BHRs. In the event there is a BOB in the PRF scheme, the BOB takes snapshots of the RAT state for every branch instruction and when there is a misprediction. The JEU rolls back the RAT state to that snapshot and the allocator can proceed immediately (which is particularly useful for microbranch misprediction as it is closer to the allocator therefore the bubble will not be as well hidden by the pipeline), rather than stalling the allocator and having to wait until retire for the retirement RAT state to be known and using that to restore the RAT and then clear the ROB (ROClear, which unstalls the allocator). With a BOB, the allocator can start issuing new instructions while the stale uops continue to execute, and when the branch is ready to retire, the ROClear only clears the uops between the retired misprediction and the new uops. If it is an MSROM uop, because it may have completed, the start of the pipeline still needs to be redirected to the MSROM uop again, but this time it will start at the redirected microip (this is the case with inline instructions (and it may be able to replay it out of the IQ). If a misprediction happens in an MSROM exception then it doesn't need to resteer the pipeline, just redirects it directly, because it has taken over the IDQ issue until the end of the procedure -- the issue may have already ended for inline issues.
The ROClear in response to the branch exception vector in the ROB actually happens on the second retirement stage RET2 (which is really the 3rd of 3 stages of typical retirement pipeline) when the uops are retired. The macroinstruction only retires and exceptions only trigger and the macroinstruction RIP only updates (with new target or increase by IP delta in the ROB) when the EOM uop (end of macroinstruction) marker retires, even if a non EOM instruction writes to it, it is not written to the RRF immediately unlike other registers -- anyway, the branch uop is likely going to be the final uop in typical branch macroinstruction handled by the decoders. If this is a microbranch in an MSROM procedure, it will not update the BTB; it updates the uIP when it retires, and the RIP is not updated until the end of the procedure.
If a generic non-mispredict exception occurs (i.e. one that requires a handler) during a MSROM macroop execution, once it has been handled, the microip that caused the exception is restored by the handler to the uIP register (in the event that it is passed to the handler when it is called), as well as the current RIP of the macroinstruction which triggered the exception, and when the exception handling ends, instruction fetch is resumed at this RIP+uIP: the macroinstruction is refetched and reattempted in the MSROM, which starts at the uIP signalled to it. The RRF write (or retirement RAT update on the PRF scheme) for previous uops in a complex non-MSROM macroinstruction may occur on the cycle before the EOM uop retires, which means that a restart can happen at a certain uop within a regular complex macroop and not just a MSROM macroop, and in this case, the instruction flow is restarted at the BPU at the RIP, and the complex decoder is configured with valid / invalid bits on the PLA cuop outputs. The uIP for this regular complex instruction that is used to configure the complex decoder valid bits is a value between 0-3, which I think the ROB sets to 0 at each EOM and increments for each microop retired, so that the non-MSROM complex instructions can be addressed, and for MSROM complex instructions, the MSROM routine contains a uop that tells the ROB the uIP of that instruction. The architectural RIP register however, which is updated by the IP delta only when the EOM uop retires is still pointing to the current macroop because the EOM uop failed to retire), which only happens for exceptions but not hardware interrupts, which can't interrupt MSROM procedures or complex instruction mid retirement (software interrupts are similar and trigger at the EOM -- the trap MSROM handler performs a macrojump to the RIP of the software trap handler once it has finished).
The BTB read and tag comparison happens in RET1 while the branch unit writes back the results, and then in the next cycle, perhaps also during RET1 (or maybe this is done in RET2), the tags in the set are compared and then, if there is a hit, a new history BHR is calculated; if there is a miss, an entry needs to be allocated at that IP with that target. Only once the uop retires in order (in RET2) can the the result be placed into the real history and the branch prediction algorithm is utilised to update the pattern table where an update is required. If the branch requires allocation, the replacement policy is utilised to decide the means for allocating the branch. If there is a hit, the target will already be correct for all direct and relative branches, so it doesn't have to be compared, in the event of no IBTB. The speculative bit is now removed from the entry if present. Finally, in the next cycle, the branch is written in the BTB cache by the BTB write control logic. The first part of the BTB lookup may be able to go ahead throughout RET1 and then may stall the BTB write pipeline until RET2 when the stage waiting to write to the BTB's ROB entry retires, otherwise, the lookup could be decoupled and the first part completes and writes data to, for instance, the BIT, and at RET2 the corresponding entry to the one retiring is just written back to the BTB (which would mean decoding the set again, comparing tags again and then writing the entry, so maybe not)
If P6 had a uop cache, the pipeline would be something like:
1H: select IP
1L: BTB set decode + cache set decode (physical/virtual index) + ITLB lookup + uop cache set decode
2H: cache read + BTB read + uop cache read
2L: cache tag compare + BTB tag compare + uop cache tag compare; if uop cache hit, stall until uop cache can issue, then clock gate legacy decode pipeline
3H: predict, if taken, flush 3H,2L,2H,1L
3L if taken, begin a 1L with new IP to decode new set and continue with current 16 byte block for which the branch instruction resides to 4L
As for the uop cache, because it is past the stage of the BAC, there is never going to be a bogus branch or an incorrect prediction for an unconditional branch or an incorrect target for a non-indirect branch. The uop cache will used the used/unused mask from the BPU to emit uops for instructions that begin at those bytes, and will use the prediction direction mask to change the macrobranch uops to a predicted not taken / predicted taken macrobranch uop (T/NT predictions are inserted into the uop itself). If it is predicted taken then it stops emitting uops for that 64B aligned block (again used to be 32B, previously 16B) and waits for the next window right behind it in the pipeline. The uop cache is going to know what uops are branches and probably statically predicts not taken to all non-predictions, or might have a more advanced static prediction. Indirect target predictions from the IBTB are inserted into the uop immediate field and then it will wait for the next BPU prediction block if this branch is also predicted taken. I would imagine the uop cache creates BIT entries and updates predictions in the BTBs, to ensure that uop cache and MITE (legacy decode) uops update the history in correct sequential order.

Collision Attacks, Message Digests and a Possible solution

I've been doing some preliminary research in the area of message digests. Specifically collision attacks of cryptographic hash functions such as MD5 and SHA-1, such as the Postscript example and X.509 certificate duplicate.
From what I can tell in the case of the postscript attack, specific data was generated and embedded within the header of the postscript file (which is ignored during rendering) which brought about the internal state of the md5 to a state such that the modified wording of the document would lead to a final MD value equivalent to the original postscript file.
The X.509 took a similar approach where by data was injected within the comment/whitespace sections of the certificate.
Ok so here is my question, and I can't seem to find anyone asking this question:
Why isn't the length of ONLY the data being consumed added as a final block to the MD calculation?
In the case of X.509 - Why is the whitespace and comments being taken into account as part of the MD?
Wouldn't a simple processes such as one of the following be enough to resolve the proposed collision attacks:
MD(M + |M|) = xyz
MD(M + |M| + |M| * magicseed_0 +...+ |M| * magicseed_n) = xyz
where :
M : is the message
|M| : size of the message
MD : is the message digest function (eg: md5, sha, whirlpool etc)
xyz : is the pairing of the acutal message digest value for the message M and |M|. <M,|M|>
magicseed_{i}: Is a set of random values generated with seed based on the internal-state prior to the size being added.
This technqiue should work, as to date all such collision attacks rely on adding more data to the original message.
In short, the level of difficulty involved in generating a collision message such that:
It not only generates the same MD
But is also comprehensible/parsible/compliant
and is also the same size as the original message,
is immensely difficult if not near impossible. Has this approach ever been discussed? Any links to papers etc would be nice.
Further Question: What is the lower bound for collisions of messages of common length for a hash function H chosen randomly from U, where U is the set of universal hash functions ?
Is it 1/N (where N is 2^(|M|)) or is it greater? If it is greater, that implies there is more than 1 message of length N that will map to the same MD value for a given H.
If that is the case, how practical is it to find these other messages? bruteforce would be of O(2^N), is there a method of time complexity less than bruteforce?
Can't speak for the rest of the questions, but the first one is fairly simple - adding length data to the input of the md5, at any stage of the hashing process (1st block, Nth block, final block) just changes the output hash. You couldn't retrieve that length from the output hash string afterwards. It's also not inconceivable that a collision couldn't be produced from another string with the exact same length in the first place, so saying "the original string was 17 bytes" is meaningless, because the colliding string could also be 17 bytes.
e.g.
md5("abce(17bytes)fghi") = md5("abdefghi<long sequence of text to produce collision>")
is still possible.
In the case of X.509 certificates specifically, the "comments" are not comments in the programming language sense: they are simply additional attributes with an OID that indicates they are to be interpreted as comments. The signature on a certificate is defined to be over the DER representation of the entire tbsCertificate ('to be signed' certificate) structure which includes all the additional attributes.
Hash function design is pretty deep theory, though, and might be better served on the Theoretical CS Stack Exchange.
As #Marc points out, though, as long as more bits can be modified than the output of the hash function contains, then by the pigeonhole principle a collision must exist for some pair of inputs. Because cryptographic hash functions are in general designed to behave pseudo-randomly over their inputs, collisions will tend toward being uniformly distributed over possible inputs.
EDIT: Incorporating the message length into the final block of the hash function would be equivalent to appending the length of everything that has gone before to the input message, so there's no real need to modify the hash function to do this itself; rather, specify it as part of the usage in a given context. I can see where this would make some types of collision attacks harder to pull off, since if you change the message length there's a changed field "downstream" of the area modified by the attack. However, this wouldn't necessarily impede the X.509 intermediate CA forgery attack since the length of the tbsCertificate is not modified.

State based testing(state charts) & transition sequences

I am really stuck with some state based testing concepts...
I am trying to calculate some checking sequences that will cover all transitions from each state and i have the answers but i dont understand them:
alt text http://www.gam3r.co.uk/1m.jpg
Now the answers i have are :
alt text http://www.gam3r.co.uk/2m.jpg
I dont understand it at all. For example say we want to check transition a/x from s1, wouldnt we do ab only? As we are already in s1, we do a/x to test the transition to s2, then b to check we are in the previous right state(s1)? I cant understand why it is aba or even bb for s1...
Can anyone talk me through it?
Thanks
There are 2 events available in each of 4 states, giving 8 transitions, which the author has decided to test in 8 separate test sequences. Each sequence (except the S1 sequences - apparently the machine's initial state is S1) needs to drive the machine to the target state and then perform either event a or event b.
The sequences he has chosen are sufficient, in that each transition is covered. However, they are not unique, and - as you have observed - not minimal.
A more obvious choice would be:
a b
ab aa
aaa aab
ba bb
I don't understand the author's purpose in adding superfluous transitions at the end of each sequence. The system is a Mealy machine - the behaviour of the machine is uniquely determined by the current state and event. There is no memory of the path leading to the current state; therefore the author's extra transitions give no additional coverage and serve only to confuse.
You are also correct that you could cover the all transitions with a shorter set of paths through the graph. However, I would be disinclined to do that. Clarity is more important than optimization for test code.