What commands can be sent on the message port of UHD USRP Sink block? - gnuradio

I have been experimenting with message passing in the Signal Source block in GNU Radio companion. I can see from its source code that we can pass messages to change the frequency, amplitude, offset and phase of the source. For example, the following message PMT sent from a message strobe can change the amplitude of the signal to 0.5.
pmt.dict_add(pmt.make_dict(), pmt.intern("ampl"), pmt.from_double(0.5))
But when I viewed the code of UHD USRP Sink, I couldn't get a clear idea as to what commands can be sent to this block or that which parameters can be changed. I have read at some places in the documentation that frequency, gain, LO offset, timestamp, center frequency and other transceiver related settings of the USRP Sink can be manipulated through command messages.
What commands can be sent to the USRP Sink block from a message strobe (in the pmt format) and which parameters (and their keys) can be modified?

This is officially documented:
https://www.gnuradio.org/doc/doxygen/page_uhd.html#uhd_command_syntax
Command name
Value Type
Description
chan
int
Specifies a channel. If this is not given, either all channels are chosen, or channel 0, depending on the action. A value of -1 forces 'all channels', where possible.
gain
double
Sets the Tx or Rx gain (in dB). Defaults to all channels.
power_dbm
double
Sets the Tx or Rx power reference level (in dBm). Defaults to all channels. Works for certain devices only, and only if calibration data is available.
freq
double
Sets the Tx or Rx frequency. Defaults to all channels. If specified without lo_offset, it will set the LO offset to zero.
lo_offset
double
Sets an LO offset. Defaults to all channels. Note this does not affect the effective center frequency.
tune
tune_request
Like freq, but sets a full tune request (i.e. center frequency and DSP offset). Defaults to all channels.
mtune
tune_request_t
Like tune, but supports a full manual tune request as uhd::tune_request_t. Defaults to all channels.
lo_freq
double
For fully manual tuning: Set the LO frequency (RF frequency). Conflicts with freq, lo_offset, and tune.
dsp_freq
double
For fully manual tuning: Set the DSP frequency (CORDIC frequency). Conflicts with freq, lo_offset, and tune.
direction
string
Used for timed transceiver tuning to ensure tuning order is maintained. Values other than 'TX' or 'RX' will be ignored.
rate
double
See usrp_block::set_samp_rate(). Always affects all channels.
bandwidth
double
See usrp_block::set_bandwidth(). Defaults to all channels.
time
timestamp
Sets a command time. See usrp_block::set_command_time(). A value of PMT_NIL will clear the command time.
mboard
int
Specify mboard index, where applicable.
antenna
string
See usrp_block::set_antenna(). Defaults to all channels.
gpio
gpio
PMT dictionary including bank, attr, value, mask for GPIO. See notes.

Related

Demodulating GFSK

I'm trying to demodulate a GFSK signal coming from an nRF24L01+ transceiver chip (hooked up to my Arduino). I've followed this guide so far:
https://www.bitcraze.io/2015/06/sniffing-crazyflies-radio-with-hackrf-blue/#comment-38046
..and managed to manually demodulate a package (the address and the message I sent 'martijn' are clearly recoverable):
https://drive.google.com/open?id=0B9CJ42CGPiF2TWoyelRmWldZcU0
However, now I want to receive packets and decode them as they come in. Someone already made a decoder for this job, but somehow it fails to find my nRF24 packets:
https://wiki.bitcraze.io/misc:hacks:hackrf
My Arduino code for sending the packets is as followed:
#include <SPI.h>
#include <nRF24L01.h>
#include <RF24.h>
#include <RF24_config.h>
RF24 radio(9,10);
const uint64_t pipe = 0xe7e7e7e7e7;
char package[] = "martijn";
void setup() {
Serial.begin(9600);
radio.begin();
radio.setDataRate(RF24_1MBPS);
radio.setChannel(95);
radio.openWritingPipe(pipe);
radio.enableDynamicPayloads();
radio.setAutoAck(true);
radio.powerUp();
}
void loop() {
radio.write(&package, strlen(package));
delay(1);
}
Basically I just want to use GNU Radio Companion to obtain the nRF24 packets, and send their binary data into a file. I'm fine with writing my own decoder. However, I have no clue on how to get this binary data from the incoming signals.
(The comments at the bitcraze site are also mine)
I've be very happy if someone could help me (or even point me in the right direction). Thanks in advance!
After the Quadrature Demod you have to use a Clock recovery block. The M&M Clock Recovery of GNU Radio should do the job. This block will dramatically increase the performance of the decoding.
However you have to take care some parameters that this block requires. The most important is the 'omega'. 'Omega' roughly speaking corresponds to the number of samples per symbol. For example, if your GFSK baudrate is 9600 and your incoming signal from the hardware is 96000, each symbol corresponds to 10 samples. The omega can be any float number. Note however, that clock recovery does not work for large omega values. So try to keep the omega up to 8.0. To do that, either adjust properly the hardware sampling rate or do some resampling.
After the Clock Recover just use a 'Binary Slicer' block. This will convert the floats to bits of 0's and 1's. Using the Pack K bits block you can convert the bit stream into byte stream, that can easily saved to file with a 'File Sink'.
Here is a good step-by-step tutorial for an FSK receiver. GFSK adds only a Gaussian filter so the procedure is quite the same for both of them.

Reading incorrect temperature value from thermistor using EVK1100

I want to get the temperature from a thermistor, so I made a voltage divider (3.3V to 10k resistor and between ground a 10k thermistor) I read the ADC between the 10k resistor and the thermistor.
The BCOEFFICIENT is 3977, the NOMINAL TEMPERATURE is 25C and I use the simple B parameter equation. I'm not sure where I'm doing mistake, I read room temperature as 10.5C which was suppose to be around 24C. The following is the part of the program that I used for temperature sensor(developed in AVR studio),
#define TEMPERATURENOMINAL 25
#define TERMISTORNOMINAL 10000
#define BCOEFFICIENT 3977
#define SERIESRESISTOR 10000
{
float ke1,tempa,xin
ke1 = adc_get_value(peak_adc2,peak2);
xin=(1023/ke1)-1;
xin=SERIESRESISTOR/xin;
tempa=xin/TERMISTORNOMINAL;
tempa=log(tempa);
tempa/= BCOEFFICIENT;
tempa+=1.0/(TEMPERATURENOMINAL + 273.15);
tempa=1.0/tempa;
tempa-=273.15;
dip204_set_cursor_position(1,3);
//sprintf(ui, "Temp is %.2f deg", Ref);
sprintf(ui, "Temp is %.2f deg", tempa);
dip204_write_string(ui);
}
I checked the voltage using multi-meter for instance in between the thermistor and 10k resistor and in the EVK 1100 using the following line
ke1 = adc_get_value(peak_adc2,peak2)*3.3/1024;
I get the same voltage in both.
Not sure where I'm doing a mistake, Hope someone guide me in right direction
Your code looks correct to me, and I suspect a hardware problem may be the culprit.
It seems likely you have inadvertently connected two 10K-ohm pull-up resistors between the ADC input and the +3.3V reference: perhaps one is already populated on the EVK1100 board, and you have added another one externally connected to your thermistor. This would be equivalent to putting both 10K-ohm resistors in parallel with each other, which would be equivalent to a 5K-ohm resistor in series with the thermistor. At 25°C, the thermistor resistance Rt would read 10K ohms, which would produce a voltage of:
+3.3V * (Rt / (Rt + 5K))
= 2.20V
instead of the correct +1.65V. This number is very close to the result you are seeing (+2.17V # 24°C).
You can verify this hypothesis by looking at the schematic and/or PCB for the EVK1100 to see if a 10K-ohm pull-up resistor is connected from the ADC input to +3.3V. If this is the problem, remove one of the two resistors and you should see correct behavior.

What branch misprediction does the Branch Target Buffer detect?

I am currently looking at the various parts of the CPU pipeline which can detect branch mispredictions. I have found these are:
Branch Target Buffer (BPU CLEAR)
Branch Address Calculator (BA CLEAR)
Jump Execution Unit (not sure of the signal name here??)
I know what 2 and 3 detect, but I do not understand what misprediction is detected within the BTB. The BAC detects where the BTB has erroneously predicted a branch for a non-branch instruction, where the BTB has failed to detect a branch, or the BTB has mispredicted the target address for a x86 RET instruction. The execution unit evaluates the branch and determines if it was correct.
What type of misprediction is detected at the Branch Target Buffer? What exactly is detected as a misprediction here?
The only clue I could find was this inside Vol 3 of the Intel Developer Manuals (the two BPU CLEAR event counters at the bottom):
BPU predicted a taken branch after incorrectly assuming that it was
not taken.
This seems to imply the prediction is not done "synchronously", but rather "asynchronously", hence the "after incorrectly assuming"??
UPDATE:
Ross, this is the CPU branch circuitry, from the original Intel Patent (hows that for "reading"?):
I don't see "Branch Prediction Unit" anywhere? Would it be reasonable that somebody having read this paper would assume that "BPU" is a lazy way of grouping the BTB Circuit, BTB Cache, BAC and RSB together??
So my question still stands, which component raises the BPU CLEAR signal?
This is a good question! I think the confusion that it's causing is due to Intel's strange naming schemes which often overload terms standard in academia. I will try to both answer your question and also clear up the confusion I see in the comments.
First of all. I agree that in standard computer science terminology a branch target buffer isn't synonymous with branch predictor. However in Intel terminology the Branch Target Buffer (BTB) [in capitals] is something specific and contains both a predictor and a Branch Target Buffer Cache (BTBC) which is just a table of branch instructions and their targets on a taken outcome. This BTBC is what most people understand as a branch target buffer [lower case]. So what is the Branch Address Calculator (BAC) and why do we need it if we have a BTB?
So, you understand that modern processors are split into pipelines with multiple stages. Whether this is a simple pipelined processor or an out of order supersclar processor, the first stages are typically fetch then decode. In the fetch stage all we have is the address of the current instruction contained in the program counter (PC). We use the PC to load bytes from memory and send them to the decode stage. In most cases we increment the PC in order to load the subsequent instruction(s) but in other cases we process a control flow instruction which can modify the contents of the PC completely.
The purpose of the BTB is to guess if the address in the PC points to a branch instruction, and if so, what should the next address in the PC be? That's fine, we can use a predictor for conditional branches and the BTBC for the next address. If the prediction was right, that's great! If the prediction was wrong, what then? If the BTB is the only unit we have then we would have to wait until the branch reaches the issue/execute stage of the pipeline. We would have to flush the pipeline and start again. But not every situation needs to be resolved so late. This is where the Branch Address Calculator (BAC) comes in.
The BTB is used in the fetch stage of the pipeline but the BAC resides in the decode stage. Once the instruction we fetched is decoded, we actually have a lot more information which can be useful. The first new piece of information we know is: "is the instruction I fetched actually a branch?" In the fetch stage we have no idea and the BTB can only guess, but in the decode stage we know it for sure. It is possible that the BTB predicts a branch when in fact the instruction is not a branch; in this case the BAC will halt the fetch unit, fix the BTB, and reinitiate fetching correctly.
What about branches like unconditional relative and call? These can be validated at the decode stage. The BAC will check the BTB, see if there are entries in the BTBC and set the predictor to always predict taken. For conditional branches, the BAC cannot confirm if they are taken/not-taken yet, but it can at least validate the predicted address and correct the BTB in the event of a bad address prediction. Sometimes the BTB won't identify/predict a branch at all. The BAC needs to correct this and give the BTB new information about this instruction. Since the BAC doesn't have a conditional predictor of its own, it uses a simple mechanism (backwards branches taken, forward branches not taken).
Somebody will need to confirm my understanding of these hardware counters, but I believe they mean the following:
BACLEAR.CLEAR is incremented when the BTB in fetch does a bad
job and the BAC in decode can fix it.
BPU_CLEARS.EARLY is
incremented when fetch decides (incorrectly) to load the next
instruction before the BTB predicts that it should actually load from
the taken path instead. This is because the BTB requires multiple cycles and fetch uses that time to speculatively load a consecutive block of instructions. This can be due to Intel using two BTBs, one quick and the other slower but more accurate. It takes more cycles to get a better prediction.
This explains why the penalty of a detecting a misprediction in the BTB is 2/3 cycles whereas the detecting a misprediction in the BAC is 8 cycles.
The fact BPU_CLEARS.EARLY description shows that this event occurs when the BPU predicts, correcting an assumption, implies there are 3 stages in the front end. Assume, predict and calculate. My current guess is that an early clear is flushing the stages of the pipeline that are before the stage that a prediction from the L1 BTB is even known, when the prediction is 'taken' as opposed to not taken.
The BTB set contains 4 ways for a maximum of 4 branches per 16 bytes (where all the ways in the set get tagged with the same tag indicating that particular 16 byte aligned block). Each way has an offset indicating the 4 LSBs of the address therefore the byte position within 16 bytes. Each entry also has a speculative bit, valid bit, pLRU bits, a speculative local BHR, a real local BHR, and each way shares the set BPT (PHT) as a second level of prediction. This may be alloyed with a GHR / speculative GHR.
I think the BPU provides a 64B prediction block to the uop cache and instruction cache (it used to be 32B, and it was 16B on P6). For the legacy route it needs to provide a 64 (or 32/16) byte prediction block which is a set of 64 bit masks which mark predicted branch instructions, prediction directions and which byte is a branch target. This information will be furnished by the L1 BTB while the fetch for the 64 byte line is underway such that 16 byte aligned (IFU has always been 16B) blocks that are read out of it with no used bits at all will not be fetched by the instruction predecoder (unused is the default because on architectures where the prediction block is smaller than the line size, the BPU may only provide bitmasks for 16 or 32B of the line). The BPU therefore provides the predictions made mask, the used/unused mask (marking bytes after first taken branch in the first prediction block and before the branch target in the second prediction block as unused and the rest used), the prediction directions mask; and the ILD provides the branch instructions mask. The first used byte in the prediction block is implicitly a branch target or the start of the instruction flow after a resteer or switch from the uop cache (DSP) to the legacy pipeline (MITE). The used bytes within the prediction block make up a prediction window.
Here is a P6 pipeline. Using this as an example, an early clear is in the 3rd cycle (13), when a prediction is made (and the target and entry type is read, and therefore unconditional branch targets are now known as well as conditional and their predictions). The first predicted taken branch target in the set for the 16 byte block is used. At this point, the 2 pipe stages before it have already been filled with fetches and beginning of lookups from the next sequential 16 byte blocks, which means that they need to be flushed if there is any taken prediction (otherwise it doesn't need to be as the next sequential 16 byte block is already beginning to be looked up in the pipestage before it), leaving a 2 cycle gap or bubble. The cache lookup occurs at the same time as the BTB lookup, so both the BTB and cache parallel 2 pipestages will have to be flushed, whereas the 3rd stage doesn't need to be flushed from the cache or the BTB because the IP is on a confirmed path and is the IP being currently looked up for the next one. In fact, in this P6 design, there is only a one cycle bubble for this early clear, because the new IP can be sent to the first stage to decode a set again on the high edge of clock while those other stages are being flushed.
This pipelining is more beneficial than waiting for the prediction before beginning a lookup on the next IP. This would mean a lookup every other cycle. This would give a throughput of 16 bytes of predictions every 2 cycles, so 8B/c. In the P6 pipelined scenario, the throughput is 16 bytes per cycle on a correct assumption and 8B/c on an incorrect assumption. Obviously faster. If we assume 2/3s of assumptions are correct for 1 in 9 instructions being a taken branch for 4 instructions per block, this gives a throughput of 16B per ((1*0.666)+2*0.333)) =1.332 cycles instead of 16B per 2 cycles.
If this is true, every taken branch will cause an early clear. This is however not the case when I use the event on my KBL. Hopefully the event is actually wrong because it is supposed to not be supported on my KBL, but does show a random low number, so hopefully it is counting something else. This also does not appear to be supported by the following https://gist.github.com/mattgodbolt/4e2cbb1c9aa97e0c5478 https://github.com/mattgodbolt/agner/blob/master/tests/branch.py. Given the 900k instructions and 100k early clears, I do not see how you can have an odd number of early clears if you use my definition of early clears and then look at his code. If we assume that the window is 32B for that CPU, then if you use an alignment of 4 on each branch instruction in that macro you get a clear every 8 instructions, because 8 will fit into the 32B aligned window.
I am not sure why Haswell and Ivy Bridge have such values for early and late clears but these events (0xe8) disappear starting with SnB, which happens to coincide with when the BTB was unified into a single structure. It also looks like the late clears counter is now counting the early clears event because it is the same number as the early clears on the Arrandale CPU, and the early clears event is now counting nothing. I'm also not sure why Nehalem has a 2 cycle bubble for early clears as the design of the L1 Nehalem BTB doesn't seem to change much from the P6 BTB, both 512 entries with 4 ways per set. It is probably because it has been broken down into more stages due to the higher clock speeds and hence also the longer L1i cache latency.
The late clear (BPU_CLEARS.LATE) appears to happen at the ILD. In the diagram above, the cache lookup takes only 2 cycles. In more recent processors, it apparently takes 4 cycles. This allows another L2 BTB to be inserted and a lookup in it to take place. 'MRU bypass' and 'MRU conflicts' could just mean that there was a miss in the MRU BTB or it could also mean that the prediction in the L2 is different to the one in L1 in the event that it uses a different prediction algorithm and history file. On my KBL, which does not support either event, I always get 0 for ILD_STALL.MRU but not BPU_CLEARS.LATE. The 3 cycle bubble comes from the BPU at stage 5 (which is also an ILD stage) resteering the pipeline at the low edge of stage 1 and flushing stages 2, 3 and 4 (which falls in line with cited L1i latencies of 4 cycles, as the L1i lookup occurs across stages 1–4 for a hit+ITLB hit). As soon as the prediction is made, the BTBs update the entries' speculative local BHR bits with the prediction that was made.
A BACLEAR happens for instance when the IQ compares the predictions-made mask with the branch instruction mask produced by the predecoder, and then for certain instruction types like a relative jump, it will check the sign bit to perform a static branch prediction. I'd imagine the static prediction happens as soon as it enters the IQ from the predecoder, such that instructions that immediately go the decoder contain the static prediction. The branch now being predicted taken will result in a BACLEAR_FORCE_IQ when the branch instruction is decoded, because there won't be a target to verify when the BAC calculates the absolute address of the relative conditional branch instruction, which is needs to verify when it is predicted taken.
The BAC at the decoders also makes sure the relative branches and direct branches have the correct branch target prediction after calculating the absolute address from the instruction itself and comparing it with it, if not, a BACLEAR is issued. For relative jumps, static prediction in the BAC uses the sign bit of the jump displacement to statically predict taken / not taken if a prediction has not been made but also overrides all return predictions as taken if the BTB does not support return entry types (it doesn't on P6 and makes no prediction, instead the BAC uses the BPU's RSB mechanism and it is the first point in the pipeline that a return instruction is acknowledged) and overrides all register indirect branch predictions as taken on P6 (because there is no IBTB) as it uses the statistic that more branches are taken that not. The BAC calculates and inserts the absolute target from the relative target into the uop and inserts the IP delta into the uop and inserts the fall through IP (NLIP) into the BPU's BIT, which may be tagged to the uop, or more likely the BIT entries work on a corresponding circular queue which will stall if there aren't enough BIT entries, and the indirect target prediction or known target is inserted into the uop 64 bit immediate field. These fields in the uop are used by the allocator for allocation into the RS/ROB later on. The BAC also informs the BTB of any spurious predictions (non branch instructions) that need their entries deallocating from the BTB. At the decoders, branch instructions are detected early in the logic (when prefixes are decoded and the instruction is examined to see if it can be decoded by the decoder) and the BAC is accessed in parallel with the rest. The BAC inserting the known or otherwise predicted target into the uop is known as converting an auop into a duop. The prediction is encoded into the uop opcode.
The BAC likely instructs the BTB to speculatively update its BTB for the detected branch instruction's IP. If the target is now known and no prediction was made for it (meaning it wasn't in the cache) -- it is still speculative as although the branch target is known for certain, it still could be on a speculative path, so is marked with a speculative bit -- this will now immediately provide early steers especially for unconditional branches now entering the pipeline but also for conditional, with a blank history so will predict not taken next time, rather than having to wait until retire).
The IQ above contains a bitmask field for branch prediction directions (BTBP) and branch predictions made / no prediction made (BTBH) (to distinguish which of the 0s in the BTBP are not taken as opposed to no prediction made) for each of the 8 instruction bytes in an IQ line as well as the target of a branch instruction, meaning there can only be one branch per IQ line and it ends the line. This diagram does not show the branch instruction mask produced by the predecoder that shows what instructions actually are branches such that the IQ knows what not-made predictions it needs to make a prediction for (and what ones are not branch instructions at all).
The IQ is a contiguous block of instruction bytes and the ILD populates 8-bit bitmasks which identify the first opcode byte (OpM) and instruction end byte (EBM) of each macroinstruction as it wraps round bytes into the IQ. It probably also provides bits indicating whether it is a complex instruction or a simple instruction (as suggested by the 'predecode bits' on many AMD patents). The gaps between these markers are implicitly prefix bytes for the following instruction. I'm thinking the IQ is designed such that the uops it issues in the IDQ/ROB will rarely outrun the IQ such that the head pointer in the IQ starts overwriting macroinstructions still tagged in the IDQ waiting to be allocated, and when it does, there is a stall, so the IDQ tags refer back to the IQ, which the allocator accesses. I think the ROB uses this uop tag as well. The IQ on SnB if 16 bytes * 40 entries contains 40 macroops in the worst case, 320 in the average case, 640 in the best case. The number of uops these produce will be much greater, so it will rarely outrun, and when it does, I guess it stalls decode until more instructions retire. The tail pointer contains the recently allocated tag by the ILD, the head pointer contains the next macroinstruction instruction waiting to retire, and the read pointer is the current tag to be consumed by the decoders (which moves towards the tail pointer). Although, this becomes difficult now that some if not the majority of the uops in the path come from the uop cache since SnB. The IQ may be allowed to outrun the back end in the event that uops are not tagged with the IQ entries (and the fields in the IQ are instead inserted into uops directly), and this will be detected and the pipeline will just be resteered from the beginning.
When the allocator allocates a physical destination (Pdst) for a branch micro-op into the ROB, the allocator provides the Pdst entry number to the BPU. The BPU inserts this into the correct BIT entry assigned by the BAC (which is probably is at the head of a circular queue of active BIT entries that are yet to be allocated a Pdst). The allocator also extracts fields from the uop and allocates the data into the RS.
The RS contains a field that indicates whether an instruction is a MSROM uop or a regular uop, which the allocator populates. The allocator also inserts the confirmed absolute target or the predicted absolute target into the immediate data and as a source, renames the flags register (or just a flag bit) and in the case of an indirect branch, there is also the renamed register that contains the address as another source. The Pdst in the PRF scheme would be the ROB entry, which as a Pdst would be the retirement macro-RIP or micro-IP register. The JEU writes the target or fallthrough to that register (it may not need to if the prediction is correct).
When the reservation station dispatches a branch micro-op to a jump execution unit located in the integer execution unit, the reservation station informs the BTB of the Pdst entry for the corresponding branch micro-op. In response, the BTB accesses the corresponding entry for the branch instruction in the BIT and the fall through IP (NLIP) is read out, decremented by the IP delta in the RS, and decoded to point to the set that the branch entry will be updated/allocated.
The outcome from the renamed flag register source Pdst to determine whether the branch is taken / not taken is then compared with the prediction in the opcode in the scheduler, and additionally, if the branch is indirect, the predicted target in the BIT is compared with the the address in the source Pdst (that was calculated and became available in the RS before it was scheduled and dispatched) and it is now known whether a correct prediction was made or not and whether the target is correct or not.
The JEU propagates an exception code to the ROB and flushes the pipeline (JEClear -- which flushes the whole pipeline before the allocate stage, as well as stalls the allocator) and redirects the next IP logic at the start of the pipeline using the fallthrough (in BIT) / target IP appropriately (as well as microsequencer if it is a microbranch misprediction; the RIP directed to the start of the pipeline will be the same one throughout the MSROM procedure). Speculative entries are deallocated and true BHRs are copied into the speculative BHRs. In the event there is a BOB in the PRF scheme, the BOB takes snapshots of the RAT state for every branch instruction and when there is a misprediction. The JEU rolls back the RAT state to that snapshot and the allocator can proceed immediately (which is particularly useful for microbranch misprediction as it is closer to the allocator therefore the bubble will not be as well hidden by the pipeline), rather than stalling the allocator and having to wait until retire for the retirement RAT state to be known and using that to restore the RAT and then clear the ROB (ROClear, which unstalls the allocator). With a BOB, the allocator can start issuing new instructions while the stale uops continue to execute, and when the branch is ready to retire, the ROClear only clears the uops between the retired misprediction and the new uops. If it is an MSROM uop, because it may have completed, the start of the pipeline still needs to be redirected to the MSROM uop again, but this time it will start at the redirected microip (this is the case with inline instructions (and it may be able to replay it out of the IQ). If a misprediction happens in an MSROM exception then it doesn't need to resteer the pipeline, just redirects it directly, because it has taken over the IDQ issue until the end of the procedure -- the issue may have already ended for inline issues.
The ROClear in response to the branch exception vector in the ROB actually happens on the second retirement stage RET2 (which is really the 3rd of 3 stages of typical retirement pipeline) when the uops are retired. The macroinstruction only retires and exceptions only trigger and the macroinstruction RIP only updates (with new target or increase by IP delta in the ROB) when the EOM uop (end of macroinstruction) marker retires, even if a non EOM instruction writes to it, it is not written to the RRF immediately unlike other registers -- anyway, the branch uop is likely going to be the final uop in typical branch macroinstruction handled by the decoders. If this is a microbranch in an MSROM procedure, it will not update the BTB; it updates the uIP when it retires, and the RIP is not updated until the end of the procedure.
If a generic non-mispredict exception occurs (i.e. one that requires a handler) during a MSROM macroop execution, once it has been handled, the microip that caused the exception is restored by the handler to the uIP register (in the event that it is passed to the handler when it is called), as well as the current RIP of the macroinstruction which triggered the exception, and when the exception handling ends, instruction fetch is resumed at this RIP+uIP: the macroinstruction is refetched and reattempted in the MSROM, which starts at the uIP signalled to it. The RRF write (or retirement RAT update on the PRF scheme) for previous uops in a complex non-MSROM macroinstruction may occur on the cycle before the EOM uop retires, which means that a restart can happen at a certain uop within a regular complex macroop and not just a MSROM macroop, and in this case, the instruction flow is restarted at the BPU at the RIP, and the complex decoder is configured with valid / invalid bits on the PLA cuop outputs. The uIP for this regular complex instruction that is used to configure the complex decoder valid bits is a value between 0-3, which I think the ROB sets to 0 at each EOM and increments for each microop retired, so that the non-MSROM complex instructions can be addressed, and for MSROM complex instructions, the MSROM routine contains a uop that tells the ROB the uIP of that instruction. The architectural RIP register however, which is updated by the IP delta only when the EOM uop retires is still pointing to the current macroop because the EOM uop failed to retire), which only happens for exceptions but not hardware interrupts, which can't interrupt MSROM procedures or complex instruction mid retirement (software interrupts are similar and trigger at the EOM -- the trap MSROM handler performs a macrojump to the RIP of the software trap handler once it has finished).
The BTB read and tag comparison happens in RET1 while the branch unit writes back the results, and then in the next cycle, perhaps also during RET1 (or maybe this is done in RET2), the tags in the set are compared and then, if there is a hit, a new history BHR is calculated; if there is a miss, an entry needs to be allocated at that IP with that target. Only once the uop retires in order (in RET2) can the the result be placed into the real history and the branch prediction algorithm is utilised to update the pattern table where an update is required. If the branch requires allocation, the replacement policy is utilised to decide the means for allocating the branch. If there is a hit, the target will already be correct for all direct and relative branches, so it doesn't have to be compared, in the event of no IBTB. The speculative bit is now removed from the entry if present. Finally, in the next cycle, the branch is written in the BTB cache by the BTB write control logic. The first part of the BTB lookup may be able to go ahead throughout RET1 and then may stall the BTB write pipeline until RET2 when the stage waiting to write to the BTB's ROB entry retires, otherwise, the lookup could be decoupled and the first part completes and writes data to, for instance, the BIT, and at RET2 the corresponding entry to the one retiring is just written back to the BTB (which would mean decoding the set again, comparing tags again and then writing the entry, so maybe not)
If P6 had a uop cache, the pipeline would be something like:
1H: select IP
1L: BTB set decode + cache set decode (physical/virtual index) + ITLB lookup + uop cache set decode
2H: cache read + BTB read + uop cache read
2L: cache tag compare + BTB tag compare + uop cache tag compare; if uop cache hit, stall until uop cache can issue, then clock gate legacy decode pipeline
3H: predict, if taken, flush 3H,2L,2H,1L
3L if taken, begin a 1L with new IP to decode new set and continue with current 16 byte block for which the branch instruction resides to 4L
As for the uop cache, because it is past the stage of the BAC, there is never going to be a bogus branch or an incorrect prediction for an unconditional branch or an incorrect target for a non-indirect branch. The uop cache will used the used/unused mask from the BPU to emit uops for instructions that begin at those bytes, and will use the prediction direction mask to change the macrobranch uops to a predicted not taken / predicted taken macrobranch uop (T/NT predictions are inserted into the uop itself). If it is predicted taken then it stops emitting uops for that 64B aligned block (again used to be 32B, previously 16B) and waits for the next window right behind it in the pipeline. The uop cache is going to know what uops are branches and probably statically predicts not taken to all non-predictions, or might have a more advanced static prediction. Indirect target predictions from the IBTB are inserted into the uop immediate field and then it will wait for the next BPU prediction block if this branch is also predicted taken. I would imagine the uop cache creates BIT entries and updates predictions in the BTBs, to ensure that uop cache and MITE (legacy decode) uops update the history in correct sequential order.

measuring time between two rising edges in beaglebone

I am reading sensor output as square wave(0-5 volt) via oscilloscope. Now I want to measure frequency of one period with Beaglebone. So I should measure the time between two rising edges. However, I don't have any experience with working Beaglebone. Can you give some advices or sample codes about measuring time between rising edges?
How deterministic do you need this to be? If you can tolerate some inaccuracy, you can probably do it on the main Linux OS; if you want to be fancy pants, this seems like a potential use case for the BBB's PRU's (which I unfortunately haven't used so take this with substantial amounts of salt). I would expect you'd be able to write PRU code that just sits with an infinite outerloop and then inside that loop, start looping until it sees the pin shows 0, then starts looping until the pin shows 1 (this is the first rising edge), then starts counting until either the pin shows 0 again (this would then be the falling edge) or another loop to the next rising edge... either way, you could take the counter value and you should be able to directly convert that into time (the PRU is states as having fixed frequency for each instruction, and is a 200Mhz (50ns/instruction). Assuming your loop is something like
#starting with pin low
inner loop 1:
registerX = loadPin
increment counter
jump if zero registerX to inner loop 1
# pin is now high
inner loop 2:
registerX = loadPin
increment counter
jump if one registerX to inner loop 2
# pin is now low again
That should take 3 instructions per counter increment, so you can get the time as 3 * counter * 50 ns.
As suggested by Foon in his answer, the PRUs are a good fit for this task (although depending on your requirements it may be fine to use the ARM processor and standard GPIO). Please note that (as far as I know) both the regular GPIOs and the PRU inputs are based on 3.3V logic, and connecting a 5V signal might fry your board! You will need an additional component or circuit to convert from 5V to 3.3V.
I've written a basic example that measures timing between rising edges on the header pin P8.15 for my own purpose of measuring an engine's rpm. If you decide to use it, you should check the timing results against a known reference. It's about right but I haven't checked it carefully at all. It is implemented using PRU assembly and uses the pypruss python module to simplify interfacing.

OFDM transceiver with rayleigh channel using Standard PDP in matlab

I have built an OFDM transceiver with rayleigh channel using standard PDP's Like EPA,EVA and ETU.The problem is I am getting very high BER even for BPSK i.e 50-60 % or higher bits in error.Scatterplotting confirms it.My OFDM transceiver blocks include:
---- Random Data -- Modulation(BPSK,QPSK,QAM) -- Serial2Parallel -- IFFT -- CyclicPrefix >>> Rayleigh Ch >>> Remove CP Data---FFT --- Par2Ser ---DeMod --- Sink Data.
I have used builtin matlab function to create Rayleigh channel passing standard PDP as parameter.
channelObj = rayleighchan(tSampling,fDoppler,tau_in_sec,pdb_in_dB);
channelObj.ResetBeforeFiltering=0; % channel remains static before filtering
Filtering for n-OFDM symbols & calculating CIR
for symb=1:OFDMSymb
ofdm_td_rx_signal(:,symb) = filter(channelObj, ofdm_td_TXdata(:,symb));
channel_cir(tapIndices,symb)= (channelObj.PathGains).';
end
channel_cfr = fft(channel_cir,nCarrier); % freq. response from CIR
Similarly at receiver,after FFT block,I just tried to use this CFR by dividing received symbol by CFR as
fft_RXdata=fft_data./channel_cfr;
What I am getting is very high SNR and scattered constellation symbols.Rest of transceiver blocks are simple and all verified as bug free...Do let me how to improve it.
How I could get improve BER?
Any need of equalizer?Should a match filter would help?Thanks in advance.
NOTE:ONLY RAYLEIGH CHANNEL IS USED AWGN NOISE IS NOT ADDED AT ALL ...
One possible solution that have helped me is use of block based pilots(reference dummy data) transmission with OFDM symbols.Least square channel estimation is performed at RX using received pilots data which inherently captured the channel behavior.