NEON output generated by the simulator regarding (pipeline information, stalls, execution cycles) not clear

NEON output generated by the simulator regarding (pipeline information, stalls, execution cycles) not clear - embedded

I have some problem understanding the output of NEON simulator. The output generated is cryptic and there is no proper documentation for understanding the simulator output.
for example :
In the above figure the 1st column's information is not clearly explained.
What does lc mean? Sometimes the syntax given below doesn't match the data format in the table.
The code and data are at http://pulsar.webshaker.net/ccc/sample-55d49530 . I find some help at Some doubts in optimizing the neon code but it is not completely clear.

It is not 'LC' it is '1C', i.e. one-C, meaning that instruction takes one cycle.

Related

Why am I getting only zeros out of the VCO block in GNU Radio?

In GNU radio I am trying to use the frequency of one signal to generate another signal of a different frequency. Here is the flow diagram that I am using:
I generate a 50 kHz signal with a signal source block and feed this into a Log Power FFT block. I used the Argmax block to find the FFT bin with the most power and multiply that with a constant. I want to use this result as the input to the complex vco block to generate another signal with a different frequency. All vectors have a length of 4096.
However, looking at the output of the complex QT Gui Time Sink block, the output of the vco is always zero. This is strange to me because using a float QT Gui Time Sink to look at the output of the multiply block (which is also going to the input of the vco block), the result is 50,000 as expected. Why am I only getting zero out of the vco?
Also, my sample rate is set to 1M. I am assuming because of the vector length of 4096 that the sample rate out of the Argmax block will be 1M/4096 = 244. Is this correct?
I am running gnu radio companion on windows 10.

The proposed solution is not a solution. Please don't abuse the signal probe, which is really just that, a probe for slow, debugging or purely visual purposes. Every time I use it myself, I see how architecturally bad it is, and I personally think the project should be removing it from the block library altogether.
Now, instead of just say "probe is bad, do something else", let's analyse where your flow graph falls short:
your frequency estimation depends on the argmax of a block that was meant for pure visualization purposes. No, the output rate is not (sampling rate/FFT length), the output rate is roughly "frame rate" (but not actually exactly. That block is terrible and mixes "sample times" with "wall clock times"). Don't do that. If you need something like that, use the FFT block, followed my "complex to magnitude squared". You don't even want the logarithm - you're just looking for a maximum
Instead of looking for a maximum absolute value in an FFT, which is inherently a quantizing frequency estimator, use something that actually gives you an oscillation. There's multiple ways you can do that with a PLL!
your VCO solution probably does what it's programmed to do. You just use an inadequate sensitivity!
The sampling rate you assume in your time sinks is totally off, which is probably why you have the impression of a constant output – it just changes so slowly that you'll not notice.
So, I propose to instead, either / or:
Use the PLL Freq det. Feed the output of that into the VCO. Don't scale with a constant, but simply apply the proper sensitivity. Sensitivity is the factor between "input amplitude" and "phase advance per sample on the output in radians".
Use the PLL Carrier recovery. Use a resampler, or some other mathematical method, to generate the new frequency. You haven't told us how that other frequency relates to the input frequency, so I can't give you concrete advice.
Also notice that this very much suggest this being a case of "I'm trying to recreate an analog approach in digital"; that might be a good approach, but in many cases it's not.
If I might be as brazen: Describe why you need to generate that other frequency, for which purpose, in a post on https://dsp.stackexchange.com or to the GNU Radio mailing list discuss-gnuradio#gnu.org (sign up here). This really only barely is a programming problem, but really a signal processing problem. And there's a lot of people out there eager to help you find an appropriate solution that actually tackles your problem!

It looks like a better solution was to probe the output of the multiplier using a probe signal block along with a Function Probe Block to create a new variable. This variable could then be used as the frequency value in a separate Signal Source Block that is used to generate the new signal. This flow diagram seems to satisfy the original intended purpose: new flow diagram

QPSK works in simulation but not with SDR

I'm going to start off by saying that I'm very new to SDR and GNU Radio. This may be a dumb question, but I have been googling and testing things for about two months now trying to get this to work without success. Any help or pointers would be appreciated!
I'm attempting to use GNU Radio 3.8 to transfer a file using differential QPSK. I've tried to follow the tutorials on the wiki as well as several similar academic papers I found on the internet (which also seem to be based off the wiki tutorial). None of them worked on their own but combining what actually works from each one, I managed to create a flowgraph sans hardware that does indeed send and receive the data from a file. Here's the Flowgraph and here is a screenshot of the results. The results show the four constellation points, and the data from the file source matches up perfectly with the data having gone though the entire transmit+receive chain. In the simulation I have a throttle block and a channel model block where the LimeSDR Source and LimeSDR Sink block would be. So far so good (at least as far as I can tell).
When I actually start transmitting this signal with the SDR, the received data no longer matches up with what is transmitted. Here's the flowgraph I've been using for the transmission. I added a protocol formatter and some FEC blocks that I could have removed for this illustration, but the point is that simply looking at what bits are going into the modulator vs what's being recovered, the two do not match up. The constellation looks good (as far as I can tell) but the bits are all wrong. Here's a screenshot showing the bits being transmitted. You'll notice in the screenshot of the transmitted signal that the signal has a repeating series of three flat top "1's" surrounded on both sides by a period of "0's" (at time 1.5ms and 3.5ms). This is a screenshot of the received bits. At time 1ms and 3ms you can see how it is has significantly more transitions between 1 and 0 than it should.
So at this point I'm stumped. The simulation worked but the real world test does not. I've messed around with the RRC filter properties a significant amount. I have no clue if the values I have chosen are correct as I have not found a tutorial or explanation on how to do so. I just looked at some of the example flowgraphs and made some guesses as to how those values were derived and applied those guesses to my use case. It worked well in the simulation so I thought it would be fine in the real world test. I've tried a variety of samples per symbol but my goal is for a 4800 bit per second transfer speed, and using different samples per symbol didn't help anyway. What should I change in order to get this to work?
Bonus question: The constellation object has QPSK and DQPSK, and the constellation modulator has a differential checkbox. What is the best practice combination of selections to get a differential QPSK modulation?

How to make Mean curvature flow skeleton deterministic after every run?

When I run mean curvature flow skeletonization from cgal the resulting skeleton is always a bit different. I was wondering if there is a way to make the skeleton exact after each run (with fixed skeleton parameters ofcourse). Can some one point me to piece of code where this happens and I would like to fix this issue.
I saw someone post similar question earlier but the answer did not contain where exactly in cgal causes this problem. Here's a link to that discussion.

Synopsys: Repeated compiles produce different results. How to automate iterated compile?

I'm new to using Design Compiler. In the past, I've done mostly FPGA work. Right now, I'm using Synopsys to determine the minimum are necessary to represent some circuits (using the Nangate 45nm library). I'm not doing P&R right now; I'm just trying to determine transistor area.
My only optimization constraint is to minimize area. I've noticed that if I tell DC to compile more than one time in a row, it produces different (and usually smaller) results each time.
I've looked and looked and failed to see if this is mentioned in a manual or anywhere in any discussion. Is it meant to work this way?
This suggests that optimization is stopping earlier than it could, so it's not REALLY minimizing area. Any idea why?
Is there a way I can tell it to increase the effort and/or tell it to automatically iterate compiles so that it will converge on the smallest design?
I'm guessing that DC is expecting to meet timing constraints, but I've given it a purely combinatorial block and no timing constraint. Did they never consider the usage scenario when all you want to do is work out the minimum gate area for a combinatorial circuit?

On a pure combinatorial circuit you can use a set_max_delay constraint and DC will attempt to meet that.
For reduced area you can use -map_effort high or -map_effort ultra to get it to work harder.
DC is a funny beast, and the algorithms it uses change as processes advance and make certain activities more or less useful. A lot of pre-layout optimization is less useful since the whole situation can change once the gates are actually placed and routed.

I filed a support ticket with Synopsys. I was using a 2010 version of design compiler. Apparently, area optimization has been improved since then, and the 2014 version will minimize area in one compiler pass.

Machine code alignment

I am trying to understand the principles of machine code alignment. I have an assembler implementation which can generate machine code in run-time. I use 16-bytes alignment on every branch destination, but looks like it is not the optimal choice, since I've noticed that if I remove alignment than sometimes same code works faster. I think that something to do with cache line width, so that some commands are cut by a cache line and CPU experiences stalls because of that. So if some bytes of alignment inserted at one place, it will move instructions somewhere further pass the cache border line...
I was hoping to implement an automatic alignment procedure, which can process a code as a whole and insert alignment according to the specification of the CPU (cache line width, 32/64 bits and so on)...
Can someone give some hints about this procedure? As an example the target CPU could be Intel Core i7 CPU 64-bit platform.
Thank you.

I'm not qualified to answer your question because this is such a vast and complicated topic. There are probably many more mechanisms in play here, other than cache line size.
However, I would like to point you to Agner Fog's site and the optimization manuals for compiler makers that you can find there. They contain a plethora of information on these kind of subjects - cache lines, branch prediction and data/code alignment.

Paragraph (16-byte) alignment is usually the best. However, it can force some "local" JMP instructions to no longer be local (due to code size bloat). May also result in not as much code being cached. I would only align major segments of code, I would not align every tiny subroutine/JMP section.

Not an expert, however... Branches to places that are not going to be in the instruction cache should benefit from alignment the most because you'll read whole cache-line of instructions to fill the pipeline. Given that statement, forward branches will benefit on the first run of a function. Backward branches ("for" and "while" loops for example) will probably not benefit because the branch target and following instructions have been read into cache already. Do follow the links in Martins answer.

As mentioned previously this is a very complex area. Agner Fog seems like a good place to visit. As to the complexities I ran across the article here Torbjörn Granlund on "Improved Division by Invariant Integers" and in the code he uses to illustrate his new algorithm the first instruction at - I guess - the main label is nop - no operation. According to the commentary it improves performance significantly. Go figure.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas