ROL / ROR on variable using inline assembly only in Objective-C [duplicate] - objective-c

This question already has answers here:
ROL / ROR on variable using inline assembly in Objective-C
(2 answers)
Closed 9 years ago.
A few days ago, I asked the question below. Because I was in need of a quick answer, I added:
The code does not need to use inline assembly. However, I haven't found a way to do this using Objective-C / C++ / C instructions.
Today, I would like to learn something. So I ask the question again, looking for an answer using inline assembly.
I would like to perform ROR and ROL operations on variables in an Objective-C program. However, I can't manage it – I am not an assembly expert.
Here is what I have done so far:
uint8_t v1 = ....;
uint8_t v2 = ....; // v2 is either 1, 2, 3, 4 or 5
asm("ROR v1, v2");
the error I get is:
Unknown use of instruction mnemonic with unknown size suffix
How can I fix this?

A rotate is just two shifts - some bits go left, the others right - once you see this rotating is easy without assembly. The pattern is recognised by some compilers and compiled using the rotate instructions. See wikipedia for the code.
Update: Xcode 4.6.2 (others not tested) on x86-64 compiles the double shift + or to a rotate for 32 & 64 bit operands, for 8 & 16 bit operands the double shift + or is kept. Why? Maybe the compiler understands something about the performance of these instructions, maybe the just didn't optimise - but in general if you can avoid assembler do so, the compiler invariably knows best! Also using static inline on the functions, or using macros defined in the same way as the standard macro MAX (a macro has the advantage of adapting to the type of its operands), can be used to inline the operations.
Addendum after OP comment
Here is the i86_64 assembler as an example, for full details of how to use the asm construct start here.
First the non-assembler version:
static inline uint32 rotl32_i64(uint32 value, unsigned shift)
// assume shift is in range 0..31 or subtraction would be wrong
// however we know the compiler will spot the pattern and replace
// the expression with a single roll and there will be no subtraction
// so if the compiler changes this may break without:
// shift &= 0x1f;
return (value << shift) | (value >> (32 - shift));
void test_rotl32(uint32 value, unsigned shift)
uint32 shifted = rotl32_i64(value, shift);
NSLog(#"%8x <<< %u -> %8x", value & 0xFFFFFFFF, shift, shifted & 0xFFFFFFFF);
If you look at the assembler output for profiling (so the optimiser kicks in) in Xcode (Product > Generate Output > Assembly File, then select Profiling in the pop-up menu as the bottom of the window) you will see that rotl32_i64 is inlined into test_rotl32 and compiles down to a rotate (roll) instruction.
Now producing the assembler directly yourself is a bit more involved than for the ARM code FrankH showed. This is because to take a variable shift value a specific register, cl, must be used, so we need to give the compiler enough information to do that. Here goes:
static inline uint32 rotl32_i64_asm(uint32 value, unsigned shift)
// i64 - shift must be in register cl so create a register local assigned to cl
// no need to mask as i64 will do that
register uint8 cl asm ( "cl" ) = shift;
uint32 shifted;
// emit the rotate left long
// %n values are replaced by args:
// 0: "=r" (shifted) - any register (r), result(=), store in var (shifted)
// 1: "0" (value) - *same* register as %0 (0), load from var (value)
// 2: "r" (cl) - any register (r), load from var (cl - which is the cl register so this one is used)
__asm__ ("roll %2,%0" : "=r" (shifted) : "0" (value), "r" (cl));
return shifted;
Change test_rotl32 to call rotl32_i64_asm and check the assembly output again - it should be the same, i.e. the compiler did as well as we did.
Further note that if the commented out masking line in rotl32_i64 is included it essentially becomes rotl32 - the compiler will do the right thing for any architecture all for the cost of a single and instruction in the i64 version.
So asm is there is you need it, using it can be somewhat involved, and the compiler will invariably do as well or better by itself...

The 32bit rotate in ARM would be:
__asm__("MOV %0, %1, ROR %2\n" : "=r"(out) : "r"(in), "M"(N));
where N is required to be a compile-time constant.
But the output of the barrel shifter, whether used on a register or an immediate operand, is always a full-register-width; you can shift a constant 8-bit quantity to any position within a 32bit word, or - as here - shift/rotate the value in a 32bit register any which way.
But you cannot rotate 16bit or 8bit values within a register using a single ARM instruction. None such exists.
That's why the compiler, on ARM targets, when you use the "normal" (portable [Objective-]C/C++) code (in << xx) | (in >> (w - xx)) will create you one assembler instruction for a 32bit rotate, but at least two (a normal shift followed by a shifted or) for 8/16bit ones.


RISC-V inline assembly using memory not behaving correctly

This system call code is not working at all. The compiler is optimizing things out and generally behaving strangely:
template <typename... Args>
inline void print(Args&&... args)
char buffer[1024];
auto res = strf::to(buffer) (std::forward<Args> (args)...);
const size_t size = res.ptr - buffer;
register const char* a0 asm("a0") = buffer;
register size_t a1 asm("a1") = size;
register long syscall_id asm("a7") = ECALL_WRITE;
register long a0_out asm("a0");
asm volatile ("ecall" : "=r"(a0_out)
: "m"(*(const char(*)[size]) a0), "r"(a1), "r"(syscall_id) : "memory");
This is a custom system call that takes a buffer and a length as arguments.
If I write this using global assembly it works as expected, but program code has generally been extraordinarily good if I write the wrappers inline.
A function that calls the print function with a constant string produces invalid machine code:
0000000000120f54 <start>:
120f54: fa1ff06f j 120ef4 <public_donothing-0x5c>
120ef4: 747367b7 lui a5,0x74736
120ef8: c0010113 addi sp,sp,-1024
120efc: 55478793 addi a5,a5,1364 # 74736554 <add_work+0x74615310>
120f00: 00f12023 sw a5,0(sp)
120f04: 00a00793 li a5,10
120f08: 00f10223 sb a5,4(sp)
120f0c: 000102a3 sb zero,5(sp)
120f10: 00500593 li a1,5
120f14: 06600893 li a7,102
120f18: 00000073 ecall
120f1c: 40010113 addi sp,sp,1024
120f20: 00008067 ret
It's not loading a0 with the buffer at sp.
What am I doing wrong?
It's not loading a0 with the buffer at sp.
Because you didn't ask for a pointer as an "r" input in a register. The one and only guaranteed/supported behaviour of T foo asm("a0") is to make an "r" constraint (including +r or =r) pick that register.
But you used "m" to let it pick an addressing mode for that buffer, not necessarily 0(a0), so it probably picked an SP-relative mode. If you add asm comments inside the template like "ecall # 0 = %0 1 = %1 2 = %2" you can look at the compiler's asm output and see what it picked. (With clang, use -no-integrated-as so asm comments in the template come through in the -S output.)
Wrapping a system call does need the pointer in a specific register, i.e. using "r" or +"r"
asm volatile ("ecall # 0=%0 1=%1 2=%2 3=%3 4=%4"
: "=r"(a0_out)
: "r"(a0), "r"(a1), "r"(syscall_id), "m"(*(const char(*)[size]) a0)
: // "memory" unneeded; the "m" input tells the compiler which memory is read
That "m" input can be used instead of the "memory" clobber, not instead of an "r" pointer input. (For write specifically, because it only reads that one area of pointed-to memory and has no other side-effects on memory user-space can see, only on kernel write write buffers and file-descriptor positions which aren't C objects this program can access directly. For a read call, you'd need the memory to be an output operand.)
With optimization disabled, compilers do typically pick another register as the base for the "m" input (e.g. 0(a5) for GCC), but with optimization enabled GCC picks 0(a0) so it doesn't cost extra instructions. Clang still picks 0(a2), wasting an instruction to set up that pointer, even though the "=r"(a0_out) is not early-clobber. (Godbolt, with a very cut-down version of the function that doesn't call strf::to, whatever that is, just copies a byte into the buffer.)
Interestingly, with optimization enabled for my cut-down stand-alone version of the function without fixing the bug, GCC and clang do happen to put a pointer to buffer into a0, picking 0(a0) as the template expansion for that operand (see the Godbolt link above). This seems to be a missed optimization vs. using 16(sp); I don't see why they'd need the buffer address in a register at all.
But without optimization, GCC picks ecall # 0 = a0 1 = 0(a5) 2 = a1. (In my simplified version of the function, it sets a5 with mv a5,a0, so it did actually have the address in a0 as well. So it's a good thing you had more code in your function to make it not happen to work by accident, so you could find the bug in your code.)

Error: No operator "=" matches these operands in "Servo_Project.cpp", Line: 15, Col: 22

So I tried using code from another post around here to see if I could use it, it was a code meant to utilize a potentiometer to move a servo motor, but when I attempted to compile it is gave the error above saying No operator "=" matches these operands in "Servo_Project.cpp". How do I go about fixing this error?
Just in case ill say this, the boards I was trying to compile the code were a NUCLEO-L476RG, the board from the post I mentioned utilized Nucleo L496ZG board and a Tower Pro Micro Servo 9G.
#include "mbed.h"
#include "Servo.h"
Servo myservo(D6);
AnalogOut MyPot(A0);
int main() {
float PotReading;
PotReading =;
while(1) {
for(int i=0; i<100; i++) {
myservo = (i/100);
This line:
myservo = (i/100);
Is wrong in a couple of ways. First, i/100 will always be zero - integer division truncates in C++. Second, there's not an = operator that allows an integer value to be assigned to a Servo object. YOu need to invoke some kind of Servo method instead, likely write().
The SOMETHING should be the position or speed of the servo you're trying to get working. See the Servo class reference for an explanation. Your code tries to use fractions from 0-1 and thatvisn't going to work - the Servo wants a position/speed between 0 and 180.
You should look in the Servo.h header to see what member functions and operators are implemented.
Assuming what you are using is this, it does have:
Servo& operator= (float percent);
Although note that the parameter is float and you are passing an int (the parameter is also in the range 0.0 to 1.0 - so not "percent" as its name suggests - so be wary, both the documentation and the naming are poor). You should have:
myservo = i/100.0f;
However, even though i / 100 would produce zero for all i in the loop, that does not explain the error, since an implicit cast should be possible - even if clearly undesirable. You should look in the actual header you are using to see if the operator= is declared - possibly you have the wrong file or a different version or just an entirely different implementation that happens to use teh same name.
I also notice that if you look in the header, there is no documentation mark-up for this function and the Servo& operator= (Servo& rhs); member is not documented at all - hence the confusing automatically generated "Shorthand for the write and read functions." on the Servo doc page when the function shown is only one of those things. It is possible it has been removed from your version.
Given that the documentation is incomplete and that the operator= looks like an after thought, the simplest solution is to use the read() / write() members directly in any case. Or implement your own Servo class - it appears to be only a thin wrapper/facade of the PwmOut class in any case. Since that is actually part of mbed rather than user contributed code of unknown quality, you may be on firmer ground.

Measuring Program Execution Time with Cycle Counters

I have confusion in this particular line-->
result = (double) hi * (1 << 30) * 4 + lo;
of the following code:
void access_counter(unsigned *hi, unsigned *lo)
// Set *hi and *lo to the high and low order bits of the cycle
// counter.
asm("rdtscp; movl %%edx,%0; movl %%eax,%1" // Read cycle counter
: "=r" (*hi), "=r" (*lo) // and move results to
: /* No input */ // the two outputs
: "%edx", "%eax");
double get_counter()
// Return the number of cycles since the last call to start_counter.
unsigned ncyc_hi, ncyc_lo;
unsigned hi, lo, borrow;
double result;
/* Get cycle counter */
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
if (result < 0) {
fprintf(stderr, "Error: counter returns neg value: %.0f\n", result);
return result;
The thing I cannot understand is that why is hi being multiplied with 2^30 and then 4? and then low added to it? Someone please explain what is happening in this line of code. I do know that what hi and low contain.
The short answer:
That line turns a 64bit integer that is stored as 2 32bit values into a floating point number.
Why doesn't the code just use a 64bit integer? Well, gcc has supported 64bit numbers for a long time, but presumably this code predates that. In that case, the only way to support numbers that big is to put them into a floating point number.
The long answer:
First, you need to understand how rdtscp works. When this assembler instruction is invoked, it does 2 things:
1) Sets ecx to IA32_TSC_AUX MSR. In my experience, this generally just means ecx gets set to zero.
2) Sets edx:eax to the current value of the processor’s time-stamp counter. This means that the lower 64bits of the counter go into eax, and the upper 32bits are in edx.
With that in mind, let's look at the code. When called from get_counter, access_counter is going to put edx in 'ncyc_hi' and eax in 'ncyc_lo.' Then get_counter is going to do:
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
What does this do?
Since the time is stored in 2 different 32bit numbers, if we want to find out how much time has elapsed, we need to do a bit of work to find the difference between the old time and the new. When it is done, the result is stored (again, using 2 32bit numbers) in hi / lo.
Which finally brings us to your question.
result = (double) hi * (1 << 30) * 4 + lo;
If we could use 64bit integers, converting 2 32bit values to a single 64bit value would look like this:
unsigned long long result = hi; // put hi into the 64bit number.
result <<= 32; // shift the 32 bits to the upper part of the number
results |= low; // add in the lower 32bits.
If you aren't used to bit shifting, maybe looking at it like this will help. If lo = 1 and high = 2, then expressed as hex numbers:
result = hi; 0x0000000000000002
result <<= 32; 0x0000000200000000
result |= low; 0x0000000200000001
But if we assume the compiler doesn't support 64bit integers, that won't work. While floating point numbers can hold values that big, they don't support shifting. So we need to figure out a way to shift 'hi' left by 32bits, without using left shift.
Ok then, shifting left by 1 is really the same as multiplying by 2. Shifting left by 2 is the same as multiplying by 4. Shifting left by [omitted...] Shifting left by 32 is the same as multiplying by 4,294,967,296.
By an amazing coincidence, 4,294,967,296 == (1 << 30) * 4.
So why write it in that complicated fashion? Well, 4,294,967,296 is a pretty big number. In fact, it's too big to fit in an 32bit integer. Which means if we put it in our source code, a compiler that doesn't support 64bit integers may have trouble figuring out how to process it. Written like this, the compiler can generate whatever floating point instructions it might need to work on that really big number.
Why the current code is wrong:
It looks like variations of this code have been wandering around the internet for a long time. Originally (I assume) access_counter was written using rdtsc instead of rdtscp. I'm not going to try to describe the difference between the two (google them), other than to point out that rdtsc does not set ecx, and rdtscp does. Whoever changed rdtsc to rdtscp apparently didn't know that, and failed to adjust the inline assembler stuff to reflect it. While your code might work fine despite this, it might do something weird instead. To fix it, you could do:
asm("rdtscp; movl %%edx,%0; movl %%eax,%1" // Read cycle counter
: "=r" (*hi), "=r" (*lo) // and move results to
: /* No input */ // the two outputs
: "%edx", "%eax", "%ecx");
While this will work, it isn't optimal. Registers are a valuable and scarce resource on i386. This tiny fragment uses 5 of them. With a slight modification:
asm("rdtscp" // Read cycle counter
: "=d" (*hi), "=a" (*lo)
: /* No input */
: "%ecx");
Now we have 2 fewer assembly statements, and we only use 3 registers.
But even that isn't the best we can do. In the (presumably long) time since this code was written, gcc has added both support for 64bit integers and a function to read the tsc, so you don't need to use asm at all:
unsigned int a;
unsigned long long result;
result = __builtin_ia32_rdtscp(&a);
'a' is the (useless?) value that was being returned in ecx. The function call requires it, but we can just ignore the returned value.
So, instead of doing something like this (which I assume your existing code does):
unsigned cyc_hi, cyc_lo;
access_counter(&cyc_hi, &cyc_lo);
// do something
double elapsed_time = get_counter(); // Find the difference between cyc_hi, cyc_lo and the current time
We can do:
unsigned int a;
unsigned long long before, after;
before = __builtin_ia32_rdtscp(&a);
// do something
after = __builtin_ia32_rdtscp(&a);
unsigned long long elapsed_time = after - before;
This is shorter, doesn't use hard-to-understand assembler, is easier to read, maintain and produces the best possible code.
But it does require a relatively recent version of gcc.

PIC C18: Converting double to string

I am using PIC18F2550. Programming it with C18 language.
I need a function that converts double to string like below:
void dtoa( char *szString, // Output string
double dbDouble, // Input number
unsigned char ucFPlaces) // Number of digits in the resulting fractional part
// ??????????????
To be called like this in the main program:
void main (void)
// ...
double dbNumber = 123.45678;
char szText[9];
dtoa(szText, dbNumber, 3); // szText becomes "123.456" or rounded to "123.457"
// ...
So write one!
5mins, a bit of graph paper and a coffee is all it should take.
In fact it's a good interview question
Tiny printf might work for you:
Generally, the Newlib C library (BSD license, from RedHat, part of Cygwin as well as used in many many "bare-metal" embedded-systems compilers) is a good place to start for usefuls sources for things that would be in the standard C library.
The Newlib dtoa.c sources are in the src/newlib/libc/stdlib subdirectory of the source tree:
Online source browser:
Direct link to the current version of the dtoa.c file:
The file is going to be a little odd, in that Newlib uses some odd macros for the function declarations, but should be straightforward to adapt -- and, being BSD-licensed, you can pretty much do whatever you want with it if you keep the copyright notice on it.

Assembly code for optimized bitshifting of a vector

i'm trying to write a routine that will logically bitshift by n positions to the right all elements of a vector in the most efficient way possible for the following vector types: BYTE->BYTE, WORD->WORD, DWORD->DWORD and WORD->BYTE (assuming that only 8 bits are present in the result). I would like to have three routines for each type depending on the type of processor (SSE2 supported, only MMX suppported, only standard instruction se supported). Therefore i need 12 functions in total.
I have already found by myself how to backup and restore the registers that i need, how to make a loop, how to copy data into regular registers or MMX registers and how to shift by 1 position logically.
Because i'm not familiar with assembly language that's about it.
Which registers should i use for each instruction set?
How will the availability of the large vector (an image) in L1 cache be optimized?
How do i find the next element of the vector (a pointer kind of thing), i know i can make a mov by address and i assume i have to increment the address by 1, 2 or 4 depending on my type of data?
Although i have all the ideas, writing the code is a bit difficult at this point.
Thank you.
Here is what i'm trying to do for MMX for a shift by 1 on a DWORD:
__asm("push mm"); // backup register
__asm("push cx"); // backup register
__asm("mov %cx, length"); // initialize loop
__asm("loopstart_shift1:"); // start label
__asm("movd %xmm0, r/m32"); // get 32 bits data
__asm("psrlq %xmm0, 1"); // right shift 32 bits data logically (stuffs 0 on the left) by 1
__asm("mov r/m32,%xmm0"); // set 32 bits data
__asm("dec %cx"); // decrement index
__asm("cmp %cx,0");
__asm("jnz loopstart_shift1");
__asm("pop cx"); // restore register
__asm("pop mm"); // restore register
__asm("emms"); // leave MMX state
I strongly suggest you pause and take a look at using intrinsics with C or C++ instead of trying to write raw asm - that way the C/C++ compiler will take care of all the register allocation, instruction scheduling and general housekeeping tasks and you can just focus on the important parts, e.g. instead of using psrlq see _m_psrlq in mmintrin.h. (Better yet, look at using 128 bit SSE intrinsics.)
Sounds like you'd benefit from either using or looking into BitMagic's source. its entirely intrinsics based too, which makes its far more portable (though from the looks of it your using GCC, so it might have to get an MSVC to GCC intrinics mapping).