How to optimize load and stores? - optimization

I'm trying to have a bunch of operation executed on different targets such as ARM,Bfin... but every time I write a simple code in C and then compile it for each operation it has like 2 loads and one store which is unnecessary for every operation.
ldr r2, [fp, #-24]
ldr r3, [fp, #-28]
add r3, r2, r3
str r3, [fp, #-20]
ldr r2, [fp, #-36]
ldr r3, [fp, #-40]
add r3, r2, r3
str r3, [fp, #-32]
ldr r2, [fp, #-44]
ldr r3, [fp, #-48]
add r3, r2, r3
str r3, [fp, #-20]
ldr r3, [fp, #-16]
add r3, r3, #1
str r3, [fp, #-16]
When I turn on any optimization options, even -O1, it simply calculates the result and stores it in the output:
subl $24, %esp
movl $4, 4(%esp)
movl $.LC0, (%esp)
Is there anyway,I can have operations without fetching the same variable over and over again? I've tried gcc -fgcse-lm and -fgcse-sm but that didn't work.

It depends on the operation. Gcc can't figure out a high level optimizations for
int a(int b, int c)
{
b-=c;
c-=b;
b-=c;
c-=b;
b-=c;
c-=b;
return c;
}

If you want to do benchmarking and avoid constant folding and dead code elimination of the optimizer in gcc, you need to use non-constants as input and make sure the result goes somewhere.
For instance, instead of using
int main(int argc, char** argv) {
int a = 1;
int b = 2;
start_clock();
int c = a + b;
int d = c + a;
int e = d + b;
stop_clock();
output_time_needed();
return 0;
}
You should use something like
int main(int argc, char** argv) {
int a = argc;
int b = argc + 1;
start_clock();
int c = a + b;
int d = c + a;
int e = d + b;
stop_clock();
output_time_needed();
return e;
}

Related

Raspberry Pico Pi cmake

I have written a code for Pico Pi, and basically the program is about one LED and two buttons where one button turns on the LED and one turns it off. I am pretty new to raspberry and so I don't know much, I am using a virtual machine for cmake and make, but unfortunately, I can't turn my code into uf2, because I have not defined my link_gpio_get function in the sdlink.c file, which I don't know how to do so cmake is failing due to an undefined reference...
.EQU LED_PIN1, 0
.EQU BUT_PIN1, 1
.EQU BUT_PIN2, 2
.EQU GPIO_IN, 0
.EQU GPIO_OUT, 1
.thumb_func
.global main
main:
MOV R0, #LED_PIN1
BL gpio_init
MOV R0, #LED_PIN1
MOV R1, #GPIO_OUT
BL link_gpio_set_dir # Initialize PIN1
MOV R0, #BUT_PIN1
BL gpio_init
MOV R0, #BUT_PIN1
MOV R1, #GPIO_IN
BL link_gpio_set_dir
MOV R0, #BUT_PIN2
BL gpio_init
MOV R0, #BUT_PIN2
MOV R1, #GPIO_IN
BL link_gpio_set_dir
wait_on:
MOV R0, #BUT_PIN1 # Wait for turn on button
BL link_gpio_get
CMP R0, #1
BEQ turn_on
B wait_on
turn_on:
MOV R0, #LED_PIN1
MOV R1, #1
BL link_gpio_put # Turn on led
B wait_off
turn_off:
MOV R0, #LED_PIN1
MOV R1, #0
BL link_gpio_put # Turn off led
B wait_on
wait_off:
MOV R0, #BUT_PIN2 # Wait for off
BL link_gpio_get
CMP R0, #1
BEQ turn_off
B wait_off
Here is my sdlink.c file
/* C wrapper functions for the RP2040 SDK
* Incline functions gpio_set_dir and gpio_put.
*/
#include "hardware/gpio.h"
void link_gpio_set_dir(int pin, int dir)
{
gpio_set_dir(pin, dir);
}
void link_gpio_put(int pin, int value)
{
gpio_put(pin, value);
}
I've been working to output uf2, using cmake on Windows 10 and I after watching a Youtube video, reviewing hackster and making my own edits I was able to get it working.
I'm not sure what OS you are using but hopefully these links and my edit can help guide you to identify the issue with your project.
https://www.youtube.com/watch?v=mUF9xjDtFfY
https://www.hackster.io/lawrence-wiznet-io/how-to-setup-raspberry-pi-pico-c-c-sdk-in-window10-f2b816
The following is my edit that allowed me to build and output I hope it helps!
After you've cloned the pico-examples project, navigate to the pico-examples directory. I opened pico_sdk_import.cmake in a text editor and I changed line 6 from if (DEFINED ENV{PICO_SDK_PATH} AND (NOT PICO_SDK_PATH)) to if (DEFINED ENV{PICO_SDK_PATH})
If you can provide a link to where you obtained the code you posted, maybe I can help further figure our what sdlink.c should contain.

SRAM usage optimization in ARM devices

The relationship between the variable size and the data bus size was confusing for me so I decided to get to the bottom of it by examining the assembly code.
I compiled the source code below in the STM32CubeIDE Version 1.2.0.
#define BUFFER_SIZE ((uint8_t)0x20)
uint8_t aTxBuffer[BUFFER_SIZE];
int i;
for(i=0; i<BUFFER_SIZE; i++){
aTxBuffer[i]=0xFF; /* TxBuffer init */
}
Looking at the assembly code confirmed my suspicion. Unless I misunderstood it grossly, this code will allocate an array with total size of BUFFER_SIZE * DATA_BUS_SIZE (Which is 32 bits on Cortex-M) but we will use only the least significant byte of each memory address.
for(i=0; i<BUFFER_SIZE; i++)
//reset i to 0
800051c: 4b09 ldr r3, [pc, #36] ; (8000544 <main+0x3c>)
800051e: 2200 movs r2, #0
8000520: 601a str r2, [r3, #0]
8000522: e009 b.n 8000538 <main+0x30>
{
//store 0xFF in each member of TxBuffer
aTxBuffer[i]=0xFF; /* TxBuffer init */
8000524: 4b07 ldr r3, [pc, #28] ; (8000544 <main+0x3c>)
8000526: 681b ldr r3, [r3, #0]
8000528: 4a07 ldr r2, [pc, #28] ; (8000548 <main+0x40>)
800052a: 21ff movs r1, #255 ; 0xff
800052c: 54d1 strb r1, [r2, r3]
for(i=0; i<BUFFER_SIZE; i++)
//increment i
800052e: 4b05 ldr r3, [pc, #20] ; (8000544 <main+0x3c>)
8000530: 681b ldr r3, [r3, #0]
8000532: 3301 adds r3, #1
8000534: 4a03 ldr r2, [pc, #12] ; (8000544 <main+0x3c>)
8000536: 6013 str r3, [r2, #0]
//compare if i is less than 31. then jump to 8000524
8000538: 4b02 ldr r3, [pc, #8] ; (8000544 <main+0x3c>)
800053a: 681b ldr r3, [r3, #0]
800053c: 2b1f cmp r3, #31
800053e: d9f1 bls.n 8000524 <main+0x1c>
//pointer to i in SRAM
8000544: 2000002c .word 0x2000002c
//pointer to TxBuffer in SRAM
8000548: 20000064 .word 0x20000064
As the SRAM is at premium in embedded devices I believe there must be some clever ways to optimize usage. One naive solution that I can think of is to allocate the buffer as uint32_t and do bit shifting to access higher bytes but this seems like costly from speed optimization perspective. What is the recommended practice here?
Bus size does not matter in this case. Memory usage will be the the same.
Some Cortex cores do not allow not aligned access. What is unaligned access? Unaligned memory accesses occur when you try to access (as single operation) N bytes of data starting from an address that is not evenly divisible by N (i.e. addr % N != 0). In our case N can be 1, 2 and 4.
your example should be analyzed with optimizations turned on.
#define BUFFER_SIZE ((uint8_t)0x20)
uint8_t aTxBuffer[BUFFER_SIZE];
void init(uint8_t x)
{
for(int i=0; i<BUFFER_SIZE; i++)
{
aTxBuffer[i]=x;
}
}
The STM32F0 which does not allow unaligned access will have to store the data byte by byte
init:
ldr r3, .L5
movs r2, r3
adds r2, r2, #32
.L2:
strb r0, [r3]
adds r3, r3, #1
cmp r3, r2
bne .L2
bx lr
.L5:
.word aTxBuffer
but stm32F4 will faster (in less operations) store the full words 32birs - 4 bytes.
init:
movs r3, #0
bfi r3, r0, #0, #8
bfi r3, r0, #8, #8
ldr r2, .L3
bfi r3, r0, #16, #8
bfi r3, r0, #24, #8
str r3, [r2] # unaligned
str r3, [r2, #4] # unaligned
str r3, [r2, #8] # unaligned
str r3, [r2, #12] # unaligned
str r3, [r2, #16] # unaligned
str r3, [r2, #20] # unaligned
str r3, [r2, #24] # unaligned
str r3, [r2, #28] # unaligned
bx lr
.L3:
.word aTxBuffer
the SRAM consumption is exactly the same in both cases
The given code does not utilize more BUFFER_SIZE*8 bits for aTxBuffer.
Note the following line in your assembly
800052c: 54d1 strb r1, [r2, r3]
Note the b suffix to the instruction here, indicating 'byte'.
In effect, the instruction translates to 'store 1 byte of value 0xFF (stored in r1) at aTxBuffer (stored in r2) + i (stored in r3)'.
So, while the assembly doesn't indicate the end of the buffer, it certainly accesses all bytes in the aTxBuffer array without any waste.
It's possible that your minimal example doesn't capture the problem you face in your actual code but I find it unlikely that the compiler will have such wasted bytes, especially one for an embedded device.
In case you do find that to be the case, you can simply allocate a uint32 array of the same size in bits (or one element higher) and cast the address of the first element to a uint8_t pointer to a uint8_t variable. Now you can access the uint8_t variable as normal.
Note that such programming should be avoided and is only shown as an example. Specifically, this makes it difficult for compilers to analyze pointer aliasing which makes some optimizations difficult. It also creates some burden on the user; careful memory management will be required to avoid mistakes (for example, you should free only one of these pointers to avoid a double-free error).
Example:
#define BUFFSIZE 0x20
// number of elements in int32 will be BUFFSIZE / 4
#define BUFFSIZE_IN_INT_32 (BUFFSIZE >> 2)
// allocate the buffer
uint32_t uint32_array[BUFFSIZE_IN_INT_32];
// point to 1 byte sized elements
uint8_t * aTxBuffer = (uint8_t *)(uint32_array)
// use aTxBuffer as you like
Note here that I assume BUFFSIZE to be divisible by 4. If that is not the case, add BUFFSIZE_IN_INT_32 by 1 more.

Is performance better to use (multiple) conditional ternary operators than an if statement in GLSL

I remember years ago I was told it was better in a GLSL shader to do
a = condition ? statementX : statementY;
over
if(condition) a = statementX;
else a = statementY;
because in the latter case, for every fragment which didn't satisfy the condition, execution would halt while statementX was executed for fragments which did satisfy the condition; and then execution on those fragments would wait until statementY is executed on the other fragments; while in the former case all statementX and statementY would be executed in parallel for corresponding fragments. (I guess it's a bit more complicated with Workgroups etc but that's the gist of it I think). In fact even for multiple statements I used to see this:
a0 = condition ? statementX0 : statementY0;
a1 = condition ? statementX1 : statementY1;
a2 = condition ? statementX2 : statementY2;
instead of
if(condition) {
a0 = statementX0;
a1 = statementX1;
a2 = statementX1;
} else {
a0 = statementY0;
a1 = statementY1;
a2 = statementY1;
}
Is this still the case? or have architectures or compilers improved? Is this a premature optimization not worth pursuing? Or still very relevant?
(and is it the same for different kinds of shaders? fragment, vertex, compute etc).
In both cases you would normally have a branch and almost surely both will lead to the same assembly.
8 __global__ void simpleTest(int *in, int a, int b, int *out)
9 {
10 int value = *in;
11 int p = (value != 0) ? __sinf(a) : __cosf(b);
12 *out = p;
13 }
14
15 __global__ void simpleTest2(int *in, int a, int b, int *out)
16 {
17 int value = *in;
18 int p;
19 if (value != 0)
20 {
21 p = __sinf(a);
22 }
23 else
24 {
25 p = __cosf(b);
26 }
27 *out = p;
28 }
Here's how SASS looks for both:
MOV R1, c[0x0][0x44]
MOV R2, c[0x0][0x140]
MOV R3, c[0x0][0x144]
LD.E R2, [R2]
MOV R5, c[0x0][0x154]
ISETP.EQ.AND P0, PT, R2, RZ, PT
#!P0 I2F.F32.S32 R0, c[0x0] [0x148]
#P0 I2F.F32.S32 R4, c[0x0] [0x14c]
#!P0 RRO.SINCOS R0, R0
#P0 RRO.SINCOS R4, R4
#!P0 MUFU.SIN R0, R0
#P0 MUFU.COS R0, R4
MOV R4, c[0x0][0x150]
F2I.S32.F32.TRUNC R0, R0
ST.E [R4], R0
EXIT
BRA 0x98
The #!P0 and #P0 you see are predicates. Each thread would have its own predicate bit based on the result. Depending on the bit, as the processing unit goes through the code it will be decided whether the instruction is to be executed (could also mean, result being committed?).
Let's look at a case in which you do not have branch regardless of both cases.
8 __global__ void simpleTest(int *in, int a, int b, int *out)
9 {
10 int value = *in;
11 int p = (value != 0) ? a : b;
12 *out = p;
13 }
14
15 __global__ void simpleTest2(int *in, int a, int b, int *out)
16 {
17 int value = *in;
18 int p;
19 if (value != 0)
20 {
21 p = a;
22 }
23 else
24 {
25 p = b;
26 }
27 *out = p;
28 }
And here's how SASS looks for both:
MOV R1, c[0x0][0x44]
MOV R2, c[0x0][0x140] ; load in pointer into R2
MOV R3, c[0x0][0x144]
LD.E R2, [R2] ; deref pointer
MOV R6, c[0x0][0x14c] ; load a. b is stored at c[0x0][0x148]
MOV R4, c[0x0][0x150] ; load out pointer into R4
MOV R5, c[0x0][0x154]
ICMP.EQ R0, R6, c[0x0][0x148], R2 ; Check R2 if zero and select source based on result. Result is put into R0.
ST.E [R4], R0
EXIT
BRA 0x60
There's no branch here. You can do can think of the result as a linear interpolation of A and B:
int cond = (*p != 0)
*out = (1-cond) * a + cond * b

Efficiently dividing unsigned value by a power of two, rounding up - in CUDA

I was just reading:
Efficiently dividing unsigned value by a power of two, rounding up
and I was wondering what was the fastest way to do this in CUDA. Of course by "fast" I mean in terms of throughput (that question also addressed the case of subsequent calls depending on each other).
For the lg() function mentioned in that question (base-2 logarithm of divisor), suppose we have:
template <typename T> __device__ int find_first_set(T x);
template <> __device__ int find_first_set<uint32_t>(uint32_t x) { return __ffs(x); }
template <> __device__ int find_first_set<uint64_t>(uint64_t x) { return __ffsll(x); }
template <typename T> __device__ int lg(T x) { return find_first_set(x) - 1; }
Edit: Since I've been made aware that there's no find-first-sert in PTX, nor in the instruction set of all nVIDIA GPUs up to this time, let's replace that lg() with the following:
template <typename T> __df__ int population_count(T x);
template <> int population_count<uint32_t>(uint32_t x) { return __popc(x); }
template <> int population_count<uint64_t>(uint64_t x) { return __popcll(x); }
template <typename T>
__device__ int lg_for_power_of_2(T x) { return population_count(x - 1); }
and we now need to implement
template <typename T> T div_by_power_of_2_rounding_up(T p, T q);
... for T = uint32_t and T = uint64_t. (p is the dividend, q is the divisor).
Notes:
As in the original question, we may not assume that p <= std::numeric_limits<T>::max() - q or that p > 0 - that would collapse the various interesting alternatives :-)
0 is not a power of 2, so we may assume q != 0.
I realize solutions might differ for 32-bit and 64-bit; I'm more interested in the former but also in the latter.
Let's focus on Maxwell and Pascal chips.
With funnel shifting available, a possible 32 bit strategy is doing a 33bit shift (essentially) preserving the carry of the addition so it be done before the shift, such as this: (not tested)
unsigned sum = dividend + mask;
unsigned result = __funnelshift_r(sum, sum < mask, log_2_of_divisor);
Edit by #einpoklum:
Tested using #RobertCrovella's program, seems to work fine. The test kernel PTX for SM_61 is:
.reg .pred %p<2>;
.reg .b32 %r<12>;
ld.param.u32 %r5, [_Z4testjj_param_0];
ld.param.u32 %r6, [_Z4testjj_param_1];
neg.s32 %r7, %r6;
and.b32 %r8, %r6, %r7;
clz.b32 %r9, %r8;
mov.u32 %r10, 31;
sub.s32 %r4, %r10, %r9;
add.s32 %r11, %r6, -1;
add.s32 %r2, %r11, %r5;
setp.lt.u32 %p1, %r2, %r11;
selp.u32 %r3, 1, 0, %p1;
// inline asm
shf.r.wrap.b32 %r1, %r2, %r3, %r4;
// inline asm
st.global.u32 [r], %r1;
ret;
and the SASS is:
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ MOV R0, c[0x0][0x144]; /* 0x4c98078005170000 */
/*0018*/ IADD R2, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff02 */
/* 0x001c4c00fe4007f1 */
/*0028*/ IADD32I R0, R0, -0x1; /* 0x1c0ffffffff70000 */
/*0030*/ LOP.AND R2, R2, c[0x0][0x144]; /* 0x4c47000005170202 */
/*0038*/ FLO.U32 R2, R2; /* 0x5c30000000270002 */
/* 0x003fd800fe2007e6 */
/*0048*/ IADD R5, R0, c[0x0][0x140]; /* 0x4c10000005070005 */
/*0050*/ ISETP.LT.U32.AND P0, PT, R5, R0, PT; /* 0x5b62038000070507 */
/*0058*/ IADD32I R0, -R2, 0x1f; /* 0x1d00000001f70200 */
/* 0x001fc400fe2007f6 */
/*0068*/ IADD32I R0, -R0, 0x1f; /* 0x1d00000001f70000 */
/*0070*/ SEL R6, RZ, 0x1, !P0; /* 0x38a004000017ff06 */
/*0078*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/* 0x0003c400fe4007e4 */
/*0088*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/*0090*/ SHF.R.W R0, R5, R0, R6; /* 0x5cfc030000070500 */
/*0098*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/* 0x001f8000ffe007ff */
/*00a8*/ EXIT; /* 0xe30000000007000f */
/*00b0*/ BRA 0xb0; /* 0xe2400fffff87000f */
/*00b8*/ NOP; /* 0x50b0000000070f00 */
Here's an adaptation of a well-performing answer for the CPU:
template <typename T>
__device__ T div_by_power_of_2_rounding_up(T dividend, T divisor)
{
auto log_2_of_divisor = lg(divisor);
auto mask = divisor - 1;
auto correction_for_rounding_up = ((dividend & mask) + mask) >> log_2_of_divisor;
return (dividend >> log_2_of_divisor) + correction_for_rounding_up;
}
I wonder whether one can do much better.
The SASS code (using #RobertCrovella's test kernel) for SM_61 is:
code for sm_61
Function : test(unsigned int, unsigned int)
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fd400fe2007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0018*/ MOV R2, c[0x0][0x144]; /* 0x4c98078005170002 */
/* 0x003fc40007a007f2 */
/*0028*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/*0030*/ FLO.U32 R3, R0; /* 0x5c30000000070003 */
/*0038*/ IADD32I R0, R2, -0x1; /* 0x1c0ffffffff70200 */
/* 0x001fc400fcc017f5 */
/*0048*/ IADD32I R3, -R3, 0x1f; /* 0x1d00000001f70303 */
/*0050*/ LOP.AND R2, R0, c[0x0][0x140]; /* 0x4c47000005070002 */
/*0058*/ IADD R2, R0, R2; /* 0x5c10000000270002 */
/* 0x001fd000fe2007f1 */
/*0068*/ IADD32I R0, -R3, 0x1f; /* 0x1d00000001f70300 */
/*0070*/ MOV R3, c[0x0][0x140]; /* 0x4c98078005070003 */
/*0078*/ MOV32I R6, 0x0; /* 0x010000000007f006 */
/* 0x001fc400fc2407f1 */
/*0088*/ SHR.U32 R4, R2, R0.reuse; /* 0x5c28000000070204 */
/*0090*/ SHR.U32 R5, R3, R0; /* 0x5c28000000070305 */
/*0098*/ MOV R2, R6; /* 0x5c98078000670002 */
/* 0x0003c400fe4007f4 */
/*00a8*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/*00b0*/ IADD R0, R4, R5; /* 0x5c10000000570400 */
/*00b8*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/* 0x001f8000ffe007ff */
/*00c8*/ EXIT; /* 0xe30000000007000f */
/*00d0*/ BRA 0xd0; /* 0xe2400fffff87000f */
/*00d8*/ NOP; /* 0x50b0000000070f00 */
/* 0x001f8000fc0007e0 */
/*00e8*/ NOP; /* 0x50b0000000070f00 */
/*00f0*/ NOP; /* 0x50b0000000070f00 */
/*00f8*/ NOP; /* 0x50b0000000070f00 */
with FLO being the "find leading 1" instruction (thanks #tera). Anyway, those are lots of instructions, even if you ignore the loads from (what looks like) constant memory... the CPU function inspiring this one compiles into just:
tzcnt rax, rsi
lea rcx, [rdi - 1]
shrx rax, rcx, rax
add rax, 1
test rdi, rdi
cmove rax, rdi
(with clang 3.9.0).
riffing off of the kewl answer by #tera:
template <typename T> __device__ T pdqru(T p, T q)
{
return bool(p) * (((p - 1) >> lg(q)) + 1);
}
11 instructions (no branches, no predication) to get the result in R0:
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fc800fec007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0018*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/* 0x003fc400ffa00711 */
/*0028*/ FLO.U32 R0, R0; /* 0x5c30000000070000 */
/*0030*/ MOV R5, c[0x0][0x140]; /* 0x4c98078005070005 */
/*0038*/ IADD32I R2, -R0, 0x1f; /* 0x1d00000001f70002 */
/* 0x001fd800fcc007f5 */
/*0048*/ IADD32I R0, R5, -0x1; /* 0x1c0ffffffff70500 */
/*0050*/ IADD32I R2, -R2, 0x1f; /* 0x1d00000001f70202 */
/*0058*/ SHR.U32 R0, R0, R2; /* 0x5c28000000270000 */
/* 0x001fd000fe2007f1 */
/*0068*/ IADD32I R0, R0, 0x1; /* 0x1c00000000170000 */
/*0070*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0078*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/* 0x001ffc001e2007f2 */
/*0088*/ ICMP.NE R0, R0, RZ, R5; /* 0x5b4b02800ff70000 */
/*0090*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*0098*/ EXIT; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*00a8*/ BRA 0xa0; /* 0xe2400fffff07000f */
/*00b0*/ NOP; /* 0x50b0000000070f00 */
/*00b8*/ NOP; /* 0x50b0000000070f00 */
..........................
After studying the above SASS code, it seemed evident that these two instructions:
/*0038*/ IADD32I R2, -R0, 0x1f; /* 0x1d00000001f70002 */
/* 0x001fd800fcc007f5 */
...
/*0050*/ IADD32I R2, -R2, 0x1f; /* 0x1d00000001f70202 */
shouldn't really be necessary. I don't have a precise explanation, but my assumption is that because the FLO.U32 SASS instruction does not have precisely the same semantics as the __ffs() intrinsic, the compiler apparently has an idiom when using that intrinsic, which wraps the basic FLO instruction that is doing the work. It wasn't obvious how to work around this at the C++ source code level, but I was able to use the bfind PTX instruction in a way to reduce the instruction count further, to 7 according to my count (to get the answer into a register):
$ cat t107.cu
#include <cstdio>
#include <cstdint>
__device__ unsigned r = 0;
static __device__ __inline__ uint32_t __my_bfind(uint32_t val){
uint32_t ret;
asm volatile("bfind.u32 %0, %1;" : "=r"(ret): "r"(val));
return ret;}
template <typename T> __device__ T pdqru(T p, T q)
{
return bool(p) * (((p - 1) >> (__my_bfind(q))) + 1);
}
__global__ void test(unsigned p, unsigned q){
#ifdef USE_DISPLAY
unsigned q2 = 16;
unsigned z = 0;
unsigned l = 1U<<31;
printf("result %u/%u = %u\n", p, q, pdqru(p, q));
printf("result %u/%u = %u\n", p, q2, pdqru(p, q2));
printf("result %u/%u = %u\n", p, z, pdqru(p, z));
printf("result %u/%u = %u\n", z, q, pdqru(z, q));
printf("result %u/%u = %u\n", l, q, pdqru(l, q));
printf("result %u/%u = %u\n", q, l, pdqru(q, l));
printf("result %u/%u = %u\n", l, l, pdqru(l, l));
printf("result %u/%u = %u\n", q, q, pdqru(q, q));
#else
r = pdqru(p, q);
#endif
}
int main(){
unsigned h_r;
test<<<1,1>>>(32767, 32);
cudaMemcpyFromSymbol(&h_r, r, sizeof(unsigned));
printf("result = %u\n", h_r);
}
$ nvcc -arch=sm_61 -o t107 t107.cu -std=c++11
$ cuobjdump -sass t107
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_61
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001c4400fe0007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ { MOV32I R3, 0x0; /* 0x010000000007f003 */
/*0018*/ FLO.U32 R2, c[0x0][0x144]; } /* 0x4c30000005170002 */
/* 0x003fd800fec007f6 */
/*0028*/ MOV R5, c[0x0][0x140]; /* 0x4c98078005070005 */
/*0030*/ IADD32I R0, R5, -0x1; /* 0x1c0ffffffff70500 */
/*0038*/ SHR.U32 R0, R0, R2; /* 0x5c28000000270000 */
/* 0x001fc800fca007f1 */
/*0048*/ IADD32I R0, R0, 0x1; /* 0x1c00000000170000 */
/*0050*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0058*/ ICMP.NE R0, R0, RZ, R5; /* 0x5b4b02800ff70000 */
/* 0x001ffc00ffe000f1 */
/*0068*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*0070*/ EXIT; /* 0xe30000000007000f */
/*0078*/ BRA 0x78; /* 0xe2400fffff87000f */
..........................
Fatbin ptx code:
================
arch = sm_61
code version = [5,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvcc -arch=sm_61 -o t107 t107.cu -std=c++11 -DUSE_DISPLAY
$ cuda-memcheck ./t107
========= CUDA-MEMCHECK
result 32767/32 = 1024
result 32767/16 = 2048
result 32767/0 = 1
result 0/32 = 0
result 2147483648/32 = 67108864
result 32/2147483648 = 1
result 2147483648/2147483648 = 1
result 32/32 = 1
result = 0
========= ERROR SUMMARY: 0 errors
$
I've only demonstrated the 32-bit example, above.
I think I could make the case that there are really only 6 instructions doing the "work" in the above kernel SASS, and that the remainder of the instructions are kernel "overhead" and/or the instructions needed to store the register result into global memory. It seems evident that the compiler is generating just these instructions as a result of the function:
/*0018*/ FLO.U32 R2, c[0x0][0x144]; // find bit set in q
/* */
/*0028*/ MOV R5, c[0x0][0x140]; // load p
/*0030*/ IADD32I R0, R5, -0x1; // subtract 1 from p
/*0038*/ SHR.U32 R0, R0, R2; // shift p right by q bit
/* */
/*0048*/ IADD32I R0, R0, 0x1; // add 1 to result
/*0050*/ ... /* */
/*0058*/ ICMP.NE R0, R0, RZ, R5; // account for p=0 case
However this would be inconsistent with the way I've counted other cases (they should all probably be reduced by 1).
template <typename T> __device__ T div_by_power_of_2_rounding_up(T p, T q)
{
return p==0 ? 0 : ((p - 1) >> lg(q)) + 1;
}
One instruction shorter than Robert's previous answer (but see his comeback) if my count is correct, or the same instruction count as the funnel shift. Has a branch though - not sure if that makes a difference (other than a benefit if the entire warp gets zero p inputs):
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fc000fda007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ ISETP.EQ.AND P0, PT, RZ, c[0x0][0x140], PT; /* 0x4b6503800507ff07 */
/*0018*/ { MOV R0, RZ; /* 0x5c9807800ff70000 */
/*0028*/ #P0 BRA 0x90; } /* 0x001fc800fec007fd */
/* 0xe24000000600000f */
/*0030*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0038*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/* 0x003fc400ffa00711 */
/*0048*/ FLO.U32 R0, R0; /* 0x5c30000000070000 */
/*0050*/ MOV R3, c[0x0][0x140]; /* 0x4c98078005070003 */
/*0058*/ IADD32I R2, -R0, 0x1f; /* 0x1d00000001f70002 */
/* 0x001fd800fcc007f5 */
/*0068*/ IADD32I R0, R3, -0x1; /* 0x1c0ffffffff70300 */
/*0070*/ IADD32I R2, -R2, 0x1f; /* 0x1d00000001f70202 */
/*0078*/ SHR.U32 R0, R0, R2; /* 0x5c28000000270000 */
/* 0x001fc800fe2007f6 */
/*0088*/ IADD32I R0, R0, 0x1; /* 0x1c00000000170000 */
/*0090*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0098*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/* 0x001ffc00ffe000f1 */
/*00a8*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*00b0*/ EXIT; /* 0xe30000000007000f */
/*00b8*/ BRA 0xb8; /* 0xe2400fffff87000f */
..........................
I believe it should still be possible to shave an instruction or two from the funnel shift by writing it in PTX (Morning update: as Robert has proven in the meantime), but I really need to go to bed.
Update2: Doing that (using Harold's funnel shift and writing the function in PTX)
_device__ uint32_t div_by_power_of_2_rounding_up(uint32_t p, uint32_t q)
{
uint32_t ret;
asm volatile("{\r\t"
".reg.u32 shift, mask, lo, hi;\n\t"
"bfind.u32 shift, %2;\r\t"
"sub.u32 mask, %2, 1;\r\t"
"add.cc.u32 lo, %1, mask;\r\t"
"addc.u32 hi, 0, 0;\r\t"
"shf.r.wrap.b32 %0, lo, hi, shift;\n\t"
"}"
: "=r"(ret) : "r"(p), "r"(q));
return ret;
}
just gets us to the same instruction count as Robert has already achieved with his simpler C code:
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fc000fec007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ MOV R0, c[0x0][0x144]; /* 0x4c98078005170000 */
/*0018*/ { IADD32I R2, R0, -0x1; /* 0x1c0ffffffff70002 */
/*0028*/ FLO.U32 R0, c[0x0][0x144]; } /* 0x001fc400fec00716 */
/* 0x4c30000005170000 */
/*0030*/ IADD R5.CC, R2, c[0x0][0x140]; /* 0x4c10800005070205 */
/*0038*/ IADD.X R6, RZ, RZ; /* 0x5c1008000ff7ff06 */
/* 0x003fc800fc8007f1 */
/*0048*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0050*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/*0058*/ SHF.R.W R0, R5, R0, R6; /* 0x5cfc030000070500 */
/* 0x001ffc00ffe000f1 */
/*0068*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*0070*/ EXIT; /* 0xe30000000007000f */
/*0078*/ BRA 0x78; /* 0xe2400fffff87000f */
..........................
One possible straightforward approach:
$ cat t105.cu
#include <cstdio>
__device__ unsigned r = 0;
template <typename T>
__device__ T pdqru(T p, T q){
T p1 = p + (q-1);
if (sizeof(T) == 8)
q = __ffsll(q);
else
q = __ffs(q);
return (p1<p)?((p>>(q-1))+1) :(p1 >> (q-1));
}
__global__ void test(unsigned p, unsigned q){
#ifdef USE_DISPLAY
unsigned q2 = 16;
unsigned z = 0;
unsigned l = 1U<<31;
printf("result %u/%u = %u\n", p, q, pdqru(p, q));
printf("result %u/%u = %u\n", p, q2, pdqru(p, q2));
printf("result %u/%u = %u\n", p, z, pdqru(p, z));
printf("result %u/%u = %u\n", z, q, pdqru(z, q));
printf("result %u/%u = %u\n", l, q, pdqru(l, q));
printf("result %u/%u = %u\n", q, l, pdqru(q, l));
printf("result %u/%u = %u\n", l, l, pdqru(l, l));
printf("result %u/%u = %u\n", q, q, pdqru(q, q));
#else
r = pdqru(p, q);
#endif
}
int main(){
unsigned h_r;
test<<<1,1>>>(32767, 32);
cudaMemcpyFromSymbol(&h_r, r, sizeof(unsigned));
printf("result = %u\n", h_r);
}
$ nvcc -arch=sm_61 -o t105 t105.cu
$ cuobjdump -sass ./t105
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_61
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fc800fec007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0018*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/* 0x005fd401fe20003d */
/*0028*/ FLO.U32 R2, R0; /* 0x5c30000000070002 */
/*0030*/ MOV R0, c[0x0][0x144]; /* 0x4c98078005170000 */
/*0038*/ IADD32I R3, -R2, 0x1f; /* 0x1d00000001f70203 */
/* 0x001fd000fc2007f1 */
/*0048*/ IADD32I R0, R0, -0x1; /* 0x1c0ffffffff70000 */
/*0050*/ MOV R2, c[0x0][0x140]; /* 0x4c98078005070002 */
/*0058*/ IADD32I R4, -R3, 0x1f; /* 0x1d00000001f70304 */
/* 0x001fd800fe2007f6 */
/*0068*/ IADD R5, R0, c[0x0][0x140]; /* 0x4c10000005070005 */
/*0070*/ ISETP.LT.U32.AND P0, PT, R5, R0, PT; /* 0x5b62038000070507 */
/*0078*/ SHR.U32 R0, R2, R4; /* 0x5c28000000470200 */
/* 0x001fd000fc2007f1 */
/*0088*/ IADD32I R0, R0, 0x1; /* 0x1c00000000170000 */
/*0090*/ MOV32I R2, 0x0; /* 0x010000000007f002 */
/*0098*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/* 0x001ffc001e2007f2 */
/*00a8*/ #!P0 SHR.U32 R0, R5, R4; /* 0x5c28000000480500 */
/*00b0*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/*00b8*/ EXIT; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*00c8*/ BRA 0xc0; /* 0xe2400fffff07000f */
/*00d0*/ NOP; /* 0x50b0000000070f00 */
/*00d8*/ NOP; /* 0x50b0000000070f00 */
/* 0x001f8000fc0007e0 */
/*00e8*/ NOP; /* 0x50b0000000070f00 */
/*00f0*/ NOP; /* 0x50b0000000070f00 */
/*00f8*/ NOP; /* 0x50b0000000070f00 */
..........................
Fatbin ptx code:
================
arch = sm_61
code version = [5,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
$ nvcc -arch=sm_61 -o t105 t105.cu -DUSE_DISPLAY
$ cuda-memcheck ./t105
========= CUDA-MEMCHECK
result 32767/32 = 1024
result 32767/16 = 2048
result 32767/0 = 2048
result 0/32 = 0
result 2147483648/32 = 67108864
result 32/2147483648 = 1
result 2147483648/2147483648 = 1
result 32/32 = 1
result = 0
========= ERROR SUMMARY: 0 errors
$
Approximately 14 SASS instructions for the 32-bit case, to get the answer into R0. It produces spurious results for the divide-by-zero case.
The equivalent assembly for this answer case looks like this:
$ cat t106.cu
#include <cstdio>
#include <cstdint>
__device__ unsigned r = 0;
template <typename T> __device__ int find_first_set(T x);
template <> __device__ int find_first_set<uint32_t>(uint32_t x) { return __ffs(x); }
template <> __device__ int find_first_set<uint64_t>(uint64_t x) { return __ffsll(x); }
template <typename T> __device__ T lg(T x) { return find_first_set(x) - 1; }
template <typename T>
__device__ T pdqru(T dividend, T divisor)
{
auto log_2_of_divisor = lg(divisor);
auto mask = divisor - 1;
auto correction_for_rounding_up = ((dividend & mask) + mask) >> log_2_of_divisor;
return (dividend >> log_2_of_divisor) + correction_for_rounding_up;
}
__global__ void test(unsigned p, unsigned q){
#ifdef USE_DISPLAY
unsigned q2 = 16;
unsigned z = 0;
unsigned l = 1U<<31;
printf("result %u/%u = %u\n", p, q, pdqru(p, q));
printf("result %u/%u = %u\n", p, q2, pdqru(p, q2));
printf("result %u/%u = %u\n", p, z, pdqru(p, z));
printf("result %u/%u = %u\n", z, q, pdqru(z, q));
printf("result %u/%u = %u\n", l, q, pdqru(l, q));
printf("result %u/%u = %u\n", q, l, pdqru(q, l));
printf("result %u/%u = %u\n", l, l, pdqru(l, l));
printf("result %u/%u = %u\n", q, q, pdqru(q, q));
#else
r = pdqru(p, q);
#endif
}
int main(){
unsigned h_r;
test<<<1,1>>>(32767, 32);
cudaMemcpyFromSymbol(&h_r, r, sizeof(unsigned));
printf("result = %u\n", h_r);
}
$ nvcc -std=c++11 -arch=sm_61 -o t106 t106.cu
$ cuobjdump -sass t106
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_61
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_61
Function : _Z4testjj
.headerflags #"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
/* 0x001fd400fe2007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ IADD R0, RZ, -c[0x0][0x144]; /* 0x4c1100000517ff00 */
/*0018*/ MOV R2, c[0x0][0x144]; /* 0x4c98078005170002 */
/* 0x003fc40007a007f2 */
/*0028*/ LOP.AND R0, R0, c[0x0][0x144]; /* 0x4c47000005170000 */
/*0030*/ FLO.U32 R3, R0; /* 0x5c30000000070003 */
/*0038*/ IADD32I R0, R2, -0x1; /* 0x1c0ffffffff70200 */
/* 0x001fc400fcc017f5 */
/*0048*/ IADD32I R3, -R3, 0x1f; /* 0x1d00000001f70303 */
/*0050*/ LOP.AND R2, R0, c[0x0][0x140]; /* 0x4c47000005070002 */
/*0058*/ IADD R2, R0, R2; /* 0x5c10000000270002 */
/* 0x001fd000fe2007f1 */
/*0068*/ IADD32I R0, -R3, 0x1f; /* 0x1d00000001f70300 */
/*0070*/ MOV R3, c[0x0][0x140]; /* 0x4c98078005070003 */
/*0078*/ MOV32I R6, 0x0; /* 0x010000000007f006 */
/* 0x001fc400fc2407f1 */
/*0088*/ SHR.U32 R4, R2, R0.reuse; /* 0x5c28000000070204 */
/*0090*/ SHR.U32 R5, R3, R0; /* 0x5c28000000070305 */
/*0098*/ MOV R2, R6; /* 0x5c98078000670002 */
/* 0x0003c400fe4007f4 */
/*00a8*/ MOV32I R3, 0x0; /* 0x010000000007f003 */
/*00b0*/ IADD R0, R4, R5; /* 0x5c10000000570400 */
/*00b8*/ STG.E [R2], R0; /* 0xeedc200000070200 */
/* 0x001f8000ffe007ff */
/*00c8*/ EXIT; /* 0xe30000000007000f */
/*00d0*/ BRA 0xd0; /* 0xe2400fffff87000f */
/*00d8*/ NOP; /* 0x50b0000000070f00 */
/* 0x001f8000fc0007e0 */
/*00e8*/ NOP; /* 0x50b0000000070f00 */
/*00f0*/ NOP; /* 0x50b0000000070f00 */
/*00f8*/ NOP; /* 0x50b0000000070f00 */
..........................
Fatbin ptx code:
================
arch = sm_61
code version = [5,0]
producer = cuda
host = linux
compile_size = 64bit
compressed
$
which appears to be 1 instruction longer, by my count.
Here is an alternative solution via population count. I tried the 32-bit variant only, testing it exhaustively against the reference implementation. Since the divisor q is a power of 2, we can trivially determine the shift count s with the help of the population count operation. The remainder t of the truncating division can be computed by simple mask m derived directly from the divisor q.
// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t reference (uint32_t p, uint32_t q)
{
uint32_t r = p / q;
if ((q * r) < p) r++;
return r;
}
// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t solution (uint32_t p, uint32_t q)
{
uint32_t r, s, t, m;
m = q - 1;
s = __popc (m);
r = p >> s;
t = p & m;
if (t > 0) r++;
return r;
}
Whether solution() is faster than the previously posted codes will likely depend on the specific GPU architecture. Using CUDA 8.0, it compiles to the following sequence of PTX instructions:
add.s32 %r3, %r2, -1;
popc.b32 %r4, %r3;
shr.u32 %r5, %r1, %r4;
and.b32 %r6, %r3, %r1;
setp.ne.s32 %p1, %r6, 0;
selp.u32 %r7, 1, 0, %p1;
add.s32 %r8, %r5, %r7;
For sm_5x, this translates into machine code pretty much 1:1, except that the two instructions SETP and SELP get contracted into a single ICMP, because the comparison is with 0.

Miss symbols when link static library to shared library

I have a problem that missing symbols when link static libraries and .o files to a shared libray. I have checked the symbol table of static libray, the functions i needed list in the table normally, like this:
...
00000000 g F .text 000000b0 av_int2dbl
...
000000b0 g F .text 00000060 av_int2flt
but when i generate shared library, av_int2dbl and av_int2flt and some else functions
missed(they all list in the static symtable normally), I used a stupid method to resolve this problem, by making a dummy function in .o file, and reference to functions missed form the dummy function, the DYNAMIC SYMBOL TABLE of shared library add some functions that missed before, but strange thing is av_int2dbl and av_int2flt missed as before.
Could anybody tell me, what's the principle to remove symbols when generate shared library?
If ld will remove all unreferfenced symbol, why functions defined in .o files (these funcs are not be referenced from other location) existed in shared library still? Why av_int2dbl and av_int2flt are invoked explicitly in dummy func, while disassembly loss the these two funcs?
Below is dummy function defined in .o file:
int my_dummy_funcs(void)
{
av_rdft_init(0x01,0x1);
av_rdft_calc(NULL, NULL);
av_rdft_end(NULL);
av_int2dbl(1);
av_int2flt(1);
av_resample(NULL,NULL,NULL,NULL,0,0,0);
av_resample_close(NULL);
av_resample_init(0,0,0,0,0,1.0);
return 0;
}
disassemble the dummy function as follow:
0008951c <my_dummy_funcs>:
8951c: e3a00001 mov r0, #1
89520: e92d40d0 push {r4, r6, r7, lr}
89524: e1a01000 mov r1, r0
89528: e3a04000 mov r4, #0
8952c: e24dd010 sub sp, sp, #16
89530: eb03a21e bl 171db0 <av_rdft_init>
89534: e1a01004 mov r1, r4
89538: e1a00004 mov r0, r4
8953c: e3a06000 mov r6, #0
89540: eb03a22f bl 171e04 <av_rdft_calc>
89544: e1a00004 mov r0, r4
89548: eb03a231 bl 171e14 <av_rdft_end>
8954c: e1a01004 mov r1, r4
89550: e1a02004 mov r2, r4
89554: e1a03004 mov r3, r4
89558: e1a00004 mov r0, r4
8955c: e58d4000 str r4, [sp]
89560: e58d4004 str r4, [sp, #4]
89564: e3a07000 mov r7, #0
89568: e58d4008 str r4, [sp, #8]
8956c: eb0e4a62 bl 41befc <av_resample>
89570: e1a00004 mov r0, r4
89574: e3437ff0 movt r7, #16368 ; 0x3ff0
89578: eb0e4a4a bl 41bea8 <av_resample_close>
8957c: e1a00004 mov r0, r4
89580: e1a01004 mov r1, r4
89584: e1a02004 mov r2, r4
89588: e1a03004 mov r3, r4
8958c: e58d4000 str r4, [sp]
89590: e1cd60f8 strd r6, [sp, #8]
89594: eb0e495f bl 41bb18 <av_resample_init>
89598: e1a00004 mov r0, r4
8959c: e28dd010 add sp, sp, #16
895a0: e8bd80d0 pop {r4, r6, r7, pc}
but when i generate shared library, av_int2dbl and av_int2flt and some else functions missed
The most likely reason: they are marked as HIDDEN in the regular symbol table, and that tells the linker to not export them in the dynamic symbol table.
You can verify this hypothesis by running
readelf -s libfoo.a | grep av_int2dbl
(and learn to use readelf instead of objdump on ELF platforms).