What is the meaning of the ES, Lk, Inf and Al column headers in the output of readelf -S? - elf

In the outupt of readelf -S, I'd like to know what the column headers ES, Lk, Inf and Al mean.
For example:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 00000d 00 AX 0 0 4
[ 2] .rel.text REL 00000000 000394 000008 08 10 1 4
[ 3] .data PROGBITS 00000000 000044 000000 00 WA 0 0 4
[...]

I'd like to know what the column headers ES, Lk, Inf and Al
Look in /usr/include/elf.h, for definition of Elf32_Shdr. You'll see something like this:
typedef struct
{
Elf32_Word sh_name; /* Section name (string tbl index) */
Elf32_Word sh_type; /* Section type */
Elf32_Word sh_flags; /* Section flags */
Elf32_Addr sh_addr; /* Section virtual addr at execution */
Elf32_Off sh_offset; /* Section file offset */
Elf32_Word sh_size; /* Section size in bytes */
Elf32_Word sh_link; /* Link to another section */
Elf32_Word sh_info; /* Additional section information */
Elf32_Word sh_addralign; /* Section alignment */
Elf32_Word sh_entsize; /* Entry size if section holds table */
} Elf32_Shdr;
So, a reasonable guess would be: ES == sh_entsize, Lk == sh_link, Inf == sh_info and Al == sh_addalign.

Related

OpenCL bad get_global_id output

I am trying to implement matrix multiplication, but get_global_id returns incorrect values.
This is the host code (n, m, TILE_SIZE = 4):
int dimention = 2;
size_t global_item_size[] = {n, m};
size_t local_item_size[] = {TILE_SIZE, TILE_SIZE};
ret = clEnqueueNDRangeKernel(command_queue, kernel, dimention, NULL, global_item_size, local_item_size, 0, NULL, &perf_event);
And part of the kernel:
kernel void mul_tile(uint n, uint m, uint k, global const float *a, global const float *b, global float *c) {
size_t i = get_global_id(0);
size_t j = get_global_id(1);
printf("aa %i %i\n", i, j);
}
This code prints this:
aa 0 0
aa 1 0
aa 2 0
aa 3 0
aa 0 0
aa 1 0
aa 2 0
aa 3 0
aa 0 0
aa 1 0
aa 2 0
aa 3 0
aa 0 0
aa 1 0
aa 2 0
aa 3 0
After some time I realized that get_global_id(0) returns correct index when I call it the first time and zero when I call it the second time:
kernel void mul_tile(uint n, uint m, uint k, global const float *a, global const float *b, global float *c) {
size_t i = get_global_id(0);
size_t j = get_global_id(0);
printf("aa %i %i\n", i, j);
}
So, this kernel prints the same thing.
In some cases get_global_id(2) returns 2-nd dimension indexes. But when I just rename variables it starts printing zeroes.
This problem looks like some driver bug. I use GeForce GT 745M, Ubuntu 20.04 and recommended drivers(nvidia-driver-440).

Is performance better to use (multiple) conditional ternary operators than an if statement in GLSL

I remember years ago I was told it was better in a GLSL shader to do
a = condition ? statementX : statementY;
over
if(condition) a = statementX;
else a = statementY;
because in the latter case, for every fragment which didn't satisfy the condition, execution would halt while statementX was executed for fragments which did satisfy the condition; and then execution on those fragments would wait until statementY is executed on the other fragments; while in the former case all statementX and statementY would be executed in parallel for corresponding fragments. (I guess it's a bit more complicated with Workgroups etc but that's the gist of it I think). In fact even for multiple statements I used to see this:
a0 = condition ? statementX0 : statementY0;
a1 = condition ? statementX1 : statementY1;
a2 = condition ? statementX2 : statementY2;
instead of
if(condition) {
a0 = statementX0;
a1 = statementX1;
a2 = statementX1;
} else {
a0 = statementY0;
a1 = statementY1;
a2 = statementY1;
}
Is this still the case? or have architectures or compilers improved? Is this a premature optimization not worth pursuing? Or still very relevant?
(and is it the same for different kinds of shaders? fragment, vertex, compute etc).
In both cases you would normally have a branch and almost surely both will lead to the same assembly.
8 __global__ void simpleTest(int *in, int a, int b, int *out)
9 {
10 int value = *in;
11 int p = (value != 0) ? __sinf(a) : __cosf(b);
12 *out = p;
13 }
14
15 __global__ void simpleTest2(int *in, int a, int b, int *out)
16 {
17 int value = *in;
18 int p;
19 if (value != 0)
20 {
21 p = __sinf(a);
22 }
23 else
24 {
25 p = __cosf(b);
26 }
27 *out = p;
28 }
Here's how SASS looks for both:
MOV R1, c[0x0][0x44]
MOV R2, c[0x0][0x140]
MOV R3, c[0x0][0x144]
LD.E R2, [R2]
MOV R5, c[0x0][0x154]
ISETP.EQ.AND P0, PT, R2, RZ, PT
#!P0 I2F.F32.S32 R0, c[0x0] [0x148]
#P0 I2F.F32.S32 R4, c[0x0] [0x14c]
#!P0 RRO.SINCOS R0, R0
#P0 RRO.SINCOS R4, R4
#!P0 MUFU.SIN R0, R0
#P0 MUFU.COS R0, R4
MOV R4, c[0x0][0x150]
F2I.S32.F32.TRUNC R0, R0
ST.E [R4], R0
EXIT
BRA 0x98
The #!P0 and #P0 you see are predicates. Each thread would have its own predicate bit based on the result. Depending on the bit, as the processing unit goes through the code it will be decided whether the instruction is to be executed (could also mean, result being committed?).
Let's look at a case in which you do not have branch regardless of both cases.
8 __global__ void simpleTest(int *in, int a, int b, int *out)
9 {
10 int value = *in;
11 int p = (value != 0) ? a : b;
12 *out = p;
13 }
14
15 __global__ void simpleTest2(int *in, int a, int b, int *out)
16 {
17 int value = *in;
18 int p;
19 if (value != 0)
20 {
21 p = a;
22 }
23 else
24 {
25 p = b;
26 }
27 *out = p;
28 }
And here's how SASS looks for both:
MOV R1, c[0x0][0x44]
MOV R2, c[0x0][0x140] ; load in pointer into R2
MOV R3, c[0x0][0x144]
LD.E R2, [R2] ; deref pointer
MOV R6, c[0x0][0x14c] ; load a. b is stored at c[0x0][0x148]
MOV R4, c[0x0][0x150] ; load out pointer into R4
MOV R5, c[0x0][0x154]
ICMP.EQ R0, R6, c[0x0][0x148], R2 ; Check R2 if zero and select source based on result. Result is put into R0.
ST.E [R4], R0
EXIT
BRA 0x60
There's no branch here. You can do can think of the result as a linear interpolation of A and B:
int cond = (*p != 0)
*out = (1-cond) * a + cond * b

GNURadio PSK bit recovery

I have followed the wonderful GNURadio Guided Tutorial PSK Demodulation:
https://wiki.gnuradio.org/index.php/Guided_Tutorial_PSK_Demodulation
I've created a very simple DBPSK modulator
I feed in a series of bits that are sliding. So the first byte I feed in is 0x01, the next byte is 0x02, 0x04, 0x08 and so on. This is the output of hd:
00000000 00 00 ac 0e d0 f0 20 40 81 02 04 08 10 00 20 40 |...... #...... #|
00000010 81 02 04 08 10 00 20 40 81 02 04 08 10 00 20 40 |...... #...... #|
*
00015000
The first few bytes are garbage, but then you can see the pattern. Looking at the second line you see:
0x81, 0x02, 0x04, 0x08, 0x10, 0x00, 0x20, 0x40, 0x81
The walking ones is there, but after 0x10, the PSK demodulator receives a 0x00, then a few bytes later is receives a 0x81. It almost seems like the timing recovery is off.
Has anyone else seen something like this?
OK, I figured it out. Below is my DBPSK modulation.
If you let this run, the BER will continue to drop. Some things to keep in mind. The PSK Mod takes an 8-bit value (or perhaps an short or int as well). It grabs the bits and modulates them. Then the PSK Demod does the same. If you save this to a file, you will not get the exact bits out. You will need to shift the bits to align them. I added the Vector Insert block to generate a preamble of sorts.
Then I wrote some Python to find my preamble:
import numpy as np
import matplotlib.pyplot as plt
def findPreamble(preamble, x):
for i in range(0, len(x) - len(preamble)):
check = 0
for j in range(0, len(preamble)):
check += x[i + j] - preamble[j]
if (check == 0):
print("Found a preamble at {0}".format(i))
x = x[i + len(preamble)::]
break
return check == 0, x
def shiftBits(x):
for i in range (0, len(x) - 1):
a = x[i]
a = a << 1
if x[i + 1] & 0x80:
a = a | 1
x[i] = (a & 0xFF)
return x
f = open('test.bits', 'rb')
x = f.read();
f.close()
preamble = [0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08]
searchForBit = True
x = np.frombuffer(x, dtype='uint8')
x = x.astype('int')
print(x)
while searchForBit:
x = shiftBits(x)
print(x)
found, y = findPreamble(preamble, x)
if found:
searchForBit = False
y = y.astype('uint8')
f = open('test.bits', 'wb')
f.write(y)
f.close()

Ti C6x DSP intrinsics for optimising C code

I want to use C66x intrinsics to optimise my code .
Below is some C code what I want to optimise by using DSP intrinsics .
I am new to DSP intrinsic ,so not having full knowledge of which intrinsic use for below logic .
uint8 const src[40] = = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40};
uint32_t width = 8;
uint32_t axay1_6 = 112345;
uint32_t axay2_6 = 123456;
uint32_t axay3_6 = 134567;
uint32_t axay4_6 = 145678;
C code:
uint8_t const *cLine = src;
uint8_t const *nLine = cLine + width;
uint32_t res = 0;
const uint32_t a1 = (*cLine++) * axay1_6;
const uint32_t a3 = (*nLine++) * axay3_6;
res = a1 + a3;
const uint32_t a2 = (*cLine) * axay2_6;
const uint32_t a4 = (*nLine) * axay4_6;
res += a2 + a4;
C66x Intrinscics :
const uint8_t *Ix00, *Ix01, *Iy00,*Iy01;
uint32_t in1,in2;
uint64_t l1, l2;
__x128_t axay1_6 = _dup32_128(axay1_6); //112345 112345 112345 112345
__x128_t axay2_6 = _dup32_128(axay2_6); //123456 123456 123456 123456
__x128_t axay3_6 = _dup32_128(axay3_6); //134567 134567 134567 134567
__x128_t axay4_6 = _dup32_128(axay4_6); //145678 145678 145678 145678
Ix00 = src ;
Ix01 = Ix00 + 1 ;
Iy00 = src + width;
Iy01 = Iy00 + 1;
int64_t I_00 = _mem8_const(Ix00); //00 01 02 03 04 05 06 07
int64_t I_01 = _mem8_const(Ix01); //01 02 03 04 05 06 07 08
int64_t I_10 = _mem8_const(Iy00); //10 11 12 13 14 15 16 17
int64_t I_11 = _mem8_const(Iy01); //11 12 13 14 15 16 17 18
in1 = _loll(I_00); //00 01 02 03
l1 = _unpkbu4(in1); //00 01 02 03 (16x4)
in2 = _hill(I_00); //04 05 06 07
l2 = _unpkbu4(in2); //04 05 06 07 (16x4)
Here I want one something __x128 register with 32*4 value containg " 00 01 02 03 " data .
So I can multiply __x128 into __x128 bit register and get __x128 bit value .Presently i am planning to use _qmpy32
I am new to this C66x DSP intrinscic .
Can you tell me which intrinsic is suitable to get __x128 type of register with 32x4 values with 00 01 02 03 values.
(means how to convert 16 bit to 32 bit by using dsp intrinsic)
Use the _unpkhu2 instruction to expand the 16x4 to 32x4.
__x128_t src1_128, src2_128;
src1_128 = _llto128(_unpkhu2(_hill(l1)), _unpkhu2(_loll(l1)));
src2_128 = _llto128(_unpkhu2(_hill(l2)), _unpkhu2(_loll(l2)));
Be careful: Little-endian/Big-endian settings can make these sorts of things come out in a way you didn't expect.
Also, I wouldn't recommend naming a variable l1. In some fonts, lower-case L and the number 1 are indistinguishable.

How can I convert hex string to binary?

My problem is getting a 64-bit key from user. For this I need to get 16 characters as string which contains hexadecimal characters (123456789ABCDEF). I got the string from user and I reached characters with the code below. But I don't know how to convert character to 4-bit binary
.data
insert_into:
.word 8
Ask_Input:
.asciiz "Please Enter a Key which size is 16, and use hex characters : "
key_array:
.space 64
.text
.globl main
main:
la $a0, Ask_Input
li $v0, 4
syscall
la $a0, insert_into
la $a1, 64
li $v0, 8
syscall
la $t0, insert_into
li $t2, 0
li $t3, 0
loop_convert:
lb $t1, ($t0)
addi $t0, $t0, 1
beq $t1, 10, end_convert
# Now charcter is in $t1 but
#I dont know how to convert it to 4 bit binary and storing it
b loop_convert
end_convert:
li $v0, 10 # exit
syscall
I don't think masking with 0x15 as #Joelmob's opinion is the right solution, because
'A' = 0x41 → 0x41 & 0x15 = 0
'B' = 0x42 → 0x42 & 0x15 = 0
'C' = 0x43 → 0x43 & 0x15 = 1
'D' = 0x44 → 0x44 & 0x15 = 4
'E' = 0x45 → 0x45 & 0x15 = 5
'F' = 0x46 → 0x46 & 0x15 = 4
which doesn't produce any relevant binary values
The easiest way is subtracting the range's lower limit from the character value. I'll give the idea in C, you can easily convert it to MIPS asm
if ('0' <= ch && ch <= '9')
{
return ch - '0';
}
else if ('A' <= ch && ch <= 'F')
{
return ch - 'A' + 10;
}
else if ('a' <= ch && ch <= 'f')
{
return ch - 'a' + 10;
}
Another way to implement:
if ('0' <= ch && ch <= '9')
{
return ch & 0x0f;
}
else if (('A' <= ch && ch <= 'F') || ('a' <= ch && ch <= 'f'))
{
return (ch & 0x0f) + 9;
}
However this can be further optimized to a single comparison using the technique describes in the following questions
Fastest way to determine if an integer is between two integers (inclusive) with known sets of values
Check if number is in range on 8051
Now the checks can be rewritten as below
if ((unsigned char)(ch - '0') <= ('9'-'0'))
if ((unsigned char)(ch - 'A') <= ('F'-'A'))
if ((unsigned char)(ch - 'a') <= ('f'-'a'))
Any modern compilers can do this kind of optimization, here is an example output
hex2(unsigned char):
andi $4,$4,0x00ff # ch, ch
addiu $2,$4,-48 # tmp203, ch,
sltu $2,$2,10 # tmp204, tmp203,
bne $2,$0,$L13
nop
andi $2,$4,0xdf # tmp206, ch,
addiu $2,$2,-65 # tmp210, tmp206,
sltu $2,$2,6 # tmp211, tmp210,
beq $2,$0,$L12 #, tmp211,,
andi $4,$4,0xf # tmp212, ch,
j $31
addiu $2,$4,9 # D.2099, tmp212,
$L12:
j $31
li $2,255 # 0xff # D.2099,
$L13:
j $31
andi $2,$4,0xf # D.2099, ch,
Have a look at this ASCII table you will see that hex-code for numbers from 9 and below are 0x9 and for capital letters this is between 0x41 and 0x5A A-Z determine if its a number or character, you see if theres a number its quite done, if it were a character mask with 0x15 to get the four bits.
If you want to include lowercase letters do same procedure with masking and determine if its a char between 0x61 and 0x7A