I want to port our algorithm using HVX intrinsics of Hexagon DSP but am unable to understand how to use them and one more question is i have used vector 64 bit intrinsics but when i profile the code cycles are less for C code than using vector intrinsics and also am using Hexaon timer api's to calculate cycles.
This is the code:
C code:
cycles consumed is 5452
for(i=0;i<=128;i++){
value[i]=((hs_int32)((((hs_int32)(hs_int16)((32767)))*((hs_int32)
(hs_int16)((((window[i])) >> (15))))))+(hs_int32)((((((hs_int32)
(hs_int16)((32767)))*((hs_int32)(hs_int16)
(((window[i])&0x00007fff))))) >> (15))));
}
Hexagon intrinsics:
Cycles consumed are 8766
for(i=0,j=0;i<=128/2;i++,j++)
{
Word64 and_op=Q6_P_and_PP(R_E_VECTOR_1[i],dummy);
shift_1[i+j]=Q6_R_asr_RI(shift_1[i+j],15);
shift_1[i+1+j]=Q6_R_asr_RI(shift_1[i+1+j],15);
Word64 first_op=Q6_P_vmpyweh_PP_sat(leak2_64,R_E_VECTOR_1[i]);
out[i]=Q6_P_vmpyweh_PP_sat(leak2_64,and_op);
shift_2[i+j]=Q6_R_asr_RI(shift_2[i+j],15);
shift_2[i+1+j]=Q6_R_asr_RI(shift_2[i+1+j],15);
out[i]=Q6_P_vaddw_PP(first_op,out[i]);
}
C code is showing less cycles compared to using hexagon intrinsics.Anyone can help me regarding this problem.
#Brain cain,
This is the dissassembly of intrinsics version:
r1:0 = memd(r30+#-48)
000000000000c400: r2 = memw(r30+#-52)
000000000000c404: r3 = memw(r30+#-20)
000000000000c408: r5:4 = memd(r2+r3<<#3)
000000000000c40c: r1:0 = vmpyweh(r1:0,r5:4):sat
000000000000c410: memd(r30+#-192) = r1:0
194 out[i]=Q6_P_vmpyweh_PP_sat(leak2_64,and_op);
000000000000c414: r1:0 = memd(r30+#-48)
000000000000c418: r5:4 = memd(r30+#-184)
000000000000c41c: r1:0 = vmpyweh(r1:0,r5:4):sat
000000000000c420: r2 = memw(r30+#-84)
000000000000c424: r3 = memw(r30+#-20)
000000000000c428: memd(r2+r3<<#3) = r1:0
195 shift_2[i+j]=Q6_R_asr_RI(shift_2[i+j],15);
000000000000c42c: r2 = memw(r30+#-148)
000000000000c430: r3 = memw(r30+#-20)
000000000000c434: r6 = memw(r30+#-24)
000000000000c438: r3 = add(r3,r6)
000000000000c43c: r6 = memw(r2+r3<<#2)
000000000000c440: r6 = asr(r6,#15)
000000000000c444: memw(r2+r3<<#2) = r6
196 shift_2[i+1+j]=Q6_R_asr_RI(shift_2[i+1+j],15);
000000000000c448: r2 = memw(r30+#-148)
000000000000c44c: r3 = memw(r30+#-20)
000000000000c450: r6 = memw(r30+#-24)
000000000000c454: r3 = add(r3,r6)
mt_cv_mec_power_spectrum_fixed_hexagon:
000000000000c458: r2 = addasl(r2,r3,#2)
000000000000c45c: r3 = memw(r2+#4)
000000000000c460: r3 = asr(r3,#15)
000000000000c464: memw(r2+#4) = r3
197 out[i]=Q6_P_vaddw_PP(first_op,out[i]);
000000000000c468: r1:0 = memd(r30+#-192)
000000000000c46c: r2 = memw(r30+#-84)
000000000000c470: r3 = memw(r30+#-20)
000000000000c474: r5:4 = memd(r2+r3<<#3)
000000000000c478: r1:0 = vaddw(r1:0,r5:4)
000000000000c47c: memd(r2+r3<<#3) = r1:0
}
Iam new to DSP programming and facing alot of issues to understand the hexagon DSP .Your help will be very helpful for me .
Related
Im attempting to model a the following continuous time simulation but do not understand why there is an indexing error.
clear all
global Bsm Bsf Jm Jg1 Jf Gc GR Ke Kf Kg Kr Ks Kt L lf mf R g Va Tf
%Model Paramaters
Bsm=0.01; Bsf=1.5; Jm=0.002; Jg1=0.001; Jf=0.0204; Gc=9.95; GR=1.7; R=4; g=9.81;
Ke=0.35; Kf=0.5; Kg=2.0; Kr=0.9; Ks=0.9; Kt=0.35; L=0.1; lf=0.35; mf=0.5; Va=1;
% Define parameters for the simulation
stepsize = 0.1;
comminterval = 0.1;
EndTime = 20;
i = 0;
% Initial conditions
u = 0;
x = [0,0]';
xdot = [0,0]';
for time = 0:stepsize:EndTime
if rem(time,comminterval)==0
i = i+1;
tout(i) = time;
xout(i,:) = x;
xdout(i,:) = xdot;
end
xdot = derivitive(x,u);
x = eulerint(xdot, stepsize, x);
end
figure(1)
clf % clear figure
plot(tout,xout(:,1),'bo-')
xlabel('time [s]')
ylabel('states')
hold on
grid on
plot(tout,xout(:,2),'ro-')
hold off
Index exceeds the number of array elements (2).
Error in derivitive (line 5)
xdot(1,1) = -R/L*x(1)-Ke/L*x(3)+Va/L;`
The function 'derivative' for my dynamic equations I'm attempting to call is as follows:
function xdot = derivitive(x,u);
global Bsm Bsf Jm Jg1 Jf Gc GR Ke Kf Kg Kr Ks Kt L lf mf R g Va Tf
xdot(1,1) = -(R/L)*x(1)-(Ke/L)*x(3)+(Va/L);
xdot(2,1) = x(3);
xdot(3,1) = -Bsm/Jm*x(3)+Bsm/Jm*x(5)+Kt/Jm*x(1);
xdot(4,1) = x(5);
xdot(5,1) = Bsm/Jg1*x(3)-Bsm/Jg1*x(5);
xdot(6,1) = x(7);
xdot(7,1) = -Bsf/Jf*x(7)-mf*lf*g/(2*Jf)*x(4)-mf*lf*g/(2*Jf)*x(6)+Tf/Jf``
Do not understand why there is an indexing error, any help appreciated.
I've been trying to write a program that can render 4D lines, the specific function doing this gets the lines already rotated, and the function attempts to clip the lines at planes z = p and w = p if needed, and then draw the line to the screen.
I think that I am doing at least most of this properly, however I am unsure, and not having much experience viewing the fourth dimension I cannot tell what might be a visual bug, or what is actually how it should be rendered.
The function first loads a line into two variables, each is one of the two endpoints of the line. If both points are beyond clippl (the clipping plane variable) for z = clippl and w = clippl, it then applies perspective transformation to them, and subsequently renders a line on the screen correspondingly.
If certain logic is met for the points, the function goes through a process of clipping them, and then continues the same as it would outside the clipping planes.
The location of the camera is held in the variables Ox, Oy, Oz, Ow at the beginning of the full program.
I can't tell if I've done this properly, can anyone tell me if this works right as a 4D perspective projection from a first person camera?
EDIT: I've added points to the rendering list that are at the corners of the cube I'm rendering, and it seems to show that there is in fact some problem with the line clipping, as I am fairly certain that the points are rendering properly, and there is not always a line showing up at it. Could the problem have to do with the w = p clip?
Here's the function, the program uses p5.js:
function drawPLines(P){
var lA,lB;
for(var i=0;i<P.length;i++){
lA = [P[i][0],P[i][1],P[i][2],P[i][3]];
lB = [P[i][4],P[i][5],P[i][6],P[i][7]];
//X: ( x*VS+(width*0.5)+(ox*VS) )
//Y: ( y*VS+(height*0.5)+(oy*VS) )
//x: (XV[0]*P[i][0])+(YV[0]*P[i][1])+(ZV[0]*P[i][2])+(WV[0]*P[i][3])
//y: (XV[1]*P[i][0])+(YV[1]*P[i][1])+(ZV[1]*P[i][2])+(WV[1]*P[i][3])
var x0,y0,x1,y1;
//x0 = (XV[0]*lA[0])+(YV[0]*lA[1])+(ZV[0]*lA[2])+(WV[0]*lA[3]);
//y0 = (XV[1]*lA[0])+(YV[1]*lA[1])+(ZV[1]*lA[2])+(WV[1]*lA[3]);
//new rendering pipeline
//old rendering pipeline
if(lA[2]>clippl&&lB[2]>clippl&&lA[3]>clippl&&lB[3]>clippl){
x0 = XV[0]*lA[0];
y0 = YV[1]*lA[1];
x0 = (x0/lA[3])/(lA[2]/lA[3]);
y0 = (y0/lA[3])/(lA[2]/lA[3]);
//console.log(y);
x0 = ( x0*VS+(width*0.5)+(ox*VS) );
y0 = ( y0*VS+(height*0.5)+(oy*VS) );
//x1 = (XV[0]*lB[0])+(YV[0]*lB[1])+(ZV[0]*lB[2])+(WV[0]*lB[3]);
//y1 = (XV[1]*lB[0])+(YV[1]*lB[1])+(ZV[1]*lB[2])+(WV[1]*lB[3]);
x1 = XV[0]*lB[0];
y1 = YV[1]*lB[1];
x1 = (x1/lB[3])/(lB[2]/lB[3]);
y1 = (y1/lB[3])/(lB[2]/lB[3]);
//console.log(y);
x1 = ( x1*VS+(width*0.5)+(ox*VS) );
y1 = ( y1*VS+(height*0.5)+(oy*VS) );
stroke([P[i][8],P[i][9],P[i][10],P[i][11]]);
line(x0,y0,x1,y1);
}else if((lA[2]>clippl||lA[3]>clippl||lB[2]>clippl||lB[3]>clippl)){
var V = 0;
var zV = 0;
var wV = 0;
//var oV = 0;
if(lA[2]>clippl&&lA[3]>clippl){
V++;
}else if(lA[2]>clippl&&lA[3]<=clippl){
zV++;
}else if(lA[2]<=clippl&&lA[3]>clippl){
wV++;
}/*else{
oV++;
}*/
if(lB[2]>clippl&&lB[3]>clippl){
V++;
}else if(lB[2]>clippl&&lB[3]<=clippl){
zV++;
}else if(lB[2]<=clippl&&lB[3]>clippl){
wV++;
}/*else{
oV++;
}*/
if((V==1)||(wV==1&&(V==1||zV==1))||(zV==1&&(V==1||wV==1))){
var lin = lB;
var out = lA;
if(lA[2]<=clippl){
out = lB;
lin = lA;
}
if(lin[2]<=clippl){
lin = [((((lA[0]-lB[0])*clippl)-((lA[0]-lB[0])*lB[2]))/(lA[2]-lB[2]))+lB[0],((((lA[1]-lB[1])*clippl)-((lA[1]-lB[1])*lB[2]))/(lA[2]-lB[2]))+lB[1],clippl,((((lA[3]-lB[3])*clippl)-((lA[3]-lB[3])*lB[2]))/(lA[2]-lB[2]))+lB[3]];
}
if((lA[2]-lB[2])!==0){
lA = lin;
lB = out;
}
lin = lA;
out = lB;
if(lB[3]<=clippl){
out = lA;
lin = lB;
}
if(lin[3]<=clippl){
lin = [((((lA[0]-lB[0])*clippl)-((lA[0]-lB[0])*lB[3]))/(lA[3]-lB[3]))+lB[0],((((lA[1]-lB[1])*clippl)-((lA[1]-lB[1])*lB[3]))/(lA[3]-lB[3]))+lB[1],((((lA[2]-lB[2])*clippl)-((lA[2]-lB[2])*lB[3]))/(lA[3]-lB[3]))+lB[2],clippl];
//alert(lin);
//alert(out);
}
if((lA[3]-lB[3])!==0){
lA = lin;
lB = out;
}
if(lA[2]>clippl||lB[2]>clippl||lA[3]>clippl||lB[3]>clippl){
x0 = XV[0]*lA[0];
y0 = YV[1]*lA[1];
x0 = (x0/lA[3])/(lA[2]/lA[3]);
y0 = (y0/lA[3])/(lA[2]/lA[3]);
//console.log(y);
x0 = ( x0*VS+(width*0.5)+(ox*VS) );
y0 = ( y0*VS+(height*0.5)+(oy*VS) );
//x1 = (XV[0]*lB[0])+(YV[0]*lB[1])+(ZV[0]*lB[2])+(WV[0]*lB[3]);
//y1 = (XV[1]*lB[0])+(YV[1]*lB[1])+(ZV[1]*lB[2])+(WV[1]*lB[3]);
x1 = XV[0]*lB[0];
y1 = YV[1]*lB[1];
x1 = (x1/lB[3])/(lB[2]/lB[3]);
y1 = (y1/lB[3])/(lB[2]/lB[3]);
//console.log(y);
x1 = ( x1*VS+(width*0.5)+(ox*VS) );
y1 = ( y1*VS+(height*0.5)+(oy*VS) );
stroke([P[i][8],P[i][9],P[i][10],P[i][11]]);
line(x0,y0,x1,y1);
}
}
}
}
}
You can see the full program at https://editor.p5js.org/hpestock/sketches/Yfagz4Bz3
I tried to obtain MLEs of the Vasicek function using the following function.
I am running into into the following error constantly and I have no way to solve it. Please help me. Thanks!
Error in if (!all(lower[isfixed] <= fixed[isfixed] & fixed[isfixed] <= :
missing value where TRUE/FALSE needed
Here is the background:
Likelihood function
likehood.Vasicek<-function (theta, kappa, sigma, rt){
n <- NROW(rt)
y <- rt[2:n,] # Take rates other than r0
dt <- 1/12 # Simulated data is monthly
mu <- rt[1:(n-1),]* exp(-kappa*dt) + theta* (1- exp(-kappa*dt)) #Take prior rates for mu calculation
sd <- sqrt((sigma^2)*(1-exp(-2*kappa*dt))/(2*kappa))
pdf_yt <- dnorm(y, mu, sd, log = FALSE)
- sum(log(pdf_yt))
}
Simulating scenarios
IRModeling.Vasicek = function(r0, theta, kappa, sigma, T, N){
M <- T*12 # monthly time step
t <- 1/12 # time interval is monthly
rt = matrix(0, M+1, N) # N sets of scenarios with M months of time steps
rt[1,] <- r0 # set the initial value for each of the N scenarios
for (i in 1:N){
for (j in 1:M){
rt[j+1,i] = rt[j,i] + kappa*(theta - rt[j,i])*t + sigma*rnorm(1,mean=0,sd=1)*sqrt(t)
}
}
rt # Return the values
}
MLE
r0 = 0.03
theta = 0.03
kappa = 0.3
sigma = 0.03
T = 5 # years
N = 500
rt = IRModeling.Vasicek (r0, theta, kappa, sigma, T, N)
theta.est <- 0.04
kappa.est <- 0.5
sigma.est <- 0.02
parameters.est <- c(theta.est, kappa.est, sigma.est)
library(stats4)
bound.lower <- parameters.est*0.1
bound.upper <- parameters.est*2
est.mle<-mle(likelihood.Vasicek, start= list(theta = theta.est, kappa = kappa.est, sigma = sigma.est),
method="L-BFGS-B", lower=bound.lower, upper= bound.upper, fixed = list(rt = rt))
summary(est.mle)
Error
Error in if (!all(lower[isfixed] <= fixed[isfixed] & fixed[isfixed] <= :
missing value where TRUE/FALSE needed
I remember years ago I was told it was better in a GLSL shader to do
a = condition ? statementX : statementY;
over
if(condition) a = statementX;
else a = statementY;
because in the latter case, for every fragment which didn't satisfy the condition, execution would halt while statementX was executed for fragments which did satisfy the condition; and then execution on those fragments would wait until statementY is executed on the other fragments; while in the former case all statementX and statementY would be executed in parallel for corresponding fragments. (I guess it's a bit more complicated with Workgroups etc but that's the gist of it I think). In fact even for multiple statements I used to see this:
a0 = condition ? statementX0 : statementY0;
a1 = condition ? statementX1 : statementY1;
a2 = condition ? statementX2 : statementY2;
instead of
if(condition) {
a0 = statementX0;
a1 = statementX1;
a2 = statementX1;
} else {
a0 = statementY0;
a1 = statementY1;
a2 = statementY1;
}
Is this still the case? or have architectures or compilers improved? Is this a premature optimization not worth pursuing? Or still very relevant?
(and is it the same for different kinds of shaders? fragment, vertex, compute etc).
In both cases you would normally have a branch and almost surely both will lead to the same assembly.
8 __global__ void simpleTest(int *in, int a, int b, int *out)
9 {
10 int value = *in;
11 int p = (value != 0) ? __sinf(a) : __cosf(b);
12 *out = p;
13 }
14
15 __global__ void simpleTest2(int *in, int a, int b, int *out)
16 {
17 int value = *in;
18 int p;
19 if (value != 0)
20 {
21 p = __sinf(a);
22 }
23 else
24 {
25 p = __cosf(b);
26 }
27 *out = p;
28 }
Here's how SASS looks for both:
MOV R1, c[0x0][0x44]
MOV R2, c[0x0][0x140]
MOV R3, c[0x0][0x144]
LD.E R2, [R2]
MOV R5, c[0x0][0x154]
ISETP.EQ.AND P0, PT, R2, RZ, PT
#!P0 I2F.F32.S32 R0, c[0x0] [0x148]
#P0 I2F.F32.S32 R4, c[0x0] [0x14c]
#!P0 RRO.SINCOS R0, R0
#P0 RRO.SINCOS R4, R4
#!P0 MUFU.SIN R0, R0
#P0 MUFU.COS R0, R4
MOV R4, c[0x0][0x150]
F2I.S32.F32.TRUNC R0, R0
ST.E [R4], R0
EXIT
BRA 0x98
The #!P0 and #P0 you see are predicates. Each thread would have its own predicate bit based on the result. Depending on the bit, as the processing unit goes through the code it will be decided whether the instruction is to be executed (could also mean, result being committed?).
Let's look at a case in which you do not have branch regardless of both cases.
8 __global__ void simpleTest(int *in, int a, int b, int *out)
9 {
10 int value = *in;
11 int p = (value != 0) ? a : b;
12 *out = p;
13 }
14
15 __global__ void simpleTest2(int *in, int a, int b, int *out)
16 {
17 int value = *in;
18 int p;
19 if (value != 0)
20 {
21 p = a;
22 }
23 else
24 {
25 p = b;
26 }
27 *out = p;
28 }
And here's how SASS looks for both:
MOV R1, c[0x0][0x44]
MOV R2, c[0x0][0x140] ; load in pointer into R2
MOV R3, c[0x0][0x144]
LD.E R2, [R2] ; deref pointer
MOV R6, c[0x0][0x14c] ; load a. b is stored at c[0x0][0x148]
MOV R4, c[0x0][0x150] ; load out pointer into R4
MOV R5, c[0x0][0x154]
ICMP.EQ R0, R6, c[0x0][0x148], R2 ; Check R2 if zero and select source based on result. Result is put into R0.
ST.E [R4], R0
EXIT
BRA 0x60
There's no branch here. You can do can think of the result as a linear interpolation of A and B:
int cond = (*p != 0)
*out = (1-cond) * a + cond * b
I'm trying to draw some sprites where the alpha channel of the image is taken into account.
What is the correct set of values for the following structures to support alpha channel of textures in the fragment shader?
vk::PipelineColorBlendAttachmentState colorBlendAttachment;
colorBlendAttachment.colorWriteMask = vk::ColorComponentFlagBits::eR | vk::ColorComponentFlagBits::eG | vk::ColorComponentFlagBits::eB | vk::ColorComponentFlagBits::eA;
colorBlendAttachment.blendEnable = VK_TRUE;
colorBlendAttachment.srcColorBlendFactor = vk::BlendFactor::eOne;
colorBlendAttachment.dstColorBlendFactor = vk::BlendFactor::eZero;
colorBlendAttachment.colorBlendOp = vk::BlendOp::eAdd;
colorBlendAttachment.srcAlphaBlendFactor = vk::BlendFactor::eOne;
colorBlendAttachment.dstAlphaBlendFactor = vk::BlendFactor::eZero;
colorBlendAttachment.alphaBlendOp = vk::BlendOp::eSubtract;
vk::PipelineColorBlendStateCreateInfo colorBlending;
colorBlending.logicOpEnable = VK_FALSE;
colorBlending.logicOp = vk::LogicOp::eCopy;
colorBlending.attachmentCount = 1;
colorBlending.pAttachments = &colorBlendAttachment;
colorBlending.blendConstants[0] = 0.0f;
colorBlending.blendConstants[1] = 0.0f;
colorBlending.blendConstants[2] = 0.0f;
colorBlending.blendConstants[3] = 0.0f;
Per Ekzusy's answer, here are 2 ways:
Using the 'discard' keyword in the fragment shader.
// Read data from some texture.
vec4 color = texture(...);
// This makes the alpha channel (w component) act as a boolean.
if (color.w < 1) { discard; }
For my original question, these values will do:
vk::PipelineColorBlendAttachmentState colorBlendAttachment;
colorBlendAttachment.colorWriteMask =
vk::ColorComponentFlagBits::eR | vk::ColorComponentFlagBits::eG |
vk::ColorComponentFlagBits::eB | vk::ColorComponentFlagBits::eA;
colorBlendAttachment.blendEnable = VK_TRUE;
colorBlendAttachment.srcColorBlendFactor = vk::BlendFactor::eSrcAlpha;
colorBlendAttachment.dstColorBlendFactor = vk::BlendFactor::eOneMinusSrcAlpha;
colorBlendAttachment.colorBlendOp = vk::BlendOp::eAdd;
colorBlendAttachment.srcAlphaBlendFactor = vk::BlendFactor::eSrcAlpha;
colorBlendAttachment.dstAlphaBlendFactor = vk::BlendFactor::eOneMinusSrcAlpha;
colorBlendAttachment.alphaBlendOp = vk::BlendOp::eSubtract;