How to write multiplication with the Little Man Computer? - multiplication

I know that I need to do repeated addition, but I'm having trouble with the loops; I just do not understand them at all.
Here is a program that multiplies a number by 2, without a loop.
INP
STA num1
LDA num1
ADD num1
STA num1
OUT
HLT
num1 DAT
I know that I need to add a loop, but I'm just lost. Where do I put the loop? How do I construct a loop with the LMC's branch commands?
The end result of my project is a program that will multiply two numbers together, depending on what the user inputs. For example, if 4 and 5 were input, the program would carry out the equation 4 + 4 + 4 + 4 + 4 = 20. I do not know how to construct a loop to carry this out, and I've been staring blankly at the instructions set for days.

In words:
Read input into R0 and R1.
Set RESULT to 0
While R1 > 0 {
Subtract 1 from R1
Add R0 to RESULT
}
Output RESULT
In LMC assembler:
INP
STA R0
INP
STA R1
LOOP LDA R1
BRZ END
SUB ONE
STA R1
LDA RES
ADD R0
STA RES
BRA LOOP
END LDA RES
OUT
// Temporary storage
R1 DAT
R0 DAT
RES DAT
// Constants
ONE DAT 1
You can see it running here: multiplication on LMC emulator

I have made a program like this before, it should go as follows with the same ram addresses if memory serves. This flowchart should also help explain this. http://creately.com/diagram/example/i5z9v65u1/Multiplication%20LMC
INP
STA X
INP
STA Y
LOOP LDA Y
BRZ END
LDA ANSWER
ADD X
STA ANSWER
LDA Y
SUB ONE
STA Y
BRA LOOP
END LDA ANSWER
OUT
HLT
ONE DAT 1
ANSWER DAT 0
X DAT 0
Y DAT 0

Related

Is it possible to optimize these fortran loops?

Here is my problem: I have a fortran code with a certain amount of nested loops and first I wanted to know if it's possible to optimize (rearranging) them in order to get a time gain? Second I wonder if I could use OpenMP to optimize them?
I have seen a lot of posts about nested do loops in fortran and how to optimize them but I didn't find one example that is suited to mine. I have also searched about OpenMP for nested do loops in fortran but I'm level 0 in OpenMP and it's difficult for me to know how to use it in my case.
Here are two very similar examples of loops that I have, first:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,c,d) - B(p,q,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,k,l) - B(p,q,l,k))*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,c,d) - B(p,q,d,c))*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,k,l) - B(p,q,l,k))*G(kl,ij)
end do
end do
end do
end do
end do
and the other one is:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,c,d)*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,k,l)*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,c,d)*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,k,l)*G(kl,ij)
end do
end do
end do
end do
end do
The very small difference between the two examples is mainly in the indices of the loops. I don't know if you need more info about the different integers in the loops but you have in general: nO < nOO < N < nVV. So I don't know if it's possible to optimize these loops and/or possibly put them in a way that will facilitate the use of OpenMP (I don't know yet if I will use OpenMP, it will depend on how much I can gain by optimizing the loops without it).
I already tried to rearrange the loops in different ways without any success (no time gain) and I also tried a little bit of OpenMP but I don't know much about it, so again no success.
From the initial comments it may appear that at least in some cases you may be using more memory than the available RAM, which means you may be using the swap file, with all the bad consequences on the performances. To fix this, you have to either install more RAM if possible, or deeply reorganize your code to not store the full B array (by far the largest one) at once (again, if possible).
Now, let's assume that you have enough RAM. As I wrote in the comments, the access pattern on the B array is far from optimal, as the inner loops correspond to the last indeces of B, which can result in many cache misses (all the more given the the size of B). Changing the loop order if possible is a way to go.
Just looking at your first example, I am focusing on the computation of the array A (the computation of the array E looks completely independent of A, so it can be processed separately):
!! test it at first without OpenMP
!!$OMP PARALLEL DO PRIVATE(cd,c,d,kl,k,l)
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,c,d) - B(:,:,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,k,l) - B(:,:,l,k))*D(kl,ab)
end do
end do
end do
!!$OMP END PARALLEL DO
What I did:
moved the loops on p and q from outer to inner positions (it's not always as easy than it is here)
replaced them with array syntax (no performance gain to expect, just a code easier to read)
Now the inner loops (abstracted by the array syntax) tackle contiguous elements in memory, which is much better for the performances. The code is even ready for OpenMP multithreading on the (now) outer loop.
EDIT/Hint
Fortran stores the arrays in "column-major order", that is when incrementing the first index one accesses contiguous elements in memory. In C the arrays are stored in "row-major order", that is when incrementing the last index one accesses contiguous elements in memory. So a general rule is to have the inner loops on the first indeces (and the opposite in C).
It would be helpful if you could describe the operations you'd like to perform using tensor notation and the Einstein summation rule. I have the feeling the code could be written much more succinctly using something like np.einsum in NumPy.
For the second block of loop nests (the ones where you iterate across a square sub-section of B as opposed to a triangle) you could try to introduce some sub-programs or primitives from which the full solution is built.
Working from the bottom up, you start with a simple sum of two matrices.
!
! a_ij := a_ij + beta * b_ij
!
pure subroutine apb(A,B,beta)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:)
real(dp), intent(in) :: beta
A = A + beta*B
end subroutine
(for first code block in the original post, you would substitute this primitive with one that only updates the upper/lower triangle of the matrix)
One step higher is a tensor contraction
!
! a_ij := a_ij + b_ijkl c_kl
!
pure subroutine reduce_b(A,B,C)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:,:,:)
real(dp), intent(in) :: C(:,:)
integer :: k, l
do l = 1, size(B,4)
do k = 1, size(B,3)
call apb( A, B(:,:,k,l), C(k,l) )
end do
end do
end subroutine
Note the dimensions of C must match the last two dimensions of B. (In the original loop nest above, the storage order of C is swapped (i.e. c_lk instead of c_kl.)
Working our way upward, we have the contractions with two different sub-blocks of B, moreover A, C, and D have an additional outer dimension:
!
! A_n := A_n + B1_cd C_cdn + B2_kl D_kln
!
! The elements of A_n are a_ijn
! The elements of B1_cd are B1_ijcd
! The elements of B2_kl are B2_ijkl
!
subroutine abcd(A,B1,C,B2,D)
real(dp), intent(inout), contiguous :: A(:,:,:)
real(dp), intent(in) :: B1(:,:,:,:)
real(dp), intent(in) :: B2(:,:,:,:)
real(dp), intent(in), contiguous, target :: C(:,:), D(:,:)
real(dp), pointer :: p_C(:,:,:) => null()
real(dp), pointer :: p_D(:,:,:) => null()
integer :: k
integer :: nc, nd
nc = size(B1,3)*size(B1,4)
nd = size(B2,3)*size(B2,4)
if (nc /= size(C,1)) then
error stop "FATAL ERROR: Dimension mismatch between B1 and C"
end if
if (nd /= size(D,1)) then
error stop "FATAL ERROR: Dimension mismatch between B2 and D"
end if
! Pointer remapping of arrays C and D to rank-3
p_C(1:size(B1,3),1:size(B1,4),1:size(C,2)) => C
p_D(1:size(B2,3),1:size(B2,4),1:size(D,2)) => D
!$omp parallel do default(private) shared(A,B1,p_C,B2,p_D)
do k = 1, size(A,3)
call reduce_b( A(:,:,k), B1, p_C(:,:,k))
call reduce_b( A(:,:,k), B2, p_D(:,:,k))
end do
!$omp end parallel do
end subroutine
Finally, we reach the main level where we select the subblocks of B
program doit
use transform, only: abcd, dp
implicit none
! n0 [2,10]
!
integer, parameter :: n0 = 6
integer, parameter :: n00 = n0*n0
integer, parameter :: N, nVV
real(dp), allocatable :: A(:,:,:), B(:,:,:,:), C(:,:), D(:,:)
! N [100,200]
!
read(*,*) N
nVV = (N - n0)**2
allocate(A(N,N,nVV))
allocate(B(N,N,N,N))
allocate(C(nVV,nVV))
allocate(D(n00,nVV))
print *, "Memory occupied (MB): ", &
real(sizeof(A) + sizeof(B) + sizeof(C) + sizeof(D),dp) / 1024._dp**2
A = 0
call random_number(B)
call random_number(C)
call random_number(D)
call abcd(A=A, &
B1=B(:,:,n0+1:N,n0+1:N), &
B2=B(:,:,1:n0,1:n0), &
C=C, &
D=D)
deallocate(A,B,C,D)
end program
Similar to the answer by PierU, parallelization is on the outermost loop. On my PC, for N = 50, this re-engineered routine is about 8 times faster when executed serially. With OpenMP on 4 threads the factor is 20. For N = 100 and I got tired of waiting for the original code; the re-engineered version on 4 threads took about 3 minutes.
The full code I used for testing, configurable via environment variables (ORIG=<0|1> N=100 ./abcd), is available here: https://gist.github.com/ivan-pi/385b3ae241e517381eb5cf84f119423d
With more fine-tuning it should be possible to bring the numbers down even further. Even better performance could be sought with a specialized library like cuTENSOR (also used under the hood of Fortran intrinsics as explained in Bringing Tensor Cores to Standard Fortran or a tool like the Tensor Contraction Engine.
One last thing I found odd was that large parts of B are un-unused. The sub sections B(:,:,1:n0,n0+1:N) and B(:,:,n0+1:N,1:n0) appear to be wasted space.

Repeating numbers with modulo -1 to 1 using positive and negative numbers

Repeating numbers with modulo
I know I can "wrap" / loop numbers back onto themselves like 2,3,1,2,3,1,...
by using modulo.
Example code below.
a=[1:8]'
b=mod(a,3)+1
But how can I use modulo to "wrap" numbers back onto themselves from -1 to 1 (-1,-.5,0,.5,1). Some test numbers would be a=[1.1,-2.3,.3,-.5] it would loop around and the values would be between -1 to 1.
I guess a visual example would be bending an x,y plane from -1 to 1 into a torus (how it loops back onto itself).
I was thinking of how a sin wave goes 0,1,0,-1 and back again but I wasn't sure how I could implement it.
PS: I'm using Octave 4.2.2
This can be accomplished by offsetting the value before taking the modulo, then reversing the offset after.
For example, if the target range is [a,b) (the half-open interval such that b is not part of the interval), then one can do:
y = mod( x - a, b - a ) + a;
For example:
a = -1;
b = 1;
x = -10:0.01:10;
y = mod( x - a, b - a ) + a;
plot(x,y)

Is both have the same meaning?

In Verilog code
case ({Q[0], Q_1})
2'b0_1 :begin
A<=sum[7]; Q<=sum; Q_1<=Q;
end
2'b1_0 : begin
A<=difference[7]; Q<=difference; Q_1<=Q;
end
default: begin
A<=A[7]; Q<=A; Q_1<=Q;
end
endcase
is above code is same as below code
case ({Q[0], Q_1})
2'b0_1 : {A, Q, Q_1} <= {sum[7], sum, Q};
2'b1_0 : {A, Q, Q_1} <= {difference[7], difference, Q};
default: {A, Q, Q_1} <= {A[7], A, Q};
endcase
If yes then why i am getting different result?
Edit:-A, Q, sum and difference are all 8-bit values and Q_1 is a 1-bit value.
No, these are not the same. The concatenation operator ({ ... }) allows you to create vectors from several different signals, allowing you to both use these vectors and assign to these vectors, resulting in the assignment of the component signals will the appropriate bits from the result. From your previous question (Please Explain these verilog code?), I see that A, Q, sum and difference are all 8-bit values and Q_1 is a 1-bit value. Lets examine the first assignment (noting that the other three work the same way):
{A, Q, Q_1} <= {sum[7], sum, Q};
If we look at the right-hand side, we can see that the result of the concatenation is a 17-bit vector, as sum[7] is 1 bit (the MSb of sum), sum is 8 bits, and Q is 8 bits (1 + 8 + 8 = 17). Lets say sum = 8'b10100101 and Q = 8'b00110110, what would {sum[7], sum, Q} look like? Well, its the concatenation of the values from sum and Q so it would be 17'b1_10100101_00110110, the first bit coming from sum[7], the next 8 bits from sum and the final 8 bits from Q.
Now we have to assign this 17-bit value to the left hand side. On the left, we have {A, Q, Q_1}, which is also 17 bits (A is 8 bits, Q is 8 bits and Q_1 is 1 bit). However, we have to assign the bits from our 17-bit value we got above to the proper signals that make up this new 17-bit vector, that means the 8 most significant bits go into A, the next 8 bits go into Q and the least significant bit go into Q_1. So, if we take our value from above (17'b1_10100101_00110110), and split it up this way (17'b11010010_10011011_0), we see A = 8'b11010010, Q = 8'b10011011 and Q_1 = 1'b0. Thus, this is not the same as assigning A = sum[7], Q = sum and Q_1 = Q (this would result in A = 8'b00000001, Q = 8'b10100101, Q_1 = 1'b0, with many bits of Q being lost and A having 7 extra bits).
However, this doesnt mean we cant split up the left-hand side concatenation, it would just look like this:
A <= {sum[7], sum[7:1]};
Q <= {sum[0], Q[7:1]};
Q_1 <= Q[0];
Yes, they are same. For example try this small code and check, the output is same :
module test;
wire A,B,C;
reg p,q,r;
initial
begin
p=1; q=1; r=0;
end
assign {A,B,C} = {p,q,r};
initial #1 $display("%b %b %b",A,B,C);
endmodule
In general if you want to understand concatenation operator, you can refer here
Edit : I have assumed A and p , B and q, C and r of same length.

VB challenge/ help MONTE CARLO INTEGRATION

Im trying to create Monte-Carlo simulation that can be used to derive estimates for integration problems (summing up the area under
a curve). Have no idea what to do now and i am stuck
"to solve this problem we generate a number (say n) of random number pairs for x and y between 0 and 1, for each pair we see if the point (x,y) falls above or below the line. We count the number of times this happens (say c). The area under the curve is computed as c/n"
Really confused please help thank you
Function MonteCarlo()
Dim a As Integer
Dim b As Integer
Dim x As Double
Dim func As Double
Dim total As Double
Dim result As Double
Dim j As Integer
Dim N As Integer
Console.WriteLine("Enter a")
a = Console.ReadLine()
Console.WriteLine("Enter b")
b = Console.ReadLine()
Console.WriteLine("Enter n")
N = Console.ReadLine()
For j = 1 To N
'Generate a new number between a and b
x = (b - a) * Rnd()
'Evaluate function at new number
func = (x ^ 2) + (2 * x) + 1
'Add to previous value
total = total + func
Next j
result = (total / N) * (b - a)
Console.WriteLine(result)
Console.ReadLine()
Return result
End Function
You are using the rejection method for MC area under the curve.
Do this:
Divide the range of x into, say, 100 equally-spaced, non-overlapping bins.
For your function y = f(x) = (x ^ 2) + (2 * x) + 1, generate e.g. 10,000 values of y for 10,000 values of x = (b - a) * Rnd().
Count the number of y-values in each bin, and divide by 10,000 to get a "bin probability." --> p(x).
Next, the proper way to randomly simulate your function is to use the rejection method, which goes as follows:
4a. Draw a random x-value using x = (b - a) * Rnd()
4b. Draw a random uniform U(0,1). If U(0,1) is less than p(x) add a count to the bin.
4c. Continue steps 4a-4b 10000 times.
You will now be able to simulate your y=f(x) function using the rejection method.
Overall, you need to master these approaches before you do what you want since it sounds like you have little experience in bin counts, simulation, etc. Area under the curve is always one using this approach, so just be creative for integrating using MC.
Look at some good textbooks on MC integration.

Fortran: efficient matrix-vector multiplication

I have a piece of code which is a significant bottleneck:
do s = 1,ns
msum = 0.d0
do k = 1,ns
msum = msum + tm(k,s)*f(:,:,k)
end do
m(:,:,s) = msum
end do
This is a simple matrix-vector product m=tm*f (where f is length k) for every x,y.
I thought about using a BLAS routine but i am not sure if any allows multiplying along a specific dimension (k). Do any of you have any good advice?
Unfortunately you do not mention the actual shape of f, i.e. the number of x and y. Since you mention this piece of code to be a bottleneck, you can and should replace msum and use the memory m(:,:,s) and spare the first step in you loop, e.g.
do s = 1,ns
m = tm(k,1)*f(:,:,k)
do k = 2, ns
m(:,:,s) = m(:,:,s) + tm(k,s)*f(:,:,k)
end do
end do
Secondly, a more general appraoch
There are ns summations of nK 2D matrices f(:,:,1:nK) by means of scalar factors that are stored in tm(:,1:ns). The goal is to store these sums in m(:,:,1:ns). Why not sum up element-wise wrt x and y to exploit contiguuos memory sections by means of the result? You already mentioned that you can redesign such that k is the first dimension in f, i.e. f(k,:,:).
Considering only the desired outcome, you ought to have ns 2D matrices m(:,:,1:ns) that are independent of each other (outer loop remains at it is). Lets drop this dimension for a moment. The problem then becomes:
m(:,:) = \sum_{k=1}^{ns} tm_k * f_k(:,:)
We should thus sum over k, e.g. have f(k,:,:) to determine m(:,:) as follows (note that I am adding the outer loop for s again):
nK = size(f, 1) ! the "k"s
nX = size(f, 2) ! the "x"s
nY = size(f, 3) ! the "y"s
m = 0.d0
do s = 1, ns
do ii = 1, nY
call DGEMV('N', nK, nY, &
1.d0, f(:,:,nY), 1, tm(:,s), 1, &
1.d0, m(:,nY,s), 1)
end do !ii
end do !s
See the documentation of DGEMV for more details on its usage.
Of course, the above advice of excluding the first step of the loop to spare the initialization by means of zeros may be applied at well.