How can I use iCE40 4K block RAM in 512x8 read mode with IceStorm?

How can I use iCE40 4K block RAM in 512x8 read mode with IceStorm? - yosys

I am trying to figure out how to use the block RAM on my iCE40HX-8K Breakout Board. I would like to access it in a 512x8 configuration, which as far as I can tell from the documentation is supported by project IceStorm, but I haven't been able to get it to work like I expected.
If I understand correctly, initializing an SB_RAM40_4K primitive with the READ_MODE parameter set to 1 should set the block up in 512x8 read mode, which uses a 9 bit read address, and reads 8 bits of data at each address.
Here is the simplest example I could think of. It sets up an SB_RAM40_4K with some pre-initialized memory and reads straight to the pins of the on-board LED's.
hx8kboard.pcf
set_io leds[0] B5
set_io leds[1] B4
set_io leds[2] A2
set_io leds[3] A1
set_io leds[4] C5
set_io leds[5] C4
set_io leds[6] B3
set_io leds[7] C3
set_io clk J3
top.v
module top (
output [7:0] leds,
input clk
);
//reg [8:0] raddr = 8'd0;
reg [8:0] raddr = 8'd1;
SB_RAM40_4K #(
.INIT_0(256'h00000000000000000000000000000000000000000000000000000000_44_33_22_11),
.WRITE_MODE(1),
.READ_MODE(1)
) ram40_4k_512x8 (
.RDATA(leds),
.RADDR(raddr),
.RCLK(clk),
.RCLKE(1'b1),
.RE(1'b1),
.WADDR(8'b0),
.WCLK(1'b0),
.WCLKE(1'b0),
.WDATA(8'b0),
.WE(1'b0)
);
endmodule
LED output when raddr == 0
\|/ \|/
O O O O O O O O
LED output when raddr == 1
\|/ \|/ \|/ \|/
O O O O O O O O
I would think that address 1 in 512x8 mode would be the second 8 bits from RAM, which is 8'h22 or 8'b0010010. Instead I get 8'h33 or 8'b00110011. After a little experimentation, this seems to be the lower 8 bits of a 16 bit read.
I'm not sure where I went wrong. Any help understanding what's going on here would be appreciated. Thanks!

This question is not really about Yosys or Project IceStorm. The format used for the SB_RAM40_4K INIT_* parameters is the same for the IceStorm flow and the Lattice iCEcube2 flow. However, Lattice has done a very very bad job at documenting this format. Otherwise I'd just point you to the right Lattice document.. :)
You are interested in the 512x8 mode. First you need to know that in 512x8 mode only the even bits of .RDATA() and .WDATA() are used (not the 8 LSB bits, as your code suggests!).
The data in .INIT_* is stored as 16 16-bit words per parameter. The lowest 16-bit word in .INIT_0() contains the 8-bit word at addr 0 in its even bits and the 8-bit word at addr 256 in its odd bits.
The next 16-bit word in .INIT_0() contains words 1 and 257. The lowest 16-bits in .INIT_1() contain words 16 and 272, and so forth.
The easiest way to investigate this kind of stuff is probably to either read the SB_RAM40_4K simulation model in /usr/local/share/yosys/ice40/cells_sim.v, or simply let Yosys infer the memory and observe what yosys does. For example the following design:
module test(input clk, wen, input [8:0] addr, input [7:0] wdata, output reg [7:0] rdata);
reg [7:0] mem [0:511];
initial mem[0] = 255;
always #(posedge clk) begin
if (wen) mem[addr] <= wdata;
rdata <= mem[addr];
end
endmodule
Will produce the following output when run through yosys -p 'synth_ice40; write_verilog' test.v:
(* top = 1 *)
(* src = "test.v:1" *)
module test(clk, wen, addr, wdata, rdata);
(* src = "/usr/local/bin/../share/yosys/ice40/brams_map.v:255" *)
(* unused_bits = "0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15" *)
wire [15:0] _0_;
(* src = "test.v:1" *)
input [8:0] addr;
(* src = "test.v:1" *)
input clk;
(* src = "test.v:1" *)
output [7:0] rdata;
(* src = "test.v:1" *)
input [7:0] wdata;
(* src = "test.v:1" *)
input wen;
(* src = "/usr/local/bin/../share/yosys/ice40/brams_map.v:277|/usr/local/bin/../share/yosys/ice40/brams_map.v:35" *)
SB_RAM40_4K #(
.INIT_0(256'bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx1x1x1x1x1x1x1x1),
.INIT_1(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_2(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_3(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_4(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_5(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_6(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_7(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_8(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_9(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_A(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_B(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_C(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_D(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_E(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.INIT_F(256'hxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx),
.READ_MODE(32'sd1),
.WRITE_MODE(32'sd1)
) \mem.0.0.0 (
.MASK(16'hxxxx),
.RADDR({ 2'h0, addr }),
.RCLK(clk),
.RCLKE(1'h1),
.RDATA({ _0_[15], rdata[7], _0_[13], rdata[6], _0_[11], rdata[5], _0_[9], rdata[4], _0_[7], rdata[3], _0_[5], rdata[2], _0_[3], rdata[1], _0_[1], rdata[0] }),
.RE(1'h1),
.WADDR({ 2'h0, addr }),
.WCLK(clk),
.WCLKE(wen),
.WDATA({ 1'hx, wdata[7], 1'hx, wdata[6], 1'hx, wdata[5], 1'hx, wdata[4], 1'hx, wdata[3], 1'hx, wdata[2], 1'hx, wdata[1], 1'hx, wdata[0] }),
.WE(1'h1)
);
endmodule
(Scroll all the way to the right to see the initialization pattern generated for the mem[0] = 255 initialization.)

Related

Transpose 8x8 64-bits matrix

Targeting AVX2, what is a fastest way to transpose a 8x8 matrix containing 64-bits integers (or doubles)?
I searched though this site and I found several ways of doing 8x8 transpose but mostly for 32-bits floats. So I'm mainly asking because I'm not sure whether the principles that made those algorithms fast readily translate to 64-bits and second, apparently AVX2 only has 16 registers so only loading all the values would take up all the registers.
One way of doing it would be to call 2x2 _MM_TRANSPOSE4_PD but I was wondering whether this is optimal:
#define _MM_TRANSPOSE4_PD(row0,row1,row2,row3) \
{ \
__m256d tmp3, tmp2, tmp1, tmp0; \
\
tmp0 = _mm256_shuffle_pd((row0),(row1), 0x0); \
tmp2 = _mm256_shuffle_pd((row0),(row1), 0xF); \
tmp1 = _mm256_shuffle_pd((row2),(row3), 0x0); \
tmp3 = _mm256_shuffle_pd((row2),(row3), 0xF); \
\
(row0) = _mm256_permute2f128_pd(tmp0, tmp1, 0x20); \
(row1) = _mm256_permute2f128_pd(tmp2, tmp3, 0x20); \
(row2) = _mm256_permute2f128_pd(tmp0, tmp1, 0x31); \
(row3) = _mm256_permute2f128_pd(tmp2, tmp3, 0x31); \
}
Still assuming AVX2, is transposing double[8][8] and int64_t[8][8] largely the same, in principle?
PS: And just being curious, having AVX512 would change the things substantially, correct?

After some thoughts and discussion in the comments, I think this is the most efficient version, at least when source and destination data is in RAM. It does not require AVX2, AVX1 is enough.
The main idea, modern CPUs can do twice as many load micro-ops compared to stores, and on many CPUs loading stuff into higher half of vectors with vinsertf128 has same cost as regular 16-byte load. Compared to your macro, this version no longer needs these relatively expensive (3 cycles of latency on most CPUs) vperm2f128 shuffles.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void loadTransposed( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
// Load top half of the matrix into low half of 4 registers
__m256d t0 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 00, 01
__m256d t1 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 02, 03
rsi += stride;
__m256d t2 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) ); // 10, 11
__m256d t3 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 12, 13
rsi += stride;
// Load bottom half of the matrix into high half of these registers
t0 = _mm256_insertf128_pd( t0, _mm_loadu_pd( rsi ), 1 ); // 00, 01, 20, 21
t1 = _mm256_insertf128_pd( t1, _mm_loadu_pd( rsi + 2 ), 1 );// 02, 03, 22, 23
rsi += stride;
t2 = _mm256_insertf128_pd( t2, _mm_loadu_pd( rsi ), 1 ); // 10, 11, 30, 31
t3 = _mm256_insertf128_pd( t3, _mm_loadu_pd( rsi + 2 ), 1 );// 12, 13, 32, 33
// Transpose 2x2 blocks in registers.
// Due to the tricky way we loaded stuff, that's enough to transpose the complete 4x4 matrix.
mat.r0 = _mm256_unpacklo_pd( t0, t2 ); // 00, 10, 20, 30
mat.r1 = _mm256_unpackhi_pd( t0, t2 ); // 01, 11, 21, 31
mat.r2 = _mm256_unpacklo_pd( t1, t3 ); // 02, 12, 22, 32
mat.r3 = _mm256_unpackhi_pd( t1, t3 ); // 03, 13, 23, 33
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
// Transpose 8x8 matrix of double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
loadTransposed( block, rsi );
store( block, rdi );
#if 1
// Using another instance of the block to support in-place transpose
Matrix4x4 block2;
loadTransposed( block, rsi + 4 ); // top right block
loadTransposed( block2, rsi + 8 * 4 ); // bottom left block
store( block2, rdi + 4 );
store( block, rdi + 8 * 4 );
#else
// Flip the #if if you can guarantee ( rsi != rdi )
// Performance is about the same, but this version uses 4 less vector registers,
// slightly more efficient when some registers need to be backed up / restored.
assert( rsi != rdi );
loadTransposed( block, rsi + 4 );
store( block, rdi + 8 * 4 );
loadTransposed( block, rsi + 8 * 4 );
store( block, rdi + 4 );
#endif
// Bottom-right corner
loadTransposed( block, rsi + 8 * 4 + 4 );
store( block, rdi + 8 * 4 + 4 );
}
For completeness, here’s a version which uses the code very similar to your macro, does twice as few loads, same count of stores, and more shuffles. Have not benchmarked but I would expect it to be slightly slower.
struct Matrix4x4
{
__m256d r0, r1, r2, r3;
};
inline void load( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
mat.r0 = _mm256_loadu_pd( rsi );
mat.r1 = _mm256_loadu_pd( rsi + stride );
mat.r2 = _mm256_loadu_pd( rsi + stride * 2 );
mat.r3 = _mm256_loadu_pd( rsi + stride * 3 );
}
inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
_mm256_storeu_pd( rdi, mat.r0 );
_mm256_storeu_pd( rdi + stride, mat.r1 );
_mm256_storeu_pd( rdi + stride * 2, mat.r2 );
_mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}
inline void transpose( Matrix4x4& m4 )
{
// These unpack instructions transpose lanes within 2x2 blocks of the matrix
const __m256d t0 = _mm256_unpacklo_pd( m4.r0, m4.r1 );
const __m256d t1 = _mm256_unpacklo_pd( m4.r2, m4.r3 );
const __m256d t2 = _mm256_unpackhi_pd( m4.r0, m4.r1 );
const __m256d t3 = _mm256_unpackhi_pd( m4.r2, m4.r3 );
// Produce the transposed matrix by combining these blocks
m4.r0 = _mm256_permute2f128_pd( t0, t1, 0x20 );
m4.r1 = _mm256_permute2f128_pd( t2, t3, 0x20 );
m4.r2 = _mm256_permute2f128_pd( t0, t1, 0x31 );
m4.r3 = _mm256_permute2f128_pd( t2, t3, 0x31 );
}
// Transpose 8x8 matrix with double values
void transpose8x8( double* rdi, const double* rsi )
{
Matrix4x4 block;
// Top-left corner
load( block, rsi );
transpose( block );
store( block, rdi );
// Using another instance of the block to support in-place transpose, with very small overhead
Matrix4x4 block2;
load( block, rsi + 4 ); // top right block
load( block2, rsi + 8 * 4 ); // bottom left block
transpose( block2 );
store( block2, rdi + 4 );
transpose( block );
store( block, rdi + 8 * 4 );
// Bottom-right corner
load( block, rsi + 8 * 4 + 4 );
transpose( block );
store( block, rdi + 8 * 4 + 4 );
}

For small matrices where more than 1 row can fit in a single SIMD vector, AVX-512 has very nice 2-input lane-crossing shuffles with 32-bit or 64-bit granularity, with a vector control. (Unlike _mm512_unpacklo_pd which is basically 4 separate 128-bit shuffles.)
A 4x4 double matrix is "only" 128 bytes, two ZMM __m512d vectors, so you only need two vpermt2ps (_mm512_permutex2var_pd) to produce both output vectors: one shuffle per output vector, with both loads and stores being full width. You do need control vector constants, though.
Using 512-bit vector instructions has some downsides (clock speed and execution port throughput), but if your program can spend a lot of time in code that uses 512-bit vectors, there's probably a significant throughput gain from throwing around more data with each instruction, and having more powerful shuffles.
With 256-bit vectors, vpermt2pd ymm would probably not be useful for a 4x4, because for each __m256d output row, each of the 4 elements you want comes from a different input row. So one 2-input shuffle can't produce the output you want.
I think lane-crossing shuffles with less than 128-bit granularity aren't useful unless your matrix is small enough to fit multiple rows in one SIMD vector. See How to transpose a 16x16 matrix using SIMD instructions? for some algorithmic complexity reasoning about 32-bit elements - an 8x8 xpose of 32-bit elements with AVX1 is about the same as an 8x8 of 64-bit elements with AVX-512, where each SIMD vector holds exactly one whole row.
So no need for vector constants, just immediate shuffles of 128-bit chunks, and unpacklo/hi
Transposing an 8x8 with 512-bit vectors (8 doubles) would have the same problem: each output row of 8 doubles needs 1 double from each of 8 input vectors. So ultimately I think you want a similar strategy to Soonts' AVX answer, starting with _mm512_insertf64x4(v, load, 1) as the first step to get the first half of 2 input rows into one vector.
(If you care about KNL / Xeon Phi, #ZBoson's other answer on How to transpose a 16x16 matrix using SIMD instructions? shows some interesting ideas using merge-masking with 1-input shuffles like vpermpd or vpermq, instead of 2-input shuffles like vunpcklpd or vpermt2pd)
Using wider vectors means fewer loads and stores, and maybe even fewer total shuffles because each one combines more data. But you also have more shuffling work to do, to get all 8 elements of a row into one vector, instead of just loading and storing to different places in chunks half the size of a row. It's not obvious is better; I'll update this answer if I get around to actually writing the code.
Note that Ice Lake (first consumer CPU with AVX-512) can do 2 loads and 2 stores per clock. It has better shuffle throughput than Skylake-X for some shuffles, but not for any that are useful for this or Soonts' answer. (All of vperm2f128, vunpcklpd and vpermt2pd only run on port 5, for the ymm and zmm versions. https://uops.info/. vinsertf64x4 zmm, mem, 1 is 2 uops for the front-end, and needs a load port and a uop for p0/p5. (Not p1 because it's a 512-bit uop, and see also SIMD instructions lowering CPU frequency).)

Largest set of different byte values unique when clearing bits

I am creating a data format, which will be stored in a DS2431 1-wire EEPROM. One page will be using EPROM emulation mode (where data once written can only be modified by clearing bits). In this page I want to store a byte with an ID, which cannot be changed to another valid value (due to only allowing clearing bits).
I am considering using the set of values that have a popcount of 4 (there are 70 different values). Clearing any bits means popcount is no longer 4, so this satisfies the desired property.
But could a set of byte values be found with more than 70 different values, that satisfy the property?

No. For an 8-bit value, using four bits is optimal.
If you have your 70 4-bit values and decide to add a 5-bit value as valid, you have to give up five 4-bit values that can be created by clearing a bit. Similarly, if you want a valid 3-bit value, you also have to give up five 4-bit values.
If you could increase the number of bits, then you can increase the ratio of possible values to bits used.

Since there are only 256 possible values and 8 possible populations it is a trivial task to test all possible population counts:
#include <stdio.h>
#include <stdint.h>
int popcount( uint8_t byte )
{
int count = 0 ;
for( uint8_t b = 0x01; b != 0; b <<= 1 )
{
count = count + (((byte & b) != 0) ? 1 : 0) ;
}
return count ;
}
int main()
{
int valuecount[8] = {0} ;
for( int i = 0; i < 256; i++ )
{
valuecount[popcount(i)]++ ;
}
printf( "popcount\tvalues\n") ;
for( int p = 0; p < 9; p++ )
{
printf( " %d\t\t %d\n", p, valuecount[p] ) ;
}
return 0;
}
Result:
popcount values
0 1
1 8
2 28
3 56
4 70
5 56
6 28
7 8
8 1
The optimum population count for any word length n is always n / 2. For 16-bits the number of values with 8 1-bits is 12870.

verilog component value passing

I am just beginning to learn verilog. I wrote two programs for that purpose. Here is my first program:
module test1(i,o);
input i;
output o;
wire[0:63] i;
wire[0:63] o;
assign o = i * 2.0;
endmodule
module test1_tb();
reg[0:63] inp;
wire[0:63] outp;
initial
begin
assign inp = 2.0;
$monitor("input=%f, output=%f",inp,outp);
end
test1 t1(inp,outp);
endmodule
This gives me the following result when I run it in ModelSim:
# input=2.000000, output=4.000000
Then I edited the above program as follows:
module test1(i1,i2,h1,w1,w2,b1);
input i1;
input i2;
input w1;
input w2;
input b1;
output h1;
wire[0:63] i1;
wire[0:63] i2;
wire[0:63] h1;
wire[0:63] w1;
wire[0:63] w2;
wire[0:63] b1;
assign h1 = 1/(1+ 2.718281828459**((-1.0)*(i1 * w1 + i2 * w2 + b1)));
endmodule
module test1_tb();
reg[0:63] i1;
reg[0:63] i2;
reg[0:63] w1;
reg[0:63] w2;
reg[0:63] b1;
wire[0:63] h1;
initial
begin
assign i1 = 0.05;
assign i2 = 0.10;
assign w1 = 0.15;
assign w2 = 0.20;
assign b1 = 0.35;
$monitor("i1=%f, i2=%f, w1=%f, w2=%f, b1=%f, h1=%f",i1,i2,w1,w2,b1,h1);
end
test1 n1(i1,i2,h1,w1,w2,b1);
endmodule
For this program I get the output:
# i1=0.000000, i2=0.000000, w1=0.000000, w2=0.000000, b1=0.000000, h1=1.000000
It seems the module does not get the initial values in the second program. But all I did was adding few more input lines to the first program and changing the calculation.
Right now I don't know what's the error with this. Please help.

The type reg is not designed to implicitly handle floating point math. As such, real constants assigned to the reg variables are rounded to the nearest integer (see IEEE1800-2012 Section 5.7.2, SystemVerilog LRM; sorry I dont have IEEE1364, Verilog LRM, to find the reference in there).
If you simply want to do some floating point math, you can use the real type which is the same as a double. Otherwise, if you want to do floating point math in hardware, youll need to deal with the complexities of it yourself (or borrow from an IP core). Also note that Verilog is not a scripting language, but a Hardware Descriptive Language, so unless you are trying to design hardware, you are much better off using Python, Ruby or Perl for general purpose stuff.

How do I wire modules?

I have written all the code, including the modules, but I can't figure out how to wire the modules to the main program.
The ALU should be:
A (4bits) and B (4bits) as inputs, sel (3bits)
1st Module When sel = 000 => Add/ sel= 001 => Sub (A+B or A-B)
2nd Module When sel = 010 => Shift right by 1
3rd Module when sel = 011 => Multiply (A*B)
4th Module When sel = 100 => A XNOR B
5th Module When sel = 101 => Compare A==B
I also made a 6th module with a Mux6to1.
It has to be on a gate level. Can't use (+/-). This is the code I've been writing but when I simulate, I just get the result: ZZZZZZZZ. Please, any suggestions would be appreciated, thanks.
For Add/Sub 1 bit:
module Full_Adder_1bit(a,b,cin,sel,sum,cout);
input a, b, cin;
input [2:0] sel;
output sum, cout;
reg sum, cout;
reg a_in;
always # (a or b or cin or sel)
begin
a_in = a^sel[0];
sum = a^b^cin;
cout = (a_in&b)|(a_in&cin)|(b&cin);
end
endmodule
For 4 bit ADD/SUB:
module Full_Adder_4bits (a, b, cin, sel, sum, cout);
input [3:0] a, b;
input [2:0] sel;
input cin;
output [3:0] sum;
output cout;
wire c1,c2,c3;
Full_Adder_1bit FA0(a[0],b[0],cin,sel,sum[0],c1);
Full_Adder_1bit FA1(a[1],b[1],c1,sel,sum[1],c2);
Full_Adder_1bit FA2(a[2],b[2],c2,sel,sum[2],c3);
Full_Adder_1bit FA3(a[3],b[3],c3,sel,sum[3],cout);
endmodule
For the Shifter:
module Shifter(dataIn, shiftOut);
output [3:0] shiftOut;
input [3:0] dataIn;
assign shiftOut = dataIn >> 1;
endmodule
For the XNOR:
module XNOR(a,b,rxnor);
input [3:0] a,b;
output [3:0] rxnor;
reg rxnor;
always # (a or b)
begin
rxnor= a~^b; //XNOR
end
endmodule
For Multiplier:
module mul4 (i0,i1,prod);
input [3:0] i0, i1;
output [7:0] prod;
assign prod = i0*i1;
endmodule
For Compare:
module Compare(B,A,R);
input [3:0] A,B;
output [3:0] R;
reg R;
always#(A,B)
begin
if (A==B)
R = 4'b0001;
else if (A==B)
R = 4'b0000;
else
R = 4'b1111;
end
endmodule
For the mux (it is actually 5 to 1 even though the name says 6 to 2):
module MUX6to2(i0,i1,i2,i3,i4,sel,out);
input [4:0] i0,i1,i2,i4;
input [7:0] i3;
input [2:0] sel;
output [7:0] out;
reg [7:0] out;
always # (i0 or i1 or i2 or i3 or i4 or sel)
begin
case (sel)
3'b000: out = i0;
3'b001: out = i0;
3'b010: out = i1;
3'b011: out = i2;
3'b100: out = i3;
3'b100: out = i4;
default: out = 8'bx;
endcase
end
endmodule
For the ALU:
module ALU(a,b,cin,sel,r,cout);
input [3:0] a, b;
input [2:0] sel;
input cin;
output [7:0] r;
output cout;
wire [3:0] add_out, shift_out, xnor_out, compare_out;
wire [7:0] mul_out;
wire cout;
MUX6to2 output_mux (Full_Adder_4bits, Shifter, XNOR, mul4, Compare, sel[2:0], r);
Full_Adder_4bits output_adder (a,b,cin,sel [2:0],add_out,cout);
Shifter output_shifter (a,shift_out);
XNOR output_XNOR (a,b,xnor_out);
mul4 output_mul4 (a,b,mul_out);
Compare output_Compare (a,b,compare_out);
endmodule

Why the value of output "r" Hi-Z is you haven't connected anything to output so the default value of wire is propagated in the output
When it comes to your design a decoder is required and for arithmetic operation operand size has to be taken care
Required bit widths are
addition 4 bit + 4 bit = 4 bit + carry,
subtraction 4 bit - 4 bit = 4 bit + borrow,
shifting which of them should be shifted and required bit width needs to be calculated,
multiplication 4*4 bit = 8 bit is needed,
xnor 4 bit xnot 4 bit = 4 bit needed,
compare it is up how we represent if equal with 1 bit or any bit widths
In your Mux logic you have
3'b100: out = i3;
3'b100: out = i4;
Here in this both case items are same so synthesizer optimizes and only considers the first statement and ignores the second, in general case should not be coded this way and more over this mux itself is not necessary.
Coming to your top module ALU, the syntax is wrong, this type of assignments are not permitted in verilog HDL
MUX6to2 output_mux (Full_Adder_4bits, Shifter, XNOR, mul4, Compare, sel[2:0], r);
When integrating all modules you have to be clear about what you are going to connect and widths with hardware in mind map,
Considering your design
The adder will result in 4bit + carry but output "r" is 8 bit so other bits should made to constant value or else it will be defaulted to X in case reg or Hi-z in case wire in the output, similar for other operations too.
I have made some modification in the design with a decoder and found to be fully functional,
A (4bits) and B (4bits) as inputs, sel (3bits)
When sel = 000 => Add/ sel= 001 => Sub (A+B or A-B)
When sel = 010 => Shift right by 1
when sel = 011 => Multiply (A*B)
When sel = 100 => A XNOR B
When sel = 101 => Compare A==B
Give a try of the code http://www.edaplayground.com/x/DaF

in the module ALU, check the port list which where mapped to multiplexer module as emman said.
MUX6to2 output_mux (Full_Adder_4bits, Shifter, XNOR, mul4, Compare, sel[2:0], r);
in this module, the output which declared is of size 8 bit r. but, in some cases add_out, shift_out, xnor_out, compare_outthe size is 4 bits only. so, during the case of these outputs, the output 'r' shows 4 X's.
module ALU(a,b,cin,sel,r,cout);
input [3:0] a, b;
input [2:0] sel;
input cin;
output [7:0] r;
output cout;
wire [3:0] add_out, shift_out, xnor_out, compare_out;
wire [7:0] mul_out;
wire cout;
// MUX6to2 output_mux (Full_Adder_4bits, Shifter, XNOR, mul4, Compare, sel[2:0], r);
MUX6to2 output_mux (add_out, shift_out, xnor_out, mul_out, compare_out, sel[2:0], r);
Full_Adder_4bits output_adder (a,b,cin,sel [2:0],add_out,cout);
Shifter output_shifter (a,shift_out);
XNOR output_XNOR (a,b,xnor_out);
mul4 output_mul4 (a,b,mul_out);
Compare output_Compare (a,b,compare_out);
endmodule

Does your code compile properly? I can see a problem in this line:
MUX6to2 output_mux (Full_Adder_4bits, Shifter, XNOR, mul4, Compare, sel[2:0], r);
Full_Adder_4bits, Shifter, etc. are names of your modules, not a valid signal names. What you meant is probably this:
Mux6to2 output_mux (adder_out, shifter_out, xnor_out, mul_out, compare_out, sel, r);

How do I set output flags for ALU in "Nand to Tetris" course?

Although I tagged this homework, it is actually for a course which I am doing on my own for free. Anyway, the course is called "From Nand to Tetris" and I'm hoping someone here has seen or taken the course so I can get some help. I am at the stage where I am building the ALU with the supplied hdl language. My problem is that I can't get my chip to compile properly. I am getting errors when I try to set the output flags for the ALU. I believe the problem is that I can't subscript any intermediate variable, since when I just try setting the flags to true or false based on some random variable (say an input flag), I do not get the errors. I know the problem is not with the chips I am trying to use since I am using all builtin chips.
Here is my ALU chip so far:
/**
* The ALU. Computes a pre-defined set of functions out = f(x,y)
* where x and y are two 16-bit inputs. The function f is selected
* by a set of 6 control bits denoted zx, nx, zy, ny, f, no.
* The ALU operation can be described using the following pseudocode:
* if zx=1 set x = 0 // 16-bit zero constant
* if nx=1 set x = !x // Bit-wise negation
* if zy=1 set y = 0 // 16-bit zero constant
* if ny=1 set y = !y // Bit-wise negation
* if f=1 set out = x + y // Integer 2's complement addition
* else set out = x & y // Bit-wise And
* if no=1 set out = !out // Bit-wise negation
*
* In addition to computing out, the ALU computes two 1-bit outputs:
* if out=0 set zr = 1 else zr = 0 // 16-bit equality comparison
* if out<0 set ng = 1 else ng = 0 // 2's complement comparison
*/
CHIP ALU {
IN // 16-bit inputs:
x[16], y[16],
// Control bits:
zx, // Zero the x input
nx, // Negate the x input
zy, // Zero the y input
ny, // Negate the y input
f, // Function code: 1 for add, 0 for and
no; // Negate the out output
OUT // 16-bit output
out[16],
// ALU output flags
zr, // 1 if out=0, 0 otherwise
ng; // 1 if out<0, 0 otherwise
PARTS:
// Zero the x input
Mux16( a=x, b=false, sel=zx, out=x2 );
// Zero the y input
Mux16( a=y, b=false, sel=zy, out=y2 );
// Negate the x input
Not16( in=x, out=notx );
Mux16( a=x, b=notx, sel=nx, out=x3 );
// Negate the y input
Not16( in=y, out=noty );
Mux16( a=y, b=noty, sel=ny, out=y3 );
// Perform f
Add16( a=x3, b=y3, out=addout );
And16( a=x3, b=y3, out=andout );
Mux16( a=andout, b=addout, sel=f, out=preout );
// Negate the output
Not16( in=preout, out=notpreout );
Mux16( a=preout, b=notpreout, sel=no, out=out );
// zr flag
Or8way( in=out[0..7], out=zr1 ); // PROBLEM SHOWS UP HERE
Or8way( in=out[8..15], out=zr2 );
Or( a=zr1, b=zr2, out=zr );
// ng flag
Not( in=out[15], out=ng );
}
So the problem shows up when I am trying to send a subscripted version of 'out' to the Or8Way chip. I've tried using a different variable than 'out', but with the same problem. Then I read that you are not able to subscript intermediate variables. I thought maybe if I sent the intermediate variable to some other chip, and that chip subscripted it, it would solve the problem, but it has the same error. Unfortunately I just can't think of a way to set the zr and ng flags without subscripting some intermediate variable, so I'm really stuck!
Just so you know, if I replace the problematic lines with the following, it will compile (but not give the right results since I'm just using some random input):
// zr flag
Not( in=zx, out=zr );
// ng flag
Not( in=zx, out=ng );
Anyone have any ideas?
Edit: Here is the appendix of the book for the course which specifies how the hdl works. Specifically look at section 5 which talks about buses and says: "An internal pin (like v above) may not be subscripted".
Edit: Here is the exact error I get: "Line 68, Can't connect gate's output pin to part". The error message is sort of confusing though, since that does not seem to be the actual problem. If I just replace "Or8way( in=out[0..7], out=zr1 );" with "Or8way( in=false, out=zr1 );" it will not generate this error, which is what lead me to look up in the appendix and find that the out variable, since it was derived as intermediate, could not be subscripted.

For anyone else interested, the solution the emulator supports is to use multiple outputs
Something like:
Mux16( a=preout, b=notpreout, sel=no, out=out,out=preout2,out[15]=ng);

This is how I did the ALU:
CHIP ALU {
IN // 16-bit inputs:
x[16], y[16],
// Control bits:
zx, // Zero the x input
nx, // Negate the x input
zy, // Zero the y input
ny, // Negate the y input
f, // Function code: 1 for add, 0 for and
no; // Negate the out output
OUT // 16-bit output
out[16],
// ALU output flags
zr, // 1 if out=0, 0 otherwise
ng; // 1 if out<0, 0 otherwise
PARTS:
Mux16(a=x, b=false, sel=zx, out=M16x);
Not16(in=M16x, out=Nx);
Mux16(a=M16x, b=Nx, sel=nx, out=M16M16x);
Mux16(a=y, b=false, sel=zy, out=M16y);
Not16(in=M16y, out=Ny);
Mux16(a=M16y, b=Ny, sel=ny, out=M16M16y);
And16(a=M16M16x, b=M16M16y, out=And16);
Add16(a=M16M16x, b=M16M16y, out=Add16);
Mux16(a=And16, b=Add16, sel=f, out=F16);
Not16(in=F16, out=NF16);
Mux16(a=F16, b=NF16, sel=no, out=out, out[15]=ng, out[0..7]=zout1, out[8..15]=zout2);
Or8Way(in=zout1, out=zr1);
Or8Way(in=zout2, out=zr2);
Or(a=zr1, b=zr2, out=zr3);
Not(in=zr3, out=zr);
}

The solution as Pax suggested was to use an intermediate variable as input to another chip, such as Or16Way. Here is the code after I fixed the problem and debugged:
CHIP ALU {
IN // 16-bit inputs:
x[16], y[16],
// Control bits:
zx, // Zero the x input
nx, // Negate the x input
zy, // Zero the y input
ny, // Negate the y input
f, // Function code: 1 for add, 0 for and
no; // Negate the out output
OUT // 16-bit output
out[16],
// ALU output flags
zr, // 1 if out=0, 0 otherwise
ng; // 1 if out<0, 0 otherwise
PARTS:
// Zero the x input
Mux16( a=x, b=false, sel=zx, out=x2 );
// Zero the y input
Mux16( a=y, b=false, sel=zy, out=y2 );
// Negate the x input
Not16( in=x2, out=notx );
Mux16( a=x2, b=notx, sel=nx, out=x3 );
// Negate the y input
Not16( in=y2, out=noty );
Mux16( a=y2, b=noty, sel=ny, out=y3 );
// Perform f
Add16( a=x3, b=y3, out=addout );
And16( a=x3, b=y3, out=andout );
Mux16( a=andout, b=addout, sel=f, out=preout );
// Negate the output
Not16( in=preout, out=notpreout );
Mux16( a=preout, b=notpreout, sel=no, out=preout2 );
// zr flag
Or16Way( in=preout2, out=notzr );
Not( in=notzr, out=zr );
// ng flag
And16( a=preout2, b=true, out[15]=ng );
// Get final output
And16( a=preout2, b=preout2, out=out );
}

Have you tried:
// zr flag
Or8way(
in[0]=out[ 0], in[1]=out[ 1], in[2]=out[ 2], in[3]=out[ 3],
in[4]=out[ 4], in[5]=out[ 5], in[6]=out[ 6], in[7]=out[ 7],
out=zr1);
Or8way(
in[0]=out[ 8], in[1]=out[ 9], in[2]=out[10], in[3]=out[11],
in[4]=out[12], in[5]=out[13], in[6]=out[14], in[7]=out[15],
out=zr2);
Or( a=zr1, b=zr2, out=zr );
I don't know if this will work but it seems to make sense from looking at this document here.
I'd also think twice about using out as a variable name since it's confusing trying to figure out the difference between that and the keyword out (as in "out=...").
Following your edit, if you cannot subscript intermediate values, then it appears you will have to implement a separate "chip" such as IsZero16 which will take a 16-bit value as input (your intermediate out) and return one bit indicating its zero-ness that you can load into zr. Or you could make an IsZero8 chip but you'd have to then call it it two stages as you're currently doing with Or8Way.
This seems like a valid solution since you can subscript the input values to a chip.
And, just looking at the error, this may be a different problem to the one you suggest. The phrase "Can't connect gate's output pin to part" would mean to me that you're unable to connect signals from the output parameter back into the chips processing area. That makes sense from an electrical point of view.
You may find you have to store the output into a temporary variable and use that to both set zr and out (since once the signals have been "sent" to the chips output pins, they may no longer be available).
Can we try:
CHIP SetFlags16 {
IN inpval[16];
OUT zflag,nflag;
PARTS:
Or8way(in=inpval[0.. 7],out=zr0);
Or8way(in=inpval[8..15],out=zr1);
Or(a=zr0,b=zr1,out=zflag);
Not(in=inpval[15],out=nflag);
}
and then, in your ALU chip, use this at the end:
// Negate the output
Not16( in=preout, out=notpreout );
Mux16( a=preout, b=notpreout, sel=no, out=tempout );
// flags
SetFlags16(inpval=tempout,zflag=zr,nflag=ng);
// Transfer tempout to out (may be a better way).
Or16(a=tempout,b=tempout,out=out);

Here's one also with a new chip but it feels cleaner
/**
* Negator16 - negates the input 16-bit value if the selection flag is lit
*/
CHIP Negator16 {
IN sel,in[16];
OUT out[16];
PARTS:
Not16(in=in, out=negateIn);
Mux16(a=in, b=negateIn, sel=sel, out=out);
}
CHIP ALU {
// IN and OUT go here...
PARTS:
//Zero x and y if needed
Mux16(a=x, b[0..15]=false, sel=zx, out=x1);
Mux16(a=y, b[0..15]=false, sel=zy, out=y1);
//Create x1 and y1 negations if needed
Negator16(in=x1, sel=nx, out=x2);
Negator16(in=y1, sel=ny, out=y2);
//Create x&y and x+y
And16(a=x2, b=y2, out=andXY);
Add16(a=x2, b=y2, out=addXY);
//Choose between And/Add according to selection
Mux16(a=andXY, b=addXY, sel=f, out=res);
// negate if needed and also set negative flag
Negator16(in=res, sel=no, out=res1, out=out, out[15]=ng);
// set zero flag (or all bits and negate)
Or16Way(in=res1, out=nzr);
Not(in=nzr, out=zr);
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas