Icarus Verilog crash while compiling dynamic memory module - crash

This is my first post on StackOverflow.
I'm a Verilog newbie, though I have significant experience with Python, C, and C++. I am using Icarus Verilog version 10.1.1 on Windows 10, and am trying to write a dynamic memory allocator. For some reason, when this command is run:
iverilog dynmem.v dynmem_test.v -o dynmem_test.out
the following is outputted:
Assertion failed!
Program: c:\iverilog\lib\ivl\ivl.exe
File: ../verilog-10.1.1/pform.cc, Line 333
Expression: lexical_scope
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
What is the problem with my code, and should I submit a bug report for this?
dynmem.v:
`define WORD_SIZE 32
`ifndef DYNMEM_SIZE // size of dynamic memory in words
`define DYNMEM_SIZE 16384*8/WORD_SIZE // 16 KB
`endif
`define __DYNMEM_BIT_SIZE WORD_SIZE*DYNMEM_SIZE-1 // size of dynamic memory in bits
reg [__DYNMEM_BIT_SIZE:0] dynmem; // dynamic memory
reg [DYNMEM_SIZE:0] allocated; // bitvector telling which words are allocated
reg mutex = 0;
module dynreg(address,
ioreg,
read_en,
write_en,
realloc_en);
output reg [WORD_SIZE-1:0] address;
output reg [WORD_SIZE-1:0] ioreg;
output reg read_en; // rising edge: put word stored at location address into ioreg
output reg write_en; // rising edge: put word stored in ioreg into location address
output reg realloc_en; // rising edge: if ioreg=0, free memory, otherwise reallocate memory into buffer of size ioreg.
task malloc; // allocate dynamic memory
output reg [WORD_SIZE-1:0] size;
output reg [WORD_SIZE-1:0] start;
unsigned integer size_int = size; // convert size to integer
reg flag1 = 1;
while (mutex) begin end // wait on mutex
mutex = 1; // acquire mutex
// loop through possible starting locations
for (index=size_int-1; (index < DYNMEM_SIZE) && flag1; index=index+1) begin
// check if unallocated
reg flag2 = 1;
for (offset=0; (offset < size_int) && flag2; offset=offset+1)
if (allocated[index-offset])
flag2 = 0;
if (flag2) begin // if memory block is free
start = index;
flag1 = 0; // exit loop
end
end
// mark as allocated
for (i=0; i<size; i=i+1)
allocated[start-offset] = 1;
mutex = 0; // release mutex
endtask
task freealloc;
output reg [WORD_SIZE-1:0] size;
output reg [WORD_SIZE-1:0] start;
while (mutex) begin end // wait on mutex
mutex = 1; // acquire mutex
// deallocate locations
for (index=start; index > 0; index=index-1)
allocated[index] = 0;
mutex = 0; // release mutex
endtask
// internal registers
unsigned integer start; // start address
unsigned integer size; // block size
unsigned integer address_int; // address register converted to int
initial begin
// allocate memory
size = ioreg;
malloc(size, start);
end
always #(posedge(read_en)) begin
// read memory into ioreg
address_int = address;
ioreg[WORD_SIZE-1:0] = dynmem[8*(start+address_int)-1 -:WORD_SIZE-1];
end
always #(posedge(write_en)) begin
// write memory from ioreg
address_int = address;
dynmem[8*(start+address_int)-1 -:WORD_SIZE-1] = ioreg[WORD_SIZE-1:0];
end
always #(posedge(realloc_en)) begin
unsigned integer ioreg_int = ioreg; // convert ioreg to integer
reg [WORD_SIZE-1:0] new_start; // new start address
if (ioreg_int != 0) begin // if memory is to be reallocated, not freed
malloc(ioreg, new_start); // allocated new memory
// copy memory
for (i=0; i<size; i=i+1)
dynmem[8*(new_start+i)-1 -:WORD_SIZE-1] = dynmem[8*(start+i)-1 -:WORD_SIZE-1];
end
freealloc(size, start); // free previous memory
// update registers
size = ioreg_int;
start = new_start;
end
endmodule
dynmem_test.v:
module testmodule();
$monitor ("%g ioreg1=%b ioreg2=%b",
$time, ioreg1, ioreg2);
reg [WORD_SIZE-1:0] address1, address2;
reg [WORD_SIZE-1:0] ioreg1=5, ioreg2=10;
reg read_en1, read_en2;
reg write_en1, write_en2;
reg realloc_en1, realloc_en2;
#1 dynreg dr1(address1, ioreg1, read_en1, write_en1, realloc_en1);
#1 dynreg dr2(address2, ioreg2, read_en2, write_en2, realloc_en2);
address1 = 0;
ioreg1 = 23;
#1 write_en1 = 1;
write_en1 = 0;
address1 = 2;
ioreg1 = 53;
#1 write_en1 = 1;
write_en1 = 0;
address1 = 0;
#1 read_en1 = 1;
read_en1 = 0;
address1 = 2;
#1 read_en1 = 1;
read_en1 = 0;
#1 $finish;
endmodule
UPDATE: C:\iverilog\lib\verilog-10.1.1 doesn't exist, and, in fact, I searched in C:\iverilog for pform.cc and found no results. Strange.

#1 dynreg dr1(address1, ioreg1, read_en1, write_en1, realloc_en1);
Using a delay (#1) on an instance declaration is probably confusing Icarus as much as it's confusing me. (What exactly is supposed to get delayed? Does the instance not exist for one simulation step?)
Remove those delays, and put all of the code in your testbench following those two instance declarations into an initial block.
For what it's worth, dynreg is probably not synthesizable as written. It has no clock input, and it contains several loops which cannot be unrolled in hardware.
UPDATE: C:\iverilog\lib\verilog-10.1.1 doesn't exist, and, in fact, I searched in C:\iverilog for pform.cc and found no results. Strange.
This path is probably referring to the location of the code on the developer's computer where your copy of Icarus was compiled. Unless you plan on trying to fix the bug that caused this crash yourself, you can safely ignore this.

Related

Thread Sanitizer - false positive?

Thread 1 has code like this...
pthread_mutex_lock(&mtx);
...
buffers[cnt] = malloc(...); // cnt is global, volatile int
buffers[cnt]->field1 = ...;
buffers[cnt]->field2 = ...;
...
buffers[cnt]->fieldN = ...;
_mm_sfence(); // let's flush things for good measure
cnt++;
pthread_mutex_unlock(&mtx);
And thread 2:
// No sync
for(int i=0; i<cnt; i++) {
SomeType *p = buffers[i];
// use p->fieldX
}
In a Thread Sanitizer-built binary, I get a warning about the read of buffers[i] against the write (the result of the malloc).
It is true that the code in thread 2 doesn't use the mutex - but I am thinking that cnt will only be incremented when the new "slot" in buffers is all set; so I would expect that the for loop in thread 2 would only ever manage to read pointers that point to valid data. cnt is an integer; in 64-bit x86 you can't have a "half-updated" cnt; it either has the old value, or the new one, by which time that index points to valid data.
Simply put, I don't think there's a race - yet the thread sanitizer reports one.
Am I wrong?

Creating threads with pthread_create() doesn't work on my linux

I have this piece of c/c++ code:
void * myThreadFun(void *vargp)
{
int start = atoi((char*)vargp) % nFracK;
printf("Thread start = %d, dQ = %d\n", start, dQ);
pthread_mutex_lock(&nItermutex);
nIter++;
pthread_mutex_unlock(&nItermutex);
}
void Opt() {
pthread_t thread[200];
char start[100];
for(int i = 0; i < 10; i++) {
sprintf(start, "%d", i);
int ret = pthread_create (&thread[i], NULL, myThreadFun, (void*) start);
printf("ret = %d on thread %d\n", ret, i);
}
for(int i = 0; i < 10; i++)
pthread_join(thread[i], NULL);
}
But it should create 10 threads. I don't understand why, instead, it creates n < 10 threads.
The ret value is always 0 (for 10 times).
But it should create 10 threads. I don't understand why, instead, it creates n < 10 threads. The ret value is always 0 (for 10 times).
Your program contains at least one data race, therefore its behavior is undefined.
The provided source is also is incomplete, so it's impossible to be sure that I can test the same thing you are testing. Nevertheless, I performed the minimum augmentation needed for g++ to compile it without warnings, and tested that:
#include <cstdlib>
#include <cstdio>
#include <pthread.h>
pthread_mutex_t nItermutex = PTHREAD_MUTEX_INITIALIZER;
const int nFracK = 100;
const int dQ = 4;
int nIter = 0;
void * myThreadFun(void *vargp)
{
int start = atoi((char*)vargp) % nFracK;
printf("Thread start = %d, dQ = %d\n", start, dQ);
pthread_mutex_lock(&nItermutex);
nIter++;
pthread_mutex_unlock(&nItermutex);
return NULL;
}
void Opt() {
pthread_t thread[200];
char start[100];
for(int i = 0; i < 10; i++) {
sprintf(start, "%d", i);
int ret = pthread_create (&thread[i], NULL, myThreadFun, (void*) start);
printf("ret = %d on thread %d\n", ret, i);
}
for(int i = 0; i < 10; i++)
pthread_join(thread[i], NULL);
}
int main(void) {
Opt();
return 0;
}
The fact that its behavior is undefined notwithstanding, when I run this program on my Linux machine, it invariably prints exactly ten "Thread start" lines, albeit not all with distinct numbers. The most plausible conclusion is that the program indeed does start ten (additional) threads, which is consistent with the fact that the output also seems to indicate that each call to pthread_create() indicates success by returning 0. I therefore reject your assertion that fewer than ten threads are actually started.
Presumably, the followup question would be why the program does not print the expected output, and here we return to the data race and accompanying undefined behavior. The main thread writes a text representation of iteration variable i into local array data of function Opt, and passes a pointer to that same array to each call to pthread_create(). When it then cycles back to do it again, there is a race between the newly created thread trying to read back the data and the main thread overwriting the array's contents with new data. I suppose that your idea was to avoid passing &i, but this is neither better nor fundamentally different.
You have several options for avoiding a data race in such a situation, prominent among them being:
initialize each thread indirectly from a different object, for example:
int start[10];
for(int i = 0; i < 10; i++) {
start[i] = i;
int ret = pthread_create(&thread[i], NULL, myThreadFun, &start[i]);
}
Note there that each thread is passed a pointer to a different array element, which the main thread does not subsequently modify.
initialize each thread directly from the value passed to it. This is not always a viable alternative, but it is possible in this case:
for(int i = 0; i < 10; i++) {
start[i] = i;
int ret = pthread_create(&thread[i], NULL, myThreadFun,
reinterpret_cast<void *>(static_cast<std::intptr_t>(i)));
}
accompanied by corresponding code in the thread function:
int start = reinterpret_cast<std::intptr_t>(vargp) % nFracK;
This is a fairly common idiom, though more often used when writing in pthreads's native language, C, where it's less verbose.
Use a mutex, semaphore, or other synchronization object to prevent the main thread from modifying the array before the child has read it. (Left as an exercise.)
Any of those options can be used to write a program that produces the expected output, with each thread responsible for printing one line. Supposing, of course, that the expectations of the output do not include that the relative order of the threads' outputs will be the same as the relative order in which they were started. If you want that, then only the option of synchronizing the parent and child threads will achieve it.

Debug data/neon performance hazards in arm neon code

Originally the problem appeared when I tried to optimize an algorithm for neon arm and some minor part of it was taking 80% of according to profiler. I tried to test to see what can be done to improve it and for that I created array of function pointers to different versions of my optimized function and then I run them in the loop to see in profiler which one performs better:
typedef unsigned(*CalcMaxFunc)(const uint16_t a[8][4], const uint16_t b[4][4]);
CalcMaxFunc CalcMaxFuncs[] =
{
CalcMaxFunc_NEON_0,
CalcMaxFunc_NEON_1,
CalcMaxFunc_NEON_2,
CalcMaxFunc_NEON_3,
CalcMaxFunc_C_0
};
int N = sizeof(CalcMaxFunc) / sizeof(CalcMaxFunc[0]);
for (int i = 0; i < 10 * N; ++i)
{
auto f = CalcMaxFunc[i % N];
unsigned retI = f(a, b);
// just random code to ensure that cpu waits for the results
// and compiler doesn't optimize it away
if (retI > 1000000)
break;
ret |= retI;
}
I got surprising results: performance of a function was totally depend on its position within CalcMaxFuncs array. For example, when I swapped CalcMaxFunc_NEON_3 to be first it would be 3-4 times slower and according to profiler it would stall at the last bit of the function where it tried to move data from neon to arm register.
So, what does it make stall sometimes and not in other times? BY the way, I profile on iPhone6 in xcode if that matters.
When I intentionally introduced neon pipeline stalls by mixing-in some floating point division between calling these functions in the loop I eliminated unreliable behavior, now all of them perform the same regardless of the order in which they were called. So, why in the first place did I have that problem and what can I do to eliminate it in actual code?
Update:
I tried to create a simple test function and then optimize it in stages and see how I could possibly avoid neon->arm stalls.
Here's the test runner function:
void NeonStallTest()
{
int findMinErr(uint8_t* var1, uint8_t* var2, int size);
srand(0);
uint8_t var1[1280];
uint8_t var2[1280];
for (int i = 0; i < sizeof(var1); ++i)
{
var1[i] = rand();
var2[i] = rand();
}
#if 0 // early exit?
for (int i = 0; i < 16; ++i)
var1[i] = var2[i];
#endif
int ret = 0;
for (int i=0; i<10000000; ++i)
ret += findMinErr(var1, var2, sizeof(var1));
exit(ret);
}
And findMinErr is this:
int findMinErr(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err = 0;
for (int j = 0; j < 16; ++j)
{
int x = var1[j] - var2[j];
err += x * x;
}
if (ret_err > err)
{
ret_err = err;
ret = i;
}
}
return ret;
}
Basically it it does sum of squared difference between each uint8_t[16] block and returns index of the block pair that has lowest squared difference. So, then I rewrote it in neon intrisics (no particular attempt was made to make it fast, as it's not the point):
int findMinErr_NEON(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err;
uint8x8_t var1_0 = vld1_u8(var1 + 0);
uint8x8_t var1_1 = vld1_u8(var1 + 8);
uint8x8_t var2_0 = vld1_u8(var2 + 0);
uint8x8_t var2_1 = vld1_u8(var2 + 8);
int16x8_t s0 = vreinterpretq_s16_u16(vsubl_u8(var1_0, var2_0));
int16x8_t s1 = vreinterpretq_s16_u16(vsubl_u8(var1_1, var2_1));
uint16x8_t u0 = vreinterpretq_u16_s16(vmulq_s16(s0, s0));
uint16x8_t u1 = vreinterpretq_u16_s16(vmulq_s16(s1, s1));
#ifdef __aarch64__1
err = vaddlvq_u16(u0) + vaddlvq_u16(u1);
#else
uint32x4_t err0 = vpaddlq_u16(u0);
uint32x4_t err1 = vpaddlq_u16(u1);
err0 = vaddq_u32(err0, err1);
uint32x2_t err00 = vpadd_u32(vget_low_u32(err0), vget_high_u32(err0));
err00 = vpadd_u32(err00, err00);
err = vget_lane_u32(err00, 0);
#endif
if (ret_err > err)
{
ret_err = err;
ret = i;
#if 0 // enable early exit?
if (ret_err == 0)
break;
#endif
}
}
return ret;
}
Now, if (ret_err > err) is clearly data hazard. Then I manually "unrolled" loop by two and modified code to use err0 and err1 and check them after performing next round of compute. According to profiler I got some improvements. In simple neon loop I got roughly 30% of entire function spent in the two lines vget_lane_u32 followed by if (ret_err > err). After I unrolled by two these operations started to take 25% (e.g. I got roughly 10% overall speedup). Also, check armv7 version, there is only 8 instructions between when err0 is set (vmov.32 r6, d16[0]) and when it's accessed (cmp r12, r6). T
Note, in the code early exit is ifdefed out. Enabling it would make function even slower. If I unrolled it by four and changed to use four errN variables and deffer check by two rounds then I still saw vget_lane_u32 in profiler taking too much time. When I checked generated asm, appears that compiler destroys all the optimizations attempts because it reuses some of the errN registers which effectively makes CPU access results of vget_lane_u32 much earlier than I want (and I aim to delay access by 10-20 instructions). Only when I unrolled by 4 and marked all four errN as volatile vget_lane_u32 totally disappeared from the radar in profiler, however, the if (ret_err > errN) check obviously got slow as hell as now these probably ended up as regular stack variables overall these 4 checks in 4x manual loop unroll started to take 40%. Looks like with proper manual asm it's possible to make it work properly: have early loop exit, while avoiding neon->arm stalls and have some arm logic in the loop, however, extra maintenance required to deal with arm asm makes it 10x more complex to maintain that kind of code in a large project (that doesn't have any armasm).
Update:
Here's sample stall when moving data from neon to arm register. To implement early exist I need to move from neon to arm once per loop. This move alone takes more than 50% of entire function according to sampling profiler that comes with xcode. I tried to add lots of noops before and/or after the mov, but nothing seems to affect results in profiler. I tried to use vorr d0,d0,d0 for noops: no difference. What's the reason for the stall, or the profiler simply shows wrong results?

Verilog I/O reading a character

I seem to have some issues anytime I try anything with I/O for verilog. Modelsim either throws function not supported for certain functions or does nothing at all. I simply need to read a file character by character and send each bit through the port. Can anyone assist
module readFile(clk,reset,dEnable,dataOut,done);
parameter size = 4;
//to Comply with S-block rules which is a 4x4 array will multiply by
// size so row is the number of size bits wide
parameter bits = 8*size;
input clk,reset,dEnable;
output dataOut,done;
wire [1:0] dEnable;
reg dataOut,done;
reg [7:0] addr;
integer file;
reg [31:0] c;
reg eof;
always#(posedge clk)
begin
if(file == 0 && dEnable == 2'b10)begin
file = $fopen("test.kyle");
end
end
always#(posedge clk) begin
if(addr>=32 || done==1'b1)begin
c <= $fgetc(file);
// c <= $getc();
eof <= $feof(file);
addr <= 0;
end
end
always#(posedge clk)
begin
if(dEnable == 2'b10)begin
if($feof(file))
done <= 1'b1;
else
addr <= addr+1;
end
end
//done this way because blocking statements should not really be used
always#(addr)
begin:Access_Data
if(reset == 1'b0) begin
dataOut <= 1'bx;
file <= 0;
end
else if(addr<32)
dataOut <= c[31-addr];
end
endmodule
I would suggest reading the entire file at one time into an array, and then iterate over the array to output the values.
Here is a snippet of how to read bytes from a file into a SystemVerilog queue. If you need to stick to plain old Verilog you can do the same thing with a regular array.
reg [8:0] c;
byte q[$];
int i;
// Read file a char at a time
file = $fopen("filename", "r");
c = $fgetc(file);
while (c != 'h1ff) begin
q.push_back(c);
$display("Got char [%0d] 0x%0h", i++, c);
c = $fgetc(file);
end
Note that c is defined as a 9-bit reg. The reason for is that $fgetc will return -1 when it reaches the end of the file. In order to differentiate between EOF and a valid 0xFF you need this extra bit.
I'm not familiar with $feof and don't see it in the Verilog 2001 spec, so that may be something specific to Modelsim. Or it could be the source of the "function not supported."

Why is XST optimizing away my registers and how do I stop it?

I have a simple verilog program that increments a 32 bit counter, converts the number to an ASCII string using $sformat and then pushes the string to the host machine 1 byte at a time using an FTDI FT245RL.
Unfortunately Xilinx XST keeps optimizing away the string register vector. I've tried mucking around with various initialization and access routines with no success. I can't seem to turn off optimization, and all of the examples I find online differ very little from my initialization routines. What am I doing wrong?
module counter(CK12, TXE_, WR, RD_, LED, USBD);
input CK12;
input TXE_;
output WR;
output RD_;
output [7:0] LED;
inout [7:0] USBD;
reg [31:0] count = 0;
reg [7:0] k;
reg wrf = 0;
reg rd = 1;
reg [7:0] lbyte = 8'b00000000;
reg td = 1;
parameter MEM_SIZE = 88;
parameter STR_SIZE = 11;
reg [MEM_SIZE - 1:0] str;
reg [7:0] strpos = 8'b00000000;
initial
begin
for (k = 0; k < MEM_SIZE; k = k + 1)
begin
str[k] = 0;
end
end
always #(posedge CK12)
begin
if (TXE_ == 0 && wrf == 1)
begin
count = count + 1;
wrf = 0;
end
else if (wrf == 0) // If we've already lowered the strobe, latch the data
begin
if(td)
begin
$sformat(str, "%0000000000d\n", count);
strpos = 0;
td = 0;
end
str = str << 8;
wrf = 1;
strpos = strpos + 1;
if(strpos == STR_SIZE)
td = 1;
end
end
assign RD_ = rd;
assign WR = wrf;
assign USBD = str[87:80];
assign LED = count[31:24];
endmodule
Loading device for application
Rf_Device from file '3s100e.nph' in
environment /opt/Xilinx/10.1/ISE.
WARNING:Xst:1293 - FF/Latch str_0
has a constant value of 0 in block
. This FF/Latch will be
trimmed during the optimization
process.
WARNING:Xst:1896 - Due to other
FF/Latch trimming, FF/Latch str_1
has a constant value of 0 in block
. This FF/Latch will be
trimmed during the optimization
process.
WARNING:Xst:1896 - Due to other
FF/Latch trimming, FF/Latch str_2
has a constant value of 0 in block
. This FF/Latch will be
trimmed during the optimization
process.
The $sformat task is unlikely to be synthesisable - consider what hardware the compiler would need to produce to implement this function! This means your 'str' register never gets updated, so the compiler thinks it can optimize it away. Consider a BCD counter, and maybe a lookup table to convert the BCD codes to ASCII codes.
AFAIK 'initial' blocks are not synthesisable. To initialize flops, use a reset signal. Memories need a 'for' loop like you have, but which triggers only after reset.