<<module name>> not a task or void function in verilog - syntax-error

I am trying to create a module for carry select adder in verilog. Everything works fine except the following portion where it is causing compilation error.
module csa(a,b,s,cout);
input[15:0] a,b;
output [15:0] s;
output cout;
wire zero_c1, zero_c2,zero_c3,zero_c4,zero_c5;
wire one_c1, one_c2,one_c3,one_c4,one_c5;
wire temp_c1,temp_c2,temp_c3,temp_c4,temp_c5;
wire [15:0] s_zero, s_one;
initial
begin
fork
fa(s[0], temp_c1,a[0],b[0],0);
fa_one(s_zero[1],s_one[1],zero_c1,one_c1,a[1],b[1]);
fa_two(s_zero[3:2],s_one[3:2],zero_c2,one_c2,a[3:2],b[3:2]);
fa_three(s_zero[6:4],s_one[6:4],zero_c3,one_c3,a[6:4],b[6:4]);
fa_four(s_zero[10:7],s_one[10:7],zero_c4,one_c4,a[10:7],b[10:7]);
fa_five(s_zero[15:11],s_one[15:11],zero_c5,one_c5,a[15:11],b[15:11]);
join
end
When I try to compile that it says -
the module "fa", "fa_one" are not a task or void function
I deleted the "initial" statement and now it says -
Syntax error near "fork", expecting "endmodule"
I just want to run the code between join and fork in parallel. I have also confirmed that the module fa, fa_one works fine.
Would appreciate if anyone can help me pointing out what I am doing wrong here. Thanks.

Verilog modules are not run or executed but instantiated, they represent physical blocks of hardware.
Everything is in parallel unless you have made effort to time share pieces of hardware. For example you might write an ALU core, which exists only once but use a program ROM to tell it which instruction to process every clockcycle.
Inside your modules you can have combinatorial code and sequential code.
Combinatorial logic will simulate in 0 time but will actually take some time for values to propagate through when placed on real devices.
If this propagation delay is not thought about and very large blocks of logic are created you will struggle to close timing on synthesis, due to the settling time through the logic being greater than the clock speed either side of the combinatorial logic.
Sequential logic implies that the results are held in flip-flops, which only update on clock edges. This means chains of sequential logic can take many clock cycles for data to propagate.
When pipelining a processor you break individual section up with flip-flops giving each section a full clock cycle for combinatorial propagation, at the expense of taking several clock cycles to calculate a single result.
To correct your example you would just have:
module csa(
input [15:0] a,
input [15:0] b,
output [15:0] s,
output cout
);
wire zero_c1, zero_c2,zero_c3,zero_c4,zero_c5;
wire one_c1, one_c2,one_c3,one_c4,one_c5;
wire temp_c1,temp_c2,temp_c3,temp_c4,temp_c5;
wire [15:0] s_zero, s_one;
fa ufa(s[0], temp_c1,a[0],b[0],0);
fa_one ufa_one(s_zero[1],s_one[1],zero_c1,one_c1,a[1],b[1]);
fa_two ufa_two(s_zero[3:2],s_one[3:2],zero_c2,one_c2,a[3:2],b[3:2]);
fa_three ufa_three(s_zero[6:4],s_one[6:4],zero_c3,one_c3,a[6:4],b[6:4]);
fa_four ufa_four(s_zero[10:7],s_one[10:7],zero_c4,one_c4,a[10:7],b[10:7]);
fa_five ufa_five(s_zero[15:11],s_one[15:11],zero_c5,one_c5,a[15:11],b[15:11]);
endmodule
NB: it is module_name #(parameters) instance_name ( ports );

fork is used to run procedural statements within a module in parallel. Separate module instances always run in parallel.
Child modules are instantiated directly within their parent module, not within an initial, begin, or fork which are used for procedural statements. So you can remove the initial, begin, fork, join, and end, and add an endmodule at the end.

Related

Recommended order of input and output ports in Verilog module declaration

Being new to Verilog I noticed that lots of code is ordering their ports
in module declarations with inputs first:
module do_something(
input wire clk_in,
input wire a_in,
input wire b_in,
output reg val_out);
....
endmodule
(Almost the same way I'm used to it when programming in C/C++: inputs first, then outputs).
However I've also seen examples with an opposite order of parameters (output first, inputs last).
I hope this isn't a too dumb question:
Is there any recommendation/best practice to prefer one over the other?
So far I'd simply stick to "inputs first" but I wanted to ask before forming a bad habit.
Usually I do clocks and resets first. Followed by IO grouped by function if it's a large module that has more than one 'thing' going on at once. Within a group I usually order inputs first and then outputs, but the other way is also fine.
Ultimately it's a matter of style so the most important thing is to be consistent. Pick a style and stick with it.
Verilog built-in primitives have their output first followed by their inputs. When connecting module ports, you should be connecting by port names, not positional order, so the order does not really matter.
As others have said, it's convention to put clocks and resets first, then other generic inputs, then outputs.
However, as a personal convenience, I like to put the clocks and resets at the bottom, because these are signals that you are unlikely to modify. This means you don't have to deal with trailing commas or missing commas as you refactor your code by adding/removing ports. For instance:
module example_module(input wire clk,
input wire [7:0] a,
input wire b,
output wire [7:0] x,
output wire y
);
If you needed to add another output port after output wire y, you would have to make sure you added a comma after the y. Similarly, if you removed the output wire y, you would have to make sure you deleted the comma after the x.
This is a small thing but is an accumulated inconvenience when you're hooking things up and rearranging signals and moving ports around. Whereas if you used the following, you could add or remove ports and never have to mess with the trailing commas.
module example_module(input wire [7:0] a,
input wire b,
output wire [7:0] x,
output wire y,
input wire clk
);
And it's the same thing for when you instantiate the module.
example_module u__example_module (
.a (a),
.b (b),
.x (x),
.y (y),
.clk(clk)
);
It's a subjective answer, but hey, it's a subjective question.

Multiple behaviours for single entity

I wrote a VHDL Testbench which contains the following :
Lots of signal declarations
UUT instantiations / port maps
A huge amount of one-line concurrent assignments
Various small processes
One main (big) process which actually stimulates the UUT.
Everything is fine except the fact that I want to have two distinct types of stimulation (let's say a simple stimulus and a more complex one) so what I did is I created two testbenches which have everything in common except the main big process.
But I don't find it really convenient since I always need to update both when, for example, I make a change to the UUT port map. Not cool.
I don't really want to merge my two main process because it will look like hell and I can't have the two process declared concurrently in the same architecture (I might end up with a very long file and I don't like that they can theoretically access the same signals).
So I would really like to keep a "distinct files" approach but only for that specific process. Is there a way out of this or am I doomed?
This seems like an example where using multiple architectures of the same entity would help. You have a file along the lines of:
entity TestBench
end TestBench;
architecture SimpleTest of TestBench is
-- You might have a component declaration for the UUT here
begin
-- Test bench code here
end SimpleTest;
You can easily add another architecture. You can have architectures in separate files. You can also use direct entity instantiation to avoid the component declaration for the UUT (halving the work required if the UUT changes):
architecture AnotherTest of TestBench is
begin
-- Test bench code here
UUT : entity work.MyDesign (Behavioral)
port map (
-- Port map as usual
);
end AnotherTest ;
This doesn't save having duplicate code, but at least it removes one of the port map lists.
Another point if you have a lot of signals in your UUT port map, is that this can be easier if you try to make more of the signals into vectors. For example, you might have lots of serial outputs of the same type going to different chips on the board. I have seen lots of people will name these like SPI_CS_SENSORS, SPI_CS_CPU, SPI_CS_FRONT_PANEL, etc. I find it makes the VHDL a lot more manageable if these are combined to SPI_CS (2 downto 0), with the mapping of what signal goes to what device specified by the circuit diagram. I suppose this is just preference, but maybe this sort of approach could help if you have really huge port lists.
Using a TestControl entity
A more sophisitcated approach would involve using a test control entity to implement all your stimulus. At the simplest level, this would have as ports all of the signals from the UUT you are interested in. A more sophisticated test bench would have a test control entity with interfaces that can control bus functional models that contain the actual pin wiggling required to exercise your design. You can have one file declaring this entity, say TestControl_Entity.vhd:
entity TestControl is
port (
clk : out std_logic;
UUTInput : out std_logic;
UUTOutput : in std_logic
);
Then you have one or more architecture files, for example TestControl_SimpleTest.vhd:
architecture SimpleTest of TestControl is
begin
-- Stimulus for simple test
end SimpleTest;
Your top level test bench would then look something like:
entity TestBench
end TestBench;
architecture Behavioral of TestBench is
signal clk : std_logic;
signal a : std_logic;
signal b : std_logic;
begin
-- Common processes like clock generation could go here
UUT : entity work.MyDesign (Behavioral)
port map (
clk => clk,
a => a,
b => b
);
TestControl_inst : entity work.TestControl (SimpleTest)
port map (
clk => clk,
UUTInput => a,
UUTOutput => b
);
end SimpleTest;
You can now change the test by changing the architecture selected for TestControl.
Using configurations
If you have a lot of different tests, you can use configurations to make it easier to select them. To do this, you first need to make the test control entity instantiation use a component declaration as opposed to direct instantiation. Then, at the end of each test control architecture file, create the configuration:
use work.all;
configuration Config_SimpleTest of TestBench is
for Behavioral
for TestControl_inst : TestControl
use entity work.TestControl (TestControl_SimpleTest);
end for;
end for;
end Config_SimpleTest;
Now when you want to simulate, you simulate a configuration, so instead of a command like sim TestBench, you would run something like sim work.Config_SimpleTest. This makes it easier to manage test benches with a large number of different tests, because you don't have to edit any files in order to run them.
A generic can be added to the test bench entity, to control if simple or
complex testing is done, like:
entity tb is
generic(
test : positive := 1); -- 1: Simple, 2: Complex
end entity;
library ieee;
use ieee.std_logic_1164.all;
architecture syn of tb is
-- Superset of declarations for simple and complex testing
begin
simple_g : if test = 1 generate
process is -- Simple test process
begin
-- ... Simple testing
wait;
end process;
end generate;
complex_g : if test = 2 generate
process is -- Complex test process
begin
-- ... Complex testing
wait;
end process;
end generate;
end architecture;
The drawback is that declarations can't be made conditional, so the
declarations must be a superset of the signals and other controls for both
simple and complex testing.
The simulator can control the generic value through options, for example -G
for generic control in ModelSim simulator. It is thereby possible to compile
once, and select simple or complex testing at runtime.

vhdl how to use an entity within a process

I'm having difficulties to understand how I could utilize a sequential logic entity in the process of another. This process is a state-machine which on each clock signal either reads values from the input, or performs calculations. These calculation take many iterations to complete. However, each iteration is supposed to utilize a sub-entity, which is defined using the same principles as the above one (two-state state-machine, clock-based iterations), to obtain some results needed in the same iteration.
As I see it, I have two options:
implementing the subentity in a separate process within the main entity and finding a way to halt the main process and sync it with the subentity execution - this would mean using the clock signal of the main entity
implementing the subentity within the process of the main entity (basically something like a function call) and finding a way to halt the main process until subentity execution completes - this seems to me hardly doable using the main clock signal
None of them seems very appealing and rather complex, so I'm asking for some experienced insight and clarification. I really hope that there is a more conventional way that I'm missing.
"Entity" is an unfortunate choice of word here, as it suggests a VHDL Entity which may or may not be what you want.
You are thinking along roughly the right lines however, but it is a little unclear what you mean by "appealing"; so your goals are unclear and that makes it difficult to help.
To take your two approaches separately :
(1) Separate processes are a valid approach to dividing up tasks. They will naturally operate in parallel. In a synchronous design (best practice, safest and simplest - not universal but you need a compelling reason to do anything else) they will normally both be clocked by the same system clock.
When you need to synchronise them, you can, using extra "handshaking" signals. Typically your main SM would start the subsystem, wait until the subsystem acknowledged, wait again until the subsystem was done, and use the result.
main_sm : process(clk)
begin
if rising_edge(clk) then
case state is
...
when start_op =>
subsystem_start <= '1';
if subsystem_busy = '1' then
state <= wait_subsystem;
end if;
when wait_subsystem <=
subsystem_start <= '0';
if subsystem_busy = '0' then
state <= use_result;
end if;
when use_result => -- carry on processing
...
end case;
end if;
end process main_sm;
It should be clear how to write the subsystem to match...
This is most useful where the subsystem processing takes a large, variable or unknown time to complete - perhaps sending characters to a UART, or a serial divider. With care, it can also allow several top level processes to access the subsystem to save hardware (obviously the subsystem handshaking logic only responds to one process at a time!)
(2) If the sub-entity is to be implemented in the process, it should be written as a subprogram, i.e. as you speculate, a procedure or function. If it is declared local to the process it has access to that process's environment; otherwise you can pass it parameters. This is simplest when the subprogram can complete within the current clock cycle; often you can structure the code so that it can.
Try the following in your synthesis tool:
main_sm : process(clk)
procedure wait_here (level : std_logic; nextstate : state_type) is
begin
subsystem_start <= level;
if subsystem_busy = level then
state <= nextstate;
end if;
end wait_here;
begin
...
when start_op =>
wait_here('1', wait_subsystem);
when wait_subsystem <=
wait_here('0', use_result);
This rewrite of the handshaking above ought to work and in some synth tools it will, but others may not provide good synthesis support for subprograms.
You can use subprograms spanning multiple clock cycles in processes in simulation; the trick is to eliminate the sensitivity list and use
wait until rising_edge(clk);
instead. This is also potentially synthesisable, and can be used e.g. in a loop in a procedure. However some synthesis tools reject it, and Xilinx XST for one is actually getting worse, rather than better, in support for it.

How to manage large VHDL testbenches

One problem I've seen again and again on different VHDL projects is that the top-level testbenches are always large and difficult to keep organized. There is basically a main test process where EVERY test signal is controlled or validated from, which becomes HUGE over time. I know that you can make testbenches for the lower-level components, but this question mainly applies to top-level input/output tests.
I'd like to have some kind of hierarchy structure to keep things organized. I've tried implementing VHDL procedures, but the compiler was very unhappy because it thought I was trying to assign signals from different sections of code...
Is there anything available in VHDL to achieve the behavior of c programming's inline-function or #define preprocessor replacement macros? If not, what can you suggest? It would make me happy to be able to have my top-level test bench look like this:
testClockSignals();
testDigitialIO();
testDACSignals();
...
Having the implementation of these functions in a separate file would be icing on the cake. Haha...I'd just like to write and simulate the test benches in C.
It is a VHDL requirement that the either you write the procedures in the process (as #MortenZdk suggests) or you pass all the IO to it.
My preference is to put my procedures only in packages, so I use the pass all IO approach. To simplify what is passed, I use records. If you reduce it to one record, it will be inout and require resolution functions on the elements of the record.
For more ideas on this approach, goto: http://www.synthworks.com/papers/ and see the papers titled:
"Accelerating Verification Through Pre-Use ..." (near the bottom) and
" VHDL Testbench Techniques that Leapfrog SystemVerilog" (at the top)
Another key aspect is to use a separate process for each independent interface. This way stimulus can be generated concurrently for different interfaces. This is also illustrated in the papers.
Separating test bench code in manageable procedures is possible, but maybe the
compiler complained because a procedure tries to access signals that were not
in scope ? If a procedure is to controls a signal that is not in scope, then
the signal can be given as argument to the procedure, as shown for the
procReset example below.
A test bench structure, with multiple levels for easier maintenance, is shown
below:
--==========================================================
-- Reusable procedures
-- Reset generation
procedure procReset(signal rst : out std_logic; ...) is
...
--==========================================================
-- Main test control procedure in test bench
process is
------------------------------------------------------------
-- General control and status
-- Reset device under test and related test bench modules
procedure genReset is
begin
procReset(rst, 100 ns); -- procReset declared elsewhere
-- Other code as required for complete reset
end procedure;
------------------------------------------------------------
-- Test cases
procedure testClockSignals is
begin
genReset; -- Apply reset to avoid test case interdependency
-- Test code here, and call genErr if mismatch detected
end procedure;
procedure testDigitialIO is
begin
genReset; -- Apply reset to avoid test case interdependency
-- Test code here, and call genErr if mismatch detected
end procedure;
procedure testDACSignals is
begin
genReset; -- Apply reset to avoid test case interdependency
-- Test code here, and call genErr if mismatch detected
end procedure;
begin
------------------------------------------------------------
-- Run test cases
testClockSignals;
testDigitialIO;
testDACSignals;
-- End of simulation
std.env.stop(0);
wait;
end process;
There are several levels in the structure:
Run test cases: Where the procedures for each test case is
called. It is thereby possible to comment out one or
more of the test cases during development and debugging.
Test cases: Test test case code itself, which is written as
separate and independent procedures. Interdependence between
run of the different test cases is avoided by reset (using
genReset procedure) of the device under test and related test
bench support modules.
General control and status: Reusable test bench specific
procedure, for example reset of device under test and test
bench support modules.
Reusable procedures: Does not control or use test bench
signals directly, but only through procedure arguments. These
procedures may be located in packages (other files) for reuse
in other test benches.
The test bench file may still be quite a number of lines, since all the test
case code still have to be in the same file with the above approach, if this
test bench code need direct access to test bench signals in order to control or
check the signals values. If signal values can be passed to test case
procedures through arguments, as done for the procReset call, then it is
possible to move the test case code to another package.
If you have lower level testbenches for each block, then you can make use of them at the top level.
By making the key lower level test elements entities in their own right, you can compose them into higher level test items which are often just a small shim to convert the pin-level data into the test-level data you were originally using.
For example, in an image processing FPGA, you would have some image-sourcing and data-checking code to check out the algorithmic parts. These could be used as is, or with some wrapping to provide the data to the top-level FPGA pins, and then decode the pin outputs back to the format that the original checking code requires.
The register setup code that was no doubt tested at the lower level, can be wrapped in some more code with wiggles the FPGA pins appropriately and interprets the pin-wiggling results.
The combination of the two sets of code allows you to check the end-to-end function of the image processing pipeline and the register configuration of that pipeline.

Neon VLD consuming more cycles than what is expected?

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:
http://pulsar.webshaker.net/ccc/sample-d3a7fe78
As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?
I have taken care of the following:
The address is 16 byte aligned.
Have provided the alignment hint in the instruction vld1.64 {d0, d1} [r0,:128]!
Tried preload instruction pld [r0, #192], at places but that seems to add to the cycles instead of actually reducing the latency.
Can someone tell me what am I doing wrong, why this latency?
Other Details:
With reference to cortex-a8
arm-2009q1 cross compiler tool chain
coding in assembly
Your code is executing much slower than expected because as it's currently written, it's causing the perfect storm of pipeline stalls. On any modern CPU with a pipelined architecture, instructions can execute in one cycle under ideal conditions. The ideal conditions are that the instruction is not waiting for memory and doesn't have any register dependencies. The way you've written the code, you're not allowing for the delay in reading from memory and making the next instruction dependent on the results of the read. This is causing the worst possible performance. Also, I'm not sure why you're accumulating the pairwise adds into multiple registers. Try something like this:
veor.u16 q12,q12,q12 # clear accumulated sum
top_of_loop:
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
subs r1,r1,#8
bne top_of_loop
Experiment with different numbers of load instructions before executing the adds. The point is that you need to allow time for the read to occur before you can use the target register.
Note: Using Q4-Q7 is risky because they're non-volatile registers. On Android you will get random garbage appearing in these (especially Q4).