Chisel, Generate Blocks and Large Intermediate/Output Files - hdl

Is there a construct in Chisel to generate Verilog generate blocks instead of unrolling Scala for-loops into very large (>100k line) output Verilog and FIRRTL files.
For example, I have the following code, that constructs a 2D lattice of MatrixElement modules and connects their inputs and outputs.
private val mat_elems = Seq.tabulate(rows, cols) { (i, j) => {
Module(new MatrixElement(n=i, m=j))
}}
for (i <- 0 until rows; j <- 0 until cols) {
// Wavefront propagation
if (i == 0 && j != 0) {
// First row
mat_elems(i)(j).io.in <> (false.B, false.B, mat_elems(i)(j - 1).io.out)
} else if (i != 0 && j == 0) {
// First col
mat_elems(i)(j).io.in <> (false.B, mat_elems(i - 1)(j).io.out, false.B)
} else if (i >= 1 && j >= 1) {
// Internal matrix
mat_elems(i)(j).io.in <> (mat_elems(i - 1)(j - 1).io.out, mat_elems(i - 1)(j).io.out,
mat_elems(i)(j - 1).io.out)
}
}
I am looking to compile this code for values of rows and cols >= 256. So this matrix gets very large in size.
If I were writing this as Verilog module, I would make use of generate blocks. However, in Chisel, since I am using the Scala loops, the entire lattice/matrix gets unrolled in the FIRRTL/Verilog outputs. Often producing >100k lines with all the _T* wires for 512x512 lattices. This causes a whole bunch of JVM out of memory errors in the Chisel compilation and makes the VCS simulation (just parsing the files takes forever) of the output files VERY slow.
Is there any way around this? Maybe get Chisel to generate Verilog generate blocks?

There is no support in Chisel nor FIRRTL for compressing this. Such a feature would probably be pretty useful but we have no plan or timeline for it. You can always use a blackbox and write the Verilog to do it yourself if you find the compilation time saved to be worth it.

Related

How to avoid usize going negative?

I'm translating a chunk (2000 lines) of proprietary C code into Rust. In C, it is common to run a pointer, array index, etc. down, for as long as it is non-negative. In Rust, simplified to the bone, it would look something like:
while i >= 0 && more_conditions {
more_work;
i -= 1;
}
Of course, when i is usize, you get an under-overflow from subtraction. I have learned to work around this by using for loops with .rev(), offsetting my indexes by one, or using a different type and casting with as usize, etc.
Usually it works, and usually I can make it legible, but the code I'm modifying is chock-full of indexes running towards each other, and eventually tested with i_low > i_high
Something like (in Rust)
loop {
while condition1(i_low) { i_low += 1; }
while condition2(i_high) { j_high -= 1; }
if i_low > i_high { return something; }
do_something_else;
}
Every now and then this panics, as i_high runs past 0.
I have been inserting a lot of j_high >= 0 && in the code, and it become a lot less readable.
How do experienced Rust programmers avoid usize variables going to -1?
for loops? for i in (0..size).rev()
casting? i as usize, after checking for i < 0
offsetting your variable by one, and using i-1 when safe?
extra conditionals?
catching exceptions?
Or do you just eventually learn to write programs around these situations?
Clarification: The C code is not broken - it has been supposedly in production for ten years, structuring video segments on multiple servers 24/7. It is just not following Rust conventions - it often returns -1 as an index, it recurses with -1 for the low index of an array to process, and indexes go negative all the time. All of these are handled before problems occurs - ugly, but functional. Something like:
incident_segment = detect_incident(array, start, end);
attach(array, incident_segment);
store(array, start, incident_segment - 1);
process(array, incident_segment + 1, end);
In the above code, every single of the three resulting calls may be getting a segment index that's -1 (attach, store) or out of bounds (process) It's handled, but after the call.
My Rust code appears to be working as well. As a matter of fact, in order to deal with the negative usize, I added additional logic that pruned a number of recursions, so it runs about as fast as the C code (apparently faster, but that's also because I distributed the output on multiple drives)
The issue is that the client does not not want a full rewrite, and wants the 'native' programmers to be able to check the two programs against each other. Based on the answers so far, I'm thinking that using i64 and casting/shadowing as needed may be the best way to produce code that's easy to read for the 'natives'. Which I personally do not have to like...
If you want to do it idiomatically:
for j in (0..=i).rev() {
if conditions {
break;
}
//use j as your new i here
}
Note the use of ..=i here in the iterator, this means that it'll actually iterate including i: [0, 1, 2, ..., i-1, i], otherwise, you end up with [0, 1, 2, ..., i-2, i-1]
Otherwise, here is the code:
while (i as isize - 1) != -2 && more_conditions {
more_work;
i -= 1;
}
playground
I'd probably start by using saturating_sub (and _add for parallel structure):
while condition1(i_low) { i_low = i_low.saturating_add(1); }
while condition2(i_high) { j_high = j_high.saturating_sub(1); }
You need to be careful to ensure that your logic handles the value saturating at zero. You could also use more C-like semantics with wrapping_sub.
Truthfully, there's no one-size-fits-all solution. Many times, complicated logic becomes simpler if you abstract it a bit, or turn it slightly sideways. You haven't provided any concrete examples, so we cannot give any useful advice. I solve way too many problems with iterators, so that's often my first solution.
catching exceptions
Absolutely not. That's exceedingly inefficient and non-idiomatic.

branch prediction, and optimized code

I have following set of code blocks, the purpose of the both block is same.
I had to implement the 2nd block to avoid inverse logic and to increase the readability.
BTW, in the production code the condition is very complex.
The question is - I know branching is bad, how much penalty I have to pay.
Just as an extra info, please also consider, the probability of else branch is very high.
X = Get_XValue()
if (X != 5)
{
K = X+3;
.
.
}
X = Get_XValue()
if (X == 5)
{
/*do nothing*/
}
else
{
K = X+3;
.
.
}
This all comes down to your compiler. A good optimizing compiler will detect that the then-clause in the second example is empty and reverse the test. Thus it will generate the same code for both cases so there should be no penalty at all.
As a side note, I can add that this was the case for all three compilers I tried (clang, gcc, and iccarm),

Kotlin stdlib operatios vs for loops

I wrote the following code:
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
for (i in src)
{
if (i % 2 == 0) dest.add(Math.sqrt(i.toDouble()))
}
IntellJ (in my case AndroidStudio) is asking me if I want to replace the for loop with operations from stdlib. This results in the following code:
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
src.filter { it % 2 == 0 }
.mapTo(dest) { Math.sqrt(it.toDouble()) }
Now I must say, I like the changed code. I find it easier to write than for loops when I come up with similar situations. However upon reading what filter function does, I realized that this is a lot slower code compared to the for loop. filter function creates a new list containing only the elements from src that match the predicate. So there is one more list created and one more loop in the stdlib version of the code. Ofc for small lists it might not be important, but in general this does not sound like a good alternative. Especially if one should chain more methods like this, you can get a lot of additional loops that could be avoided by writing a for loop.
My question is what is considered good practice in Kotlin. Should I stick to for loops or am I missing something and it does not work as I think it works.
If you are concerned about performance, what you need is Sequence. For example, your above code will be
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
src.asSequence()
.filter { it % 2 == 0 }
.mapTo(dest) { Math.sqrt(it.toDouble()) }
In the above code, filter returns another Sequence, which represents an intermediate step. Nothing is really created yet, no object or array creation (except a new Sequence wrapper). Only when mapTo, a terminal operator, is called does the resulting collection is created.
If you have learned java 8 stream, you may found the above explaination somewhat familiar. Actually, Sequence is roughly the kotlin equivalent of java 8 Stream. They share similiar purpose and performance characteristic. The only difference is Sequence isn't designed to work with ForkJoinPool, thus a lot easier to implement.
When there is multiple steps involved or the collection may be large, it's suggested to use Sequence instead of plain .filter {...}.mapTo{...}. I also suggest you to use the Sequence form instead of your imperative form because it's easier to understand. Imperative form may become complex, thus hard to understand, when there are 5 or more steps involved in the data processing. If there is just one step, you don't need a Sequence, because it just creates garbage and gives you nothing useful.
You're missing something. :-)
In this particular case, you can use an IntProgression:
val progression = 0 until 1_000_000 step 2
You can then create your desired list of squares in various ways:
// may make the list larger than necessary
// its internal array is copied each time the list grows beyond its capacity
// code is very straight forward
progression.map { Math.sqrt(it.toDouble()) }
// will make the list the exact size needed
// no copies are made
// code is more complicated
progression.mapTo(ArrayList(progression.last / 2 + 1)) { Math.sqrt(it.toDouble()) }
// will make the list the exact size needed
// a single intermediate list is made
// code is minimal and makes sense
progression.toList().map { Math.sqrt(it.toDouble()) }
My advice would be to choose whichever coding style you prefer. Kotlin is both object-oriented and functional language, meaning both of your propositions are correct.
Usually, functional constructs favor readability over performance; however, in some cases, procedural code will also be more readable. You should try to stick with one style as much as possible, but don't be afraid to switch some code if you feel like it's better suited to your constraints, either readability, performance, or both.
The converted code does not need the manual creation of the destination list, and can be simplified to:
val src = (0 until 1000000).toList()
val dest = src.filter { it % 2 == 0 }
.map { Math.sqrt(it.toDouble()) }
And as mentioned in the excellent answer by #glee8e you can use a sequence to do a lazy evaluation. The simplified code for using a sequence:
val src = (0 until 1000000).toList()
val dest = src.asSequence() // change to lazy
.filter { it % 2 == 0 }
.map { Math.sqrt(it.toDouble()) }
.toList() // create the final list
Note the addition of the toList() at the end is to change from a sequence back to a final list which is the one copy made during the processing. You can omit that step to remain as a sequence.
It is important to highlight the comments by #hotkey saying that you should not always assume that another iteration or a copy of a list causes worse performance than lazy evaluation. #hotkey says:
Sometimes several loops. even if they copy the whole collection, show good performance because of good locality of reference. See: Kotlin's Iterable and Sequence look exactly same. Why are two types required?
And excerpted from that link:
... in most cases it has good locality of reference thus taking advantage of CPU cache, prediction, prefetching etc. so that even multiple copying of a collection still works good enough and performs better in simple cases with small collections.
#glee8e says that there are similarities between Kotlin sequences and Java 8 streams, for detailed comparisons see: What Java 8 Stream.collect equivalents are available in the standard Kotlin library?

SMO algorithm running into infinite loop?

I'm interest in building an SVM multi class classifier, so I am currently implementing
Sequential minimal optimization SMO.
My implementation is based on the pseudo code in
`Fast Training of Support Vector Machines using Sequential Minimal Optimization" by John C. Platt
I observed that for certain training examples. The Smo may diverge and run into an infinite loop
The following loop in the main routine
numChanged = 0;
examineAll = 1;
while (numChanged > 0 || examineAll >0) {…}
may run into an infinite loop.
Is there there clue or criterion to prevent the smo algorithm routine from running into an infinite loop?
I would like to thank you for your answer.
Regards
You can add a max iteration condition if you want:
while ((numChanged > 0 || examineAll) && iter < MaxIter)
but for most cases it shouldn't run into an infinite loop, this is the full Platt's pseudocode:
while (numChanged > 0 || examineAll)
{
numChanged = 0;
// Adding curly brackets for better readability
if (examineAll)
{
loop I over all training examples
numChanged += examineExample(I);
}
else
{
loop I over examples where alpha is not 0 & not C
numChanged += examineExample(I);
}
if (examineAll == 1)
{
examineAll = 0;
}
else
{
examineAll = 1;
}
}
Notice that what it is doing is performing an iteration to examine the example and the next one do the same just to those examples where alpha is not 0 or C. If nothing changes after the "examine all" iteration, the while loop condition will be false hence stopping the loop.
So, for that to be in a infinite loop there must be a corner case (probably a numerical error) that introduce oscillations making examples to change during the examine all phase but not changing in the "examine only alpha == 0 and C".
Usually if the data is normalized in [-1,1] or [0,1] and the parameters of the algorithm have reasonable values, those corner cases would be rare. In any case, if you want to be extra careful you can put the max-iter safety net.

Make interpreter execute faster

I've created an interprter for a simple language. It is AST based (to be more exact, an irregular heterogeneous AST) with visitors executing and evaluating nodes. However I've noticed that it is extremely slow compared to "real" interpreters. For testing I've ran this code:
i = 3
j = 3
has = false
while i < 10000
j = 3
has = false
while j <= i / 2
if i % j == 0 then
has = true
end
j = j+2
end
if has == false then
puts i
end
i = i+2
end
In both ruby and my interpreter (just finding primes primitively). Ruby finished under 0.63 second, and my interpreter was over 15 seconds.
I develop the interpreter in C++ and in Visual Studio, so I've used the profiler to see what takes the most time: the evaluation methods.
50% of the execution time was to call the abstract evaluation method, which then casts the passed expression and calls the proper eval method. Something like this:
Value * eval (Exp * exp)
{
switch (exp->type)
{
case EXP_ADDITION:
eval ((AdditionExp*) exp);
break;
...
}
}
I could put the eval methods into the Exp nodes themselves, but I want to keep the nodes clean (Terence Parr saied something about reusability in his book).
Also at evaluation I always reconstruct the Value object, which stores the result of the evaluated expression. Actually Value is abstract, and it has derived value classes for different types (That's why I work with pointers, to avoid object slicing at returning). I think this could be another reason of slowness.
How could I make my interpreter as optimized as possible? Should I create bytecodes out of the AST and then interpret bytecodes instead? (As far as I know, they could be much faster)
Here is the source if it helps understanding my problem: src
Note: I haven't done any error handling yet, so an illegal statement or an error will simply freeze the program. (Also sorry for the stupid "error messages" :))
The syntax is pretty simple, the currently executed file is in OTZ1core/testfiles/test.txt (which is the prime finder).
I appreciate any help I can get, I'm really beginner at compilers and interpreters.
One possibility for a speed-up would be to use a function table instead of the switch with dynamic retyping. Your call to the typed-eval is going through at least one, and possibly several, levels of indirection. If you distinguish the typed functions instead by name and give them identical signatures, then pointers to the various functions can be packed into an array and indexed by the type member.
value (*evaltab[])(Exp *) = { // the order of functions must match
Exp_Add, // the order type values
//...
};
Then the whole switch becomes:
evaltab[exp->type](exp);
1 indirection, 1 function call. Fast.