Is it possible to avoid specifying a default in order to get an X in Chisel? - hardware

The following Chisel code works as expected.
class Memo extends Module {
val io = new Bundle {
val wen = Bool(INPUT)
val wrAddr = UInt(INPUT, 8)
val wrData = UInt(INPUT, 8)
val ren = Bool(INPUT)
val rdAddr = UInt(INPUT, 8)
val rdData = UInt(OUTPUT, 8)
}
val mem = Mem(UInt(width = 8), 256)
when (io.wen) { mem(io.wrAddr) := io.wrData }
io.rdData := UInt(0)
when (io.ren) { io.rdData := mem(io.rdAddr) }
}
However, it's a compile-time error if I don't specify io.rdData := UInt(0), because defaults are required. Is there a way to either explicitly specify X or to have modules without defaults output X, erm, by default?
Some reasons you might want to do this are that nothing should rely on the output if ren isn't asserted, and Xs let you specify it, and specifying an X can tell the synthesis tool that it's a don't care, for optimization purposes.

Chisel does not support X.
This is probably what you want:
io.rdData := mem(io.rdAddr)
(this prevents muxing between zero and mem's readout data).
If, for example, one is trying to infer a single-ported SRAM, the Chisel tutorial/manual describes how to do so (https://chisel.eecs.berkeley.edu/latest/chisel-tutorial.pdf):
val ram1p =
Mem(UInt(width = 32), 1024, seqRead = true)
val reg_raddr = Reg(UInt())
when (wen) { ram1p(waddr) := wdata }
.elsewhen (ren) { reg_raddr := raddr }
val rdata = ram1p(reg_raddr)
In short, the read address needs to be registered (since we're dealing with synchronous memory in this example), and the enable signal on this register dictates that the read-out data only changes with that enable is true. Thus, the backend understands this a read-enable signal exists on the read port, and in this example, the write port and the read port are one and the same (since only one is ever accessed at a time). Or if you want 1w/1r (as per your example), you can change the "elsewhen (ren)" to a "when (ren)".

From the Chisel 2.0 Tutorial paper, only binary logic is supported at the moment:
2 Hardware expressible in Chisel
This version of Chisel also only supports binary logic, and does not
support tri-state signals.
We focus on binary logic designs as they constitute the vast majority
of designs in practice. We omit support for tri-state logic in the
current Chisel language as this is in any case poorly supported by
industry flows, and difficult to use reliably outside of controlled
hard macros.

Related

Kotlin: succinct way of inverting Int sign depending on Boolean value

I have
var x: Int
var invert: Boolean
and I need the value of the expression
if (invert) -x else x
Is there any more succinct way to write that expression in Kotlin?
There is no shorter expression using only the stdlib to my knowledge.
This is pretty clear, though. Using custom functions to make it shorter is possible, but it would only obscure the meaning IMO.
It's hard to tell which approach might be best without seeing more of the code, but one option is an extension function. For example:
fun Int.negateIf(condition: Boolean) = if (condition) -this else this
(I'm using the term ‘negate’ here, as that's less ambiguous: when dealing with numbers, I think ‘inverse’ more often refers to a multiplicative inverse, i.e. reciprocal.)
You could then use:
x.negateIf(invert)
I think that makes the meaning very clear, and saves a few characters. (The saving is greater if x is a long name or an expression, of course.)
If invert didn't change (e.g. if it were a val), another option would be to derive a multiplier from it, e.g.:
val multiplier = if (invert) -1 else 1
Then you could simply multiply by that:
x * multiplier
That's even shorter, though a little less clear; if you did that, it would be worth adding a comment to explain it.
(BTW, whichever approach you use, there's an extremely rare corner case here: no positive Int has the same magnitude as Int.MIN_VALUE (-2147483648), so you can't negate that one value. Either way, you'll get that same number back. There's no easy way around that, but it's worth being aware of.)
You could create a local extension function.
Local functions are helpful when you want to reduce some repetitive code and you want to access a local variable (in this case, the invert boolean).
Local functions are particularly useful when paired with extension functions, as extension functions only have one 'receiver' - so it would be difficult, or repetitive, to access invert if invert() wasn't a local function.
fun main() {
val x = 1
val y = 2
val z = 3
var invert = false
// this local function can still access 'invert'
fun Int.invert(): Int = if (invert) -this else this
println("invert = false -> (${x.invert()}, ${y.invert()}, ${z.invert()})")
invert = true
println("invert = true -> (${x.invert()}, ${y.invert()}, ${z.invert()})")
}
invert = false -> (1, 2, 3)
invert = true -> (-1, -2, -3)

Nim sequence assignment value semantics

I was under the impression that sequences and strings always get deeply copied on assignment. Today I got burned when interfacing with a C library to which I pass unsafeAddr of a Nim sequence. The C library writes into the memory area starting at the passed pointer.
Since I don't want the original Nim sequence to be changed by the library I thought I'll simply copy the sequence by assigning it to a new variable named copy and pass the address of the copy to the library.
Lo and behold, the modifications showed up in the original Nim sequence nevertheless. What's even more weird is that this behavior depends on whether the copy is declared via let copy = ... (changes do show up) or via var copy = ... (changes do not show up).
The following code demonstrates this in a very simplified Nim example:
proc changeArgDespiteCopyAssignment(x: seq[int], val: int): seq[int] =
let copy = x
let copyPtr = unsafeAddr(copy[0])
copyPtr[] = val
result = copy
proc dontChangeArgWhenCopyIsDeclaredAsVar(x: seq[int], val: int): seq[int] =
var copy = x
let copyPtr = unsafeAddr(copy[0])
copyPtr[] = val
result = copy
let originalSeq = #[1, 2, 3]
var ret = changeArgDespiteCopyAssignment(originalSeq, 9999)
echo originalSeq
echo ret
ret = dontChangeArgWhenCopyIsDeclaredAsVar(originalSeq, 7777)
echo originalSeq
echo ret
This prints
#[9999, 2, 3]
#[9999, 2, 3]
#[9999, 2, 3]
#[7777, 2, 3]
So the first call changes originalSeq while the second doesn't. Can someone explain what is going on under the hood?
I'm using Nim 1.6.6 and a total Nim newbie.
Turns out there are a lot of issues concerned with this behavior in the nim-lang issue tracker. For example:
let semantics gives 3 different results depends on gc, RT vs VM, backend, type, global vs local scope
Seq assignment does not perform a deep copy
Let behaves differently in proc for default gc
assigning var to local let does not properly copy in default gc
clarify spec/implementation for let: move or copy?
RFC give default gc same semantics as --gc:arc as much as possible
Long story short, whether a copy is made depends on a lot of factors, for sequences especially on the scope (global vs. local ) and the gc (refc, arc, orc) in use.
More generally, the type involved (seq vs. array), the code generation backend (C vs. JS) and whatnot can also be relevant.
This behavior has tricked a lot of beginners and is not well received by some of the contributors. It doesn't happen with the newer GCs --gc:arc or --gc:orc where the latter is expected to become the default GC in an upcoming Nim version.
It has never been fixed in the current default gc because of performance concerns, backward compatibility risks and the expectation that it will disappear anyway once we transition to the newer GCs.
Personally, I would have expected that it at least gets clearly singled out in the Nim language manual. Well, it isn't.
I took a quick look at the generated C code, a highly edited and simplified version is below.
The essential thing that is missing in the let code = ... variant is the call to genericSeqAssign() which makes a copy of the argument and assigns that to copy, instead the argument is assigned to copy directly. So, there is definitely no deep copy on assignment in that case.
I don't know if that is intentional or if it is a bug in the code generation (my first impression was astonishment). Any ideas?
tySequence_A* changeArgDespiteCopyAssignment(tySequence_A* x, NI val) {
tySequence_A* result = NIM_NIL;
tySequence_A* copy;
NI* copyPtr;
copy = x; /* NO genericSeqAssign() call !*/
copyPtr = &copy->data[(NI) 0];
*copyPtr = val;
genericSeqAssign(&result, copy, &NTIseqLintT_B);
return result;
}
tySequence_A* dontChangeArgWhenCopyIsDeclaredAsVar(tySequence_A* x, NI val) {
tySequence_A* result = NIM_NIL;
tySequence_A copy;
NI* copyPtr;
genericSeqAssign(&copy, x, &NTIseqLintT_B);
copyPtr = &copy->data[(NI) 0];
*copyPtr = val;
genericSeqAssign(&result, copy, &NTIseqLintT_B);
return result;
}

Megamorphic virtual call - trying to optimize it

we know that there are some techniques that make virtual calls not so expensive in JVM like Inline Cache or Polymorphic Inline Cache.
Let's consider the following situation:
Base is an interface.
public void f(Base[] b) {
for(int i = 0; i < b.length; i++) {
b[i].m();
}
}
I see from my profiler that calling virtual (interface) method m is relatively expensive.
f is on the hot path and it was compiled to machine code (C2) but I see that call to m is a real virtual call. It means that it was not optimised by JVM.
The question is, how to deal with a such situation? Obviously, I cannot make the method m not virtual here because it requires a serious redesign.
Can I do anything or I have to accept it? I was thinking how to "force" or "convince" a JVM to
use polymorphic inline cache here - the number of different types in b` is quite low - between 4-5 types.
to unroll this loop - length of b is also relatively small. After an unroll it is possible that Inline Cache will be helpful here.
Thanks in advance for any advices.
Regards,
HotSpot JVM can inline up to two different targets of a virtual call, for more receivers there will be a call via vtable/itable [1].
To force inlining of more receivers, you may try to devirtualize the call manually, e.g.
if (b.getClass() == X.class) {
((X) b).m();
} else if (b.getClass() == Y.class) {
((Y) b).m();
} ...
During execution of profiled code (in the interpreter or C1), JVM collects receiver type statistics per call site. This statistics is then used in the optimizing compiler (C2). There is just one call site in your example, so the statistics will be aggregated throughout the entire execution.
However, for example, if b[0] always has just two receivers X or Y, and b[1] always has another two receivers Z or W, JIT compiler may benefit from splitting the code into multiple call sites, i.e. manual unrolling:
int len = b.length;
if (len > 0) b[0].m();
if (len > 1) b[1].m();
if (len > 2) b[2].m();
...
This will split the type profile, so that b[0].m() and b[1].m() can be optimized individually.
These are low level tricks relying on the particular JVM implementation. In general, I would not recommend them for production code, since these optimizations are fragile, but they definitely make the source code harder to read. After all, megamorphic calls are not that bad [2].
[1] https://shipilev.net/blog/2015/black-magic-method-dispatch/
[2] https://shipilev.net/jvm/anatomy-quarks/16-megamorphic-virtual-calls/

Vulkan memory barrier for indirect compute shader dispatch

I have two compute shaders and the first one modifies DispatchIndirectCommand buffer, which is later used by the second one.
// This is the code of the first shader
struct DispatchIndirectCommand{
uint x;
uint y;
uint z;
};
restrict layout(std430, set = 0, binding = 5) buffer Indirect{
DispatchIndirectCommand[1] dispatch_indirect;
};
void main(){
// some stuff
dispatch_indirect[0].x = number_of_groups; // I used debugPrintfEXT to make sure that this number is correct
}
I execute them as follows
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, first_shader);
vkCmdDispatch(cmd_buffer, x, 1, 1);
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, second_shader);
vkCmdDispatchIndirect(cmd_buffer, indirect_buffer, 0);
The problem is that the changes made by first shader are not reflected by the second one.
// This is the code of the second shader
void main(){
debugPrintfEXT("%d", gl_GlobalInvocationID.x); //this seems to never be called
}
I initialise the indirect_buffer with VkDispatchIndirectCommand{.x=0,.y=1,.z=1}, and it seems that the second shader always executes with x==0, because the debugPrintfEXT never prints anything. I tried to add a memory barrier like
VkBufferMemoryBarrier barrier;
barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
barrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
barrier.srcQueueFamilyIndex = queue_idx;
barrier.dstQueueFamilyIndex = queue_idx;
barrier.buffer = indirect_buffer;
barrier.offset = 0;
barrier.size = sizeof_indirect_buffer;
However, this does not seem to make any difference. What does seem to work, is when I use
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ;
When I use such access flags, the all compute shaders work properly. However, I get a validation error
Validation Error: [ VUID-vkCmdPipelineBarrier-dstAccessMask-02816 ] Object 0: handle = 0x5561b60356c8, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x69c8467d | vkCmdPipelineBarrier(): .pMemoryBarriers[1].dstAccessMask bit VK_ACCESS_INDIRECT_COMMAND_READ_BIT is not supported by stage mask (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT). The Vulkan spec states: The dstAccessMask member of each element of pMemoryBarriers must only include access flags that are supported by one or more of the pipeline stages in dstStageMask, as specified in the table of supported access types (https://vulkan.lunarg.com/doc/view/1.2.182.0/linux/1.2-extensions/vkspec.html#VUID-vkCmdPipelineBarrier-dstAccessMask-02816)
Vulkan's documentation states that
VK_ACCESS_INDIRECT_COMMAND_READ_BIT specifies read access to indirect command data read as part of an indirect build, trace, drawing or dispatching command. Such access occurs in the VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT pipeline stage
It looks really confusing. It clearly does mention "dispatching command", but at the same time it says that the stage must be VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT and not VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT. Is the specification contradictory/imprecise or am I missing something?
You seem to be employing trial-and-error strategy. Do not use such strategy in low level APIs, and preferrably in computer engineering in general. Sooner or later you will encounter something that will look like it works when you test it, but be invalid code anyway. The spec did tell you the exact thing to do, so you never ever had a legitimate reason to try any other flags or with no barriers at all.
As you discovered, the appropriate access flag for indirect read is VK_ACCESS_INDIRECT_COMMAND_READ_BIT. And as the spec also says, the appropriate stage is VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT.
So the barrier for your case should probably be:
srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
dstStageMask = VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
The name of the stage flag is slightly confusing, but the spec is otherwisely very clear it aplies to compute:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT specifies the stage of the pipeline where VkDrawIndirect* / VkDispatchIndirect* / VkTraceRaysIndirect* data structures are consumed.
and also:
For the compute pipeline, the following stages occur in this order:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
PS: Also relevant GitHub Issue: KhronosGroup/Vulkan-Docs#176

Kotlin stdlib operatios vs for loops

I wrote the following code:
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
for (i in src)
{
if (i % 2 == 0) dest.add(Math.sqrt(i.toDouble()))
}
IntellJ (in my case AndroidStudio) is asking me if I want to replace the for loop with operations from stdlib. This results in the following code:
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
src.filter { it % 2 == 0 }
.mapTo(dest) { Math.sqrt(it.toDouble()) }
Now I must say, I like the changed code. I find it easier to write than for loops when I come up with similar situations. However upon reading what filter function does, I realized that this is a lot slower code compared to the for loop. filter function creates a new list containing only the elements from src that match the predicate. So there is one more list created and one more loop in the stdlib version of the code. Ofc for small lists it might not be important, but in general this does not sound like a good alternative. Especially if one should chain more methods like this, you can get a lot of additional loops that could be avoided by writing a for loop.
My question is what is considered good practice in Kotlin. Should I stick to for loops or am I missing something and it does not work as I think it works.
If you are concerned about performance, what you need is Sequence. For example, your above code will be
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
src.asSequence()
.filter { it % 2 == 0 }
.mapTo(dest) { Math.sqrt(it.toDouble()) }
In the above code, filter returns another Sequence, which represents an intermediate step. Nothing is really created yet, no object or array creation (except a new Sequence wrapper). Only when mapTo, a terminal operator, is called does the resulting collection is created.
If you have learned java 8 stream, you may found the above explaination somewhat familiar. Actually, Sequence is roughly the kotlin equivalent of java 8 Stream. They share similiar purpose and performance characteristic. The only difference is Sequence isn't designed to work with ForkJoinPool, thus a lot easier to implement.
When there is multiple steps involved or the collection may be large, it's suggested to use Sequence instead of plain .filter {...}.mapTo{...}. I also suggest you to use the Sequence form instead of your imperative form because it's easier to understand. Imperative form may become complex, thus hard to understand, when there are 5 or more steps involved in the data processing. If there is just one step, you don't need a Sequence, because it just creates garbage and gives you nothing useful.
You're missing something. :-)
In this particular case, you can use an IntProgression:
val progression = 0 until 1_000_000 step 2
You can then create your desired list of squares in various ways:
// may make the list larger than necessary
// its internal array is copied each time the list grows beyond its capacity
// code is very straight forward
progression.map { Math.sqrt(it.toDouble()) }
// will make the list the exact size needed
// no copies are made
// code is more complicated
progression.mapTo(ArrayList(progression.last / 2 + 1)) { Math.sqrt(it.toDouble()) }
// will make the list the exact size needed
// a single intermediate list is made
// code is minimal and makes sense
progression.toList().map { Math.sqrt(it.toDouble()) }
My advice would be to choose whichever coding style you prefer. Kotlin is both object-oriented and functional language, meaning both of your propositions are correct.
Usually, functional constructs favor readability over performance; however, in some cases, procedural code will also be more readable. You should try to stick with one style as much as possible, but don't be afraid to switch some code if you feel like it's better suited to your constraints, either readability, performance, or both.
The converted code does not need the manual creation of the destination list, and can be simplified to:
val src = (0 until 1000000).toList()
val dest = src.filter { it % 2 == 0 }
.map { Math.sqrt(it.toDouble()) }
And as mentioned in the excellent answer by #glee8e you can use a sequence to do a lazy evaluation. The simplified code for using a sequence:
val src = (0 until 1000000).toList()
val dest = src.asSequence() // change to lazy
.filter { it % 2 == 0 }
.map { Math.sqrt(it.toDouble()) }
.toList() // create the final list
Note the addition of the toList() at the end is to change from a sequence back to a final list which is the one copy made during the processing. You can omit that step to remain as a sequence.
It is important to highlight the comments by #hotkey saying that you should not always assume that another iteration or a copy of a list causes worse performance than lazy evaluation. #hotkey says:
Sometimes several loops. even if they copy the whole collection, show good performance because of good locality of reference. See: Kotlin's Iterable and Sequence look exactly same. Why are two types required?
And excerpted from that link:
... in most cases it has good locality of reference thus taking advantage of CPU cache, prediction, prefetching etc. so that even multiple copying of a collection still works good enough and performs better in simple cases with small collections.
#glee8e says that there are similarities between Kotlin sequences and Java 8 streams, for detailed comparisons see: What Java 8 Stream.collect equivalents are available in the standard Kotlin library?