How to flatten Verilog bus to individual wires using Yosys - yosys

Simple question here. Is there a method in Yosys to flatten arrays? i.e.:
wire [1:0] rdata; becomes wire rdata_1; wire rdata_0;

Here's my answer. Not sure when it should be called. After proc seems to work.
proc
opt
splicenets -format __ # <---
[..]

Related

Gstreamer: How to read a structure inside the property of an element

First of all , I am very new to G-streamer and it would be very helpful if someone could just give me a simple explanation about what I'm asking here.
So there is a pipeline which feeds raw video from a camera source to Tensorflow element which detects faces and stores the Face ROI coordinates into a structure and also update some kind of metadata. After this , there's a display overlay element which will draw a bounding box using the inference results read from metadata updated by the Tensor Flow element.
Tensorflow (plugin) --> Post processing (property) --> Detection (structure)
To put it simply , I need to get the values of structure which will get updated on each face detection when the pipeline is running. I checked gst_bin_get_by_name() + g_object_class_find_property()+g_object_get () API combination , but seems like it can only read the state like Enabled/disabled,a parameter string etc from the property.
I don't know I was able to convey my requirement properly.
Someone please help me out.
It appears that you can use nnstreamer for your application: https://github.com/nnstreamer/nnstreamer
The corresponding (GStreamer) pipeline would look like (w/ a lot of assumptions):
v4l2src ! # assuming a USB camera
tee name=rawvideo ! queue max-buffer-size=2 leaky=2 !
videoconvert ! videoscale ! videorate ! # basic preprocessing for live video
video/x-raw,format=RGB,width=300,height=300 ! # another assumption on your model
tensor_converter ! # now, the format becomes video/x-raw --> other/tensors
tensor_transform mode=arithmetic option=typecast:float32,div:255 ! # you may add different mode & options for your pre-processing needs (e.g., transpose, standardization, ...)
tensor_filter framework=tensorflow model=YOURMODELFILE.pb !
tee name=result ! appsink name=yourappcangetrawstructuretensors
result. ! tensor_decoder mode=bounding_boxes option..... ! # if the correponsing subplugin exists for the given structure, you can simply designate here. otherwise, you can write your own code and attach here.
mix.sink_1
rawvideo. ! queue leaky=2 max-buffer-size=2 ! mix.sink_0
### use composite or videomixer to overlay boundingboxes
compositor name=mix sink_0::zorder=1 sink_1::zorder=2 ! videoconvert !
autovideosink ## you may use different video sink for your systems.
In other words, you can have separated streams of the "structure" (output of tensorflow) and the "video" (input of tensorflow or the original video stream) and merge them later at compositor in the above example.

Yosys synthesys - is this opimum?

I'm using yosys to synthesize simple circuits and show how the result varies with the cell library.
However, it looks like the result is not well optimized.
I'm using the library vsclib013.lib downloaded from: http://www.vlsitechnology.org/synopsys/vsclib013.lib
E.g. I synthesize an adder composed by 4 full adders. Since I do not use Carry_in and Carry_out I do expect that an half adder is synthesized (XOR with two inputs) for the LSB adder.
The result of the synthesis is the following.
Number of cells 12
cgi2v0x05 4
iv1v0x05 4
xor3v1x05 4
It uses 4 cells that are XOR with three inputs.
This is also clear from the graph of the circuit: graph obtained using the yosys command 'show'
The circuit is simply composed by four identical full adders and there is no optimization for the Carry_in being equal to '0' and for the Carry_out not being connected.
The script I used to syntesize is:
ghdl TOP_ENTITY
hierarchy -check -top TOP_ENTITY
proc; opt; memory; opt; fsm; opt
techmap; opt
read_liberty -lib vsclib013.lib
dfflibmap -liberty vsclib013.lib
abc -liberty vsclib013.lib -D 1000 -constr constraint_file_vsclib013.txt
splitnets -ports; opt
clean
write_verilog TOP_ENTITY.v
flatten
show -stretch -format pdf -lib TOP_ENTITY.v
Thank you for any suggestion to improve the synthesys.
Thx for your answer.
After some tries and errors I obtained good resutls by simply using flatten.
I also added -full to the opt commands for (hopefully) good meaure.
Now, my working script is like this:
ghdl TOP_ENTITY
hierarchy -check -top TOP_ENTITY
flatten
proc; opt -full; memory; opt -full; fsm; opt -full
techmap; opt -full
read_liberty -lib vsclib013.lib
dfflibmap -liberty vsclib013.lib
abc -liberty vsclib013.lib -D 1000 -constr constraint_file_vsclib013.txt
splitnets -ports; opt -full
clean -purge
write_verilog TOP_ENTITY.v
flatten
show -stretch -format pdf -lib TOP_ENTITY.v
I also added -purge option to the clean command to get a nicer printed schematic.

TensorFlow Operator Source Code

I'm trying to find the source code for TensorFlow's low level linear-algebra and matrix arithmetic operators for execution on CPU. For example, where is the actual implementation of tf.add() for execution on a CPU? As far as I know, most linear algebra operators are actually implemented by Eigen, but I'd like to know what Eigen functions specifically are being called.
I've tried tracing back from the high-level API, but this is difficult as there are a lot of steps between placing an operator on the graph, and the actual execution of the operator by the TF runtime.
The implementation is hidden behind some meta-template programming (not unusual for Eigen).
Each operation in TensorFlow is registered at some point. Add is registered here and here.
REGISTER3(BinaryOp, GPU, "Add", functor::add, float, Eigen::half, double);
The actual implementation of Operations is based on OpKernel. The Add operation is implemented in BinaryOp::Compute The class hierarchy would be BinaryOp : BinaryOpShared : OpKernel
In the case of adding two scalars, the entire implementation is just:
functor::BinaryFunctor<Device, Functor, 1>().Right(
eigen_device, out_flat, in0.template flat<Tin>(),
in1.template scalar<Tin>(), error_ptr);
where in0, in1 are the incoming Tensor-Scalars, Device is either GPU or CPU, and Functor is the operation itself. The other lines are just for performing the broadcasting.
Scroll down in this file and expanding the REGISTER3 macro explains how the arguments are passed from REGISTER3 to functor::BinaryFunctor<Device, Functor, ...>.
You cannot expect to see some loops as Eigen use Expressions to do Lazy Evaluation and Aliasing. The Eigen-"Call" is here:
https://github.com/tensorflow/tensorflow/blob/7a0def60d45c1841a4e79a0ddf6aa9d50bf551ac/tensorflow/core/kernels/cwise_ops.h#L693-L696

How to implement a custom op that uses intermediate data from the forward pass to compute its gradient faster in tensorflow?

I am trying to implement a custom op in TensorFlow that represents a computationally heavy transfer function computed in C++ using Eigen on GPU. I would like to accelerate the computation of the gradient (also in C++ for speed) of the op by re-using some of the intermediate values obtained while computing its output.
In the source code of tensorflow/core/kernels/cwise_ops_gradients.h we see that many functions already do that to some extent by re-using the output of the op to compute its derivative. Here is the example of the sigmoid:
template <typename T>
struct scalar_sigmoid_gradient_op {
EIGEN_EMPTY_STRUCT_CTOR(scalar_sigmoid_gradient_op)
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const T
operator()(const T& output, const T& output_gradient) const {
return output_gradient * output * (T(1) - output);
}
However, I don't see how I can access something else than just the output, for example some other values I stored during the forward pass, to accelerate the computation of my derivative.
I've thought about adding a second output to my op, with all the data required for the derivative, and use it for the computation of the gradient of the actual output, but I've not managed to make it work yet. I'm not sure if it could work in principle.
Another approach I imagined is to manually modify the full graph (forward and backprop) to shortcut an output from the op directly towards its derivative block. I'm not sure how to do it.
Otherwise, there may be a data storage scheme I'm not aware of and that would allow me to store data in the forward pass of an op and retrieve it during gradient computation.
Thank you for your attention, I would greatly appreciate any ideas.
D

Block for collecting samples - Simulink

Is there a way in Simulink where we can collect the samples generated during the simulation. I have a random integer generator block which generates integers between 0-15 and am mapping the integers to chip sequences as mentioned in the 802.15.4 standard. The data to chip mapper outputs a vector of 32x1 and I would like to store n such chip sequences and serialize them before OQPSK modulating the signal. Is there a block in Simulink to do this? If not an idea on how to implement this would be greatly appreciated.
Thanks, Sommer