What is required to target a new device? - yosys

From a high-level point of view, what is required to target a new device with Yosys? I'd like to target a Xilinx XC9572XL. I have one these development boards: XC9572XL-CPLD-development-board-v1b. The architecture of this CPLD is fairly well covered in the Xilinx documentation here.
I think I need to do the following:
Work out how to get Yosys to synthesise a design to a Sum-of-Product and D-type Flip Flop based netlist.
Output that netlist as a BLIF format from Yosys.
Create a 'fitter' (analogous to arachne-pnr for the ICE40 FPGA) for the XC9572XL
Output a JEDEC file with the appropriate fuses that need to be set to implement the design in the previous step.
Flash the design to the CPLD using xc3sprog.
It looks possible. The hard bit is building a 'fitter' tool. This tool needs to understand the CPLD's resources and then needs some clever algorithms to fit the design and output the required fuses in a JEDEC format. One import missing piece is the mapping between the 'fuses' in the physical CPLD and the fuses in the JEDEC file. This would have to be reverse engineered. I note that a JEDEC file from Xilinx WebPACK ISE contains 46656 fuses. Each of those map back to some configurable node in the CPLD.
I'd like to know what others think about this approach. What types of issues am I likely to encounter?
What legal aspects do I need to consider if I was to undertake this? Should I write to Xilinx first and seek permission from them should I decide I want to reverse engineer a JEDEC file produced by their tool?
The XC9572XL is an obsolete part...

Work out how to get Yosys to synthesise a design to a Sum-of-Product and D-type Flip Flop based netlist.
Output that netlist as a BLIF format from Yosys.
You can do two-level synthesis with ABC from a logic-level BLIF file. For example:
$ yosys -p synth -o test.blif tests/simple/fiedler-cooley.v
$ yosys-abc
abc> read_blif test.blif
abc> collapse
abc> write_pla test.pla
Now you can aim for writing a program that converts a .pla file (plus auxiliary information that might be generated by a yosys plugin you'd need to write) to a JEDEC file.
What legal aspects do I need to consider if I was to undertake this?
IANAL. TINLA.
When you reverse engineer it by analyzing the software provided by the chip vendor: In this case it really depends on the country you are living in. For example, in europe you can reverse-engineer, even disassemble, software in certain situations, even when the software EULA prohibits it. I explain this in a bit more depth here.
I think reverse engineering the silicon itself (instead of analyzing the software) is less problematic in places like north america.

Have you considered targeting the CoolRunner-II family? I did some fairly extensive RE on it (https://recon.cx/2015/slides/recon2015-18-andrew-zonenberg-From-Silicon-to-Compiler.pdf) and understand the majority of the bitstream format. Porting Yosys to it is high on my priority list once I figure out the last of the clock network structure.
These devices are more recent and lower power, plus the internal architecture is cleaner and easier to target (nice regular AND/OR array vs having some pterms dedicated to certain OR terms).
In either case please contact me to discuss further, I'd love to collaborate.
EDIT: Clifford is right, reversing silicon is explicitly legal in the US (17 USC 906) while software is more of a gray area. ISE is also such a giant monster that nobody with their head screwed on right would want to reverse engineer it; the chip is a lot easier to follow.
Although the XC9500XL series is an older 350nm family (less metal layers, larger features, easier to see detail under a microscope) it also uses a lot of nasty analog tricks with floating gate EEPROM/flash cells directly in the logic and sense amplifiers on the output. CoolRunner-II is 180nm with 4 or 5 metal layers depending on density, and the main logic array is entirely digital and a lot easier to reverse engineer.

Related

What's the point of "make move" and "undo move" in chessengines?

I want to experiment with massive parallel chess computing. From what I saw and understood in wikis and source code of some engines is that in most (all?) implementations of the min-max(negamax, alpha-beta, ...)-algorithm there is one internal position that gets updated for every new branch and then gets undone after receiving the evaluation of that branch.
What are the benefits of that technic compared to just generating a new position-object and passing that to the next branch?
This is what I have done in my previous engines and I believe this method is superior for the purpose of parallelism.
The answer to your question depends heavily on a few factors, such as how you are storing the chess board, and what your make/unmake move function looks like. I'm sure there are ways of storing the board that would be better suited towards your method, and indeed historically some top tiered chess engine (in particular crafty) used to use that method, but you are correct in saying that modern engines no longer do it that way. Here is a link to a discussion about this very point:
http://www.talkchess.com/forum/viewtopic.php?t=50805
If you want to understand why this is, then you must understand how today's engines represent the board. The standard implementation revolves around bitboards, and that requires 12- 64 bit integers per position, in addition to a redundant mailbox (array in non computer chess jargon) used in conjunction. Copying all that is usually more expensive than a good makeMove/unMakeMove function.
I also want to point out that a hybrid approach is also common. Usually make and unmake is used on the board itself, where as other information like en passant squares and castle rights are copied and changed like you suggested.
Interestingly, for board representations of < ~300 bytes, it IS cheaper to copy the board on each move (on modern x86), as opposed to making and unmaking the move.
As you suggest the immutable properties of copying the board on each move are from a programming perspective, including parallelising, very attractive.
My board rep is 208 bytes. C++ compiled with g++ 7.4.0.
Empricism shows that board copy is 20% faster than move make/unmake. I presume the code is using 32/64 byte-wide AVX instructions for the copy. In theory you can do 32/64 byte copy per cycle.
Just for you: https://github.com/rolandpj1968/oom-skaak/tree/init-impl-legal-move-gen-move-do-undo-2

Analyzing bitstreams using Icestorm

I'm trying to understand the bitstreams generated by Yosys/arachne-pnr as described on http://www.clifford.at/icestorm/:
The recommended approach for learning how to use this documentation is to synthesize very simple circuits using Yosys and Arachne-pnr, run the icestorm tool icebox_explain on the resulting bitstream files, and analyze the results using the HTML export of the database mentioned above. icebox_vlog can be used to convert the bitstream to Verilog. The output file of this tool will also outline the signal paths in comments added to the generated Verilog code.
In order to understand the effect a change in the bitstream has, it would be helpful if I could change the .ex file and convert it back to an ASCII bitstream (instead of having to identify the bit manually) for uploading to the FPGA. Is there a way to do so?
I'm a bit concerned about damaging the FPGA with an invalid bitstream. Are there situations where this is known to happen? Is there a way to simulate a bitstream?
Also, it would be helpful to have some kind of “higher-level” explanation format which e.g. shows the IE/REN bits on the I/O blocks to which they correspond, not the one on which they have to be set in the bitstream. Is there such a format?
I know of the possibility to generate an equivalent Verilog circuit, but the problem with this is that it doesn't usually allow me a lossless round-trip back into a bitstream. Is there a way to generate an equivalent Verilog circuit which (e.g. by instantiating the blocks explicitly) yields the exact same bitstream when processed with Yosys/arachne-pnr?
I'm a bit concerned about damaging the FPGA with an invalid bitstream. Are there situations where this is known to happen? Is there a way to simulate a bitstream?
I have not damaged any FPGA so far. (I have, however, managed to damage the serial flash on one icestick after running some test that reprogrammed it in a loop.)
But this does not mean that you cannot damage your FPGA by programming it with an invalid bitstream. You could theoretically configure the FPGA in a way that produces a driver-driver conflict. I don't know how well the hardware deals with something like that. I have not run any experiments to find out..
Also, it would be helpful to have some kind of “higher-level” explanation format which e.g. shows the IE/REN bits on the I/O blocks to which they correspond, not the one on which they have to be set in the bitstream. Is there such a format?
icebox_vlog produces a higher-level output. But it does not output things like I/O blocks, so it might be too high-level for your needs.
I know of the possibility to generate an equivalent Verilog circuit, but the problem with this is that it doesn't usually allow me a lossless round-trip back into a bitstream. Is there a way to generate an equivalent Verilog circuit which (e.g. by instantiating the blocks explicitly) yields the exact same bitstream when processed with Yosys/arachne-pnr?
Not at the moment. But it should not be too hard to extend icebox_vlog to provide this functionality. So if you really need that, it might be something within your reach to add yourself.

Automated Design in CAD, Analysis in FEA, and Optimization

I would like to optimize a design by having an optimizer make changes to a CAD file, which is then analyzed in FEM, and the results fed back into the optimizer to make changes on the design based on the FEM, until the solution converges to an optimum (mass, stiffness, else).
This is what I envision:
create a blueprint of the part in a CAD software (e.g. CATIA).
run an optimizer code (e.g. fmincon) from within a programming language (e.g. Python). The parameters of the optimizer are parameters of the CAD model (angles, lengths, thicknesses, etc.).
the optimizer evaluates a certain design (parameter set). The programming language calls the CAD software and modifies the design accordingly.
the programming language extracts some information (e.g. mass).
then the programming language extracts a STEP file and passes it a FEA solver (e.g. Abaqus) where a predefined analysis is performed.
the programming language reads the results (e.g. max van Mises stress).
the results from CAD and FEM (e.g. mass and stress) are fed to the optimizer, which changes the design accordingly.
until it converges.
I know this exists from within a closed architecture (e.g. isight), but I want to use an open architecture where the optimizer is called from within an open programming language (ideally Python).
So finally, here are my questions:
Can it be done, as I described it or else?
References, tutorials please?
Which softwares do you recommend, for programming, CAD and FEM?
Yes, it can be done. What you're describing is a small parametric structural sizing multidisciplinary optimization (MDO) environment. Before you even begin coding up the tools or environment, I suggest doing some preliminary work on a few areas
Carefully formulate the minimization problem (minimize f(x), where x is a vector containing ... variables, subject to ... constraints, etc.)
Survey and identify individual tools of interest
How would each tool work? Input variables? Output variables?
Outline in a Design Structure Matrix (a.k.a. N^2 diagram) how the tools will feed information (variables) to each other
What optimizer is best suited to your problem (MDF?)
Identify suitable convergence tolerance(s)
Once the above steps are taken, I would then start to think MDO implementation details. Python, while not the fastest language, would be an ideal environment because there are many tools that were built in Python to solve MDO problems like the one you have and the low development time. I suggest going with the following packages
OpenMDAO (http://openmdao.org/): a modern MDO platform written by NASA Glenn Research Center. The tutorials do a good job of getting you started. Note that each "discipline" in the Sellar problem, the 2nd problem in the tutorial, would include a call to your tool(s) instead of a closed-form equation. As long as you follow OpenMDAO's class framework, it does not care what each discipline is and treats it as a black-box; it doesn't care what goes on in-between an input and an output.
Scipy and numpy: two scientific and numerical optimization packages
I don't know what software you have access to, but here are a few tool-related tips to help you in your tool survey and identification:
Abaqus has a Python API (http://www.maths.cam.ac.uk/computing/software/abaqus_docs/docs/v6.12/pdf_books/SCRIPT_USER.pdf)
If you need to use a program that does not have an API, you can automate the GUI using Python's win32com or Pywinauto (GUI automation) package
For FEM/FEA, I used both MSC PATRAN and MSC NASTRAN on previous projects since they have command-line interfaces (read: easy to interface with via Python)
HyperSizer also has a Python API
Install Pythonxy (https://code.google.com/p/pythonxy/) and use the Spyder Python IDE (included)
CATIA can be automated using win32com (quick Google search on how to do it: http://code.activestate.com/recipes/347243-automate-catia-v5-with-python-and-pywin32/)
Note: to give you some sort of development time-frame, what you're asking will probably take at least two weeks to develop.
I hope this helps.

VHDL optimization tips [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am quite new in VHDL, and by using different IP cores (by different providers) can see that sometimes they differ massively as per the space that they occupy or timing constraints.
I was wondering if there are rules of thumb for optimization in VHDL (like there are in C, for example; unroll your loops, etc.).
Is it related to the synthesis tools I am using (like the different compilers are using other methods of optimization in C, so you need to learn to read the feedback asm files they return), or is it dependent on my coding skills?
Is it related to the synthesis tools I am using (like the different compilers are using other methods of optimization in C, so you need to learn to read the feedback asm files they return), or is it dependent on my coding skills?
The answer is "yes." When you are coding in HDL, you are actually describing hardware (go figure). Instead of the code being converted into machine code (like it would with C) it is synthesized to logical functions (AND, NOT, OR, XOR, etc) and memory elements (RAM, ROM, FF...).
VHDL can be used in many different ways. You can use VHDL in a purely structural sense where at the base level you are calling our primitives of the underlying technology that you are targeting. For example, you literally instantiate every AND, OR, NOT, and Flip Flop in your design. While this can give you a lot of control, it is not an efficient use of time in 99% of cases.
You can also implement hardware using behavioral constructs with VHDL. Instead of explicitly calling out each logic element, you describe a function to be implemented. For example, if this, do this, otherwise, do something else. You can describe state machines, math operations, and memories all in a behavioral sense. There are huge advantages to describing hardware in a behavioral sense:
Easier for humans to understand
Easier for humans to maintain
More portable between synthesis tools and target hardware
When using behavioral constructs, knowing your synthesis tool and your target hardware can help in understanding how what you write will actually be implemented. For example, if you describe a memory element with a asynchronous reset the implementation in hardware will be different for architectures with a dedicated asynchronous reset input to the memory element and one without.
Synthesis tools will generally publish in their reference manual or user guide a list of suggested HDL constructs to use in order to obtain some desired implementation result. For basic cases, they will be what you would expect. For more elaborate behavior models (e.g. a dual port RAM) there may be some form that you need to follow for the tool to "recognize" what you are describing.
In summary, for best use of your target device:
Know the device you are targeting. How are the programmable elements laid out? How many inputs and outputs are there from lookup tables? Read the device user manual to find out.
Know your synthesis engine. What types of behavioral constructs will be recognized and how will they be implemented? Read the synthesis tool user guide or reference manual to find out. Additionally, experiment by synthesizing small constructs to see how it gets implemented (via RTL or technology viewer, if available).
Know VHDL. Understand the differences between signals and variables. Be able to recognize statements that will generate many levels of logic in your design.
I was wondering if there are rules of thumb for optimization in VHDL
Now that you know the hardware, synthesis tool, and VHDL... Assuming you want to design for maximum performance, the following concepts should be adhered to:
Pipeline, pipeline, pipeline. The more levels of logic you have between synchronous elements, the more difficulty you are going to have making your timing constraint/goal.
Pipeline some more. Having additional stages of registers can provide additional wiggle-room in the future if you need to add more processing steps to your algorithm without affecting the overall latency/timeline.
Be careful when operating on the boundaries of the normal fabric. For example, if interfacing with an IO pin, dedicated multiplies, or other special hardware, you will take more significant timing hits. Additional memory elements should be placed here to avoid critical paths forming.
Review your synthesis and implementation reports frequently. You can learn a lot from reviewing these frequently. For example, if you add a new feature, and your timing takes a hit, you just introduced a critical path. Why? How can you alleviate this issue?
Take care with your "global" structures -- such as resets. Logic that must be widely distributed in your design deserves special care, since it needs to reach across your whole device. You may need special pipeline stages, or timing constraints on this type of logic. If at all possible, avoid "global" structures, unless truly a requirement.
While synthesis tools have design goals to focus on area, speed or power, the designer's choices and skills is the major contributor for the quality of the output. A designer should have a goal to maximize speed or minimize area and it will greatly influence his choices. A design optimized for speed can be made smaller by asking the tool to reduce the area, but not nearly as much as the same design thought for area in the first place.
However, it is more complicated than that. IP cores often target several FPGA technologies as well as ASIC. This can only be achieved by using general VHDL constructs, (or re-writing the code for each target, which non-critical IP providers don't do). FPGA and ASIC vendor have primitives that will improve speed/area when used, but if you write code to use a primitive for a technology, it doesn't mean that the resulting code will be optimized if you change the technology. Both Xilinx and Altera have DSP blocks to speed multiplication and whatnot, but they don't work exactly the same and writing code that uses the full potential of both is very challenging.
Synthesis tools are notorious for doing exactly what you ask them to, even if a more optimized solution is simple, for example:
a <= (x + y) + z; -- This is a 2 cascaded 2-input adder
b <= x + y + z; -- This is a 3-input adder
Will likely lead a different path from xyz to b/c. In the end, the designer need to know what he wants, and he has to verify that the synthesis tool understands his intent.

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
My impression is that it has to do with taking advantage of certain libraries (e.g., BLAS, LAPLACK) where possible, revising algorithms (reducing iteration), profiling, parallelizing, considering memory-hard disk-processor use/access... I am aware of the adage, "want to optimize your code? don't do it", but if one were interested in learning about writing efficient code, what references might be available?
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
Profiling is a must at any level of machinery. In common usage, I've found that scaling to larger and larger grids requires a better understanding of the grid software and the topology of the grid. In that sense, everything you learn about optimizing for one machine is still applicable, but understanding the grid software gets you additional mileage. Hadoop is one of the most popular and widespread grid systems, so learning about the scheduler options, interfaces (APIs and web interfaces), and other aspects of usage will help. Although you may not use Hadoop for a given supercomputer, it is one of the less painful methods for learning about distributed computing. For parallel computing, you may pursue MPI and other systems.
Additionally, learning to parallelize code on a single machine, across multiple cores or processors, is something you can begin learning on a desktop machine.
Recommendations:
Learn to optimize code on a single machine:
Learn profiling
Learn to use optimized libraries (after profiling: so that you see the speedup)
Be sure you know algorithms and data structures very well (*)
Learn to do embarrassingly parallel programming on multiple core machines.
Later: consider multithreaded programming. It's harder and may not pay off for your problem.
Learn about basic grid software for distributed processing
Learn about tools for parallel processing on a grid
Learn to program for alternative hardware, e.g. GPUs, various specialized computing systems.
This is language agnostic. I have had to learn the same sequence in multiple languages and multiple HPC systems. At each step, take a simpler route to learn some of the infrastructure and tools; e.g. learn multicore before multithreaded, distributed before parallel, so that you can see what fits for the hardware and problem, and what doesn't.
Some of the steps may be reordered depending on local computing practices, established codebases, and mentors. If you have a large GPU or MPI library in place, then, by all means, learn that rather than foist Hadoop onto your collaborators.
(*) The reason to know algorithms very well is that as soon as your code is running on a grid, others will see it. When it is hogging up the system, they will want to know what you're doing. If you are running a process that is polynomial and should be constant, you may find yourself mocked. Others with more domain expertise may help you find good approximations for NP-hard problems, but you should know that the concept exists.
Parallelization would be the key.
Since the problems you cited (e.g. CFD, multiphysics, mass transfer) are generally expressed as large-scale linear algebra problems, you need matrix routines that parallelize well. MPI is a standard for those types of problems.
Physics can influence as well. For example, it's possible to solve some elliptical problems efficiently using explicit dynamics and artificial mass and damping matricies.
3D multiphysics means coupled differential equations with varying time scales. You'll want a fine mesh to resolve details in both space and time, so the number of degrees of freedom will rise rapidly; time steps will be governed by the stability requirements of your problem.
If someone ever figures out how to run linear algebra as a map-reduce problem they'll have it knocked.
Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
First, you would need to understand the problem. Not all problems can be solved in parallel (and I'm using the term parallel in as wide meaning as it can get). So, see how the problem is now solved. Can it be solved with some other method quicker. Can it be divided in independent parts ... and so on ...
Fortran is the language specialized for scientific computing, and during the recent years, along with the development of new language features, there has also been some very interesting development in terms of features that are aiming for this "market". The term "co-arrays" could be an interesting read.
But for now, I would suggest reading first into a book like Using OpenMP - OpenMP is a simpler model but the book (fortran examples inside) explains nicely the fundamentals. Message parsing interface (for friends, MPI :) is a larger model, and one of often used. Your next step from OpenMP should probably go in this direction. Books on the MPI programming are not rare.
You mentioned also libraries - yes, some of those you mentioned are widely used. Others are also available. A person who does not know exactly where the problem in performance lies should IMHO never try to undertake the task of trying to rewrite library routines.
Also there are books on parallel algorithms, you might want to check out.
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
In short it comes down to understanding the problem, learning where the problem in performance is, re-solving the whole problem again with a different approach, iterating a few times, and by that time you'll already know what you're doing and where you're stuck.
We're in a position similar to yours.
I'm most in agreement with #Iterator's answer, but I think there's more to say.
First of all, I believe in "profiling" by the random-pausing method, because I'm not really interested in measuring things (It's easy enough to do that) but in pinpointing code that is causing time waste, so I can fix it. It's like the difference between a floodlight and a laser.
For one example, we use LAPACK and BLAS. Now, in taking my stack samples, I saw a lot of the samples were in the routine that compares characters. This was called from a general routine that multiplies and scales matrices, and that was called from our code. The matrix-manipulating routine, in order to be flexible, has character arguments that tell it things like, if a matrix is lower-triangular or whatever. In fact, if the matrices are not very large, the routine can spend more than 50% of its time just classifying the problem. Of course, the next time it is called from the same place, it does the same thing all over again. In a case like that, a special routine should be written. When it is optimized by the compiler, it will be as fast as it reasonably can be, and will save all that classifying time.
For another example, we use a variety of ODE solvers. These are optimized to the nth degree of course. They work by calling user-provided routines to calculate derivatives and possibly a jacobian matrix. If those user-provided routines don't actually do much, samples will indeed show the program counter in the ODE solver itself. However, if the user-provided routines do much more, samples will find the lower end of the stack in those routines mostly, because they take longer, while the ODE code takes roughly the same time. So, optimization should be concentrated in the user-provided routines, not the ODE code.
Once you've done several of the kind of optimization that is pinpointed by stack sampling, which can speed things up by 1-2 orders of magnitude, then by all means exploit parallelism, MPI, etc. if the problem allows it.