Analyzing bitstreams using Icestorm - yosys

I'm trying to understand the bitstreams generated by Yosys/arachne-pnr as described on http://www.clifford.at/icestorm/:
The recommended approach for learning how to use this documentation is to synthesize very simple circuits using Yosys and Arachne-pnr, run the icestorm tool icebox_explain on the resulting bitstream files, and analyze the results using the HTML export of the database mentioned above. icebox_vlog can be used to convert the bitstream to Verilog. The output file of this tool will also outline the signal paths in comments added to the generated Verilog code.
In order to understand the effect a change in the bitstream has, it would be helpful if I could change the .ex file and convert it back to an ASCII bitstream (instead of having to identify the bit manually) for uploading to the FPGA. Is there a way to do so?
I'm a bit concerned about damaging the FPGA with an invalid bitstream. Are there situations where this is known to happen? Is there a way to simulate a bitstream?
Also, it would be helpful to have some kind of “higher-level” explanation format which e.g. shows the IE/REN bits on the I/O blocks to which they correspond, not the one on which they have to be set in the bitstream. Is there such a format?
I know of the possibility to generate an equivalent Verilog circuit, but the problem with this is that it doesn't usually allow me a lossless round-trip back into a bitstream. Is there a way to generate an equivalent Verilog circuit which (e.g. by instantiating the blocks explicitly) yields the exact same bitstream when processed with Yosys/arachne-pnr?

I'm a bit concerned about damaging the FPGA with an invalid bitstream. Are there situations where this is known to happen? Is there a way to simulate a bitstream?
I have not damaged any FPGA so far. (I have, however, managed to damage the serial flash on one icestick after running some test that reprogrammed it in a loop.)
But this does not mean that you cannot damage your FPGA by programming it with an invalid bitstream. You could theoretically configure the FPGA in a way that produces a driver-driver conflict. I don't know how well the hardware deals with something like that. I have not run any experiments to find out..
Also, it would be helpful to have some kind of “higher-level” explanation format which e.g. shows the IE/REN bits on the I/O blocks to which they correspond, not the one on which they have to be set in the bitstream. Is there such a format?
icebox_vlog produces a higher-level output. But it does not output things like I/O blocks, so it might be too high-level for your needs.
I know of the possibility to generate an equivalent Verilog circuit, but the problem with this is that it doesn't usually allow me a lossless round-trip back into a bitstream. Is there a way to generate an equivalent Verilog circuit which (e.g. by instantiating the blocks explicitly) yields the exact same bitstream when processed with Yosys/arachne-pnr?
Not at the moment. But it should not be too hard to extend icebox_vlog to provide this functionality. So if you really need that, it might be something within your reach to add yourself.

Related

What is required to target a new device?

From a high-level point of view, what is required to target a new device with Yosys? I'd like to target a Xilinx XC9572XL. I have one these development boards: XC9572XL-CPLD-development-board-v1b. The architecture of this CPLD is fairly well covered in the Xilinx documentation here.
I think I need to do the following:
Work out how to get Yosys to synthesise a design to a Sum-of-Product and D-type Flip Flop based netlist.
Output that netlist as a BLIF format from Yosys.
Create a 'fitter' (analogous to arachne-pnr for the ICE40 FPGA) for the XC9572XL
Output a JEDEC file with the appropriate fuses that need to be set to implement the design in the previous step.
Flash the design to the CPLD using xc3sprog.
It looks possible. The hard bit is building a 'fitter' tool. This tool needs to understand the CPLD's resources and then needs some clever algorithms to fit the design and output the required fuses in a JEDEC format. One import missing piece is the mapping between the 'fuses' in the physical CPLD and the fuses in the JEDEC file. This would have to be reverse engineered. I note that a JEDEC file from Xilinx WebPACK ISE contains 46656 fuses. Each of those map back to some configurable node in the CPLD.
I'd like to know what others think about this approach. What types of issues am I likely to encounter?
What legal aspects do I need to consider if I was to undertake this? Should I write to Xilinx first and seek permission from them should I decide I want to reverse engineer a JEDEC file produced by their tool?
The XC9572XL is an obsolete part...
Work out how to get Yosys to synthesise a design to a Sum-of-Product and D-type Flip Flop based netlist.
Output that netlist as a BLIF format from Yosys.
You can do two-level synthesis with ABC from a logic-level BLIF file. For example:
$ yosys -p synth -o test.blif tests/simple/fiedler-cooley.v
$ yosys-abc
abc> read_blif test.blif
abc> collapse
abc> write_pla test.pla
Now you can aim for writing a program that converts a .pla file (plus auxiliary information that might be generated by a yosys plugin you'd need to write) to a JEDEC file.
What legal aspects do I need to consider if I was to undertake this?
IANAL. TINLA.
When you reverse engineer it by analyzing the software provided by the chip vendor: In this case it really depends on the country you are living in. For example, in europe you can reverse-engineer, even disassemble, software in certain situations, even when the software EULA prohibits it. I explain this in a bit more depth here.
I think reverse engineering the silicon itself (instead of analyzing the software) is less problematic in places like north america.
Have you considered targeting the CoolRunner-II family? I did some fairly extensive RE on it (https://recon.cx/2015/slides/recon2015-18-andrew-zonenberg-From-Silicon-to-Compiler.pdf) and understand the majority of the bitstream format. Porting Yosys to it is high on my priority list once I figure out the last of the clock network structure.
These devices are more recent and lower power, plus the internal architecture is cleaner and easier to target (nice regular AND/OR array vs having some pterms dedicated to certain OR terms).
In either case please contact me to discuss further, I'd love to collaborate.
EDIT: Clifford is right, reversing silicon is explicitly legal in the US (17 USC 906) while software is more of a gray area. ISE is also such a giant monster that nobody with their head screwed on right would want to reverse engineer it; the chip is a lot easier to follow.
Although the XC9500XL series is an older 350nm family (less metal layers, larger features, easier to see detail under a microscope) it also uses a lot of nasty analog tricks with floating gate EEPROM/flash cells directly in the logic and sense amplifiers on the output. CoolRunner-II is 180nm with 4 or 5 metal layers depending on density, and the main logic array is entirely digital and a lot easier to reverse engineer.

What are good compression-oriented application programming interfaces (APIs)?

What are good compression-oriented application programming interfaces (APIs)?
Do people still use the
1991 "data compression interface" draft standard, and the
1991 "Stream transformation algorithm interface" draft standard.
(Both draft standards by Ross Williams)?
Are there any alternatives to those draft standards?
(I'm particularly looking for C APIs, but links to compression-oriented APIs in C++ and other languages would also be appreciated).
I'm experimenting with some data compression algorithms.
Typically the compressed file I'm producing is composed of a series of blocks,
with a block header indicating which compression algorithm needs to be used to decompress the remaining data in that block -- Huffman, LZW, LZP, "stored uncompressed", etc.
The block header also indicates which filter(s) need to be used to convert the intermediate stream or buffer of data from the decompressor into a lossless copy of the original plaintext -- Burrows–Wheeler transform, delta encoding, XML end-tag restoration, "copy unchanged", etc.
Rather than use a huge switch statement that selects based on the "compression type", which calls the selected decompression algorithm or filter algorithm, each procedure with its own special number and order of parameters,
it simplifies my code if every algorithm has exactly the same API -- the same number and order of parameters, etc.
Rather than waiting for the decompressor to run through the entire input stream before handing its output to the first filter,
It would be nice if the API supported decompressed output data coming out the final filter "relatively quickly" (low-latency) after relatively little compressed data has been fed into the initial decompressor.
It would be nice if the API could be used in systems that have only one thread or process.
Currently I'm kludging together my own internal API,
re-using existing compression algorithm implementations by
writing short wrapper functions to convert between my internal API and the special number and order of parameters used by each implementation.
Is there an already-existing API that I could use rather than designing my own from scratch?
Where can I find such an API?
I fear such an "API" does not exist.
Especially, requirement such as "starting stage-2 while stage-1 is ongoing and unfinished" is completely implementation dependant; and cannot be added later by an API layer.
Btw, Maciej Adamczyk just tried the same as you.
He made an open source benchmark comparing multiple compression algorithms over a block-compression scenario. The code can be consulted here :
http://encode.ru/threads/1371-Filesystem-benchmark?p=26630&viewfull=1#post26630
He has been obliged to "encapsulate" all these different compressor interfaces in order to cope with the difference.
Now for the good thing : most compressors tend to have relatively similar C interface when it comes to compressing a block of data.
AS an example, they can be as simple as this one :
http://code.google.com/p/lz4/source/browse/trunk/lz4.h
So, in the end, the adaptation layer is not so heavy.

Creating simple waveforms with CoreAudio

I am new to CoreAudio, and I would like to output a simple sine wave and square wave with a given frequency and amplitude through the speakers using CA. I don't want to use sound files as I want to synthesize the sound.
What do I need to do this? And can you give me an example or tutorial? Thanks.
There are a number of errors in the previous answer. I, the legendary :-) James McCartney, not James Harkins wrote the sinewavedemo, I also wrote SuperCollider which is what the audiosynth.com website is about. I also now work at Apple on CoreAudio. The sinewavedemo DOES use CoreAudio, since it uses AudioHardware.h from CoreAudio.framework as its way to play the sound.
You should not use the sinewavedemo. It is very old code and it makes dangerous assumptions about the buffer layout of the audio hardware. The easiest way nowadays to play a sound that you are generating is to use the AudioQueue, or to use an output audio unit with a render callback set.
The best and easiest way to do that without files is to prepare a single cycle buffer, containing one cycle of the wave (this is called technically a wavetable)
In the playback function called by CoreAudio thread, fill the output buffer with samples read from the wave buffer.
Note however that you will face two problems very quickly :
- for the sine wave, if the playback frequency is not an integer multiple of the desired sine frequency, you will probably need to implement an interpolator if you want to have a good quality. Using only integer pointers will generate a significant level of harmonic noise.
for the square wave, avoid to just program an array with +1 / -1 values. Such a signal is not bandlimited and will alias a lot. Do not forget that the spectrum of a square wave is virtually infinite!
To get good algorithms for signal generation, take a look to musicdsp.org, that's probably one of the best resource for that
Are you new to audio programming in general? As a starting point i would check out
http://www.audiosynth.com/sinewavedemo.html
This is a minimum osx sinewave implementation by the legendary James Harkins. Note, it doesn't use CoreAudio at all.
If you specifically want to use CoreAudio for your sinewave you need to create an output unit (RemoteIO on the iphone, AUHAL on osx) and supply an input callback, where you can pretty much use the code from the above example. Check out
http://developer.apple.com/mac/library/technotes/tn2002/tn2091.html
The benefits of CoreAudio are chiefly, chain other effects with your sinewave, write plugins for hosts like Logic & provide the interfaces for them, write a host (like Logic) for plugins that can be chained together.
If you don't wont to write a plugin, or host plugins then CoreAudio might not actually be for you. But one of the best things about using CoreAudio is that once you get your sinewave callback working it is easy to add effects, or mix multiple sines together
To do this you need to put your output unit in a graph, to which you can effects, mixers, etc.
Here is some help on setting up graphs http://timbolstad.com/2010/03/16/core-audio-getting-started-pt2/
It isn't as difficult as it looks. Apple provides C++ helper classes for many things (/Developer/Examples/CoreAudio/PublicUtility) and even if you don't want to use C++ (you don't have to!) they can be a useful guide to the CoreAudio API.
If you are not doing this realtime, using the sin() function from math.h is not a bad idea. Just fill however many samples you need with sin() beforehand when it is time to play it, just send it to the audio buffer. sin() can be quite slow to call once every sample if you are doing this realtime, using an interpolated wavetable lookup method is much faster, but the resulting sound will not be as spectrally pure.
There is a good and well documented sine wave player code example in Chapter 7 of the Adamson/Avila "Learning Core Audio" book, published by Addison-Wesley Professional (ISBN-10: 0-321-63684-8 ):
http://www.informit.com/store/learning-core-audio-a-hands-on-guide-to-audio-programming-9780321636843
It is a rather new publication (2012) and addresses precisely the issue of this question. It's only a starting point, but it's a valuable starting point.
BTW. Don't jump to graphs before having this basic lesson (which involves some math) behind.
Concerning example code, a quick and efficient method I often use deals with a pre-filled sinewave lookup table which has as many members as sample rate, for 44100 Hz the table has size of 44100. In other words, cycle length equals sample rate. This gives an acceptable trade-off between speed and quality in many cases. You can initialize it with the program.
If you generate floating point samples (which is default in OSX), and use math functions, use sinf() rather than (float)sin(). Promotions in inner loop cycles of a render callback are always resource-expensive. So are repetitive multiplications of constants, such as 2.0*M_PI, which can too often be found in code examples.

How can I accelerate the generation of the an MD5 Checksum within vb.net?

I'm working with some very large files residing on P2 (Panasonic) cards. Part of the process we employ is to first generate a checksum of the file we are going to copy, then copy the file, then run a checksum on the file to confirm that it copied OK. The problem is, is that files are large (70 GB+) and take a long time to complete. It's an issue since we will eventually be dealing with thousands of these files.
I would like to find a faster way to generate the checksum other than using the System.Security.Cryptography.MD5CryptoServiceProvider
I don't care if this means using a specialized hardware card, provided it works and is not to ungodly expensive. I would prefer to have a method of encoding that provided some feedback as to how far the process has gone along so I can display it like I do now.
The application is written in vb.net. I would prefer to be able to use it as component, library, reference within my application, but I'm willing to call an outside application if there is enough improvement in the speed of generating the checksum.
Needless to say, the checksum must be consistent and correct. :-)
Thank you in advance for your time and efforts,
Richard
I see one potential way to speed up this process: calculate the MD5 of the source file while performing the copy, not prior to it. This will reduce the number of times you'll need to read the entire file from 3 (source hash, copy, destination hash) to 2 (copy, destination hash).
The downside of this all is that you'll have to write your own copying code (as opposed to just relying on System.IO.File.Copy), and there's a non-zero chance that this will turn out to be slower in the end anyway than the 3-step process.
Other than that, I don't think there's much you can do here, as the entire process is I/O bound by design. You're spending most of your time reading/writing the file, and even at 100MB/s (a respectable I/O speed for your typical SATA drive), you'll do about 5.8GB/min at best.
With a modern processor, the overhead of calculating the MD5 (or anything else) doesn't factor into things very much, so speeding it up won't improve your overall throughput. Crypto accelerators in particular won't help you here, as unless the driver implementation is very efficient, they'll add more overhead due to context switches required to feed the data to the external card than they'll save.
What you do want to improve is the I/O speed. The .NET framework is already pretty efficient when it comes to this (using nicely-sized buffers, overlapped I/O and such), but it's possible an optimized native Windows application will perform better here. My advice: Google around for a few native MD5 calculators, and see how they compare to your current .NET implementation. If the difference in hash calculation speed is >10%, it's worth switching to using said external app.
The correct answer is to avoid using MD5. MD5 is a cryptographic hash function, designed to provide certain cryptographic features. For merely detecting accidental corruption, it is way over-engineered and slow. There are many faster checksums, the design of which can be understood by examining the literature of error detection and correction. Some common examples are the CRC checksums, of which CRC32 is very common, but you can also relatively easily compute 64 or 128 bit or even larger CRCs much much faster than an MD5 hash.

What would be a good (de)compression routine for this scenario

I need a FAST decompression routine optimized for restricted resource environment like embedded systems on binary (hex data) that has following characteristics:
Data is 8bit (byte) oriented (data bus is 8 bits wide).
Byte values do NOT range uniformly from 0 - 0xFF, but have a poisson distribution (bell curve) in each DataSet.
Dataset is fixed in advanced (to be burnt into Flash) and each set is rarely > 1 - 2MB
Compression can take as much as time required, but decompression of a byte should take 23uS in the worst case scenario with minimal memory footprint as it will be done on a restricted resource environment like an embedded system (3Mhz - 12Mhz core, 2k byte RAM).
What would be a good decompression routine?
The basic Run-length encoding seems too wasteful - I can immediately see that adding a header setion to the compressed data to put to use unused byte values to represent oft repeated patterns would give phenomenal performance!
With me who only invested a few minutes, surely there must already exist much better algorithms from people who love this stuff?
I would like to have some "ready to go" examples to try out on a PC so that I can compare the performance vis-a-vis a basic RLE.
The two solutions I use when performance is the only concern:
LZO Has a GPL License.
liblzf Has a BSD License.
miniLZO.tar.gz This is LZO, just repacked in to a 'minified' version that is better suited to embedded development.
Both are extremely fast when decompressing. I've found that LZO will create slightly smaller compressed data than liblzf in most cases. You'll need to do your own benchmarks for speeds, but I consider them to be "essentially equal". Both are light-years faster than zlib, though neither compresses as well (as you would expect).
LZO, in particular miniLZO, and liblzf are both excellent for embedded targets.
If you have a preset distribution of values that means the propability of each value is fixed over all datasets, you can create a huffman encoding with fixed codes (the code tree has not to be embedded into the data).
Depending on the data, I'd try huffman with fixed codes or lz77 (see links of Brian).
Well, the main two algorithms that come to mind are Huffman and LZ.
The first basically just creates a dictionary. If you restrict the dictionary's size sufficiently, it should be pretty fast...but don't expect very good compression.
The latter works by adding back-references to repeating portions of output file. This probably would take very little memory to run, except that you would need to either use file i/o to read the back-references or store a chunk of the recently read data in RAM.
I suspect LZ is your best option, if the repeated sections tend to be close to one another. Huffman works by having a dictionary of often repeated elements, as you mentioned.
Since this seems to be audio, I'd look at either differential PCM or ADPCM, or something similar, which will reduce it to 4 bits/sample without much loss in quality.
With the most basic differential PCM implementation, you just store a 4 bit signed difference between the current sample and an accumulator, and add that difference to the accumulator and move to the next sample. If the difference it outside of [-8,7], you have to clamp the value and it may take several samples for the accumulator to catch up. Decoding is very fast using almost no memory, just adding each value to the accumulator and outputting the accumulator as the next sample.
A small improvement over basic DPCM to help the accumulator catch up faster when the signal gets louder and higher pitch is to use a lookup table to decode the 4 bit values to a larger non-linear range, where they're still 1 apart near zero, but increase at larger increments toward the limits. And/or you could reserve one of the values to toggle a multiplier. Deciding when to use it up to the encoder. With these improvements, you can either achieve better quality or get away with 3 bits per sample instead of 4.
If your device has a non-linear μ-law or A-law ADC, you can get quality comparable to 11-12 bit with 8 bit samples. Or you can probably do it yourself in your decoder. http://en.wikipedia.org/wiki/M-law_algorithm
There might be inexpensive chips out there that already do all this for you, depending on what you're making. I haven't looked into any.
You should try different compression algorithms with either a compression software tool with command line switches or a compression library where you can try out different algorithms.
Use typical data for your application.
Then you know which algorithm is best-fitting for your needs.
I have used zlib in embedded systems for a bootloader that decompresses the application image to RAM on start-up. The licence is nicely permissive, no GPL nonsense. It does make a single malloc call, but in my case I simply replaced this with a stub that returned a pointer to a static block, and a corresponding free() stub. I did this by monitoring its memory allocation usage to get the size right. If your system can support dynamic memory allocation, then it is much simpler.
http://www.zlib.net/