Hardware implementation of Data Encryption Standard (S-boxes /permutations) - hardware

I am trying to implement the DES circuit and according to a lot of papers, the S-boxes usually is implemented using a SRL or LUT, i'm not familiar with SRL, so i thought i use 8 LUT, each one has 6 adress lines and 4 data lines ( the 1st 2 adress lines represent the 1st and last bits of the bloc, and the 4 other adress lines represent the rest which will define the column)
For example if we take S-box 1 (shown in this figure)
Here is the table that comes with it
That's just for one box, it seems to me wrong, to do all of the boxes we have to write 512 lines. My first question is: is a LUT a hardware component? if so, i am using it correctly? and, is there a more appropriate implementation or representation?
My second question is: what does it mean hardware wiring? I found out that all the permutation function could be implemented using wire crossing, i didn't understand it. Should i make a wire for every bit?

Related

Block types in GNU Radio

I am still learning GNU Radio and I have trouble understanding something about signal processing block type. I understand that if I create a block taking let say 2 samples in the input and output 4 samples, it will be an interpolator of 2.
But now, I would like to create a block which will be a framer. So, it will have two inputs and one output. The block will receives the n samples from the first input, then take m inputs from the second input and append to the samples received from input one, and then output them. In this case, my samples are supposed to be bytes.
How to proceed in this case please ? Am I taking the right path like that? Do any one know to proceed with this type of scenario?
Your case (input 0 and input 1 having different relative rates to the output) is not covered by the sync_block/interpolator/decimator "templates" that GNU Radio has, so you have to use the general block approach.
Assuming you're familiar with gr_modtool¹, you can use it to add things like interpolator (relative rate >1), decimators (<1) and sync (=1) blocks:
-t BLOCK_TYPE, --block-type=BLOCK_TYPE
One of sink, source, sync, decimator, interpolator,
general, tagged_stream, hier, noblock.
But also note the general type. Using that, you can implement a block that doesn't have any restrictions on the relation between in- and output. That means that
you will have to manually consume() items from the inputs, because the number of items you took from the input can no longer be derived by the number of output items, and
you will have to implement a forecast method to tell the GNU Radio scheduler how much items you'll need for a given output.
gr_modtool will give you a stub where you'll only have to add the right code!
¹ if you're not: It's introduced in the GNU Radio Guided Tutorials, part 3 or so, somethig that I think will be a quick and fun read to you.
Considering that the question was asked 4 years ago and that there has been many changes in GNU Radio since then, I want to add to the answer that now this is possible to do with the Patterned Interleaver block.
patterned_interleaver_image
This block works the following way: it receives inputs in the ports to the left and outputs a single interleaved pattern in the port that is to the right. So let's imagine a block with 2 inputs, V1 and V2:
V1 = [0,1,0,0,1,1]
V2 = [1,1,1,0,1,0]
Suppose we want the output to be the first 2 bits of V1 followed by the first 2 bits of V2 followed by the next 2 bits of V1 and then the next 2 bits of V2 and so on...that is, we want the output to be
Vo = [0,1,1,1,0,0,1,0,1,1,1,0].
In order to accomplish this we go to the properties of the Patterned Interleaver block which looks like this:
patterned_interleaver_properties
The Pattern field allows us to control the order in which the bits in the input ports will be interleaved. By default they are in [0,0,1,1] meaning that the block will take 2 bits from input port 0 followed by 2 bits from input port 1. The corresponding output will be
[0,1,1,1,0,0,1,0,1,1,1,0],
that is, the first 2 bits of V1 followed by the first 2 bits of V2 and then the next 2 bits of V1, etc.
Let's see another example. In case the Pattern field is set to [0,0,1,1,1,0] the output will be 2 bits from input port 0 followed by 3 bits from input port 1 and then 1 bit from input port 0. In the output we will obtain [0,1,1,1,1,0,0,1,0,1,0,0].
Lastly, the Pattern field is also used to determine the number of input ports. If the Pattern field is [0,0,1,2] we will see that another input port is added to the block.
patterned_interleaver_3_inputs

Method to get non-base units?

Is there a method of using the exponent properties of LabView units for carrying custom units? For example I would find it convenient to use milli-Amperes instead of Amperes in my data wires.
My first attempt at doing so looks like this, but trying to get the value out at the end gives me nothing.
I would find it convenient to use milli-Amperes instead of Amperes in my data wires
For a wire, it's not possible, and it's not a problem, here's why:
I'm afraid what you want make little sense, since you're milli-Amperes instead of Amperes refers to representing your data, while a wire is just raw data. Adding the milli- to a floating point changes the exponent, not the mantissa, so there's no loss or gain of precision in the value that your number carries.
Now if we talk about an indicator which is technically a display of the wire value, you change the unit from "A" to "mA" to have the display you want.
Finally, in your attempt with "set numeric info", the -3 factor added next to Amperes means the unit is A^-3, not mA.
You can use data that don't use units, however than you will loose your automatic check of the units.
For display properties you can tweak the display format to show different outputs:
This format string is constructed as following:
% numeric
^ engineering notation, exponents in multiples of three
# no trailing zeros
_6 six significat digits
e scientific notation (1e1 for instance)
The prefix is the best way to affect the presentation of the value on a specific front panel.
When passing data from VI to VI, the prefix is not passed, and the data uses the base ( Amps, Volts, etc...)
In my example below, the unitless value 3 is assigned units of Amp in mA.vi. The front panel indicator is set to show units of mA.
In Watts.vi I multiply the Amps OUT of mA.vi by a constant of 9V and the result is wired to the indicator x*y.
x*y has units of W and I changed the prefix to k for presentation.
The NI forums have several threads that report certain functions (square and square root specifically) can cause unit errors or broken wires. Most folks don't even know the units capability exists, and most that do have tried and abandoned them. :)

Advice for bit level manipulation

I'm currently working on a project that involves a lot of bit level manipulation of data such as comparison, masking and shifting. Essentially I need to search through chunks of bitstreams between 8kbytes - 32kbytes long for bit patterns between 20 - 40bytes long.
Does anyone know of general resources for optimizing for such operations in CUDA?
There has been a least a couple of questions on SO on how to do text searches with CUDA. That is, finding instances of short byte-strings in long byte-strings. That is similar to what you want to do. That is, a byte-string search is much like a bit-string search where the number of bits in the byte-string can only be a multiple of 8, and the algorithm only checks for matches every 8 bits. Search on SO for CUDA string searching or matching, and see if you can find them.
I don't know of any general resources for this, but I would try something like this:
Start by preparing 8 versions of each of the search bit-strings. Each bit-string shifted a different number of bits. Also prepare start and end masks:
start
01111111
00111111
...
00000001
end
10000000
11000000
...
11111110
Then, essentially, perform byte-string searches with the different bit-strings and masks.
If you're using a device with compute capability >= 2.0, store the shifted bit-strings in global memory. The start and end masks can probably just be constants in your program.
Then, for each byte position, launch 8 threads that each checks a different version of the 8 shifted bit-strings against the long bit-string (which you now treat like a byte-string). In each block, launch enough threads to check, for instance, 32 bytes, so that the total number of threads per block becomes 32 * 8 = 256. The L1 cache should be able to hold the shifted bit-strings for each block, so that you get good performance.

Custom EQ AudioUnit on iOS

The only effect AudioUnit on iOS is the "iTunes EQ", which only lets you use EQ pre-sets. I would like to use a customized eq in my audio graph
I came across this question on the subject and saw an answer suggesting using this DSP code in the render callback. This looks promising and people seem to be using this effectively on various platforms. However, my implementation has a ton of noise even with a flat eq.
Here's my 20 line integration into the "MixerHostAudio" class of Apple's "MixerHost" example application (all in one commit):
https://github.com/tassock/mixerhost/commit/4b8b87028bfffe352ed67609f747858059a3e89b
Any ideas on how I could get this working? Any other strategies for integrating an EQ?
Edit: Here's an example of the distortion I'm experiencing (with the eq flat):
http://www.youtube.com/watch?v=W_6JaNUvUjA
In the code in EQ3Band.c, the filter coefficients are used without being initialized. The init_3band_state method initialize just the gains and frequencies, but the coefficients themselves - es->f1p0 etc. are not initialized, and therefore contain some garbage values. That might be the reason for the bad output.
This code seems wrong in more then one way.
A digital filter is normally represented by the filter coefficients, which are constant, the filter inner state history (since in most cases the output depends on history) and the filter topology, which is the arithmetic used to calculate the output given the input and the filter (coeffs + state history). In most cases, and of course when filtering audio data, you expect to get 0's at the output if you feed 0's to the input.
The problems in the code you linked to:
The filter coefficients are changed in each call to the processing method:
es->f1p0 += (es->lf * (sample - es->f1p0)) + vsa;
The input sample is usually multiplied by the filter coefficients, not added to them. It doesn't make any physical sense - the sample and the filter coeffs don't even have the same physical units.
If you feed in 0's, you do not get 0's at the output, just some values which do not make any sense.
I suggest you look for another code - the other option is debugging it, and it would be harder.
In addition, you'd benefit from reading about digital filters:
http://en.wikipedia.org/wiki/Digital_filter
https://ccrma.stanford.edu/~jos/filters/

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)