Performance tests for Lightweight Encryption - cryptography

I am currently doing my final project, and I need to conduct some performance tests on at least 3 lightweight encryption algorithms (symmetric block ciphers). Ideally encrypt/decrypt of a text file and measure/compare at least 3 metrics such as execution time, memory, code size, throughput. I'm struggling to work out how to achieve this so if anyone has any pointers I would be extremely grateful. I've no experience with code although I'm trying to work this out (looking at C#, Java Cryptographic Extension, FELICS).
Thanks.

Consider constructing speed and memory tests of one-time pad, as an example of perfect secrecy or a substitution cipher perhaps, DES as an outdated form of symmetric encryption, and it’s modern replacement AES. Pick a language, and duckduckgo for library implementations of the latter two.

Related

Cryptography in Q# - Is it possible?

Title says it all.
Since Quantum Computers are said to be the next big thing, I figured the speed at which these systems operate on should be enough to decrypt files/applications in a 'Brute Force' manner.
Is it possible? When will it be possible?
Quantum computers operate differently from classical computers, rather than faster or slower. For some problems they're much faster than the best known algorithms, while for others they'd be slower if they would work at all.
For decrypting, there are quantum algorithms for attacking some specific ciphers. Probably the best known is Shor's Algorithm, which on a large enough quantum computer would allow you to factor large numbers efficiently, thus breaking RSA. Breaking RSA would require many thousands of high-quality qubits, and so is not something that's going to be available in the next few years. Longer term, I myself wouldn't try to guess when such a quantum computer will be available, although others may have more confidence.
There are quantum attacks on other ciphers as well, including elliptic curve cryptography. The good news is that post-quantum cryptography is an active field of research, and there are some promising developments already. Also, most symmetric ciphers in use today are quantum-resistant; while brute force search time on a quantum computer would in theory scale with the square root of the number of possible keys, doubling the key size addresses this neatly.
There are good resources for this on Wikipedia: https://en.wikipedia.org/wiki/Shor%27s_algorithm and https://en.wikipedia.org/wiki/Post-quantum_cryptography. The Microsoft Quantum samples repository includes a Q# implementation of Shor's algorithm.
Threat of Quantum Computers to today's encrypted data is real. Please refer to "Harvest now Decrypt later attack".
You can implement Shor's algorithm in Q#. Shor's algorithm is the threat to
Asymmetric key cryptography.
You can also implement Grover's Algorithm in Q#. Grover's algorithm uses brute-force / unsorted search to search symmetric keys.
Much progress has been made with Post Quantum Cryptography (PQC) since this question was originally answered in 2018. NIST is driving a standardization process to identify PQC algorithms. These PQC Algorithms will replace RSA / ECC based DHE/DSA/LEM algorithms.
So more than "when is it possible?" - we have to act now because the threat is real and the encrypted data as we know today is perhaps being passively snapped ( we wont know about it). Data elements such as social security id (in USA) and similar information have a shelf life that exceeds 7 to 10 years.

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
My impression is that it has to do with taking advantage of certain libraries (e.g., BLAS, LAPLACK) where possible, revising algorithms (reducing iteration), profiling, parallelizing, considering memory-hard disk-processor use/access... I am aware of the adage, "want to optimize your code? don't do it", but if one were interested in learning about writing efficient code, what references might be available?
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
Profiling is a must at any level of machinery. In common usage, I've found that scaling to larger and larger grids requires a better understanding of the grid software and the topology of the grid. In that sense, everything you learn about optimizing for one machine is still applicable, but understanding the grid software gets you additional mileage. Hadoop is one of the most popular and widespread grid systems, so learning about the scheduler options, interfaces (APIs and web interfaces), and other aspects of usage will help. Although you may not use Hadoop for a given supercomputer, it is one of the less painful methods for learning about distributed computing. For parallel computing, you may pursue MPI and other systems.
Additionally, learning to parallelize code on a single machine, across multiple cores or processors, is something you can begin learning on a desktop machine.
Recommendations:
Learn to optimize code on a single machine:
Learn profiling
Learn to use optimized libraries (after profiling: so that you see the speedup)
Be sure you know algorithms and data structures very well (*)
Learn to do embarrassingly parallel programming on multiple core machines.
Later: consider multithreaded programming. It's harder and may not pay off for your problem.
Learn about basic grid software for distributed processing
Learn about tools for parallel processing on a grid
Learn to program for alternative hardware, e.g. GPUs, various specialized computing systems.
This is language agnostic. I have had to learn the same sequence in multiple languages and multiple HPC systems. At each step, take a simpler route to learn some of the infrastructure and tools; e.g. learn multicore before multithreaded, distributed before parallel, so that you can see what fits for the hardware and problem, and what doesn't.
Some of the steps may be reordered depending on local computing practices, established codebases, and mentors. If you have a large GPU or MPI library in place, then, by all means, learn that rather than foist Hadoop onto your collaborators.
(*) The reason to know algorithms very well is that as soon as your code is running on a grid, others will see it. When it is hogging up the system, they will want to know what you're doing. If you are running a process that is polynomial and should be constant, you may find yourself mocked. Others with more domain expertise may help you find good approximations for NP-hard problems, but you should know that the concept exists.
Parallelization would be the key.
Since the problems you cited (e.g. CFD, multiphysics, mass transfer) are generally expressed as large-scale linear algebra problems, you need matrix routines that parallelize well. MPI is a standard for those types of problems.
Physics can influence as well. For example, it's possible to solve some elliptical problems efficiently using explicit dynamics and artificial mass and damping matricies.
3D multiphysics means coupled differential equations with varying time scales. You'll want a fine mesh to resolve details in both space and time, so the number of degrees of freedom will rise rapidly; time steps will be governed by the stability requirements of your problem.
If someone ever figures out how to run linear algebra as a map-reduce problem they'll have it knocked.
Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
First, you would need to understand the problem. Not all problems can be solved in parallel (and I'm using the term parallel in as wide meaning as it can get). So, see how the problem is now solved. Can it be solved with some other method quicker. Can it be divided in independent parts ... and so on ...
Fortran is the language specialized for scientific computing, and during the recent years, along with the development of new language features, there has also been some very interesting development in terms of features that are aiming for this "market". The term "co-arrays" could be an interesting read.
But for now, I would suggest reading first into a book like Using OpenMP - OpenMP is a simpler model but the book (fortran examples inside) explains nicely the fundamentals. Message parsing interface (for friends, MPI :) is a larger model, and one of often used. Your next step from OpenMP should probably go in this direction. Books on the MPI programming are not rare.
You mentioned also libraries - yes, some of those you mentioned are widely used. Others are also available. A person who does not know exactly where the problem in performance lies should IMHO never try to undertake the task of trying to rewrite library routines.
Also there are books on parallel algorithms, you might want to check out.
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
In short it comes down to understanding the problem, learning where the problem in performance is, re-solving the whole problem again with a different approach, iterating a few times, and by that time you'll already know what you're doing and where you're stuck.
We're in a position similar to yours.
I'm most in agreement with #Iterator's answer, but I think there's more to say.
First of all, I believe in "profiling" by the random-pausing method, because I'm not really interested in measuring things (It's easy enough to do that) but in pinpointing code that is causing time waste, so I can fix it. It's like the difference between a floodlight and a laser.
For one example, we use LAPACK and BLAS. Now, in taking my stack samples, I saw a lot of the samples were in the routine that compares characters. This was called from a general routine that multiplies and scales matrices, and that was called from our code. The matrix-manipulating routine, in order to be flexible, has character arguments that tell it things like, if a matrix is lower-triangular or whatever. In fact, if the matrices are not very large, the routine can spend more than 50% of its time just classifying the problem. Of course, the next time it is called from the same place, it does the same thing all over again. In a case like that, a special routine should be written. When it is optimized by the compiler, it will be as fast as it reasonably can be, and will save all that classifying time.
For another example, we use a variety of ODE solvers. These are optimized to the nth degree of course. They work by calling user-provided routines to calculate derivatives and possibly a jacobian matrix. If those user-provided routines don't actually do much, samples will indeed show the program counter in the ODE solver itself. However, if the user-provided routines do much more, samples will find the lower end of the stack in those routines mostly, because they take longer, while the ODE code takes roughly the same time. So, optimization should be concentrated in the user-provided routines, not the ODE code.
Once you've done several of the kind of optimization that is pinpointed by stack sampling, which can speed things up by 1-2 orders of magnitude, then by all means exploit parallelism, MPI, etc. if the problem allows it.

Encrypting(MD5) multiple times can improve security?

I saw some guy who encrypt users password multiple times with MD5 to improve security. I'm not sure if this works but it doesn't look good. So, does it make sense?
Let's assume the hash function you use would be a perfect one-way function. Then you can view its output like that of a "random oracle", its output values are in a finite range of values (2^128 for MD5).
Now what happens if you apply the hash multiple times? The output will still stay in the same range (2^128). It's like you saying "Guess my random number!" twenty times, each time thinking of a new number - that doesn't make it harder or easier to guess. There isn't any "more random" than random. That's not a perfect analogy, but I think it helps to illustrate the problem.
Considering brute-forcing a password, your scheme doesn't add any security at all. Even worse, the only thing you could "accomplish" is to weaken the security by introducing some possibility to exploit the repeated application of the hash function. It's unlikely, but at least it's guaranteed that you for sure won't win anything.
So why is still not all lost with this approach? It's because of the notion that the others made with regard to having thousands of iterations instead of just twenty. Why is this a good thing, slowing the algorithm down? It's because most attackers will try to gain access using a dictionary (or rainbow table using often-used passwords, hoping that one of your users was negligent enough to use one of those (I'm guilty, at least Ubuntu told me upon installation). But on the other hand it's inhumane to require your users to remember let's say 30 random characters.
That's why we need some form of trade-off between easy to remember passwords but at the same time making it as hard as possible for attackers to guess them. There are two common practices, salts and slowing the process down by applying lots of iterations of some function instead of a single iteration. PKCS#5 is a good example to look into.
In your case applying MD5 20000 instead of 20 times would slow attackers using a dictionary significantly down, because each of their input passwords would have to go through the ordinary procedure of being hashed 20000 times in order to be still useful as an attack. Note that this procedure does not affect brute-forcing as illustrated above.
But why is using a salt still better? Because even if you apply the hash 20000 times, a resourceful attacker could pre-compute a large database of passwords, hashing each of them 20000 times, effectively generating a customized rainbow table specifically targeted at your application. Having done this they could quite easily attack your application or any other application using your scheme. That's why you also need to generate a high cost per password, to make such rainbow tables impractical to use.
If you want to be on the really safe side, use something like PBKDF2 illustrated in PKCS#5.
Hashing a password is not encryption. It is a one-way process.
Check out security.stackexchange.com, and the password related questions. They are so popular we put together this blog post specifically to help individuals find useful questions and answers.
This question specifically discusses using md5 20 times in a row - check out Thomas Pornin's answer. Key points in his answer:
20 is too low, it should be 20000 or more - password processing is still too fast
There is no salt: an attacker may attack passwords with very low per-password cost, e.g. rainbow tables - which can be created for any number of md5 cycles
Since there is no sure test for knowing whether a given algorithm is secure or not, inventing your own cryptography is often a recipe for disaster. Don't do it
There is such a question on crypto.SE but it is NOT public now. The answer by Paŭlo Ebermann is:
For password-hashing, you should not use a normal cryptographic hash,
but something made specially to protect passwords, like bcrypt.
See How to safely store a password for details.
The important point is that password crackers don't have to bruteforce
the hash output space (2160 for SHA-1), but only the
password space, which is much much smaller (depending on your password
rules - and often dictionaries help). Thus we don't want a fast
hash function, but a slow one. Bcrypt and friends are designed for
this.
And similar question has these answers:
The question is "Guarding against cryptanalytic breakthroughs: combining multiple hash functions"
Answer by Thomas Pornin:
Combining is what SSL/TLS does with MD5 and SHA-1, in its
definition of its internal "PRF" (which is actually a Key Derivation
Function). For a given hash function, TLS defines a KDF which
relies on HMAC which relies on the hash function. Then the KDF is
invoked twice, once with MD5 and once with SHA-1, and the results are
XORed together. The idea was to resist cryptanalytic breaks in either
MD5 or SHA-1. Note that XORing the outputs of two hash functions
relies on subtle assumptions. For instance, if I define SHB-256(m) =
SHA-256(m) XOR C, for a fixed constant C, then SHB-256 is as
good a hash function as SHA-256; but the XOR of both always yields
C, which is not good at all for hashing purposes. Hence, the
construction in TLS in not really sanctioned by the authority of
science (it just happens not to have been broken). TLS-1.2 does
not use that combination anymore; it relies on the KDF with a single,
configurable hash function, often SHA-256 (which is, in 2011, a smart
choice).
As #PulpSpy points out, concatenation is not a good generic way of
building hash functions. This was published by Joux in 2004 and then
generalized by Hoch and Shamir in 2006, for a large class of
construction involving iterations and concatenations. But mind the
fine print: this is not really about surviving weaknesses in hash
functions, but about getting your money worth. Namely, if you take a
hash function with a 128-bit output and another with a 160-bit output,
and concatenate the results, then collision resistance will be no
worse than the strongest of the two; what Joux showed is that it will
not be much better either. With 128+160 = 288 bits of output, you
could aim at 2144 resistance, but Joux's result implies
that you will not go beyond about 287.
So the question becomes: is there a way, if possible an efficient
way, to combine two hash functions such that the result is as
collision-resistant as the strongest of the two, but without incurring
the output enlargement of concatenation ? In 2006, Boneh and
Boyen have published a result which simply states that the answer
is no, subject to the condition of evaluating each hash function only
once. Edit: Pietrzak lifted the latter condition in 2007
(i.e. invoking each hash function several times does not help).
And by PulpSpy:
I'm sure #Thomas will give a thorough answer. In the interm, I'll just
point out that the collision resistance of your first construction,
H1(m)||H2(M) is surprisingly not that much better than just H1(M). See
section 4 of this paper:
http://web.cecs.pdx.edu/~teshrim/spring06/papers/general-attacks/multi-joux.pdf
no , it's not a good practice, you must use a $salt for your encryption because the password cand be cracked with those rainbow tables

Is it unwise to fuzz-test with a cryptographically weak pseudorandom generator?

When working on a large software project I often use fuzz-testing as part of my test cases to help smoke out bugs that may only show up when the input reaches a certain size or shape. I've done this most commonly by just using the standard random number facilities that are bundled with the programming language I happen to be using.
Recently I've started wondering, ignoring the advantages or disadvantages of fuzz testing in general, whether or not it's a good idea to be using non-cryptographically secure pseudorandom number generators when doing fuzz testing. Weak random number generators often exhibit patterns that distinguish them from true random sequences, even if those patterns are not readily obvious. It seems like a fuzz test using a weak PRNG might always fail to trigger certain latent bugs that only show up in certain circumstances because the pseudorandom numbers could be related to one another in a way that never trigger those circumstances.
Is it inherently unwise to use a weak PRNG for fuzz testing? If it is theoretically unsound to do so, is it still reasonable in practice?
You're confusing two very different grades of "weakness":
statistical weakness means the output of the PRNG exhibits statistical patterns, such as having certain sequences occur more often than others. This could actually lead to ineffective fuzz testing in some rare cases. Statistically strong PRNGs are performant and widely available though (most prominently the Mersenne Twister).
cryptographical weakness means that the output of the RNG is in some way predictable given knowledge other than the seed (such as the output itself). It makes absolutley no sense to require a PRNG used for fuzz testing to be cryptographically strong, because the "patterns" exhibited by statistically-strong-but-cryptographically-weak PRNGs are pretty much only an issue if you need to prevent a cryptographically versed attacker from predicting the output.
I don't think it really matters, but I can't prove it.
Fuzz-testing will only try some inputs, in most cases a minuscule proportion of the possibilities. No matter how good the RNG you use, it may or may not find one of the inputs that breaks your code, depending on what proportion of all possible inputs break your code. Unless the pattern in the PRNG is very simple, it seems to me unlikely that it will correspond in any way to a pattern in the "bad" inputs you're looking for, so it will hit it neither more nor less than true random.
In fact, if you knew how to pick an RNG to maximise the probability of it finding bad inputs, you could probably use that knowledge to help find the bug more directly...
I don't think you should use a really bad PRNG. rand for example is permitted to exhibit very simple patterns, such as the LSB alternating. And if your code uses a PRNG internally, you probably want to avoid using the same PRNG in a similar way in the test, just to be sure you don't accidentally only test cases where the input data matches the internally-generated number stream! Small risk, of course, since you'd hope they'll be using different seeds, but still.
It's usually not that hard in a given language to find crypto or at least secure hash libraries. SHA-1 is everywhere and easy to use to generate a stream, or failing that RC4 is trivial to implement yourself. Both provide pretty good PRNG, if not quite so secure as Blum Blum Shub. I'd have thought the main concern is speed - if for example a Mersenne Twister can generate fuzz test cases 10 times as fast, and the code under test is reasonably fast, then it might have a better chance of finding bad inputs in a given time regardless of the fact that given 624 outputs you can deduce the complete state of the RNG...
You don't need unpredictable source (that's exactly what a cryptographically secure generator is), you only need a source with good statistical properties.
So using a general purpose generator is enough - it is fast and usually reproduceable (which means problems are also reproduceable).

How can I accelerate the generation of the an MD5 Checksum within vb.net?

I'm working with some very large files residing on P2 (Panasonic) cards. Part of the process we employ is to first generate a checksum of the file we are going to copy, then copy the file, then run a checksum on the file to confirm that it copied OK. The problem is, is that files are large (70 GB+) and take a long time to complete. It's an issue since we will eventually be dealing with thousands of these files.
I would like to find a faster way to generate the checksum other than using the System.Security.Cryptography.MD5CryptoServiceProvider
I don't care if this means using a specialized hardware card, provided it works and is not to ungodly expensive. I would prefer to have a method of encoding that provided some feedback as to how far the process has gone along so I can display it like I do now.
The application is written in vb.net. I would prefer to be able to use it as component, library, reference within my application, but I'm willing to call an outside application if there is enough improvement in the speed of generating the checksum.
Needless to say, the checksum must be consistent and correct. :-)
Thank you in advance for your time and efforts,
Richard
I see one potential way to speed up this process: calculate the MD5 of the source file while performing the copy, not prior to it. This will reduce the number of times you'll need to read the entire file from 3 (source hash, copy, destination hash) to 2 (copy, destination hash).
The downside of this all is that you'll have to write your own copying code (as opposed to just relying on System.IO.File.Copy), and there's a non-zero chance that this will turn out to be slower in the end anyway than the 3-step process.
Other than that, I don't think there's much you can do here, as the entire process is I/O bound by design. You're spending most of your time reading/writing the file, and even at 100MB/s (a respectable I/O speed for your typical SATA drive), you'll do about 5.8GB/min at best.
With a modern processor, the overhead of calculating the MD5 (or anything else) doesn't factor into things very much, so speeding it up won't improve your overall throughput. Crypto accelerators in particular won't help you here, as unless the driver implementation is very efficient, they'll add more overhead due to context switches required to feed the data to the external card than they'll save.
What you do want to improve is the I/O speed. The .NET framework is already pretty efficient when it comes to this (using nicely-sized buffers, overlapped I/O and such), but it's possible an optimized native Windows application will perform better here. My advice: Google around for a few native MD5 calculators, and see how they compare to your current .NET implementation. If the difference in hash calculation speed is >10%, it's worth switching to using said external app.
The correct answer is to avoid using MD5. MD5 is a cryptographic hash function, designed to provide certain cryptographic features. For merely detecting accidental corruption, it is way over-engineered and slow. There are many faster checksums, the design of which can be understood by examining the literature of error detection and correction. Some common examples are the CRC checksums, of which CRC32 is very common, but you can also relatively easily compute 64 or 128 bit or even larger CRCs much much faster than an MD5 hash.