BigInteger subtraction in JavaCard - cryptography

I am attempting a proof of concept under very constrained technological conditions. My question is: how to efficiently subtract big integers (represented as byte arrays) in a Java Card?.
Now, the details are what make the task tricky. I have access to one smart card. The model is Feitian JavaCOS A22 and runs Java Card 2.2. For full detail, Java Card enables the usage of a very restricted subset of the Java API (namely, no int, no char, and naturally, no BigInteger), but it does support a series of cryptographic primitives that can be detailed on this list.
In particular, my task is to implement classic ElGamal on card. I found two relevant replies so far. In the first one, Maarten points out that ElGamal is not on the standard, and therefore the functionality would need to be implemented. In this answer, thotheolh shares a link to an implementation of DiffieHellman in Java Card 2.2 based on the same principle: since it is not natively supported, it leverages on the functionality of RSA.
The logic is seamless: RSA, ElGamal and DiffieHellman rely on the same basic operation $a^b mod c$. Based on thotheolh's code, I have managed to achieve key generation. Encryption occurs out of the card so it is not my concern. But decryption requires a particular variant. For decryption $b=p-1-x$, where both $p$ and $x$ are BigIntegers. This is the point where I get stuck: how to calculate efficiently $p-1-x$?

Well, in fact there is no such thing like native real BigInteger support for JavaCard. There is BigNumber, but I don't think it will fit your requirements.
However, there is a way to undertake this limitation.
There is some JavaCard library that should allow you to deal with arbitrary long big integers - the problem is that your applet could run out of memory.
Sources of library are here, and here is the prebuilt .jar.
This approach might work but also likely to be drastically slow on real card. However this isn't an issue, if you run such code in simulator just for PoC.
I've no idea what is your IDE but this is how you can add this library for IntelliJ.
However, as Maarten Bodewes pointed out, you might be better focus on bytes substraction, just because of probable inefficency of any BigInteger JavaCard library.
Hope this helps.
UPD
BigNumber is guaranteed to be at least 8 bytes, but as far, as I tried it, it allows exactly 8 bytes, which is way to small to hold some security-robust parameters. Say, it cat not contain safe prime p that equals to 57896044618658097711785492504343953926634992332820282019728792003956564821041.
You can try this yourself with method getMaxBytesSupported() just to ensure the fact.
So, as you can see, BigNumber is relatively big for JavaCard, but still smaller, than most crypto protocols needs.

As others said, you won't find native Integers or BigInts in most JavaCards, even today.
However, for anyone still wondering 4 years later, JCMathLib actually implements this functionality.
It is not as fast as a native implementation would be but it uses the crypto coprocessor (where possible) and achieves decent performance.

Related

Provably correct marshalling?

Does there exist serialization schemes (marshalling) of data structures that can be formally proven to be correct ?
I am agnostic to the particular programming language, could be ocaml/haskell or cpp, java or other as long as the data to be serialized can be assumed to be properly typed.
Perhaps as a way to reformulate/clarify my question what i am interested is whether there exists known standard encoding schemes to write data structures to disk which can proved to have 100% fidelity in the sense that the deserialized data is exactly the same as the original one.
As a simplifying assumption i can assume there is no complication of pointers/references. The input is “pure data” for a lack of a better way to say it.
It's a slightly vague question, but I'll have a go.
Heterogenous Environments
Serialisation's job is to take data in the memory of one computer program, convert it to some sort of standardised representation, and convert that back to data in the memory of, quite possibly, another computer program on a completely different sort of computer. That does open up some interesting possiblities.
For example, the representation of a floating point value on many computers is IEEE754. But it's not wholly universal; historically companies like Cray and IBM used alternate formats, and so there exists the possibility that a value when deserialised on those machines might not be exactly the same value as was serialised in the first place. Generally no one cares, because the differences are numerically very small.
This shows up in some serialisation technologies; ASN.1's own wireformats for floats are either a text representation, or it's own binary format that is not IEEE754. The idea behind the text representation is that it can convey any floating point value, with no constraints. In contrast a binary format often has limits in the precision, maximum value, etc.
Text is another potential problem area; serialised unicode strings sent to another computer that doesn't support unicode will likely result in the deserialised string being different to the original.
Similarly with platforms that don't support 64bit integers, etc. Java is very annoying - historically it had no unsigned integers, so handling 64bit unsigned values received from, say, a C++ program is a nuisance.
Conclusion - It's a Logical Impossibility
So in some senses, for heterogenous environments, there can be no serialiatsion technologies formally proven to reproduce identical values because the destination machine is of a different architecture, and its representation may well be different, or limited in someway.
Homogenous Environments
Serialisation used to convey data from a computer program on one computer to exactly the same program on an identical computer (i.e. a homogenous environment) ought to produce exactly the same values on deserialisation. AFAIK there's no formally proven serialisation technologies. If there's serialisation built into the Ada language (I don't know) the Greenhills Ada compiler is formally proven. Boost for C++ is heavily peer reviewed, so that comes close, especially if used on top of Greenhill's formally proven C++ compiler, and has a serialisation library. Some of the commercial ASN.1 tools / libraries are very mature and highly trusted.
What is it that is Formally Proven?
In that last para I have touched on a difficulty with your question; formal proof is perhaps only of value if the entire software development stack (libraries, compiler, CPU) and you application source code are themselves formally proven. Otherwise you could have perfect source code for a serialisation library being compiled by a junk compiler, linked against rubbish libraries, running on a shonky CPU; it's not going to work.
So, when one is talking about "formally proven" one is generally talking about the whole system, not just an individual component. A component part that is, by itself, formally proven to meet its specification is a good aid to achieving a proven system, but it does not magically confer "correctness" on the whole system all on its own. Every other component needs to meet its specification too.
And what we've seen historically is that, quite often, CPUs don't really do what their data sheet says they do. Some will take shortcuts in floating point arithmetic in the interests of completing instructions in a single cycle in preference to achieving a numerically perfect result.
Sorry for the rambling answer, but I hope that's of interest and help.

AES encryption/decryption for a beginner

I am trying to encrypt an NSString to both NSString and NSData in Objective-C and so I began a search.
I started off here, but that went way over my head, unfortunately.
I then found myself at this post and it came across to be very easy to follow, so I went along and tried to figure out the implementation. After looking over the implementation, I saw the second answer in the post and saw he had more adaptable implementations, which brought me to his gist. As per the gist readme, he "took down this Gist due to concerns about the security of the encryption/decryption". That leads me to believe that the security of the implementation from above has security flaws as well.
From that gist, however, he mentioned another alternative that I could use for encryption. After taking a look at the code, I noticed that it generates NSData with "a header, encryption salt, HMAC salt, IV, ciphertext, and HMAC". I know how to handle that to decode using the same library again, but how would I pass this off to a server guy, given that I don't quite know what I'm sending to him?
At the root of it all, I'm in over my head. Given what I said above and knowing that I don't have the time to take on a lot of learning for this, unless if it is absolutely necessary, how should I best handle going about this encoding/decoding process, given a private key with the end goal of shipping it off to a server that is not designed by me? (How's that for a run on sentence!)
Maybe you should ask the server guy? When ever you have encryption between too parties you have to have some kind of agreement on the format of that data, the raw primitives don't handle that alone, not to mention it's easy to mess things up security wise dealing with just the primitives and the desire to just send the aes ciphertext alone is going to cause mistakes.
RNCryptor, which you mention, is a high level encryption library it defines a simple format that others would have to conform too, it's simple thus helps going cross platform, but it has that extra that you need to do AES properly. There are other libraries like that too (NaCL, GPGME, and Keyczar), that are not as simple in format, but simple in usage, so you'd need to be able to use the library on both ends, but I'd highly recommend that you uses something like that, if you can, rather than rolling your own.
Keyczar specifically exists for java, python, c++, c# and go, so if you can use the c++ version on the iOS (or Mac, which ever you are targeting on the client) you might be good on the server as there are several choices.

Choosing CPU architecture for LLVM/CLANG

I am designing TTL serial computer, and I am struggling on choosing architecture more suitable for LLVM compiler backend (I want to be able to run any C++ software there). There will be no MMU, no multiplication/division, no hardware stack, no interrupts.
I have 2 main options:
1) 8-bit memory, 8-bit ALU, 8-bit registers (~12-16). Memory address width 24 bit. So I will need to use 3 registers as IP and 3 registers for any memory location.
Needless to say that any address calculations would be pure pain to implement in compiler.
2) 24-bit memory, 24-bit ALU, 24-bit registers (~6-8). Flat memory, nice. The drawbacks is that due to serial nature of the design, each operation would take 3 time more clocks, even if we are operating on some booleans. 24-bit memory data width is expensive. And it's harder to implement in hardware in general.
The question is : Do you think implementing all c++ features on this 8-bit, stack-less based hardware is possible, or I need to have more complex hardware to have generated code of reasonable quality & speed?
I second the suggestion to use LCC. I used it in this homebrew 16-bit RISC project: http://fpgacpu.org/xsoc/cc.html .
I don't think it should make much difference whether you build the 8-bit variant and use 3 add-with-carries to increment IP, or the 24-bit variant and do the whole thing in hardware. You can hide the difference in your assembler.
If you look at my article above, or an even simpler CPU here: http://fpgacpu.org/papers/soc-gr0040-paper.pdf you will see you really don't need that many operators / instructions to cover the integer C repetoire. In fact there is an lcc utility (ops) to print the min operator set for a given machine.
For more information see my article on porting lcc to a new machine here: http://www.fpgacpu.org/usenet/lcc.html
Once I had ported lcc, I wrote an assembler, and it synthesized a larger repetoire of instructions from the basic ones. For example, my machine had load-byte-unsigned but not load-byte-signed, so I emitted this sequence:
lbs rd,imm(rs) ->
lbu rd,imm(rs)
lea r1,0x80
xor rd,r1
sub rd,r1
So I think you can get by with this min cover of operations:
registers
load register with constant
load rd = *rs
store *rs1 = rs2
+ - (w/ w/o carry) // actually can to + with - and ^
>> 1 // << 1 is just +
& ^ // (synthesize ~ from ^, | from & and ^)
jump-and-link rd,rs // rd = pc, pc = rs
skip-z/nz/n/nn rs // skip next insn on rs==0, !=0, <0, >=0
Even simpler is to have no registers (or equivalently blur registers with memory -- all registers have a memory address).
Set aside a register for SP, and write the function prolog/epilog handler in the compiler and you won't have to worry about stack instructions. There's just code to store each of the callee save registers, adjust the SP by the frame size, and so forth.
Interrupts (and return from interrupts) are straightforward. All you need to do is force a jump-and-link instruction into the instruction register. If you chose the bit pattern for that to be something like 0, and put the right addresses into the source register rs (especially if it is r0), it can be done with a flip-flop reset input or an extra force-to-0 and gate. I use a similar trick in the second paper above.
Interesting project. I see a TTL / 7400 contest is underway and I was thinking myself of how simple a machine could you get away with and would it be cheating to add a 32 KB or 128 KB async SRAM to the machine to hold the code and data.
Anyway, happy hacking!
p.s.
1) You will want to decide how large each integral type is. You can certainly make char, short, int, long, long long, etc. the same size, one 24b word, if you wish, although it won't be compliant in min representation ranges.
2) And although I focused on lcc here, you were asking about C++. I recommend persuing C first. Once you have things figured out for C, including *, /, % operators in software, etc., it should be more tractable to move to full blown C++ whether in LLVM or GCC. The difference between C and C++ is "only" the extra vtables and RTTI tables and code sequences (entirely built up out the primitive C integer operator repetoire) required to handle virtual function calls, pointer to member dereference, dynamic casts, static constructors, exception handling, etc.
IMHO, It is possible for c compiler. i am not sure for c++, though.
LLVM/CLang could be hard choice for 8bit computer,
Instead, first try lcc, then second llvm/etc, HTH.
Bill Buzbee succeed to retarget lcc compiler for his Magic-1(known as homebrewcpu).
Although the hardware design and construction of Magic-1 usually gets the most attention, the largest part of the project (by far) has been developing/porting the software. To this end, I've had to write an assembler and linker from scratch, retarget a C compiler, write and port the standard C libraries, write a simplified operating system and then port a more sophisticated one. It's been a challenge, but a fun one. I suppose I'm somewhat twisted, but I happen to enjoy debugging difficult problems. And, when the bug you're trying to track down could involve one or more of: hardware design flaw, loose or broken wire, loose or bad TTL chip, assembler bug, linker bug, compiler bug, C runtime library bug, or finally a bug in the program in question there's lot of opportunity for fun. Oh, and I also don't have the luxury of blaming the bugs on anyone else.
I'm continually amazed that the damn thing runs at all, much less runs as well as it does.
In my opinion, stackless hardware is already poorly suited for C and C++ code. If you have nested function calls, you will need to emulate a stack in software anyway, which of course is much slower.
When going the stackless route, you will probably allocate most of your variables as 'static', and have no re-entrant functions. In this case, 6502-style addressing modes can be effective. You could for example have these addressing modes:
Immediate address (24bit) as part of opcode
Immediate address (24bit) plus index register (8bit)
Indirect access: immediate 24bit address to memory, which contains the actual address
Indirect access: 24 bit address to memory, 8 bit index register added to value from memory.
The address modes outlined above would allow efficient access to arrays, structures and objects allocated at a constant address (static allocation). They would be less efficient (but still usable) for dynamically and stack-allocated objects.
You would also get some benefit from your serial design: usually the 24 bit + 8 bit addition does not take 24 cycles, but you can instead short-circuit the addition when carry is 0.
Instead of mapping the IP as registers directly, you could allow changing it only through goto/branch instructions, using the same address modes as above. Jumps into dynamically computed addresses are quite rare so it makes more sense to give the whole 24-bit address directly in the opcode.
I think that if you design the CPU carefully, you can use many C++ features quite efficiently. However, do not expect that any random C++ code would run fast on such a limited CPU.
The implementation is certainly possible, but I doubt it will be usable (at lest for C++ code). As it was already noted, first problem is lack of stack. Next, bunch of C++ relies heavily on dynamic memory allocation, also C++ "internal" structures are quite big.
So, as it seems to me, it will be better, if you:
Get rid of C++ requirement (or at least, limit yourself to some subset)
Use 24 bits, not 8 bits for everything (for registers as well)
Add hardware stack
You are not going to be able to run "any" C++ code there. For example fork(), system(), etc. Anything that clearly relies on interrupts for example. You can get a long way there, sure.
Now do you mean any programs that can/have been written in C++ or are you limiting yourself to the language only and not the libraries that are commonly associated with C/C++? The language itself is a much easier rule to live with.
I think the easier question/answer, is, why not just try? What have you tried so far? It could be argued that the x86 is an 8-bit machine, no regard for alignment and many 8 bit instructions. the msp430 was ported to llvm to show how easily and quickly it could be done, I would like to see that platform with better support (not where my strengths lie otherwise I would be doing it) a 16 bit platform. no mmu. does have a stack and interrupts sure, dont have to use them and if you remove library rules then what is left that needs an interrupt?
I would look at llvm but note that the documentation produced that shows how easy it is to port, is dated and wrong and you basically have to figure it out on your own from the compiler sources. llc has a book, known for that, not optimized. Sources dont compile well on modern computers, always having to go backwards in time to use it, any time I go near it after an evening just trying to build it as is I give up. vbcc, simple, clean, documented, not unfriendly to smaller processors. Is it C++, dont remember. Of all of them the easiest to get a compiler up and running though. Of all of them LLVM is the most attractive and most useful when all said and done. dont go near gcc or even think of it, duct tape and bailing wire inside holding it together.
Have you invented your instruction set yet? do you have a simulator and assembler yet? Look up lsasim at github to find my instruction set. You can write an llvm backend for mine as practice for yours...grin...(my vbcc backend is horrible, I need to start over)...
You have to have some idea of how the high level will be implemented but you really have to start with an instruction set and an instruction set simulator and an assembler of some sort. Then start hand converting C/C++ code into assembly for your instruction set, that should pretty quickly get you through "can I do this without a stack", etc. In this process define your calling convention, implement more C/C++ code by hand using your calling convention. THEN dig into a compiler and make a back end. I think you should consider vbcc as a stepping stone, then head for LLVM if it appears like it (the isa) will work.

Getting Embedded with D (the programming language)

I like a lot of what I've read about D.
Unified Documentation (That would
make my job a lot easier.)
Testing capability built in to the
language.
Debug code support in the language.
Forward Declarations. (I always
thought it was stupid to declare the
same function twice.)
Built in features to replace the
Preprocessor.
Modules
Typedef used for proper type checking
instead of aliasing.
Nested functions. (Cough PASCAL
Cough)
In and Out Parameters. (How obvious is that!)
Supports low level programming -
Embedded systems, oh yeah!
However:
Can D support an embedded system that
not going to be running an OS?
Does the outright declearation that
it doesn't support 16 bit processors
proclude it entirely from embedded
applications running on such machines? Sometimes you don't need a hammer to solve your problem.
Garbage collection is great on Windows or Linux, but, and unfortunately embedded applications sometime must do explicit memory management.
Array bounds checking, you love it, you hate it. Great for design assurance, but not alway permissable for performance issues.
What are the implications on an embedded system, not running an OS, for multithreading support? We have a customer that doesn't even like interrupts. Much less OS/multithreading.
Is there a D-Lite for embedded systems?
So basically is D suitable for embedded systems with only a few megabytes (sometimes less than a magabyte), not running an OS, where max memory usage must be known at compile time (Per requirements.) and possibly on something smaller than a 32 bit processor?
I'm very interested in some of the features, but I get the impression it's aimed at desktop application developers.
What is specifically that makes it unsuitable for a 16-bit implementation? (Assuming the 16 bit architecture could address sufficient amounts of memory to hold the runtimes, either in flash memory or RAM.) 32 bit values could still be calculated, albeit slower than 16 bit and requiring more operations, using library code.
I have to say that the short answer to this question is "No".
If your machines are 16 bit, you'll have big problems fitting D into it - it is explicitly not designed for it.
D is not a light languages in itself, it generates a lot of runtime type info that normally is linked into your app, and that also is needed for typesafe variadics (and thus the standard formatting features be it Tango or Phobos). This means that even the smallest applications are surprisingly large in size, and may thus disqualify D from the systems with low RAM. Also D with a runtime as a shared lib (which could alleviate some of these issues), has been little tested.
All current D libraries requires a C standard library below it, and thus typically also an OS, so even that works against using D. However, there do exist experimental kernels in D, so it is not impossible per se. There just wouldn't be any libraries for it, as of today.
I would personally like to see you succeed, but doubt that it will be easy work.
First and foremost read larsivi's answer. He's worked on the D runtime and knows of what he's talking about.
I just wanted to add: Some of what you asked about is already possible. It won't get you all the way, and a miss is as good as a mile here but still, FYI:
Garbage collection is great on Windoze or Linux, but, and unfortunately embedded apps sometime must do explicite memory management.
You can turn garbage collection off. The various experimental D OSes out there do it. See the std.gc module, in particular std.gc.disable. Note also that you do not need to allocate memory with new: you can use malloc and free. Even arrays can be allocated with it, you just need to attach a D array around the allocated memory using a slice.
Array bounds checking, you love it, you hate it. Great for design assurance, but not alway permissable for performance issues.
The specification for arrays specifically requires that compilers allow for bounds checking to be turned off (see the "Implementation Note"). gdc provides -fno-bounds-check, and in dmd using -release should disable it.
What are the implications on an embedded system, not running an OS, for multithreading support? We have a customer that doesn't even like interrupts. Much less OS/multithreading.
This I'm less clear on, but given that most C runtimes allow turning off multithreading, it seems likely one could get the D runtime to disable it as well. Whether that's easy or possible right now though I can't tell you.
The answers to this question are outdated:
Can D support an embedded system that not going to be running an OS?
D can be cross-compiled for ARM Linux and for ARM Cortex-M. Some projects aim at creating libraries for Cortex-M architectures like MiniLibD for the STM32 or this project which uses a generic library for the STM32. (You could implement your own minimalistic OS in D on ARM Cortex-M.)
Does the outright declearation that it doesn't support 16 bit processors proclude it entirely from embedded applications running on such machines? Sometimes you don't need a hammer to solve your problem.
No, see answer above... (But I would not expect that "smaller" architectures than Cortex-M will be supported in the near future.)
Garbage collection is great on Windows or Linux, but, and unfortunately embedded applications sometime must do explicit memory management.
You can write Garbage Collection free code. (The D foundation seems to aim at a "GC free compliant" standard library Phobos but that is work in progress.)
Array bounds checking, you love it, you hate it. Great for design assurance, but not alway permissable for performance issues.
(As you said this depends on your "personal taste" and design decisions. But I would assume an acceptable performance overhead for bound checking due to the background of the D compiler developers and D's design aims.)
What are the implications on an embedded system, not running an OS, for multithreading support? We have a customer that doesn't even like interrupts. Much less OS/multithreading.
(What is the question? One could implement mutlithreading using D's language capabilities e.g. like explained in this question. BTW: If you want to use interrupts consider this "hello world" project for a Cortex-M3.)
Is there a D-Lite for embedded systems?
The SafeD subset of D targets at the embedded domain.

Really Big Numbers and Objective-C

I've been toying around with some Project Euler problems and naturally am running into a lot that require the handling of bigger than long long type numbers. I am committed to using Cocoa and Objective-C (I need to stay sharp for work) but can't find an elegant way (read: library) to handle these really big numbers.
I'd love to use GMP but is sounds like using it with Xcode is a complete world of hurt.
Does anyone know of any other options?
If I were you I would compile gmp outside XCode and use just gmp.h and libgmp.a (or libgmp.dylib) in my XCode project.
Try storing the digits in arrays.
Although you will have to write some new functions for all your arithmatic problems but thats how we were told to do it in college.
Plus the speed of calculations was pretty improved as big numbers weren't really big afterall and were not numbers really altogether
see if it helps
regards
vBigNum in vecLib implements 1024 bit integers (signed or unsigned). Is that big enough?
If you wanted to use matlab (or anything close) you could look at my implementation of a big integer form (vpi) on the file exchange.
It is rather simple. Store each digit separately. Adds and subtracts are simple, just implement a carry operation. Multiplies are best done using convolution, then a carry. Implement divide and mod operators, then a powermod operation, useful for many of the PE problems. Powers are easy - just repeated squaring and multiplication, based on the binary representation of the exponent.
This will let you solve many PE problems.
I too got the bright idea to attempt some Euler Project problems with Cocoa/Objective-C and have found it frustrating. I previously used Java and perhaps some PHP. I posted my exact problem in this thread.
I always considered using a library cheating for this project. Just write a class with the things you need. And don't be afraid to use malloc and uint64_t and so on. NSNumber is not a good idea in many cases.
On the other hand, there are many problems where the obvious solution would require huge to enormously huge numbers, and the trick is to find a way to solve the problem without using these huge numbers. (For example, what is the sum of the last thousand digits of 1,000,000 factorial)?