How does libgcrypt increment the counter for CTR mode? - cryptography

I have a file encrypted with AES-256 using libgcrypt's CTR mode implementation.
I want to be able to decrypt the file in parts (e.g. decrypting blocks 5-10 out of 20 blocks without decrypting the whole file).
I know that by using CTR mode, I should be able to do it. All I need is to know the correct counter.
The problem lies in the fact that all I have is the initial counter for block 0. If I want to decrypt block 5 for example, I need another counter, one that is achieved by doing some action on the initial counter for each block from 0 to 5.
I can't seem to find an API that libgcrypt exposes in order to calculate counter for later blocks given the initial counter.
How can I calculate the counter of later blocks (e.g. block #5) given the counter of block #0?

When in doubt, go to the source. Here's the code in gcrypt's generic CTR mode implementation (_gcry_cipher_ctr_encrypt() in cipher-ctr.c) that increments the counter:
for (i = blocksize; i > 0; i--)
{
c->u_ctr.ctr[i-1]++;
if (c->u_ctr.ctr[i-1] != 0)
break;
}
There are other, more optimized implementations of counter incrementing found in other places in the libgcrypt source, e.g. in the various cipher-specific fast bulk CTR encryption implementations, but this generic one happens to be nice and readable. (Of course, all those alternative implementations need to produce the same sequence of counter values anyway, so that gcrypt stays compatible with itself.)
OK, so what does it actually do?
Well, looking at the context (or, more specifically, cipher-internal.h), it's clear that c->u_ctr.ctr is an array of blocksize unsigned bytes (where blocksize equals 16 bytes for AES). The code above increments its last byte by one, and checks if the result wrapped around to zero. If it didn't, it stops; if it did wrap, the code then moves to the second-to-last byte, increments it, checks to see if it wrapped, and keeps looping until it either finds a byte that doesn't wrap around when incremented, or it has incremented all of the blocksize bytes.
So, for example, if your original counter value was {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}, then after incrementing it would become {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1}. If incremented again, it would become {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2}, then {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3}, and so on up to {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255}, after which the next counter value would be {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0} (and after that {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1}, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2}, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3}, etc.).
Of course, what this is really doing is just arithmetically incrementing a single (blocksize × 8)-bit integer, stored in memory in big-endian byte order.

Related

How to extract encryption and MAC keys using KDF (X9.63) defined by javacardx.security.derivation

As per Java Card v3.1 new package is defined javacardx.security.derivation
https://docs.oracle.com/en/java/javacard/3.1/jc_api_srvc/api_classic/javacardx/security/derivation/package-summary.html
KDF X9.63 works on three inputs: input secret, counter and shared info.
Depends on length of generated key material, multiple rounds on hash is carried out to generated final output.
I am using this KDF via JC API to generated 64 bytes of output (which is carried out by 2 rounds of SHA-256) for a 16 bytes-Encryption Key, a 16 bytes-IV, and a 32 bytes-MAC Key.
Note: This is just pseudo code to put my question with necessary details.
DerivationFunction df = DerivationFunction.getInstance(DerivationFunction.ALG_KDF_ANSI_X9_63, false);
df.init(KDFAnsiX963Spec(MessageDigest.ALG_SHA_256, input, sharedInfo, (short) 64);
SecretKey encKey = KeyBuilder.buildKey(KeyBuilder.TYPE_AES, (short)16, false);
SecretKey macKey = KeyBuilder.buildKey(KeyBuilder.TYPE_HMAC, (short)32, false);
df.nextBytes(encKey);
df.nextBytes(IVBuffer, (short)0, (short)16);
df.lastBytes(macKey);
I have the following questions:
When rounds of KDF are performed? Are these performed during df.init() or during df.nextBytes() & df.lastBytes()?
One KDF round will generate 32 bytes output (considering SHA-256 algorithm) then how API's df.nextBytes() & df.lastBytes() will work with any output expected length < 32 bytes?
In this KDF counter is incremented in every next round then how counter will be managed between df.nextBytes() & df.lastBytes() API's?
When rounds of KDF are performed? Are these performed during df.init() or during df.nextBytes() & df.lastBytes()?
That seems to be implementation specific to me. It will probably be faster to perform all the calculations at one time, but in that case it still makes sense to wait for the first request of the bytes. On the other hand RAM is also often an issue, so on demand generation also makes some sense. That requires a somewhat trickier implementation though.
The fact that the output size is pre-specified probably indicates that the simpler method of generating all the key material at once is at least foreseen by the API designers (they probably created an implementation before subjecting it to peer review in the JCF).
One KDF round will generate 32 bytes output (considering SHA-256 algorithm) then how API's df.nextBytes() & df.lastBytes() will work with any output expected length < 32 bytes?
It will commonly return the leftmost bytes (of the hash output) and likely leave the rest of the bytes in a buffer. This buffer will likely be destroyed together with the rest of the state when lastBytes is called (so don't forget to do so).
Note that the API clearly states that you have to re-initialize the DerivationFunction instance if you want to use it again. So that is a very strong indication that they though of destruction of key material (something that is required by FIPS and Common Criteria certification, not just common sense).
Other KDF's could have a different way of returning bytes, but using the leftmost bytes and then add rounds to the right is so common you can call it universal. For the ANSI X9.63 KDF this is certainly the case and it is clearly specified in the standard that way.
In this KDF counter is incremented in every next round then how counter will be managed between df.nextBytes() & df.lastBytes() API's?
These are methods of the same class and cannot be viewed separately, so they are not separate API's. Class instances can keep state in anyway they want. It might simply hold the counter as class variable, but if it decided to generate the bytes during init or the first nextBytes / lastBytes call then the counter is not even required anymore.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

Why are the outputs of this pseudo random number generator (LFSR) so predictable?

Recently I asked here, how to generate random numbers in hardware and was told to use an LFSR. It will be random but will start repeating after a certain value.
The problem is that the random numbers generated are so predictable that the next value can be easily guessed. For example check the simulation below:
The next "random" number can be guessed by adding the previous number with a +1 of itself. Can someone please verify if this is normal and to be expected.
Here is the code I used for the LFSR:
module LFSR(
input clock,
input reset,
output [12:0] rnd
);
wire feedback = rnd[12] ^ rnd[3] ^ rnd[2] ^ rnd[0];
reg [12:0] random;
always # (posedge clock or posedge reset)
begin
if (reset)
random <= 13'hF; //An LFSR cannot have an all 0 state, thus reset to FF
else
random <= {random[11:0], feedback}; //shift left the xor'd every posedge clock
end
assign rnd = random;
endmodule
The location of the bits to XOR are picked up from here: The table page 5
LFSR only generates one random bit per clock. It doesn't generate a new (in your case) 13-bit number each cycle. The other 12 bits in rnd are just the old random values, so it will not appear very random.
If you need a 13-bit random number, then you must either sample LFSR every 13 cycles, or put 13 LFSR in parallel with different seeds, and use the 13 zero bits as your random number.
An LFSR is most certainly not 'random' in any real sense whatsoever. To quote Von Neumann "Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin." I haven't looked up whether the feedback terms you've chosen are maximal, meaning that they'll provide a sequence with a length equal to the number of bits in your LFSR, but that's the best you can do.
So yes, the next value in your LFSR is extremely predictable. If you need something more securely 'random' you need to look into cryptographic methods, these depend on a secret key of course, and are also much more computationally intensive than an LFSR. You 'get what you pay for' though.
Incidentally, a system where you get predictable 'random' numbers is highly useful in it's own right. Usually for simulation purposes.

Does the "C" code algorithm in RFC1071 work well on big-endian machine?

As described in RFC1071, an extra 0-byte should be added to the last byte when calculating checksum in the situation of odd count of bytes:
But in the "C" code algorithm, only the last byte is added:
The above code does work on little-endian machine where [Z,0] equals Z, but I think there's some problem on big-endian one where [Z,0] equals Z*256.
So I wonder whether the example "C" code in RFC1071 only works on little-endian machine?
-------------New Added---------------
There's one more example of "breaking the sum into two groups" described in RFC1071:
We can just take the data here (addr[]={0x00, 0x01, 0xf2}) for example:
Here, "standard" represents the situation described in the formula [2], while "C-code" representing the C code algorithm situation.
As we can see, in "standard" situation, the final sum is f201 regardless of endian-difference since there's no endian-issue with the abstract form of [Z,0] after "Swap". But it matters in "C-code" situation because f2 is always the low-byte whether in big-endian or in little-endian.
Thus, the checksum is variable with the same data(addr&count) on different endian.
I think you're right. The code in the RFC adds the last byte in as low-order, regardless of whether it is on a litte-endian or big-endian machine.
In these examples of code on the web we see they have taken special care with the last byte:
https://github.com/sjaeckel/wireshark/blob/master/epan/in_cksum.c
and in
http://www.opensource.apple.com/source/tcpdump/tcpdump-23/tcpdump/print-ip.c
it does this:
if (nleft == 1)
sum += htons(*(u_char *)w<<8);
Which means that this text in the RFC is incorrect:
Therefore, the sum may be calculated in exactly the same way
regardless of the byte order ("big-endian" or "little-endian")
of the underlaying hardware. For example, assume a "little-
endian" machine summing data that is stored in memory in network
("big-endian") order. Fetching each 16-bit word will swap
bytes, resulting in the sum; however, storing the result
back into memory will swap the sum back into network byte order.
The following code in place of the original odd byte handling is portable (i.e. will work on both big- and little-endian machines), and doesn't depend on an external function:
if (count > 0)
{
char buf2[2] = {*addr, 0};
sum += *(unsigned short *)buf2;
}
(Assumes addr is char * or const char *).

Interoperability of AES CTR mode?

I use AES128 crypto in CTR mode for encryption, implemented for different clients (Android/Java and iOS/ObjC). The 16 byte IV used when encrypting a packet is formated like this:
<11 byte nonce> | <4 byte packet counter> | 0
The packet counter (included in a sent packet) is increased by one for every packet sent. The last byte is used as block counter, so that packets with fewer than 256 blocks always get a unique counter value. I was under the assumption that the CTR mode specified that the counter should be increased by 1 for each block, using the 8 last bytes as counter in a big endian way, or that this at least was a de facto standard. This also seems to be the case in the Sun crypto implementation.
I was a bit surprised when the corresponding iOS implementation (using CommonCryptor, iOS 5.1) failed to decode every block except the first when decoding a packet. It seems that CommonCryptor defines the counter in some other way. The CommonCryptor can be created in both big endian and little endian mode, but some vague comments in the CommonCryptor code indicates that this is not (or at least has not been) fully supported:
http://www.opensource.apple.com/source/CommonCrypto/CommonCrypto-60026/Source/API/CommonCryptor.c
/* corecrypto only implements CTR_BE. No use of CTR_LE was found so we're marking
this as unimplemented for now. Also in Lion this was defined in reverse order.
See <rdar://problem/10306112> */
By decoding block by block, each time setting the IV as specified above, it works nicely.
My question: is there a "right" way of implementing the CTR/IV mode when decoding multiple blocks in a single go, or can I expect it to be interoperability problems when using different crypto libs? Is CommonCrypto bugged in this regard, or is it just a question of implementing the CTR mode differently?
The definition of the counter is (loosely) specified in NIST recommendation sp800-38a Appendix B. Note that NIST only specifies how to use CTR mode with regards to security; it does not define one standard algorithm for the counter.
To answer your question directly, whatever you do you should expect the counter to be incremented by one each time. The counter should represent a 128 bit big endian integer according to the NIST specifications. It may be that only the least significant (rightmost) bits are incremented, but that will usually not make a difference unless you pass the 2^32 - 1 or 2^64 - 1 value.
For the sake of compatibility you could decide to use the first (leftmost) 12 bytes as random nonce, and leave the latter ones to zero, then let the implementation of the CTR do the increments. In that case you simply use a 96 bit / 12 byte random at the start, in that case there is no need for a packet counter.
You are however limited to 2^32 * 16 bytes of plaintext until the counter uses up all the available bits. It is implementation specific if the counter returns to zero or if the nonce itself is included in the counter, so you may want to limit yourself to messages of 68,719,476,736 = ~68 GB (yes that's base 10, Giga means 1,000,000,000).
because of the birthday problem you've got a 2^48 chance (48 = 96 / 2) of creating a collision for the nonce (required for each message, not each block), so you should limit the amount of messages;
if some attacker tricks you into decrypting 2^32 packets for the same nonce, you run out of counter.
In case this is still incompatible (test!) then use the initial 8 bytes as nonce. Unfortunately that does mean that you need to limit the number of messages because of the birthday problem.
Further investigations sheds some light on the CommonCrypto problem:
In iOS 6.0.1 the little endian option is now unimplemented. Also, I have verified that CommonCrypto is bugged in that the CCCryptorReset method does not in fact change the IV as it should, instead using pre-existing IV. The behaviour in 6.0.1 is different from 5.x.
This is potentially a security risc, if you initialize CommonCrypto with a nulled IV, and reset it to the actual IV right before encrypting. This would lead to all your data being encrypted with the same (nulled) IV, and multiple streams (that perhaps should have different IV but use same key) would leak data via a simple XOR of packets with corresponding ctr.