Basic group arithmetic in libsodium - libsodium

I am trying to implement a simple cryptographic primitive.
Under the following code: given sa, sk, hn, I want to compute sb: such that sg*G = (sb + sk . hn)*G.
However, after finding sb, the following equality does not hold: sb*G + (sk.hn)G = saG.
My understand stand is that in the exponent is arithmetic modulo the order of group instead of L.
However, I have a few questions relating to their implementation:
why the scalar has to be chosen from [0,L] where L is the order of the subgroup?
is there a "helper" function that multiplies two large scalar without performing modulo L?
int main(void)
{
if (sodium_init() < 0) {
/* panic! the library couldn't be initialized, it is not safe to use */
return -1;
}
uint8_t sb[crypto_core_ed25519_SCALARBYTES];
uint8_t sa[crypto_core_ed25519_SCALARBYTES];
uint8_t hn[crypto_core_ed25519_SCALARBYTES];
uint8_t sk[crypto_core_ed25519_SCALARBYTES];
crypto_core_ed25519_scalar_random(sa); // s_a <- [0,l]
crypto_core_ed25519_scalar_random(sk); // sk <- [0,l]
crypto_core_ed25519_scalar_random(hn); // hn <- [0,l]
uint8_t product[crypto_core_ed25519_SCALARBYTES];
crypto_core_ed25519_scalar_mul(product, sk,hn); // sk*hn
crypto_core_ed25519_scalar_sub(sb, sa, product); // sb = sa-hn*sk
uint8_t point1[crypto_core_ed25519_BYTES];
crypto_scalarmult_ed25519_base(point1, sa);
uint8_t point2[crypto_core_ed25519_BYTES];
uint8_t sum[crypto_core_ed25519_BYTES];
// equal
// crypto_core_ed25519_scalar_add(sum, sb, product);
// crypto_scalarmult_ed25519_base(point2, sum);
// is not equal
uint8_t temp1[crypto_core_ed25519_BYTES];
uint8_t temp2[crypto_core_ed25519_BYTES];
crypto_scalarmult_ed25519_base(temp1, sb); // sb*G
crypto_scalarmult_ed25519_base(temp2, product); //
crypto_core_ed25519_add(point2, temp1, temp2);
if(memcmp(point1, point2, 32) != 0)
{
printf("[-] Not equal ");
return -1;
}
printf("[+] equal");
return 0;
}

I got the answer from jedisct1 , the author of libsodium and I will post it here:
crypto_scalarmult_ed25519_base() clamps the scalar (clears the 3 lower bits, set the high bit) before performing the multiplication.
Use crypto_scalarmult_ed25519_base_noclamp() to prevent this.
Or, even better, use the Ristretto group instead.

Related

Addressing pins of Register in microcontrollers

I'm working on Keil software and using LM3S316 microcontroller. Usually we address registers in microcontrollers in form of:
#define GPIO_PORTC_DATA_R (*((volatile uint32_t *)0x400063FC))
My question is how can I access to single pin of register for example, if I have this method:
char process_key(int a)
{ PC_0 = a ;}
How can I get PC_0 and how to define it?
Thank you
Given say:
#define PIN0 (1u<<0)
#define PIN1 (1u<<1)
#define PIN2 (1u<<2)
// etc...
Then:
char process_key(int a)
{
if( a != 0 )
{
// Set bit
GPIO_PORTC_DATA_R |= PIN0 ;
}
else
{
// Clear bit
GPIO_PORTC_DATA_R &= ~PIN0 ;
}
}
A generalisation of this idiomatic technique is presented at How do you set, clear, and toggle a single bit?
However the read-modify-write implied by |= / &= can be problematic if the register might be accessed in different thread/interrupt contexts, as well as adding a possibly undesirable overhead. Cortex-M3/4 parts have a feature known as bit-banding that allows individual bits to be addressed directly and atomically. Given:
volatile uint32_t* getBitBandAddress( volatile const void* address, int bit )
{
__IO uint32_t* bit_address = 0;
uint32_t addr = reinterpret_cast<uint32_t>(address);
// This bit maniplation makes the function valid for RAM
// and Peripheral bitband regions
uint32_t word_band_base = addr & 0xf0000000u;
uint32_t bit_band_base = word_band_base | 0x02000000u;
uint32_t offset = addr - word_band_base;
// Calculate bit band address
bit_address = reinterpret_cast<__IO uint32_t*>(bit_band_base + (offset * 32u) + (static_cast<uint32_t>(bit) * 4u));
return bit_address ;
}
Then you can have:
char process_key(int a)
{
static volatile uint32_t* PC0_BB_ADDR = getBitBandAddress( &GPIO_PORTC_DATA_R, 0 ) ;
*PC0_BB_ADDR = a ;
}
You could of course determine and hard-code the bit-band address; for example:
#define PC0 (*((volatile uint32_t *)0x420C7F88u))
Then:
char process_key(int a)
{
PC0 = a ;
}
Details of the bit-band address calculation can be found ARM Cortex-M Technical Reference Manual, and there is an on-line calculator here.

How to calculate CRC32 over blocks that are splitted and buffered of a large data?

Let's say I have a 1024kb data, which is 1kB buffered and transfered 1024 times from a transmitter to a receiver.
The last buffer contains a calculated CRC32 value as the last 4 bytes.
However, the receiver has to calculate the CRC32 buffer by buffer, because of the RAM constraints.
I wonder how to apply a linear distributed addition of CRC32 calculations to match the total CRC32 value.
I looked at CRC calculation and its distributive preference. The calculation and its linearity is not much clear to implement.
So, is there a mathematical expression for addition of calculated CRC32s over buffers to match with the CRC32 result which is calculated over total?
Such as:
int CRC32Total = 0;
int CRC32[1024];
for(int i = 0; i < 1024; i++){
CRC32Total = CRC32Total + CRC32[i];
}
Kind Regards
You did not provide any clues as to what implementation or even what language for which you "looked at CRC calculation". However every implementation I've seen is designed to compute CRCs piecemeal, exactly like you want.
For the crc32() routine provided in zlib, it is used thusly (in C):
crc = crc32(0, NULL, 0); // initialize CRC value
crc = crc32(crc, firstchunk, 1024); // update CRC value with first chunk
crc = crc32(crc, secondchunk, 1024); // update CRC with second chunk
...
crc = crc32(crc, lastchunk, 1024); // complete CRC with the last chunk
Then crc is the CRC of the concatenation of all of the chunks. You do not need a function to combine the CRCs of individual chunks.
If for some other reason you do want a function to combine CRCs, e.g. if you need to split the CRC calculation over multiple CPUs, then zlib provides the crc32_combine() function for that purpose.
When you start the transfer, reset the CrcChecksum to its initial value with the OnFirstBlock method. For every block received, call the OnBlockReceived to update the checksum. Note that the blocks must be processed in the correct order. When the final block has been processed, the final CRC is in the CrcChecksum variable.
// In crc32.c
uint32_t UpdateCrc(uint32_t crc, const void *data, size_t length)
const uint8_t *current = data;
while (length--)
crc = (crc >> 8) ^ Crc32Lookup[(crc & 0xFF) ^ *current++];
}
// In your block processing application
static uint32_t CrcChecksum;
void OnFirstBlock(void) {
CrcChecksum = 0;
}
void OnBlockReceived(const void *data, size_t length) {
CrcChecksum = UpdateCrc(CrcChecksum, data, length);
}
To complement my comment to your question, I have added code here that goes thru the whole process: data generation as a linear array, CRC32 added to the transmitted data, injection of errors, and reception in 'chunks' with computed CRC32 and detection of errors. You're probably only interested in the 'reception' part, but I think having a complete example makes it more clear for your comprehension.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
// ---------------------- buildCRC32table ------------------------------
static const uint32_t CRC32_POLY = 0xEDB88320;
static const uint32_t CRC32_XOR_MASK = 0xFFFFFFFF;
static uint32_t CRC32TABLE[256];
void buildCRC32table (void)
{
uint32_t crc32;
for (uint16_t byte = 0; byte < 256; byte++)
{
crc32 = byte;
// iterate thru all 8 bits
for (int i = 0; i < 8; i++)
{
uint8_t feedback = crc32 & 1;
crc32 = (crc32 >> 1);
if (feedback)
{
crc32 ^= CRC32_POLY;
}
}
CRC32TABLE[byte] = crc32;
}
}
// -------------------------- myCRC32 ----------------------------------
uint32_t myCRC32 (uint32_t previousCRC32, uint8_t *pData, int dataLen)
{
uint32_t newCRC32 = previousCRC32 ^ CRC32_XOR_MASK; // remove last XOR mask (or add first)
// add new data to CRC32
while (dataLen--)
{
uint32_t crc32Top24bits = newCRC32 >> 8;
uint8_t crc32Low8bits = newCRC32 & 0x000000FF;
uint8_t data = *pData++;
newCRC32 = crc32Top24bits ^ CRC32TABLE[crc32Low8bits ^ data];
}
newCRC32 ^= CRC32_XOR_MASK; // put XOR mask back
return newCRC32;
}
// ------------------------------ main ---------------------------------
int main()
{
// build CRC32 table
buildCRC32table();
uint32_t crc32;
// use a union so we can access the same data linearly (TX) or by chunks (RX)
union
{
uint8_t array[1024*1024];
uint8_t chunk[1024][1024];
} data;
// use time to seed randomizer so we have different data every run
srand((unsigned int)time(NULL));
/////////////////////////////////////////////////////////////////////////// Build data to be transmitted
////////////////////////////////////////////////////////////////////////////////////////////////////////
// populate array with random data sparing space for the CRC32 at the end
for (int i = 0; i < (sizeof(data.array) - sizeof(uint32_t)); i++)
{
data.array[i] = (uint8_t) (rand() & 0xFF);
}
// now compute array's CRC32
crc32 = myCRC32(0, data.array, sizeof(data.array) - sizeof(uint32_t));
printf ("array CRC32 = 0x%08X\n", crc32);
// to store the CRC32 into the array, we want to remove the XOR mask so we can compute the CRC32
// of all received data (including the CRC32 itself) and expect the same result all the time,
// regardless of the data, when no errors are present
crc32 ^= CRC32_XOR_MASK;
// load CRC32 at the very end of the array
data.array[sizeof(data.array) - 1] = (uint8_t)((crc32 >> 24) & 0xFF);
data.array[sizeof(data.array) - 2] = (uint8_t)((crc32 >> 16) & 0xFF);
data.array[sizeof(data.array) - 3] = (uint8_t)((crc32 >> 8) & 0xFF);
data.array[sizeof(data.array) - 4] = (uint8_t)((crc32 >> 0) & 0xFF);
/////////////////////////////////////////////// At this point, data is transmitted and errors may happen
////////////////////////////////////////////////////////////////////////////////////////////////////////
// to make things interesting, let's add one bit error with 1/8 probability
if ((rand() % 8) == 0)
{
uint32_t index = rand() % sizeof(data.array);
uint8_t errorBit = 1 << (rand() & 0x7);
// add error
data.array[index] ^= errorBit;
printf("Error injected on byte %u, bit mask = 0x%02X\n", index, errorBit);
}
else
{
printf("No error injected\n");
}
/////////////////////////////////////////////////////// Once received, the data is processed in 'chunks'
////////////////////////////////////////////////////////////////////////////////////////////////////////
// now we access the data and compute its CRC32 one chunk at a time
crc32 = 0; // initialize CRC32
for (int i = 0; i < 1024; i++)
{
crc32 = myCRC32(crc32, data.chunk[i], sizeof data.chunk[i]);
}
printf ("Final CRC32 = 0x%08X\n", crc32);
// because the CRC32 algorithm applies an XOR mask at the end, when we have no errors, the computed
// CRC32 will be the mask itself
if (crc32 == CRC32_XOR_MASK)
{
printf ("No errors detected!\n");
}
else
{
printf ("Errors detected!\n");
}
}

Determine Position of Most Signifiacntly Set Bit in a Byte

I have a byte I am using to store bit flags. I need to compute the position of the most significant set bit in the byte.
Example Byte: 00101101 => 6 is the position of the most significant set bit
Compact Hex Mapping:
[0x00] => 0x00
[0x01] => 0x01
[0x02,0x03] => 0x02
[0x04,0x07] => 0x03
[0x08,0x0F] => 0x04
[0x10,0x1F] => 0x05
[0x20,0x3F] => 0x06
[0x40,0x7F] => 0x07
[0x80,0xFF] => 0x08
TestCase in C:
#include <stdio.h>
unsigned char check(unsigned char b) {
unsigned char c = 0x08;
unsigned char m = 0x80;
do {
if(m&b) { return c; }
else { c -= 0x01; }
} while(m>>=1);
return 0; //never reached
}
int main() {
unsigned char input[256] = {
0x00,0x01,0x02,0x03,0x04,0x05,0x06,0x07,0x08,0x09,0x0a,0x0b,0x0c,0x0d,0x0e,0x0f,
0x10,0x11,0x12,0x13,0x14,0x15,0x16,0x17,0x18,0x19,0x1a,0x1b,0x1c,0x1d,0x1e,0x1f,
0x20,0x21,0x22,0x23,0x24,0x25,0x26,0x27,0x28,0x29,0x2a,0x2b,0x2c,0x2d,0x2e,0x2f,
0x30,0x31,0x32,0x33,0x34,0x35,0x36,0x37,0x38,0x39,0x3a,0x3b,0x3c,0x3d,0x3e,0x3f,
0x40,0x41,0x42,0x43,0x44,0x45,0x46,0x47,0x48,0x49,0x4a,0x4b,0x4c,0x4d,0x4e,0x4f,
0x50,0x51,0x52,0x53,0x54,0x55,0x56,0x57,0x58,0x59,0x5a,0x5b,0x5c,0x5d,0x5e,0x5f,
0x60,0x61,0x62,0x63,0x64,0x65,0x66,0x67,0x68,0x69,0x6a,0x6b,0x6c,0x6d,0x6e,0x6f,
0x70,0x71,0x72,0x73,0x74,0x75,0x76,0x77,0x78,0x79,0x7a,0x7b,0x7c,0x7d,0x7e,0x7f,
0x80,0x81,0x82,0x83,0x84,0x85,0x86,0x87,0x88,0x89,0x8a,0x8b,0x8c,0x8d,0x8e,0x8f,
0x90,0x91,0x92,0x93,0x94,0x95,0x96,0x97,0x98,0x99,0x9a,0x9b,0x9c,0x9d,0x9e,0x9f,
0xa0,0xa1,0xa2,0xa3,0xa4,0xa5,0xa6,0xa7,0xa8,0xa9,0xaa,0xab,0xac,0xad,0xae,0xaf,
0xb0,0xb1,0xb2,0xb3,0xb4,0xb5,0xb6,0xb7,0xb8,0xb9,0xba,0xbb,0xbc,0xbd,0xbe,0xbf,
0xc0,0xc1,0xc2,0xc3,0xc4,0xc5,0xc6,0xc7,0xc8,0xc9,0xca,0xcb,0xcc,0xcd,0xce,0xcf,
0xd0,0xd1,0xd2,0xd3,0xd4,0xd5,0xd6,0xd7,0xd8,0xd9,0xda,0xdb,0xdc,0xdd,0xde,0xdf,
0xe0,0xe1,0xe2,0xe3,0xe4,0xe5,0xe6,0xe7,0xe8,0xe9,0xea,0xeb,0xec,0xed,0xee,0xef,
0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7,0xf8,0xf9,0xfa,0xfb,0xfc,0xfd,0xfe,0xff };
unsigned char truth[256] = {
0x00,0x01,0x02,0x02,0x03,0x03,0x03,0x03,0x04,0x04,0x04,0x04,0x04,0x04,0x04,0x04,
0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,
0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,
0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,
0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,
0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,
0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,
0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08};
int i,r;
int f = 0;
for(i=0; i<256; ++i) {
r=check(input[i]);
if(r !=(truth[i])) {
printf("failed %d : 0x%x : %d\n",i,0x000000FF & ((int)input[i]),r);
f += 1;
}
}
if(!f) { printf("passed all\n"); }
else { printf("failed %d\n",f); }
return 0;
}
I would like to simplify my check() function to not involve looping (or branching preferably). Is there a bit twiddling hack or hashed lookup table solution to compute the position of the most significant set bit in a byte?
Your question is about an efficient way to compute log2 of a value. And because you seem to want a solution that is not limited to the C language I have been slightly lazy and tweaked some C# code I have.
You want to compute log2(x) + 1 and for x = 0 (where log2 is undefined) you define the result as 0 (e.g. you create a special case where log2(0) = -1).
static readonly Byte[] multiplyDeBruijnBitPosition = new Byte[] {
7, 2, 3, 4,
6, 1, 5, 0
};
public static Byte Log2Plus1(Byte value) {
if (value == 0)
return 0;
var roundedValue = value;
roundedValue |= (Byte) (roundedValue >> 1);
roundedValue |= (Byte) (roundedValue >> 2);
roundedValue |= (Byte) (roundedValue >> 4);
var log2 = multiplyDeBruijnBitPosition[((Byte) (roundedValue*0xE3)) >> 5];
return (Byte) (log2 + 1);
}
This bit twiddling hack is taken from Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup where you can see the equivalent C source code for 32 bit values. This code has been adapted to work on 8 bit values.
However, you may be able to use an operation that gives you the result using a very efficient built-in function (on many CPU's a single instruction like the Bit Scan Reverse is used). An answer to the question Bit twiddling: which bit is set? has some information about this. A quote from the answer provides one possible reason why there is low level support for solving this problem:
Things like this are the core of many O(1) algorithms such as kernel schedulers which need to find the first non-empty queue signified by an array of bits.
That was a fun little challenge. I don't know if this one is completely portable since I only have VC++ to test with, and I certainly can't say for sure if it's more efficient than other approaches. This version was coded with a loop but it can be unrolled without too much effort.
static unsigned char check(unsigned char b)
{
unsigned char r = 8;
unsigned char sub = 1;
unsigned char s = 7;
for (char i = 0; i < 8; i++)
{
sub = sub & ((( b & (1 << s)) >> s--) - 1);
r -= sub;
}
return r;
}
I'm sure everyone else has long since moved on to other topics but there was something in the back of my mind suggesting that there had to be a more efficient branch-less solution to this than just unrolling the loop in my other posted solution. A quick trip to my copy of Warren put me on the right track: Binary search.
Here's my solution based on that idea:
Pseudo-code:
// see if there's a bit set in the upper half
if ((b >> 4) != 0)
{
offset = 4;
b >>= 4;
}
else
offset = 0;
// see if there's a bit set in the upper half of what's left
if ((b & 0x0C) != 0)
{
offset += 2;
b >>= 2;
}
// see if there's a bit set in the upper half of what's left
if > ((b & 0x02) != 0)
{
offset++;
b >>= 1;
}
return b + offset;
Branch-less C++ implementation:
static unsigned char check(unsigned char b)
{
unsigned char adj = 4 & ((((unsigned char) - (b >> 4) >> 7) ^ 1) - 1);
unsigned char offset = adj;
b >>= adj;
adj = 2 & (((((unsigned char) - (b & 0x0C)) >> 7) ^ 1) - 1);
offset += adj;
b >>= adj;
adj = 1 & (((((unsigned char) - (b & 0x02)) >> 7) ^ 1) - 1);
return (b >> adj) + offset + adj;
}
Yes, I know that this is all academic :)
It is not possible in plain C. The best I would suggest is the following implementation of check. Despite quite "ugly" I think it runs faster than the ckeck version in the question.
int check(unsigned char b)
{
if(b&128) return 8;
if(b&64) return 7;
if(b&32) return 6;
if(b&16) return 5;
if(b&8) return 4;
if(b&4) return 3;
if(b&2) return 2;
if(b&1) return 1;
return 0;
}
Edit: I found a link to the actual code: http://www.hackersdelight.org/hdcodetxt/nlz.c.txt
The algorithm below is named nlz8 in that file. You can choose your favorite hack.
/*
From last comment of: http://stackoverflow.com/a/671826/315052
> Hacker's Delight explains how to correct for the error in 32-bit floats
> in 5-3 Counting Leading 0's. Here's their code, which uses an anonymous
> union to overlap asFloat and asInt: k = k & ~(k >> 1); asFloat =
> (float)k + 0.5f; n = 158 - (asInt >> 23); (and yes, this relies on
> implementation-defined behavior) - Derrick Coetzee Jan 3 '12 at 8:35
*/
unsigned char check (unsigned char b) {
union {
float asFloat;
int asInt;
} u;
unsigned k = b & ~(b >> 1);
u.asFloat = (float)k + 0.5f;
return 32 - (158 - (u.asInt >> 23));
}
Edit -- not exactly sure what the asker means by language independent, but below is the equivalent code in python.
import ctypes
class Anon(ctypes.Union):
_fields_ = [
("asFloat", ctypes.c_float),
("asInt", ctypes.c_int)
]
def check(b):
k = int(b) & ~(int(b) >> 1)
a = Anon(asFloat=(float(k) + float(0.5)))
return 32 - (158 - (a.asInt >> 23))

Multiply 2 very big numbers in IOS

I have to multiply 2 large integer numbers, every one is 80+ digits.
What is the general approach for such kind of tasks?
You will have to use a large integer library. There are some open source ones listed on Wikipedia's Arbitrary Precision arithmetic page here
We forget how awesome it is that CPUs can multiply numbers that fit into a single register. Once you try to multiply two numbers that are bigger than a register you realize what a pain in the ass it is to actually multiply numbers.
I had to write a large number class awhile back. Here is the code for my multiply function. KxVector is just an array of 32 bit values with a count, and pretty self explanatory, and not included here. I removed all the other math functions for brevity. All the math operations are easy to implement except multiply and divide.
#define BIGNUM_NEGATIVE 0x80000000
class BigNum
{
public:
void mult( const BigNum& b );
KxVector<u32> mData;
s32 mFlags;
};
void BigNum::mult( const BigNum& b )
{
// special handling for multiply by zero
if ( b.isZero() )
{
mData.clear();
mFlags = 0;
return;
}
// apply sign
mFlags ^= b.mFlags & BIGNUM_NEGATIVE;
// multiply two numbers using a naive multiplication algorithm.
// this would be faster with karatsuba or FFT based multiplication
const BigNum* ppa;
const BigNum* ppb;
if ( mData.size() >= b.mData.size() )
{
ppa = this;
ppb = &b;
} else {
ppa = &b;
ppb = this;
}
assert( ppa->mData.size() >= ppb->mData.size() );
u32 aSize = ppa->mData.size();
u32 bSize = ppb->mData.size();
BigNum tmp;
for ( u32 i = 0; i < aSize + bSize; i++ )
tmp.mData.insert( 0 );
const u32* pb = ppb->mData.data();
u32 carry = 0;
for ( u32 i = 0; i < bSize; i++ )
{
u64 mult = *(pb++);
if ( mult )
{
carry = 0;
const u32* pa = ppa->mData.data();
u32* pd = tmp.mData.data() + i;
for ( u32 j = 0; j < aSize; j++ )
{
u64 prod = ( mult * *(pa++)) + *pd + carry;
*(pd++) = u32(prod);
carry = u32( prod >> 32 );
}
*pd = u32(carry);
}
}
// remove leading zeroes
while ( tmp.mData.size() && !tmp.mData.last() ) tmp.mData.pop();
mData.swap( tmp.mData );
}
It depends on what you want to do with the numbers. Do you want to use more arithmetic operators or do you simply want to multiply two numbers and then output them to a file? If it's the latter it's fairly simple to put the digits in an int or char array and then implement a multiplication function which works just like you learned to do multiplication by hands.
This is the simplest solution if you want to do this in C, but of course it's not very memory efficient. I suggest looking for Biginteger libraries for C++, e.g. if you want to do more, or just implement it by yourself to suit your needs.

Arm loop code optimization for calculating mean standard deviation

I want to improve some code which is using 25% of my app CPU, the code is the next:
for (int i=0; i<8; i++) {
unsigned f = *p++;
sum += f;
sqsum += f*f;
}
I made some arm code but it is not working, even not compiling, which is the next:
void loop(uint8_t * p , int * sum ,int * qsum)
{
__asm__ volatile("vld4.8 {d0}, [%0]! \n"
"mov r4, #0 \n"
"vmlal.u8 [%1]!, [%1]!, d0 \n"
"vmull.u8 r4, d0 , d0 \n"
"vmlal.u8 [%2]!, [%2]!, r4\n"
:
: "r"(p), "r"(sum), "r"(qsum)
: "r4"
);
}
Any help?
Here is the my function to improve:
void calculateMeanStDev8x8(cv::Mat* patch, int sx, int sy, int& mean, float& stdev)
{
unsigned sum=0;
unsigned sqsum=0;
for (int j=0; j< 8; j++) {
const unsigned char* p = (const unsigned char*)(patch->data + (j+sy)*patch->step + sx); //Apuntador al inicio de la matrix
//The code to improve
for (int i=0; i<8; i++) {
unsigned f = *p++;
sum += f;
sqsum += f*f;
}
}
mean = sum >> 6;
int r = (sum*sum) >> 6;
stdev = sqrtf(sqsum - r);
if (stdev < .1) {
stdev=0;
}
}
That loop is a perfect candidate for NEON optimization. You can fit your 8 unsigned integers into a single NEON register. There is no "sum all elements of a vector" instruction, but you can use the pairwise add to compute the sum of the 8 elements in 3 steps. Since we can't see the rest of your application, it's hard to know what the big picture is, but NEON is your best bet for improving the speed. All recent Apple products support NEON instructions and in XCode you can use the NEON intrinsics mixed with your C++ code.