S3c2440(ARM9) spi_read_write Flash Memory - embedded
I am working on SPI communication.Trying to communicate SST25VF032B(32 MB microchip SPI Flash).
When I am reading the Manufacturer Id it shows MF_ID =>4A25BF
but originally it is MF_ID =>BF254A. I am getting it simply reverse, means first bite in 3rd and 3rd byte in first.
What could be the possible reason for that?
My SPI Init function is here:
//Enable clock control register CLKCON 18 Bit enables SPI
CLKCON |= (0x01 << 18);//0x40000;
printk("s3c2440_clkcon=%08ld\n",CLKCON);
//Enable GPG2 Corresponding NSS port
GPGCON =0x1011;//010000 00 01 00 01
printk("s3c2440_GPGCON=%08ld\n",GPGCON);
SPNSS0_ENABLE();
//Enable GPE 11,12,13,Corresponding MISO0,MOSI0,SCK0 = 11 0x0000FC00
GPECON &= ~((3 << 22) | (3 << 24) | (3 << 26));
GPECON |= ((2 << 22) | (2 << 24) | (2 << 26));
//GPEUP Set; all disable
GPGUP &= ~(0x07 << 2);
GPEUP |= (0x07 << 11);
//SPI Register section
//SPI Prescaler register settings,
//Baud Rate=PCLK/2/(Prescaler value+1)
SPPRE0 = 0x18; //freq = 1M
printk("SPPRE0=%02X\n",SPPRE0);
//polling,en-sck,master,low,format A,nomal = 0 | TAGD = 1
SPCON0 = (0<<5)|(1<<4)|(1<<3)|(0<<2)|(0<<1)|(0<<0);
printk("SPCON1=%02ld\n",SPCON0);
//Multi-host error detection is enabled
SPPIN0 = (0 << 2) | (1 << 1) | (0 << 0);
printk("SPPIN1=%02X\n",SPPIN0);
//Initialization procedure
SPTDAT0 = 0xff;
My spi_read_write function as follows:
static char spi_read_write (unsigned char outb)
{
// Write and Read a byte on SPI interface.
int j = 0;
unsigned char inb;
SPTDAT0 = outb;
while(!SPI_TXRX_READY) for(j = 0; j < 0xFF; j++);
SPTDAT0 = outb;
//SPTDAT0 = 0xff;
while(!SPI_TXRX_READY) for(j = 0; j < 0xFF; j++);
inb = SPRDAT0;
return (inb);
}
My Calling function is:
MEM_1_CS(0);
spi_read_write(0x9F);
m1 = spi_read_write(0x00);
m2 = spi_read_write(0x00);
m3 = spi_read_write(0x00);
MEM_1_CS(1);
printk("\n\rMF_ID =>%02X-%02X-%02X",m1,m2,m3);
Please guide me what to do?
Thanks in Advance!!
There's no apparent problem with the SPI function.
The problem is with your printing function.
Arm is little endian processor. it keeps the bytes reversed in memory.
You need to print it reverse order and you'll be fine.
I was banging my head on this from last couple of days and finally I find the solution. All I needed to change my spi_read_write function as follows.
static char spi_read_write (unsigned char outb)
{
int j = 0;
unsigned char inb;
while(!SPI_TXRX_READY) for(j = 0; j < 0xFF; j++);
SPTDAT0 = outb;
while(!SPI_TXRX_READY) for(j = 0; j < 0xFF; j++);
inb = SPRDAT0;
return (inb);
}
CHANGES MADE:
First of all we have to check whether the SPI_TXRX_READY then fill the register with the value SPTDAT0 = outb;.
Thanks all for your kind support.
Related
Attiny 84 Communicating with RTC Through SPI Troubles
I am currently trying to use an ATtiny84 to communicate with an RTC (DS1305) through SPI to make an buzzer vibrate every variable amount of time. I've been trying to set alarm0 on the DS1305. However, the 84 does not "technically" have SPI. It has USI which can be programmed to be like SPI. I was wondering if any of you could review my code/ board connections and let me know if you see any problems. The current problem is that I cannot get any communication going through SPI and I am having trouble finding what the issue could be. Current board connections: ATtiny84 | DS1305 MOSI ------ DI MISO ------ DO USCLK ---- CLK Datasheets: Attiny84 DS1305 /* * Atmel_Jolt_Code.c * * Created: 11/28/2018 10:44:30 PM * Author : Nick Hulsey */ #include <avr/io.h> #define F_CPU 16000000UL #include <avr/interrupt.h> #include <util/delay.h> //variables for SPI #define SPI_DDR_PORT DDRA #define CE_PIN DDA3 //I ADDED ***** #define DO_DD_PIN DDA5 // SHOULD WE #define DI_DD_PIN DDA6 // THEM FLIP #define USCK_DD_PIN DDA4 #define SPI_MODE0 0x00 #define SPI_MODE1 0x04 #define MOTOR_PIN DDA7 //I ADDED ***** void SPI_begin(); void setDataMode(uint8_t spiDataMode); uint8_t transfer(uint8_t spiData); void flipLatch(uint8_t on); int main(void) { SPI_begin(); setDataMode(SPI_MODE1); DDRA |= (1 << MOTOR_PIN); //**startup** uint8_t status_register = 0x10; uint8_t control_register = 0x8F; uint8_t control_byte = 0x05; uint8_t alarm_registers[] = {0x8A, 0x89, 0x88, 0x87}; //set control flipLatch(1); transfer(control_register); transfer(0); flipLatch(0); flipLatch(1); transfer(control_register); transfer(control_byte); flipLatch(0); //set alarm: for (int i = 0; i < 4; i++){ flipLatch(1); transfer(alarm_registers[i]); transfer(0x80); //0b10000000 flipLatch(0); } //THIS MIGHT NEED WORK //GIMSK |= (1 << PCIE1);//set external interrupt (A1) PCMSK0 |= (1 << PCINT1); sei(); while (1) //our main loop { //reading the flag from the status register uint8_t status = transfer(status_register); if(status == 0x01){//if alarm 0 has been flagged PORTA ^= (1 << MOTOR_PIN); _delay_ms(100); } } } //if A1 has changed state at all this function will fire ISR(PCINT1_vect){ PORTA ^= (1 << MOTOR_PIN);//invert motor power _delay_ms(100); } void SPI_begin(){ USICR &= ~((1 << USISIE) | (1 << USIOIE) | (1 << USIWM0));//Turn off these bits USICR |= (1 << USIWM0) | (1 << USICS1) | (1 << USICLK);//Turn on these bits //REVIEW THIS PAGE 128 //external,positive edge software clock //What does this mean SPI_DDR_PORT |= 1 << USCK_DD_PIN; // set the USCK pin as output SPI_DDR_PORT |= 1 << DO_DD_PIN; // set the DO pin as output SPI_DDR_PORT |= 1 << CE_PIN;// ******** I ADDED SPI_DDR_PORT &= ~(1 << DI_DD_PIN); // set the DI pin as input } void setDataMode(uint8_t spiDataMode) { if (spiDataMode == SPI_MODE1) USICR |= (1 << USICS0); else USICR &= (1 << USICS0); } //returns values returned from the IC uint8_t transfer(uint8_t spiData) { USIDR = spiData; USISR = (1 << USIOIF); // clear counter and counter overflow interrupt flag //ATOMIC_BLOCK(ATOMIC_RESTORESTATE) // ensure a consistent clock period //{ while ( !(USISR & (1 << USIOIF)) ) USICR |= (1 << USITC); //} return USIDR; } void flipLatch(uint8_t on){ if (on == 1) PORTA |= (1 << CE_PIN); else PORTA &= ~(1 << CE_PIN); }
crc16 XMODEM from hexstring [Vb.net
I want to figure out how to CRC16 XMODEM works and write a code for it. it will calculate from 3 to 18bytes and calls with the button, it will take HEX values then show a result in hex value aswell. For example: 0x05 0x02 0xAA 0xAA - will be 0x3430 accrording to http://crccalc.com/ - and this is correct. But how to implement this with code , does anyone have any info please?
unsigned crc16xmodem(unsigned crc, unsigned char const *data, size_t len) { if (data == NULL) return 0; while (len--) { crc ^= (unsigned)(*data++) << 8; for (unsigned k = 0; k < 8; k++) crc = crc & 0x8000 ? (crc << 1) ^ 0x1021 : crc << 1; } return crc & 0xffff; }
Bitwise operations, bits falling off, fit into 8 bit container
I'm trying to fit 16, 32 and 64-bit values into an 8-bit container I'm able to shift the bits to where it's only a 8 bits of data for the container, but I do not know how to bring them back to reflect the value. I've been racking my brain all day and I cannot figure it out. Any help would be so amazing. Here's the code I've been experimenting with before I start my lab project because this is the thing that has me confused. How would I store the 8 bits and then go back and be able to pull it back up without the lost bits #include <iostream> #include <stdio.h> #include <cstdint> using namespace std; int main() { const int MAX_SIZE = 1000; uint16_t data = 62153; uint8_t mem[MAX_SIZE]; cout << data << endl; data = ( data >> 8) & 0xff; cout << data << endl; data = ( data << 8); cout << data << endl; return 0; } output is: 1111 0010 1100 1001 = 62153 1111 0010 = 242 after bit shift 1111 0010 0000 0000 = 61952 shift back those bits are gone, how can I save space by breaking up the bits to being able to store them in a smaller container, while still being able to have a function go back and read what the full value was before the shift. This is homework, so I'm not asking for an answer. A push in the right direction would be greatly appreciated.
This should give you a push in A direction. You could also use a stack to achieve something similar. By using a vector you could determine what the output would be by checking the size of the vector. For example if the vector::size() == 2 you know you need to create a new uint16_t. If the vector size is 4 you need a uint32_t etc... #include <vector> #include <cstdint> #include <iostream> int main() { std::vector<uint8_t> bytes; uint16_t data = 62153; uint8_t lsb = data & 0xff; uint8_t msb = (data >> 8) & 0xff; bytes.push_back(lsb); bytes.push_back(msb); uint16_t new_data = 0; new_data = new_data | (bytes.at(1) << 8); new_data = new_data | bytes.at(0); std::cout << new_data << std::endl; return 0; }
Determine Position of Most Signifiacntly Set Bit in a Byte
I have a byte I am using to store bit flags. I need to compute the position of the most significant set bit in the byte. Example Byte: 00101101 => 6 is the position of the most significant set bit Compact Hex Mapping: [0x00] => 0x00 [0x01] => 0x01 [0x02,0x03] => 0x02 [0x04,0x07] => 0x03 [0x08,0x0F] => 0x04 [0x10,0x1F] => 0x05 [0x20,0x3F] => 0x06 [0x40,0x7F] => 0x07 [0x80,0xFF] => 0x08 TestCase in C: #include <stdio.h> unsigned char check(unsigned char b) { unsigned char c = 0x08; unsigned char m = 0x80; do { if(m&b) { return c; } else { c -= 0x01; } } while(m>>=1); return 0; //never reached } int main() { unsigned char input[256] = { 0x00,0x01,0x02,0x03,0x04,0x05,0x06,0x07,0x08,0x09,0x0a,0x0b,0x0c,0x0d,0x0e,0x0f, 0x10,0x11,0x12,0x13,0x14,0x15,0x16,0x17,0x18,0x19,0x1a,0x1b,0x1c,0x1d,0x1e,0x1f, 0x20,0x21,0x22,0x23,0x24,0x25,0x26,0x27,0x28,0x29,0x2a,0x2b,0x2c,0x2d,0x2e,0x2f, 0x30,0x31,0x32,0x33,0x34,0x35,0x36,0x37,0x38,0x39,0x3a,0x3b,0x3c,0x3d,0x3e,0x3f, 0x40,0x41,0x42,0x43,0x44,0x45,0x46,0x47,0x48,0x49,0x4a,0x4b,0x4c,0x4d,0x4e,0x4f, 0x50,0x51,0x52,0x53,0x54,0x55,0x56,0x57,0x58,0x59,0x5a,0x5b,0x5c,0x5d,0x5e,0x5f, 0x60,0x61,0x62,0x63,0x64,0x65,0x66,0x67,0x68,0x69,0x6a,0x6b,0x6c,0x6d,0x6e,0x6f, 0x70,0x71,0x72,0x73,0x74,0x75,0x76,0x77,0x78,0x79,0x7a,0x7b,0x7c,0x7d,0x7e,0x7f, 0x80,0x81,0x82,0x83,0x84,0x85,0x86,0x87,0x88,0x89,0x8a,0x8b,0x8c,0x8d,0x8e,0x8f, 0x90,0x91,0x92,0x93,0x94,0x95,0x96,0x97,0x98,0x99,0x9a,0x9b,0x9c,0x9d,0x9e,0x9f, 0xa0,0xa1,0xa2,0xa3,0xa4,0xa5,0xa6,0xa7,0xa8,0xa9,0xaa,0xab,0xac,0xad,0xae,0xaf, 0xb0,0xb1,0xb2,0xb3,0xb4,0xb5,0xb6,0xb7,0xb8,0xb9,0xba,0xbb,0xbc,0xbd,0xbe,0xbf, 0xc0,0xc1,0xc2,0xc3,0xc4,0xc5,0xc6,0xc7,0xc8,0xc9,0xca,0xcb,0xcc,0xcd,0xce,0xcf, 0xd0,0xd1,0xd2,0xd3,0xd4,0xd5,0xd6,0xd7,0xd8,0xd9,0xda,0xdb,0xdc,0xdd,0xde,0xdf, 0xe0,0xe1,0xe2,0xe3,0xe4,0xe5,0xe6,0xe7,0xe8,0xe9,0xea,0xeb,0xec,0xed,0xee,0xef, 0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7,0xf8,0xf9,0xfa,0xfb,0xfc,0xfd,0xfe,0xff }; unsigned char truth[256] = { 0x00,0x01,0x02,0x02,0x03,0x03,0x03,0x03,0x04,0x04,0x04,0x04,0x04,0x04,0x04,0x04, 0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05, 0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06, 0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06, 0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07, 0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07, 0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07, 0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07, 0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08, 0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08, 0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08, 0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08, 0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08, 0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08, 0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08, 0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08}; int i,r; int f = 0; for(i=0; i<256; ++i) { r=check(input[i]); if(r !=(truth[i])) { printf("failed %d : 0x%x : %d\n",i,0x000000FF & ((int)input[i]),r); f += 1; } } if(!f) { printf("passed all\n"); } else { printf("failed %d\n",f); } return 0; } I would like to simplify my check() function to not involve looping (or branching preferably). Is there a bit twiddling hack or hashed lookup table solution to compute the position of the most significant set bit in a byte?
Your question is about an efficient way to compute log2 of a value. And because you seem to want a solution that is not limited to the C language I have been slightly lazy and tweaked some C# code I have. You want to compute log2(x) + 1 and for x = 0 (where log2 is undefined) you define the result as 0 (e.g. you create a special case where log2(0) = -1). static readonly Byte[] multiplyDeBruijnBitPosition = new Byte[] { 7, 2, 3, 4, 6, 1, 5, 0 }; public static Byte Log2Plus1(Byte value) { if (value == 0) return 0; var roundedValue = value; roundedValue |= (Byte) (roundedValue >> 1); roundedValue |= (Byte) (roundedValue >> 2); roundedValue |= (Byte) (roundedValue >> 4); var log2 = multiplyDeBruijnBitPosition[((Byte) (roundedValue*0xE3)) >> 5]; return (Byte) (log2 + 1); } This bit twiddling hack is taken from Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup where you can see the equivalent C source code for 32 bit values. This code has been adapted to work on 8 bit values. However, you may be able to use an operation that gives you the result using a very efficient built-in function (on many CPU's a single instruction like the Bit Scan Reverse is used). An answer to the question Bit twiddling: which bit is set? has some information about this. A quote from the answer provides one possible reason why there is low level support for solving this problem: Things like this are the core of many O(1) algorithms such as kernel schedulers which need to find the first non-empty queue signified by an array of bits.
That was a fun little challenge. I don't know if this one is completely portable since I only have VC++ to test with, and I certainly can't say for sure if it's more efficient than other approaches. This version was coded with a loop but it can be unrolled without too much effort. static unsigned char check(unsigned char b) { unsigned char r = 8; unsigned char sub = 1; unsigned char s = 7; for (char i = 0; i < 8; i++) { sub = sub & ((( b & (1 << s)) >> s--) - 1); r -= sub; } return r; }
I'm sure everyone else has long since moved on to other topics but there was something in the back of my mind suggesting that there had to be a more efficient branch-less solution to this than just unrolling the loop in my other posted solution. A quick trip to my copy of Warren put me on the right track: Binary search. Here's my solution based on that idea: Pseudo-code: // see if there's a bit set in the upper half if ((b >> 4) != 0) { offset = 4; b >>= 4; } else offset = 0; // see if there's a bit set in the upper half of what's left if ((b & 0x0C) != 0) { offset += 2; b >>= 2; } // see if there's a bit set in the upper half of what's left if > ((b & 0x02) != 0) { offset++; b >>= 1; } return b + offset; Branch-less C++ implementation: static unsigned char check(unsigned char b) { unsigned char adj = 4 & ((((unsigned char) - (b >> 4) >> 7) ^ 1) - 1); unsigned char offset = adj; b >>= adj; adj = 2 & (((((unsigned char) - (b & 0x0C)) >> 7) ^ 1) - 1); offset += adj; b >>= adj; adj = 1 & (((((unsigned char) - (b & 0x02)) >> 7) ^ 1) - 1); return (b >> adj) + offset + adj; } Yes, I know that this is all academic :)
It is not possible in plain C. The best I would suggest is the following implementation of check. Despite quite "ugly" I think it runs faster than the ckeck version in the question. int check(unsigned char b) { if(b&128) return 8; if(b&64) return 7; if(b&32) return 6; if(b&16) return 5; if(b&8) return 4; if(b&4) return 3; if(b&2) return 2; if(b&1) return 1; return 0; }
Edit: I found a link to the actual code: http://www.hackersdelight.org/hdcodetxt/nlz.c.txt The algorithm below is named nlz8 in that file. You can choose your favorite hack. /* From last comment of: http://stackoverflow.com/a/671826/315052 > Hacker's Delight explains how to correct for the error in 32-bit floats > in 5-3 Counting Leading 0's. Here's their code, which uses an anonymous > union to overlap asFloat and asInt: k = k & ~(k >> 1); asFloat = > (float)k + 0.5f; n = 158 - (asInt >> 23); (and yes, this relies on > implementation-defined behavior) - Derrick Coetzee Jan 3 '12 at 8:35 */ unsigned char check (unsigned char b) { union { float asFloat; int asInt; } u; unsigned k = b & ~(b >> 1); u.asFloat = (float)k + 0.5f; return 32 - (158 - (u.asInt >> 23)); } Edit -- not exactly sure what the asker means by language independent, but below is the equivalent code in python. import ctypes class Anon(ctypes.Union): _fields_ = [ ("asFloat", ctypes.c_float), ("asInt", ctypes.c_int) ] def check(b): k = int(b) & ~(int(b) >> 1) a = Anon(asFloat=(float(k) + float(0.5))) return 32 - (158 - (a.asInt >> 23))
g++ SSE intrinsics dilemma - value from intrinsic "saturates"
I wrote a simple program to implement SSE intrinsics for computing the inner product of two large (100000 or more elements) vectors. The program compares the execution time for both, inner product computed the conventional way and using intrinsics. Everything works out fine, until I insert (just for the fun of it) an inner loop before the statement that computes the inner product. Before I go further, here is the code: //this is a sample Intrinsics program to compute inner product of two vectors and compare Intrinsics with traditional method of doing things. #include <iostream> #include <iomanip> #include <xmmintrin.h> #include <stdio.h> #include <time.h> #include <stdlib.h> using namespace std; typedef float v4sf __attribute__ ((vector_size(16))); double innerProduct(float* arr1, int len1, float* arr2, int len2) { //assume len1 = len2. float result = 0.0; for(int i = 0; i < len1; i++) { for(int j = 0; j < len1; j++) { result += (arr1[i] * arr2[i]); } } //float y = 1.23e+09; //cout << "y = " << y << endl; return result; } double sse_v4sf_innerProduct(float* arr1, int len1, float* arr2, int len2) { //assume that len1 = len2. if(len1 != len2) { cout << "Lengths not equal." << endl; exit(1); } /*steps: * 1. load a long-type (4 float) into a v4sf type data from both arrays. * 2. multiply the two. * 3. multiply the same and store result. * 4. add this to previous results. */ v4sf arr1Data, arr2Data, prevSums, multVal, xyz; //__builtin_ia32_xorps(prevSums, prevSums); //making it equal zero. //can explicitly load 0 into prevSums using loadps or storeps (Check). float temp[4] = {0.0, 0.0, 0.0, 0.0}; prevSums = __builtin_ia32_loadups(temp); float result = 0.0; for(int i = 0; i < (len1 - 3); i += 4) { for(int j = 0; j < len1; j++) { arr1Data = __builtin_ia32_loadups(&arr1[i]); arr2Data = __builtin_ia32_loadups(&arr2[i]); //store the contents of two arrays. multVal = __builtin_ia32_mulps(arr1Data, arr2Data); //multiply. xyz = __builtin_ia32_addps(multVal, prevSums); prevSums = xyz; } } //prevSums will hold the sums of 4 32-bit floating point values taken at a time. Individual entries in prevSums also need to be added. __builtin_ia32_storeups(temp, prevSums); //store prevSums into temp. cout << "Values of temp:" << endl; for(int i = 0; i < 4; i++) cout << temp[i] << endl; result += temp[0] + temp[1] + temp[2] + temp[3]; return result; } int main() { clock_t begin, end; int length = 100000; float *arr1, *arr2; double result_Conventional, result_Intrinsic; // printStats("Allocating memory."); arr1 = new float[length]; arr2 = new float[length]; // printStats("End allocation."); srand(time(NULL)); //init random seed. // printStats("Initializing array1 and array2"); begin = clock(); for(int i = 0; i < length; i++) { // for(int j = 0; j < length; j++) { // arr1[i] = rand() % 10 + 1; arr1[i] = 2.5; // arr2[i] = rand() % 10 - 1; arr2[i] = 2.5; // } } end = clock(); cout << "Time to initialize array1 and array2 = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl; // printStats("Finished initialization."); // printStats("Begin inner product conventionally."); begin = clock(); result_Conventional = innerProduct(arr1, length, arr2, length); end = clock(); cout << "Time to compute inner product conventionally = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl; // printStats("End inner product conventionally."); // printStats("Begin inner product using Intrinsics."); begin = clock(); result_Intrinsic = sse_v4sf_innerProduct(arr1, length, arr2, length); end = clock(); cout << "Time to compute inner product with intrinsics = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl; //printStats("End inner product using Intrinsics."); cout << "Results: " << endl; cout << " result_Conventional = " << result_Conventional << endl; cout << " result_Intrinsics = " << result_Intrinsic << endl; return 0; } I use the following g++ invocation to build this: g++ -W -Wall -O2 -pedantic -march=i386 -msse intrinsics_SSE_innerProduct.C -o innerProduct Each of the loops above, in both the functions, runs a total of N^2 times. However, given that arr1 and arr2 (the two floating point vectors) are loaded with a value 2.5, the length of the array is 100,000, the result in both cases should be 6.25e+10. The results I get are: Results: result_Conventional = 6.25e+10 result_Intrinsics = 5.36871e+08 This is not all. It seems that the value returned from the function that uses intrinsics "saturates" at the value above. I tried putting other values for the elements of the array and different sizes too. But it seems that any value above 1.0 for the array contents and any size above 1000 meets with the same value we see above. Initially, I thought it might be because all operations within SSE are in floating point, but floating point should be able to store a number that is of the order of e+08. I am trying to see where I could be going wrong but cannot seem to figure it out. I am using g++ version: g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2). Any help on this is most welcome. Thanks, Sriram.
The problem that you are having is that while a float can store 6.25e+10, it only has a few significant digits of precision. This means that when you are building a large number by adding lots of small numbers together a bit at a time, you reach a point where the smaller number is smaller than the lowest precision digit in the larger number so adding it up has no effect. As to why you are not getting this behaviour in the non-intrinsic version, it is likely that result variable is being held in a register which uses a higher precision that the actual storage of a float so it is not being truncated to the precision of a float on every iteration of the loop. You would have to look at the generated assembler code to be sure.