I am trying to achieve some functionality using pure c programming language in ios app. The code runs fine when the square matrix size is 50(w=h=50). If I increase the size of matrix to 100, I get EXC BAD ACCESS message. Below is the code I am using:
double solutionMatrixRed[w][h];
double solutionMatrixGreen[w][h];
double solutionMatrixBlue[w][h];
double solutionMatrixAlpha[w][h];
for(int x=0;x<w;x++)
{
for(int y=0;y<h;y++)
{
//NSLog(#"x=%d y=%d",x,y);
solutionMatrixRed[x][y] = 0;
solutionMatrixGreen[x][y] = 0;
solutionMatrixBlue[x][y] = 0;
solutionMatrixAlpha[x][y] = 0;
}
}
Even if w=h=100, the total size should be 100*100*8 Bytes which 80KB, which is normal. Can anyone tell what could be wrong ?
Your code allocates all four matrices in the automatic storage*, which may be limited. Even four times 80K may be too much for a mobile device.
If you need to deal with that much memory, consider allocating it from the dynamic memory using malloc:
double (*solutionMatrixRed)[h] = malloc((sizeof *solutionMatrixRed) * w);
// allocate other matrices in the same way, then do your processing
free(solutionMatrixRed); // Do not forget to free the memory.
* Often referred to as "stack", by the name of the data structure that is frequently used in the implementation of automatic storage.
I believe the stack size in iOS is limited to 512 KB. At w = 100 and h = 100, your arrays would require about 312.5 KB. I suspect that you are exceeding the stack size and should attempt at allocating the arrays on the heap (use malloc() to allocate the arrays).
Because you're trying to allocate all that memory on the stack.
while you should allocate it on the heap, using dynamic allocation (malloc):
double **solutionMatrixRed = malloc(h * sizeof(double *));
for(i=0; i<h; i++)
solutionMatrixRed[i] = malloc(w * sizeof(double));
Related
I write a program on AVR's microcontrolers. It should check actual temperature and show it on 7-segment display. And this the problem which I have: I made a structure with all variables referd to temperature (temperature, pointer position, sign and unit) and saw that the execution time of e.g. dividing by 10 or mod 10 is much more longer then when i use normal local variable. I dont know why. I use Atmel Studio 6.2.
struct dane
{
int32_t temperature;
int8_t pointer;
int8_t sign;
int8_t unit;
};
//************************************
//inside function of timer interrupt
static struct dane present;
//*****************************
//tested operations:
present.temperature % 10; //execution time: ~380 processor's cycles, on normal local variable ~4 cycles.
present.temperature /= 10; //execution time: ~611 cycles
I give you this function where I use it and a little bit of assembly code.
ISR(TIMER0_OVF_vect)
{
static int8_t i = 4;
static struct dane present;
if(i == 4 && (TCCR0 & (1 << CS01)))
{
i = 0;
present = current;
if(present.temperature < 0)
present.temperature = -present.temperature;
}
if((TCCR0 & ((1 << CS00) | (1 << CS02))) && i != 0)
{
i = 0;
}
if(present.unit == current.unit) //time between here and fist instruction in function print equals about 300 cycles.
{
print((i * present.sign == 3 && present_temperature % 10 == 0) ? 16 : present_temperature % 10, displays[i], i == present.pointer);
}
else
{
print(current.unit, displays[i],0);
if(i == 4)
{
i = 3;
TCCR0 = (1 << CS01);
present.unit = current.unit;
}
}
present.temperature /= 10;
i++;
}
And assembly code for the one before last instruction:
present.temperature /= 10;
0000021F LDI R28,0x7D Load immediate
00000220 LDI R29,0x00 Load immediate
00000221 LDD R22,Y+0 Load indirect with displacement
00000222 LDD R23,Y+1 Load indirect with displacement
00000223 LDD R24,Y+2 Load indirect with displacement
00000224 LDD R25,Y+3 Load indirect with displacement
00000225 LDI R18,0x0A Load immediate
00000226 LDI R19,0x00 Load immediate
00000227 LDI R20,0x00 Load immediate
00000228 LDI R21,0x00 Load immediate
00000229 RCALL PC+0x01AC Relative call subroutine
0000022A STD Y+0,R18 Store indirect with displacement
0000022B STD Y+1,R19 Store indirect with displacement
0000022C STD Y+2,R20 Store indirect with displacement
0000022D STD Y+3,R21 Store indirect with displacement
I can't use int16_t for temperature because I use the same structure inside function which coverts the temperature from the sensor and it is easier to operate on number with decimal part when I multiply it by 10 powered by suitable number.
There must be something wrong with your timings:
present.temperature % 10; //execution time: ~380 processor's cycles, on normal local variable ~4 cycles.
present.temperature /= 10; //execution time: ~611 cycles
A modulo-10 operation for a 32-bit value is never going to happen in 4 clock cycles with an AVR. The 380 cycles sounds a lot, but it is more realistic for a 32:32 division operation. I am afraid even an integer division on an AVR will take a lot of time with long integers.
It is quite natural that operations on module static variables take a bit longer, because they have to be fetched and stored in the RAM. This takes maybe 10 extra clock cycles per byte when compared to register variables (local variables are often in a register. The variable being in a struct should not change the timings at all in a case like this (with pointers to structs it may have an effect).
The only real way to get to know what is happening is to look at the assembly code produced by the compiler in each case.
And, please, include a minimal but complete example of both cases in your question. Then it is easier to see if there is something clearly wrong.
If you are interested in making your code faster, I suggest you try to use int16_t for the temperature. Your dynamic range in temperature measurement is hardly more than 12 bits (that would be, e.g., 0.1 °C for temperatures between -100°C..+300°C.) so that 16-bit ints should be sufficient.
I'm starting to play around with some C code within Objective-C programs. The function I'm trying to write sorts all of the lat/long coordinates from a KML file into clusters based on 2D arrays.
I'm using three 2D arrays to accomplish this:
NSUInteger **bucketCounts refers to the number of CLLocationCoordinate2Ds in a cluster.
CLLocationCoorindate2D **coordBuckets is an array of arrays of coordinates
NSUInteger **bucketPointers refers to an index in the array of coordinates from coordBuckets
Here's the code that is messing me up:
//Initialize C arrays and indexes
int n = 10;
bucketCounts = (NSUInteger**)malloc(sizeof(NSUInteger*)*n);
bucketPointers = (NSUInteger**)malloc(sizeof(NSUInteger*)*n);
coordBuckets = (CLLocationCoordinate2D **)malloc(sizeof(CLLocationCoordinate2D*)*n);
for (int i = 0; i < n; i++) {
bucketPointers[i] = malloc(sizeof(NSUInteger)*n);
bucketCounts[i] = malloc(sizeof(NSUInteger)*n);
}
NSUInteger nextEmptyBucketIndex = 0;
int bucketMax = 500;
Then for each CLLocationCoordinate2D that needs to be added:
//find location to enter point in matrix
int latIndex = (int)(n * (oneCoord.latitude - minLat)/(maxLat-minLat));
int lonIndex = (int)(n * (oneCoord.longitude - minLon)/(maxLon-minLon));
//see if necessary bucket exists yet. If not, create it.
NSUInteger positionInBucket = bucketCounts[latIndex][lonIndex];
if (positionInBucket == 0) {
coordBuckets[nextEmptyBucketIndex] = malloc(sizeof(CLLocationCoordinate2D) * bucketMax);
bucketPointers[latIndex][lonIndex] = nextEmptyBucketIndex;
nextEmptyBucketIndex++;
}
//Insert point in bucket.
NSUInteger bucketIndex = bucketPointers[latIndex][lonIndex];
CLLocationCoordinate2D *bucketForInsert = coordBuckets[bucketIndex];
bucketForInsert[positionInBucket] = oneCoord;
bucketCounts[latIndex][lonIndex]++;
positionInBucket++;
//If bucket is full, expand it.
if (positionInBucket % bucketMax == 0) {
coordBuckets[bucketIndex] = realloc(coordBuckets[bucketIndex], (sizeof(CLLocationCoordinate2D) * (positionInBucket + bucketMax)));
}
Things seem to be going well for about 800 coordinates, but at the same point a value in either bucketCounts or bucketPointers gets set to an impossibly high number, which causes a reference to a bad value and crashes the program. I'm sure this is a memory management issue, but I don't know C well enough to troubleshoot it myself. Any helpful pointers for where I'm going wrong? Thanks!
It seems to me each entry in bucketPointers can potentially have its own "bucket", requiring a unique element of coordBuckets to hold the pointer to that bucket.
The entries in bucketPointers are indexed by bucketPointers[latIndex][lonIndex], so there can be n*n of them, but you allocated only n places in coordBuckets.
I think you should allocate for n*n elements in coordBuckets.
Two problems I see:
You don't initialize bucketCounts[] in the given code. It may well happen to all 0s but you should still initialize it with calloc() or memset():
bucketCounts[i] = calloc(n, sizeof(NSUInteger));
if oneCoord.latitude == maxLat then latIndex == n which will overflow your arrays which have valid indexes from 0 to n-1. Same issue with lonIndex. Either allocate n+1 elements and/or make sure latIndex and lonIndex are clamped from 0 to n-1.
In code using raw arrays like this you can solve a lot of issues with two simple rules:
Initialize all arrays (even if you technically don't need to).
Check/verify all array indexes to prevent out-of-bounds accesses.
I have gone through this site. From here I got that pinned memory using cudamallocHost gives better performance than cudamalloc. Then I use two different simple program and tested the execution time as
using cudaMallocHost
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
cudaMallocHost((void **) &a_h, size);
//a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
cudaFreeHost(a_h);
cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
using malloc
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
here during execution of both program, the execution time was almost similar.
Is there anything wrong in the implementation?? what is the exact difference in execution in cudamalloc and cudamallochost??
and also with each run the execution time decreases
If you want to see the difference in execution time for the copy operation, just time the copy operation. In many cases you will see approximately a 2x difference in execution time for just the copy operation when the underlying mememory is pinned. And make your copy operation large enough/long enough so that you are well above the granularity of whatever timing mechanism you are using. The various profilers such as the visual profiler and nvprof can help here.
The cudaMallocHost operation under the hood is doing something like a malloc plus additional OS functions to "pin" each page associated with the allocation. These additional OS operations take extra time, as compared to just doing a malloc. And note that as the size of the allocation increases, the registration ("pinning") cost will generally increase as well.
Therefore, for many examples, just timing the overall execution doesn't show much difference, because while the cudaMemcpy operation may be quicker from pinned memory, the cudaMallocHost takes longer than the corresponding malloc.
So what's the point?
You may be interested in using pinned memory (i.e. cudaMallocHost) when you will be doing repeated transfers from a single buffer. You only pay the extra cost to pin it once, but you benefit on each transfer/usage.
Pinned memory is required to overlap a data transfer operations (cudaMemcpyAsync) with compute activities (kernel calls). Refer to the programming guide.
I too found that just declaring cudaHostAlloc / cudaMallocHost on a piece of memory doesn't do much.
To be sure, do a nvprof with --print-gpu-trace and see whether the throughput for memcpyHtoD or memcpyDtoH is good. For PCI2.0, you should get around 6-8gbps.
However, pinned memory is a perquisite for cudaMemcpyAsync.
After I called cudaMemcpyAsync, I shifted whatever computations I had on the host right after it. In this way you can "layer" the asynchronous memcpys with the host computations.
I was surprised that I was able to save quite a lot of time this way, it's worth a try.
I'm having trouble implementing realloc in a very basic way.
I'm trying to expand the region of memory at **ret, which is pointing to an array of structs
with ret = realloc(ret, newsize); and based on my debug strings I know newsize is correctly increasing over the course of the loop (going from the original size of 4 to 8 to 12 etc.), but when I do sizeof(ptr) it's still returning the original size of 4, and the things I'm trying to place into the newly allocated space can't be found (I think I've narrowed it down to realloc() which is why I'm formatting the question like this)
I can post the function in it's entirety if the problem isn't immediately evident to you, I'm just trying to not "cheat" with my homework too much (the code is kind of messy right now anyway, with heavy use of printf() for debug).
[EDIT] Alright, so based on your answers I'm failing at debugging my code, so I guess I'll post the whole function so you can tell me more about what I'm doing wrong.
(You can ignore the printf()'s since most of that is debug that isn't even working)
Booking **bookingSelectPaid(Booking **booking) {
Booking **ret = malloc(sizeof(Booking*));
printf("Initial address of ret = %p\n", ret);
size_t i = 0;
int numOfPaid = 0;
while (booking[i] != NULL)
{
if (booking[i]->paid == 1)
{
printf("Paying customer! sizeof(Booking*) = %d\n", (int)sizeof(Booking*));
++numOfPaid;
size_t newsize = sizeof(Booking*) * (numOfPaid + 1);
printf("Newsize = %d\n", (int)newsize);
Booking **temp = realloc(NULL, (size_t)newsize);
if (temp != NULL)
printf("Expansion success! => %p sizeof(new pointer) = %d ret = %p\n", temp, (int)sizeof(temp), ret);
ret = realloc(ret, newsize);
ret[i] = booking[i];
ret[i+1] = NULL;
}
++i;
printf("Sizeof(ret) = %d numOfPaid = %d\n", (int)sizeof(ret), numOfPaid);
}
return ret; }
[EDIT2] --> http://pastebin.com/xjzUBmPg
[EDIT3] Just to be clear, the printf's, the temp pointer and things of that nature are debug, and not part of the intended functionality. The line that is puzzling me is either the one with realloc(ret, newsize); or ret[i] = booking[i]
Basically I know for sure that booking contains a table of structs that ends in NULL, and I'm trying to bring the ones that have a specific value set to 1 (paid) onto the new table, which is what my main() is trying to get from this function... So where am I going wrong?
I think the problem here is that your sizeof(ptr) only returns the size of the pointer, which will depend on your architecture (you say 4, so that would mean you're running a 32-bit system).
If you allocate memory dynamically, you have to keep track of its size yourself.
Because sizeof(ptr) returns the size of the pointer, not the allocated size
Yep, sizeof(ptr) is a constant. As the other answer says, depends on the architecture. On a 32 bit architecture it will be 4 and on a 64 bit architecture it will be 8. If you need more help with questions like that this homework help web site can be great for you.
Good luck.
The following code is crashing my program. I found the problem is the size of the array. If I reduce the size to 320*320, it works fine. Does it make sense that this wound be a limitation? If so, what is a work around? I am coding in Objective C for IPhone. Any help would be appreciated.
Thanks!
int life_matrix[320*350];
x_size=320;
y_size=350;
for (int counter=0; counter < x_size; counter++)
{
for (int counter2=0;counter2 < (y_size); counter2++)
{
life_matrix[counter*(int)x_size+counter2] = rand()%2;
}
}
The array is allocated on stack and usually the stack size is limited. If you need a big array, usually it is a good idea to allocate it on heap.
leiz's advice is correct, you really should be allocating this dynamically otherwise you run the risk of running into a situation were the size of the array is larger than the available memory on the stack.
Also the formula you are using to map a 2-dimensional grid to a 1-dimensional array is incorrect. You should be multiplying by y_size instead of x_size.
life_matrix[counter*(int)y_size+counter2] = rand()%2;
or you could flip your counters
life_matrix[counter2*(int)x_size+counter] = rand()%2;
Another approach to solving this would be to use it as a 1-dimensional array for initialization:
for(int n = 0; n < x_size * y_size; ++n) {
life_matrix[n] = rand()%2;
}