Snow flake simulation in OpenGL? - objective-c

I want to do an animation for snow flakes using OpenGL. Can anyone suggest me some tutorial or sample code?

You can use transform feedback to calculate the particle physics in the vertex shader and instancing if you want to use 3D snowflakes. Point sprites should make billboards faster and use less memory for storing vertices. (You only need one per snowflake.)
Running the particle system in the vertex shader will make it a few times faster while the math essentially stays the same.
You could also use a 3D texture to offset the damping calculation so you can have visible turbulences.
If you use a heightmap for the ground, you can use that data to reset snowflakes that aren't visible anymore.
Transform feedback and instancing are explained in OpenGL SuperBible (Fifth Edition) in chapter 12, point sprites are in chapter 7. The source code for all examples is available online. The example code for Mac OS X only goes up to chapter 7, but it should be possible to make most of it work.
I couldn't find any good online tutorials, but the code well commented. The transform feedback example is called "flocking". For a snowflake simulation, one vertex shader should be enough for both updating and redering the particles in one pass.
If you want lots of fast moving snow, the water particles from Nvidia's Cascades Demo (starting at page 114) show an interesting approach to faking a large amount of particles.

One possible solution is use a particle system. I once made some explosion effects from it, and I think that is pretty close to what you want.
Here's one of the tutorials I used, and I think this might be helpful (BTW, there are many good tutorials on that site, you could check them out).
Also, for the snow flake generation, you could just use identical flakes, but if you want fancier things (not too fancy, but relatively easy), you could use triangle stripes (which is used by the tutorial) to achieve better effects, since snow flakes are symmetric.

#include <stdlib.h>
#include <GL/gl.h>
#include <GL/glut.h>
#define PEATALS_NUMBER 100
#define X 0
#define Y 1
#define ORG_X 2
#define ORG_Y 3
#define SIZE 4
#define GROWING 5
#define SPEED 5
struct timeval start;
unsigned long int last_idle_time;
GLdouble size=5;
GLboolean growing=true;
GLdouble tab[PEATALS_NUMBER][7];
unsigned int get_ticks(){
struct timeval now;
gettimeofday(&now, NULL);
return (now.tv_sec - start.tv_sec) * 1000 +
(now.tv_usec - start.tv_usec) / 1000;
}
void init(){
for(int i=0;i<PEATALS_NUMBER;i++){
tab[i][X]=-300+rand()%600;
tab[i][Y]=200+rand()%500;
tab[i][ORG_X]=tab[i][X];
tab[i][ORG_Y]=tab[i][Y];
tab[i][SIZE]=1+rand()%9;
tab[i][GROWING]=rand()%1;
tab[i][SPEED]=rand()%10;
}
}
void Idle(){
unsigned long int time_now = get_ticks();
for(int i=0;i<PEATALS_NUMBER;i++){
tab[i][Y] -= (tab[i][SPEED]+40.0) * (time_now - last_idle_time) / 1000.0;
if(tab[i][Y]<-200.0)tab[i][Y]=tab[i][ORG_Y];
if(tab[i][SIZE]>5){
tab[i][GROWING]=0;
}
if(tab[i][SIZE]<1){
tab[i][GROWING]=1;
}
if(tab[i][GROWING]==1.0){
tab[i][SIZE]+=8.0 * (time_now - last_idle_time) / 1000.0;
tab[i][X] -= (tab[i][SPEED]+1.0) * (time_now - last_idle_time) / 1000.0;
}
else{
tab[i][SIZE]-=8.0 * (time_now - last_idle_time) / 1000.0;
tab[i][X] += (tab[i][SPEED]+2.0) * (time_now - last_idle_time) / 1000.0;
}
}
last_idle_time = time_now;
glutPostRedisplay();
}
int main(int argc, char *argv[]) {
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGB | GLUT_DEPTH);
glutInitWindowSize(600, 400);
gettimeofday(&start, NULL);
init();
// size of window
glutCreateWindow(argv[0]);
glutIdleFunc(Idle);
glEnable(GL_POINT_SMOOTH);
glutMainLoop();
return 0;
}

Related

control led brightness of microcontroller rtos/bios

i'm trying to control my led in 256 (0-255) different levels of brightness. my controller is set to 80mhz and running on rtos. i'm setting the clock module to interrupt every 5 microseconds and the brightness e.g. to 150. the led is dimming, but i'm not sure if it is done right to really have 256 different levels
int counter = 1;
int brightness = 0;
void SetUp(void)
{
SysCtlClockSet(SYSCTL_SYSDIV_2_5|SYSCTL_USE_PLL|SYSCTL_OSC_MAIN|SYSCTL_XTAL_16MHZ);
GPIOPinTypeGPIOOutput(PORT_4, PIN_1);
Clock_Params clockParams;
Clock_Handle myClock;
Error_Block eb;
Error_init(&eb);
Clock_Params_init(&clockParams);
clockParams.period = 400; // every 5 microseconds
clockParams.startFlag = TRUE;
myClock = Clock_create(myHandler1, 400, &clockParams, &eb);
if (myClock == NULL) {
System_abort("Clock create failed");
}
}
void myHandler1 (){
brightness = 150;
while(1){
counter = (++counter) % 256;
if (counter < brightness){
GPIOPinWrite(PORT_4, PIN_1, PIN_1);
}else{
GPIOPinWrite(PORT_4, PIN_1, 0);
}
}
}
A 5 microsecond interrupt is a tall ask for an 80 MHz processor, and will leave little time for other work, and if you are not doing other work, you need not use interrupts at all - you could simply poll the clock counter; then it would still be a lot of processor to throw at a rather trivial task - and the RTOS is overkill too.
A better way to perform your task is to use the timer's PWM (Pulse Width Modulation) feature. You will then be able to accurately control the brightness with zero software overhead; leaving your processor to do more interesting things.
Using a PWM you could manage with a far lower performance processor if LED control is all it will do.
If you must use an interrupt/GPIO (for example your timer does not support PWM generation or the LED is not connected to a PWM capable pin), then it would be more efficient to set the timer incrementally. So for example for a mark:space of 150:105, you would set the timer for 150*5us (9.6ms), on the interrupt toggle the GPIO, then set the timer to 105*5us (6.72ms).
A major problem with your solution is the interrupt handler does not return - interrupts must run to completion and be as short as possible and preferably deterministic in execution time.
Without using hardware PWM, the following based on your code fragment is probably closer to what you need:
#define PWM_QUANTA = 400 ; // 5us
static volatile uint8_t brightness = 150 ;
static Clock_Handle myClock ;
void setBrightness( uint8_t br )
{
brightness = br ;
}
void SetUp(void)
{
SysCtlClockSet(SYSCTL_SYSDIV_2_5|SYSCTL_USE_PLL|SYSCTL_OSC_MAIN|SYSCTL_XTAL_16MHZ);
GPIOPinTypeGPIOOutput(PORT_4, PIN_1);
Clock_Params clockParams;
Error_Block eb;
Error_init(&eb);
Clock_Params_init(&clockParams);
clockParams.period = brightness * PWM_QUANTA ;
clockParams.startFlag = TRUE;
myClock = Clock_create(myHandler1, 400, &clockParams, &eb);
if (myClock == NULL)
{
System_abort("Clock create failed");
}
}
void myHandler1(void)
{
static int pin_state = 1 ;
// Toggle pin state and timer period
if( pin_state == 0 )
{
pin_sate = 1 ;
Clock_setPeriod( myClock, brightness * PWM_QUANTA ) ;
}
else
{
pin_sate = 0 ;
Clock_setPeriod( myClock, (255 - brightness) * PWM_QUANTA ) ;
}
// Set pin state
GPIOPinWrite(PORT_4, PIN_1, pin_state) ;
}
On the urging of Clifford I am elaborating on an alternate strategy for reducing the load of software dimming as as servicing interrupts every 400 clock cycles may prove difficult. The preferred solution should of course be to use hardware pulse-width modulation whenever available.
One option is to set interrupts only at the PWM flanks. Unfortunately this strategy tends to introduce races and drift as time elapses while adjustments are taking place and scales poorly to multiple channels.
Alternative we may switch from pulse-width to delta-sigma modulation. There is a fair bit of theory behind the concept but in this context it boils down to toggling the pin on and off as quickly as possible while maintaining an average on-time proportional to the dimming level. As a consequence the interrupt frequency may be reduced without bringing the overall switching frequency down to visible levels.
Below is an example implementation:
// Brightness to display. More than 8-bits are required to handle full 257-step range.
// The resolution also course be increased if desired.
volatile unsigned int brightness = 150;
void Interrupt(void) {
// Increment the accumulator with the desired brightness
static uint8_t accum;
unsigned int addend = brightness;
accum += addend;
// Light the LED pin on overflow, relying on native integer wraparound.
// Consequently higher brightness values translate to keeping the LED lit more often
GPIOPinWrite(PORT_4, PIN_1, accum < addend);
}
A limitation is that the switching frequency decreases with the distance from 50% brightness. Thus the final N steps may need to be clamped to 0 or 256 to prevent visible flicker.
Oh, and take care if switching losses are a concern in your application.

Do constrained refinement with CGAL isotropic_remeshing

I'd like to do refinement of eg a simple cube (from a .off); there are a few ways but the ones suitable for what I want to do next end up with 'wrinkles', ie the object shape gets distorted.
This way below promises to allow the boundaries (shape?) of the object to be preserved, permitting what you'd expect of refinement, to just add more edges and vertices:
http://doc.cgal.org/latest/Polygon_mesh_processing/Polygon_mesh_processing_2isotropic_remeshing_example_8cpp-example.html
I want an edge constraint map (and if that isn't sufficient then I'll want a vertex constraint map as well) but can't figure out the template abstractions well enough. I tried an OpenMesh Constrained_edge_map from a different CGAL example, but that's too different and won't compile. What I'm asking for is an edge map and maybe a vertex map that I can feed to the call:
PMP::isotropic_remeshing(
faces(mesh),
target_edge_length,
mesh,
PMP::parameters::number_of_iterations(nb_iter)
.protect_constraints(true)//i.e. protect border, here
);
I'm using CGAL 4.8.1, the latest at time of writing. Thanks.
Here is a minimal example to remesh a triangulated cube:
#include <CGAL/Exact_predicates_inexact_constructions_kernel.h>
#include <CGAL/Surface_mesh.h>
#include <CGAL/boost/graph/graph_traits_Surface_mesh.h>
#include <CGAL/Polygon_mesh_processing/remesh.h>
#include <CGAL/Mesh_3/dihedral_angle_3.h>
#include <boost/foreach.hpp>
typedef CGAL::Exact_predicates_inexact_constructions_kernel K;
typedef CGAL::Surface_mesh<K::Point_3> Mesh;
typedef boost::graph_traits<Mesh>::halfedge_descriptor halfedge_descriptor;
typedef boost::graph_traits<Mesh>::edge_descriptor edge_descriptor;
namespace PMP=CGAL::Polygon_mesh_processing;
int main(int, char* argv[])
{
std::ifstream input(argv[1]);
Mesh tmesh;
input >> tmesh;
double target_edge_length = 0.20;
unsigned int nb_iter = 10;
// give each vertex a name, the default is empty
Mesh::Property_map<edge_descriptor,bool> is_constrained =
tmesh.add_property_map<edge_descriptor,bool>("e:is_constrained",false).first;
//detect sharp features
BOOST_FOREACH(edge_descriptor e, edges(tmesh))
{
halfedge_descriptor hd = halfedge(e,tmesh);
if ( !is_border(e,tmesh) ){
double angle = CGAL::Mesh_3::dihedral_angle(tmesh.point(source(hd,tmesh)),
tmesh.point(target(hd,tmesh)),
tmesh.point(target(next(hd,tmesh),tmesh)),
tmesh.point(target(next(opposite(hd,tmesh),tmesh),tmesh)));
if ( CGAL::abs(angle)<100 )
is_constrained[e]=true;
}
}
//remesh
PMP::isotropic_remeshing(
faces(tmesh),
target_edge_length,
tmesh,
PMP::parameters::number_of_iterations(nb_iter)
.edge_is_constrained_map(is_constrained) );
std::ofstream out("out.off");
out << tmesh;
return 0;
}

different execution time between loclal variable and static structure

I write a program on AVR's microcontrolers. It should check actual temperature and show it on 7-segment display. And this the problem which I have: I made a structure with all variables referd to temperature (temperature, pointer position, sign and unit) and saw that the execution time of e.g. dividing by 10 or mod 10 is much more longer then when i use normal local variable. I dont know why. I use Atmel Studio 6.2.
struct dane
{
int32_t temperature;
int8_t pointer;
int8_t sign;
int8_t unit;
};
//************************************
//inside function of timer interrupt
static struct dane present;
//*****************************
//tested operations:
present.temperature % 10; //execution time: ~380 processor's cycles, on normal local variable ~4 cycles.
present.temperature /= 10; //execution time: ~611 cycles
I give you this function where I use it and a little bit of assembly code.
ISR(TIMER0_OVF_vect)
{
static int8_t i = 4;
static struct dane present;
if(i == 4 && (TCCR0 & (1 << CS01)))
{
i = 0;
present = current;
if(present.temperature < 0)
present.temperature = -present.temperature;
}
if((TCCR0 & ((1 << CS00) | (1 << CS02))) && i != 0)
{
i = 0;
}
if(present.unit == current.unit) //time between here and fist instruction in function print equals about 300 cycles.
{
print((i * present.sign == 3 && present_temperature % 10 == 0) ? 16 : present_temperature % 10, displays[i], i == present.pointer);
}
else
{
print(current.unit, displays[i],0);
if(i == 4)
{
i = 3;
TCCR0 = (1 << CS01);
present.unit = current.unit;
}
}
present.temperature /= 10;
i++;
}
And assembly code for the one before last instruction:
present.temperature /= 10;
0000021F LDI R28,0x7D Load immediate
00000220 LDI R29,0x00 Load immediate
00000221 LDD R22,Y+0 Load indirect with displacement
00000222 LDD R23,Y+1 Load indirect with displacement
00000223 LDD R24,Y+2 Load indirect with displacement
00000224 LDD R25,Y+3 Load indirect with displacement
00000225 LDI R18,0x0A Load immediate
00000226 LDI R19,0x00 Load immediate
00000227 LDI R20,0x00 Load immediate
00000228 LDI R21,0x00 Load immediate
00000229 RCALL PC+0x01AC Relative call subroutine
0000022A STD Y+0,R18 Store indirect with displacement
0000022B STD Y+1,R19 Store indirect with displacement
0000022C STD Y+2,R20 Store indirect with displacement
0000022D STD Y+3,R21 Store indirect with displacement
I can't use int16_t for temperature because I use the same structure inside function which coverts the temperature from the sensor and it is easier to operate on number with decimal part when I multiply it by 10 powered by suitable number.
There must be something wrong with your timings:
present.temperature % 10; //execution time: ~380 processor's cycles, on normal local variable ~4 cycles.
present.temperature /= 10; //execution time: ~611 cycles
A modulo-10 operation for a 32-bit value is never going to happen in 4 clock cycles with an AVR. The 380 cycles sounds a lot, but it is more realistic for a 32:32 division operation. I am afraid even an integer division on an AVR will take a lot of time with long integers.
It is quite natural that operations on module static variables take a bit longer, because they have to be fetched and stored in the RAM. This takes maybe 10 extra clock cycles per byte when compared to register variables (local variables are often in a register. The variable being in a struct should not change the timings at all in a case like this (with pointers to structs it may have an effect).
The only real way to get to know what is happening is to look at the assembly code produced by the compiler in each case.
And, please, include a minimal but complete example of both cases in your question. Then it is easier to see if there is something clearly wrong.
If you are interested in making your code faster, I suggest you try to use int16_t for the temperature. Your dynamic range in temperature measurement is hardly more than 12 bits (that would be, e.g., 0.1 °C for temperatures between -100°C..+300°C.) so that 16-bit ints should be sufficient.

cudaMallocHost vs malloc for better performance shows no difference

I have gone through this site. From here I got that pinned memory using cudamallocHost gives better performance than cudamalloc. Then I use two different simple program and tested the execution time as
using cudaMallocHost
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
cudaMallocHost((void **) &a_h, size);
//a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
cudaFreeHost(a_h);
cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
using malloc
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
here during execution of both program, the execution time was almost similar.
Is there anything wrong in the implementation?? what is the exact difference in execution in cudamalloc and cudamallochost??
and also with each run the execution time decreases
If you want to see the difference in execution time for the copy operation, just time the copy operation. In many cases you will see approximately a 2x difference in execution time for just the copy operation when the underlying mememory is pinned. And make your copy operation large enough/long enough so that you are well above the granularity of whatever timing mechanism you are using. The various profilers such as the visual profiler and nvprof can help here.
The cudaMallocHost operation under the hood is doing something like a malloc plus additional OS functions to "pin" each page associated with the allocation. These additional OS operations take extra time, as compared to just doing a malloc. And note that as the size of the allocation increases, the registration ("pinning") cost will generally increase as well.
Therefore, for many examples, just timing the overall execution doesn't show much difference, because while the cudaMemcpy operation may be quicker from pinned memory, the cudaMallocHost takes longer than the corresponding malloc.
So what's the point?
You may be interested in using pinned memory (i.e. cudaMallocHost) when you will be doing repeated transfers from a single buffer. You only pay the extra cost to pin it once, but you benefit on each transfer/usage.
Pinned memory is required to overlap a data transfer operations (cudaMemcpyAsync) with compute activities (kernel calls). Refer to the programming guide.
I too found that just declaring cudaHostAlloc / cudaMallocHost on a piece of memory doesn't do much.
To be sure, do a nvprof with --print-gpu-trace and see whether the throughput for memcpyHtoD or memcpyDtoH is good. For PCI2.0, you should get around 6-8gbps.
However, pinned memory is a perquisite for cudaMemcpyAsync.
After I called cudaMemcpyAsync, I shifted whatever computations I had on the host right after it. In this way you can "layer" the asynchronous memcpys with the host computations.
I was surprised that I was able to save quite a lot of time this way, it's worth a try.

Optimization of HLSL shaders

I'm trying to optimize my terrain shader for my XNA game as it seems to consume a lot of ressources. It takes around 10 to 20 FPS on my computer, and my terrain is 512*512 vertices, so the PixelShader is called a lot of times.
I've seen that branching is using some ressources, and I have 3/4 conditions in my shaders.
What could I do to bypass them? Are triadic operators more efficient than conditions?
For instance:
float a = (b == x) ? c : d;
or
float a;
if(b == x)
a = c;
else
c = d;
I'm using also multiple times the functions lerp and clamp, should it be more efficient to use arithmetic operations instead?
Here's the less efficient part of my code:
float fog;
if(FogWaterActivated && input.WorldPosition.y-0.1 < FogWaterHeight)
{
if(!IsUnderWater)
fog = clamp(input.Depth*0.005*(FogWaterHeight - input.WorldPosition.y), 0, 1);
else
fog = clamp(input.Depth*0.02, 0, 1);
return float4(lerp(lerp( output * light, FogColorWater, fog), ShoreColor, shore), 1);
}
else
{
fog = clamp((input.Depth*0.01 - FogStart) / (FogEnd - FogStart), 0, 0.8);
return float4(lerp(lerp( output * light, FogColor, fog), ShoreColor, shore), 1);
}
Thanks!
Any time you can precalculate operations done on shader constants the better. Removing division operations by passing through the inverse into the shader is another useful tip as division is typically slower than multiplication.
In your case, precalculate (1 / (FogEnd - FogStart)), and multiply by that on your second-last line of code.