AF_XDP queue ring index overflow? - xdp-pdf

I'm struggling with the umem implementation of xdp, trying to understand the following structure from kernel net/xdp/xsk_queue.h.
struct xsk_queue {
u64 chunk_mask;
u64 size;
u32 ring_mask;
u32 nentries;
u32 prod_head;
u32 prod_tail;
u32 cons_head;
u32 cons_tail;
struct xdp_ring *ring;
u64 invalid_descs;
};
well, {prod,cons}_{head,tail}, there are comments in the same file explain how these four members are used.
After hours, I still can't figure out one question:
all these members are increased when needed( e.g. consumer consumes one item in the queue, cons_tail are increased), and we can get the actual index by & ring_mask, which is obvious, but I find no place dealing with the u32 overflow.
I know it's hard to reach 2^32 in normal applications, but it's still possible with long running processes, isn't it? Did I miss something here?
thanks in advance.

Related

Issue precision CGAL's Delaunay triangulation

I am having an issue that is probably due to the mismatch between the precision of a double and the infinite precision given in CGAL, but I cannot seem to solve it, and do not see anyway to set a tolerance.
I input a set of points (initially the locations are doubles).
When the points are aligned horizontally on the upper part (and only happens there) I sometimes (not always) get an issue of too many triangles (with a really small area, almost 0) being generated: see image
(*notice in the upper section, how there is a line that seems to be thicker than the rest, because there are at least 3 triangles there).
This is what I am doing in my code, I tried to set the kernel to handle the imprecision of doubles:
typedef CGAL::Simple_cartesian<double> CK;
typedef CGAL::Filtered_kernel<CK> K;
//typedef CGAL::Exact_predicates_inexact_constructions_kernel K;
typedef K::FT FT;
typedef K::Point_2 Point;
typedef K::Segment_2 Segment;
typedef CGAL::Polygon_2<K> Polygon_2;
typedef CGAL::Triangulation_vertex_base_with_info_2<unsigned long, K> Vb2;
typedef CGAL::Triangulation_data_structure_2<Vb2,Fb> Tds2;
typedef CGAL::Delaunay_triangulation_2<K,Tds2> Delaunay;
std::vector<std::vector<long> > Geometry::delaunay(std::vector<double> xs,std::vector<double> ys){
std::vector<Point> points;
std::vector<unsigned long> indices;
points.resize(xs.size());
indices.resize(xs.size());
for(long i=0;i<xs.size();i++){
indices[i]=i;
points[i]=Point(xs[i],ys[i]);
}
std::vector<long> idAs;
std::vector<long> idBs;
Delaunay dt;
dt.insert( boost::make_zip_iterator(boost::make_tuple( points.begin(),indices.begin() )),boost::make_zip_iterator(boost::make_tuple( points.end(),indices.end() ) ) );
for(Delaunay::Finite_edges_iterator it = dt.finite_edges_begin(); it != dt.finite_edges_end(); ++it)
{
Delaunay::Edge e=*it;
long i1= e.first->vertex( (e.second+1)%3 )->info();
long i2= e.first->vertex( (e.second+2)%3 )->info();
idAs.push_back(i1);
idBs.push_back(i2);
}
std::vector<std::vector<long> > result;
result.resize(2);
result[0]=(idAs);
result[1]=(idBs);
return result;
}
I am completely new to CGAL, and this code is something that I have been able to put together after a lot of looking up in web pages in the last 2 days. So if there is something else that might be improved, please, do not hesitate to mention it, the sintaxis of CGAL is not really straightforward.
*the code works perfectly for random points, and 70% of the times even for points that are aligned, but the other 30% worries me.
THE QUESTION IS, how can I set a tolerance, so CGAL does not generate triangles on top of points that are almost almost aligned??, or is there a better kernel for this? (as you see I tried also the Exact_predicates_inexact_constructions_kernel, but it is even worst).

Fastest way to compare uchar arrays in OpenCL

I need do many comparsions in opencl programm. Now i make it like this
int memcmp(__global unsigned char* a,__global unsigned char* b,__global int size){
for (int i = 0; i<size;i++){
if(a[i] != b[i])return 0;
}
return 1;
}
How i can make it faster? Maybe using vectors like uchar4 or somethins else? Thanks!
I guess that your kernel computes "size" elements for each thread. I think that your code can improve if your accesses are more coalesced. Thanks to the L1 caches of the current GPUs this is not a huge problem but it can imply a noticeable performance penalty. For example, you have 4 threads(work-items), size = 128, so the buffers have 512 uchars. In your case, thread #0 acceses to a[0] and b[0], but it brings to cache a[0]...a[63] and the same for b. thread #1 wich belongs to the same warp (aka wavefront) accesses to a[128] and b[128], so it brings to cache a[128]...a[191], etc. After thread #3 all the buffer is in the cache. This is not a problem here taking into account the small size of this domain.
However, if each thread accesses to each element consecutively, only one "cache line" is necessary all the time for your 4 threads execution (the accesses are coalesced). The behavior will be better when more threads per block are considered. Please, try it and tell me your conclusions. Thank you.
See: http://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf Section 3.1.2.1
It is a bit old but their concepts are not so old.
PS: By the way, after this I would try to use uchar4 as you commented and also the "loop unrolling".

How to bitwise-and CFBitVector

I have two instances of CFMutableBitVector, like so:
CFBitVectorRef ref1, ref2;
How can I do bit-wise operations to these guys? For right now, I only care about and, but obviously xor, or, etc would be useful to know.
Obviously I can iterate through the bits in the vector, but that seems silly when I'm working at the bit level. I feel like there are just some Core Foundation functions that I'm missing, but I can't find them.
Thanks,
Kurt
Well a
CFBitVectorRef
is a
typedef const struct __CFBitVector *CFBitVectorRef;
which is a
struct __CFBitVector {
CFRuntimeBase _base;
CFIndex _count; /* number of bits */
CFIndex _capacity; /* maximum number of bits */
__CFBitVectorBucket *_buckets;
};
Where
/* The bucket type must be unsigned, at least one byte in size, and
a power of 2 in number of bits; bits are numbered from 0 from left
to right (bit 0 is the most significant) */
typedef uint8_t __CFBitVectorBucket;
So you can dive in a do byte wise operations which could speed things up. Of course being non-mutable might hinder things a bit :D

How to do numerical integration with quantum harmonic oscillator wavefunction?

How to do numerical integration (what numerical method, and what tricks to use) for one-dimensional integration over infinite range, where one or more functions in the integrand are 1d quantum harmonic oscillator wave functions. Among others I want to calculate matrix elements of some function in the harmonic oscillator basis:
phin(x) = Nn Hn(x) exp(-x2/2)
where Hn(x) is Hermite polynomial
Vm,n = \int_{-infinity}^{infinity} phim(x) V(x) phin(x) dx
Also in the case where there are quantum harmonic wavefunctions with different widths.
The problem is that wavefunctions phin(x) have oscillatory behaviour, which is a problem for large n, and algorithm like adaptive Gauss-Kronrod quadrature from GSL (GNU Scientific Library) take long to calculate, and have large errors.
An incomplete answer, since I'm a little short on time at the moment; if others can't complete the picture, I can supply more details later.
Apply orthogonality of the wavefunctions whenever and wherever possible. This should significantly cut down the amount of computation.
Do analytically whatever you can. Lift constants, split integrals by parts, whatever. Isolate the region of interest; most wavefunctions are band-limited, and reducing the area of interest will do a lot to save work.
For the quadrature itself, you probably want to split the wavefunctions into three pieces and integrate each separately: the oscillatory bit in the center plus the exponentially-decaying tails on either side. If the wavefunction is odd, you get lucky and the tails will cancel each other, meaning you only have to worry about the center. For even wavefunctions, you only have to integrate one and double it (hooray for symmetry!). Otherwise, integrate the tails using a high order Gauss-Laguerre quadrature rule. You might have to calculate the rules yourself; I don't know if tables list good Gauss-Laguerre rules, as they're not used too often. You probably also want to check the error behavior as the number of nodes in the rule goes up; it's been a long time since I used Gauss-Laguerre rules and I don't remember if they exhibit Runge's phenomenon. Integrate the center part using whatever method you like; Gauss-Kronrod is a solid choice, of course, but there's also Fejer quadrature (which sometimes scales better to high numbers of nodes, which might work nicer on an oscillatory integrand) and even the trapezoidal rule (which exhibits stunning accuracy with certain oscillatory functions). Pick one and try it out; if results are poor, give another method a shot.
Hardest question ever on SO? Hardly :)
I'd recommend a few other things:
Try transforming the function onto a finite domain to make the integration more manageable.
Use symmetry where possible - break it up into the sum of two integrals from negative infinity to zero and zero to infinity and see if the function is symmetry or anti-symmetric. It could make your computing easier.
Look into Gauss-Laguerre quadrature and see if it can help you.
The WKB approximation?
I am not going to explain or qualify any of this right now. This code is written as is and probably incorrect. I am not even sure if it is the code I was looking for, I just remember that years ago I did this problem and upon searching my archives I found this. You will need to plot the output yourself, some instruction is provided. I will say that the integration over infinite range is a problem that I addressed and upon execution of the code it states the round off error at 'infinity' (which numerically just means large).
// compile g++ base.cc -lm
#include <iostream>
#include <cstdlib>
#include <fstream>
#include <math.h>
using namespace std;
int main ()
{
double xmax,dfx,dx,x,hbar,k,dE,E,E_0,m,psi_0,psi_1,psi_2;
double w,num;
int n,temp,parity,order;
double last;
double propogator(double E,int parity);
double eigen(double E,int parity);
double f(double x, double psi, double dpsi);
double g(double x, double psi, double dpsi);
double rk4(double x, double psi, double dpsi, double E);
ofstream datas ("test.dat");
E_0= 1.602189*pow(10.0,-19.0);// ev joules conversion
dE=E_0*.001;
//w^2=k/m v=1/2 k x^2 V=??? = E_0/xmax x^2 k-->
//w=sqrt( (2*E_0)/(m*xmax) );
//E=(0+.5)*hbar*w;
cout << "Enter what energy level your looking for, as an (0,1,2...) INTEGER: ";
cin >> order;
E=0;
for (n=0; n<=order; n++)
{
parity=0;
//if its even parity is 1 (true)
temp=n;
if ( (n%2)==0 ) {parity=1; }
cout << "Energy " << n << " has these parameters: ";
E=eigen(E,parity);
if (n==order)
{
propogator(E,parity);
cout <<" The postive values of the wave function were written to sho.dat \n";
cout <<" In order to plot the data should be reflected about the y-axis \n";
cout <<" evenly for even energy levels and oddly for odd energy levels\n";
}
E=E+dE;
}
}
double propogator(double E,int parity)
{
ofstream datas ("sho.dat") ;
double hbar =1.054*pow(10.0,-34.0);
double m =9.109534*pow(10.0,-31.0);
double E_0= 1.602189*pow(10.0,-19.0);
double dx =pow(10.0,-10);
double xmax= 100*pow(10.0,-10.0)+dx;
double dE=E_0*.001;
double last=1;
double x=dx;
double psi_2=0.0;
double psi_0=0.0;
double psi_1=1.0;
// cout <<parity << " parity passsed \n";
psi_0=0.0;
psi_1=1.0;
if (parity==1)
{
psi_0=1.0;
psi_1=m*(1.0/(hbar*hbar))* dx*dx*(0-E)+1 ;
}
do
{
datas << x << "\t" << psi_0 << "\n";
psi_2=(2.0*m*(dx/hbar)*(dx/hbar)*(E_0*(x/xmax)*(x/xmax)-E)+2.0)*psi_1-psi_0;
//cout << psi_1 << "=psi_1\n";
psi_0=psi_1;
psi_1=psi_2;
x=x+dx;
} while ( x<= xmax);
//I return 666 as a dummy value sometimes to check the function has run
return 666;
}
double eigen(double E,int parity)
{
double hbar =1.054*pow(10.0,-34.0);
double m =9.109534*pow(10.0,-31.0);
double E_0= 1.602189*pow(10.0,-19.0);
double dx =pow(10.0,-10);
double xmax= 100*pow(10.0,-10.0)+dx;
double dE=E_0*.001;
double last=1;
double x=dx;
double psi_2=0.0;
double psi_0=0.0;
double psi_1=1.0;
do
{
psi_0=0.0;
psi_1=1.0;
if (parity==1)
{double psi_0=1.0; double psi_1=m*(1.0/(hbar*hbar))* dx*dx*(0-E)+1 ;}
x=dx;
do
{
psi_2=(2.0*m*(dx/hbar)*(dx/hbar)*(E_0*(x/xmax)*(x/xmax)-E)+2.0)*psi_1-psi_0;
psi_0=psi_1;
psi_1=psi_2;
x=x+dx;
} while ( x<= xmax);
if ( sqrt(psi_2*psi_2)<=1.0*pow(10.0,-3.0))
{
cout << E << " is an eigen energy and " << psi_2 << " is psi of 'infinity' \n";
return E;
}
else
{
if ( (last >0.0 && psi_2<0.0) ||( psi_2>0.0 && last<0.0) )
{
E=E-dE;
dE=dE/10.0;
}
}
last=psi_2;
E=E+dE;
} while (E<=E_0);
}
If this code seems correct, wrong, interesting or you do have specific questions ask and I will answer them.
I am a student majoring in physics, and I also encountered the problem. These days I keep thinking about this question and get my own answer. I think it may help you solve this question.
1.In gsl, there are functions can help you integrate the oscillatory function--qawo & qawf. Maybe you can set a value, a. And the integration can be separated into tow parts, [0,a] and [a,pos_infinity]. In the first interval, you can use any gsl integration function you want, and in the second interval, you can use qawo or qawf.
2.Or you can integrate the function to a upper limit, b, that is integrated in [0,b]. So the integration can be calculated using a gauss legendry method, and this is provided in gsl. Although there maybe some difference between the real value and the computed value, but if you set b properly, the difference can be neglected. As long as the difference is less than the accuracy you want. And this method using the gsl function is only called once and can use many times, because the return value is point and its corresponding weight, and integration is only the sum of f(xi)*wi, for more details you can search gauss legendre quadrature on wikipedia. Multiple and addition operation is much faster than integration.
3.There is also a function which can calculate the infinity area integration--qagi, you can search it in the gsl-user's guide. But this is called everytime you need to calculate the integration, and this may cause some time consuming, but I'm not sure how long will it use in you program.
I suggest NO.2 choice I offered.
If you are going to work with Harmonic oscillator functions less than n = 100 you might want to try:
http://www.mymathlib.com/quadrature/gauss_hermite.html
The program computes an integral via gauss-hermite quadrature with 100 zeroes and weights (the zeroes of H_100). Once you go over Hermite_100 the integrals are not as accurate.
Using this integration method I wrote a program calculating exactly what you want to calculate and it works fairly well. Also, there might be a way to go beyond n=100 by using the asymptotic form of the Hermite-polynomial zeroes but I haven't looked into it.

What are you favorite low level code optimization tricks? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I know that you should only optimize things when it is deemed necessary. But, if it is deemed necessary, what are your favorite low level (as opposed to algorithmic level) optimization tricks.
For example: loop unrolling.
gcc -O2
Compilers do a lot better job of it than you can.
Picking a power of two for filters, circular buffers, etc.
So very, very convenient.
-Adam
Why, bit twiddling hacks, of course!
One of the most useful in scientific code is to replace pow(x,4) with x*x*x*x. Pow is almost always more expensive than multiplication. This is followed by
for(int i = 0; i < N; i++)
{
z += x/y;
}
to
double denom = 1/y;
for(int i = 0; i < N; i++)
{
z += x*denom;
}
But my favorite low level optimization is to figure out which calculations can be removed from a loop. Its always faster to do the calculation once rather than N times. Depending on your compiler, some of these may be automatically done for you.
Inspect the compiler's output, then try to coerce it to do something faster.
I wouldn't necessarily call it a low level optimization, but I have saved orders of magnitude more cycles through judicious application of caching than I have through all my applications of low level tricks combined. Many of these methods are applications specific.
Having an LRU cache of database queries (or any other IPC based request).
Remembering the last failed database query and returning a failure if re-requested within a certain time frame.
Remembering your location in a large data structure to ensure that if the next request is for the same node, the search is free.
Caching calculation results to prevent duplicate work. In addition to more complex scenarios, this is often found in if or for statements.
CPUs and compilers are constantly changing. Whatever low level code trick that made sense 3 CPU chips ago with a different compiler may actually be slower on the current architecture and there may be a good chance that this trick may confuse whoever is maintaining this code in the future.
++i can be faster than i++, because it avoids creating a temporary.
Whether this still holds for modern C/C++/Java/C# compilers, I don't know. It might well be different for user-defined types with overloaded operators, whereas in the case of simple integers it probably doesn't matter.
But I've come to like the syntax... it reads like "increment i" which is a sensible order.
Using template metaprogramming to calculate things at compile time instead of at run-time.
Years ago with a not-so-smart compilier, I got great mileage from function inlining, walking pointers instead of indexing arrays, and iterating down to zero instead of up to a maximum.
When in doubt, a little knowledge of assembly will let you look at what the compiler is producing and attack the inefficient parts (in your source language, using structures friendlier to your compiler.)
precalculating values.
For instance, instead of sin(a) or cos(a), if your application doesn't necessarily need angles to be very precise, maybe you represent angles in 1/256 of a circle, and create arrays of floats sine[] and cosine[] precalculating the sin and cos of those angles.
And, if you need a vector at some angle of a given length frequently, you might precalculate all those sines and cosines already multiplied by that length.
Or, to put it more generally, trade memory for speed.
Or, even more generally, "All programming is an exercise in caching" -- Terje Mathisen
Some things are less obvious. For instance traversing a two dimensional array, you might do something like
for (x=0;x<maxx;x++)
for (y=0;y<maxy;y++)
do_something(a[x,y]);
You might find the processor cache likes it better if you do:
for (y=0;y<maxy;y++)
for (x=0;x<maxx;x++)
do_something(a[x,y]);
or vice versa.
Don't do loop unrolling. Don't do Duff's device. Make your loops as small as possible, anything else inhibits x86 performance and gcc optimizer performance.
Getting rid of branches can be useful, though - so getting rid of loops completely is good, and those branchless math tricks really do work. Beyond that, try never to go out of the L2 cache - this means a lot of precalculation/caching should also be avoided if it wastes cache space.
And, especially for x86, try to keep the number of variables in use at any one time down. It's hard to tell what compilers will do with that kind of thing, but usually having less loop iteration variables/array indexes will end up with better asm output.
Of course, this is for desktop CPUs; a slow CPU with fast memory access can precalculate a lot more, but in these days that might be an embedded system with little total memory anyway…
I've found that changing from a pointer to indexed access may make a difference; the compiler has different instruction forms and register usages to choose from. Vice versa, too. This is extremely low-level and compiler dependent, though, and only good when you need that last few percent.
E.g.
for (i = 0; i < n; ++i)
*p++ = ...; // some complicated expression
vs.
for (i = 0; i < n; ++i)
p[i] = ...; // some complicated expression
Optimizing cache locality - for example when multiplying two matrices that don't fit into cache.
Allocating with new on a pre-allocated buffer using C++'s placement new.
Counting down a loop. It's cheaper to compare against 0 than N:
for (i = N; --i >= 0; ) ...
Shifting and masking by powers of two is cheaper than division and remainder, / and %
#define WORD_LOG 5
#define SIZE (1 << WORD_LOG)
#define MASK (SIZE - 1)
uint32_t bits[K]
void set_bit(unsigned i)
{
bits[i >> WORD_LOG] |= (1 << (i & MASK))
}
Edit
(i >> WORD_LOG) == (i / SIZE) and
(i & MASK) == (i % SIZE)
because SIZE is 32 or 2^5.
Jon Bentley's Writing Efficient Programs is a great source of low- and high-level techniques -- if you can find a copy.
Eliminating branches (if/elses) by using boolean math:
if(x == 0)
x = 5;
// becomes:
x += (x == 0) * 5;
// if '5' was a base 2 number, let's say 4:
x += (x == 0) << 2;
// divide by 2 if flag is set
sum >>= (blendMode == BLEND);
This REALLY speeds things out especially when those ifs are in a loop or somewhere that is being called a lot.
The one from Assembler:
xor ax, ax
instead of:
mov ax, 0
Classical optimization for program size and performance.
In SQL, if you only need to know whether any data exists or not, don't bother with COUNT(*):
SELECT 1 FROM table WHERE some_primary_key = some_value
If your WHERE clause is likely return multiple rows, add a LIMIT 1 too.
(Remember that databases can't see what your code's doing with their results, so they can't optimise these things away on their own!)
Recycling the frame-pointer all of a sudden
Pascal calling-convention
Rewrite stack-frame tail call optimizarion (although it sometimes messes with the above)
Using vfork() instead of fork() before exec()
And one I am still looking for, an excuse to use: data driven code-generation at runtime
Liberal use of __restrict to eliminate load-hit-store stalls.
Rolling up loops.
Seriously, the last time I needed to do anything like this was in a function that took 80% of the runtime, so it was worth trying to micro-optimize if I could get a noticeable performance increase.
The first thing I did was to roll up the loop. This gave me a very significant speed increase. I believe this was a matter of cache locality.
The next thing I did was add a layer of indirection, and put some more logic into the loop, which allowed me to only loop through the things I needed. This wasn't as much of a speed increase, but it was worth doing.
If you're going to micro-optimize, you need to have a reasonable idea of two things: the architecture you're actually using (which is vastly different from the systems I grew up with, at least for micro-optimization purposes), and what the compiler will do for you.
A lot of the traditional micro-optimizations trade space for time. Nowadays, using more space increases the chances of a cache miss, and there goes your performance. Moreover, a lot of them are now done by modern compilers, and typically better than you're likely to do them.
Currently, you should (a) profile to see if you need to micro-optimize, and then (b) try to trade computation for space, in the hope of keeping as much as possible in cache. Finally, run some tests, so you know if you've improved things or screwed them up. Modern compilers and chips are far too complex for you to keep a good mental model, and the only way you'll know if some optimization works or not is to test.
In addition to Joshua's comment about code generation (a big win), and other good suggestions, ...
I'm not sure if you would call it "low-level", but (and this is downvote-bait) 1) stay away from using any more levels of abstraction than absolutely necessary, and 2) stay away from event-driven notification-style programming, if possible.
If a computer executing a program is like a car running a race, a method call is like a detour. That's not necessarily bad except there's a strong temptation to nest those things, because once you're written a method call, you tend to forget what that call could cost you.
If your're relying on events and notifications, it's because you have multiple data structures that need to be kept in agreement. This is costly, and should only be done if you can't avoid it.
In my experience, the biggest performance killers are too much data structure and too much abstraction.
I was amazed at the speedup I got by replacing a for loop adding numbers together in structs:
const unsigned long SIZE = 100000000;
typedef struct {
int a;
int b;
int result;
} addition;
addition *sum;
void start() {
unsigned int byte_count = SIZE * sizeof(addition);
sum = malloc(byte_count);
unsigned int i = 0;
if (i < SIZE) {
do {
sum[i].a = i;
sum[i].b = i;
i++;
} while (i < SIZE);
}
}
void test_func() {
unsigned int i = 0;
if (i < SIZE) { // this is about 30% faster than the more obvious for loop, even with O3
do {
addition *s1 = &sum[i];
s1->result = s1->b + s1->a;
i++;
} while ( i<SIZE );
}
}
void finish() {
free(sum);
}
Why doesn't gcc optimise for loops into this? Or is there something I missed? Some cache effect?