Valgrind vs. Linux perf correlation - valgrind

Suppose that I choose perf events instructions, LLC-load-misses, LLC-store-misses. Suppose further that I test a program prog varying its input. Is valgrind supposed to give me the "same" functional results for the same input and the same counter? That is, if one value in perf goes up, the one in valgrind should always do the same? Is there any impact in valgrind being a simulation that I should be aware of during profiling my code?
EDIT: BTW, before people grill me for not experimenting myself, I have to say that I (kinda) have, the problem is that I have a Sandybridge processor, and perf has a "bug" that prevents me from measuring LLC-* events. There is a patch, but I don't feel like recompiling my kernel...

Well, Cachegrind is a cache simulator. Even though it tries to mimic some of your hardware's characteristics (cache size, associativity, etc), it does not model every single feature and behavior of your system. Therefore you might in some cases see some differences.
For example, Valgrind's doc states that "Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004". Sandy Bridge processors first appeared in 2011, and you can guess that branch predictors have improved quite a lot since 2004.
That being said, Valgrind is still a wonderful tool to have in your toolbox.
What's the problem with perf's LLC events on Sandy Bridge processors? I use these events everyday at work on my Sandy Bridge laptop and it works as expected (archlinux 64bits, linux 3.6).

Related

hardware emulation project

Greetings.
I am interested in writing an emulator for some old computer. However, I'd like to pick something simple for a start, some architecture that is not too complicated and relatively well-known, so that its easy to find documentation. Could you suggest something?
Also welcome: links to technical specs/documentation of the suggested platform, rom archives, etc. :)
The good old Commodore 64 would be a good choice. Well-documented, lots of ROM archives available, and a fair amount of community support available.
It runs on on 8-bit microprocessor (the MOS 6510) which operates a RISC instruction set and should be fairly straightforward to simulate (in as much as any hardware emulation can be called "simple" :)
The processor datasheet is even available!
Having already done something like this I would agree with e.James and go with something like the 6502. The 6502 is manageable, I think less than 256 instructions. The z80 for example multiplexes some of the opcodes and is a lot more work. With the 6502 you can go after the vic20 the commodore64, etc as well as standups like Asteroids, lunar lander, breakout and some others. The apple Iie and atari vcs (2600) and others are also 6502 based.
It is good to go with something like this that has already been emulated (and there is open source). Something, that you can examine both datasheets and implementations together when making your own. Beware not all are bug free, they may emulate one thing well perhaps because that one thing never uses this broken instruction or that flag. You may also find there are different interpretations of the datasheet.
Thanks to mame and others there are a lot of video games (not necessarily 6502 based, in general) out there, perhaps you have a favorite. The processor emulators in mame as well as others out there are often written for execution speed, and can be difficult to follow. Certainly not educational code, but heavily hand tuned for performance (was needed for a 486 platform but dont necessarily need that tuning today).
If the 6502 is too big to digest, or when you look at the peripherals you have to emulate, you might go with just the processor or a microcontroller like the 12 bit microchip pic or msp430 instruction set. very digestible, still in production so tools are available, both have c compilers for example. Not going to have sexy well known programs running or anything like that but no less educational.

LabVIEW + National Instruments hardware or ???

I'm in the processes of buying a new data acquisition system for my company to use for various projects. At first, it's primary purpose will be to monitor up to 20 thermocouples and control the temperature of a composites oven. However, I also plan on using it to monitor accelerometers, strain gauges, and to act as a signal generator.
I probably won't be the only one to use it, but I have a good bit of programming experience with Atmel microcontrollers (C). I've used LabVIEW before, but ~5 years ago. LabVIEW would be good because it is easy to pick up on for both me and my coworkers. On the flip side, it's expensive. Right now I have a NI CompactDAQ system with 2 voltage and one thermocouple cards + LabVIEW speced out and it's going to cost $5779!
I'm going to try to get the same I/O capabilities with different NI hardware for less $ + LabVIEW to see if I can get it for less $. I'd like to see if anyone has any suggestions other than LabVIEW for me.
Thanks in advance!
Welcome to test and measurement. It's pretty expensive for pre-built stuff, but you trade money for time.
You might check out the somewhat less expensive Agilent 34970A (and associated cards). It's a great workhorse for different kinds of sensing, and, if I recall correctly, it comes with some basic software.
For simple temperature control, you might consider a PID controller (Watlow or Omega used to be the brands, but it's been a few years since I've looked at them).
You also might look into the low-cost usb solutions from NI. The channel count is lower, but they're fairly inexpensive. They do still require software of some type, though.
There are also a fair number of good smaller companies (like Hytek Automation) that produce some types of measurement and control devices or sub-assemblies, but YMMV.
There's a lot of misconception about what will and will not work with LabView and what you do and do not need to build a decent system with it.
First off, as others have said, test and measurement is expensive. Regardless of what you end up doing, the system you describe IS going to cost thousands to build.
Second, you don't NEED to use NI hardware with LabView. For thermocouples your best bet is to look into multichannel or multiple single-channel thermocouple units - something that reads from a thermocouple and outputs to something like RS-232, etc. The OMEGABUS Digital Transmitters are an example, but many others exist.
In this way, you need only a breakout card with lots of RS-232 ports and you can grow your system as it needs it. You can still use labview to acquire the data via RS-232 and then display, log, process, etc, it however you like.
Third party signal generators would also work, for example. You can pick up good ones (with GPIB connection) reasonably cheaply and with a GPIB board can integrate it into LabView as well. This if you want something like a function generator, of course (duty cycled pulses, standard sine/triangle/ramp functions, etc). If you're talking about arbitrary signal generation then this remains a reasonably expensive thing to do (if $5000 is our goalpost for "expensive").
This also hinges on what you're needing the signal generation for - if you're thinking for control signals then, again, there may be cheaper and more robust opitons available. For temperature control, for example, separate hardware PID controllers are probably the best bet. This also takes care of your thermocouple problem since PID controllers will typically accept thermocouple inputs as well. In this way you only need one interface (RS-232, for example) to the external PID controller and you have total access in LabView to temperature readings as well as the ability to control setpoints and PID parameters in one unit.
Perhaps if you could elaborate on not just the system components as you've planned them at present, but the ultimaty system functionality, it may be easier to suggest alternatives - not simply alternative hardware, but alternative system design altogether.
edit :
Have a look at Omega CNi8C22-C24 and CNiS8C24-C24 units -> these are temperature and strain DIN PID units which will take inputs from your thermocouples and strain gauges, process the inputs into proper measurements, and communicate with LabView (or anything else) via RS-232.
This isn't necessarily a software answer, but if you want low cost data aquisition, you might want to look at the labjack. It's basically a microcontroller & usb interface wrapped in a nice box (like an arduino (Atmel AVR + USB-Serial converter) but closed source) with a lot of drivers and functions for various languages, including labview.
Reading a thermocouple can be tough because microvolts are significant, so you either need a high resolution A/D or an amplifier on the input. I think NI may sell a specialized digitizer for thermocouple readings, but again you'll pay.
As far as the software answer, labview will work nicely with almost any hardware you choose -- e.g. I built my own temperature controller based on an arduino (with an AD7780) wrote a little interface using serial commands and then talked with it using labview. But if you're willing to pay a premium for a guaranteed to work out of the box solution, you can't go wrong with labview and an NI part.
LabWindows CVI is NI's C IDE, with good integration with their instrument libraries and drivers. If you're willing to write C code, maybe you could get by with the base version of LabWindows CVI, versus having to buy a higher-end LabView version that has the functionality you need. LabWindows CVI and LabView are priced identically for the base versions, so
that may not be much of an advantage.
Given the range of measurement types you plan to make and the fact that you want colleagues to be able to use this, I would suggest LabVIEW is a good choice - it will support everything you want to do and make it straightforward to put a decent GUI on it. Assuming you're on Windows then the base package should be adequate and if you want to build stand-alone applications, either to deploy on other PCs or to make a particular setup as simple as possible for your colleagues, you can buy the application builder separately later.
As for the DAQ hardware, you can certainly save money - e.g. Measurement Computing have a low cost 8-channel USB thermocouple input device - but that may cost you in setup time or be less robust to repeated changes in your hardware configuration for different tests.
I've got a bit of experience with LabView stuff, and if you can afford it, it's awesome (and useful for a lot of different applications).
However, if your applications are simple you might actually be able to hack together something with one or two arduino's here, it's OSS, and has some good cheap hardware boards.
LabView really comes into its own with real time applications or RAD (because GUI dev is super easy), so if all you're doing is running a couple of thermopiles I'd find something cheaper.
A few thousand dollars is not a lot of money for process monitoring and control systems. If you do a cost/benefit analysis, you will very quickly recover your development costs if the scope of the system is right and if it does the job it is intended to do.
Another tool to consider is National Instruments measurement studio with VB .NET. This way you can still use the NI hardware if you want and can still build nice gui's quickly.
Alternatively, as others have said, it is perfectly viable to get industrial serial based instruments and talk to them with LabVIEW, VB .NET, c# or whatever you like.
If you go down the route of serial instruments, another piece of hardware that might be useful is a serial terminal (example). These allow you to connect arbitrary numbers of devices to your network. You computers can then use them as though they were physical COM ports.
Have you looked at MATLAB. They have a toolbox called Data Acquisition. compactDAQ is a supported hardware.
LabVIEW is a great visual programming environment. In terms if we want to drag,drop and visualize our system. NI Hardware also comes with the NIDAQmx Library which can be accessed through our code. Probably a feasible solution for you would be to import the libraries into another programming language and write code for all the activities which otherwise you were going to perform using LabVIEW. Though other overheads like code optimization would be the users responsibility, you are free to tweak the normal method flow, by introducing your own improvements at suitable junctures in the DAQ process.

Which platform should i choose for scientific computing?

What are the pros and cons in choosing PS3 as a platform for scientific computing in detriment of GPU's? Is It the better choice ?
Stick with a PC, you will have a far easier life at the end of the day. I also wouldn't be surprised if you get more horsepower out of GPU's.
p.s., from what I know dispatching work to the cells is not an enjoyable task :D
I'd go for GPU, for three reasons:
(a) GPU code can be developed, tested, and run on pretty much any PC you may want to use, with the only dependency being a $150 video card, whereas CELL/PS3 is a much more custom development environment and won't run natively on your laptop, etc.;
(b) I'm willing to bet a lot that GPUs and Cuda will be alive and well in 5 years, but I wouldn't put money on PS3 being around that long -- what are you going to do if PS4 has a totally different architecture and CELL effectively dies?
(c) There's a more vibrant research and development community around GPU than there is around PS3/Cell (outside of strict game development), so you're likely to be in more good company, have example code and tools to work with, etc.
There is no broad "better" choice, it is all dependent on the situation and what you're doing. Probably the biggest PRO to a PS3 is they're cheap by comparison. A computer can more easily scale bigger though (for a price) when looking into things like CUDA.
CUDA is pretty slick. I was shown a presentation recently demonstrating how easy it is to get at the power of the GPU's many cores using a C++ based syntax. If I was starting a parallel computing project now, I would probably take the PC/GPU-based route.
A major objection to the PS3 (which is already quite a wacky choice unless you're under some pretty extreme price/performance constraints) has to be that Sony are dropping support for installation of other OS. In future, PS3s without the disabling firmware update may become harder and harder to get hold of.

Optimisation , Compilers and Its Effects

(i) If a Program is optimised for one CPU class (e.g. Multi-Core Core i7)
by compiling the Code on the same , then will its performance
be at sub-optimal level on other CPUs from older generations (e.g. Pentium 4)
... Optimizing may prove harmful for performance on other CPUs..?
(ii)For optimization, compilers may use x86 extensions (like SSE 4) which are
not available in older CPUs.... so ,Is there a fall-back to some non-extensions
based routine on older CPUs..?
(iii)Is Intel C++ Compiler is more optimizing than Visual C++ Compiler or GCC..
(iv) Will a truly Multi-Core Threaded application will perform effeciently on a
older CPUs (like Pentium III or 4)..?
Compiling on a platform does not mean optimizing for this platform. (maybe it's just bad wording in your question.)
In all compilers I've used, optimizing for platform X does not affect the instruction set, only how it is used, e.g. optimizing for i7 does not enable SSE2 instructions.
Also, optimizers in most cases avoid "pessimizing" non-optimized platforms, e.g. when optimizing for i7, typically a small improvement on i7 will not not be chosen if it means a major hit for another common platform.
It also depends in the performance differences in the instruction sets - my impression is that they've become much less in the last decade (but I haven't delved to deep lately - might be wrong for the latest generations). Also consider that optimizations make a notable difference only in few places.
To illustrate possible options for an optimizer, consider the following methods to implement a switch statement:
sequence if (x==c) goto label
range check and jump table
binary search
combination of the above
the "best" algorithm depends on the relative cost of comparisons, jumps by fixed offsets and jumps to an address read from memory. They don't differ much on modern platforms, but even small differences can create a preference for one or other implementation.
It is probably true that optimising code for execution on CPU X will make that code less optimal on CPU Y than the same code optimised for execution on CPU Y. Probably.
Probably not.
Impossible to generalise. You have to test your code and come to your own conclusions.
Probably not.
For every argument about why X should be faster than Y under some set of conditions (choice of compiler, choice of CPU, choice of optimisation flags for compilation) some clever SOer will find a counter-argument, for every example a counter-example. When the rubber meets the road the only recourse you have is to test and measure. If you want to know whether compiler X is 'better' than compiler Y first define what you mean by better, then run a lot of experiments, then analyse the results.
I) If you did not tell the compiler which CPU type to favor, the odds are that it will be slightly sub-optimal on all CPUs. On the other hand, if you let the compiler know to optimize for your specific type of CPU, then it can definitely be sub-optimal on other CPU types.
II) No (for Intel and MS at least). If you tell the compiler to compile with SSE4, it will feel safe using SSE4 anywhere in the code without testing. It becomes your responsibility to ensure that your platform is capable of executing SSE4 instructions, otherwise your program will crash. You might want to compile two libraries and load the proper one. An alternative to compiling for SSE4 (or any other instruction set) is to use intrinsics, these will check internally for the best performing set of instructions (at the cost of a slight overhead). Note that I am not talking about instruction instrinsics here (those are specific to an instruction set), but intrinsic functions.
III) That is a whole other discussion in itself. It changes with every version, and may be different for different programs. So the only solution here is to test. Just a note though; Intel compilers are known not to compile well for running on anything other than Intel (e.g.: intrinsic functions may not recognize the instruction set of a AMD or Via CPU).
IV) If we ignore the on-die efficiencies of newer CPUs and the obvious architecture differences, then yes it may perform as well on older CPU. Multi-Core processing is not dependent per se on the CPU type. But the performance is VERY dependent on the machine architecture (e.g.: memory bandwidth, NUMA, chip-to-chip bus), and differences in the Multi-Core communication (e.g.: cache coherency, bus locking mechanism, shared cache). All this makes it impossible to compare newer and older CPU efficiencies in MP, but that is not what you are asking I believe. So on the whole, a MP program made for newer CPUs, should not be using less efficiently the MP aspects of older CPUs. Or in other words, just tweaking the MP aspects of a program specifically for an older CPU will not do much. Obviously you could rewrite your algorithm to more efficiently use a specific CPU (e.g.: A shared cache may permit you to use an algorithm that exchanges more data between working threads, but this algo will die on a system with no shared cache, full bus lock and low memory latency/bandwidth), but it involves a lot more than just MP related tweaks.
(1) Not only is it possible but it has been documented on pretty much every generation of x86 processor. Go back to the 8088 and work your way forward, every generation. Clock for clock the newer processor was slower for the current mainstream applications and operating systems (including Linux). The 32 to 64 bit transition is not helping, more cores and less clock speed is making it even worse. And this is true going backward as well for the same reason.
(2) Bank on your binaries failing or crashing. Sometimes you get lucky, most of the time you dont. There are new instructions yes, and to support them would probably mean trap for an undefined instruction and have a software emulation of that instruction which would be horribly slow and the lack of demand for it means it is probably not well done or just not there. Optimization can use new instructions but more than that the bulk of the optimization that I am guessing you are talking about has to do with reordering the instructions so that the various pipelines do not stall. So you arrange them to be fast on one generation processor they will be slower on another because in the x86 family the cores change too much. AMD had a good run there for a while as they would make the same code just run faster instead of trying to invent new processors that eventually would be faster when the software caught up. No longer true both amd and intel are struggling to just keep chips running without crashing.
(3) Generally, yes. For example gcc is a horrible compiler, one size fits all fits no one well, it can never and will never be any good at optimizing. For example gcc 4.x code is slower on gcc 3.x code for the same processor (yes all of this is subjective, it all depends on the specific application being compiled). The in house compilers I have used were leaps and bounds ahead of the cheap or free ones (I am not limiting myself to x86 here). Are they worth the price though? That is the question.
In general because of the horrible new programming languages and gobs of memory, storage, layers of caching, software engineering skills are at an all time low. Which means the pool of engineers capable of making a good compiler much less a good optimizing compiler decreases with time, this has been going on for at least 10 years. So even the in house compilers are degrading with time, or they just have their employees to work on and contribute to the open source tools instead having an in house tool. Also the tools the hardware engineers use are degrading for the same reason, so we now have processors that we hope to just run without crashing and not so much try to optimize for. There are so many bugs and chip variations that most of the compiler work is avoiding the bugs. Bottom line, gcc has singlehandedly destroyed the compiler world.
(4) See (2) above. Don't bank on it. Your operating system that you want to run this on will likely not install on the older processor anyway, saving you the pain. For the same reason that the binaries optimized for your pentium III ran slower on your Pentium 4 and vice versa. Code written to work well on multi core processors will run slower on single core processors than if you had optimized the same application for a single core processor.
The root of the problem is the x86 instruction set is dreadful. So many far superior instructions sets have come along that do not require hardware tricks to make them faster every generation. But the wintel machine created two monopolies and the others couldnt penetrate the market. My friends keep reminding me that these x86 machines are microcoded such that you really dont see the instruction set inside. Which angers me even more that the horrible isa is just an interpretation layer. It is kinda like using Java. The problems you have outlined in your questions will continue so long as intel stays on top, if the replacement does not become the monopoly then we will be stuck forever in the Java model where you are one side or the other of a common denominator, either you emulate the common platform on your specific hardware, or you are writing apps and compiling to the common platform.

Multi core programming

I want to get into multi core programming (not language specific) and wondered what hardware could be recommended for exploring this field.
My aim is to upgrade my existing desktop.
If at all possible, I would suggest getting a dual-socket machine, preferably with quad-core chips. You can certainly get a single-socket machine, but dual-socket would let you start seeing some of the effects of NUMA memory that are going to be exacerbated as the core counts get higher and higher.
Why do you care? There are two huge problems facing multi-core developers right now:
The programming model Parallel programming is hard, and there is (currently) no getting around this. A quad-core system will let you start playing around with real concurrency and all of the popular paradigms (threads, UPC, MPI, OpenMP, etc).
Memory Whenever you start having multiple threads, there is going to be contention for resources, and the memory wall is growing larger and larger. A recent article at arstechnica outlines some (very preliminary) research at Sandia that shows just how bad this might become if current trends continue. Multicore machines are going to have to keep everything fed, and this will require that people be intimately familiar with their memory system. Dual-socket adds NUMA to the mix (at least on AMD machines), which should get you started down this difficult road.
If you're interested in more info on performance inconsistencies with multi-socket machines, you might also check out this technical report on the subject.
Also, others have suggested getting a system with a CUDA-capable GPU, which I think is also a great way to get into multithreaded programming. It's lower level than the stuff I mentioned above, but throw one of those on your machine if you can. The new Portland Group compilers have provisional support for optimizing loops with CUDA, so you could play around with your GPU even if you don't want to learn CUDA yourself.
Quad-core, because it'll permit you to do problems where the number of concurrent processes is > 2, which often non-trivializes problems.
I would also, for sheer geek squee, pick up a nice NVidia card and use the CUDA API. If you have the bucks, there's a stand-alone CUDA workstation that plugs into your main computer via a cable and an expansion slot.
It depends what you want to do.
If you want to learn the basics of multithreaded programming, then you can do that on your existing single-core PC. (If you have 2 threads, then the OS will switch between them on a single-core PC. Then when you move to a dual-core PC they should automatically run in parallel on separate cores, for a 2x speedup). This has the advantage of being free! The disadvantages are that you won't see a speedup (in fact a parallel implementation is probably slightly slower due to overheads), and that buggy code has a slightly higher chance of working.
However, although you can learn multithreaded programming on a single-core box, a dual-core (or even HyperThreading) CPU would be a great help.
If you want to really stress-test the code you're writing, then as "blue tuxedo" says, you should go for as many cores as you can easily afford, and if possible get hyperthreading too.
If you want to learn about algorithms for running on graphics cards - which is a very different area to x86 multicore - then get CUDA and buy a normal nVidia graphics card that supports it.
I'd recommend at least a quad-core processor.
You could try tinkering with CUDA. It's free, not that hard to use and will run on any recent NVIDIA card.
Alternatively, you could get a PlayStation 3 and the Linux SDK and work out how to program a Cell processor. Note that the next cheapest option for Cell BE development is an order of magnitude more expensive than a PS3.
Finally, any modern motherboard that will take a Core Quad or quad-core Opteron (get a good one from Asus or some other reputable manufacturer) will let you experiment with a multi-core PC system for a reasonable sum of money.
The difficult thing with multithreaded/core programming is that it opens a whole new can of worms. The bugs you'll be faced with are usually not the one you're used to. Race conditions can remain dormant for ages until they bite and your mainstream language compiler won't assist you in any way. You'll get random data and/or crashes that only happen once a day/week/month/year, usually under the most mysterious conditions...
One things remains true fortunately : the higher the concurrency exhibited by a computer, the more race conditions you'll unveil.
So if you're serious about multithreaded/core programming, then go for as many cpu cores as possible. Keep in mind that neither hyperthreading nor SMT allow for the level of concurrency that multiple cores provide.
I would agree that, depending on what you ultimately want to do, you can probably get by with just your current single-core system. Multi-core programming is basically multi-threaded programming, and you can certainly do that on a single-core chip.
When I was a student, one of our projects was to build a thread-safe implementation the malloc library for C. Even on a single core processor, that was more than enough to cure me of my desire to get into multi-threaded programming. I would try something small like that before you start thinking about spending lots of money.
I agree with the others where I would upgrade to a quad-core processor. I am also a BIG FAN of ASUS Motherboards (the P5Q Pro is excellent for Core2Quad and Core2Duo processors)!
The draw for multi-core programming is that you have more resources to get things done faster. If you are serious about multi-core programming, then I would absolutely get a quad-core processor. I don't believe that you should get the new i7 architecture from Intel to take advantage of multi-core processing because anything written to take advantage of the Core2Duo or Core2Quad will just run better on the newer architecture.
If you are going to dabble in multi-core programming, then I would get a good Core2Duo processor. Remember, it's not just how many cores you have, but also how FAST the cores are to process the jobs. My Core2Duo running at 4GHz routinely completes jobs faster than my Core2Quad running at 2.4GHz even with a multi-core program.
Let me know if this helps!
JFV