Why not 1 CPU core in place of multiple cores? - ram

I've just read something about how CPU cores interact with each other. I can be wrong on some points so don't hesitate to correct me.
So a CPU will basically run instructions that are stored in the L2 or L3 cache. These instructions are addresses that reference to an object in the DRAM.
A multi-core CPU will be able to run more instructions, this will result in better performance. But there is a little problem with that: these cores have to interact with each other, and this is slowing down a little bit the process.
So now that I go back to my question: Why do we not use 1 CPU with bigger cache? As I think, this should give more performance for less costs? Right?
I know these are some basic things that you should know lol. I feel a little bit weird asking this.
Any answer would be welcome!

Multiple cores means you have duplicated circuitry which allow you to do more work in parallel. Each core has its own L1 dcache and icache along with their own registers, decode units, execution pipelines etc.
Just having a bigger cache and a 20 ghz clock won't give you as much performance because you still have to share all other resources.

As someone said to me, I'm forgetting clock speed of CPU's.
LOL
Just Imagine a single core 20GHz; the refresh rate would be too fast to interact with the RAM. In other terms, that means that too much data would be lost, which would result in a crash.
Same holds true with overclocking. -_-

Related

Optimizing Tensorflow for a 32-cores computer

I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.
I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)
TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:
The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.
You can find useful suggestions here
Use the timeline to see what is executed when

Could a GPU be Programmed and/or Modified to Carry Out CPU Instructions?

I was wondering if a GPU could behave like a CPU if modified or programmed to do so. If there is a way, I would also like to know how that could be done. The reason why is, well, sometimes I do that kind of stuff as experiments, just for fun. Plus, if it isn't a big hassle, then it would be much better than buying an expensive processor just to get better performance. I usually don't need my GPU, only because I use my computer for the simplest of things. My other computer, that's a slightly different story (because I use it for video playback), but you get the idea.
Yes, it's called GPGPU (general purpose GPU), and with it you could program some CPU-like workloads on your GPU using languages like CUDA or OpenCL.
Of course this method doesn't work well with any workload, the CPU is still much better in single-threaded hard-to-parallelize codes, or codes with complicated control flow (due to branch predictors) or memory locality (due to better caching and prefetching). GPGPUs are mostly better for performing very straight-forward highly parallel vectorizable code.
In fact, this method of computation caught enough traction to create a new lines of products, (such as Xeon Phi, formely Larrabee), and enhancing existing GPUs (e.g. Tesla/Fermi, and others)
EDIT
Having reread your question - if you mean running actual CPU ISA on such GPGPU, not just some general CPU task, then the best bet is Xeon Phi mentioned above, it's intended to be based on the same ISA as the CPU (it's the only x86 GPGPU I know of).

Programe Execution Optimization

I am writing a Program for Parabolic Time Price Systems based on the book written by J.Welles Wilder Jr.
I am have way through the program, running with an execution time of 122 microsecs. This is way above the benchmark limit. What I was looking for is a few views and tips if I
write a kernel space program to achieve the same. Implementing it through drivers
[Really keen on this method] Is it possible, if yes then how and where I should start looking, passing instructions to a graphic driver to perform the steps and calculation (Read this in a blog somewhere).
Thanks in Advance.
--->Programming on c
What makes GPU very fast is the fact that it can run around 2000~ (depending on the card) threads asynchronously.
If you code can be divided into threads then it might improve your performance to do the calculations on the gpgpu since average CPU speed is 50-100 GFlops and average GPU speed is 1500~ when used correctly.
Also you might want to consider the difficulties of maintaining gpgpu code. I suggest you that if you have an NVidia GPU you should check out 'Managed CUDA' since it contains a debugger and a GPU profiler which makes it possible to work with.
TL;DR: use gpgpu only for async code and preferably use 'managed CUDA' if possible

Creating a heater application

This might seem weird, but I'm interesting in creating an electric heater out of my computer, that is program an application, that heats up my PC, and I need some help.
I currently made an application, that runs infinite loops on the GPU (using a little shader), and on the CPU cores, however I'm interesting in getting the ram going too, as well as the several output ports, so.. About the ram heating, just allocate, and start randomly accessing and writing using all 8 cores?
And what about triggering CD-ROM, floppy etc, how do I do this?
How about heater with a purpose? Just run World Community Grid, create tons of heat while making your computer do valuable computations for science. It runs the processors wide open, is stable, and isn't just wasting cycles.
Have a look at How to stress test a computer If your interested in making your own try searching for open source stress test software that you could modify to your liking.
Use Furmark together with LinX/Prime95. Max out your settings. Make sure you have a strong enough PSU.
There`s a torture test option for CPU & RAM in Prime95 that looks like what you want. As for the GPU, there is Furmark which achieves the same kind of stress.
The heat from the other components will likely be not relevant (unless you have something really specific like a physx card) if you stress enough your cpu and gpu imho.

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.
The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):
Intel XScale PXA270
Algorithm A: 22642 ms
Algorithm B: 29271 ms
ARM1136EJ-S core (embedded in a MSM7201A chip)
Algorithm A: 24874 ms
Algorithm B: 29504 ms
ARM926EJ-S core (embedded in an OMAP 850 chip)
Algorithm A: 70215 ms
Algorithm B: 31652 ms (!)
I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.
So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.
Thanks in advance.
One possible cause is that the 926 has a shorter pipeline (5 cycles vs. 8 cycles for the 1136, iirc), so branch mispredictions are less costly on the 926.
That said, there are a lot of architectural differences between those processors, too many to say for sure why you see this effect without knowing something about the instructions that you're actually executing.
Clock speed is only one factor. Bus width and latency are big if not bigger factors. Cache is a factor. Speed of the media the program is run from if run from media and not memory.
Is this test using any shared libraries at all at any point in the test or is it all internal code? Fetching shared libraries on media that will vary from platform to platform (even if it is say the same sd card).
Is this the same algorithm compiled separately for each platform or the same binary? You can and will see some compiler induced variation as well. 50% faster and slower can easily come from the same compiler on the same platform by varying compiler settings. If possible you want to execute the same binary, and insure that no shared libraries are used in the loop under test. If not the same binary disassemble the loop under test for each platform and insure that there are no variations other than register selection.
From the data you have presented, its difficult to point the exact problem, but we can share some of the prior experience
Cache setting (check if all the
processors has the same CACHE
setting)
You need to check both D-Cache and I-Cache
For analysis,
Break down your code further, not just as algorithm but at a block level, and try to understand the block that causes the bottle-neck. After you find the block that causes the bottle-neck, try to disassemble the block's source code, and check the assembly. It may help.
Looks like the problem is in cache settings or something memory-related (maybe I-Cache "overflow").
Pipeline stalls, branch miss-predictions usually give less significant differences.
You can try to count some basic operations, executed in each algorithm, for example:
number of "easy" arithmetical/bitwise ops (+-|^&) and shifts by constant
number of shifts by variable
number of multiplications
number of "hard" arithmetics operations (divides, floating point ops)
number of aligned memory reads (32bit)
number of byte memory reads (8bit) (it's slower than 32bit)
number of aligned memory writes (32bit)
number of byte memory writes (8bit)
number of branches
something else, don't remember more :)
And you'll get info, that things get 926 much slower. After this you can check suspicious blocks, making using of them more or less intensive. And you'll get the answer.
Furthermore, it's much better to enable assembly listing generation in VS and use it (but not your high-level source code) as base for research.
p.s.: maybe the problem is in OS/software/firmware? Did you testing on clean system? OS is the same on all devices?