Why is von neumann faster than harvard architecture [closed] - embedded

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I read about these two types of architecture and somewhere on the internet someone said that systems using Von Neumann architecture are faster than the ones using Harvard architecture. I tried searching for why this is the case but I was yet to find a explanation that clarified the things for me.
In my understanding:
- in a Von Neumann architecture the CPU can do one operation at a time meaning it can fetch data or fetch an instruction from memory in one cycle. So to perform some sort of operation on a data it needs 2 cycles(one to fetch the data and one to fetch the instruction).
- in a Harvard architecture the CPU can fetch both data and an instruction in the same clock cycle since there are 2 separate memory blocks and two separate sets of data address busses
So if the HV architecture can do the same thing that VN does in one cycle why is it slower? Doesn't less cycles used for one thing means it should be faster than the other? Please go easy on me I'm a noob in embedded systems. Thank you for reading my post!

In a von Neumann architecture, the CPU operates sequentially, e.g. it does fetch instruction, decode it, fetch operands (data), compute result, and store it. All these steps use the same memory channel.
A Harvard architecture has two memory channels, one for instructions, and one for data. It has an advantage over the von Neumann architecture, if the CPU supports pipelining, i.e. while instruction x, that has been decoded already, is fetching operands (data) over the data channel, instruction x+1 is fetched at the same time over the instruction channel.
So, if the CPU is pipelined, a Harvard architecture is faster than a von Neumann architecture.

This is all purely academic, and very dated. From the academic type view the Harvard architecture can at the same time be performing a data transaction and an instruction transaction where von Neumann can only do one or the other at a time.
True Harvard has the problem that you can't actually use it. You can't have a bootloader you can't have an operating system (that loads programs) as you can't use data transactions to put instructions in memory then branch to those instructions and run them the two memory systems are separate. Once you cross the paths it isn't Harvard any more, its a modified Harvard or a von Neumann.
Looking at how Wikipedia defines it, the modern busses you see today are modified Harvard because of the definition that you can't do data and instruction at the same time with von Neumann, but they use the same busses. you will see a read address bus a read data bus a write address bus and a write data bus, both instruction and data will cross the read busses, data goes across the write busses. Many transactions can be happening at the same time, a multi bus width sized instruction fetch can happen on one clock cycle which starts with a read address request, the next clock cycle a data read address request can start on the same bus, some number of clocks later the instruction address request is acked, then the read address request is acked, they don't necessarily have to come back in the same order depends on the design. then the read data bus will deliver the data then the processor will ack that. the write bus can be handling multiple data transactions in flight at the same time as well. And the read and write busses being independent can be doing things at the same not just each having multiple transactions in flight at the same time.
None of this has anything to do with the instruction set, you can and there are instruction sets with different busses behind them. Depending on the instruction set, how the fetching and pipeline and caching works you can have a pure textbook von Neumann come close to meeting the performance of a pure textbook Harvard. But if you think pre-cache, pre-pipeline one instruction at a time type architectures, then you can say 1) neither wins as the instruction fetch has to wait for the data transaction for loads and stores (or other instructions with memory access) to complete before the next fetch happens, so Harvard can't do data and instruction at the same time. Or 2) you can say that Harvard is allowed to do things in parallel and the von Neumann isn't and the Harvard wins as it can complete a simple data transaction and do the next fetch in the same cycle, periodically beating von Neumann by a cycle.
From a pure sense though one instruction at a time the von Neumann cannot be faster than a Harvard, it can tie but can't win. Harvard has two busses that can operate in parallel and all other factors held constant that difference gives Harvard a slight advantage as far as performance goes. All other factors held constant (instruction set, pipeline design, prefetching, etc).
Note that one instruction at a time no pipeline means it takes multiple clock cycles to perform most instructions as you see with pre-cache, pre-pipeline processors, they have tables of how many clocks it takes and you can just look at the instruction anyway and see how and why it takes that many. Even with a pipeline Harvard has a slight advantage. but if you say double the width of the von Neumann bus compared to the Harvard you can fetch two instructions at a time you can perform data operations on two sequential data locations at a time, now you have better bandwidth than the Harvard and can tie or beat it at times. but that isn't a pure comparison.
Again these notions are very dated. There are a very small number of Harvard-ish processors but to make them useful they are really modified Harvard as there is a way to gap between the memory systems.

Related

How can computational requirements be compared?

Calculating the solution to an optimization problem takes a 2 GHz CPU one hour. During this process there are no background processes, no RAM is being used and the CPU is at 100% capacity.
Based on this information, can it be derived that a 1 GHz CPU will take two hours to solve the same problem?
A quick search of IPC, frequence, and chip architecture will show you this topic has been breached many times. There are many things that can determine the execution speed of a program (without even going into threading at all) the main ones that pop to mind:
Instruction set - If one chip has an instruction for multiplication, than a*b is atomic. If not, you will need a lot of atomic instructions to perform such an action - big difference in speed, which can prove to make even higher frequency chips slower.
Cycles per second - this is the frequency of the chip.
Instructions per cycle (IPC) - what you are really interested is IPC*frequency, not just frequency. How many atomic actions can you can perform in a second. After the amount of atomic actions (see 1), on a single threaded application this might act as you expect (x2 this => x2 faster program), though no guarantees.
and there are a ton of other nuance technologies that can affect this, like branch prediction which hit the news big time recently. For a complete understanding a book/course might be a better resource.
So, in general, no. If you are comparing two single core, same architecture chips (unlikely), then maybe yes.

How many nand gates does a computer actually need to operate?

At first I was thinking that logic gates were much smaller than they actually are:
https://www.google.com/search?q=nand+gates#q=nand+gates&tbm=shop
So my question is, how many logic gates (similar to the one above) does a computer actually need to operate? Since this number must be somewhat small due to size limitations (there clearly cannot be millions of these in a computer), how is it that the computer can work with such a small number of these gates?
You can't just shop for nand gates, those are ones used for hobbies or other markets, nand gates are in your processor in your computer, and they're not individual component but directly lithographied on the die, they are a few nanometers in size and there are billons of them on a modern processor.

Relationship between number of logic cells on an FPGA and performance

Hey so I have a question about FPGA's. If you look at the current lineup of xilinx products, specifically the 7 series, there is a massive price differential between each of the models. What I don't understand is if I could buy an Artix-7 with ~200k logic cells for $300 whereas a Virtex-7 with ~2000k logic cells costs in excess of $20,000. So could I just buy 10 Artix-7's and get the same performance? Furthermore is performance linearly related to the number of logic cells, and if not then how are they related? Is there any advantage to having more logic cells per core? I'm sure it depends on what you are doing but as my interest in the matter, although theoretical, lies in cryptograpic applications, my question relates specifically to implementations of the MD5, SHA-0/1/2/3, and similar encryption algorithms.
An FPGA doesn't have "performance" like a processor. It just has a bunch of logic elements (LEs) that you can use. If a high-end part has 2MLEs and a low-end part has 200kLEs, but you only need 20kLEs for your processing core, it makes little difference which one you use, all else being equal. Of course, if you have a problem that can easily be parallelized, then you can turn those extra LEs into extra performance by building more processing cores. But that's up to you to do.
Now, all else is not always equal, because there's a lot more to an FPGA than simply the number of logic cells. I can't speak for Xilinx parts (I work for another major FPGA vendor) but typically the high-end families will have things like very high-speed transceivers that the midrange and low-end families do not. In addition, sometimes they have different mixes of embedded RAM, DSP, etc.
So, can you use a bunch of small FPGAs instead of a large one? Remember that an FPGA will only have about 1000-2000 IOs, whereas there will be more like 100Ks of internal wires between the corresponding parts of the higher-end part. So not only will you have to build a pretty complicated board, you might find yourself IO-limited in getting signals off of one chip and onto another.

Embarassingly parallelizable computation with CUDA, how to start?

I need to accelerate many computations I am now doing with PyLab. I thought of using CUDA. The overall unit of computation (A) consists in doing several (thousands) entirely independent smaller computations (B). Each of them involves, at their initial stage, doing 40-41 independent, even smaller, computations (C). So parallel programming should really help. With PyLab the overall (A) takes 20 minutes and (B) takes some tenth of a second.
As a beginner in this realm, my question is what level I should parallelize the computation at, whether at (C) or at (B).
I should clarify that the (C) stage consists in taking a bunch of data (thousands of floats) which is shared between all the (C) processes, and doing various tasks, among which one of the most time consuming is linear regression, which is, too, parallelizable! The output of each procedure (C) is a single float. Each computation (B) consists basically in doing many times procedure (C) and doing a linear regression on the data that comes out. Its output, again, is a single float.
I'm not familiar with CUDA programming so I am basically asking what would be the wisest strategy to start with.
An important consideration when deciding how (and if) to convert your project to CUDA is what type of memory access patterns your code requires. The GPU runs threads in groups of 32, called warps, and to get the best performance, the threads in a warp should access the memory in some basic patterns, that are described in the CUDA Programming Guide (included with CUDA). In general, the more random the access patterns, the more likely the kernel is to become memory bound. In that case, the compute power in the GPU cannot be fully utilized.
The other main case when the compute power in the GPU cannot be fully utilized is if there is conditional logic and loops that causes the threads in a warp to run through different code paths, as the GPU has to run all the threads in the warp through each code path.
If you find that these points may cause issues for your code, you should also do some research to see if there are known alternative ways to implement your code to run better on the GPU (this is often the case).
If you see your question about at which level to parallelize the computation in the light of the above considerations, it may become clear which choice to make.

Can you program a pure GPU game?

I'm a CS master student, and next semester I will have to start working on my thesis. I've had trouble coming up with a thesis idea, but I decided it will be related to Computer Graphics as I'm passionate about game development and wish to work as a professional game programmer one day.
Unfortunately I'm kinda new to the field of 3D Computer Graphics, I took an undergraduate course on the subject and hope to take an advanced course next semester, and I'm already reading a variety of books and articles to learn more. Still, my supervisor thinks its better if I come up with a general thesis idea now and then spend time learning about it in preparation for doing my thesis proposal. My supervisor has supplied me with some good ideas but I'd rather do something more interesting on my own, which hopefully has to do with games and gives me more opportunities to learn more about the field. I don't care if it's already been done, for me the thesis is more of an opportunity to learn about things in depth and to do substantial work on my own.
I don't know much about GPU programming and I'm still learning about shaders and languages like CUDA. One idea I had is to program an entire game (or as much as possible) on the GPU, including all the game logic, AI, and tests. This is inspired by reading papers on GPGPU and questions like this one I don't know how feasible that is with my knowledge, and my supervisor doesn't know a lot about recent GPUs. I'm sure with time I will be able to answer this question on my own, but it'd be handy if I could know the answer in advance so I could also consider other ideas.
So, if you've got this far, my question: Using only shaders or something like CUDA, can you make a full, simple 3D game that exploits the raw power and parallelism of GPUs? Or am I missing some limitation or difference between GPUs and CPUs that will always make a large portion of my code bound to CPU? I've read about physics engines running on the GPU, so why not everything else?
DISCLAIMER: I've done a PhD, but have never supervised a student of my own, so take all of what I'm about to say with a grain of salt!
I think trying to force as much of a game as possible onto a GPU is a great way to start off your project, but eventually the point of your work should be: "There's this thing that's an important part of many games, but in it's present state doesn't fit well on a GPU: here is how I modified it so it would fit well".
For instance, fortran mentioned that AI algorithms are a problem because they tend to rely on recursion. True, but, this is not necessarily a deal-breaker: the art of converting recursive algorithms into an iterative form is looked upon favorably by the academic community, and would form a nice center-piece for your thesis.
However, as a masters student, you haven't got much time so you would really need to identify the kernel of interest very quickly. I would not bother trying to get the whole game to actually fit onto the GPU as part of the outcome of your masters: I would treat it as an exercise just to see which part won't fit, and then focus on that part alone.
But be careful with your choice of supervisor. If your supervisor doesn't have any relevant experience, you should pick someone else who does.
I'm still waiting for a Gameboy Emulator that runs entirely on the GPU, which is just fed the game ROM itself and current user input and results in a texture displaying the game - maybe a second texture for sound output :)
The main problem is that you can't access persistent storage, user input or audio output from a GPU. These parts have to be on the CPU, by definition (even though cards with HDMI have audio output, but I think you can't control it from the GPU). Apart from that, you can already push large parts of the game code into the GPU, but I think it's not enough for a 3D game, since someone has to feed the 3D data into the GPU and tell it which shaders should apply to which part. You can't really randomly access data on the GPU or run arbitrary code, someone has to do the setup.
Some time ago, you would just setup a texture with the source data, a render target for the result data, and a pixel shader that would do the transformation. Then you rendered a quad with the shader to the render target, which would perform the calculations, and then read the texture back (or use it for further rendering). Today, things have been made simpler by the fourth and fifth generation of shaders (Shader Model 4.0 and whatever is in DirectX 11), so you can have larger shaders and access memory more easily. But still they have to be setup from the outside, and I don't know how things are today regarding keeping data between frames. In worst case, the CPU has to read back from the GPU and push again to retain game data, which is always a slow thing to do. But if you can really get to a point where a single generic setup/rendering cycle would be sufficient for your game to run, you could say that the game runs on the GPU. The code would be quite different from normal game code, though. Most of the performance of GPUs comes from the fact that they execute the same program in hundreds or even thousands of parallel shading units, and you can't just write a shader that can draw an image to a certain position. A pixel shader always runs, by definition, on one pixel, and the other shaders can do things on arbitrary coordinates, but they don't deal with pixels. It won't be easy, I guess.
I'd suggest just trying out the points I said. The most important is retaining state between frames, in my opinion, because if you can't retain all data, all is impossible.
First, Im not a computer engineer so my assumptions cannot even be a grain of salt, maybe nano scale.
Artificial intelligence? No problem.There are countless neural network examples running in parallel in google. Example: http://www.heatonresearch.com/encog
Pathfinding? You just try some parallel pathfinding algorithms that are already on internet. Just one of them: https://graphics.tudelft.nl/Publications-new/2012/BB12a/BB12a.pdf
Drawing? Use interoperability of dx or gl with cuda or cl so drawing doesnt cross pci-e lane. Can even do raytracing at corners so no z-fighting anymore, even going pure raytraced screen is doable with mainstream gpu using a low depth limit.
Physics? The easiest part, just iterate a simple Euler or Verlet integration and frequently stability checks if order of error is big.
Map/terrain generation? You just need a Mersenne-twister and a triangulator.
Save game? Sure, you can compress the data parallelly before writing to a buffer. Then a scheduler writes that data piece by piece to HDD through DMA so no lag.
Recursion? Write your own stack algorithm using main vram, not local memory so other kernels can run in wavefronts and GPU occupation is better.
Too much integer needed? You can cast to a float then do 50-100 calcs using all cores then cast the result back to integer.
Too much branching? Compute both cases if they are simple, so every core is in line and finish in sync. If not, then you can just put a branch predictor of yourself so the next time, it predicts better than the hardware(could it be?) with your own genuine algorithm.
Too much memory needed? You can add another GPU to system and open DMA channel or a CF/SLI for faster communication.
Hardest part in my opinion is the object oriented design since it is very weird and hardware dependent to build pseudo objects in gpu. Objects should be represented in host(cpu) memory but they must be separated over many arrays in gpu to be efficient. Example objects in host memory: orc1xy_orc2xy_orc3xy. Example objects in gpu memory: orc1_x__orc2_x__ ... orc1_y__orc2_y__ ...
The answer has already been chosen 6 years ago but for those interested to the actual question, Shadertoy, a live-coding WebGL platform, recently added the "multipass" feature allowing preservation of state.
Here's a live demo of the Bricks game running on Gpu.
I don't care if it's already been
done, for me the thesis is more of an
opportunity to learn about things in
depth and to do substantial work on my
own.
Then your idea of what a thesis is is completely wrong. A thesis must be an original research. --> edit: I was thinking about a PhD thesis, not a master thesis ^_^
About your question, the GPU's instruction sets and capabilities are very specific to vector floating point operations. The game logic usually does little floating point, and much logic (branches and decision trees).
If you take a look to the CUDA wikipedia page you will see:
It uses a recursion-free,
function-pointer-free subset of the C
language
So forget about implementing any AI algorithms there, that are essentially recursive (like A* for pathfinding). Maybe you could simulate the recursion with stacks, but if it's not allowed explicitly it should be for a reason. Not having function pointers also limits somewhat the ability to use dispatch tables for handling the different actions depending on state of the game (you could use again chained if-else constructions, but something smells bad there).
Those limitations in the language reflect that the underlying HW is mostly thought to do streaming processing tasks. Of course there are workarounds (stacks, chained if-else), and you could theoretically implement almost any algorithm there, but they will probably make the performance suck a lot.
The other point is about handling the IO, as already mentioned up there, this is a task for the main CPU (because it is the one that executes the OS).
It is viable to do a masters thesis on a subject and with tools that you are, when you begin, unfamiliar. However, its a big chance to take!
Of course a masters thesis should be fun. But ultimately, its imperative that you pass with distinction and that might mean tackling a difficult subject that you have already mastered.
Equally important is your supervisor. Its imperative that you tackle some problem they show an interest in - that they are themselves familiar with - so that they can become interested in helping you get a great grade.
You've had lots of hobby time for scratching itches, you'll have lots more hobby time in the future too no doubt. But master thesis time is not the time for hobbies unfortunately.
Whilst GPUs today have got some immense computational power, they are, regardless of things like CUDA and OpenCL limited to a restricted set of uses, whereas the CPU is more suited towards computing general things, with extensions like SSE to speed up specific common tasks. If I'm not mistaken, some GPUs have the inability to do a division of two floating point integers in hardware. Certainly things have improved greatly compared to 5 years ago.
It'd be impossible to develop a game to run entirely in a GPU - it would need the CPU at some stage to execute something, however making a GPU perform more than just the graphics (and physics even) of a game would certainly be interesting, with the catch that game developers for PC have the biggest issue of having to contend with a variety of machine specification, and thus have to restrict themselves to incorporating backwards compatibility, complicating things. The architecture of a system will be a crucial issue - for example the Playstation 3 has the ability to do multi gigabytes a second of throughput between the CPU and RAM, GPU and Video RAM, however the CPU accessing GPU memory peaks out just past 12MiB/s.
The approach you may be looking for is called "GPGPU" for "General Purpose GPU". Good starting points may be:
http://en.wikipedia.org/wiki/GPGPU
http://gpgpu.org/
Rumors about spectacular successes in this approach have been around for a few years now, but I suspect that this will become everyday practice in a few years (unless CPU architectures change a lot, and make it obsolete).
The key here is parallelism: if you have a problem where you need a large number of parallel processing units. Thus, maybe neural networks or genetic algorithms may be a good range of problems to attack with the power of a GPU. Maybe also looking for vulnerabilities in cryptographic hashes (cracking the DES on a GPU would make a nice thesis, I imagine :)). But problems requiring high-speed serial processing don't seem so much suited for the GPU. So emulating a GameBoy may be out of scope. (But emulating a cluster of low-power machines might be considered.)
I would think a project dealing with a game architecture that targets multiple core CPUs and GPUs would be interesting. I think this is still an area where a lot of work is being done. In order to take advantage of current and future computer hardware, new game architectures are going to be needed. I went to GDC 2008 and there were ome talks related to this. Gamebryo had an interesting approach where they create threads for processing computations. You can designate the number of cores you want to use so that if you don't starve out other libraries that might be multi-core. I imagine the computations could be targeted to GPUs as well.
Other approaches included targeting different systems for different cores so that computations could be done in parallel. For instance, the first split a talk suggested was to put the renderer on its own core and the rest of the game on another. There are other more complex techniques but it all basically boils down to how do you get the data around to the different cores.