What's the point of "make move" and "undo move" in chessengines?

What's the point of "make move" and "undo move" in chessengines? - chess

I want to experiment with massive parallel chess computing. From what I saw and understood in wikis and source code of some engines is that in most (all?) implementations of the min-max(negamax, alpha-beta, ...)-algorithm there is one internal position that gets updated for every new branch and then gets undone after receiving the evaluation of that branch.
What are the benefits of that technic compared to just generating a new position-object and passing that to the next branch?
This is what I have done in my previous engines and I believe this method is superior for the purpose of parallelism.

The answer to your question depends heavily on a few factors, such as how you are storing the chess board, and what your make/unmake move function looks like. I'm sure there are ways of storing the board that would be better suited towards your method, and indeed historically some top tiered chess engine (in particular crafty) used to use that method, but you are correct in saying that modern engines no longer do it that way. Here is a link to a discussion about this very point:
http://www.talkchess.com/forum/viewtopic.php?t=50805
If you want to understand why this is, then you must understand how today's engines represent the board. The standard implementation revolves around bitboards, and that requires 12- 64 bit integers per position, in addition to a redundant mailbox (array in non computer chess jargon) used in conjunction. Copying all that is usually more expensive than a good makeMove/unMakeMove function.
I also want to point out that a hybrid approach is also common. Usually make and unmake is used on the board itself, where as other information like en passant squares and castle rights are copied and changed like you suggested.

Interestingly, for board representations of < ~300 bytes, it IS cheaper to copy the board on each move (on modern x86), as opposed to making and unmaking the move.
As you suggest the immutable properties of copying the board on each move are from a programming perspective, including parallelising, very attractive.
My board rep is 208 bytes. C++ compiled with g++ 7.4.0.
Empricism shows that board copy is 20% faster than move make/unmake. I presume the code is using 32/64 byte-wide AVX instructions for the copy. In theory you can do 32/64 byte copy per cycle.
Just for you: https://github.com/rolandpj1968/oom-skaak/tree/init-impl-legal-move-gen-move-do-undo-2

Related

Synopsys: Repeated compiles produce different results. How to automate iterated compile?

I'm new to using Design Compiler. In the past, I've done mostly FPGA work. Right now, I'm using Synopsys to determine the minimum are necessary to represent some circuits (using the Nangate 45nm library). I'm not doing P&R right now; I'm just trying to determine transistor area.
My only optimization constraint is to minimize area. I've noticed that if I tell DC to compile more than one time in a row, it produces different (and usually smaller) results each time.
I've looked and looked and failed to see if this is mentioned in a manual or anywhere in any discussion. Is it meant to work this way?
This suggests that optimization is stopping earlier than it could, so it's not REALLY minimizing area. Any idea why?
Is there a way I can tell it to increase the effort and/or tell it to automatically iterate compiles so that it will converge on the smallest design?
I'm guessing that DC is expecting to meet timing constraints, but I've given it a purely combinatorial block and no timing constraint. Did they never consider the usage scenario when all you want to do is work out the minimum gate area for a combinatorial circuit?

On a pure combinatorial circuit you can use a set_max_delay constraint and DC will attempt to meet that.
For reduced area you can use -map_effort high or -map_effort ultra to get it to work harder.
DC is a funny beast, and the algorithms it uses change as processes advance and make certain activities more or less useful. A lot of pre-layout optimization is less useful since the whole situation can change once the gates are actually placed and routed.

I filed a support ticket with Synopsys. I was using a 2010 version of design compiler. Apparently, area optimization has been improved since then, and the 2014 version will minimize area in one compiler pass.

Usage of frame pointer optimizations

This is related but NOT the same as frame pointer omitting ? Any risk?
I am trying to follow this old (but still relevan article)
http://blogs.msdn.com/b/larryosterman/archive/2007/03/12/fpo.aspx
Larry (author writes)
machines got sufficiently faster since 1995 that the performance
improvements that were achieved by FPO weren't sufficient to counter
the pain in debugging and analysis that FPO caused
However in the discussion further down the page one user writes
Disabling FPO can have both serious code size and performance impact.
Tail call optimizations have to be disabled when a frame pointer is
present, leading to much greater stack usage in affected paths. Small
functions are also disproportionately affected by prolog/epilog code.
Third, although there are still six registers available with a frame
pointer on X86, only three of them are nonvolatile with respect to
nested calls: EBX, ESI, and EDI. Opening up a fourth register can drop
out a bunch of spill code.
I have a couple of question.
Spill code == Register spillage?
Is the author correct that FPO is generally considered a pain and
the gain doe not out-weigh the benefits.
Is FPO still relevant today in x64 architecture since there are a
LOT more registers o play with.
Do you use FPO? What for (if yes) and does it make a difference to
you?
Finally in this article
http://www.altdevblogaday.com/2012/05/24/x64-abi-intro-to-the-windows-x64-calling-convention/
The author says
[with repect to Windows x64 calling convention].....
All parameters have space reserved on the stack, even the ones passed in registers. In fact, there’s stack space for 4 parameters
even if your function doesn’t have any params. Those parameters are 8
bytes so that’s at least 32 bytes on the stack for every function
(every function actually has at least 48 bytes on the stack…I’ll
explain that another time). This stack area is called the home space.
There are few reasons behind this home space:
If the registers need to be used for something else, the called function can store the data in the home space without moving the stack
pointer.
It keeps the stack structure easy to determine. That’s very handy for debugging, and perhaps necessary for x64′s stack metadata (another
point I’ll come back to another time). ...... The compiler can use it
for whatever it wants, and an optimized build will likely make great
use of it.
Wouldn't an optimized build optimize the excess allocation away?

1.Spill code == Register spillage?
Almost. Stricly speaking, spill code is the code added by the compiler to implement a register spill. The spill itself is the decision to tag a live range as not able to be placed in a register.
2.Is the author correct that FPO is generally considered a pain and the gain doe not out-weigh the benefits.
The author is probably correct that in modern processor architectures, the kinds of functions where FPOs will generate a significant performance gain is a smaller set than in the past. Yet FPO's do make code smaller, reducing cache pressure. They do reduce register pressure. These can be important in some settings. They do speed up prolog and epilog code by a few instructions. The point about debuggers not working well without the FP is noteworthy. It means core dumps are less useful for post mortems on production-optimized code. You'd never want to use FPO during development except for final testing.
3.Is FPO still relevant today in x64 architecture since there are a LOT more registers o play with.
Modern processors are so various and complex that you just about never know what's "relevant" until you try it and measure.
4.Do you use FPO? What for (if yes) and does it make a difference to you?
I have written a medium-size C library (20K SLOC) where it made a small (~5%) difference in run time overall under gcc. This was a native language extension to a scripting language that had to compile under both gcc and Visual C. Using it would have split the build path. I decided 5% was not worth it for the purpose the extension served. But if it had been a dynamic fluid simulation to predict the weather, 5% could have been worth many millions of dollars. The decision would have been different.
5.Wouldn't an optimized build optimize the excess allocation away?
That's entirely up to the compiler and optimizer designer. It looks from the MS documentation here that MS has defined the ABI to require home space for all data even if it's whole lifetime is spent in a register.

1) When you need to use a register and don't have any unused ones, you need to write code to save some register value on the stack and later restore it.
2) FPO was a pain back when unwinding was primarily done by walking the stack. Nowadays standard unwind ABIs exist anyway (e.g. to enable exception handling), so the information already exists, and is organized more efficiently (away from the hot code), so there's no pain. Sure, there would be some pain if you wrote all your machine code by hand, but that's not the typical use case.
3) Typical x86_64 ABIs don't use frame pointers at all (except when absolutely necessary, like for variable-length arrays in C).
4) I'm not a compiler. My compiler doesn't generate frame pointers.
Optimize excess away) Not sure what your question is. The space consumption for the home area isn't a problem. The benefit of not having to adjust any stack pointers is a big advantage, since you need a lot less code. The same goes for the red zone just beyond the stack frame, which allows leaf code to use a lot of memory without ever needing any stack pointer gymnastics.

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
My impression is that it has to do with taking advantage of certain libraries (e.g., BLAS, LAPLACK) where possible, revising algorithms (reducing iteration), profiling, parallelizing, considering memory-hard disk-processor use/access... I am aware of the adage, "want to optimize your code? don't do it", but if one were interested in learning about writing efficient code, what references might be available?
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).

Profiling is a must at any level of machinery. In common usage, I've found that scaling to larger and larger grids requires a better understanding of the grid software and the topology of the grid. In that sense, everything you learn about optimizing for one machine is still applicable, but understanding the grid software gets you additional mileage. Hadoop is one of the most popular and widespread grid systems, so learning about the scheduler options, interfaces (APIs and web interfaces), and other aspects of usage will help. Although you may not use Hadoop for a given supercomputer, it is one of the less painful methods for learning about distributed computing. For parallel computing, you may pursue MPI and other systems.
Additionally, learning to parallelize code on a single machine, across multiple cores or processors, is something you can begin learning on a desktop machine.
Recommendations:
Learn to optimize code on a single machine:
Learn profiling
Learn to use optimized libraries (after profiling: so that you see the speedup)
Be sure you know algorithms and data structures very well (*)
Learn to do embarrassingly parallel programming on multiple core machines.
Later: consider multithreaded programming. It's harder and may not pay off for your problem.
Learn about basic grid software for distributed processing
Learn about tools for parallel processing on a grid
Learn to program for alternative hardware, e.g. GPUs, various specialized computing systems.
This is language agnostic. I have had to learn the same sequence in multiple languages and multiple HPC systems. At each step, take a simpler route to learn some of the infrastructure and tools; e.g. learn multicore before multithreaded, distributed before parallel, so that you can see what fits for the hardware and problem, and what doesn't.
Some of the steps may be reordered depending on local computing practices, established codebases, and mentors. If you have a large GPU or MPI library in place, then, by all means, learn that rather than foist Hadoop onto your collaborators.
(*) The reason to know algorithms very well is that as soon as your code is running on a grid, others will see it. When it is hogging up the system, they will want to know what you're doing. If you are running a process that is polynomial and should be constant, you may find yourself mocked. Others with more domain expertise may help you find good approximations for NP-hard problems, but you should know that the concept exists.

Parallelization would be the key.
Since the problems you cited (e.g. CFD, multiphysics, mass transfer) are generally expressed as large-scale linear algebra problems, you need matrix routines that parallelize well. MPI is a standard for those types of problems.
Physics can influence as well. For example, it's possible to solve some elliptical problems efficiently using explicit dynamics and artificial mass and damping matricies.
3D multiphysics means coupled differential equations with varying time scales. You'll want a fine mesh to resolve details in both space and time, so the number of degrees of freedom will rise rapidly; time steps will be governed by the stability requirements of your problem.
If someone ever figures out how to run linear algebra as a map-reduce problem they'll have it knocked.

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
First, you would need to understand the problem. Not all problems can be solved in parallel (and I'm using the term parallel in as wide meaning as it can get). So, see how the problem is now solved. Can it be solved with some other method quicker. Can it be divided in independent parts ... and so on ...
Fortran is the language specialized for scientific computing, and during the recent years, along with the development of new language features, there has also been some very interesting development in terms of features that are aiming for this "market". The term "co-arrays" could be an interesting read.
But for now, I would suggest reading first into a book like Using OpenMP - OpenMP is a simpler model but the book (fortran examples inside) explains nicely the fundamentals. Message parsing interface (for friends, MPI :) is a larger model, and one of often used. Your next step from OpenMP should probably go in this direction. Books on the MPI programming are not rare.
You mentioned also libraries - yes, some of those you mentioned are widely used. Others are also available. A person who does not know exactly where the problem in performance lies should IMHO never try to undertake the task of trying to rewrite library routines.
Also there are books on parallel algorithms, you might want to check out.
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
In short it comes down to understanding the problem, learning where the problem in performance is, re-solving the whole problem again with a different approach, iterating a few times, and by that time you'll already know what you're doing and where you're stuck.

We're in a position similar to yours.
I'm most in agreement with #Iterator's answer, but I think there's more to say.
First of all, I believe in "profiling" by the random-pausing method, because I'm not really interested in measuring things (It's easy enough to do that) but in pinpointing code that is causing time waste, so I can fix it. It's like the difference between a floodlight and a laser.
For one example, we use LAPACK and BLAS. Now, in taking my stack samples, I saw a lot of the samples were in the routine that compares characters. This was called from a general routine that multiplies and scales matrices, and that was called from our code. The matrix-manipulating routine, in order to be flexible, has character arguments that tell it things like, if a matrix is lower-triangular or whatever. In fact, if the matrices are not very large, the routine can spend more than 50% of its time just classifying the problem. Of course, the next time it is called from the same place, it does the same thing all over again. In a case like that, a special routine should be written. When it is optimized by the compiler, it will be as fast as it reasonably can be, and will save all that classifying time.
For another example, we use a variety of ODE solvers. These are optimized to the nth degree of course. They work by calling user-provided routines to calculate derivatives and possibly a jacobian matrix. If those user-provided routines don't actually do much, samples will indeed show the program counter in the ODE solver itself. However, if the user-provided routines do much more, samples will find the lower end of the stack in those routines mostly, because they take longer, while the ODE code takes roughly the same time. So, optimization should be concentrated in the user-provided routines, not the ODE code.
Once you've done several of the kind of optimization that is pinpointed by stack sampling, which can speed things up by 1-2 orders of magnitude, then by all means exploit parallelism, MPI, etc. if the problem allows it.

How to make a 2D Soft-body physics engine?

The definition of rigid body in Box2d is
A chunk of matter that is so strong
that the distance between any two bits
of matter on the chunk is completely
constant.
And this is exactly what i don't want as i would like to make 2D (maybe 3D eventually), elastic, deformable, breakable, and even sticky bodies.
What I'm hoping to get out of this community are resources that teach me the math behind how objects bend, break and interact. I don't care about the molecular or chemical properties of these objects, and often this is all I find when I try to search for how to calculate what a piece of wood, metal, rubber, goo, liquid, organic material, etc. might look like after a force is applied to it.
Also, I'm a very visual person, so diagrams and such are EXTREMELY HELPFUL for me.
================================================================================
Ignore these questions, they're old, and I'm only keeping them here for contextual purposes
1.Are there any simple 2D soft-body physics engines out there like this?
preferably free or opensource?
2.If not would it be possible to make my own without spending years on it?
3.Could i use existing engines like bullet and box2d as a start and simply transform their code, or would this just lead to more problems later, considering my 1 year of programming experience and bullet being 3D?
4.Finally, if i were to transform another library, would it be the best change box2D's already 2d code, Bullet's already soft code, or mixing both's source code?
Thanks!

(1) Both Bullet and PhysX have support for deformable objects in some capacity. Bullet is open source and PhysX is free to use. They both have ports for windows, mac, linux and all the major consoles.
(2) You could hack something together if you really know what you are doing, and it might even work. However, there will probably be bugs unless you have a damn good understanding of how Box2D's sequential impulse constraint solver works and what types of measures are going to be necessary to keep your system stable. That said, there are many ways to get deformable objects working with minimal fuss within a game-like environment. The first option is to take a second (or higher) order approximation of the deformation. This lets you deal with deformations in much the same way as you deal with rigid motions, only now you have a few extra degrees of freedom. See for example the following paper:
http://www.matthiasmueller.info/publications/MeshlessDeformations_SIG05.pdf
A second method is pressure soft bodies, which basically model the body as a set of particles with some distance constraints and pressure forces. This is what both PhysX and Bullet do, and it is a pretty standard technique by now (for example, Gish used it):
http://citeseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.4.2828%26rep%3Drep1%26type%3Dpdf
If you google around, you can find lots of tutorials on implementing it, but I can't vouch for their quality. Finally, there has been a more recent push to trying to do deformable objects the `right' way using realistic elastic models and finite element type approaches. This is still an area of active research, so it is not for the faint of heart. For example, you could look at any number of the papers in this year's SIGGRAPH proceedings:
http://kesen.realtimerendering.com/sig2011.html
(3) Probably not. Though there are certain 2D style games that can work with a 3D physics engine (for example top down type games) for special effects.
(4) Based on what I just said, you should probably know the answer by now. If you are the adventurous sort and got some time to kill and the will to learn, then I say go for it! Of course it will be hard at first, but like anything it gets easier over time. Plus, learning new stuff is lots of fun!
On the other hand, if you just want results now, then don't do it. It will take a lot of time, and you will probably fail (a lot). If you just want to make games, then stick to the existing libraries and build on whatever abstractions it provides you.

Quick and partial answer:
rigid body are easy to model due to their property (you can use physic tools, like "Torseur+ (link on french on wikipedia, english equivalent points to screw theory) to modelate forces applying at any point in your element.
in comparison, non-solid elements move from almost solid (think very hard rubber : it can move but is almost solid) to almost liquid (think very soft ruber, latex). Meaning that dynamical properties applying to that knd of objects are much complex and depend of the nature of the object
Take the example of a spring : it's easy to model independantly (f=k.x), but creating a generic tool able to model that specific case is a nightmare (especially if you think of corner cases : extension is not infinite, compression reaches a lower point, material is non linear...)
as far as I know, when dealing with "elastic" materials, people do their own modelisation for their own purpose (a generic one does not exist)
now the answers:
Probably not, not that I know at least
not easily, see previously why
Unless you got high level background in elastic materials, I fear it's gonna be painful
Hope this helped

Some specific cases such as deformable balls can be simulated pretty well using spring-joint bodies:
Here is an implementation example with cocos2d: http://2sa-studio.blogspot.com/2014/05/soft-bodies-with-cocos2d-v3.html

Depending on the complexity of the deformable objects that you need, you might be able to emulate them using box2d, constraining rigid bodies with joints or springs. I did it in the past using a box2d clone for xna (farseer) and it worked nicely. Hope this helps.

The physics of your question breaks down into two different topics:
Inelastic Collisions: The math here is easy, and you could write a pretty decent library yourself without too much work for 2D points/balls. (And with more work, you could learn the physics for extended bodies.)
Materials Bending and Breaking: This will be hard. In general, you will have to model many of the topics in Mechanical Engineering:
Continuum Mechanics
Structural Analysis
Failure Analysis
Stress Analysis
Strain Analysis
I am not being glib. Modeling the bending and breaking of materials is, in general, a very detailed and varied topic. It will take a long time. And the only way to succeed will be to understand the science well enough that you can make clever shortcuts in limiting the scope of the science you need to model in your game.
However, the other half of your problem (modeling Inelastic Collisions) is a much more achievable goal.
Good luck!

Can you program a pure GPU game?

I'm a CS master student, and next semester I will have to start working on my thesis. I've had trouble coming up with a thesis idea, but I decided it will be related to Computer Graphics as I'm passionate about game development and wish to work as a professional game programmer one day.
Unfortunately I'm kinda new to the field of 3D Computer Graphics, I took an undergraduate course on the subject and hope to take an advanced course next semester, and I'm already reading a variety of books and articles to learn more. Still, my supervisor thinks its better if I come up with a general thesis idea now and then spend time learning about it in preparation for doing my thesis proposal. My supervisor has supplied me with some good ideas but I'd rather do something more interesting on my own, which hopefully has to do with games and gives me more opportunities to learn more about the field. I don't care if it's already been done, for me the thesis is more of an opportunity to learn about things in depth and to do substantial work on my own.
I don't know much about GPU programming and I'm still learning about shaders and languages like CUDA. One idea I had is to program an entire game (or as much as possible) on the GPU, including all the game logic, AI, and tests. This is inspired by reading papers on GPGPU and questions like this one I don't know how feasible that is with my knowledge, and my supervisor doesn't know a lot about recent GPUs. I'm sure with time I will be able to answer this question on my own, but it'd be handy if I could know the answer in advance so I could also consider other ideas.
So, if you've got this far, my question: Using only shaders or something like CUDA, can you make a full, simple 3D game that exploits the raw power and parallelism of GPUs? Or am I missing some limitation or difference between GPUs and CPUs that will always make a large portion of my code bound to CPU? I've read about physics engines running on the GPU, so why not everything else?

DISCLAIMER: I've done a PhD, but have never supervised a student of my own, so take all of what I'm about to say with a grain of salt!
I think trying to force as much of a game as possible onto a GPU is a great way to start off your project, but eventually the point of your work should be: "There's this thing that's an important part of many games, but in it's present state doesn't fit well on a GPU: here is how I modified it so it would fit well".
For instance, fortran mentioned that AI algorithms are a problem because they tend to rely on recursion. True, but, this is not necessarily a deal-breaker: the art of converting recursive algorithms into an iterative form is looked upon favorably by the academic community, and would form a nice center-piece for your thesis.
However, as a masters student, you haven't got much time so you would really need to identify the kernel of interest very quickly. I would not bother trying to get the whole game to actually fit onto the GPU as part of the outcome of your masters: I would treat it as an exercise just to see which part won't fit, and then focus on that part alone.
But be careful with your choice of supervisor. If your supervisor doesn't have any relevant experience, you should pick someone else who does.

I'm still waiting for a Gameboy Emulator that runs entirely on the GPU, which is just fed the game ROM itself and current user input and results in a texture displaying the game - maybe a second texture for sound output :)
The main problem is that you can't access persistent storage, user input or audio output from a GPU. These parts have to be on the CPU, by definition (even though cards with HDMI have audio output, but I think you can't control it from the GPU). Apart from that, you can already push large parts of the game code into the GPU, but I think it's not enough for a 3D game, since someone has to feed the 3D data into the GPU and tell it which shaders should apply to which part. You can't really randomly access data on the GPU or run arbitrary code, someone has to do the setup.
Some time ago, you would just setup a texture with the source data, a render target for the result data, and a pixel shader that would do the transformation. Then you rendered a quad with the shader to the render target, which would perform the calculations, and then read the texture back (or use it for further rendering). Today, things have been made simpler by the fourth and fifth generation of shaders (Shader Model 4.0 and whatever is in DirectX 11), so you can have larger shaders and access memory more easily. But still they have to be setup from the outside, and I don't know how things are today regarding keeping data between frames. In worst case, the CPU has to read back from the GPU and push again to retain game data, which is always a slow thing to do. But if you can really get to a point where a single generic setup/rendering cycle would be sufficient for your game to run, you could say that the game runs on the GPU. The code would be quite different from normal game code, though. Most of the performance of GPUs comes from the fact that they execute the same program in hundreds or even thousands of parallel shading units, and you can't just write a shader that can draw an image to a certain position. A pixel shader always runs, by definition, on one pixel, and the other shaders can do things on arbitrary coordinates, but they don't deal with pixels. It won't be easy, I guess.
I'd suggest just trying out the points I said. The most important is retaining state between frames, in my opinion, because if you can't retain all data, all is impossible.

First, Im not a computer engineer so my assumptions cannot even be a grain of salt, maybe nano scale.
Artificial intelligence? No problem.There are countless neural network examples running in parallel in google. Example: http://www.heatonresearch.com/encog
Pathfinding? You just try some parallel pathfinding algorithms that are already on internet. Just one of them: https://graphics.tudelft.nl/Publications-new/2012/BB12a/BB12a.pdf
Drawing? Use interoperability of dx or gl with cuda or cl so drawing doesnt cross pci-e lane. Can even do raytracing at corners so no z-fighting anymore, even going pure raytraced screen is doable with mainstream gpu using a low depth limit.
Physics? The easiest part, just iterate a simple Euler or Verlet integration and frequently stability checks if order of error is big.
Map/terrain generation? You just need a Mersenne-twister and a triangulator.
Save game? Sure, you can compress the data parallelly before writing to a buffer. Then a scheduler writes that data piece by piece to HDD through DMA so no lag.
Recursion? Write your own stack algorithm using main vram, not local memory so other kernels can run in wavefronts and GPU occupation is better.
Too much integer needed? You can cast to a float then do 50-100 calcs using all cores then cast the result back to integer.
Too much branching? Compute both cases if they are simple, so every core is in line and finish in sync. If not, then you can just put a branch predictor of yourself so the next time, it predicts better than the hardware(could it be?) with your own genuine algorithm.
Too much memory needed? You can add another GPU to system and open DMA channel or a CF/SLI for faster communication.
Hardest part in my opinion is the object oriented design since it is very weird and hardware dependent to build pseudo objects in gpu. Objects should be represented in host(cpu) memory but they must be separated over many arrays in gpu to be efficient. Example objects in host memory: orc1xy_orc2xy_orc3xy. Example objects in gpu memory: orc1_x__orc2_x__ ... orc1_y__orc2_y__ ...

The answer has already been chosen 6 years ago but for those interested to the actual question, Shadertoy, a live-coding WebGL platform, recently added the "multipass" feature allowing preservation of state.
Here's a live demo of the Bricks game running on Gpu.

I don't care if it's already been
done, for me the thesis is more of an
opportunity to learn about things in
depth and to do substantial work on my
own.
Then your idea of what a thesis is is completely wrong. A thesis must be an original research. --> edit: I was thinking about a PhD thesis, not a master thesis ^_^
About your question, the GPU's instruction sets and capabilities are very specific to vector floating point operations. The game logic usually does little floating point, and much logic (branches and decision trees).
If you take a look to the CUDA wikipedia page you will see:
It uses a recursion-free,
function-pointer-free subset of the C
language
So forget about implementing any AI algorithms there, that are essentially recursive (like A* for pathfinding). Maybe you could simulate the recursion with stacks, but if it's not allowed explicitly it should be for a reason. Not having function pointers also limits somewhat the ability to use dispatch tables for handling the different actions depending on state of the game (you could use again chained if-else constructions, but something smells bad there).
Those limitations in the language reflect that the underlying HW is mostly thought to do streaming processing tasks. Of course there are workarounds (stacks, chained if-else), and you could theoretically implement almost any algorithm there, but they will probably make the performance suck a lot.
The other point is about handling the IO, as already mentioned up there, this is a task for the main CPU (because it is the one that executes the OS).

It is viable to do a masters thesis on a subject and with tools that you are, when you begin, unfamiliar. However, its a big chance to take!
Of course a masters thesis should be fun. But ultimately, its imperative that you pass with distinction and that might mean tackling a difficult subject that you have already mastered.
Equally important is your supervisor. Its imperative that you tackle some problem they show an interest in - that they are themselves familiar with - so that they can become interested in helping you get a great grade.
You've had lots of hobby time for scratching itches, you'll have lots more hobby time in the future too no doubt. But master thesis time is not the time for hobbies unfortunately.

Whilst GPUs today have got some immense computational power, they are, regardless of things like CUDA and OpenCL limited to a restricted set of uses, whereas the CPU is more suited towards computing general things, with extensions like SSE to speed up specific common tasks. If I'm not mistaken, some GPUs have the inability to do a division of two floating point integers in hardware. Certainly things have improved greatly compared to 5 years ago.
It'd be impossible to develop a game to run entirely in a GPU - it would need the CPU at some stage to execute something, however making a GPU perform more than just the graphics (and physics even) of a game would certainly be interesting, with the catch that game developers for PC have the biggest issue of having to contend with a variety of machine specification, and thus have to restrict themselves to incorporating backwards compatibility, complicating things. The architecture of a system will be a crucial issue - for example the Playstation 3 has the ability to do multi gigabytes a second of throughput between the CPU and RAM, GPU and Video RAM, however the CPU accessing GPU memory peaks out just past 12MiB/s.

The approach you may be looking for is called "GPGPU" for "General Purpose GPU". Good starting points may be:
http://en.wikipedia.org/wiki/GPGPU
http://gpgpu.org/
Rumors about spectacular successes in this approach have been around for a few years now, but I suspect that this will become everyday practice in a few years (unless CPU architectures change a lot, and make it obsolete).
The key here is parallelism: if you have a problem where you need a large number of parallel processing units. Thus, maybe neural networks or genetic algorithms may be a good range of problems to attack with the power of a GPU. Maybe also looking for vulnerabilities in cryptographic hashes (cracking the DES on a GPU would make a nice thesis, I imagine :)). But problems requiring high-speed serial processing don't seem so much suited for the GPU. So emulating a GameBoy may be out of scope. (But emulating a cluster of low-power machines might be considered.)

I would think a project dealing with a game architecture that targets multiple core CPUs and GPUs would be interesting. I think this is still an area where a lot of work is being done. In order to take advantage of current and future computer hardware, new game architectures are going to be needed. I went to GDC 2008 and there were ome talks related to this. Gamebryo had an interesting approach where they create threads for processing computations. You can designate the number of cores you want to use so that if you don't starve out other libraries that might be multi-core. I imagine the computations could be targeted to GPUs as well.
Other approaches included targeting different systems for different cores so that computations could be done in parallel. For instance, the first split a talk suggested was to put the renderer on its own core and the rest of the game on another. There are other more complex techniques but it all basically boils down to how do you get the data around to the different cores.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas