Basic skills to work as an optimiser in the gaming industry [closed] - optimization

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm curious about a certain job title, that of "senior developer with a specialty in optimisation." It's not the actual title but that's essentially what it would be. What would this mean in the gaming industry in terms of knowledge and skills? I would assume basic stuff like
B-trees
Path finding
Algorithmic analysis
Memory management
Threading (and related topics like thread safety, atomicity, etc)
But this is only me conjecturing. What would be the real-life (and academic) basic knowledge required for such a job?

I interviewed for such a position a few years ago at one of the Big North American game studios.
The job required a lot of deep pipeline assembly programming, arithmetic optimization algorithms (think Duff's Device, branchless ifs), compile-time computation (SWAR), meta-template programming, computation of many values at once in parallel in very large registers (I forget the name for that)... You'll need to be solid on operating system fundementals, low level system operations, linear algebra, and C++ especially templates. You'll also become very familiar with the peculiar architecture of the PlayStation3, and probably be involved in developing libraries for that environment that the company's game teams will build on top of.

Generally I concur with Ether's post; this will typically be more about low-level optimisation than algorithmic stuff. Knowing good algorithms comes in handy, but there are many cases in games where you prefer the O(N) solution over the O(logN) solution because the first is far friendlier on the cache and requires less memory management. So you need a more holistic knowledge.
Perhaps on a more general level, the job may want to know if you can do some or all of the following:
use a CPU profiler (eg. VTune, CodeAnalyst) in both sampling and call graph mode;
use graphical profilers (eg. Microsoft Pix, NVPerfHud)
write your own profiling/timer code and generate useful output with it;
rewrite functions to remove dynamic memory allocations;
reorganise and reduce data to be more cache-friendly;
reorganise data to make it more SIMD-friendly;
edit graphics shaders to use fewer and cheaper instructions;
...and more, I'm sure.

This is a lot like my job actually. Real-life knowledge that would be practical for this:
Experience in using profilers of all kinds to locate bottlenecks.
Experience and skill in determining the reason those bottlenecks exist.
Good understanding of CPU caches, virtual memory, and common bottlenecks such as load-hit-store penalties, L2 misses, floating point code, etc.
Good understanding of multithreading and both lockless and locking solutions.
Good understanding of HLSL and graphics programming, including linear algebra.
Good understanding of SIMD techniques and the specific SIMD interfaces on relevant hardware (paired singles, VMX, SSE/MMX).
Good understanding of the assembly language used on relevant hardware. If writing assembly, a good understanding of instruction pairing, branch prediction, delay slots (if applicable), and any and all applicable stalls on the target platform.
Good understanding of the compilation and linking process, binary formats used on the target hardware, and tools to manipulate all of the above (including available compiler flags and optimizations).
Every once in a while people ask how to become good at low-level optimization. There are a few good sources of info, mostly proprietary, but I think it generally comes down to experience.

This is one of those "if you got it you know it" type of things. It's hard to list out specifics, and some studios will have different criteria than others.
To put it simply, the 'Senior Developer' part means you've been around the block; you have multiple years of experience in which you've excelled and have shipped games. You should have a working knowledge of a wide range of topics, with things such as memory management high up the list.
"Specialty in Optimization" essentially means that you know how to make a game run faster. You've already spent a significant amount of time successfully optimizing games which have shipped. You should have a wide knowledge of algorithms, 3d rendering (a lot of time is spent rendering), cpu intrinsics, memory management, and others. You should also typically have an in depth knowledge of the hardware you'd be working on (optimizing PS3 can be substantially different than optimizing for PC).
This is at best a starting point for understanding. The key is having significant real world experience in the topic; at a senior level it should preferably be from working on titles that have shipped.

Related

VHDL optimization tips [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am quite new in VHDL, and by using different IP cores (by different providers) can see that sometimes they differ massively as per the space that they occupy or timing constraints.
I was wondering if there are rules of thumb for optimization in VHDL (like there are in C, for example; unroll your loops, etc.).
Is it related to the synthesis tools I am using (like the different compilers are using other methods of optimization in C, so you need to learn to read the feedback asm files they return), or is it dependent on my coding skills?
Is it related to the synthesis tools I am using (like the different compilers are using other methods of optimization in C, so you need to learn to read the feedback asm files they return), or is it dependent on my coding skills?
The answer is "yes." When you are coding in HDL, you are actually describing hardware (go figure). Instead of the code being converted into machine code (like it would with C) it is synthesized to logical functions (AND, NOT, OR, XOR, etc) and memory elements (RAM, ROM, FF...).
VHDL can be used in many different ways. You can use VHDL in a purely structural sense where at the base level you are calling our primitives of the underlying technology that you are targeting. For example, you literally instantiate every AND, OR, NOT, and Flip Flop in your design. While this can give you a lot of control, it is not an efficient use of time in 99% of cases.
You can also implement hardware using behavioral constructs with VHDL. Instead of explicitly calling out each logic element, you describe a function to be implemented. For example, if this, do this, otherwise, do something else. You can describe state machines, math operations, and memories all in a behavioral sense. There are huge advantages to describing hardware in a behavioral sense:
Easier for humans to understand
Easier for humans to maintain
More portable between synthesis tools and target hardware
When using behavioral constructs, knowing your synthesis tool and your target hardware can help in understanding how what you write will actually be implemented. For example, if you describe a memory element with a asynchronous reset the implementation in hardware will be different for architectures with a dedicated asynchronous reset input to the memory element and one without.
Synthesis tools will generally publish in their reference manual or user guide a list of suggested HDL constructs to use in order to obtain some desired implementation result. For basic cases, they will be what you would expect. For more elaborate behavior models (e.g. a dual port RAM) there may be some form that you need to follow for the tool to "recognize" what you are describing.
In summary, for best use of your target device:
Know the device you are targeting. How are the programmable elements laid out? How many inputs and outputs are there from lookup tables? Read the device user manual to find out.
Know your synthesis engine. What types of behavioral constructs will be recognized and how will they be implemented? Read the synthesis tool user guide or reference manual to find out. Additionally, experiment by synthesizing small constructs to see how it gets implemented (via RTL or technology viewer, if available).
Know VHDL. Understand the differences between signals and variables. Be able to recognize statements that will generate many levels of logic in your design.
I was wondering if there are rules of thumb for optimization in VHDL
Now that you know the hardware, synthesis tool, and VHDL... Assuming you want to design for maximum performance, the following concepts should be adhered to:
Pipeline, pipeline, pipeline. The more levels of logic you have between synchronous elements, the more difficulty you are going to have making your timing constraint/goal.
Pipeline some more. Having additional stages of registers can provide additional wiggle-room in the future if you need to add more processing steps to your algorithm without affecting the overall latency/timeline.
Be careful when operating on the boundaries of the normal fabric. For example, if interfacing with an IO pin, dedicated multiplies, or other special hardware, you will take more significant timing hits. Additional memory elements should be placed here to avoid critical paths forming.
Review your synthesis and implementation reports frequently. You can learn a lot from reviewing these frequently. For example, if you add a new feature, and your timing takes a hit, you just introduced a critical path. Why? How can you alleviate this issue?
Take care with your "global" structures -- such as resets. Logic that must be widely distributed in your design deserves special care, since it needs to reach across your whole device. You may need special pipeline stages, or timing constraints on this type of logic. If at all possible, avoid "global" structures, unless truly a requirement.
While synthesis tools have design goals to focus on area, speed or power, the designer's choices and skills is the major contributor for the quality of the output. A designer should have a goal to maximize speed or minimize area and it will greatly influence his choices. A design optimized for speed can be made smaller by asking the tool to reduce the area, but not nearly as much as the same design thought for area in the first place.
However, it is more complicated than that. IP cores often target several FPGA technologies as well as ASIC. This can only be achieved by using general VHDL constructs, (or re-writing the code for each target, which non-critical IP providers don't do). FPGA and ASIC vendor have primitives that will improve speed/area when used, but if you write code to use a primitive for a technology, it doesn't mean that the resulting code will be optimized if you change the technology. Both Xilinx and Altera have DSP blocks to speed multiplication and whatnot, but they don't work exactly the same and writing code that uses the full potential of both is very challenging.
Synthesis tools are notorious for doing exactly what you ask them to, even if a more optimized solution is simple, for example:
a <= (x + y) + z; -- This is a 2 cascaded 2-input adder
b <= x + y + z; -- This is a 3-input adder
Will likely lead a different path from xyz to b/c. In the end, the designer need to know what he wants, and he has to verify that the synthesis tool understands his intent.

Will an FPGA be useful here?

I'm writing some code for a piece of network middleware. Right now, our code is running too slowly.
We've already done one round of rewrites and optimizations, but we seem to be running into hard limitations on what can be done in software.
The slowness of the code stems from one subroutine - an esoteric sampling algorithm from computational statistics. Because the math involved is somewhat similar to the stuff done in DSP, I'm wondering if we can use an FPGA to accelerate the computation.
My question is basically in the title - how do I tell whether an FPGA (or even ASIC) will give a useful speedup in my use case?
EDIT: A 'useful' speedup is one that is significant enough to justify the cost and dev time required to build the FPGA.
The short answer is to have an experienced FPGA engineer look at the algorithm and tell you what it would take in development time and bill of material cost for a solution.
Without knowing the details of your algorithm it is harder to guess. How parallelizable is the problem? Or can it be heavily pipelined? How many multiply/accumulate/dsp operations are required? Can you approximate some calculations with a big look up table or other FPGA dsp techniques (CORDIC). FPGAs can do many, many more of these operations in parallel (every clock cycle) 100s or even 1000s depending on how much you are willing to spend on an FPGA. Without knowing the details and having an experienced FPGA/DSP engineer look at the problem it is going to be hard to get a real feel though.
Some other options are:
look into DSP chips (TI dsp chips are one example).
Does your processor have SIMD instructions available that are not being used?

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
My impression is that it has to do with taking advantage of certain libraries (e.g., BLAS, LAPLACK) where possible, revising algorithms (reducing iteration), profiling, parallelizing, considering memory-hard disk-processor use/access... I am aware of the adage, "want to optimize your code? don't do it", but if one were interested in learning about writing efficient code, what references might be available?
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
Profiling is a must at any level of machinery. In common usage, I've found that scaling to larger and larger grids requires a better understanding of the grid software and the topology of the grid. In that sense, everything you learn about optimizing for one machine is still applicable, but understanding the grid software gets you additional mileage. Hadoop is one of the most popular and widespread grid systems, so learning about the scheduler options, interfaces (APIs and web interfaces), and other aspects of usage will help. Although you may not use Hadoop for a given supercomputer, it is one of the less painful methods for learning about distributed computing. For parallel computing, you may pursue MPI and other systems.
Additionally, learning to parallelize code on a single machine, across multiple cores or processors, is something you can begin learning on a desktop machine.
Recommendations:
Learn to optimize code on a single machine:
Learn profiling
Learn to use optimized libraries (after profiling: so that you see the speedup)
Be sure you know algorithms and data structures very well (*)
Learn to do embarrassingly parallel programming on multiple core machines.
Later: consider multithreaded programming. It's harder and may not pay off for your problem.
Learn about basic grid software for distributed processing
Learn about tools for parallel processing on a grid
Learn to program for alternative hardware, e.g. GPUs, various specialized computing systems.
This is language agnostic. I have had to learn the same sequence in multiple languages and multiple HPC systems. At each step, take a simpler route to learn some of the infrastructure and tools; e.g. learn multicore before multithreaded, distributed before parallel, so that you can see what fits for the hardware and problem, and what doesn't.
Some of the steps may be reordered depending on local computing practices, established codebases, and mentors. If you have a large GPU or MPI library in place, then, by all means, learn that rather than foist Hadoop onto your collaborators.
(*) The reason to know algorithms very well is that as soon as your code is running on a grid, others will see it. When it is hogging up the system, they will want to know what you're doing. If you are running a process that is polynomial and should be constant, you may find yourself mocked. Others with more domain expertise may help you find good approximations for NP-hard problems, but you should know that the concept exists.
Parallelization would be the key.
Since the problems you cited (e.g. CFD, multiphysics, mass transfer) are generally expressed as large-scale linear algebra problems, you need matrix routines that parallelize well. MPI is a standard for those types of problems.
Physics can influence as well. For example, it's possible to solve some elliptical problems efficiently using explicit dynamics and artificial mass and damping matricies.
3D multiphysics means coupled differential equations with varying time scales. You'll want a fine mesh to resolve details in both space and time, so the number of degrees of freedom will rise rapidly; time steps will be governed by the stability requirements of your problem.
If someone ever figures out how to run linear algebra as a map-reduce problem they'll have it knocked.
Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
First, you would need to understand the problem. Not all problems can be solved in parallel (and I'm using the term parallel in as wide meaning as it can get). So, see how the problem is now solved. Can it be solved with some other method quicker. Can it be divided in independent parts ... and so on ...
Fortran is the language specialized for scientific computing, and during the recent years, along with the development of new language features, there has also been some very interesting development in terms of features that are aiming for this "market". The term "co-arrays" could be an interesting read.
But for now, I would suggest reading first into a book like Using OpenMP - OpenMP is a simpler model but the book (fortran examples inside) explains nicely the fundamentals. Message parsing interface (for friends, MPI :) is a larger model, and one of often used. Your next step from OpenMP should probably go in this direction. Books on the MPI programming are not rare.
You mentioned also libraries - yes, some of those you mentioned are widely used. Others are also available. A person who does not know exactly where the problem in performance lies should IMHO never try to undertake the task of trying to rewrite library routines.
Also there are books on parallel algorithms, you might want to check out.
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
In short it comes down to understanding the problem, learning where the problem in performance is, re-solving the whole problem again with a different approach, iterating a few times, and by that time you'll already know what you're doing and where you're stuck.
We're in a position similar to yours.
I'm most in agreement with #Iterator's answer, but I think there's more to say.
First of all, I believe in "profiling" by the random-pausing method, because I'm not really interested in measuring things (It's easy enough to do that) but in pinpointing code that is causing time waste, so I can fix it. It's like the difference between a floodlight and a laser.
For one example, we use LAPACK and BLAS. Now, in taking my stack samples, I saw a lot of the samples were in the routine that compares characters. This was called from a general routine that multiplies and scales matrices, and that was called from our code. The matrix-manipulating routine, in order to be flexible, has character arguments that tell it things like, if a matrix is lower-triangular or whatever. In fact, if the matrices are not very large, the routine can spend more than 50% of its time just classifying the problem. Of course, the next time it is called from the same place, it does the same thing all over again. In a case like that, a special routine should be written. When it is optimized by the compiler, it will be as fast as it reasonably can be, and will save all that classifying time.
For another example, we use a variety of ODE solvers. These are optimized to the nth degree of course. They work by calling user-provided routines to calculate derivatives and possibly a jacobian matrix. If those user-provided routines don't actually do much, samples will indeed show the program counter in the ODE solver itself. However, if the user-provided routines do much more, samples will find the lower end of the stack in those routines mostly, because they take longer, while the ODE code takes roughly the same time. So, optimization should be concentrated in the user-provided routines, not the ODE code.
Once you've done several of the kind of optimization that is pinpointed by stack sampling, which can speed things up by 1-2 orders of magnitude, then by all means exploit parallelism, MPI, etc. if the problem allows it.

Can you program a pure GPU game?

I'm a CS master student, and next semester I will have to start working on my thesis. I've had trouble coming up with a thesis idea, but I decided it will be related to Computer Graphics as I'm passionate about game development and wish to work as a professional game programmer one day.
Unfortunately I'm kinda new to the field of 3D Computer Graphics, I took an undergraduate course on the subject and hope to take an advanced course next semester, and I'm already reading a variety of books and articles to learn more. Still, my supervisor thinks its better if I come up with a general thesis idea now and then spend time learning about it in preparation for doing my thesis proposal. My supervisor has supplied me with some good ideas but I'd rather do something more interesting on my own, which hopefully has to do with games and gives me more opportunities to learn more about the field. I don't care if it's already been done, for me the thesis is more of an opportunity to learn about things in depth and to do substantial work on my own.
I don't know much about GPU programming and I'm still learning about shaders and languages like CUDA. One idea I had is to program an entire game (or as much as possible) on the GPU, including all the game logic, AI, and tests. This is inspired by reading papers on GPGPU and questions like this one I don't know how feasible that is with my knowledge, and my supervisor doesn't know a lot about recent GPUs. I'm sure with time I will be able to answer this question on my own, but it'd be handy if I could know the answer in advance so I could also consider other ideas.
So, if you've got this far, my question: Using only shaders or something like CUDA, can you make a full, simple 3D game that exploits the raw power and parallelism of GPUs? Or am I missing some limitation or difference between GPUs and CPUs that will always make a large portion of my code bound to CPU? I've read about physics engines running on the GPU, so why not everything else?
DISCLAIMER: I've done a PhD, but have never supervised a student of my own, so take all of what I'm about to say with a grain of salt!
I think trying to force as much of a game as possible onto a GPU is a great way to start off your project, but eventually the point of your work should be: "There's this thing that's an important part of many games, but in it's present state doesn't fit well on a GPU: here is how I modified it so it would fit well".
For instance, fortran mentioned that AI algorithms are a problem because they tend to rely on recursion. True, but, this is not necessarily a deal-breaker: the art of converting recursive algorithms into an iterative form is looked upon favorably by the academic community, and would form a nice center-piece for your thesis.
However, as a masters student, you haven't got much time so you would really need to identify the kernel of interest very quickly. I would not bother trying to get the whole game to actually fit onto the GPU as part of the outcome of your masters: I would treat it as an exercise just to see which part won't fit, and then focus on that part alone.
But be careful with your choice of supervisor. If your supervisor doesn't have any relevant experience, you should pick someone else who does.
I'm still waiting for a Gameboy Emulator that runs entirely on the GPU, which is just fed the game ROM itself and current user input and results in a texture displaying the game - maybe a second texture for sound output :)
The main problem is that you can't access persistent storage, user input or audio output from a GPU. These parts have to be on the CPU, by definition (even though cards with HDMI have audio output, but I think you can't control it from the GPU). Apart from that, you can already push large parts of the game code into the GPU, but I think it's not enough for a 3D game, since someone has to feed the 3D data into the GPU and tell it which shaders should apply to which part. You can't really randomly access data on the GPU or run arbitrary code, someone has to do the setup.
Some time ago, you would just setup a texture with the source data, a render target for the result data, and a pixel shader that would do the transformation. Then you rendered a quad with the shader to the render target, which would perform the calculations, and then read the texture back (or use it for further rendering). Today, things have been made simpler by the fourth and fifth generation of shaders (Shader Model 4.0 and whatever is in DirectX 11), so you can have larger shaders and access memory more easily. But still they have to be setup from the outside, and I don't know how things are today regarding keeping data between frames. In worst case, the CPU has to read back from the GPU and push again to retain game data, which is always a slow thing to do. But if you can really get to a point where a single generic setup/rendering cycle would be sufficient for your game to run, you could say that the game runs on the GPU. The code would be quite different from normal game code, though. Most of the performance of GPUs comes from the fact that they execute the same program in hundreds or even thousands of parallel shading units, and you can't just write a shader that can draw an image to a certain position. A pixel shader always runs, by definition, on one pixel, and the other shaders can do things on arbitrary coordinates, but they don't deal with pixels. It won't be easy, I guess.
I'd suggest just trying out the points I said. The most important is retaining state between frames, in my opinion, because if you can't retain all data, all is impossible.
First, Im not a computer engineer so my assumptions cannot even be a grain of salt, maybe nano scale.
Artificial intelligence? No problem.There are countless neural network examples running in parallel in google. Example: http://www.heatonresearch.com/encog
Pathfinding? You just try some parallel pathfinding algorithms that are already on internet. Just one of them: https://graphics.tudelft.nl/Publications-new/2012/BB12a/BB12a.pdf
Drawing? Use interoperability of dx or gl with cuda or cl so drawing doesnt cross pci-e lane. Can even do raytracing at corners so no z-fighting anymore, even going pure raytraced screen is doable with mainstream gpu using a low depth limit.
Physics? The easiest part, just iterate a simple Euler or Verlet integration and frequently stability checks if order of error is big.
Map/terrain generation? You just need a Mersenne-twister and a triangulator.
Save game? Sure, you can compress the data parallelly before writing to a buffer. Then a scheduler writes that data piece by piece to HDD through DMA so no lag.
Recursion? Write your own stack algorithm using main vram, not local memory so other kernels can run in wavefronts and GPU occupation is better.
Too much integer needed? You can cast to a float then do 50-100 calcs using all cores then cast the result back to integer.
Too much branching? Compute both cases if they are simple, so every core is in line and finish in sync. If not, then you can just put a branch predictor of yourself so the next time, it predicts better than the hardware(could it be?) with your own genuine algorithm.
Too much memory needed? You can add another GPU to system and open DMA channel or a CF/SLI for faster communication.
Hardest part in my opinion is the object oriented design since it is very weird and hardware dependent to build pseudo objects in gpu. Objects should be represented in host(cpu) memory but they must be separated over many arrays in gpu to be efficient. Example objects in host memory: orc1xy_orc2xy_orc3xy. Example objects in gpu memory: orc1_x__orc2_x__ ... orc1_y__orc2_y__ ...
The answer has already been chosen 6 years ago but for those interested to the actual question, Shadertoy, a live-coding WebGL platform, recently added the "multipass" feature allowing preservation of state.
Here's a live demo of the Bricks game running on Gpu.
I don't care if it's already been
done, for me the thesis is more of an
opportunity to learn about things in
depth and to do substantial work on my
own.
Then your idea of what a thesis is is completely wrong. A thesis must be an original research. --> edit: I was thinking about a PhD thesis, not a master thesis ^_^
About your question, the GPU's instruction sets and capabilities are very specific to vector floating point operations. The game logic usually does little floating point, and much logic (branches and decision trees).
If you take a look to the CUDA wikipedia page you will see:
It uses a recursion-free,
function-pointer-free subset of the C
language
So forget about implementing any AI algorithms there, that are essentially recursive (like A* for pathfinding). Maybe you could simulate the recursion with stacks, but if it's not allowed explicitly it should be for a reason. Not having function pointers also limits somewhat the ability to use dispatch tables for handling the different actions depending on state of the game (you could use again chained if-else constructions, but something smells bad there).
Those limitations in the language reflect that the underlying HW is mostly thought to do streaming processing tasks. Of course there are workarounds (stacks, chained if-else), and you could theoretically implement almost any algorithm there, but they will probably make the performance suck a lot.
The other point is about handling the IO, as already mentioned up there, this is a task for the main CPU (because it is the one that executes the OS).
It is viable to do a masters thesis on a subject and with tools that you are, when you begin, unfamiliar. However, its a big chance to take!
Of course a masters thesis should be fun. But ultimately, its imperative that you pass with distinction and that might mean tackling a difficult subject that you have already mastered.
Equally important is your supervisor. Its imperative that you tackle some problem they show an interest in - that they are themselves familiar with - so that they can become interested in helping you get a great grade.
You've had lots of hobby time for scratching itches, you'll have lots more hobby time in the future too no doubt. But master thesis time is not the time for hobbies unfortunately.
Whilst GPUs today have got some immense computational power, they are, regardless of things like CUDA and OpenCL limited to a restricted set of uses, whereas the CPU is more suited towards computing general things, with extensions like SSE to speed up specific common tasks. If I'm not mistaken, some GPUs have the inability to do a division of two floating point integers in hardware. Certainly things have improved greatly compared to 5 years ago.
It'd be impossible to develop a game to run entirely in a GPU - it would need the CPU at some stage to execute something, however making a GPU perform more than just the graphics (and physics even) of a game would certainly be interesting, with the catch that game developers for PC have the biggest issue of having to contend with a variety of machine specification, and thus have to restrict themselves to incorporating backwards compatibility, complicating things. The architecture of a system will be a crucial issue - for example the Playstation 3 has the ability to do multi gigabytes a second of throughput between the CPU and RAM, GPU and Video RAM, however the CPU accessing GPU memory peaks out just past 12MiB/s.
The approach you may be looking for is called "GPGPU" for "General Purpose GPU". Good starting points may be:
http://en.wikipedia.org/wiki/GPGPU
http://gpgpu.org/
Rumors about spectacular successes in this approach have been around for a few years now, but I suspect that this will become everyday practice in a few years (unless CPU architectures change a lot, and make it obsolete).
The key here is parallelism: if you have a problem where you need a large number of parallel processing units. Thus, maybe neural networks or genetic algorithms may be a good range of problems to attack with the power of a GPU. Maybe also looking for vulnerabilities in cryptographic hashes (cracking the DES on a GPU would make a nice thesis, I imagine :)). But problems requiring high-speed serial processing don't seem so much suited for the GPU. So emulating a GameBoy may be out of scope. (But emulating a cluster of low-power machines might be considered.)
I would think a project dealing with a game architecture that targets multiple core CPUs and GPUs would be interesting. I think this is still an area where a lot of work is being done. In order to take advantage of current and future computer hardware, new game architectures are going to be needed. I went to GDC 2008 and there were ome talks related to this. Gamebryo had an interesting approach where they create threads for processing computations. You can designate the number of cores you want to use so that if you don't starve out other libraries that might be multi-core. I imagine the computations could be targeted to GPUs as well.
Other approaches included targeting different systems for different cores so that computations could be done in parallel. For instance, the first split a talk suggested was to put the renderer on its own core and the rest of the game on another. There are other more complex techniques but it all basically boils down to how do you get the data around to the different cores.

Given this expectations, what language or system would you choose to implement the solution? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Here are the estimates the system should handle:
3000+ end users
150+ offices around the world
1500+ concurrent users at peak times
10.000+ daily updates
4-5 commits per second
50-70 transactions per second (reads/searches/updates)
This will be internal only business application, dedicated to help shipping company with worldwide shipment management.
What would be your technology choice, why that choice and roughly how long would it take to implement it? Thanks.
Note: I'm not recruiting. :-)
So, you asked how I would tackle such a project. In the Smalltalk world, people seem to agree that Gemstone makes things scale somewhat magically.
So, what I'd really do is this: I'd start developing in a simple Squeak image, using SandstoneDB. Then, this moment would come where a single image begins being too slow.
GemStone then takes care of copying your public objects (those visible from a certain root) back and forth between all instances. You get sessions and enhanced query functionalities, plus quite a fast VM.
It shares data with C, Java and Ruby.
In fact, they have their own VM for ruby, which is also worth a look.
wikipedia manages much more demanding requirements with MySQL
Your volumes are significant but not likely to strain any credible RDBMS if programmed efficiently. If your team is sloppy (i.e., casually putting SQL queries directly into components which are then composed into larger components), you face the likelihood of a "multiplier" effect where one logical requirement (get the data necessary for this page) turns into a high number of physical database queries.
So, rather than focussing on the capacity of your RDBMS, you should focus on the capacity of your programmers and the degree to which your implementation language and environment facilitate profiling and refactoring.
The scenario you propose is clearly a 24x7x365 one, too, so you should also consider the need for monitoring / dashboard requirements.
There's no way to estimate development effort based on the needs you've presented; it's great that you've analyzed your transactions to this level of granularity, but the main determinant of development effort will be the domain and UI requirements.
Choose the technology your developers know and are familiar with. All major technologies out there will handle such requirements with ease.
Your daily update numbers vs commits do not add up. Four commits per second = 14,400 per hour.
You did not mention anything about expected database size.
In any case, I would concentrate my efforts on choosing a robust back end like Oracle, Sybase, MS etc. This choice will make the most difference in performance. The front end could either be a desktop app or WEB app depending on needs. Since this will be used in many offices around the world, a WEB app might make the most sense.
I'd go with MySQL or PostgreSQL. Not likely to have problems with either one for your requirements.
I love object-databases. In terms of commits-per-second and database-roundtrip, no relational database can hold up. Check out db4o. It's dead easy to learn, check out the examples!
As for the programming language and UI framework: Well, take what your team is good at. Dynamic languages with fewer meta-time wasting will probably save time.
There is not enough information provided here to give a proper recommendation. A little more due diligence is in order.
What is the IT culture like? Do they prefer lots of little servers or fewer bigger servers or big iron? What is their position on virtualization?
What is the corporate culture like? What is the political climate like? The open source offerings may very well handle the load but you may need to go with a proprietary vendor just because they are already used to navigating the political winds of a large company. Perception is important.
What is the maturity level of the organization? Do they already have an Enterprise Architecture team in place? Do they even know what EA is?
You've described the operational side but what about the analytical side? What OLAP technology are they expecting to use or already have in place?
Speaking of integration, what other systems will you need to integrate with?