How do GPU/PPU timings work on the Gameboy? - gpu

I’m writing a gameboy emulator and I’ve come to implementing the graphics. However I can’t quite figure out how it works with the cpu as far as timing/clock cycles go. Does the CPU execute a certain amount of cycles (if so how many) and then hand it of to the GPU? Or is the gameboy always in a hblank/vblank state and the GPU uses the CPU in between them? I can’t find any information that helps me with this, only how to use the control registers.

This has been answered at https://forums.nesdev.com/viewtopic.php?f=20&t=17754&p=225009#p225009
It turns out I had it completely wrong and they are completely different.
Here is the post:
The Game Boy CPU and PPU run in parallel. The 4.2 MHz master clock is
also the dot clock. It's divided by 2 to form the PPU's 2.1 MHz memory
access clock, and divided by 4 to form a multi-phase 1.05 MHz clock
used by the CPU.
Each scanline is 456 dots (114 CPU cycles) long and consists of mode 2
(OAM search), mode 3 (active picture), and mode 0 (horizontal
blanking). Mode 2 is 80 dots long (2 for each OAM entry), mode 3 is
about 168 plus about 10 more for each sprite on a given line, and mode
0 is the rest. After 144 scanlines are drawn are 10 lines of mode 1
(vertical blanking), for a total of 154 lines or 70224 dots per
screen. The CPU can't see VRAM (writes are ignored and reads are $FF)
during mode 3, but it can during other modes. The CPU can't see OAM
during modes 2 and 3, but it can during blanking modes (0 and 1).

The link gives more of a general answer instead of implementation specifics, so I want to give my 2 cents.
CPU usually is the main part of your emulator and what actually counts cycles. Each time your CPU does something for any amount of cycles you pass that amount of cycles to other components of your emulator so that they can synchronize themselves.
For example, some CPU instructions read and write memory as part of a single instruction. That means is would take Gameboy CPU 4 (read) + 4 (write) cycles to complete the instruction. So in emulator you do the read, pass 4 cycles to GPU, do the write, pass 4 cycles to GPU. You do the same for other components that run parallel to the CPU like timers and sound.
It's actually important to do it that way instead of emulating whole instruction and then synchronizing everything else. Don't know about real ROMs but there're test ROMs that verify this exact behavior. 8 cycles is a long time and in the middle of multiple memory accesses some other Gameboy component might make a change.

Related

Optaplanner - multithreading

I am using optaplanner 8.17.FINAL with Java 17.0.2 inside a kubernetes cluster, my server has 32 cores + hyper threading. My app scales to 14 pods and I use moveThreadCount = 4 . On a single run, everything works fine, but on a parallel run, the speed of the optaplanner drops. With 7 launches, the drop is insignificant, 5-10%. But with 14 launches, the speed drop is about 50%. Of course, you can say that there are not enough physical cores, but I'm not sure that hyperthreading works like that. In resource monitoring, I see that 60 logical cores are involved with 14 launches, but why then do the speed drop twice?
I'm tried to inscrease heap size and change garbage collector (G1GC, SerialGC, ParallelGC), but it has little effect
I am not an expert on hyperthreading by any means but perhaps OptaPlanner, by
fully utilizing the entire core(s), cannot benefit from HT so much. If so, you just don't have enough CPU cores to run so many solvers in parallel, which leads to context switching and performance drop, as a result.
You can prove that by adding more cores. If it helps, it means there is no artificial bottleneck for this amount of tasks.

Do gpu cores switch tasks when they're done with one?

I'm experimenting with c++ AMP, one thing thats unclear from MS documentation is this:
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do. Then there was some vague stuff on warps and tiles out of which I got this impression:
Regardless of how the threads are tiled together (or not at all), the whole group must finish before taking on new tasks so if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true? Also, how do I find out what this internal groups size is?
During my experimentation, it certainly appears to have some truth to it because assigning more even amounts of work to the threads seems to speed things up, even if there is slightly more work overall.
The reason I'm trying to figure it out is because I'm deciding whether or not to engage in another lengthy experiment that would be based on threads getting uneven amounts of work (sometimes by the factor of 10x) but all the threads would be independent so data wise, the cores would be free to pick up another thread.
In practice, the underlying execution model of AMP on GPU is the same as CUDA, OpenCL, Compute Shaders, etc. The only thing that changes is the naming of each concept. So if you feel that the AMP documentation is lacking, consider reading up on CUDA or OpenCL. Those are significantly more mature APIs and the knowledge you gain from them applies as well to AMP.
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do.
Maybe. From the high-level view of parallel_for_each, you don't have to care about this. The threads may as well be executed sequentially, one at a time.
If you launch 1000 threads without specifying a tile size, the AMP runtime will choose a tile size for you, based on the underlying hardware. If you specify a tile size, then AMP will use that one.
GPUs are made of multiprocessors (in CUDA parlance, or compute units in OpenCL), each composed of a number of cores.
Tiles are assigned per multiprocessor: all threads within the same tile will be ran by the same multiprocessor, until all threads within that tile run to completion. Then, the multiprocessor will pick another available tile (if any) and run it, until all tiles are executed. Multiprocessors can execute multiple tiles simultaneously.
if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true?
Not necessarily. As mentionned earlier, a multiprocessor may have multiple active tiles. It may therefore schedule threads from other tiles to remain busy.
Important note: On GPU, threads are not executed on a granularity of 1. For example, NVIDIA hardware executes 32 threads at once.
To not make this answer needlessly lengthy, I encourage you to read up on the concept of warp.
The GPU certainly won't run 1000 threads at the same time, but it also won't complete them 300 at a time.
It uses multithreading, which means that just like in a CPU, it will share run time among the 1000 threads allowing them to complete seemingly at the same time.
Keep in mind creating a lot of threads may be not interesting for several reasons. For instance, if you must complete all 1000 tasks in step 1 before doing step 2, you might aswell distribute them on a number of threads equal to the number of cores in your GPU and no more than that.
Using more threads than the number of cores only makes sense if you want to dispatch tasks that are not being waited on, or because you felt like doing your code this way is easier. But keep in mind thread management is time-costly too and may drag down your performance.

Logging 16-bit data to an SD card at the rate of 44 kHz

I am using the STM32F4 microcontroller with a microSD card. I am capturing analogue data via DMA.
I am using a double buffer, taking 1280 (10*128 - 10 FFTs) samples at a time.
When one buffer is full I am setting a flag and I then look at 128 samples at a time and run an FFT calculation on it. All of this is running well.
The data is being sampled at the rate I want and FFT calculation is as I would expect. If I just let the program run for one second, I see that it runs the FFT approximately 343 times (44000/128).
But the problem is I would like to save 64 values from this FFT to the SD card.
I am using the HCC fat file system library.
Each loop of the FFT calculation I am copy the 64 values into an array.
After every 10 calculations I write the contents of this array to file and start again.
The array stores 640 float_32 values (10*64).
This works perfectly for a one-second test run. I get 22,000 values stored to the SD card.
But as I increase the time I start losing samples as it take the SD card longer to write. I need the SD card to store over 87 kbit/s (4 bytes * 64 * 343 = 87808) consistently. I have tried increasing the DMA buffer sample size and then the number of times it writes, but didn't find it helped.
I am using an 8G microSD card, class 4. I formatted the SD card to the default FAT32 allocation unit size 2048.
How should I organize the buffering of data to allow for this? I thought using fewer writes might help. Would a queue help? How would I implement this and would anyone have an example?
I saw that clifford had a similar problem and he was using a queue, How can I use an SD card for logging 16-bit data at 48 ksamples/s?.
In my case I got it to work by trying a large number of different cards - they vary a great deal. If I had enough RAM available for a longer buffer that would have worked too.
If you are not using an RTOS, the queue buffering option may not be available to you, or at least would be non-trivial to implement.
Using an RTOS queue, I suggest that you create a queue of messages each of length 64*sizeof(float_32), the number of messages in the queue will be determined by the ammount of card latency you need to deal with; a length of 343 for example, will sustain a card stall of 1 second, and will require 87Kb of RAM. The application will then have a high priority thread performing the FFT and placing data in the queue, while a low priority thread takes data from the queue and writes to the file.
You might improve performance further by accumulating multiple message blocks in your DMA buffer before initiating a write, and there may be some benefit in carefully selecting an optimum DMA buffer length.
Flash is very, very sensitive to overwrites. Writing 3kB and then a further 3kB may count as an overwrite of the first 4 kB. In your case, there's no good reason why you'd want such small writes anyway. I'd advise 16 kB writes (32 frames/write * 64 samples/frame * 4 bytes/sample). You'd need 5 or 6 writes per second, which should be well in spec of any old SD card.
Now it's quite likely that you'd get another 1280 samples it while writing; you'll have to deal with that on another thread. Should be no problem as the writing should block without using CPU (it's a low-level Flash delay)
The most probable cause of the problem might be the way you are interfacing the card through the library.
SD cards over the SPI protocol (which I assume being used here) can be read or written in 512 byte sector units, some SD commands making it possible to stream (to perform sequential sector access faster). An important element of the SD card SPI protocol are various delays, where you have to poll the card whether you could start an operation (such as writing data to a sector).
You should read the library's API to discover how its writing process might work. You will need to perform some regular action which in the end would poll the card to know whether the writing process could continue. Some cards might require a set number of accesses before becoming ready for an operation, some others might use timeouts for state transitions. It might not work well to have the function called relatively rarely (such as once in 2-3 milliseconds) anticipating the card getting ready meanwhile. You have to keep on nagging it whether it completed already.
Just from own experiences with SD interfacing.

CUDA optimisation - kernel launch conditions

I am fairly new to CUDA and would like to find out more about optimising kernel launch conditions to speed up my code. This is quite a specific scenario but I'll try to generalise it as much as possible so anyone else with a similar question can gain from this in the future.
Assume I've got an array of 300 elements (Array A) that is sent to the kernel as an input. This array is made of a few repeating integers with each integer having a device function specific to it. For example, every time 5 appears in Array A, the kernel performs the function specific to 5. These functions are device functions.
How I have parallelised this problem is by launching 320 blocks (probably not the best number) so that each block will perform the device function relevant to its element in parallel.
The CPU would handle the entire problem in a serial fashion where it will take element by element and call each function one after the other whereas the GPU would allocate an element to each block so that all 320 blocks can access the relevant device functions and calculate simultaneously.
In theory for a large number of elements the GPU should be faster - at least I though so but in my case it isn't. My assumption is that since 300 elements is a small number the CPU will always be faster than the GPU.
This is acceptable BUT what I want to know is how I can cut down the GPU execution time at least by a little. Currently, the CPU takes 2.5 milliseconds and the GPU around 12 ms.
Question 1 - How can I choose the optimum number of blocks/threads to launch at the start?
First I tried 320 blocks with 1 thread per block. Then 1 block with 320 threads. No real change in execution time. Will tweaking the number of blocks/threads improve the speed?
Question 2 - If 300 elements is too small, why is that, and roughly how many elements do I need to see the GPU outperforming the CPU?
Question 3 - What optimisation techniques should I look into?
Please let me know if any of this isn't that clear and I'll expand on it.
Thanks in advance.
Internally, CUDA manages threads in groups of 32 (so-called warps). If you have 1 thread per block device will still execute 32 of those - 31 thread will simply be in divergent state. This is potentially an occupancy issue though you may not observe it on your device and with your problem size. There is also limit on number of blocks given multiprocessor (SM) can execute. AFAIR, GeForce 4x can run up to 8 blocks on one SM. Hence if you have a device with 8 SMs you can simultaneously run 64 threads if you have block size of 1. You can use a tool called occupancy calculator to estimate a better block size - or you can use a visual profiler.
This can only be decided by profiling. There are too many unknowns - e.g. what is your ratio of memory accesses to actual computations, how parallelizable your task is, etc.
I would really recommend you to start with best practices guide.

Drive high frequency output (32Khz) on 20Mhz Renesas microcontroller IO pin

I need to drive a 32Khz square wave on pin 19 of a Renesas R8C/36C µController. The pin is non-negotiable (the circuit design is already complete.)
The software design uses a 250 µsec interrupt for simulating multi-tasking, but that's only good for 2Khz full-wave.
Do I need to create another higher-priority interrupt for driving 32 Khz, or is there some other trick that I'm not aware of?
R8C/36C Hardware Manual
R8C/36C Software Manual
I am not familiar with the RC8 and Renesas don't say much on the subject of performance, but it is a CISC processor with typically 4 cycles per instruction, so lets estimate about 4 MIPS? Some instructions are much longer with division up to 30 cycles.
So if you create a 64KHz timer and flip the output on each interrupt, you have about 63 instructions between each interrupt, you have the interrupt latency plus the code to flip the bit. If it works at all, it is likely to constitute a significant CPU load and may affect the timeliness of other operations.
Be realistic, without a redesign, the project may not be viable. You are already stressing it with the 4KHz OS tick in my opinion - the software overhead at that rate is likley to be a significant chunk of your CPU load.
[ADDED]
I previously suggested 6 instructions between interrupts - finger trouble in the calculator, I have changed that estimate to 63, and moderated my conclusion to "barely feasible".
However I looked again at the data sheet, interrupt latency is variable because the instruction execution is variable, and the current instruction must complete before the interrupt is serviced, the worst case is when the DIVX instruction is executing, when it takes up-to 51 cycles before the first instruction of the interrupt routine. That's 2.55us, when you need the interrupt to trigger every 15.625us, the variable latency will impose significant jitter and constitutes 6 to 16 % of your total CPU time without even considering that used by the ISR itself.. Plus if the interrupt itself is pre-empted, or a higher priority interrupt is running when this one becomes due, further jitter will be imposed.
Whether it works will depend on the accuracy and jitter constraints of the 32KHz, and whatever else your code needs to get done.
As many people have pointed out, this design doesn't seem to be very good from a hardware standpoint if the 32khz clock is meant to be generated with a gpio.
However, I don't know How desperate is your situation, nor do I know the volume involved. But if it is a prototype or very short series, and pin 20 is free, you can short-circuit pins 19 and 20, setup pin 19 as an input and 20 as output. Since pin 20 can be used as output from timer rd, you could set up that timer to output the 32khz without using any interrupts.
I am not a renesas micro expert, but I'm talking from what I've seen in the data sheet you attached and previous experience with other mcu's.
I hope this helps.
Looking at the datasheet for that chip:
It looks like your only real option is to use the pin as a generic output port.
the only usable output mode seems to be the generic output port.
If you can't strap pin 19 to another pin that has the hardware to generate 32KHz and just make pin 19 an input? Not a proud moment but it was easy on a DIL package.
Could you call an interrupt every 15.6us and toggle pin19 then on the sixteenth interrupt do the multi-tasking stuff but that is likely to be wasteful. With an interrupt rate of 32Khz, setting pin19 then eighth of the time doing the multi-tasking decisions and the other seven times wait till a point you can reset pin19 and do some background code for less than half the CPU time