easiest way to log and save basic GPU stats in Unreal Engine? - gpu

I need to log the basic GPU stats (computation times) in a file while testing in the unreal engine editor so that I can analize them afterwards.
What would be the easiest way of doing that? I'm using UE 5.1
no preference for blueprint, the solution can employ blueprint or not.
I don’t need to log the synchronized events (it is ok if they are added too, i just dont have to have them). I just need the plain basic stats as time goes by.
Any constructive feedback is appreciated. Cheers!

Unreal ships with a powerful profiler called Unreal Insights, which can be used to record and analyze GPU processes as well. It is a standalone tool which you can attach to your Editor session when testing.
You can find it at Engine/Binaries/Win64/UnrealInsights.exe (relative to engine install directory). It saves data to Engine/Programs/UnrealInsights/Saved/TraceSessions (also relative to engine install dir).
Here is an in-depth documentation.

Related

Why do TTS (Text-To-Speech) prompts play normally while testing in one environment but not in others?

I am a software engineer working at a company that uses TTS for telephony projects. When I place calls to test that our VUI and it's corresponding functions and TTS prompts are working correctly, I often run into the following problem.
When I run tests (placing phone calls and navigating the VUI), in our local environment I'll randomly have prompts that stop playing for a few seconds. Instead of hearing the prompt, there is silence, and then the prompt picks up where you'd expect it to be a few seconds from where the cut off began.
For example, take the prompt: "Hello, thank you for calling today."
At certain times, while testing in our local environment, I'll hear, for example, "Hello, brief silence calling today."
But, when I run the exact same test in our environment that we deploy to, I hear the same prompt just as I'd expect it. I know environment issues can be common with TTS, specifically prompts cutting out and not playing clearly, but I'm curious, can anyone elaborate on what these "environment problems" could be? Furthermore, I do know that these issues aren't grammar issues. I'll run tests where the prompt is spoken perfectly, but then when I give a no-input or no-match response, to hit the next function, which in that case is essentially the same exact prompt, the cut-off / silence occurs.
Any information, sites or books are much appreciated. I personally haven't found anything online about this stuff. Thanks in advance!
TTS - Text to Speech is an active process. Depending on how your platform implements TTS, it might be getting directly streamed from the TTS server. What may be happening is that the TTS engine can't keep with the request.
If this is on premise (unlikely these days), monitor the performance of the TTS server(s). CPU is the best metric. If the platform uses MRCP (likely) logs for that communication may provide insights.
If this is a hosted solution, contact your provider. Odds are, their test environment is underprovisioned for TTS. Mostly because in test environments, everybody else is doing the same. In production, many apps switch to recorded audio for quality, so the scale of TTS resources is reduced.
For an ugly hack, you could play a recording (actual audio file) of 1s of silence at the beginning of all forms. This might give the TTS server enough time to get ahead and buffer the audio generation.

Linux embedded sched_setscheduler workload

I am developing a testing approach for embedded software, and would like to test it on a POSIX based, real world application that uses scheduler Linux functions (e.g. sched_setscheduler).
It is relatively easy to find open source software that is using POSIX threads and locks (e.g. http://ctrace.sourceforge.net). However, I cannot find any real world application to use Linux scheduling as well. Google drives me in a direction of WCET calculation.
Does anyone know any open source, embedded, POSIX based software, that uses scheduling and affinity Linux functions?
Thank you.
Have a look here: https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/. and here: http://elinux.org/Realtime_Testing_Best_Practices These are artificial test sets for realtime testing Linux kernels. Basically they work on parallel and prioritized CPU load, scheduling and missing computation deadlines.

How to estimate size of Windows CE run-time image

I am developing an application, and need to estimate how much resources (RAM and ROM) it will need to run on a device. I have been looking online, but couldn't find any good tip on how to do this.
The system in question is an industrial system. The application itself will need to have a .NET Compact framework, and following components besides Windows CE Core: SYSGEN_HTTPD (Web Server), SYSGEN_ATL (Active Template Libraries), SYSGEN_SOAPTK_SERVER (SOAP Server), SYSGEN_MSXML_HTTP (XML/HTTP), SYSGEN_CPP_EH_AND_RTTI (Exception Handling and Runtime Type Information).
Tx
There really is not way to estimate this, becasue application behavior and code can have wildly different requirements. Something that does image manipulation is going to likely require more RAM than a simple HMI, but even two graphical apps that do the same thing could be vastly different based on how image algorithms and buffer sizes are set up.
The only way to greally get an idea is to actually run the application and see what the footprint looks like. I would guess that you're going to want a BOM that includes at least 64MB or RAM and 32MB of flash at a bare minimum. Depending on the app, I'd probably ask for 128MB of RAM. Flash would greatly depend on what the app needs to do.
Since you are specifying core OS components and since I assume you can estimate your own application's resources, I assume you ask for an estimation of the OS as a whole.
The simplest way to have an approximation is to build an emulator image (CE6 has an arm one) and it should give you a sense. The difference with the final image will be with the size of the drivers for the actual platform you will use.

A PDF reader - please guide - a step by step guidance - reference to guidance-

I have to make a hardware project using a microcontroller, memory, screens, etc.
Is it possible to make an independent PDF / documents reader, which is capable of running on battery power?
Please note I don't want to use any technology which needs licensing. It must be all freeware readers, etc., and programing language can be assembly, C, Flash or any.
I have submitted proposal of PDF reader project (independent hardware). Many say it's impossible. What should I do?
Reading and displaying a PDF document is quite a "high level operation".
You should start with a microcontroller starter kit, with an ARM9 processor or something similar. Then install a Linux operating system on it, include a standard display driver and run an X server. Then you should be able to find a Linux based PDF reader with X drivers.
To 2nd another comment here, I would say that you're not going to to do this with a microcontroller, you're going to need to get some more powerful ARM CPU like an ARM9, Cortex-A8 or similar with a decent amount of RAM.
You'll probably need something that's capable of running Linux if you want to start with pieces of software that won't require writing quite a large volume of software from scratch.
Note that for commercial devices that are out there, including the Kindle, run Linux, and aren't based on a micrcontroller.
You might be best off getting something like a BeagleBoard, attach a display to that, and start from there with an X-based PDF viewer.

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.