How to understand the asynchrony of CopyResource? - direct3d11

virtual void STDMETHODCALLTYPE CopyResource(
/* [annotation] */
_In_ ID3D11Resource *pDstResource,
/* [annotation] */
_In_ ID3D11Resource *pSrcResource);
I have questions about this function, especially when there are read and write operations immediately on pDstResource, such as the following code, because of asynchronous, Can I ensure that every time pDstResource is a newly rendered image instead of black_color?
This suspicion is because before using IDXGIOutputDuplication::AcquireNextFrame to get the first frame and calling CopyResource, it takes a short time to Sleep before you can see that pDstResource is copied (judging from the rendering output).
Refer to performance considerations, I still don’t know the principle here, whether it is when WinB wants to access pDstResource_shared, Direct3D will be synchronized internally (that is, executing the CopyResource command)? Worried about misunderstanding.
for(;;)
{
1, WinA_Clear(black_color)
2, WinA_DrawRedTriangleOnTexture(pSrcResource)
3, WinA_CopyResource(pDstResource_shared, pSrcResource)
4, WinB_Clear(white_color)
5, WinB_RenderTexture(pDstResource_shared)
}

Can I ensure that every time pDstResource is a newly rendered image instead of black_color?
Unlike lower-level APIs like Direct3D 12, in 10 and 11 the GPU driver handles that automatically. Usually does it pretty well.
If you call CopyResource, and then update the source texture, the new changes won’t be visible in the copy. If you call CopyResource and then read from destination texture somehow (either render or move to system memory), the GPU driver will first wait for the copy to complete.
However, there’re edge cases when this workflow breaks. One of them is DXGI surface sharing. If you use that mechanism to share textures across graphics APIs, or across processes, it’s possible you’ll get incomplete renders or other artifacts.
Since one of your textures is named pDstResource_shared, I assume that’s what you’re doing.
To workaround, wait for the copy to complete before passing the destination texture to someone else through surface sharing. Here’s how.
On startup, create ID3D11Query object, of type D3D11_QUERY_EVENT
After you called ID3D11DeviceContext.CopyResource, call ID3D11DeviceContext.Flush and then ID3D11DeviceContext.End, passing that query.
You now need to wait for ID3D11DeviceContext.GetData method to return S_OK for that query. That happens when the GPU has completed all commands submitted before ID3D11DeviceContext.End, applying all pending updates to all resources.
Eventually, MS implemented proper way to wait for GPU, use a fence instead of query, ID3D11Fence::SetEventOnCompletion can signal an event once the GPU did the job, then you can put a thread to sleep with WaitForSingleObject or similar. However, this requires recent enough version of Windows 10. Don’t remember which one but I think either Anniversary Update from 2016, or Creators Update from 2017. Queries are way more compatible, they work on all versions of Win10, and even on Windows 7 or 8.

Related

C++ | Adding workload to a existing thread from a injected DLL

in my project i injected a DLL(64-bit Windows 10) in to a external process with Manual-map & Thread-hijacking and i do some stuff in there.
In current state i use "RtlCreateUserThread" to create a new thread and do some extra workload in there to distribute it for better performance.
My question is now... Is it possible to access other threads from the current process (hijack it) and add your own workload/code there. Without creating a new thread?
I didn't found anything helpful yet in the internet and the code i used and modified for Thread-hijacking seems to only work for a DLL file. Because i am pretty new to C++ i am still learning i am already thankful for any help.
(If you want to see the source for injector Google GHInjector your find the library on github.)
It is possible, but so complicated and may not work in all cases.
You need to splice existing thread's machine codes, so you will need write access to code page memory.
Logic:
find thread id and thread handle, then suspend thread with SuspendThread WINAPI call
suspended thread can be in wait state or in system DLL call now, so you need to analyze current execution stack, backtrace it and find execution address from application space. You need API functions StackWalk, and PDB files in some cases. Also it depends on running architecture (x86, amd64, ...). Walk through stack until your EIP/RIP will not be in application memory address space
decode machine instruction (it will be 'call') and splice next instructions to your function call. You need to use __declspec(naked) declared function or ASM implemented one for execute your code and replaced instructions.
ResumeThread
This method may work only once because no guarantees that application code is executed in loop.

Create a wxPython app that has only one instance

I would like to create a wxPython app such that:
If I run a second instance of that app (e.g., call the Python script from the shell a second time), no new instance should be created.
Instead, the toplevel frame of the already running instance should be raised and focussed.
The first point can be easily implemented by wx.SingleInstanceChecker (see the example code there), but at least the example code only gives a way for making the second instance of the app abort, but not raise the existing app's main frame.
I am using wxPython-Phoenix with Python 3.
Claritication: I would much prefer an out-of-the-box solution like wx.SingleInstanceChecker (that is, not implement my own locking and IPC solution).
You can use any kind of IPC to send a message asking the other program to do whatever needs to be done (just raise its top level window or maybe handle the command line options passed to the second instance). In C++ there are wxConnection and the related wxServer and wxClient classes that can be used for this, but I'm not sure if they're wrapped by wxPython -- however you could use any Python IPC module instead, if they aren't.
As has been pointed out, the "correct" way to do this is IPC because you have a new process that is supposed to affect a change (raise and focus) in another process.
What you seem to want is to take advantage of the IPC channel that wx.SingleInstanceChecker is already using to do your work. Unfortunately, you can't. That class is implemented in the wxWidgets c++ code and therefore there are no Python bindings to the internal workings of the class.
However, you can probably abuse wx.SingleInstanceChecker to do what you want. In your program, you can set up a timer at some rapid interval (say, 250ms) that will constantly check IsAnotherRunning() from your main process. Therefore, when your second process starts up, the first will notice and can raise itself to the front. You would just have to wait for a little bit in the secondary process before it exits, to give the first time to notice.

ARM Cortex-M3 Startup Code

I'm trying to understand how the initialization code works that ships with Keil (realview v4) for the STM32 microcontrollers. Specifically, I'm trying to understand how the stack is initialized.
In the documentation on ARM's website it mentions that one of the routines in startup_xxx.s, __user_initial_stack_heap, should not use more than 88 bytes of stack. Do you know where that limitation is coming from?
It seems that when the reset handler calls System_Init it is executing a couple functions in a C environment which I believe means it is using some form of a temporary stack (it allocates a few automatic variables). However, all of those stack'd items should be out of scope once it returns back and then calls __main which is where __user_initial_stack_heap is called from.
So why is there this requirement for __user_initial_stack_heap to not use more than 88 bytes? Does the rest of __main use a ton of stack or something?
Any explanation of the cortex-m3 stack architecture as it relates to the startup sequence would be fantastic.
You will see from the __user_initial_stackheap() documentation, that the function is for legacy support and that it is superseded by __user_setup_stackheap(); the documentation for the latter provides a clue ragarding your question:
Unlike __user_initial_stackheap(), __user_setup_stackheap() works with systems where the application starts with a value of sp (r13) that is already correct, for example, Cortex-M3
[..]
Using __user_setup_stackheap() rather than __user_initial_stackheap() improves code size because there is no requirement for a temporary stack.
On Cortex-M the sp is initialised on reset by the hardware from a value stored in the vector table, on older ARM7 and ARM9 devices this is not the case and it is necessary to set the stack-pointer in software. The start-up code needs a small stack for use before the user defined stack is applied - this may be the case for example if the user stack were in external memory and could not be used until the memory controller were initialised. The 88 byte restriction is imposed simply because this temporary stack is sized to be as small as possible since it is probably unused after start-up.
In your case in STM32 (a Cortex-M device), it is likely that there is in fact no such restriction, but you should perhaps update your start-up code to use the newer function to be certain. That said, given the required behaviour of this function and the fact that its results are returned in registers, I would suggest that 88 bytes would be rather extravagant if you were to need that much! Moreover, you only need to reimplement it if you are using scatter loading file as described.

Simulating multiple instances of an embedded processor

I'm working on a project which will entail multiple devices, each with an embedded (ARM) processor, communicating. One development approach which I have found useful in the past with projects that only entailed a single embedded processor was develop the code using Visual Studio, divided into three portions:
Main application code (in unmanaged C/C++ [see note])
I/O-simulating code (C/C++) that runs under Visual Studio
Embedded I/O code (C), which Visual Studio is instructed not to build, runs on the target system. Previously this code was for the PIC; for most future projects I'm migrating to the ARM.
Feeding the embedded compiler/linker the code from parts 1 and 3 yields a hex file that can run on the target system. Running parts 1 and 2 together yields code which can run on the PC, with the benefit of better debugging tools and more precise control over I/O behavior (e.g. I can make the simulation code introduce certain types of random hiccups more easily than I can induce controlled hiccups on real hardware).
Target code is written in C, but the simulation environment uses C++ so as to simulate I/O registers. For example, I have a PortArray data structure; the header file for the embedded compiler includes a line like unsigned char LATA # 0xF89; and my header file for simulation includes #define LATA _IOBIT(f89,1) which in turn invokes a macro that accesses a suitable property of an I/O object, so a statement like LATA |= 4; will read the simulated latch, "or" the read value with 4, and write the new value. To make this work, the target code has to compile under C++ as well as under C, but this mostly isn't a problem. The biggest annoyance is probably with enum types (which behave as integers in C, but have to be coaxed to do so in C++).
Previously, I've used two approaches to making the simulation interactive:
Compile and link a DLL with target-application and simulation code, and have VB code in the same project which interacts with it.
Compile the target-application code and some simulation code to an EXE with instance of Visual Studio, and use a second instance of Visual Studio for the simulation-UI. Have the two programs communicate via TCP, so nearly all "real" I/O logic is in the simulation program. For example, the aforementioned `LATA |= 4;` would send a "read port 0xF89" command to the TCP port, get the response, process the received value, and send a "write port 0xF89" command with the result.
I've found the latter approach to run a tiny bit slower than the former in some cases, but it seems much more convenient for debugging, since I can suspend execution of the unmanaged simulation code while the simulation UI remains responsive. Indeed, for simulating a single target device at a time, I think the latter approach works extremely well. My question is how I should best go about simulating a plurality of target devices (e.g. 16 of them).
The difficulty I have is figuring out how to make each simulated instance get its own set of global variables. If I were to compile to an EXE and run one instance of the EXE for each simulated target device, that would work, but I don't know any practical way to maintain debugger support while doing that. Another approach would be to arrange the target code so that everything would compile as one module joined together via #include. For simulation purposes, everything could then be wrapped into a single C++ class, with global variables turning into class-instance variables. That would be a bit more object-oriented, but I really don't like the idea of forcing all the application code to live in one compiled and linked module.
What would perhaps be ideal would be if the code could load multiple instances of the DLL, each with its own set of global variables. I have no idea how to do that, however, nor do I know how to make things interact with the debugger. I don't think it's really necessary that all simulated target devices actually execute code simultaneously; it would be perfectly acceptable for simulation instances to use cooperative multitasking. If there were some way of finding out what range of memory holds the global variables, it might be possible to have the 'task-switch' method swap out all of the global variables used by the previously-running instance and swap in the contents applicable to the instance being switched in. Although I'd know how to do that in an embedded context, though, I'd have no idea how to do that on the PC.
Edit
My questions would be:
Is there any nicer way to allow simulation logic to be paused and examined in VS2010 debugger, while keeping a responsive UI for the simulator front-end, than running the simulator front end and the simulator logic in separate instances of VS2010, if the simulation logic must be written in C and the simulation front end in managed code? For example, is there a way to tell the debugger that when a breakpoint is hit, some or all other threads should be allowed to keep running while the thread that had hit the breakpoint sits paused?
If the bulk of the simulation logic must be source-code compatible with an embedded system written in C (so that the same source files can be compiled and run for simulation purposes under VS2010, and then compiled by the embedded-systems compiler for use in real hardware), is there any way to have the VS2010 debugger interact with multiple simulated instances of the embedded device? Assume performance is not likely to be an issue, but the number of instances will be large enough that creating a separate project for each instance would be likely be annoying in the absence of any way to automate the process. I can think of three somewhat-workable approaches, but don't know how to make any of them work really nicely. There's also an approach which would be better if it's possible, but I don't know how to make it work.
Wrap all the simulation code within a single C++ class, such that what would be global variables in the target system become class members. I'm leaning toward this approach, but it would seem to require everything to be compiled as a single module, which would annoyingly affect the design of the target system code. Is there any nice way to have code access class instance members as though they were globals, without requiring all functions using such instances to be members of the same module?
Compile a separate DLL for each simulated instance (so that e.g. if I want to run up to 16 instances, I would include 16 DLL's in the project, all sharing the same source files). This could work, but every change to the project configuration would have to be repeated 16 times. Really ugly.
Compile the simulation logic to an EXE, and run an appropriate number of instances of that EXE. This could work, but I don't know of any convenient way to do things like set a breakpoint common to all instances. Is it possible to have multiple running instances of an EXE attached to a single debugger instance?
Load multiple instances of a DLL in such a way that each instance gets its own global variables, while still being accessible in the debugger. This would be nicest if it were possible, but I don't know any way to do so. Is it possible? How? I've never used AppDomains, but my intuition would suggest that might be useful here.
If I use one VS2010 instance for the front-end, and another for the simulation logic, is there any way to arrange things so that starting code in one will automatically launch the code in the other?
I'm not particularly committed to any single simulation approach; while it might be nice to know if there's some way of slightly improving the above, I'd also like to know of any other alternative approaches that could work even better.
I would think that you'd still have to run 16 copies of your main application code, but that your TCP-based I/O simulator could keep a different set of registers/state for each TCP connection that comes in.
Instead of a bunch of global variables, put them into a single structure that encompasses the I/O state of a single device. Either spawn off a new thread for each socket, or just keep a list of active sockets and dedicate a single instance of the state structure for each socket.
the simulators I have seen that handle multiple instances of the instruction set/processor are designed that way. There is a structure usually that contains a complete set of registers, and a new pointer or an array of these structures are used to multiply them into multiple instances of the processor.

VB.NET Synchronization confusion

VB.NET, .NET 4
Hello all,
I have an application that controls an industrial system. It has a GUI which, once a process is started, principally displays the states of various attached devices. It basically works like this:
A System.Timers.Timer object is always running. At each Elapsed event, it polls the devices for their current values and invokes controls on the GUI, updating them with the new values.
A start button is clicked, a process time Stopwatch object is created and started (Labels on the GUI are now invoked and updated on the System.Timers.Timer's Elapsed event, in addition to the other work that is taken care of on this event)
A new thread is created which runs a Process() subroutine
Some Stopwatch objects are created and started (these Stopwatches are periodically restarted during the process via their Restart() method.
Some logic is executed on the new Stopwatchs' Elapsedmilliseconds properties to determine when to do things like write new setpoints to the devices, update the data log, etc...
Here's my problem: The program occasionally freezes. My ignorant efforts at tracking down the problem have led me to suspect that read/writes to the subset of devices that are RS-232 controlled are the culprits most of the time. However, I occasionally see other strange things upon program freeze, e.g., one of the time Labels whose Text property is determined by a Stopwatch's Elapsedmilliseconds property sometimes will show an impossible value (e.g., -50 hours or something).
For the RS-232 problems, I suspect something like a read event is being executed at the same time as a write event and this causes a freeze(?). I tried to prevent this by making sure that all communication with an RS-232 device is funneled through a Transmit() subroutine which has the following attribute:
Which, as far as my ignorance permits me to understand, should force one Transmit() execution to finish completely before another one can start. Perhaps another risk is the code getting blocked here if one Transmit() never finishes?
Regarding the Stopwatch trouble, I speculate that the problem is that the Timer is trying to update a GUI Label at the same time that the Stopwatch's Restart() method is being executed. I'm unsure if this could cause a problem. All I know is that this problem has only occurred at a point in the process when a Restart() call would be made.
I am wondering if I could use a SyncLock or something to lock a Stopwatch while the Label is being updated (or, conversely, while its being restarted)? Or, perhaps I should stop the Timer, restart the Stopwatch, and then start the timer again, like so?:
Timer.Stop
Stopwatch.Restart
Timer.Start
My trepidation regarding how to proceed is due to my complete lack of understanding of how .NET synchronization objects actually work. I've tried slapping a few SyncLocks in various places, but I really have no idea if they're implemented correctly or not. I'm wondering if, having provided all this context, someone really smart might be able to tell me how I'm stupid and how to do this right. I would really appreciate any input. If it would be useful to provide some code snippets, I'd be happy to, I just worry that everything's so convoluted that it would just detract from what I'm hoping is a conceptual question.
Thanks in advance!
Brian
I would consider a shift to a task scheduling framework instead of relying on manual manipulation of timers if your working on anything SCADA related. A simple starting point would be something similar to the hardcodet.Scheduling classes and you can move to something like the beast that is Quartz. Most of these types of frameworks will provide you with a way to pause and resume scheduled actions.
If I'm working with Modbus, I normally keep a local cache of the register values and make changes to any value fire a change event. This has the benefit of allowing you to implement things like refreshing values manually without interfering with your process scheduling and checking for deadband when evaluating your polled response. This happened to be the side effect of implementing a polled protocol to a subset of the OPC DA interface.