Encode Multiple streams using NVENC - single or multi-thread? - gpu

Given the cuda context needs to be synchronized across threads while making NVENC calls, would there be true concurrency in encoding multiple streams using multiple threads [each thread handling a single stream] ?
Wouldn't we be better off doing everything in a single thread - saving syscalls of mutex locks etc?

Related

Does Vulkan parallel rendering relies on multiple queues?

I'm a newbie of Vulkan, and not very clear on how parallel rendering works, here's some question (the "queue" mentioned below refers specifically to the graphics queue:
Does parallel rendering relies on a device which supports more than one queue?
If question 1 is a yes, what if the physical device only have one queue, but Vulkan abstracted to 4 queues (which is the real case of my macbook's gpu), will the rendering in this case really parallel?
If question 1 is a yes, what if there is only one queue in Vulkan's abstraction, does that mean the device defiantly can render objects in parallel.
P.S. About question 2, when I use Metal api, the number of queues are only one, but when using Vulkan api, the number is 4, I'm not sure it is right to say "the physical device only have one queue".
I have the sneaking suspicion you are abusing the word "parallel". Make sure you know what it means.
Rendering on GPU is by nature embarrassingly parallel. Typically one queue can feed the entire GPU, and typically apps assume that is true.
In all likelihood they made the number of queues equal to the CPU core count. In Vulkan, submissions to a single queue always need to be externally synchronized. Having more queues allows to submit from multiple threads without synchronization.
If there is only one Vulkan queue, you can only submit to one queue. And any submission has to be synchronized with mutex or coming only from one thread in the first place.

Can I do transfer operation in transfer queue and graphics queue at the same time?

I have made 2 instances of VkQueue: one from graphics family and another one from transfer family. Command pools and command buffers are separated accordingly. Both are doing transfer operations.
Purpose of first one except rendering is to update uniform buffers on
each frame.
Purpose of second one is to update resources: model
vertex/index buffers, texture images etc.
They work in parallel in different threads asynchronously. So it is possible that there will be 2 calls of vkQueueSubmit at the same time.
Is such usage allowed and is it safe?
Note: once I have multithreaded my program sometimes I have VK_DEVICE_LOST on vkQueueSumbit and it is likely that it happens more frequently when resources are loading, that is why I actually came to this question
The Vulkan specification is pretty clear about CPU synchronization of Vulkan functions. vkQueueSubmit says:
Host access to queue must be externally synchronized
Where "queue" is the parameter passed to vkQueueSubmit. It doesn't say every queue; it says "that queue".
And if "external synchronization" is not specifically stated as a requirement of a command, then it isn't a requirement of that command.

What are the architectural differences between Erlang/OTP and OpenResty?

In Erlang/OTP, I've read how light weight processes, the actor model, and supervisors are important in creating reliable services. How would this compare to OpenResty (master/worker, async IO, embedded Lua)?
I am curious over a general architectural overview on the main concepts to better understand how OpenResty would be used alongside (or instead of) Erlang/OTP.
These two links partially answers the question:
https://github.com/openresty/lua-nginx-module/blob/master/README.markdown
The Lua interpreter or LuaJIT instance is shared across all the
requests in a single nginx worker process but request contexts are
segregated using lightweight Lua coroutines.
Loaded Lua modules persist in the nginx worker process level resulting
in a small memory footprint...
https://github.com/openresty/lua-nginx-module/wiki/Introduction
...for each incoming request, lua-nginx-module creates a coroutine to run user code to process the request, and the coroutine will be destroyed when the request handling process is done. Every coroutine has its own independent global environment, which inherits shared read-only common data.
...lua-nginx-module can handle tens of thousands of concurrent requests with very low memory overhead. According to our tests, the memory overhead for each request in lua-nginx-module is only 2 KB or even half smaller if LuaJIT is used.

Syncing 3 threads sharing buffers using NSConditionLock. It's hard

I have 3 threads (in addition to the main thread). The threads read, process, and write. They each do this to a number of buffers, which are cycled through and reused. The reason it's set up this way is so the program can continue to do the other tasks while one of them is running. So, for example, while the program is writing to disk, it can simultaneously be reading more data.
The problem is I need to synchronize all this so the processing thread doesn't try to process buffers that haven't been filled with new data. Otherwise, there is a chance that the processing step could process leftover data in one of the buffers.
The read thread reads data into a buffer, then marks the buffer as "new data" in an array. So, it works like this:
//set up in main thread
NSConditionLock *readlock = [[NSConditionLock alloc] initWithCondition:0];
//set up lock in thread
[readlock lockWhenCondition:buffer_new[current_buf]];
//copy data to buffer
memcpy(buffer[current_buf],source_data,data_length);
//mark buffer as new (this is reset to 0 once the data is processed)
buffer_new[current_buf] = 1;
//unlock
[readlock unlockWithCondition:0];
I use buffer_new[current_buf] as a condition variable to NSConditionLock. If the buffer isn't marked as new, then the thread in question will lock, waiting for the previous thread to write new data. That part seems to work okay.
The main problem is I need to sync this in both directions. If the read thread happens to take too long for some reason and the processing thread has already finished with processing all the buffers, the processing thread needs to wait and vice-versa.
I'm not sure NSConditionLock is the appropriate way to do this.
I'd turn this on its head. As you say, threading is hard and multi-way synchronization of threads is even harder. Queue based concurrency is often much more natural.
Define three queues; a read queue, a write queue and a processing queue. Then employ a rule stating that no buffer shall be enqueued in more than one queue at a time.
That is, a buffer may be enqueued onto the read queue and, once done reading, enqueued into the processing queue, and once done processing, enqueued into the write queue.
You could use a stack of buffers if you want but, typically, the cost of allocation is pretty cheap compared to the cost of processing and, thus, enqueue-for-read could also do the allocation while dequeue-once-written could do the free.
This would be pretty straightforward to code with GCD. Note that if you really want parallelism, your various queues would really just be throttles, using semaphores -- potentially shared -- to enqueue the work to the global concurrent queues.
Note also that this design has a distinct advantage over what you are currently using in that it uses no locks. The only locks are hidden below the GCD APIs as a part of queue management, but that is effectively invisible to your code.
Have you seen then Apple Concurrency Programming Guide ?
It recommends several preferable methods for moving away from a Threads and Locks concurrency model. Using Operation Queues for example can not only reduce and simplify your code, speed up your development and give you better performance.
Sometimes you need to use threads, and you already have the correct idea. You will need to keep adding locks, and with each it will get exponentially more complicated until you can't understand your own code. Then you can start adding locks at random places. Then you're screwed.
Read the concurrency guide, then follow bbum's advice.

What is the difference between a thread/process/task?

What is the difference between a thread/process/task?
Process:
A process is an instance of a computer program that is being executed.
It contains the program code and its current activity.
Depending on the operating system (OS), a process may be made up of multiple threads of execution that execute instructions concurrently.
Process-based multitasking enables you to run the Java compiler at the same time that you are using a text editor.
In employing multiple processes with a single CPU,context switching between various memory context is used.
Each process has a complete set of its own variables.
Thread:
A thread is a basic unit of CPU utilization, consisting of a program counter, a stack, and a set of registers.
A thread of execution results from a fork of a computer program into two or more concurrently running tasks.
The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources.
Example of threads in same process is automatic spell check and automatic saving of a file while writing.
Threads are basically processes that run in the same memory context.
Threads may share the same data while execution.
Thread Diagram i.e. single thread vs multiple threads
Task:
A task is a set of program instructions that are loaded in memory.
Short answer:
A thread is a scheduling concept, it's what the CPU actually 'runs' (you don't run a process). A process needs at least one thread that the CPU/OS executes.
A process is data organizational concept. Resources (e.g. memory for holding state, allowed address space, etc) are allocated for a process.
To explain on simpler terms
Process: process is the set of instruction as code which operates on related data and process has its own various state, sleeping, running, stopped etc. when program gets loaded into memory it becomes process. Each process has atleast one thread when CPU is allocated called sigled threaded program.
Thread: thread is a portion of the process. more than one thread can exist as part of process. Thread has its own program area and memory area. Multiple threads inside one process can not access each other data. Process has to handle sycnhronization of threads to achieve the desirable behaviour.
Task: Task is not widely concept used worldwide. when program instruction is loaded into memory people do call as process or task. Task and Process are synonyms nowadays.
A process invokes or initiates a program. It is an instance of a program that can be multiple and running the same application. A thread is the smallest unit of execution that lies within the process. A process can have multiple threads running. An execution of thread results in a task. Hence, in a multithreading environment, multithreading takes place.
A program in execution is known as process. A program can have any number of processes. Every process has its own address space.
Threads uses address spaces of the process. The difference between a thread and a process is, when the CPU switches from one process to another the current information needs to be saved in Process Descriptor and load the information of a new process. Switching from one thread to another is simple.
A task is simply a set of instructions loaded into the memory. Threads can themselves split themselves into two or more simultaneously running tasks.
for more Understanding refer the link: http://www.careerride.com/os-thread-process-and-task.aspx
Wikipedia sums it up quite nicely:
Threads compared with processes
Threads differ from traditional multitasking operating system processes in that:
processes are typically independent, while threads exist as
subsets of a process
processes carry considerable state information, whereas multiple
threads within a process share state
as well as memory and other resources
processes have separate address spaces, whereas threads share their
address space
processes interact only through system-provided inter-process
communication mechanisms.
Context switching between threads in the same process is
typically faster than context
switching between processes.
Systems like Windows NT and OS/2 are said to have "cheap" threads and "expensive" processes; in other operating systems there is not so great a difference except the cost of address space switch which implies a TLB flush.
Task and process are used synonymously.
from wiki clear explanation
1:1 (Kernel-level threading)
Threads created by the user are in 1-1 correspondence with schedulable entities in the kernel.[3] This is the simplest possible threading implementation. Win32 used this approach from the start. On Linux, the usual C library implements this approach (via the NPTL or older LinuxThreads). The same approach is used by Solaris, NetBSD and FreeBSD.
N:1 (User-level threading)
An N:1 model implies that all application-level threads map to a single kernel-level scheduled entity;[3] the kernel has no knowledge of the application threads. With this approach, context switching can be done very quickly and, in addition, it can be implemented even on simple kernels which do not support threading. One of the major drawbacks however is that it cannot benefit from the hardware acceleration on multi-threaded processors or multi-processor computers: there is never more than one thread being scheduled at the same time.[3] For example: If one of the threads needs to execute an I/O request, the whole process is blocked and the threading advantage cannot be utilized. The GNU Portable Threads uses User-level threading, as does State Threads.
M:N (Hybrid threading)
M:N maps some M number of application threads onto some N number of kernel entities,[3] or "virtual processors." This is a compromise between kernel-level ("1:1") and user-level ("N:1") threading. In general, "M:N" threading systems are more complex to implement than either kernel or user threads, because changes to both kernel and user-space code are required. In the M:N implementation, the threading library is responsible for scheduling user threads on the available schedulable entities; this makes context switching of threads very fast, as it avoids system calls. However, this increases complexity and the likelihood of priority inversion, as well as suboptimal scheduling without extensive (and expensive) coordination between the userland scheduler and the kernel scheduler.