GPU Shader architecture - gpu

I have two questions regarding GPU architecture.
Firstly, I was seeing how L1 caches work and noticed that they need to have multiple ports for them to support requests from multiple threads (even after memory coalescing). How many ports do L1 caches have in general? What about L2 caches?
Secondly, in CPUs, the requested block from the cache goes directly into a register file. But how does it work in GPUs? Do they also have multiple ports (equal to the number of ports in L1 cache)? How exactly is it synced with the operand collector?
Is there any document (say from NVIDIA) which explains the detailed architecture of caches, operand collectors and its interaction?

Related

Why do queues in a queue family in Vulkan need priority if we can't distinguish between them?

As asked in the title. My main point is "why", as in what's the benefiting factor in such logical structure for queues and queue families.
Do chip/card makers actually etch multiple independent queues onto their chips? That are at the same time separately distinguishable?
Does implementing separate processing units/streams provide any benefit to implementations? And by extension, does it retroactiely benefit older APIs such as OpenCL?
I've observed an interesting fact: that in my "Intel(R) Core(TM) i3-8100B CPU # 3.60GHz" Mac Mini, there are 2 GPUs listed in "vulkaninfo.app" (from LunarG SDK). My bad, the app linked against 2 libMoltonVK.dylib (1 in "Contents/Frameworks", 1 in "/usr/local/lib").
"Why" is not a great question for SO format. It leads to speculation.
The queues are distinguishable in Vulkan. They each have their index with which they can be distinguished. Keep in mind they are rather a driver thing. Even when the driver has more queues, even single one typically can use all the GPU's computing resources.
Furthermore Vulkan specification does not really say what should happen when you supply a specific priority value. It is perfectly valid for driver\GPU to ignore it.
Chip makers do have compute units that are independent. They can theoretically execute different code from each other. But it is not usually advantageous. In the usual work rendering some regular W × H image, it saturates all the compute units with the same work.
Why: because you can submit different types of work that're of different importance, and you can give a hint to the Vulkan implementation what you want to be done first-most.
Everything else in the question are pointless:
Do chip/card makers actually etch multiple independent queues onto their chips? That are at the same time separately distinguishable?
Not necessarily, those may be logical queues that're time-sliced.
Does implementing separate processing units/streams provide any benefit to implementations? And by extension, does it retroactiely benefit older APIs such as OpenCL?
No, a contemporary API called Metal (from Apple) don't have a queue count or the concept of queue family at all.

Memory reference traces with Intel Pin of packet processing applications

I'm learning how to use Intel Pin and I have a couple of questions regarding the instrumentation process for a particular usecase. I would like to create a memory reference trace of a simple packet processing application. I have developed the required pintool for that purpose and my questions are the following.
Assuming I use the same network packet trace at all times as input to my packet processing application and let's say I instrument that same application on two different machines. How will the memory reference traces be different? Apparently Pin instruments userspace and is architecture independent so I wouldn't expect to see big qualitative differences in the two output memory reference traces. Is that assumption correct ?
How will the memory trace change if I experiment with the rate at which I inject network packets to my packet processing application ? Or will it change at all and if yes how can I detect how the output traces differ ?
Thank you
I assume you are doing something related to following the data flow / code flow of the network packet, probably closely related to data tainting?
Assuming I use the same network packet trace at all times as input to my packet processing application and let's say I instrument that same application on two different machines. How will the memory reference traces be different?
There are multiple factors that can make the memory trace trace quite different, the crucial point being "two different machines":
Exact copy of the same O.S : traces nearly the same (as the stack, heap and virtual memory manager will work the same) except addresses will change (ASLR).
Same O.S (but not necessarily the same version of the system shared libraries): probably the same as above if there is no recompilation of the target application. Maybe minor difference due to the heap manager that can behave differently.
Different O.S (where a recompilation of the traced application is needed): completely different traces.
Apparently Pin instruments userspace and is architecture independent so I wouldn't expect to see big qualitative differences in the two output memory reference traces. Is that assumption correct ?
Pintools needs to be recompiled for different archs, but the pintool itself should not change the way the target application is traced (same pintool + same os + same application = nearly same trace).
How will the memory trace change if I experiment with the rate at which I inject network packets to my packet processing application ?
This is system dependent and also depends on your insertion point(s). If you start tracing at recv() or recvfrom() there might be some congestion or dropped packets (UDP) if, for example, the rate is too important. Depends on the protocol, your receive window, etc. There are really multiple factors here.
Or will it change at all and if yes how can I detect how the output traces differ ?
I'd probably check the code flow rather than the data flow for this case (seems easier to me). Given exactly the same packet but different rates, if the code branches are not the same (maybe at the basic block (BBL) level), this immediately tells that the same packet is handled differently.

Resource usage of a static web server

I came across this question in a blog post. It was asked by Mozilla in their internship interview. (Blog Post)
You are running a HTTP server (nginx, Apache, etc) that is configured
to serve static files off the local filesystem of your modern,
multi-core server connected to a gigabit network. A handful of clients
start requesting the same 4kb static file as fast as they can. What
system resource do you think will be exhausted first?
a. CPU
b. Disk / I/O
c. Memory
d. Network
e. Other
According to me, none of this would be exhausted on a modern machine, with Nginx/Apache. Won't the web server cache such a small file and just keep serving that. Also, for repeated request it can easily send a Not-Modified header.
In case of Apache, I guess due to it handling multiple clients by spawning threads, CPU will be exhausted first, but for a "handful" of clients, that won't matter.
I wanted to know what others have to say about this question.
It reeeeeeeeally depends. 4k is that magical size that will fit into as good as all caches and buffers at their default settings, so it is easy (and fast) to pass around. memory is not a limiting factor here as webservers will operate on filehandles, not entire files. In this case I would assume they keep it right in memory, but that would be one file per worker instance which would usually come down to 4kb * (num_cores + 1) at most, which is not really an issue.
One could argue that either memory- or diskspeed were an issue. But former one were neglectable when methods like sendfile are properly configured, enabling for a zero-copy approach. Latter one would amortize over time once a copy of the file got loaded into memory.
Lastly, there's the interface and the CPU(s). Overall, CPU time tends to be a lot cheaper than network time, so I would expect the NIC to be the bottleneck long before the CPU - if at all.
The question is a bit unspecific on the location of the clients. If they are connected to the same GbE network, they could indeed have the power to saturate your NIC with their requests. If not, some intermediary could become the limiting factor.
Now let us assume those clients were in our network and we had a single-homed 10GbE NIC here, connected via 8 lanes (which is fairly standard IMHO): PCIe 3.0 x8 is specified with 7,877 MB/s. A Core i7 3770 has a bus speed of 5GT/s, which is translating to roughly 8 GB/s at 8 lanes. Assuming no other I/O-intensive workload, this CPU could easily saturate the NIC.
So in summary: Network/NIC saturation before CPU saturation before anything else.

How does Redis achieve the high throughput and performance?

I know this is a very generic question. But, I wanted to understand what are the major architectural decision that allow Redis (or caches like MemCached, Cassandra) to work at amazing performance limits.
How are connections maintained?
Are connections TCP or HTTP?
I know that it is completely written in C. How is the memory managed?
What are the synchronization techniques used to achieve high throughput inspite
of competing read/writes?
Basically, what is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box? I also understand that the answer needs to be very huge and should include very complex details for completion. But, what I'm looking for are some general techniques used rather than all nuances.
There is a wealth of of information in the Redis documentation to understand how it works. Now, to answer specifically your questions:
1) How are connections maintained?
Connections are maintained and managed using the ae event loop (designed by the Redis author). All network I/O operations are non blocking. You can see ae as a minimalistic implementation using the best network I/O demultiplexing mechanism of the platform (epoll for Linux, kqueue for BSD, etc ...) just like libevent, libev, libuv, etc ...
2) Are connections TCP or HTTP?
Connections are TCP using the Redis protocol, which is a simple telnet compatible, text oriented protocol supporting binary data. This protocol is typically more efficient than HTTP.
3) How is the memory managed?
Memory is managed by relying on a general purpose memory allocator. On some platforms, this is actually the system memory allocator. On some other platforms (including Linux), jemalloc has been selected since it offers a good balance between CPU consumption, concurrency support, fragmentation and memory footprint. jemalloc source code is part of the Redis distribution.
Contrary to other products (such as memcached), there is no implementation of a slab allocator in Redis.
A number of optimized data structures have been implemented on top of the general purpose allocator to reduce the memory footprint.
4) What are the synchronization techniques used to achieve high throughput inspite of competing read/writes?
Redis is a single-threaded event loop, so there is no synchronization to be done since all commands are serialized. Now, some threads also run in the background for internal purposes. In the rare cases they access the data managed by the main thread, classical pthread synchronization primitives are used (mutexes for instance). But 100% of the data accesses made on behalf of multiple client connections do not require any synchronization.
You can find more information there:
Redis is single-threaded, then how does it do concurrent I/O?
What is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box?
There is no difference. Redis is a plain vanilla implementation of a machine with in memory cache and server that can respond to commands. But it is an implementation which is done right:
using the single threaded event loop model
using simple and minimalistic data structures optimized for their corresponding use cases
offering a set of commands carefully chosen to balance minimalism and usefulness
constantly targeting the best raw performance
well adapted to modern OS mechanisms
providing multiple persistence mechanisms because the "one size does fit all" approach is only a dream.
providing the building blocks for HA mechanisms (replication system for instance)
avoiding stacking up useless abstraction layers like pancakes
resulting in a clean and understandable code base that any good C developer can be comfortable with

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.