inactive invocation in subgroups in vulkan - vulkan

I am reading the vulkan subgroup tutorial and it mentions that if the local workgroup size is less than the subgroup size, then we will always have inactive invocations.
This post clarifies that there is no direct relation between a SubgroupLocalInvocationId and LocalInvocationId. If there is no relation between the subgroup and local workgroup ids, how does the small size of local workgroup guarantee inactive invocations?
My guess is as follows
I am thinking that the invocations (threads) in a workgroup are divided into subgroups before executing on the GPU. Each subgroup would be an exact match for the basic unit of execution on the GPU (warp for an NVIDIA GPU). This means that if the workgroup size is smaller than the subgroup size then the system somehow tries to construct a minimal subgroup which can be executed on the GPU. This would require using some "inactive/dead" invocations just to meet the minimum subgroup size criteria leading to the aforementioned guaranteed inactive invocations. Is this understanding correct? (I deliberately tried to use basic words for simplicity, please let me know if any of the terminology is incorrect)
Thanks

A dispatch of compute defines with its parameters the global workgroup. The global workgroup has x×y×z invocations.
Each of those invocations are divided into local groups (defined by the shader). A local workgroup also has another set of x×y×z invocations.
A local workgroup is partitioned into subgroups. Its invocations are rearranged into subgroups. A subgroup has (1-dimensional) SubgroupSize amount of invocations, which all need not be assigned a local workgroup invocation. And a subgroup must not span over multiple local workgroups; it can use only invocations from a single local workgroup.
Otherwise how this partitioning is done seems largely unspecified, except that under very specific conditions you are guaranteed full subgroups, which means none of the invocations in a subgroup of SubgroupSize will stay vacant. If those conditions are not fulfilled, then the driver may keep some invocations inactive in the subgroup as it sees fit.
If the local workgroup has in total less invocations than SubgroupSize, then some of the invocations of the subgroup indeed need to stay inactive as there are not enough available local workgroup invocations to fill even one subgroup.

Related

Efficient combination of a large number of affinity calls in Apache Ignite

We would like to compute on a large, partition-able dataset or 'products' in Ignite (100.000+ products, each linked to a large amount of extra data in different caches). We require several use cases:
1) Launch a compute job, limited to a large number (100's) of products, with a strong focus on responsiveness (<200ms). We can use the product ID as an affinity key to collocate all extra data with the products. But affinityRun only allows a single key to be specified, which would mean we need to launch 100's of compute jobs. Ideally we would be able to do an affinityRun on the entire set of product IDs at once, and let Ignite distribute the compute job to the relevant nodes, but we struggle to find a way to do this. (The compute job would then use local queries only on those compute nodes.)
2) Launch a compute job over the entire space of products in an efficient manner. We could launch the compute job on each compute node and use local queries, but that would no longer give us the benefits of falling back to backup partitions in case a primary partition is unavailable. This is an extreme case of problem number 1, just with a huge (all) number of product IDs as input.
We've been brainstorming about this for a while now, but it seems like we're missing something. Any ideas?
There is a version of affinityRun that takes a partition number as a parameter. Distribute your task per partition and each node on the receiving end will be processing data residing in that partition number (just run a scan query for the partition). In case of failure, you'll just restart the process for a partition and can filter out already processed items with a custom logic.
Affinity job is nothing but the one which execute on the data node where key/value resides.
There are several ways to send job to particular node and not only affinity key. for example, you can send based on consistentID and in 2.4.10(if I remember correctly), they added way to query backup explicitly.
Regarding your scenario, I can think of below solution-
SqlFieldsQuery query = new SqlFieldsQuery("select productID from CacheTable").setLocal(true);
You can prepare affinity job with above SQL where you will select all products(from that node only) and iterate over them and do all queries locally only to get all products information like this. Send that job to required node and do your computation and reduce the result and return to client.

What Algorithm to Use for a List Swap with No Duplication

I have a scenario where a number of organizations have joined together to swap names to increase the size of their respective mailing lists. I have a pool of names, each tagged to their source organization. Each org has provided a subset of their full list, and is entitled to an output of up to the number of names they put into the pool, but with these restrictions:
an org should not receive a name already on their full list
each name should only be swapped once (should be allocated to only a single output grouping)
Separate from this pool I have the full lists of each organization (hashed to obfuscate the details, but not particularly relevant to the question), so I have the data necessary to determine which records are available to swap to each list.
My question: is there a grouping/clustering algorithm that would apply in this case, where we want to cluster based the organization(s) that are entitled to them, with the aforementioned requirement of each record being distributed to a single group and each group's size not exceeding the number of records originally sourced to that organization.
The data is currently in MySQL tables and I will likely use Node/JS for implementation, but I'm mainly seeking general advice on the algorithm.
Thanks in advance for any advice you might have!

SQL: how does parallelism of a join operation exactly work in a "shared nothing" architecture?

I'm currently reading a book about parallelism in a DBMS and I find it difficult to understand how parallelism works exactly in the join operation.
Suppose we have 10 systems, each system has each own disk space and main memory. There is a network with which the systems can communicate with each other in order for example to share data.
Now suppose we have the following operation: A(X,Y) JOIN B(Y,Z)
the tables A and B are too big so we want to use parallelism in order to gain better overall computing speed.
What we do is we apply a hash function on the 'Y' attribute for each record of A and B tables, and we send these records to a different system. Each system can then use a local algorithm in order to join the records that they got.
What I don't understand is, where exactly is the initial hash function being applied and where exactly are the initial tables A and B being stored?
While I was reading I thought that we had another "main" system, which had also its own disk space, and in this space we had all the initial information, which is table A and B with all their records. This system used its own main memory in order to apply the initial hash function, which determined for each record the system out of the total 10 where it will eventually go and be processed.
however upon reading I got stuck in the following example(I translate from Greek)
Let's say we have two tables R(X,Y) JOIN S(Y,Z) where R has 1000 pages and S 500 pages. Suppose that we have 10 systems that can be used in parallel. So we start by using a hash function to determine where we should send each record. The total amount of I/Os needed to read the tables R and S is 1500, which is 150 for each system. Each system will have 15 pages of data which are necessary for each of the remaining systems, so it sends 135 pages to the other nine systems. Hence the total communication is 1350 pages.
I don't really understand the bold part. Why would a system have to send any data to the other systems? Doesn't the "main" system I was talking about previously do this job?
I imagine something like this:
main_system
||
\/
apply_hash(record)
||
\/
send record to the appropriate system
/ / / / / / / / / /
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
now all systems have their own records, they apply the local algorithm and give the result to the output. No communication between the systems, what am I missing here?? does the book use a different approach and if so what kind of approach because I have read the same unit 3 times and I still don't get it(maybe a bad translation not sure though).
thanks in advance
In a shared-nothing system, the data is typically partitioned across the processors when the data is created. Although databases can be shared-nothing, probably the best documentation is for Hadoop and HDFS, the Hadoop distributed file system.
A function assigns rows to a partition. Some examples are: round-robin, where new rows are assigned to the processors one after the other; range-based, where rows are assigned to a processor based on the value of a column; hash-based, where rows are assigned to a processor based on a hash of the value. The process of partitioning the data is very similar to "partitioning" in databases such as SQL Server and Oracle which are not in a shared-nothing environment.
If your join uses the partition key for both tables, and the partitioning method is the same, then the data is already local. Otherwise, one or both tables need to be redistributed to continue the processing.
In the section that you quote, you are probably confused by the arithmetic. Remember that if you have 1500 pages across 10 processors, each will have on average 150 pages. These pages need to be redistributed. Say you are processor 3. About 15 pages will go to processor 1; another 15 to processor 2. And another to processor 3. Wait! You don't have to send these; they are already in the right spot. You only have to send 9*15 = 135 pages to other processors.
The key idea is that the same processors are storing the data as doing the processing, in a shared-notthing environment.
id take a wild guess that your connection is your local client. since it has a connection to all machines.

What is the difference/relationship between extent and allocation unit?

Can you explain the difference -or relationship- between 'Extent' and 'Allocation Unit' in SQL?
The allocation unit is basically just a set of pages. It can be small (one page) or large (many many pages). It has a metadata entry in sys.allocation_units. It is tracked by a IAM chain. The most common use of allocation units is the 3 well known AUs of a rowset: IN_ROW_DATA, ROW_OVERFLOW and LOB_DATA.
An extent is any 8 consecutive pages that start from a page ID that is divisible by 8. SQL Server IO is performed in an extent aware fashion: ideally an entire extent is read in at once, an entire extent is write out at once. This is subject to current state of the buffer pool, for details see How It Works: Bob Dorr's SQL Server I/O Presentation. Extents are usually allocated together, so all pages of an extent belong to the same allocation unit. But since this would lead to overallocation for small tables a special type of extent is a so called 'mixed' extent, in which each page can belong to a separate allocation unit. For details see Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps.
So as you see the concepts are related, but very different. Perhaps you should explain a bit what is the problem you're trying to solve or why are you interested in these concepts, perhaps we can then elaborate.
Each object (be it an index or a heap) has a number of partitions (1-15k). Each partition can have three different allocation units, the HoBT (heap or b-tree, also known as the hobbit) where your actual data is stored. The LOB ALU for the LOB types as well as the SLOB ALU for row-overflow data.
Pages belong to a certain allocation unit. All pages belong to an extent - a group of 8 pages. While the individual pages can belong to different allocation units, they'll always belong to the same object in a uniform extent - while a mixed extent contains pages for different objects and potentially different allocation units.

max memory per query

How can I configure the maximum memory that a query (select query) can use in sql server 2008?
I know there is a way to set the minimum value but how about the max value? I would like to use this because I have many processes in parallel. I know about the MAXDOP option but this is for processors.
Update:
What I am actually trying to do is run some data load continuously. This data load is in the ETL form (extract transform and load). While the data is loaded I want to run some queries ( select ). All of them are expensive queries ( containing group by ). The most important process for me is the data load. I obtained an average speed of 10000 rows/sec and when I run the queries in parallel it drops to 4000 rows/sec and even lower. I know that a little more details should be provided but this is a more complex product that I work at and I cannot detail it more. Another thing that I can guarantee is that my load speed does not drop due to lock problems because I monitored and removed them.
There isn't any way of setting a maximum memory at a per query level that I can think of.
If you are on Enterprise Edition you can use resource governor to set a maximum amount of memory that a particular workload group can consume which might help.
In SQL 2008 you can use resource governor to achieve this. There you can set the request_max_memory_grant_percent to set the memory (this is the percent relative to the pool size specified by the pool's max_memory_percent value). This setting in not query specific, it is session specific.
In addition to Martin's answer
If your queries are all the same or similar, working on the same data, then they will be sharing memory anyway.
Example:
A busy web site with 100 concurrent connections running 6 different parametrised queries between them on broadly the same range of data.
6 execution plans
100 user contexts
one buffer pool with assorted flags and counters to show usage of each data page
If you have 100 different queries or they are not parametrised then fix the code.
Memory per query is something I've never thought or cared about since last millenium