Why is there a VK_PIPELINE_STAGE_2_INDEX_INPUT_BIT and do I have to use it? - vulkan

According to the docs at Khronos.org the PIPELINE_STAGE_VERTEX_INPUT_BIT is:
VK_PIPELINE_STAGE_VERTEX_INPUT_BIT specifies the stage of the pipeline where vertex and index buffers are consumed
So that flag covers both vertex input and index input. However I've seen a:
VK_PIPELINE_STAGE_2_INDEX_INPUT_BIT
in my Vulkan header. Have they now been separated into two separate flags? And do I have to use them to flush of invalidate caches with respect to indices?

VK_PIPELINE_STAGE_VERTEX_INPUT_BIT is part of VkPipelineStageFlagBits. It covers both vertex and index buffers.
VK_PIPELINE_STAGE_2_INDEX_INPUT_BIT is part of VkPipelineStageFlagBits2. It covers only index buffers.
Have they now been separated into two separate flags?
There are two flags in the new structure, one for index and one for vertex. They are, respectively
VK_PIPELINE_STAGE_2_VERTEX_ATTRIBUTE_INPUT_BIT for vertex buffers only
VK_PIPELINE_STAGE_2_INDEX_INPUT_BIT for index buffers only
However, there is also a VK_PIPELINE_STAGE_2_VERTEX_INPUT_BIT definition which is their bitwise or, so it acts just like VK_PIPELINE_STAGE_VERTEX_INPUT_BIT in the older versions of the barrier functions.
do I have to use them to flush of invalidate caches with respect to indices?
It depends on what function you're calling. If you're calling vkCmdPipelineBarrier, you don't need to think about any of this, because it still uses VkPipelineStageFlagBits.
If you're calling vkCmdPipelineBarrier2 then you need to use the newer VkPipelineStageFlagBits2. But if you don't care about the distinction between the index and vertex buffers in the pipeline, then you could just swap out VK_PIPELINE_STAGE_VERTEX_INPUT_BIT with VK_PIPELINE_STAGE_2_VERTEX_INPUT_BIT directly and everything should work the same as before, because VK_PIPELINE_STAGE_2_VERTEX_INPUT_BIT intentionally covers both vertex and index buffers.

Related

Did anyone write custom Affinity function?

I want all nodes in a cluster to have equal number data load. With
default Affinity function it is not happening.
As of now, we have 3 nodes. We use group ID as affinity key, and we have 3
group IDs (1, 2 and 3). And we limit cache partitions to group IDs. Overall
nodes=group IDs=cache partitions. So that each node have equal number of
partitions.
Will it be okay to write custom Affinity function? And
what will we lose doing so? Did anyone write custom Affinity function?
The affinity function doesn't guarantee an even distribution across all nodes. It's statistical... and three values isn't really enough to make sure the data is "fairly" distributed.
So, yes, writing a new affinity function would work. The downsides being you need to make it fast (it's called a lot) and you'd be hard-coding it to your current node topology. What happens when you choose to add a new node? What happens when a node fails? Also, you'd be potentially putting all your data into three partitions which make it harder to scale out (one of the main advantages of Ignite's architecture).
As an alternative, I'd look at your data model. Splitting your data into three chunks is too coarse for things to work automatically.

Redis ZRANGEBYLEX command complexity

According documentation section for ZRANGEBYLEX command, there is following information. If store keys in ordered set with zero score, later keys can be retrieved with lexicographical order. And ZRANGEBYLEX operation complexity will be O(log(N)+M), where N is total elements count and M is result set size. Documentation has some information about string comparation, but tells nothing about structure, in which elements will be stored.
But after some experiments and reading source code, it's probably what ZRANGEBYLEX operation has a linear time search, when every element in ziplist will be matched against request. If so, complexity will be more larger than described above - about O(N), because every element in ziplist will be scanned.
After debugging with gdb, it's clean that ZRANGEBYLEX command is implemented in genericZrangebylexCommand function. Control flow continues at eptr = zzlFirstInLexRange(zl,&range);, so major work for element retrieving will be performed at zzlFirstInLexRange function. All namings and following control flow consider that ziplist structure is used, and all comparation with input operands are done sequentially element by element.
Inspecting memory with analysis after inserting well-known keys in redis store, it seems that ZSET elements are really stored in ziplist - byte-per-byte comparation with gauge confirm it.
So question - how can documentation be wrong and propagate logarithmic complexity where linear one appears? Or maybe ZRANGEBYLEX command works slightly different? Thanks in advance.
how can documentation be wrong and propagate logarithmic complexity where linear one appears?
The documentation has been wrong on more than a few occasions, but it is an ongoing open source effort that you can contribute to via the repository (https://github.com/antirez/redis-doc).
Or maybe ZRANGEBYLEX command works slightly different?
Your conclusion is correct in the sense that Sorted Set search operations, whether lexicographical or not, exhibit linear time complexity when Ziplists are used for encoding them.
However.
Ziplists are an optimization that prefers CPU to memory, meaning it is meant for use on small sets (i.e. low N values). It is controlled via configuration (see the zset-max-ziplist-entries and zset-max-ziplist-value directives), and once the data grows above the specified thresholds the ziplist encoding is converted to a skip list.
Because ziplists are small (little Ns), their complexity can be assumed to be constant, i.e. O(1). On the other hand, due to their nature, skip lists exhibit logarithmic search time. IMO that means that the documentation's integrity remains intact, as it provides the worst case complexity.

How to represent a binary relation

I plan to make a class that represents a strict partially ordered set, and I assume the most natural way to model its interface is as a binary relation. This gives functions like:
bool test(elementA, elementB); //return true if elementA < elementB
void set(elementA, elementB); //declare that elementA < elementB
void clear(elementA, elementB); //forget that elementA < elementB
and possibly functions like:
void applyTransitivity(); //if test(a,b) and test(b, c), then set(a, c)
bool checkIrreflexivity(); //return true if for no a, a < a
bool checkAsymmetry(); //return true if for no a and b, a < b and b < a
The naive implementation would be to have a list of pairs such that (a, b) indicates a < b. However, it's probably not optimal. For example, test would be linear time. Perhaps it could be better done as a hash map of lists.
Ideally, though, the in memory representation would by its nature enforce applyTransitivity to always be "in effect" and not permit the creation of edges that cause reflexivity or symmetry. In other words, the degrees of freedom of the data structure represent the degrees of freedom of a strict poset. Is there a known way to do this? Or, more realistically, is there a means of checking for being cyclical, and maintaining transitivity that is amortized and iterative with each call to set and clear, so that the cost of enforcing the correctness is low. Is there a working implementation?
Okay, let's talk about achieving bare metal-scraping micro-efficiency, and you can choose how deep down that abyss you want to go. At this architectural level, there are no data structures like hash maps and lists, there aren't even data types, just bits and bytes in memory.
As an aside, you'll also find a lot of info on representations here by looking into common representations of DAGs. However, most of the common reps are designed more for convenience than efficiency.
Here, we want the data for a to be fused with that adjacency data into a single memory block. So you want to store the 'list', so to speak, of items that have a relation to a in a's own memory block so that we can potentially access a and all the elements related to a within a single cache line (bonus points if those related elements might also fit in the same cache line, but that's an NP-hard problem).
You can do that by storing, say, 32-bit indices in a. We can model such objects like so if we go a little higher level and use C for exemplary purposes:
struct Node
{
// node data
...
int links[]; // variable-length struct
};
This makes the Node a variable-length structure whose size and potentially even address changes, so we need an extra level of indirection to get stability and avoid invalidation, like an index to an index (if you control the memory allocator/array and it's purely contiguous), or an index to a pointer (or reference in some languages).
That makes your test function still involve a linear time search, but linear with respect to the number of elements related to a, not the number of elements total. Because we used a variable-length structure, a and its neighbor indices will potentially fit in a single cache line, and it's likely that a will already be in the cache just to make the query.
It's similar to the basic idea you had of the hash map storing lists, but without the explosion of lists overhead and without the hash lookup (which may be constant time but not nearly as fast as just accessing the connections to a from the same memory block). Most importantly, it's far more cache-friendly, and that's often going to make the difference between a few cycles and hundreds.
Now this means that you still have to roll up your sleeves and check for things like cycles yourself. If you want a data structure that more directly and conveniently models the problem, you'll find a nicer fit with graph data structures revolving around a formalization of a directed edge. However, those are much more convenient than they are efficient.
If you need the container to be generic and a can be any given type, T, then you can always wrap it (using C++ now):
template <class T>
struct Node
{
T node_data;
int links[1]; // VLS, not necessarily actually storing 1 element
};
And still fuse this all into one memory block this way. We need placement new here to preserve those C++ object semantics and possibly keep an eye on alignment here.
Transitivity checks always involves a search of some sort (breadth first or depth first). I don't think there's any rep that avoids this unless you want to memoize/cache a potentially massive explosion of transitive data.
At this point you should have something pretty fast if you want to go this deep down the abyss and have a solution that's really hard to maintain and understand. I've unfortunately found that this doesn't impress the ladies very much as with the case of having a car that goes really fast, but it can make your software go really, really fast.

SHA1-Indexed Hash table in D

I'm using a D builtin hash table indexed by SHA1-digests (ubyte[20]) to relate information in my file system search engine.
Are there any data structures more suitable for this (in D) because of all the nice properties of such a key: uniformly, distributed, random, fixed-sized or will the behaviour of D's builtin hash tables automatically figure out that it could for example just pick the first n (1-8) bytes of a SHA1-digest as a bucket index?
I think the hash function used inside standards maps is trivial enough (cost wise) that it won't make much if any difference unless you are running code that is mostly look-ups. Keep in mind that the full key will be read to do the final comparison so it will get loaded into the cache either way.
OTOH I think there is a opHash method you can overload.

What is the cost of object reference in Scala?

Assume we build an object to represent some network (social, wireless, whatever). So we have some 'node' object to represent the KIND of network, different nodes might have different behaviors and so forth. The network has a MutableList of nodes.
But each node has neighbors, and these neighbors are also nodes. So somewhere, there has to be a list, per node, of all of the neighbors of that node--or such a list has to be generated on the fly whenever it is needed. If the list of neighbors is stored in the node objects, is it cheaper to store it (a) as a list of nodes, or (b) as list of numbers that can be used to reference nodes out of the network?
Some code for clarity:
//approach (a)
class network {
val nodes = new MutableList[Node]
// other stuff //
}
class Node {
val neighbors = new MutableList[Node]
// other stuff //
}
//approach (b)
class Network {
val nodes = new MutableList[Node]
val indexed_list = //(some function to get an indexed list off nodes)
//other stuff//
}
class Node {
val neighbors = MutableList[Int]
//other stuff//
}
Approach (a) seems like the easiest. My first question is whether this is costly in Scala 2.8, and the second is whether it breaks the principle of DRY?
Short answer: premature optimization is the root of etc. Use the clean reference approach. When you have performance issues there's no substitute for profiling and benchmarking.
Long answer: Scala uses the exact same reference machinery as Java so this is really a JVM question more than a Scala question. Formally the JVM spec doesn't say one word about how references are implemented. In practice they tend to be word sized or smaller pointers that either point to an object or index into a table that points to the object (the later helps garbage collectors).
Either way, an array of refs is about the same size as a array of ints on a 32 bit vm or about double on a 64bit vm (unless compressed-oops are used). That doubling might be important to you or might not.
If you go with the ref based approach, each traversal from a node to a neighbor is a reference indirection. With the int based approach, each traversal from a node to a neighbor is a lookup into a table and then a reference indirection. So the int approach is more expensive computationally. And that's assuming you put the ints into a collection that doesn't box the ints. If you do box the ints then it's just pure craziness because now you've got just as many references as the original AND you've got a table lookup.
Anyway, if you go with the reference based approach then the extra references can make a bit of extra work for a garbage collector. If the only references to nodes lie in one array then the gc will scan that pretty damn fast. If they're scattered all over in a graph then the gc will have to work harder to track them all down. That may or may not affect your needs.
From a cleanliness standpoint the ref based approach is much nicer. So go with it and then profile to see where you're spending your time. That or benchmark both approaches.
The question is - what kind of a cost? Memory-wise, the b) approach would probably end up consuming more memory, since you have both mutable lists, and boxed integers in that list, and another global structure holding all the indices. Also, it would probably be slower because you would need several levels of indirection to reach the neighbour node.
One important note - as soon as you start storing integers into mutable lists, they will undergo boxing. So, you will have a list of heap objects in both cases. To avoid this, and furthermore to conserve memory, in the b) approach you would have to keep a dynamically grown array of integers that are the indices of the neighbours.
Now, even if you modify the approach b) as suggested above, and make sure the indexed list in the Network class is really an efficient structure (a direct lookup table or a hash table), you would still pay an indirection cost to find your Node. And memory consumption would still be higher. The only benefit I see is in keeping some sort of a table of weak references if you're concerned you might run out of memory, and recreate the Node object when you need it and you cannot find it in your indexed_list which keeps a set of weak references.
This is, of course, just a hypothesis, you would have to profile/benchmark your code to see the difference.
My suggestion would be to use something like an ArrayBuffer in Node and use it store direct references to nodes.
If memory concerns are an issue, and you want to do the b) approach together with weak references, then I would further suggest rolling in your own dynamically grown integer-array for neighbours, to avoid boxing with ArrayBuffer[Int].