Vulkan WaW hazard & memory barrier - vulkan

Vulkan spec states:
Write-after-read hazards can be solved with just an execution dependency, but read-after-write and write-after-write hazards need appropriate memory dependencies to be included between them.
I thought that WaW can be solved with only execution barrier as well.
Why do we need a memory barrier to solve WaW hazard if we're not going to read data?

Execution dependencies ensure the ordering of operations. Memory dependencies ensure the visibility of memory operations. These aren't the same thing.
In order for a write-after-write to work correctly, the second write has to happen after the first write, but also you have to make sure that the first write is visible to the operation doing the second write. Otherwise, it's possible for the second write to be overwritten by the first, even if the second write happened after the first.
If you want a more hardware-based explanation, consider what happens if the first write uses one cache, and the second write uses a separate cache from the first (GPUs have a lot of caches). Execution dependencies don't affect caches. So the second write's cache may write its data before the first write's cache does, which means that the first write will eventually overwrite the second.
Memory dependencies force data out of caches, thus ensuring that when the second write takes place, there isn't a write sitting around in a cache somewhere.

Related

Vulkan visibility operation [duplicate]

Vulkan spec states:
Write-after-read hazards can be solved with just an execution dependency, but read-after-write and write-after-write hazards need appropriate memory dependencies to be included between them.
I thought that WaW can be solved with only execution barrier as well.
Why do we need a memory barrier to solve WaW hazard if we're not going to read data?
Execution dependencies ensure the ordering of operations. Memory dependencies ensure the visibility of memory operations. These aren't the same thing.
In order for a write-after-write to work correctly, the second write has to happen after the first write, but also you have to make sure that the first write is visible to the operation doing the second write. Otherwise, it's possible for the second write to be overwritten by the first, even if the second write happened after the first.
If you want a more hardware-based explanation, consider what happens if the first write uses one cache, and the second write uses a separate cache from the first (GPUs have a lot of caches). Execution dependencies don't affect caches. So the second write's cache may write its data before the first write's cache does, which means that the first write will eventually overwrite the second.
Memory dependencies force data out of caches, thus ensuring that when the second write takes place, there isn't a write sitting around in a cache somewhere.

Pipeline barriers between transfer write commands

We have two transfer commands, vkCmdFillBuffer() followed by vkCmdCopyQueryPoolResults(). The transfer commands write to overlapping buffer ranges.
Is a pipeline barrier needed between the commands in order to avoid write-after-write hazards?
Does Vulkan provide any guarantee for commands executed in the same pipeline stage?
Of course, you virtually always have to synchronize in Vulkan. There are only very few places where Vulkan does implicit synchronization.
You have the wrong intuition about pipeline stages. Commands indepentently "reach" stages of pipeline. All commands start at VK_PIPELINE_STAGE_TOP_OF_PIPE (they "reach" it in submission order). Then (without synch) it is not determined which command(s) will proceed to the next stage of pipeline. There's no order to it without explicit sync primitives. The spec would say something like "execution of queue operations may overlap or happen out of order".
So without sync vkCmdCopyQueryPoolResults may even happen before vkCmdFillBuffer, which I assume you do not want. If they both happen at the same time, that's even worse. The data may then contain some mess of writes from both sources (or from neither). The results would simply be undefined.

What would be the value of VkAccessFlags for a VkBuffer or VkImage after being allocated memory?

So I create a bunch of buffers and images, and I need to set up a memory barrier for some reason.
How do I know what to specify in the srcAccessMask field for the barrier struct of a newly created buffer or image, seeing as at that point I wouldn't have specified the access flags for it? How do I decide what initial access flags to specify for the first memory barrier applied to a buffer or image?
Specifying initial values for other parameters in Vk*MemoryBarrier is easy since I can clearly know, say, the original layout of an image, but it isn't apparent to me what the value of srcAccessMask could be the first time I set up a barrier.
Is it based on the usage flags specified during creation of the object concerned? Or is there some other way that can be used to find out?
So, let's assume vkCreateImage and VK_LAYOUT_UNDEFINED.
Nowhere the specification says it defines some scheduled operation. So it is healthy to assume all its work is done as soon as it returns. Plus, it does not even have memory.
So any synchronization needs would be of the memory you bind to it. Let's assume it is just fresh memory from vkAllocate. Similarly, nowhere it is said in the specification that it defines some scheduled operation.
Even so, there's really only two options. Either the implementation does nothing with the memory, or it null-fills it (for security reason). In the case it null-fills it, that must be done in a way you cannot access the original data (even using synchronization errors). So it is healthy to assume the memory has no "synchronization baggage" on it.
So simply srcStageMask = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT (no previous outstanding scheduled operation) and srcAccessMask = 0 (no previous writes) should be correct.

Persist slower than non-persist calls

My settings are: Spark 2.1 on a 3 node YARN cluster with 160 GB, 48 vcores.
Dynamic allocation turned on.
spark.executor.memory=6G, spark.executor.cores=6
First, I am reading hive tables: orders (329MB) and lineitems (1.43GB) and
doing a left outer join.
Next, I apply 7 different filter conditions based on the joined
dataset (something like var line1 = joinedDf.filter("linenumber=1"), var line2 = joinedDf.filter("l_linenumber=2"), etc).
Because I'm filtering on the joined dataset multiple times, I thought doing a persist (MEMORY_ONLY) would help here as the joined dataset will fits fully in memory.
I noticed that with persist, the Spark application takes longer to run than without persist (3.5 mins vs 3.3 mins). With persist, the DAG shows that a single stage was created for persist and other downstream jobs are waiting for the persist to complete.
Does that mean persist is a blocking call? Or do stages in other jobs start processing when persisted blocks become available?
In the non-persist case, different jobs are creating different stages to read the same data. Data is read multiple times in different stages, but this is still is turning out to be faster than the persist case.
With larger data sets, persist actually causes executors to run out of
memory (Java heap space). Without persist, the Spark jobs complete just fine. I looked at some other suggestions here: Spark java.lang.OutOfMemoryError: Java heap space.
I tried increasing/decreasing executor cores, persisting
with disk only, increasing partitions, modifying the storage ratio, but nothing seems to help with executor memory issues.
I would appreciate it if someone could mention how persist works, in what cases it is faster than not-persisting and more importantly, how to go about troubleshooting out of memory issues.
I'd recommend reading up on the difference between transformations and actions in spark. I must admit that I've been bitten by this myself on multiple occasions.
Data in spark is evaluated lazily, which essentially means nothing happens until an "action" is performed. The .filter() function is a transformation, so nothing actually happens when your code reaches that point, except to add a section to the transformation pipeline. A call to .persist() behaves in the same way.
If your code downstream of the .persist() call has multiple actions that can be triggered simultaneously, then it's quite likely that you are actually "persisting" the data for each action separately, and eating up memory (The "Storage' tab in the Spark UI will tell you the % cached of the dataset, if it's more than 100% cached, then you are seeing what I describe here). Worse, you may never actually be using cached data.
Generally, if you have a point in code where the data set forks into two separate transformation pipelines (each of the separate .filter()s in your example), a .persist() is a good idea to prevent multiple readings of your data source, and/or to save the result of an expensive transformation pipeline before the fork.
Many times it's a good idea to trigger a single action right after the .persist() call (before the data forks) to ensure that later actions (which may run simultaneously) read from the persisted cache, rather than evaluate (and uselessly cache) the data independently.
TL;DR:
Do a joinedDF.count() after your .persist(), but before your .filter()s.

What's the point of using NSPurgeabledata in NSCache?

I've read some article about using suggestion of NSCache, for many it mentioned a recommendation is that to use NSPurgeabledata in an NSCache.
However I just can't catch the point, while the NSCache already be able to evict its content when memory is tight or it reached its count/cost limit, why we still need to use NSPurgeabledata here? Isn't that just potentially slower than using the data object we already have? What kind of advantage can we take here?
The count limit and the total-cost limit are not strictly enforced. That is, when the cache goes over one of its limits, some of its objects might get evicted immediately, later, or never, all depending on the implementation details of the cache.
So the advantages of using NSPurgeabledata here is :-
By using purgeable memory, you allow the system to quickly recover memory if it needs to, thereby increasing performance. Memory that is marked as purgeable is not paged to disk when it is reclaimed by the virtual memory system because paging is a time-consuming process. Instead, the data is discarded, and if needed later, it will have to be recomputed.
It works like a locking mechanism or we can say that it works like synchronization. If data is accessing by one thread then no other thread can access the same one, until unless the first one get completed.