How to use MTLBlitCommandEncoder without the use of MTLBuffer - rendering

I've read in the Apple docs that the MTLBlitCommandEncoder should be used when trying to copy between buffers. However, I've been working with smaller than 4K buffers and have been calling setVertexBytes of the MTLRenderCommandEncoder. Since the MTLBlitCommandEncoder only works with MTLBuffers, is the use of this encoder needed to copy buffers? Does Apple have any sort of documentation about trying to copy UnsafeRawPointers instead? Are there any downsides, in terms of memory, to using MTLBuffers?

In your case (4Kb buffers) it would be better to perform copying on CPU.
To copy data between two raw pointer’s buffers you should use the following method:
copyMemory(from:byteCount:)

Related

Is it better better to open or to read large matrices in Julia?

I'm in the process of switching over to Julia from other programming languages and one of the things that Julia will let you hang yourself on is memory. I think this is likely a good thing, a programming language where you actually have to think about some amount of memory management forces the coder to write more efficient code. This would be in contrast to something like R where you can seemingly load datasets that are larger than the allocated memory. Of course, you can't actually do that, so I wonder how does R get around that problem?
Part of what I've done in other programming languages is work on large tabular datasets, often converted over to a R dataframe or a matrix. I think the way this is handled in Julia is to stream data in wherever possible, so my main question is this:
Is it better to use readline("my_file.txt") to access data or is it better to use open("my_file.txt", "w")? If possible, wouldn't it be better to access a large dataset all at once for speed? Or would it be better to always stream data?
I hope this makes sense. Any further resources would be greatly appreciated.
I'm not an extensive user of Julia's data-ecosystem packages, but CSV.jl offers the Chunks and Rows alternatives to File, and these might let you process the files incrementally.
While it may not be relevant to your use case, the mechanisms mentioned in #Przemyslaw Szufel's answer are used other places as well. Two I'm familiar with are the TiffImages.jl and NRRD.jl packages, both I/O packages mostly for loading image data into Julia. With these, you can load terabyte-sized datasets on a laptop. There may be more packages that use the same mechanism, and many package maintainers would probably be grateful to receive a pull request that supports optional memory-mapping when applicable.
In R you cannot have a data frame larger than memory. There is no magical buffering mechanism. However, when running R-based analytics you could use a disk.frame package for that.
Similarly, in Julia if you want to process data frames larger than memory you need to use am appropriate package. The most reasonable and natural option in Julia ecosystem is JuliaDB.
If you want to do something more low-level solution have a look at:
Mmap that provides Memory-mapped I/O that exactly solves the issue of conveniently handling data too large to fit into memory
SharedArrays that offers a disk mapped array with implementation based on Mmap.
In conclusion, if your data is data frame based - try JuliaDB, otherwise have a look at Mmap and SharedArrays (look at the filename parameter)

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

Best way to load 24-bit BGR image from memory to ID2D1Bitmap1

I have a BGR 24-bit image in memory as continuous buffer (represented by cv::Mat, in case it may be of any help). I would like to load it to ID2D1Bitmap1 bitmap for 2D rendering. I have the following working code (showing a pseudo-code here):
IWICImagingFactory::CreateBitmapFromMemory(GUID_WICPixelFormat24bppBGR);
IWICFormatConverter::Initialize(GUID_WICPixelFormat32bppRGB);
ID2D1DeviceContext::CreateBitmapFromWicBitmap;
This works fine, the main issue being the time it takes: 20-40 milliseconds, which is too long for my application. I am looking for ways to optimize the process.
I, probably, can save the creation time of the ID2D1Bitmap1 by doing this once, and then loading the converted image from memory using CopyFromMemory, but still the conversion itself takes a large amount of time. One way could be loading the raw BGR buffer to GPU memory, and converting it to native RGBA format on the GPU itself, but I have no idea how start with that.
Your second idea is exactly the direction you should go. Create the ID2D1Bitmap(s) once, convert the buffer (more on that below), and use CopyFromMemory. I do something very similar in an app which provides a preview of a connected webcam (which may have one of various formats). Some cameras will deliver the images in MJPEG, YUY2, etc.
That app uses MediaFoundation, and an IMFTransform to convert the buffer. The IMFTransform in this case is an instance of CLSID_CColorConvertDMO (which uses SIMD registers/instructions when possible). However, prior to completing that, I tested with my own conversion code (which was CPU bound, and performed so-so), and another solution with HLSL and DirectCompute (performed well, but handled only one format). In the end I chose CLSID_CColorConvertDMO to handle the various types of input, but if you only have one known type, you may choose to use the HLSL solution (although it will cause you to have to write the conversion code, and setup the 'views' of the data).
However, if you choose the MediaFoundation route, you can use an IMFTransform without all of the rest of the graph (source, sink, etc). After creating the CLSID_CColorConvertDMO instance, and setting the input and output types (format, frame size, etc), create an IMFSample (using MFCreateSample), and an IMFMediaBuffer (using MFCreateMemoryBuffer) to the sample (using IMFSample::AddBuffer), then all that is necessary is to call ProcessInput and ProcessOutput to convert the buffer (create all items upfront). This may sound like a lot, but if done correctly, your cpu utilization will not have a noticeable impact, and you will achieve the performance you are looking for, even for capture cards which often deliver large frames at 60+ FPS (having used DataPath and Blackmagic capture cards in the past).
Good luck. I am certain you will crush it.

Is it possible to read from a VBO?

I'm trying to make an OpenGL renderer that mashes various shapes into one large mesh and stores these in two VBOs, one GL_ARRAY_BUFFER and one GL_ELEMENT_ARRAY_BUFFER. I'm aiming for it to work on both OpenGL ES 2 and OpenGL 3.2 core. I am currently trying to find the best way to handle deleting shapes from within this mesh and my current approach is to periodically rebuild the entire thing, possibly on a background thread.
The problem is that in order to rebuild the new and clean mesh, I need access to the vertices / indices that have been written to the buffers using glMapBuffer. According to the documentation for GL_OES_mapbuffer, WRITE_ONLY_OES is the only acceptable parameter for 'access'.
So, I don't think the data pointed at there is reliable to read from in order to create my new buffers. I know there are other functions in GL Core that allow you to copy the buffer data, but these also seem to be missing.
Can anyone verify that this is not possible on ES 2.0 or give some approach for achieving buffer reading? My current solution is to keep a shadow copy of all the data, which is obviously not ideal.
I think that keeping a shadow copy of GPU data in main memory is much better than reading these data from GPU memory. It is recommended to discard previous data before using glMapBuffer anyway. Read this for more information (It will not give you direct answer to your question, but it might be usefull).

How can I accelerate the generation of the an MD5 Checksum within vb.net?

I'm working with some very large files residing on P2 (Panasonic) cards. Part of the process we employ is to first generate a checksum of the file we are going to copy, then copy the file, then run a checksum on the file to confirm that it copied OK. The problem is, is that files are large (70 GB+) and take a long time to complete. It's an issue since we will eventually be dealing with thousands of these files.
I would like to find a faster way to generate the checksum other than using the System.Security.Cryptography.MD5CryptoServiceProvider
I don't care if this means using a specialized hardware card, provided it works and is not to ungodly expensive. I would prefer to have a method of encoding that provided some feedback as to how far the process has gone along so I can display it like I do now.
The application is written in vb.net. I would prefer to be able to use it as component, library, reference within my application, but I'm willing to call an outside application if there is enough improvement in the speed of generating the checksum.
Needless to say, the checksum must be consistent and correct. :-)
Thank you in advance for your time and efforts,
Richard
I see one potential way to speed up this process: calculate the MD5 of the source file while performing the copy, not prior to it. This will reduce the number of times you'll need to read the entire file from 3 (source hash, copy, destination hash) to 2 (copy, destination hash).
The downside of this all is that you'll have to write your own copying code (as opposed to just relying on System.IO.File.Copy), and there's a non-zero chance that this will turn out to be slower in the end anyway than the 3-step process.
Other than that, I don't think there's much you can do here, as the entire process is I/O bound by design. You're spending most of your time reading/writing the file, and even at 100MB/s (a respectable I/O speed for your typical SATA drive), you'll do about 5.8GB/min at best.
With a modern processor, the overhead of calculating the MD5 (or anything else) doesn't factor into things very much, so speeding it up won't improve your overall throughput. Crypto accelerators in particular won't help you here, as unless the driver implementation is very efficient, they'll add more overhead due to context switches required to feed the data to the external card than they'll save.
What you do want to improve is the I/O speed. The .NET framework is already pretty efficient when it comes to this (using nicely-sized buffers, overlapped I/O and such), but it's possible an optimized native Windows application will perform better here. My advice: Google around for a few native MD5 calculators, and see how they compare to your current .NET implementation. If the difference in hash calculation speed is >10%, it's worth switching to using said external app.
The correct answer is to avoid using MD5. MD5 is a cryptographic hash function, designed to provide certain cryptographic features. For merely detecting accidental corruption, it is way over-engineered and slow. There are many faster checksums, the design of which can be understood by examining the literature of error detection and correction. Some common examples are the CRC checksums, of which CRC32 is very common, but you can also relatively easily compute 64 or 128 bit or even larger CRCs much much faster than an MD5 hash.