I'm trying to make an OpenGL renderer that mashes various shapes into one large mesh and stores these in two VBOs, one GL_ARRAY_BUFFER and one GL_ELEMENT_ARRAY_BUFFER. I'm aiming for it to work on both OpenGL ES 2 and OpenGL 3.2 core. I am currently trying to find the best way to handle deleting shapes from within this mesh and my current approach is to periodically rebuild the entire thing, possibly on a background thread.
The problem is that in order to rebuild the new and clean mesh, I need access to the vertices / indices that have been written to the buffers using glMapBuffer. According to the documentation for GL_OES_mapbuffer, WRITE_ONLY_OES is the only acceptable parameter for 'access'.
So, I don't think the data pointed at there is reliable to read from in order to create my new buffers. I know there are other functions in GL Core that allow you to copy the buffer data, but these also seem to be missing.
Can anyone verify that this is not possible on ES 2.0 or give some approach for achieving buffer reading? My current solution is to keep a shadow copy of all the data, which is obviously not ideal.
I think that keeping a shadow copy of GPU data in main memory is much better than reading these data from GPU memory. It is recommended to discard previous data before using glMapBuffer anyway. Read this for more information (It will not give you direct answer to your question, but it might be usefull).
Related
I've been working on a GPU-based boid simulation recently. I've spent most of my time trying to get the underlying sorting system working, in an attempt to avoid having each boid check every other boid—I'm ideally looking for this algorithm to end up being scalable into the hundreds of thousands of individual particles. However, I'm a bit confused as to how I should try to organize my boids into some kind of spatial tree structure when I don't have access to pointers (I'm working in HLSL).
I elected to try and base my method off of this incredibly helpful article. I already have a relatively quick radix sort functioning properly, but what I'm confused about is how I can actually put the sorted z-order morton keys to use. I naïvely assumed that, once sorted, all sequential boids would be sorted by distance, but this assumption breaks down whenever the boids are near the edge of two "sections" in the z-order curve, which causes some bizarre behavior that I've pictured below:
It seems clear that I also need to construct some kind of BVH (Bounding Volume Hierarchy) data structure so I can predictably access boids within a set distance, instead of just iterating over nearby sorted boids, but I'm stuck on how to achieve this in a language like HLSL that doesn't include pointers. I've read this article a few times, but I'm not sure if it's well-suited to what I'm trying to do. Should I create nodes that store buffer indices instead of pointers? Or is there a simpler way that I could go about this?
I'd deeply appreciate any advice on how to move forward, thank you!
I'm in the process of switching over to Julia from other programming languages and one of the things that Julia will let you hang yourself on is memory. I think this is likely a good thing, a programming language where you actually have to think about some amount of memory management forces the coder to write more efficient code. This would be in contrast to something like R where you can seemingly load datasets that are larger than the allocated memory. Of course, you can't actually do that, so I wonder how does R get around that problem?
Part of what I've done in other programming languages is work on large tabular datasets, often converted over to a R dataframe or a matrix. I think the way this is handled in Julia is to stream data in wherever possible, so my main question is this:
Is it better to use readline("my_file.txt") to access data or is it better to use open("my_file.txt", "w")? If possible, wouldn't it be better to access a large dataset all at once for speed? Or would it be better to always stream data?
I hope this makes sense. Any further resources would be greatly appreciated.
I'm not an extensive user of Julia's data-ecosystem packages, but CSV.jl offers the Chunks and Rows alternatives to File, and these might let you process the files incrementally.
While it may not be relevant to your use case, the mechanisms mentioned in #Przemyslaw Szufel's answer are used other places as well. Two I'm familiar with are the TiffImages.jl and NRRD.jl packages, both I/O packages mostly for loading image data into Julia. With these, you can load terabyte-sized datasets on a laptop. There may be more packages that use the same mechanism, and many package maintainers would probably be grateful to receive a pull request that supports optional memory-mapping when applicable.
In R you cannot have a data frame larger than memory. There is no magical buffering mechanism. However, when running R-based analytics you could use a disk.frame package for that.
Similarly, in Julia if you want to process data frames larger than memory you need to use am appropriate package. The most reasonable and natural option in Julia ecosystem is JuliaDB.
If you want to do something more low-level solution have a look at:
Mmap that provides Memory-mapped I/O that exactly solves the issue of conveniently handling data too large to fit into memory
SharedArrays that offers a disk mapped array with implementation based on Mmap.
In conclusion, if your data is data frame based - try JuliaDB, otherwise have a look at Mmap and SharedArrays (look at the filename parameter)
I have a Windows application that currently renders graphics largely using MFC that I'd like to change to get better use out of the GPU. Most of the graphics are straightforward and could easily be built up into a scene graph, but some of the graphics could prove very difficult. Specifically, in addition to the normal mesh type objects, I'm also dealing with point clouds which are liable to contain billions of Cartesian stored in a very compact manner that use quite a lot of custom culling techniques to be displayed in real time (Example). What I'm looking for is a mechanism that does the bulk of the scene rendering to a buffer and then gives me access to that buffer, a z buffer, and camera parameters such that I can modify them before putting them out to the display. I'm wondering whether this is possible with Direct3D, OpenGL or possibly use a higher level framework like OpenSceneGraph, and what would be the best starting point? Given the software is Windows based, I'd probably prefer to use Direct3D as this is likely to lead to fewest driver issues which I'm eager to avoid. OpenSceneGraph seems to provide custom culling via octrees, which are close but not identical to what I'm using.
Edit: To clarify a bit more, currently I have the following;
A display list / scene in memory which will typically contain up to a few million triangles, lines, and pieces of text, which I cull in software and output to a bitmap using low performing drawing primitives
A point cloud in memory which may contain billions of points in a highly compressed format (~4.5 bytes per 3d point) which I cull and output to the same bitmap
Cursor information that gets added to the bitmap prior to output
A camera, z-buffer and attribute buffers for navigation and picking purposes
The slow bit is the highlighted part of section 1 which I'd like to replace with GPU rendering of some kind. The solution I envisage is to build a scene for the GPU, render it to a bitmap (with matching z-buffer) based on my current camera parameters and then add my point cloud prior to output.
Alternatively, I could move to a scene based framework that managed the cameras and navigation for me and provide points in view as spheres or splats based on volume and level of detail during the rendering loop. In this scenario I'd also need to be able add cursor information to the view.
In either scenario, the hosting application will be MFC C++ based on VS2017 which would require too much work to change for the purposes of this exercise.
It's hard to say exactly based on your description of a complex problem.
OSG can probably do what you're looking for.
Depending on your timeframe, I'd consider eschewing both OpenGL (OSG) and DirectX in favor of the newer Vulkan 3D API. It's a successor to both D3D and OGL, and is designed by the GPU manufacturers themselves to provide optimal performance exceeding both of its predecessors.
The OSG project is currently developing a Vulkan scenegraph known as VSG, which already demonstrates superior performance to OSG and will have more generalized culling ability.
I've worked a bunch with point clouds and am pretty experienced with them, but I'm not exactly clear on what you're proposing to do.
If you want to actually have a verbal discussion about the matter, I'm pretty easy to find (my company is AlphaPixel -- AlphaPixel.com) and you could call us. I'm in the European time zone right now, it's not clear from your question where you are but you sound US-based.
Im making a simple raytracer for a schoolproject were a compute shader is supposed to be used to shade a triangle or some other primitive.
For this I'd like to write to a backbuffer-surface directly in the compute shader, to then present the results imideatly. I know for certain that this is possible in DX11 though i can't seem to get it to work in DX12.
I couldn't gather that much information about this, but i found this gamedev thread discussing the exact same problem I try to figure out and they seem to come to the conclusion which was my go to workaround: writing to an intermediate texture and then sampling in a pipeline.
I can't fully accept that this would be impossible to achieve in dx12. Why would that feature be removed? Could it be that the queuing-systems removes some overhead that makes it unnecessary to have this feature?
Is there any way to achieve a raytracer without writing to a separate texture and then sampling in a pipeline or copy it onto the back-buffer? What are my best alternatives for achieving performance?
You will have to access the answer. They removed the capability to create an UAV the same way they removed the capability to use multisample surface in the swapchain.
The problem with authorizing UAV on the swapchain surface is that they would have to forfeit tracking of what is happening to it. DX12 rely on descriptor heaps that are 100% volatile at runtime for UAVs ( render targets are CPU side only and can be tracked ).
Microsoft need to track the swapchain surface status strongly in order to guarantee behavior with the desktop presentation and for that reason, they choose to deny the UAV binding.
It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.