Can I perform compute on an NVMe storage device? - low-level

I'll start by saying this is not something I am doing in a production system, just a personal project and just writing coding to learn and try and find challenging tasks.
So NVMe drives are used in computers for storing information and they have controllers and perform a set of standard operations and I have been reading through the specification (https://nvmexpress.org/developers/) trying to figure out if there is any way to use them in a way that would actually be getting them to compute information. I have been reading through the specifications, and there are a lot of different commands they can do. But I haven't found anything where someone has already tried this yet, but maybe I'm just using the wrong search terms.
I wanted to check if anyone knows if this is something that has been done, using NVMe storage as a compute device?
Some of the things I thought I might be able to find where?
Maybe a write option equivalent to a logical OR where it would write 1s over 0s but not change other values?
Maybe a way to compare values as they were overwritten, so if there was a status code that was tracking wear levelling or if the data actually changed on a write, then it might have been possible to know if the value written to was already set to the value I just wrote?
Maybe a check if data is blank or all zeros command?
Maybe a command to return the hash or parity bit for the data instead of the data?
I was hoping to find some combination or Move or BitShift or Logical operators that
be chained together to do calculations on data without returning it.
Or possibly a status code on an operation that would give me information about the data.

Related

Improving our algorithm with SQLite vs storing everything in memory

The problem...I’m trying to figure out a way to make our algorithm faster.
Our algorithm...is written in C and runs on an embedded Linux system with little memory and a lackluster CPU. The entire algorithm makes heavy use of 2d arrays and stores them all in memory. At a high level, the algorithm’s input data, which is a single array of 250 doubles (0.01234, 0.02532….0.1286), is compared to a larger 2d array, which is 20k+ rows x 250 doubles. The input data is compared against the 20k+ rows using a for loop. For each iteration, the algorithm performs computations and stores those results in memory.
I’m not an embedded software developer, I am a cloud developer that uses databases (Postgres, mainly). Our embedded software doesn’t make use of any databases and, since that is what I know, I thought I’d look into SQLite.
My approach...applying what I know about databases, I'd go about it this way: I would have a single table with 6 columns: id, array, computation_1, computation_2, computation_3, and computation_4. I’d store all 20k+ rows in this table with the computation_* columns initially defaulted to null. Then I’d have the algorithm loop through each entry and update the values for each computation_* column accordingly. For graphical purposes, the table would look like this:
Storing arrays in a database doesn't seem like a good fit so I don't immediately understand if there is a benefit to doing this. But, it seems like it would replace the extensive use of malloc()/calloc() we have baked into the algorithm.
My question is...can SQLite help speed up our algorithm if I use it in the way I've described? Since I don’t know how much benefit this would provide, if any, I thought I’d ask the experts here on SO before going down this path. If it will (or won't) provide an improvement, I'd like to know why from a technical standpoint so that I can learn.
Thanks in advance.
As you have described it so far, SQLite won't help you.
A relational database stores data into tables with various indexes and so on. When it receives SQL, it compiles it into a bytecode program, and then it runs that bytecode program in an interpreter against those tables. You can learn more about SQLite's bytecode from https://www.sqlite.org/opcode.html.
This has a lot of overhead compared to native data structures in a low-level language. In my experience the difference is up to several orders of magnitude.
Why, then, would anyone use a database? It is because you'd have to write a lot of potentially buggy code to match it. Doubly so if you've got multiple users at the same time. Furthermore the database query optimizer is able to find efficient plans for computing complex joins that are orders of magnitude more efficient than what most programmers produce on their own.
So a database is not a recipe for doing arbitrary calculations more efficiently. But if you can describe what you are doing in SQL (particularly if it involves joins), the database may be able to find a much more efficient calculation than the one you're currently performing.
Even in that case, squeezing performance out of a low-end embedded system is a case where it may be worth figuring out what a database would do, and then writing code to do that directly.

How to best create an Octree/BVH in GPU memory without pointers?

I've been working on a GPU-based boid simulation recently. I've spent most of my time trying to get the underlying sorting system working, in an attempt to avoid having each boid check every other boid—I'm ideally looking for this algorithm to end up being scalable into the hundreds of thousands of individual particles. However, I'm a bit confused as to how I should try to organize my boids into some kind of spatial tree structure when I don't have access to pointers (I'm working in HLSL).
I elected to try and base my method off of this incredibly helpful article. I already have a relatively quick radix sort functioning properly, but what I'm confused about is how I can actually put the sorted z-order morton keys to use. I naïvely assumed that, once sorted, all sequential boids would be sorted by distance, but this assumption breaks down whenever the boids are near the edge of two "sections" in the z-order curve, which causes some bizarre behavior that I've pictured below:
It seems clear that I also need to construct some kind of BVH (Bounding Volume Hierarchy) data structure so I can predictably access boids within a set distance, instead of just iterating over nearby sorted boids, but I'm stuck on how to achieve this in a language like HLSL that doesn't include pointers. I've read this article a few times, but I'm not sure if it's well-suited to what I'm trying to do. Should I create nodes that store buffer indices instead of pointers? Or is there a simpler way that I could go about this?
I'd deeply appreciate any advice on how to move forward, thank you!

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

What is a good access time to a database (SQL)?

Hey.. i wanna know which time is a good accesstime, because i'm searching for a good sql database and hsqldb says their accesstime is 12ms... <-- good?
I think it would depend on your needs. Is it for a web server or a desktop application? The amount of data is also important, because reading lots of small records will perform differently than reading a few large records. Access time is also based upon your hardware, software and maybe even some other factors.
For example, you can use a database with lightning-fast access, but if your users need to connect to it over a 5 megabit VPN connection, passing through three different proxies and with trafic world-wide, your database would then just be a waste of power.
Basically, it's a marketing thing that they're claiming. It's a good product but don't just focus on access time. Make sure you also look at your other needs. Another system might just perform better, even if it has a slower acess time, because it is more optimized in reading it's indices and stuff.
So, what do you want, exactly?
I don't think access time tells you anything, really. If you have slow or incorrectly configured storage, then this access time metric will be dwarfed by how much time is spent on waits and split I/Os. Network latency is also a factor, since I'm guessing you probably won't want to have your code on the same machine as your database, and you will most likely have a few network devices you'll need to traverse in your production environment.
In my experience, all the database platforms these days will all perform adequately if configured correctly and paired with a complementary application. Pick the DBMS that best fits your requirements, follow the best practices for configuration of the DBMS on your hardware, and you should be please with the outcome.

How to prove that code isn't broken, but the hardware is?

I'm sure it repeats everywhere. You can 'feel' network is slow, or machine or slow or something. But the server/chassis logs are not showing anything, so IT doesn't believe you. What do you do?
Your regressions are taking twice the time ... but that's not enough
Okay you transfer 100 GB using dd etc, but ... that's not enough.
Okay you get server placed in different chassis for 2 week, it works fine ... but .. that's not enough...
so HOW do you get IT to replace the chassis ?
More specifically:
Is there any suite which I can run on two setups ( supposed to be identical ), which can show up difference in network/cpu/disk access .. which IT will believe ?
Computers don't age and slow down the same way we do. If your server is getting slower -- actually slower, not just feels slower because every other computer you use is getting faster -- then there is a reason and it is possible that you may be able to fix it. I'd try cleaning up some disk space, de-fragmenting the disk, and checking what other processes are running (perhaps someone's added more apps to the system and you're just not getting as many cycles).
If your app uses a database, you may want to analyze your query performance and see if some indices are in order. Queries that perform well when you have little data can start taking a long time as the amount of data grows if they have to use table scans. As a former "IT" guy, I'd also be reluctant to throw hardware at a problem because someone tells me the system is slowing down. I'd want to know what has changed and see if I could get the system running the way it should be. If the app has simply out grown the hardware -- after you've made suitable optimizations -- then upgrading is a reasonable choice.
Run a standard benchmark suite. See if it pinpoints memory, cpu, bus or disk, when compared to a "working" similar computer.
See http://en.wikipedia.org/wiki/Benchmark_(computing)#Common_benchmarks for some tips.
The only way to prove something is to do a stringent audit.
Now traditionally, we should keep the system constant between two different sets while altering the variable we are interested. In this case the variable is the hardware that your code is running on. So in simple terms, you should audit the running of your software on two different sets of hardware, one being the hardware you are unhappy about. And see the difference.
Now if you are to do this properly, which I am sure you are, you will first need to come up with a null hypothesis, something like:
"The slowness of the application is
unrelated to the specific hardware we
are using"
And now you set about disproving that hypothesis in favour of an alternative hypothesis. Once you have collected enough results, you can apply statistical analyses on them, to decide whether any differences are statistically significant. There are analyses to find out how much data you need, and then compare the two sets to decide if the differences are random, or not random (which would disprove your null hypothesis). The type of tests you do will mostly depend on your data, but clever people have made checklists to help us decide.
It sounds like your main problem is being listened to by IT, but raw technical data may not be persuasive to the right people. Getting backup from the business may help you and that means talking about money.
Luckily, both platforms already contain a common piece of software - the application itself - designed to make or save money for someone. Why not measure how quickly it can do that e.g. how long does it take to process an order?
By measuring how long your application spends dealing with each sub task or data source you can get a rough idea of the underlying hardware which is under performing. Writing to a local database, or handling a data structure larger than RAM will impact the disk, making network calls will impact the network hardware, CPU bound calculations will impact there.
This data will never be as precise as a benchmark, and it may require expensive coding, but its easier to translate what it finds into money terms. Log4j's NDC and MDC features, and Springs AOP might be good enabling tools for you.
Run perfmon.msc from Start / Run in Windows 2000 through to Vista. Then just add counters for CPU, disk etc..
For SQL queries you should capture the actual queries then run them manually to see if they are slow.
For instance if using SQL Server, run the profiler from Tools, SQL Server Profiler. Then perform some operations in your program and look at the capture for any suspicous database calls. Copy and paste one of the queries into a new query window in management studio and run it.
For networking you should try artificially limiting your network speed to see how it affects your code (e.g. Traffic Shaper XP is a simple freeware limiter).