Processing large amounts of files, slowed down by I/O, how to make more efficent? - vb.net

I've been maintaining an application for a few years ever since creator stopped working on it. This application does batch operations on a large amount of pictures such as resizing, applying text, adding a logo, etc.
Lately I've been rewriting this application in order to improve efficiency and try to fully use every capability of the system by using all the available memory and splitting the operation over multiple threads.
The previous maintainer implemented multithreading in somewhat awful ways such as doing long operations in the UI thread (rendering the stop button he made inoperable) and using a really small thread pool.
So I've been optimizing code and trying out increasing the size of the thread pool but apparently nothing improves at all, the bottleneck seems to be reading and writing to disk.
Is there a way to improve this situation?

Determine exactly what IO the application does at the moment. Then try to convert it to sequential access patterns. Random IO is 100x slower on magnetic disks and 10x slower on SSDs.
If it is a magnetic disk make sure that only one thread writes sequentially at a time. If you're forced to do random IO tune the optimal queue length.

Related

An example: Am I understanding GPU advantage correctly?

Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton faster...is this example spot on or have I misunderstood something?
I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.

updating 2 800 000 records with 4 threads

I have a VB.net application with an Access Database with one table that contains about 2,800,000 records, each raw is updated with new data daily. The machine has 64GB of ram and i7 3960x and its over clocked to 4.9GHz.
Note: data sources are local.
I wonder if I use ~10 threads will it finish updating the data to the rows faster.
If it is possiable what would be the mechanisim of deviding this big loop to multiple threads?
Update: Sometimes the loop has to repeat the calculation for some row depending on results also the loop have exacly 63 conditions and its 242 lines of code.
Microsoft Access is not particularly good at handling many concurrent updates, compared to other database platforms.
The more your tasks need to do calculations, the more you will typically benefit from concurrency / threading. If you spin up 10 threads that do little more than send update commands to Access, it is unlikely to be much faster than it is with just one thread.
If you have to do any significant calculations between reading and writing data, threads may show a performance improvement.
I would suggest trying the following and measuring the result:
One thread to read data from Access
One thread to perform whatever calculations are needed on the data you read
One thread to update Access
You can implement this using a Producer / Consumer pattern, which is pretty easy to do with a BlockingCollection.
The nice thing about the Producer / Consumer pattern is that you can add more producer and/or consumer threads with minimal code changes to find the sweet spot.
Supplemental Thought
IO is probably the bottleneck of your application. Consider placing the Access file on faster storage if you can (SSD, RAID, or even a RAM disk).
Well if you're updating 2,800,000 records with 2,800,000 queries, it will definitely be slow.
Generally, it's good to avoid opening multiple connections to update your data.
You might want to show us some code of how you're currently doing it, so we could tell you what to change.
So I don't think (with the information you gave) that going multi-thread for this would be faster. Now, if you're thinking about going multi-thread because the update freezes your GUI, now that's another story.
If the processing is slow, I personally don't think it's due to your servers specs. I'd guess it's more something about the logic you used to update the data.
Don't wonder, test. Write it so you could dispatch as much threads to make the work and test it with various numbers of threads. What does the loop you are talking about look like?
With questions like "if I add more threads, will it work faster"? it is always best to test, though there are rule of thumbs. If the DB is local, chances are that Oded is right.

Improve MPI program

I wrote a MPI program that seems to run ok, but I wonder about performance. Master thread needs to do 10 or more times MPI_Send, and the worker receives data 10 or more times and sends it. I wonder if it gives a performance penalty and whether I could transfer everything in single structs or which other technique could I benefit from.
Other general question, once a mpi program works more or less, what are the best optimization techniques.
It's usually the case that sending 1 large message is faster than sending 10 small messages. The time cost of sending a message is well modelled by considering a latency (how long it would take to send an empty message, which is non-zero because of the overhead of function calls, network latency, etc) and a bandwidth (how much longer it takes to send an extra byte given that the network communications has already started). By bundling up messages into one message, you only incurr the latency cost once, and this is often a win (although it's always possible to come up with cases where it isn't). The best way to know for any particular code is simply to try. Note that MPI datatypes allow you very powerful ways to describe the layout of your data in memory so that you can take it almost directly from memory to the network without having to do an intermediate copy into some buffer (so-called "marshalling" of the data).
As to more general optimization questions about MPI -- without knowing more, all we can do is give you advice which is so general as to not be very useful. Minimize the amount of communications which need to be done; wherever possible, use built-in MPI tools (collectives, etc) rather than implementing your own.
One way to fully understand the performance of your MPI application is to run it within the SimGrid platform simulator. The tooling and models provided are sufficient to get realistic timing predictions of mid-range applications (like, a few dozen thousands lines of C or Fortran), and it can be associated to adapted visualization tools that can help you fully understand what is going on in your application, and the actual performance tradeoffs that you have to consider.
For a demo, please refer to this screencast: https://www.youtube.com/watch?v=NOxFOR_t3xI

Scattered-write speed versus scattered-read speed on modern Intel or AMD CPUs?

I'm thinking of optimizing a program via taking a linear array and writing each element to a arbitrary location (random-like from the perspective of the CPU) in another array. I am only doing simple writes and not reading the elements back.
I understand that a scatted read for a classical CPU can be quite slow as each access will cause a cache miss and thus a processor wait. But I was thinking that a scattered write could technically be fast because the processor isn't waiting for a result, thus it may not have to wait for the transaction to complete.
I am unfortunately unfamiliar with all the details of the classical CPU memory architecture and thus there may be some complications that may cause this also to be quite slow.
Has anyone tried this?
(I should say that I am trying to invert a problem I have. I currently have an linear array from which I am read arbitrary values -- a scattered read -- and it is incredibly slow because of all the cache misses. My thoughts are that I can invert this operation into a scattered write for a significant speed benefit.)
In general you pay a high penalty for scattered writes to addresses which are not already in cache, since you have to load and store an entire cache line for each write, hence FSB and DRAM bandwidth requirements will be much higher than for sequential writes. And of course you'll incur a cache miss on every write (a couple of hundred cycles typically on modern CPUs), and there will be no help from any automatic prefetch mechanism.
I must admit, this sounds kind of hardcore. But I take the risk and answer anyway.
Is it possible to divide the input array into pages, and read/scan each page multiple times. Every pass through the page, you only process (or output) the data that belongs in a limited amount of pages. This way you only get cache-misses at the start of each input page loop.

Does it consume CPU when reading a large file

Suppose I want to do following opeartions on my 2-core machine:
Read a very large file
Compute
Does the file reading operation need to consume 1 core? Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
Thanks.
Edit
Thanks guys, yea, we should always consider if the file I/O blocks the computing. Now let's just consider that the file I/O will never block computing, you can think the computing doesn't depends on the file's data, we just read the file in for future processing. Now we have 2 core, we need to read in a file, and we need to do computing, is it the best solution to create 3 threads, 1 for file reading and 2 for computing, as most of you has already pointed out: file reading consumes very little CPU?
It depends on how your hardware is configured. Normally, reading is not CPU-intensive, thanks to DMA. It may be very expensive though, if it initiates swap-out of other applications. But there is more to it.
Don't read a huge file at once if you can
If your file is really big, you should use mmap or sequential processing, when you don't need to read a whole file at once. Try to consume it by chunks is possible.
For example, to sum all values in a huge file, you don't need to load this file into the memory. You can process it by small chunks, accumulating the sum. Memory is an expensive resource in most situations.
Reading is sequential
Does the file reading operation need to consume 1 core?
Yes, I think most low-level read operations are implemented sequentially (consume 1 core).
You can avoid blocking on read operation if you use asynchronous I/O, but it is just a variation of the same "read by small chunks" technique. You can launch several small asynchronous read operations at once, but you have always to check if an operation has finished before you use the result.
See also this Stack Overflow answer to a related question).
Reading and computing in parallel
Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
It depends, if you need all data to start computations, than there is no reason to start computation in parallel. It will have to wait effectively until reading is done.
If you can start computing even with partial data, likely you don't need to read the whole file at once. And it is usually much better not to do so with huge files.
What is your bottleneck — computation or IO?
Finally, you should know if your task is computation-bound or input-output bound. If it is limited by the performance of input-output subsystem, there is little benefit in parallelizing computation. If computation is very CPU-intensive, and reading time is negligible, you can benefit from parallelizing computation. Input-output is usually a bottleneck unless you are doing some number-crunching.
This is a good candidate for parallelization, because you have two types of operations here - disk I/O (for reading the file), and CPU load (for your computations). So the first step would be to write your application such that the file I/O wasn't blocking the computation. You could do this by reading a little bit at a time from the file and handing it off to the compute thread.
But now you're saying you have two cores that you want to utilize. Your second thought about parallelizing the CPU-intensive part is correct, because we can only parallelize compute tasks if we have more than one processor to use. But, it might be the case that the blocking part of your application is still the file I/O - that depends on a lot of factors, and the only way to tell what level of parallelization is appropriate is to benchmark.
SO required caveat: multithreading is hard and error-prone, and it's better to have correct code than fast code, if you can pick only one. But I don't advocate against threads, as you may find from others on the site.
I would think this depends on the computation you are performing. If you are doing very heavy computations then I would suggest threading the application. Reading a file demands very little from your CPU and because of this, the overhead created by threading the application might slow it down.
Another thing to consider is if you need to load the entire file before you can compute, if so, there is no point in threading it at all as you will have to complete one action before you can perform the other.