How to optimize OrangeFS? - file-io

I have built a parallel file system (cluster) with OrangeFS. I am new in this area. How could I optimize it? It seems that RAID and Logical Volume Management (LVM) are good choices, but I cannot find any tutorial or doc about it.

Raid 0 works. The instruction is here: https://help.ubuntu.com/community/Installation/SoftwareRAID

Related

easiest way to log and save basic GPU stats in Unreal Engine?

I need to log the basic GPU stats (computation times) in a file while testing in the unreal engine editor so that I can analize them afterwards.
What would be the easiest way of doing that? I'm using UE 5.1
no preference for blueprint, the solution can employ blueprint or not.
I don’t need to log the synchronized events (it is ok if they are added too, i just dont have to have them). I just need the plain basic stats as time goes by.
Any constructive feedback is appreciated. Cheers!
Unreal ships with a powerful profiler called Unreal Insights, which can be used to record and analyze GPU processes as well. It is a standalone tool which you can attach to your Editor session when testing.
You can find it at Engine/Binaries/Win64/UnrealInsights.exe (relative to engine install directory). It saves data to Engine/Programs/UnrealInsights/Saved/TraceSessions (also relative to engine install dir).
Here is an in-depth documentation.

Apache SOLR erratic RAM comsumption

I've set up SOLR indices on my computer, and everything works fine.
I'm not experienced with SOLR but after profiling the "start.jar" process for a while, I've noticed that the RAM consumption jumps around a lot, anywhere from 150MB to 400MBish. And this is just for 10K documents!
So as a response, I wrote a script that just waits for SOLR to go past my RAM consumption limit (on shared hosting), and when it does, it kills start.jar and restarts it.
Does this have any adverse effect? And if so, what better solutions are there, besides get more RAM or use cloud based SOLR (which also costs money)? Sorry if this sounds stupid but I just need a working solution.
Thank you.
You need to provide some Index stats:
How big is your index (no. of docs & size in MB/GB)?
How many fields indexed/stored?
How much memory is allocated to JVM?
Do you optimize/commit, very often/realtime?
What is the SOLR Query time you see in logs?
Best

Is it meaningful to monitor physical memory usage on AIX?

Due to AIX's special memory-using algorithm, is it meaning to monitor the physical memory usage in order to find out the memory bottleneck during performance tuning?
If not, then what kind of KPI am i supposed to keep eyes on so as to determine whether we need to enlarge the RAM capacity or not?
Thanks
If a program requires more memory that is available as RAM, the OS will start swapping memory sections to disk as it sees fit. You'll need to monitor the output of vmstat and look for paging activity. I don't have access to an AIX machine now to illustrate with an example, but I recall the man page is pretty good at explaining what data is represented there.
Also, this looks to be a good writeup about another AIX specfic systems monitoring tool, and watching your systems overall memory (svgmon).
http://www.aixhealthcheck.com/blog.php?id=255
To track the size of your individual application instance(s), there are several options, with the most common being ps. Again, you'll have to check the man page to get information on which options to use. There are several columns for memory sz per process. You can compare those values to the overall memory that's available on your machine, and understand, by tracking over time, if your application is only increasing is memory, or if it releases memory when it is done with a task.
Finally, there's quite a body of information from IBM on performance tuning for AIX, but I was never able to find a road map guide to reading that information. A lot of it assumes you know facts and features that aren't explained in the current doc set, so you then have to try and find an explanation, which oftens leads to searching for yet another layer of explanations. ! :^/
IHTH.

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.

Embedded app and wearing out flash disks

I have an embedded app that needs to do a lot of writing to a flash disk (or other). We cannot use a hard disk due to the environment. This is an industrial system subject to vibration and explosive fuel vapour.
The trouble is, flash has a lifecycle of around 100000 write cycles. Ample for your digital camera. Wears out after a year in our scenario.
Any alternatives that people have found work for them?
I was thinking of using FRAM but it's been done before here and it's slow and small.
As Nils says, commercial compact flash cards, and drive replacements (NAND) have wear levelling.
If you are using cheap onboard (NOR) flash you might have to do this yourself.
The best way is some sort of ring buffer where you are only appending data and then overwriting a full drive. Remember flash can only erase a full block (page) but can then append individual bytes to existing data in that page.
Also can you buffer a page in RAM and then write once or do you have to have individual bytes committed at all times?
Most app sheets for embedded processors will have examples of this.
You really need to provide much more information:
how much capacity do you need?
what costs are acceptable?
what physical form factor do you need?
what lifetime do you want?
If your storage needs aren't particularly huge and you can deal with the cost, There are battery-backed SRAM parts (up to at least 2 Megabytes per part) that are as fast as RAM (that's what they are) and have no limit on number of writes. But they cost a lot more than flash.
You could also get a drive with a SATA interface that's populated with DRAM.
This post referes to using embedded linux. Not sure if this is what you want.
I have a not to differnt system, but for medical use. We use a NOR flash for all parts that have low update frequency and NAND flash for the rest. I would recoment using UBI/UBIFS for the top layer om the MTD disk. UBI/UBIFS takes care of all the underlying problems for you. If you then design your system to have a lot larger physical flash than you need. Example: You need 100MB and then design your HW with 1GB flash. Then the data can be shuffeld around by UBI without any interaction from systems above.
UBIFS documentation
UBI documentation
As Michael Burr pointed out, we need more info. (Please answer his questions.)
I have an additional question: What kind of interface is this? PATA? SATA? USB?
As others have pointed out, any decent Flash Drive will provide some kind of wear leveling. Look for this in the datasheet for the device. Many vendors will boast about their wear-leveling technique.
You mention 100000 cycles. This seems pretty low to me. Most "industrial grade" flash drives can do a lot more than that (millions). Make sure you aren't using a bargain-basement device. A good flash drive will usually include an equation or calculator tool you can use to figure out the expected lifespan of the device.
(I can say from personal experience that some brands of flash drives hold up a lot better than others, particularly the "industrial" ones. Our drives go through some pretty brutal usage scenarios.)
The other thing that can help a lot is capacity. The higher capacity of flash drive, the more room the wear-leveling algorithm has to work with, which means a longer lifespan.
The other thing you can look at doing is software techniques to minimize the wearing of the flash components. Do you have a pagefile/swapfile? Maybe you don't need it. If you are creating/deleting lots of temporary files, move this to a RAM disk. Remember, it is erasure/reprogramming cycles that usually wears out a flash cell, so reducing those operations will usually help.
Use SD cards that have a built-in wear leveling controller. That way the write cycles get distributed over all the flash blocks and you get a very long life out of your flash.
I was thinking of using FRAM but it's
been done before here and it's slow
and small.
Compare with nvSRAM; that may provide the performance you need.
I have used a Compact Flash card in a embedded system with great success. It has a onboard controller that does all the thinking for you. Not all Compact Flash controllers are equal so get one that is a recent design and was intended to be used as a hard drive replacement as they have better wear levelling algorithms.