If I have individual files for each process, does it make any difference if i write into those files using normal file pointer or writing using MPI.File?
If you are using simple contiguous i/o operations? no, no difference.
If you are using MPI datatype (to describe memory) and/or MPI file views (to describe structure or some other non-contiguous file access), then it's possible the MPI implementation has some kind of noncontiguous i/o operation (e.g. data sieving, list i/o, datatype i/o)
Related
I have written my optimization problem in zimpl and used SCIP to solve it.
One of my constraint is
x'Qx<=0.05(portfolio risk <=0.05)
where x is n*1 vector and Q is the n*n covariance matrix. Currently I am reading my covariance matrix from a txt file and it's quite large (3000*3000), I used something like param[I]=read "cov.txt".
When I use SCIP to read the zpl file, the parsing takes a long time. I am just wondering is there a better way to load the data into my problem? Do I have to pass values to the parameters in the zimpl model through a file (disk IO) or can I use memory to pass the values?
There are more efficient ways, but they would need programming.
1. You can implement your model directly through the SCIP C/C++ API.
2. You can write a program that embededds zimpl and SCIP and then it is
possible to pass file to zimpl as strings from memory. But I doubt there is a
tutorial/documentation and still zimpl would have to parse the file.
Given that the Linux file system caches files anyway if enough memory is
available, this would be probably not much faster then the time you get now
if you run the same modell a 2nd time directly after the first time.
I have this molecular dynamics program that writes atom position and velocities to a file at every n steps of simulation. The actual writing is taking like 90% of the running time! (checked by eiminating the writes) So I desperately need to optimize that.
I see that some fortrans have an extension to change the write buffer size (called i/o block size) and the "number of blocks" at the OPEN statement, but it appears that gfortran doesn't. Also I read somewhere that gfortran uses 8192 bytes write buffer.
I even tried to do an FSTAT (right after opening, is that right?) to see what is the block size and number of blocks it is using but it returns -1 on both. (compiling for windows 64 bit)
Isn't there a way to enlarge the write buffer for a file in gfortran? Will it be diferent compiling for linux than for windows?
I'd really really rather stay in fortran but as a desperate measure isn't there a way to do so by adding some c routine?
thanks!
IanH question is key. Unformatted IO is MUCH faster than formatted. The conversion from base 2 to base 10 is very CPU intensive. If you don't need the values to be human readable, then use unformatted IO. If you want to be able to read the values in another language, then use access='stream'.
Another approach would be to add your own buffering. Replace the write statement with a call to a subroutine. Have that subroutine store values and write only when it has received M values. You'll also have to have a "flush" call to the subroutine to cause it to write the last values, if they are fewer them M.
If gcc C is faster at IO, you could mix Fortran and C with Fortran's ISO_C_Binding: https://stackoverflow.com/questions/tagged/fortran-iso-c-binding. There are examples of the use of the ISO C Binding in the gfortran manual under "Mixed Language Programming".
If you spend 90% of your runtime writing coords/vels every n timesteps, the obvious quick fix would be to instead write data every, say, n/100 timestep. But I'm sure you already thought of that yourself.
But yes, gfortran has a fixed 8k buffer, whose size cannot be changed except by modifying the libgfortran source and rebuilding it. The reason for the buffering is to amortize the syscall overhead; (simplistic) tests on Linux showed that 8k is sufficient and more than that goes far into diminishing returns territory. That being said, if you have some substantiated claims that bigger buffers are useful on some I/O patterns and/or OS, there's no reason why the buffer can't be made larger in a future release.
As for you performance issues, as already mentioned, unformatted is a lot faster than formatted I/O. Additionally, gfortran has rather high per-IO-statement overhead. You can amortize that by writing arrays (or, array sections) rather than individual elements (this matters mostly for unformatted, for formatted IO there is so much to do that this doesn't help that much).
I am thinking that if cost of IO is comparable or even larger than the effort of simulation, then it probably isn't such a good idea to store all these data to disk the first place. It is better to do whatever processing you intend to do directly during the simulation, instead of saving lots of intermediate data them later read them in again to do the processing.
Moreover, MD is an inherently highly parallelizable problem, and with IO you will severely cripple the efficiency of parallelization! I would avoid IO whenever possible.
For individual trajectories, normally you just need to store the initial condition of each trajectory, along with its key statistics, or important snapshots at a small number of time values. When you need one specific trajectory plotted you can regenerate the exact same trajectory or section of trajectory from the initial condition or the closest snapshot, and with similar cost as reading it from the disk.
I have a FORTRAN MPI code to solve a flow field.
At the start I want to read data from file and distribute it to the participating processes.
The data is consisting of several 3-D arrays(velocities in space x,y,z).
Every process stores only a part of the array.
So if every process is going to read the file(the easiest way I think) it is not going to work as it will only store a the first part of the file corresponding to the number of arrays that the process can hold.
MPI Bcast can work for 3d arrays? But then things become complex.
Or is there an easier way?
You have, broadly speaking, 2 or 3 choices, depending on your platform.
One process reads the input data and sends (parts of) it to the other processes. I wouldn't usually use broadcast for this since it is a collective operation and all processes have to take part. I'd usually just send the necessary information to each process. If it is convenient (and not a memory issue) you could certainly broadcast all the input data to all the processes, it's just not a pattern of operation that I use or see much.
All processes read the data that they require. This may involve a process reading an entire input file and only storing those parts it requires. But if you have very large input files you can write routines to read only the necessary part into each process's memory space. This approach may involve processes competing for disk access, which is only slow in a relative sense: if you are running large-scale and long-running parallel computations waiting a few seconds while all the processes get their data is not much of an overhead.
If you have a parallel file system then you can use MPI's parallel I/O routines so that each process reads only those parts of the input data that it requires.
The canonical way of such an I/O pattern in MPI is either to
Read the data on rank 0, then use MPI_Scatter to distribute it. Or if memory is tight, do this blockwise, or then use 1-to-1 communication rather than MPI_Scatter.
Use MPI-I/O, and have each rank read its own subset of the data file (to be useful, this of course requires a file format where you can figure out the boundaries without first reading through the entire file).
For extreme scalability, one can combine the two approaches, that is a subset of processes (say, sqrt(N) as a rough rule of thumb) use MPI I/O, and each MPI process sends data to its own IO process.
If you are running your code on less than 1000 cores with a good file system (e.g. Lustre) then just use Fortran I/O where each rank opens the file and reads the data it needs (skipping the rest). Yes it takes a few minutes but you're only reading the file once during start.
MPI I/O (binary only) is non-trivial and usually you are always better off using higher level libs such as HDF5 or Parallel NetCDF. Performance will depend on how the data is read (contiguous vs non-contiguous and so on). The following links may be helpful ...
http://www.osc.edu/supercomputing/training/pario/parallel-io-nov04.pdf
https://support.scinet.utoronto.ca/wiki/images/0/01/Parallel_io_course.pdf
I was wondering if anyone had any experience with what I am about to embark on. I have several csv files which are all around a GB or so in size and I need to load them into a an oracle database. While most of my work after loading will be read-only I will need to load updates from time to time. Basically I just need a good tool for loading several rows of data at a time up to my db.
Here is what I have found so far:
I could use SQL Loader t do a lot of the work
I could use Bulk-Insert commands
Some sort of batch insert.
Using prepared statement somehow might be a good idea. I guess I was wondering what everyone thinks is the fastest way to get this insert done. Any tips?
I would be very surprised if you could roll your own utility that will outperform SQL*Loader Direct Path Loads. Oracle built this utility for exactly this purpose - the likelihood of building something more efficient is practically nil. There is also the Parallel Direct Path Load, which allows you to have multiple direct path load processes running concurrently.
From the manual:
Instead of filling a bind array buffer
and passing it to the Oracle database
with a SQL INSERT statement, a direct
path load uses the direct path API to
pass the data to be loaded to the load
engine in the server. The load engine
builds a column array structure from
the data passed to it.
The direct path load engine uses the
column array structure to format
Oracle data blocks and build index
keys. The newly formatted database
blocks are written directly to the
database (multiple blocks per I/O
request using asynchronous writes if
the host platform supports
asynchronous I/O).
Internally, multiple buffers are used
for the formatted blocks. While one
buffer is being filled, one or more
buffers are being written if
asynchronous I/O is available on the
host platform. Overlapping computation
with I/O increases load performance.
There are cases where Direct Path Load cannot be used.
With that amount of data, you'd better be sure of your backing store - the dbf disks' free space.
sqlldr is script drive, very efficient, generally more efficient than a sql script.
The only thing I wonder about is the magnitude of the data. I personally would consider several to many sqlldr processes and assign each one a subset of data and let the processes run in parallel.
You said you wanted to load a few records at a time? That may take a lot longer than you think. Did you mean a few files at a time?
You may be able to create an external table on the CSV files and load them in by SELECTing from the external table into another table. Whether this method will be quicker not sure however might be quicker in terms of messing around getting sql*loader to work especially when you have a criteria for UPDATEs.
Suppose I want to do following opeartions on my 2-core machine:
Read a very large file
Compute
Does the file reading operation need to consume 1 core? Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
Thanks.
Edit
Thanks guys, yea, we should always consider if the file I/O blocks the computing. Now let's just consider that the file I/O will never block computing, you can think the computing doesn't depends on the file's data, we just read the file in for future processing. Now we have 2 core, we need to read in a file, and we need to do computing, is it the best solution to create 3 threads, 1 for file reading and 2 for computing, as most of you has already pointed out: file reading consumes very little CPU?
It depends on how your hardware is configured. Normally, reading is not CPU-intensive, thanks to DMA. It may be very expensive though, if it initiates swap-out of other applications. But there is more to it.
Don't read a huge file at once if you can
If your file is really big, you should use mmap or sequential processing, when you don't need to read a whole file at once. Try to consume it by chunks is possible.
For example, to sum all values in a huge file, you don't need to load this file into the memory. You can process it by small chunks, accumulating the sum. Memory is an expensive resource in most situations.
Reading is sequential
Does the file reading operation need to consume 1 core?
Yes, I think most low-level read operations are implemented sequentially (consume 1 core).
You can avoid blocking on read operation if you use asynchronous I/O, but it is just a variation of the same "read by small chunks" technique. You can launch several small asynchronous read operations at once, but you have always to check if an operation has finished before you use the result.
See also this Stack Overflow answer to a related question).
Reading and computing in parallel
Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
It depends, if you need all data to start computations, than there is no reason to start computation in parallel. It will have to wait effectively until reading is done.
If you can start computing even with partial data, likely you don't need to read the whole file at once. And it is usually much better not to do so with huge files.
What is your bottleneck — computation or IO?
Finally, you should know if your task is computation-bound or input-output bound. If it is limited by the performance of input-output subsystem, there is little benefit in parallelizing computation. If computation is very CPU-intensive, and reading time is negligible, you can benefit from parallelizing computation. Input-output is usually a bottleneck unless you are doing some number-crunching.
This is a good candidate for parallelization, because you have two types of operations here - disk I/O (for reading the file), and CPU load (for your computations). So the first step would be to write your application such that the file I/O wasn't blocking the computation. You could do this by reading a little bit at a time from the file and handing it off to the compute thread.
But now you're saying you have two cores that you want to utilize. Your second thought about parallelizing the CPU-intensive part is correct, because we can only parallelize compute tasks if we have more than one processor to use. But, it might be the case that the blocking part of your application is still the file I/O - that depends on a lot of factors, and the only way to tell what level of parallelization is appropriate is to benchmark.
SO required caveat: multithreading is hard and error-prone, and it's better to have correct code than fast code, if you can pick only one. But I don't advocate against threads, as you may find from others on the site.
I would think this depends on the computation you are performing. If you are doing very heavy computations then I would suggest threading the application. Reading a file demands very little from your CPU and because of this, the overhead created by threading the application might slow it down.
Another thing to consider is if you need to load the entire file before you can compute, if so, there is no point in threading it at all as you will have to complete one action before you can perform the other.