How to orchestrate members in a cluster to read new input from a single file once the current job is done? - file-io

I am working on a global optimization using brutal force. I am wondering if it is possible to complete the following task with Fortran MPI file I/O:
I have three nodes, A, B, C. I want these nodes to search for the optima over six sets of parameter inputs, which are arranged in the following matrix:
0.1 0.2 0.3
0.4 0.5 0.6
0.7 0.8 0.9
1.1 1.2 1.3
1.4 1.5 1.6
1.7 1.8 1.9
A row vector represents a set of parameter inputs. The order of which node reading in which set of parameter inputs does not matter. All I need is to orchestrate nodes A, B, C to run through the six sets of parameters, obtain the corresponding value of penalty function, and save the output to a single file.
For example, node A pulls the first set, node B the second, and node C the third. Each node takes a while to finish respective computation. Since the computation time varies across nodes, it is possible that C is the first that finishes the first-round computation, and followed by B and then A. In such a case, I want node C to subsequently pull the forth set of inputs, node B to pull the fifth and node A to read in the last set.
A <--- 0.1 0.2 0.3
B <--- 0.4 0.5 0.6
C <--- 0.7 0.8 0.9
C <--- 1.1 1.2 1.3
B <--- 1.4 1.5 1.6
A <--- 1.7 1.8 1.9
What troubles me is that the order of which node to read which set for the second-round computation is not known in advance due to the uncertainty in the run time of respective node. So I would like to know if there is a way to dynamically program my code with MPI file I/O to attain such a parallel need. Can anyone show me a code template to solve this problem?
Thank you very much.
Lee

As much as it pains me to suggest it, this might be the one good use of MPI "Shared file pointers". These work in fortran, too, but I'm going to get the syntax wrong.
Each process can read a row from the file with MPI_File_read_shared This independent I/O routine will update a global "shared file pointer" bit of state. Should B or C finish their work quickly, they can call MPI_File_read_shared again. If A is slow, whenver it calls MPI_File_read_shared it will read whatever has not been dealt with yet.
Some warnings:
shared file pointers don't get a lot of attention.
The global bit of shared state is typically... a hidden file. So yeah, it might not scale terribly well. Should be fine for a few tens of processes, though.
the global bit of shared state is stored on a file system. Some file systems like PVFS do not support the locking required to ensure this shared state is always correct.

Related

Getting "DUAL_INFEASIBLE" when solving a very simple linear programming problem

I am solving a simple LP problem using Gurobi with dual simplex and presolve. I get the model is unbounded but I couldn't see why such a model is unbounded. Can anyone help to tell me where goes wrong?
I attached the log and also the content in the .mps file.
Thanks very much in advance.
Kind regards,
Hongyu.
The output log and .mps file:
Link to the .mps file: https://studntnu-my.sharepoint.com/:u:/g/personal/hongyuzh_ntnu_no/EV5CBhH2VshForCL-EtPvBUBiFT8uZZkv-DrPtjSFi8PGA?e=VHktwf
Gurobi Optimizer version 9.5.2 build v9.5.2rc0 (mac64[arm])
Thread count: 8 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 1 rows, 579 columns and 575 nonzeros
Coefficient statistics:
Matrix range [3e-02, 5e+01]
Objective range [7e-01, 5e+01]
Bounds range [0e+00, 0e+00]
RHS range [7e+03, 7e+03]
Iteration Objective Primal Inf. Dual Inf. Time
0 handle free variables 0s
Solved in 0 iterations and 0.00 seconds (0.00 work units)
Unbounded model
The easiest way to debug this is to put a bound on the objective, so the model is no longer unbounded. Then inspect the solution. This is a super easy trick that somehow few people know about.
When we do this with a bound of 100000, we see:
phi = 100000.0000
gamma[11] = -1887.4290
(the rest zero). Indeed we can make gamma[11] as negative as we want to obey R0. Note that gamma[11] is not in the objective.
More advice: It is also useful to write out the LP file of the model and study that carefully. You probably would have caught the error and that would have prevented this post.

Understanding spacy textcat_multilabel scorer output

I'm trying to understand the output of my textcat_multilabel job. I have 4 text categories and I'm using spacy version 3.2.0 (The methodologies have changed a lot recently and I don't really understand the documentation).
E
#
LOSS TEXTC...
CATS_SCORE
SCORE
0
0
1.00
51.86
0.52
0
200
122.15
52.90
0.53
This is what I have in my config file. (btw. What is v1?)
scorer = {"#scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5
In fact, everything in the standard config file is unchanged from the suggestions except the dropout which I increased to 0.5.
The final row of my job shows these values: 0 8400 2.59 87.29 0.87
I am very impressed with the results that I'm getting with this job. Just need to understand what I'm doing.
E is epochs
# is training iterations / batches (see here)
LOSS_TEXTCAT is the loss of your textcat component. Loss normally fluctuates the first few iterations and then trends downward. The exact values are meaningless.
SCORE_TEXTCAT is the score of your textcat component on your dev set, see the docs for some details on that.
SCORE is overall score of your pipeline, a weighted average of any components you have. But you only have a textcat so it's basically the same as that score.
v1 is just version 1, components are versioned in case they are updated later so you can use older versions with newer code.

Basic rules for custom cluster configuration when using distributed learning in Cloud ML

I am investigating the use of custom scale tiers in the Cloud Machine Learning API.
Now, I dont know precisely how to design my custom tiers! I basically use a CIFAR type of model, and I decided to use:
if args.distributed:
config['trainingInput']['scaleTier'] = 'CUSTOM'
config['trainingInput']['masterType'] = 'complex_model_m'
config['trainingInput']['workerType'] = 'complex_model_m'
config['trainingInput']['parameterServerType'] = 'large_model'
config['trainingInput']['workerCount'] = 12
config['trainingInput']['parameterServerCount'] = 4
yaml.dump(config, file('custom_config.yaml', 'w'))
But I hardly can find any information on how to dimension properly the cluster. Are there any "rules of thumb" out there? Or do we have to try and test?
Many thanks in advance!
I have done a couple of small experiments, which might be worth sharing. My setup wasn't a 100% clean, but I think the rough idea is correct.
The model looks like the cifar example, but with a lot of training data. I use averaging, decaying gradient as well as dropout.
The "config" naming is (hopefully) explicit : basically 'M{masterCost}_PS{nParameterServer}x{parameterServerCost}_W{nWorker}x{workerCost}'. For parameter servers, I always use "large_model".
The "speed" is the 'global_step/s'
The "cost" is the total ML unit
And I call "efficiency" the number of 'global_step/second/ML unit'
Here are some partial results :
config cost speed efficiency
0 M1 1 0.5 0.50
1 M6_PS1x3_W2x6 21 10.0 0.48
2 M6_PS2x3_W2x6 24 10.0 0.42
3 M3_PS1x3_W3x3 15 11.0 0.73
4 M3_PS1x3_W5x3 21 15.9 0.76
5 M3_PS2x3_W4x3 21 15.1 0.72
6 M2_PS1x3_W5x2 15 7.7 0.51
I know that I should run many more experiments, but I have no time for this.
If I have time, I will dig in deeper.
The main conclusions are :
it might be worth trying a few setup on a small amount of iterations, just for deciding which config to use before going on hyper parameter tuning.
What is good, is that the variation is quite limited. From 0.5 to 0.75, this is a 50% efficiency increase, it is significant but not explosive.
For my specific problem, basically, large and expensive units are overkill for my problem. The best value I can get is using "complex_model_m".

How do I create a histogram where the bar heights cover a range of values (preferably in Open Office Calc )?

I have a spreadsheet that contains data that ranges from 0.0 to 1.0, e.g.
a, 0.1
b, 0.11
c, 0.7
d, 0.12
...
I'd like a histogram where each bar covers a range of values, e.g. there would be a bar with a height of 3 for the range [0.1, 0.2). How do I do this in Open Office Calc? If it is hard to do, is there a commonly available tool that makes it easy? I'd prefer something that is available on both Linux and Windows.
So far, I've found two "solutions", both of which can do the job, but neither of which are ideal. However, they are both free and available for both Linux and Windows.
Ggobi provides a GUI that allows you to read in data from a CSV file and produce histograms. Unfortunately, the interface isn't that great, and it is hard to figure out how to manipulate the display. For example, by default, the histogram is "on its side", and thus far, I haven't figured out how to make the bars vertical rather than horizontal.
R provides a programming environment for statistics with some handy graphics packages. For example, you can create a histogram and put it into a PDF file with just a few lines of code:
result <- read.csv("myTable.csv")
str(result) # look at the structure of the resulting data frame
attach(result) # make the components of result available as objects
pdf("myTable.pdf")
hist(X.TCC)
plot(X.TCC, MWE, pch="*")
dev.off()
The drawback is that you need to learn something about the R environment.

Moving beyond R's optim function

I am trying to use R to estimate a multinomial logit model with a manual specification. I have found a few packages that allow you to estimate MNL models here or here.
I've found some other writings on "rolling" your own MLE function here. However, from my digging around - all of these functions and packages rely on the internal optim function.
In my benchmark tests, optim is the bottleneck. Using a simulated dataset with ~16000 observations and 7 parameters, R takes around 90 seconds on my machine. The equivalent model in Biogeme takes ~10 seconds. A colleague who writes his own code in Ox reports around 4 seconds for this same model.
Does anyone have experience with writing their own MLE function or can point me in the direction of something that is optimized beyond the default optim function (no pun intended)?
If anyone wants the R code to recreate the model, let me know - I'll glady provide it. I haven't provided it since it isn't directly relevant to the problem of optimizing the optim function and to preserve space...
EDIT: Thanks to everyone for your thoughts. Based on a myriad of comments below, we were able to get R in the same ballpark as Biogeme for more complicated models, and R was actually faster for several smaller / simpler models that we ran. I think the long term solution to this problem is going to involve writing a separate maximization function that relies on a fortran or C library, but am certainly open to other approaches.
Tried with the nlm() function already? Don't know if it's much faster, but it does improve speed. Also check the options. optim uses a slow algorithm as the default. You can gain a > 5-fold speedup by using the Quasi-Newton algorithm (method="BFGS") instead of the default. If you're not concerned too much about the last digits, you can also set the tolerance levels higher of nlm() to gain extra speed.
f <- function(x) sum((x-1:length(x))^2)
a <- 1:5
system.time(replicate(500,
optim(a,f)
))
user system elapsed
0.78 0.00 0.79
system.time(replicate(500,
optim(a,f,method="BFGS")
))
user system elapsed
0.11 0.00 0.11
system.time(replicate(500,
nlm(f,a)
))
user system elapsed
0.10 0.00 0.09
system.time(replicate(500,
nlm(f,a,steptol=1e-4,gradtol=1e-4)
))
user system elapsed
0.03 0.00 0.03
Did you consider the material on the CRAN Task View for Optimization ?
I am the author of the R package optimParallel, which could be helpful in your case. The package provides parallel versions of the gradient-based optimization methods of optim(). The main function of the package is optimParallel(), which has the same usage and output as optim(). Using optimParallel() can significantly reduce optimization times as illustrated in the following figure (p is the number of paramters).
See https://cran.r-project.org/package=optimParallel and http://arxiv.org/abs/1804.11058 for more information.
FWIW, I've done this in C-ish, using OPTIF9. You'd be hard-pressed to go faster than that. There are plenty of ways for something to go slower, such as by running an interpreter like R.
Added: From the comments, it's clear that OPTIF9 is used as the optimizing engine. That means that most likely the bulk of the time is spent in evaluating the objective function in R. While it is possible that C functions are being used underneath for some of the operations, there still is interpreter overhead. There is a quick way to determine which lines of code and function calls in R are responsible for most of the time, and that is to pause it with the Escape key and examine the stack. If a statement costs X% of time, it is on the stack X% of the time. You may find that there are operations that are not going to C and should be. Any speedup factor you get this way will be preserved when you find a way to parallelize the R execution.