Should I trim outliers from input features - tensorflow

Almost half of my input feature columns have offshoot "outliers" like when the mean is 19.6 the max is 2908.0. Is it OK or should I trim those to mean + std?
msg_cnt_in_x msg_cnt_in_other msg_cnt_in_y \
count 330096.0 330096.0 330096.0
mean 19.6 2.6 38.3
std 41.1 8.2 70.7
min 0.0 0.0 0.0
25% 0.0 0.0 0.0
50% 3.0 1.0 8.0
75% 21.0 2.0 48.0
max 2908.0 1296.0 4271.0

There is no general answer to that. It depends very much on your probem and data set.
You should look into your data set and check whether these outlier data points are actually valid and important. If they are caused by some errors during data collection you should delete them. If they are valid, then you can expect similar values in your test data and thus the data points should stay in the data set.
If you are unsure, just test both and pick the one that works better.

Related

Basic rules for custom cluster configuration when using distributed learning in Cloud ML

I am investigating the use of custom scale tiers in the Cloud Machine Learning API.
Now, I dont know precisely how to design my custom tiers! I basically use a CIFAR type of model, and I decided to use:
if args.distributed:
config['trainingInput']['scaleTier'] = 'CUSTOM'
config['trainingInput']['masterType'] = 'complex_model_m'
config['trainingInput']['workerType'] = 'complex_model_m'
config['trainingInput']['parameterServerType'] = 'large_model'
config['trainingInput']['workerCount'] = 12
config['trainingInput']['parameterServerCount'] = 4
yaml.dump(config, file('custom_config.yaml', 'w'))
But I hardly can find any information on how to dimension properly the cluster. Are there any "rules of thumb" out there? Or do we have to try and test?
Many thanks in advance!
I have done a couple of small experiments, which might be worth sharing. My setup wasn't a 100% clean, but I think the rough idea is correct.
The model looks like the cifar example, but with a lot of training data. I use averaging, decaying gradient as well as dropout.
The "config" naming is (hopefully) explicit : basically 'M{masterCost}_PS{nParameterServer}x{parameterServerCost}_W{nWorker}x{workerCost}'. For parameter servers, I always use "large_model".
The "speed" is the 'global_step/s'
The "cost" is the total ML unit
And I call "efficiency" the number of 'global_step/second/ML unit'
Here are some partial results :
config cost speed efficiency
0 M1 1 0.5 0.50
1 M6_PS1x3_W2x6 21 10.0 0.48
2 M6_PS2x3_W2x6 24 10.0 0.42
3 M3_PS1x3_W3x3 15 11.0 0.73
4 M3_PS1x3_W5x3 21 15.9 0.76
5 M3_PS2x3_W4x3 21 15.1 0.72
6 M2_PS1x3_W5x2 15 7.7 0.51
I know that I should run many more experiments, but I have no time for this.
If I have time, I will dig in deeper.
The main conclusions are :
it might be worth trying a few setup on a small amount of iterations, just for deciding which config to use before going on hyper parameter tuning.
What is good, is that the variation is quite limited. From 0.5 to 0.75, this is a 50% efficiency increase, it is significant but not explosive.
For my specific problem, basically, large and expensive units are overkill for my problem. The best value I can get is using "complex_model_m".

How to orchestrate members in a cluster to read new input from a single file once the current job is done?

I am working on a global optimization using brutal force. I am wondering if it is possible to complete the following task with Fortran MPI file I/O:
I have three nodes, A, B, C. I want these nodes to search for the optima over six sets of parameter inputs, which are arranged in the following matrix:
0.1 0.2 0.3
0.4 0.5 0.6
0.7 0.8 0.9
1.1 1.2 1.3
1.4 1.5 1.6
1.7 1.8 1.9
A row vector represents a set of parameter inputs. The order of which node reading in which set of parameter inputs does not matter. All I need is to orchestrate nodes A, B, C to run through the six sets of parameters, obtain the corresponding value of penalty function, and save the output to a single file.
For example, node A pulls the first set, node B the second, and node C the third. Each node takes a while to finish respective computation. Since the computation time varies across nodes, it is possible that C is the first that finishes the first-round computation, and followed by B and then A. In such a case, I want node C to subsequently pull the forth set of inputs, node B to pull the fifth and node A to read in the last set.
A <--- 0.1 0.2 0.3
B <--- 0.4 0.5 0.6
C <--- 0.7 0.8 0.9
C <--- 1.1 1.2 1.3
B <--- 1.4 1.5 1.6
A <--- 1.7 1.8 1.9
What troubles me is that the order of which node to read which set for the second-round computation is not known in advance due to the uncertainty in the run time of respective node. So I would like to know if there is a way to dynamically program my code with MPI file I/O to attain such a parallel need. Can anyone show me a code template to solve this problem?
Thank you very much.
Lee
As much as it pains me to suggest it, this might be the one good use of MPI "Shared file pointers". These work in fortran, too, but I'm going to get the syntax wrong.
Each process can read a row from the file with MPI_File_read_shared This independent I/O routine will update a global "shared file pointer" bit of state. Should B or C finish their work quickly, they can call MPI_File_read_shared again. If A is slow, whenver it calls MPI_File_read_shared it will read whatever has not been dealt with yet.
Some warnings:
shared file pointers don't get a lot of attention.
The global bit of shared state is typically... a hidden file. So yeah, it might not scale terribly well. Should be fine for a few tens of processes, though.
the global bit of shared state is stored on a file system. Some file systems like PVFS do not support the locking required to ensure this shared state is always correct.

Save face color in .obj file

I have an .obj file storing triangle mesh. I wish to record a color for each of the triangle faces. Is there a way to save this information into .obj file so that software like MeshLab could identify and visualize it?
It's way too late to help you but since I came across the same problem, someone might have the same problem later.
Here is how I would do it :
As you can read on Wikipedia's page about .obj file format,
Some applications support vertex colors, by putting red, green and blue values after x y and z (this precludes specifying w). The color values range from 0 to 1.
This method is now widely supported. I figured out that giving vertices the proper colors, will make most software (including MeshLab) automatically color the faces with those. This is advantageous when you want to store everything in an OBJ file.
v 0.0 0.0 0.0 1.0 0.0 0.0
v 0.0 0.5 0.5 1.0 0.0 0.0
v 0.5 0.5 0.0 1.0 0.0 0.0
f 3//3 2//2 1//1
Will make a red triangle.
Obviously, it will look way better if you use a MTL file. But this is quick to make and useful for testing your code when you do 3D scan. Also I did not try this with indexed vertices, so it might not work the same way.

variable size rolling window regression

In Pandas OLS the window size is fix length. How can I achieve set the window size based on index instead of number of rows?
I have a series where it has variable number of observations per day and I have 10 years history of data, so I want to run rolling OLS on 1 year rolling window. loop through each date is a bit too slow, anyway to make it faster? Here is the example of the data.
Date x y
2008-1-2 10.0 2
2008-1-2 5.0 1
2008-1-3 7.0 1.5
2008-1-5 9.0 3.0
...
2013-5-30 11.0 2.5
I would like something simple like pandas.ols(df.y, df.x, window='1y'), rather than looping each row since it will be slow to do the loop.
There is method for doing this in pandas see documentation http://pandas.pydata.org/pandas-docs/dev/computation.html#computing-rolling-pairwise-correlations:
model = pandas.ols(y=df.y, x=df.x, window=250)
you will just have to provide your period is a number of intervals on frame instead of '1y'. There are also many additional options that you might find useful on your data.
all the rolling ols statistics are in model
model.beta.plot()
to show rolling beta

Unexpected value printed when using %.1f

I'm trying to display floats to just one decimal point. I'm getting unexpected results as follows:
Code:
float a = 1.25;
float b = 1.35;
NSLog(#"1.25 -> %.1f\n1.35 -> %.1f",a,b);
Output:
1.25 -> 1.2
1.35 -> 1.4
Expected output, either:
1.25 -> 1.3
1.35 -> 1.4
or:
1.25 -> 1.2
1.35 -> 1.3
Is this simply due to the internal conversion between binary and decimal? If so, how do I get the expected behaviour?
I'm using Xcode 4.6.
edit: Okay, thanks to TonyK and H2CO3 it's due to the binary representation of decimals.
float a = 1.25;
float b = 1.35;
NSLog(#"1.25 -> %.30f\n1.35 -> %.30f",a,b);
1.25 -> 1.250000000000000000000000000000
1.35 -> 1.350000000000000088817841970013
Lots of good info, but as far as I can see no one has approached the second question: How do I get the expected behaviour?
Rounding numbers in Objective-C is a quite different question.
1.35 is 27/20, which in binary is
1.01 0110 0110 0110 0110 0110 0110....
A float has a 23-bit mantissa on most systems (not counting the implied leading 1.), so this gets rounded up to
1.01 0110 0110 0110 0110 0110 1
(because 0110 is unambiguously greater than half of 1000). So it's strictly greater than 1.35 by the time printf sees it. Hence 1.4.
As for 1.25, this is exactly representable in binary as
1.01
So printf sees its exact value. But how should it round 1.25? We were taught in school to round 5 up to 10. But most modern systems use a default rounding mode called "round to even" at the hardware level, because it lessens the effect of cumulative rounding errors. This means that when a number is exactly between the two nearest candidates for rounding, it gets rounded to the even candidate.
So it seems that print is using "round to even" for decimal output! I checked this hypothesis at this ideone link, and indeed 1.75 gets rounded up, to 1.8. This is a surprise to me, but not a huge one.
That's because floating-point numbers aren't exact. %.1f prints the number rounded to one decimal place, it seems, however, that 1.35 can't be exactly represented as 1.2500000, instead it's a slightly smaller number that can be.
Read about this behavior here: What every computer scientist should know about floating-point numbers.