K-Fold Cross-Validation How Many Folds? - data-science

Working with K-Fold Cross-Validation I commonly see 5 folds and 10 folds employed. A 1995 paper recommends 10 fold cv. However that conclusion was based on small datasets using models of that time.
I'm just wondering if current use of 5 & 10 folds still harks back to that paper as a convention? Or are there are other good reasons to use 5 or 10 folds rather than say 6, 8, 12 etc?

This is just tradition. These are just nice numbers that people like and divide many things evenly. This works out to nice numbers like 10% and 20% each time. If you used 8, that would 12.5% each. Not as nice a number right?
It's possible for your dataset, another number works better, but it isn't worth the trouble to figure that out. If you tried to publish with 7-fold cross-validation, people would give you funny looks and become suspicious. Stick to the standards.

K-Fold Cross Validation is helpful when the performance of your model shows significant variance based on your Train-Test split.
Using 5 or 10 is neither is a norm nor there is a rule. you can use as many Folds (K= 2, 3, 4, to smart guess).
K fold cross validation is exploited to solve problems where Training data is limited .
I have came across an example in a book (Francois Chollet's book example shared below) where K=4 so it depend on your requirement.
enter code here
`k = 4
num_validation_samples = len(data) // k
np.random.shuffle(data)
validation_scores = []
for fold in range(k):
validation_data = data[num_validation_samples * fold:
num_validation_samples * (fold + 1)]
training_data = data[:num_validation_samples * fold] +
data[num_validation_samples * (fold + 1):
model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)
validation_scores.append(validation_score)
validation_score = np.average(validation_scores)
model = get_model()
model.train(data)
test_score = model.evaluate(test_data)`
Three-fold validation Pictorial Description

Related

Understanding Time2Vec embedding for implementing this as a keras layer

The paper time2vector link (the relevant theory is in section 4) shows an approach to include a time embedding for features to improve model performance. I would like to give this a try. I found a implementation as keras layer which I changed a little bit. Basically it creates two matrices for one feature:
(1) linear = w * x + b
(2) periodic = sin(w * x + b)
Currently I choose this feature manually. Concerning the paper there are a few things i don't understand. The first thing is the term k as the number of sinusoids. The authors use up to 64 sinusoids. What does this mean? I have just 1 sinusoid at the moment, right? Secondly I'm about to put every feature I have through the sinus transformation for me dataset that would make 6 (sinusoids) periodic features. The authors use only one linear term. How should I choose the feature for the linear term? Unfortunately the code from the paper is not available anymore. Has anyone worked with time embeddings or even with this particularly approach?
For my limited understanding, the linear transformation of time is a fixed element of the produced embedding and the parameter K allows you to select how many different learned time representations you want to use in your model. So, the resulting embedding has a size of K+1 elements.

How can I find a standard method of predicting next values of a stock market using Tensorflow?

Thank you for reading. I'm not good at English.
I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps.
I wonder if the time series data has been properly learned and predicted.
How i do this right get the following (next) value?
I want to get the next value using like model.predict or etc
I have x_test and x_test[-1] == t, so the meaning of the next value is t+1, t+2, .... t+n,
In this example I want to get predictions of the next t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same.
I'm wondering if it's the correct train and predicted value.
I'm wondering if it was a right training and predict.
full source
The method I tried first did not work correctly.
Seconds
I realized something is wrong, I tried using another official data
So, I used the time series in the Tensorflow tutorial to practice predicting the model.
a = y_val[-look_back:]
for i in range(N-step prediction): # predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) # predicted value
a = a[1:] # remove first
a = np.append(a, tmp) # insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious what is a standard method of predicting next values of a stock market.
Thank you for reading the long question. I seek advice about your priceless opinion.
Q : "How can I find a standard method of predicting next values of a stock market...?"
First - salutes to C64 practitioner!
Next, let me say, there is no standard method - there cannot be ( one ).
Principally - let me draw from your field of a shared experience - one can easily predict the near future flow of laminar fluids ( a technically "working" market instrument - is a model A, for which one can derive a better or worse predictive tool )
That will never work, however, for turbulent states of the fluids ( just read the complexity of the attempts to formulate the many-dimensional high-order PDE for a turbulence ( and it still just approximates the turbulence ) ) -- and this is the fundamentally "working" market ( after some expected fundamental factor was released ( read NFP or CPI ) or some flash-news was announced in the news - ( read a Swiss release of currency-bonding of CHF to some USD parity or Cyprus one time state tax on all speculative deposits ... the financial Big Bangs follow ... )
So, please, do not expect one, the less any simple, model for reasonably precise predictions, working for both the laminar and turbulent fluidics - the real world is for sure way more complex than this :o)

Confused by random.randn()

I am a bit confused by the numpy function random.randn() which returns random values from the standard normal distribution in an array in the size of your choosing.
My question is that I have no idea when this would ever be useful in applied practices.
For reference about me I am a complete programming noob but studied math (mostly stats related courses) as an undergraduate.
The Python function randn is incredibly useful for adding in a random noise element into a dataset that you create for initial testing of a machine learning model. Say for example that you want to create a million point dataset that is roughly linear for testing a regression algorithm. You create a million data points using
x_data = np.linspace(0.0,10.0,1000000)
You generate a million random noise values using randn
noise = np.random.randn(len(x_data))
To create your linear data set you follow the formula
y = mx + b + noise_levels with the following code (setting b = 5, m = 0.5 in this example)
y_data = (0.5 * x_data ) + 5 + noise
Finally the dataset is created with
my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_data,columns=['Y'])],axis=1)
This could be used in 3D programming to generate non-overlapping random values. This would be useful for optimization of graphical effects.
Another possible use for statistical applications would be applying a formula in order to test against spacial factors affecting a given constant. Such as if you were measuring a span of time with some formula doing something but then needing to know what the effectiveness would be given various spans of time. This would return a statistic measuring for example that your formula is more effective in the shorter intervals or longer intervals, etc.
np.random.randn(d0, d1, ..., dn) Return a sample (or samples) from the “standard normal” distribution(mu=0, stdev=1).
For random samples from , use:
sigma * np.random.randn(...) + mu
This is because if Z is a standard normal deviate, then will have a normal distribution with expected value and standard deviation .
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html
https://en.wikipedia.org/wiki/Normal_distribution

Numpy- Deep Learning, Training Examples

Silly Question, I am going through the third week of Andrew Ng's newest Deep learning course, and getting stuck at a fairly simple Numpy function ( i think? ).
The exercise is to find How many training examples, m , we have.
Any idea what the Numpy function is to find out about the size of a preloaded training example.
Thanks!
shape_X = X.shape
shape_Y = Y.shape
m = ?
print ('The shape of X is: ' + str(shape_X))
print ('The shape of Y is: ' + str(shape_Y))
print ('I have m = %d training examples!' % (m))
It depends on what kind of storage-approach you use.
Most python-based tools use the [n_samples, n_features] approach where the first dimension is the sample-dimension, the second dimension is the feature-dimension (like in scikit-learn and co.). Alternatively expressed: samples are rows and features are columns.
So:
# feature 1 2 3 4
x = np.array([[1,2,3,4], # first sample
[2,3,4,5], # second sample
[3,4,5,6]
])
is a training-set of 3 samples with 4 features each.
The sizes M,N (again: interpretation might be different for others) you can get with:
M, N = x.shape
because numpy's first dimension are rows, numpy's second dimension are columns like in matrix-algebra.
For the above example, the target-array is of shape (M) = n_samples.
Anytime you want to find the number of training examples or the size of an array, you can use
m = X.size
This will give you the size or the total number of the examples. In this case, it would be 400.
The above method is also correct but not the optimal method to find the size since, in large datasets, the values could be large and while python easily handles large values, it is not advisable to utilize extra unneeded space.
Or a better way of doing the above scenario is
m=X.shape[1]

Why shuffling data gives significantly higher accuracy?

In Tensorflow, I've wrote a big model for 2 image classes problem. My question is concerned with the following code snippet:
X, y, X_val, y_val = prepare_data()
probs = calc_probs(model, session, X)
accuracy = float(np.equal(np.argmax(probs, 1), np.argmax(y, 1)).sum()) / probs.shape[0]
loss = log_loss(y, probs)
X is an np.array of shape: (25000,244,244,3). That code results in accuracy=0.5834 (towards random accuracy) and loss=2.7106. But
when I shuffle the data, by adding these 3 lines after the first line:
sample_idx = random.sample(range(0, X.shape[0]), 25000)
X = X[sample_idx]
y = y[sample_idx]
, the results become convenient: accuracy=0.9933 and loss=0.0208.
Why shuffling data can give significantly higher accuracy ? or what can be a reason for that ?
The function calc_probs is mainly a run call:
probs = session.run(model.probs, feed_dict={model.X: X})
Update:
After hours of debugging, I figured out that evaluating a single image gives different result. For example, if you run the following line of code multiple times, you get a different result each time:
session.run(model.props, feed_dict={model.X: [X[20]])
My data is normally sorted, X contains class 1 samples first then class 2. And in calc_probs function, I run using each batch of the data sequentially. So, without shuffling, each run has data of a single class.
I've also noted that with shuffling, if batch size is very small, I get the random accuracy.
There is some mathematical justification for this in the context of randomized Kaczmarz algorithm. Regular Kaczmarz algorithm is an old algorithm which can be seen as an non-shuffling SGD on a least squares problem, and there are guaranteed faster convergence rates that come out if you use randomization, follow references in http://www.cs.ubc.ca/~nickhar/W15/Lecture21Notes.pdf