I am trying to learn KNN by working on Breast cancer dataset provided by UCI repository. The Total size of dataset is 699 with 9 continuous variables and 1 class variable.
I tested my accuracy on cross-validation set. For K =21 & K =19. Accuracy is 95.7%.
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=21)
neigh.fit(X_train, y_train)
y_pred_val = neigh.predict(X_val)
print accuracy_score(y_val, y_pred_val)
But for K= 1, I am getting Accuracy = 97.85%
K = 3, Accuracy = 97.14
I read
Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes ) .A simple approach to select k is set k = n^(1/2). here
Which value of K should I consider for my model. Can you guys elaborate the logic behind it?
Thanks in advance!
Related
I'm trying to implement the Gradient descent method for solving $Ax = b$ for a positive definite symmetric matrix $A$ of size about $9600 \times 9600$. I thought my code was relatively simple
#Solves the problem Ax = b for x within epsilon tolerance or until MAX_ITERATION is reached
def GradientDescent(Amat,target,epsilon = .01,MAX_ITERATION = 100,x=np.zeros(9604):
CurrentRes = target-np.matmul(Amat,x)
count = 0
while(np.linalg.norm(CurrentRes)> epsilon and count < MAX_ITERATION):
Ar = np.matmul(Amat,CurrentRes)
alpha = CurrentRes.T.dot(CurrentRes)/CurrentRes.T.dot(Ar)
x = x+alpha*CurrentRes
Ax = np.matmul(Amat,x)
CurrentRes = target-Ax
count = count+1
return(x,count,norm(CurrentRes))
#A is square matrix about 9600x9600 and b is about 9600x1
GDSum = GradientDescent(A,b)
but the above takes almost 3 minutes to run a single iteration of the main while loop.
I didn't think that $9600 \times 9600$ was too big for NumPy to handle effectively, but even the step of computing alpha which is just the quotient of two dot products is taking over 30 seconds.
I tried error-testing the code by timing each action in the while loop, and they are all running much slower than expected. A single matrix multiplication is taking almost a minute. The steps involving vector addition or subtraction at least seem to be running quickly.
#A is square matrix about 9600x9600 and b is about 9600x1
GDSum = GradientDescent(A,b)
Perhaps the most relevant bit of information is missing.
Your function is fast when A and b are Numpy arrays, but it's terribly slow when they are lists.
Is that your case?
There is something about the workings of GradientTape that escapes my understanding.
Suppose we want to train an agent on the classic bandit problem using an actor-critic RL framework. There are two bandits, A and B, and the agent must learn to select A, which yields higher returns on average. The training consists of, say, 1000 epochs, in each of which the agent draws, say, 100 samples from each bandit. The reward is 1 every time the agent selects A, and 0 otherwise.
Let's see how the agent learns by observing rewards over 10 training simulations. Here is the code defining the agent and the environment (neither needs to be more complicated than below).
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from keras import Model
from keras.optimizers import Adam
n_sims = 10 # number of simulations
for n in range(n_sims):
# define actors and optimizers for each simulation
actor_input = Input(shape=(2,))
actor_output = Dense(2, activation='softmax')(actor_input)
globals()[f'actor_{n}'] = Model(inputs=actor_input, outputs=actor_output)
globals()[f'actor_opt_{n}'] = Adam(learning_rate=.1)
# define critics and optimizers for each simulation
critic_input = Input(shape=(2,))
critic_output = Dense(1, activation='softmax')(critic_input)
globals()[f'critic_{n}'] = Model(inputs=critic_input, outputs=critic_output)
globals()[f'critic_opt_{n}'] = Adam(learning_rate=.1)
globals()[f'mean_rewards_{n}'] = [] # list to store rewards over training epochs for each simulation
A = np.random.normal(loc=10, scale=15, size=int(1e5)) # bandit A
B = np.random.normal(loc=0, scale=1, size=int(1e5)) # bandit B
n_training_epochs = 1000
n_samples = 100
Let's consider two alternative codes for the training loop using GradientTape, both based on a simple 'vanilla' loss function.
The first is the slow one and literally involves a for loop over the samples drawn in each epoch. Cumulative actor and critic's losses are iteratively computed, and then their means are used to update their respective network weights.
for _ in range(n_training_epochs):
A_samples = np.random.choice(A, size=n_samples)
B_samples = np.random.choice(B, size=n_samples)
for n in range(n_sims):
cum_actor_loss, cum_critic_loss, cum_reward = 0, 0, 0
with tf.GradientTape() as actor_tape, tf.GradientTape() as critic_tape:
for A_sample, B_sample in zip(A_samples, B_samples):
probs = globals()[f'actor_{n}'](tf.reshape([A_sample, B_sample], (1,-1)))[0]
action = np.random.choice(['A','B'], p=np.squeeze(probs))
reward = 1 if action == 'A' else 0
cum_reward += reward
action_prob = probs[['A','B'].index(action)]
value = globals()[f'critic_{n}'](tf.reshape([A_sample, B_sample], (1,-1)))[0]
advantage = reward - value
cum_actor_loss += -tf.math.log(action_prob)*advantage
cum_critic_loss += advantage**2
mean_actor_loss = cum_actor_loss/n_samples
mean_critic_loss = cum_critic_loss/n_samples
globals()[f'mean_rewards_{n}'].append(cum_reward/n_samples)
actor_grads = actor_tape.gradient(mean_actor_loss, globals()[f'actor_{n}'].trainable_variables)
globals()[f'actor_opt_{n}'].apply_gradients(zip(actor_grads, globals()[f'actor_{n}'].trainable_variables))
critic_grads = critic_tape.gradient(mean_critic_loss, globals()[f'critic_{n}'].trainable_variables)
globals()[f'critic_opt_{n}'].apply_gradients(zip(critic_grads, globals()[f'critic_{n}'].trainable_variables))
If you plot the average training rewards over each epoch, you'll probably get something like this figure
In the second option, instead of using an explicit for loop over samples in each epoch, we perform operations on arrays. This alternative is much faster in terms of computation time.
for _ in range(n_training_epochs):
A_samples = np.random.choice(A, size=n_samples)
B_samples = np.random.choice(B, size=n_samples)
for n in range(n_sims):
with tf.GradientTape() as actor_tape, tf.GradientTape() as critic_tape:
probs = globals()[f'actor_{n}'](tf.reshape([[A_sample, B_sample] for A_sample, B_sample in zip(A_samples, B_samples)], (n_samples,-1)))
actions = np.array([np.random.choice(['A','B'], p=np.squeeze(probs[i])) for i in range(len(probs))]).reshape(n_samples, -1)
rewards = np.array([1.0 if action == 'A' else 0.0 for action in actions]).reshape(n_samples, -1)
globals()[f'mean_rewards_{n}'].append(np.mean(rewards))
values = globals()[f'critic_{n}'](tf.reshape([[A_sample, B_sample] for A_sample, B_sample in zip(A_samples, B_samples)], (n_samples,-1)))
advantages = rewards + tf.math.negative(values)
actions_num = [['A','B'].index(action) for action in actions]
action_probs = tf.reduce_sum(tf.one_hot(actions_num, len(['A','B'])) * probs, axis=1)
mean_actor_loss = -tf.reduce_mean(advantages * tf.math.log(action_probs))
mean_critic_loss = tf.reduce_mean(tf.pow(advantages, 2))
actor_grads = actor_tape.gradient(mean_actor_loss, globals()[f'actor_{n}'].trainable_variables)
globals()[f'actor_opt_{n}'].apply_gradients(zip(actor_grads, globals()[f'actor_{n}'].trainable_variables))
critic_grads = critic_tape.gradient(mean_critic_loss, globals()[f'critic_{n}'].trainable_variables)
globals()[f'critic_opt_{n}'].apply_gradients(zip(critic_grads, globals()[f'critic_{n}'].trainable_variables))
Let's plot the average reward over epochs, to obtain something like this
As you can see the agent tends to learn earlier and more stably in the first case than in the second (where learning may not even happen), although the two training loops are in theory mathematically equivalent. How is that? The reason has probably something to do with the fact that, in the first option, GradientTape is watching the trainable variables several times per epoch before applying the gradient, whereas in the second option it does so only once. Even so, I can't figure out why exactly this produces the observed results. Can you help me understand?
In order to make the case simple and intuitive, I will using binary (0 and 1) classification for illustration.
Loss function
loss = np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY)) #cross entropy
cost = -np.sum(loss)/m #num of examples in batch is m
Probability of Y
predY is computed using sigmoid and logits can be thought as the outcome of from a neural network before reaching the classification step
predY = sigmoid(logits) #binary case
def sigmoid(X):
return 1/(1 + np.exp(-X))
Problem
Suppose we are running a feed-forward net.
Inputs: [3, 5]: 3 is number of examples and 5 is feature size (fabricated data)
Num of hidden units: 100 (only 1 hidden layer)
Iterations: 10000
Such arrangement is set to overfit. When it's overfitting, we can perfectly predict the probability for the training examples; in other words, sigmoid outputs either 1 or 0, exact number because the exponential gets exploded. If this is the case, we would have np.log(0) undefined. How do you usually handle this issue?
If you don't mind the dependency on scipy, you can use scipy.special.xlogy. You would replace the expression
np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY))
with
xlogy(Y, predY) + xlogy(1 - Y, 1 - predY)
If you expect predY to contain very small values, you might get better numerical results using scipy.special.xlog1py in the second term:
xlogy(Y, predY) + xlog1py(1 - Y, -predY)
Alternatively, knowing that the values in Y are either 0 or 1, you can compute the cost in an entirely different way:
Yis1 = Y == 1
cost = -(np.log(predY[Yis1]).sum() + np.log(1 - predY[~Yis1]).sum())/m
How do you usually handle this issue?
Add small number (something like 1e-15) to predY - this number doesn't make predictions much off, and it solves log(0) issue.
BTW if your algorithm outputs zeros and ones it might be useful to check the histogram of returned probabilities - when algorithm is so sure that something's happening it can be a sign of overfitting.
One common way to deal with log(x) and y / x where x is always non-negative but can become 0 is to add a small constant (as written by Jakub).
You can also clip the value (e.g. tf.clip_by_value or np.clip).
I am looking at the below image.
Can someone explain how they are calculated?
I though it was -1 for an N and +1 for a yes but then I can't figure out how the little girl has .1. But that doesn't work for tree 2 either.
I agree with #user1808924. I think it's still worth to explain how XGBoost works under the hood though.
What is the meaning of leaves' scores ?
First, the score you see in the leaves are not probability. They are the regression values.
In Gradient Boosting Tree, there's only regression tree. To predict if a person like computer games or not, the model (XGboost) will treat it as a regression problem. The labels here become 1.0 for Yes and 0.0 for No. Then, XGboost puts regression trees in for training. The trees of course will return something like +2, +0.1, -1, which we get at the leaves.
We sum up all the "raw scores" and then convert them to probabilities by applying sigmoid function.
How to calculate the score in leaves ?
The leaf score (w) are calculated by this formula:
w = - (sum(gi) / (sum(hi) + lambda))
where g and h are the first derivative (gradient) and the second derivative (hessian).
For the sake of demonstration, let's pick the leaf which has -1 value of the first tree. Suppose our objective function is mean squared error (mse) and we choose the lambda = 0.
With mse, we have g = (y_pred - y_true) and h=1. I just get rid of the constant 2, in fact, you can keep it and the result should stay the same. Another note: at t_th iteration, y_pred is the prediction we have after (t-1)th iteration (the best we've got until that time).
Some assumptions:
The girl, grandpa, and grandma do NOT like computer games (y_true = 0 for each person).
The initial prediction is 1 for all the 3 people (i.e., we guess all people love games. Note that, I choose 1 on purpose to get the same result with the first tree. In fact, the initial prediction can be the mean (default for mean squared error), median (default for mean absolute error),... of all the observations' labels in the leaf).
We calculate g and h for each individual:
g_girl = y_pred - y_true = 1 - 0 = 1. Similarly, we have g_grandpa = g_grandma = 1.
h_girl = h_grandpa = h_grandma = 1
Putting the g, h values into the formula above, we have:
w = -( (g_girl + g_grandpa + g_grandma) / (h_girl + h_grandpa + h_grandma) ) = -1
Last note: In practice, the score in leaf which we see when plotting the tree is a bit different. It will be multiplied by the learning rate, i.e., w * learning_rate.
The values of leaf elements (aka "scores") - +2, +0.1, -1, +0.9 and -0.9 - were devised by the XGBoost algorithm during training. In this case, the XGBoost model was trained using a dataset where little boys (+2) appear somehow "greater" than little girls (+0.1). If you knew what the response variable was, then you could probably interpret/rationalize those contributions further. Otherwise, just accept those values as they are.
As for scoring samples, then the first addend is produced by tree1, and the second addend is produced by tree2. For little boys (age < 15, is male == Y, and use computer daily == Y), tree1 yields 2 and tree2 yields 0.9.
Read this
https://towardsdatascience.com/xgboost-mathematics-explained-58262530904a
and then this
https://medium.com/#gabrieltseng/gradient-boosting-and-xgboost-c306c1bcfaf5
and the appendix
https://gabrieltseng.github.io/appendix/2018-02-25-XGB.html
As a warm-up to writing my own elastic net solver, I'm trying to get a fast enough version of ordinary least squares implemented using coordinate descent.
I believe I've implemented the coordinate descent algorithm correctly, but when I use the "fast" version (see below), the algorithm is insanely unstable, outputting regression coefficients that routinely overflow a 64-bit float when the number of features is of moderate size compared to the number of samples.
Linear Regression and OLS
If b = A*x, where A is a matrix, x a vector of the unknown regression coefficients, and y is the output, I want to find x that minimizes
||b - Ax||^2
If A[j] is the jth column of A and A[-j] is A without column j, and the columns of A are normalized so that ||A[j]||^2 = 1 for all j, the coordinate-wise update is then
Coordinate Descent:
x[j] <-- A[j]^T * (b - A[-j] * x[-j])
I'm following along with these notes (page 9-10) but the derivation is simple calculus.
It's pointed out that instead of recomputing A[j]^T(b - A[-j] * x[-j]) all the time, a faster way to do it is with
Fast Coordinate Descent:
x[j] <-- A[j]^T*r + x[j]
where the total residual r = b - Ax is computed outside the loop over coordinates. The equivalence of these update rules follows from noting that Ax = A[j]*x[j] + A[-j]*x[-j] and rearranging terms.
My problem is that while the second method is indeed faster, it's wildly numerically unstable for me whenever the number of features isn't small compared to the number of samples. I was wondering if anyone might have some insight as to why that's the case. I should note that the first method, which is more stable, still starts disagreeing with more standard methods as the number of features approaches the number of samples.
Julia code
Below is some Julia code for the two update rules:
function OLS_builtin(A,b)
x = A\b
return(x)
end
function OLS_coord_descent(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
for j = 1:P
x[j] = dot(A[:,j], b - A[:,1:P .!= j]*x[1:P .!= j])
end
end
return(x)
end
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
end
end
return(x)
end
Example of the problem
I generate data with the following:
n = 100
p = 50
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(Float64,p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Here I use p=50, and I get good agreement between OLS_coord_descent(X,y) and OLS_builtin(X,y), whereas OLS_coord_descent_fast(X,y)returns exponentially large values for the regression coefficients.
When p is less than about 20, OLS_coord_descent_fast(X,y) agrees with the other two.
Conjecture
Since things agrees for the regime of p << n, I think the algorithm is formally correct, but numerically unstable. Does anyone have any thoughts on whether this guess is correct, and if so how to correct for the instability while retaining (most) of the performance gains of the fast version of the algorithm?
The quick answer: You forgot to update r after each x[j] update. Following is the fixed function which behaves like OLS_coord_descent:
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
r -= A[:,j]*dot(A[:,j],r) # Add this line
end
end
return(x)
end