XGBoost showing same prediction for all test data - xgboost

I am working on a problem to predict output label based on certain input values.
Since I do not have real data, I am creating some dummy data so that I can have my code ready by the time I get the data.
Below is what the sample data looks like. There are a bunch of input values and the last column 'output' is the output label to be predicted.
input_1,input_2,input_3,input_4,input_5,input_6,input_7,input_8,input_9,input_10,input_11,input_12,input_13,input_14,input_15,input_16,input_17,input_18,input_19,input_20,input_21,input_22,input_23,input_24,input_25,input_26,input_27,input_28,input_29,input_30,input_31,input_32,output
0.0,97.0,155,143,98,145,102,102,144,100,96,193,90,98,98,122,101,101,101,98,99,96,118,148,98,99,112,94,98,100,96.0,95,loc12
96.0,94.0,116,99,98,105,95,101,168,101,96,108,95,98,98,96,102,98,98,99,98,98,132,150,102,101,195,104,96,97,93.0,98,loc27
Since this is dummy data, I am setting the output label to the input that has the maximum value.
For e.g. in the first row, the maximum value is at 12th location so output is set to loc12.
My expectation is that the XGBoost algorithm should learn this on its own and predict the output label correctly.
I have written below code to train and test XGBoost.
from __future__ import division
import numpy as np
import pandas as pd
import scipy.sparse
import pickle
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, LabelBinarizer
df=pd.read_csv("data.txt", sep=',')
# Create training and validation sets
sz = df.shape
train = df.iloc[:int(sz[0] * 0.7), :]
test = df.iloc[int(sz[0] * 0.7):, :]
# Separate X & Y for training
train_X = train.iloc[:, :32].values
train_Y = train.iloc[:, 32].values
# Separate X & Y for test
test_X = test.iloc[:, :32].values
test_Y = test.iloc[:, 32].values
# Get the count of unique output labels
num_classes = df.output.nunique()
lb = LabelBinarizer()
train_Y = lb.fit_transform(train_Y.tolist())
test_Y = lb.fit_transform(test_Y.tolist())
# Normalize the training data
#train_X -= np.mean(train_X, axis=0)
#train_X /= np.std(train_X, axis=0)
#train_X /= 255
# Normalize the test data
#test_X -= np.mean(test_X, axis=0)
#test_X /= np.std(test_X, axis=0)
#test_X /= 255
xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)
# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softmax'
# scale weight of positive examples
param['eta'] = 0.1
param['max_depth'] = 6
param['silent'] = 1
param['nthread'] = 4
param['num_class'] = num_classes
watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 5
bst = xgb.train(param, xg_train, num_round, watchlist)
#bst.dump_model('dump.raw.txt')
# get prediction
pred = bst.predict(xg_test)
actual = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred != actual) / test_Y.shape[0]
print('Test error using softmax = {}'.format(error_rate))
# do the same thing again, but output probabilities
param['objective'] = 'multi:softprob'
bst = xgb.train(param, xg_train, num_round, watchlist)
# Note: this convention has been changed since xgboost-unity
# get prediction, this is in 1D array, need reshape to (ndata, nclass)
pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], num_classes)
pred_label = np.argmax(pred_prob, axis=1)
actual_label = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred_label != actual_label) / test_Y.shape[0]
print('Test error using softprob = {}'.format(error_rate))
However I am observing that it is always predicting label 0, i.e. first index in the one-hot encoded output.
Output:
[0] train-merror:0.11081 test-merror:0.111076
[1] train-merror:0.11081 test-merror:0.111076
[2] train-merror:0.11081 test-merror:0.111076
[3] train-merror:0.111216 test-merror:0.111076
[4] train-merror:0.11081 test-merror:0.111076
Test error using softmax = 0.64846954875355
[0] train-merror:0.11081 test-merror:0.111076
[1] train-merror:0.11081 test-merror:0.111076
[2] train-merror:0.11081 test-merror:0.111076
[3] train-merror:0.111216 test-merror:0.111076
[4] train-merror:0.11081 test-merror:0.111076
Test error using softprob = 0.64846954875355
Prediction:
pred_prob[0:10]
array([[0.34024397, 0.10218474, 0.07965304, 0.07965304, 0.07965304,
0.07965304, 0.07965304, 0.07965304, 0.07965304],
[0.34009758, 0.10257103, 0.07961877, 0.07961877, 0.07961877,
0.07961877, 0.07961877, 0.07961877, 0.07961877],
[0.34421352, 0.09171014, 0.08058234, 0.08058234, 0.08058234,
0.08058234, 0.08058234, 0.08058234, 0.08058234],
[0.33950377, 0.10413795, 0.07947975, 0.07947975, 0.07947975,
0.07947975, 0.07947975, 0.07947975, 0.07947975],
[0.3426607 , 0.09580766, 0.08021881, 0.08021881, 0.08021881,
0.08021881, 0.08021881, 0.08021881, 0.08021881],
[0.33777002, 0.10427278, 0.07970817, 0.07970817, 0.07970817,
0.07970817, 0.07970817, 0.07970817, 0.07970817],
[0.33733884, 0.10985068, 0.07897293, 0.07897293, 0.07897293,
0.07897293, 0.07897293, 0.07897293, 0.07897293],
[0.33953893, 0.10404517, 0.07948799, 0.07948799, 0.07948799,
0.07948799, 0.07948799, 0.07948799, 0.07948799],
[0.33987975, 0.10314585, 0.07956778, 0.07956778, 0.07956778,
0.07956778, 0.07956778, 0.07956778, 0.07956778],
[0.34013695, 0.10246711, 0.07962799, 0.07962799, 0.07962799,
0.07962799, 0.07962799, 0.07962799, 0.07962799]], dtype=float32)
Whatever accuracy I'm getting is because of predicting label 0 which is around 35% of the data.
Is my expectation correct here? Are the input features too many and data too little for it to learn properly?
Full code: Here
Test Data: Here

For anyone else with this issue like me, check your xgb.train parameter:'num_boost_round'. Make sure it is equal or about same with xgb.cv.
I think the problem is the model has not been trained, hence, stopped too early.

Related

Convert Tensorflow 1.x code with custom loss into 2.x

Suppose I have the following code written in Tensorflow 1.x where I define custom loss function. I wish to remove .compat.v1., Session, placeholder etc. and convert it into Tensorflow 2.x.
How to do so?
import DGM
import tensorflow as tf
import numpy as np
import scipy.stats as spstats
import matplotlib.pyplot as plt
from tqdm.notebook import trange
# Option parameters
phi = 10
n = 0.01
T = 4
# Solution parameters (domain on which to solve PDE)
t_low = 0.0 - 1e-10
x_low = 0.0 + 1e-10
x_high = 1.0
# neural network parameters
num_layers = 3
nodes_per_layer = 50
# Training parameters
sampling_stages = 2500 # number of times to resample new time-space domain points
steps_per_sample = 20 # number of SGD steps to take before re-sampling
# Sampling parameters
nsim_interior = 100
nsim_boundary_1 = 50
nsim_boundary_2 = 50
nsim_initial = 50
x_multiplier = 1.1 # multiplier for oversampling i.e. draw x from [x_low, x_high * x_multiplier]
def sampler(nsim_interior, nsim_boundary_1, nsim_boundary_2, nsim_initial):
''' Sample time-space points from the function's domain; points are sampled
uniformly on the interior of the domain, at the initial/terminal time points
and along the spatial boundary at different time points.
Args:
nsim_interior: number of space points in the interior of U
nsim_boundary_1: number of space points in the boundary of U
nsim_boundary_2: number of space points in the boundary of U_x
nsim_initial: number of space points at the initial time
'''
# Sampler #1: domain interior
t_interior = np.random.uniform(low=t_low, high=T, size=[nsim_interior, 1])
x_interior = np.random.uniform(low=x_low, high=x_high*x_multiplier, size=[nsim_interior, 1])
# Sampler #2: spatial boundary 1
t_boundary_1 = np.random.uniform(low=t_low, high=T, size=[nsim_boundary_1, 1])
x_boundary_1 = np.ones((nsim_boundary_1, 1))
# Sampler #3: spatial boundary 2
t_boundary_2 = np.random.uniform(low=t_low, high=T, size=[nsim_boundary_2, 1])
x_boundary_2 = np.zeros((nsim_boundary_2, 1))
# Sampler #4: initial condition
t_initial = np.zeros((nsim_initial, 1))
x_initial = np.random.uniform(low=x_low, high=x_high*x_multiplier, size=[nsim_initial, 1])
return (
t_interior, x_interior,
t_boundary_1, x_boundary_1,
t_boundary_2, x_boundary_2,
t_initial, x_initial
)
def loss(
model,
t_interior, x_interior,
t_boundary_1, x_boundary_1,
t_boundary_2, x_boundary_2,
t_initial, x_initial
):
''' Compute total loss for training.
Args:
model: DGM model object
t_interior, x_interior: sampled time / space points in the interior of U
t_boundary_1, x_boundary_1: sampled time / space points in the boundary of U
t_boundary_2, x_boundary_2: sampled time / space points in the boundary of U_x
t_initial, x_initial: sampled time / space points at the initial time
'''
# Loss term #1: PDE
# compute function value and derivatives at current sampled points
u = model(t_interior, x_interior)
u_t = tf.gradients(ys=u, xs=t_interior)[0]
u_x = tf.gradients(ys=u, xs=x_interior)[0]
u_xx = tf.gradients(ys=u_x, xs=x_interior)[0]
diff_u = u_t - u_xx + phi**2 * (tf.nn.relu(u) + 1e-10)**n
# compute average L2-norm for the PDE
L1 = tf.reduce_mean(input_tensor=tf.square(diff_u))
# Loss term #2: First b. c.
u = model(t_boundary_1, x_boundary_1)
bc1_error = u - 1
# Loss term #3: Second b. c.
u = model(t_boundary_2, x_boundary_2)
u_x = tf.gradients(ys=u, xs=x_boundary_2)[0]
bc2_error = u_x - 0
# Loss term #3: Initial condition
u = model(t_initial, x_initial)
init_error = u - 1
# compute average L2-norm for the initial/boundary conditions
L2 = tf.reduce_mean(input_tensor=tf.square(bc1_error + bc2_error + init_error))
return L1, L2
# initialize DGM model (last input: space dimension = 1)
model = DGM.DGMNet(nodes_per_layer, num_layers, 1)
# tensor placeholders (_tnsr suffix indicates tensors)
# inputs (time, space domain interior, space domain at initial time)
t_interior_tnsr = tf.compat.v1.placeholder(tf.float32, [None,1])
x_interior_tnsr = tf.compat.v1.placeholder(tf.float32, [None,1])
t_boundary_1_tnsr = tf.compat.v1.placeholder(tf.float32, [None,1])
x_boundary_1_tnsr = tf.compat.v1.placeholder(tf.float32, [None,1])
t_boundary_2_tnsr = tf.compat.v1.placeholder(tf.float32, [None,1])
x_boundary_2_tnsr = tf.compat.v1.placeholder(tf.float32, [None,1])
t_initial_tnsr = tf.compat.v1.placeholder(tf.float32, [None,1])
x_initial_tnsr = tf.compat.v1.placeholder(tf.float32, [None,1])
# loss
L1_tnsr, L2_tnsr = loss(
model,
t_interior_tnsr, x_interior_tnsr,
t_boundary_1_tnsr, x_boundary_1_tnsr,
t_boundary_2_tnsr, x_boundary_2_tnsr,
t_initial_tnsr, x_initial_tnsr
)
loss_tnsr = L1_tnsr + L2_tnsr
# set optimizer
starting_learning_rate = 3e-4
global_step = tf.Variable(0, trainable=False)
lr = tf.compat.v1.train.exponential_decay(
learning_rate=starting_learning_rate,
global_step=global_step,
decay_steps=1e5,
decay_rate=0.96,
staircase=True,
)
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=lr).minimize(loss_tnsr)
# initialize variables
init_op = tf.compat.v1.global_variables_initializer()
# open session
sess = tf.compat.v1.Session()
sess.run(init_op)
try:
model.load_weights("checkpoint/")
print("Loading from checkpoint.")
except:
print("Checkpoint not found.")
# for each sampling stage
for i in trange(sampling_stages):
# sample uniformly from the required regions
t_interior, x_interior, \
t_boundary_1, x_boundary_1, \
t_boundary_2, x_boundary_2, \
t_initial, x_initial = sampler(
nsim_interior, nsim_boundary_1, nsim_boundary_2, nsim_initial
)
# for a given sample, take the required number of SGD steps
for _ in range(steps_per_sample):
loss, L1, L2, _ = sess.run(
[loss_tnsr, L1_tnsr, L2_tnsr, optimizer],
feed_dict = {
t_interior_tnsr: t_interior,
x_interior_tnsr: x_interior,
t_boundary_1_tnsr: t_boundary_1,
x_boundary_1_tnsr: x_boundary_1,
t_boundary_2_tnsr: t_boundary_2,
x_boundary_2_tnsr: x_boundary_2,
t_initial_tnsr: t_initial,
x_initial_tnsr: x_initial,
}
)
if i % 10 == 0:
print(f"Loss: {loss:.5f},\t L1: {L1:.5f},\t L2: {L2:.5f},\t iteration: {i}")
model.save_weights("checkpoint/")
I tried searching how to implement custom loss functions with model as an argument, but couldn't implement it.
For model.compile there is a loss argument for which you can pass the Loss function. May be a string (name of loss function), or a tf.keras.losses.Loss instance. For example
Model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.BinaryCrossentropy())
If you have created your custom loss function you can also pass that loss function to the loss argument by providing the name of that loss function. For example
def my_loss_fn(y_true, y_pred):
squared_difference = tf.square(y_true - y_pred)
return tf.reduce_mean(squared_difference, axis=-1)
model.compile(optimizer='adam', loss=my_loss_fn)
Thank You.

PPO: NaN Policy return in Tensorflow Keras

I am trying to implement the PPO algorithm with clipped loss in addition to KL penalties and run training on Mujuco Gym environments. After ~ 15000 gradient steps, policy collapses into returning NaN.
These are the policy training info before the policy collapses:
A: tf.Tensor(-0.10426917, shape=(), dtype=float32)
LOG_A: tf.Tensor(37.021107, shape=(), dtype=float32)
LOSS: tf.Tensor(0.16812761, shape=(), dtype=float32)
GRAD: tf.Tensor(
[[-3.4624012e-04 -1.2807851e-04 -1.9778654e-01 ... -2.7586846e+00
-1.2552655e-01 -1.7212760e-03]
[ 4.6312678e-05 -2.2251482e-04 5.5088173e-03 ... 9.5249921e-02
2.2186586e-03 2.0080474e-04]
[ 2.0314787e-05 -1.6381161e-04 7.1509695e-03 ... 1.1740552e-01
3.4010289e-03 1.2105847e-04]
...
[ 1.7827883e-04 -1.1712313e-05 5.8873045e-01 ... 9.2354174e+00
2.9186043e-01 -2.2818900e-03]
[-9.0385452e-05 3.0951984e-03 -3.6487404e-02 ... -2.6829168e-01
-3.9602429e-02 2.0654879e-03]
[ 2.2925157e-04 4.6892464e-03 5.9946489e-01 ... 9.3497839e+00
3.0514282e-01 -1.3834883e-03]], shape=(11, 256), dtype=float32)
A: tf.Tensor(nan, shape=(), dtype=float32)
LOG_A: tf.Tensor(nan, shape=(), dtype=float32)
Note: The gradient info captures only the gradients of the first layer, as I have found capturing all gradient info to be messy and seemingly redundant.
What I have tried:
Tuning hyperparameters: I have tried multiple sets of hyperparameters including the one documented in the original paper. The same error occurs(the hyperparams setup provided in the example below are chosen for higher sampling efficiency for faster debugging).
Gradient clipping: Gradient norm has been clipped to be unitary, and as shown above, it does not appear to have the exploding gradient issue.
Guaranteed numerical stability of tanh squashing of policy log probability: A small epsilon was used to clip the sum of squares so that action log probability does not return inf after tanh squashing.
Unitized code example:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import gym
import scipy.signal
import time
from tensorflow.keras import Model
import matplotlib.pyplot as plt
import random
import tensorflow_probability as tfp
tf.keras.backend.set_floatx('float32')
EPSILON = 1e-10
################## GLOBAL SETUP P1 ##################
problem = "Hopper-v2"
env = gym.make(problem)
eval_env = gym.make(problem)
num_states = env.observation_space.shape[0]
print("Size of State Space -> {}".format(num_states), flush=True)
num_actions = env.action_space.shape[0]
print("Size of Action Space -> {}".format(num_actions), flush=True)
upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]
print("Max Value of Action -> {}".format(upper_bound), flush=True)
print("Min Value of Action -> {}".format(lower_bound), flush=True)
minibatch_size = 256
##########*****####################*****##########
#################### Auxiliaries ####################
def discounted_cumulative_sums(x, discount):
# Discounted cumulative sums of vectors for computing rewards-to-go and advantage estimates
return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]
##########*****####################*****##########
#################### Replay Buffer ####################
class Buffer:
def __init__(self, observation_dimensions, action_dimensions, size, gamma=0.99, lam=0.95):
self.observation_buffer = np.zeros(
(size, observation_dimensions), dtype=np.float32
)
self.action_buffer = np.zeros((size, action_dimensions), dtype=np.int32)
self.advantage_buffer = np.zeros(size, dtype=np.float32)
self.reward_buffer = np.zeros(size, dtype=np.float32)
self.return_buffer = np.zeros(size, dtype=np.float32)
self.value_buffer = np.zeros(size, dtype=np.float32)
self.logprobability_buffer = np.zeros(size, dtype=np.float32)
self.gamma, self.lam = gamma, lam
self.pointer, self.trajectory_start_index = 0, 0
def store(self, observation, action, reward, value, logprobability):
self.observation_buffer[self.pointer] = observation
self.action_buffer[self.pointer] = action
self.reward_buffer[self.pointer] = reward
self.value_buffer[self.pointer] = value
self.logprobability_buffer[self.pointer] = logprobability
self.pointer += 1
def finish_trajectory(self, last_value=0):
path_slice = slice(self.trajectory_start_index, self.pointer)
rewards = np.append(self.reward_buffer[path_slice], last_value)
values = np.append(self.value_buffer[path_slice], last_value)
deltas = rewards[:-1] + self.gamma * values[1:] - values[:-1]
self.advantage_buffer[path_slice] = discounted_cumulative_sums(
deltas, self.gamma * self.lam
)
self.return_buffer[path_slice] = discounted_cumulative_sums(
rewards, self.gamma
)[:-1]
self.trajectory_start_index = self.pointer
def get(self):
# Get all data of the buffer and normalize the advantages
rindex = np.random.choice(self.pointer, minibatch_size)
advantage_mean, advantage_std = (
np.mean(self.advantage_buffer[rindex]),
np.std(self.advantage_buffer[rindex]),
)
return (
self.observation_buffer[rindex],
self.action_buffer[rindex],
(self.advantage_buffer[rindex] - advantage_mean) / advantage_std,
self.return_buffer[rindex],
self.logprobability_buffer[rindex],
)
def clear(self):
self.pointer, self.trajectory_start_index = 0, 0
##########*****####################*****##########
#################### Models ####################
class Actor(Model):
def __init__(self):
super().__init__()
self.action_dim = num_actions
self.dense1_layer = layers.Dense(256, activation="relu")
self.dense2_layer = layers.Dense(256, activation="relu")
self.mean_layer = layers.Dense(self.action_dim)
self.stdev_layer = layers.Dense(self.action_dim)
def call(self, state, eval_mode=False):
a1 = self.dense1_layer(state)
a2 = self.dense2_layer(a1)
mu = self.mean_layer(a2)
log_sigma = self.stdev_layer(a2)
sigma = tf.exp(log_sigma)
covar_m = tf.linalg.diag(sigma**2)
dist = tfp.distributions.MultivariateNormalTriL(loc=mu, scale_tril=tf.linalg.cholesky(covar_m))
if eval_mode:
action_ = mu
else:
action_ = dist.sample()
action = tf.tanh(action_)
log_pi_ = dist.log_prob(action_)
log_pi = log_pi_ - tf.reduce_sum(tf.math.log(tf.clip_by_value(1 - action**2, EPSILON, 1.0)), axis=1)
return action*upper_bound, log_pi
def get_critic():
state_input = layers.Input(shape=(num_states))
state_out = layers.Dense(256, activation="relu")(state_input)
out = layers.Dense(256, activation="relu")(state_out)
outputs = layers.Dense(1, dtype='float32')(out)
model = tf.keras.Model(state_input, outputs)
return model
##########*****####################*****##########
#################### GLOBAL SETUP P2 ####################
# Hyperparameters of the PPO algorithm
horizon = 2048
iterations = 2000
gamma = 0.99
clip_ratio = 0.2
epochs = 500
lam = 0.97
target_kl = 0.01
beta = 1.0
render = False
actor_model = Actor()
critic_model = get_critic()
lr = 0.0003
policy_optimizer = tf.keras.optimizers.Adam(learning_rate=lr,
# )
clipnorm=1.0)
value_optimizer = tf.keras.optimizers.Adam(learning_rate=lr,
# )
clipnorm=1.0)
buffer = Buffer(num_states, num_actions, horizon)
##########*****####################*****##########
#################### Training ####################
observation, episode_return, episode_length = env.reset(), 0, 0
tf_observation = tf.expand_dims(observation, 0)
def train_policy(
observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
):
global beta
with tf.GradientTape() as tape: # Record operations for automatic differentiation.
action, log_a = actor_model(observation_buffer)
# print("A: ", tf.reduce_mean(action))
# print("LOG_A: ", tf.reduce_mean(log_a))
ratio = tf.exp(
log_a
- logprobability_buffer
)
# print("R: ", tf.reduce_mean(ratio), flush=True)
cd_ratio = tf.clip_by_value(ratio, (1 - clip_ratio), (1 + clip_ratio))
min_advantage = cd_ratio * advantage_buffer
_kl = -beta*tf.math.reduce_max(logprobability_buffer - log_a)
policy_loss = -tf.reduce_mean(tf.minimum(ratio * advantage_buffer, min_advantage) + _kl)
# print("LOSS: ", policy_loss)
policy_grads = tape.gradient(policy_loss, actor_model.trainable_variables)
policy_optimizer.apply_gradients(zip(policy_grads, actor_model.trainable_variables))
# print("GRAD: ", policy_grads[0], flush=True)
action_opt, log_a_opt = actor_model(observation_buffer)
kl = tf.reduce_mean(
logprobability_buffer
- log_a_opt
)
if kl < target_kl/1.5:
beta = beta/2
if kl > target_kl*1.5:
beta = beta*2
return kl
def train_value_function(observation_buffer, return_buffer):
with tf.GradientTape() as tape: # Record operations for automatic differentiation.
value_loss = tf.reduce_mean((return_buffer - critic_model(observation_buffer)) ** 2)
value_grads = tape.gradient(value_loss, critic_model.trainable_variables)
value_optimizer.apply_gradients(zip(value_grads, critic_model.trainable_variables))
for ite in range(iterations):
for t in range(horizon):
if render:
env.render()
action, log_pi_a = actor_model(tf_observation)
action = action[0]
observation_new, reward, done, _ = env.step(action)
episode_return += reward
episode_length += 1
value_t = critic_model(tf_observation)
buffer.store(observation, action, reward, value_t, log_pi_a)
observation = observation_new
tf_observation = tf.expand_dims(observation, 0)
terminal = done
if terminal or (t == horizon - 1):
last_value = 0 if done else critic_model(tf_observation)
buffer.finish_trajectory(last_value)
observation, episode_return, episode_length = env.reset(), 0, 0
tf_observation = tf.expand_dims(observation, 0)
for _ in range(epochs):
(
observation_buffer,
action_buffer,
advantage_buffer,
return_buffer,
logprobability_buffer,
) = buffer.get()
kl = train_policy(
observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
)
train_value_function(observation_buffer, return_buffer)
buffer.clear()
##########*****####################*****##########
Note:
The code base is constructed by a combination of a modified version of the official keras PPO tutorial(https://keras.io/examples/rl/ppo_cartpole/) and Modules(Mainly the policy network) that have been tested in other implementations.
I refrained from using tf_function declaration as I am very new to tensorflow, thus not understanding its impact, and I have read from various github issues that sometimes such declaration causes numerical instability due to caching. However, it could be a source of my issues.
Any help is appreciated, and apologies if something is missing or unclear.

I'm creating a linear regression model and i am receiving an error

I was creating a linear regression model and I used TensorFlow's linear estimator but after I run the linear estimator train function I receive an invalid argument error which says Labels must be <= n_classes - 1.I don't know which part of the code i have gone wrong
this is the code i was running
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\XPRESS\Downloads\CarPrice_Assignment.csv") #load the data
data.head()
#split data into traiing and testing
from sklearn.model_selection import train_test_split
train , test = train_test_split(data,random_state=42,test_size=0.2)
train_x = train
train_y = train.pop('price')
eval_x = test
eval_y = test.pop('price')
lst = list(train_x.columns)
#get numerical and categorical columns
categorical_columns = []
numerical_columns = []
for cat in lst:
if train_x[cat].dtypes == 'object':
categorical_columns.append(_)
for nums in lst:
if nums not in categorical_columns:
numerical_columns.append(nums)
train_x.info()
#convert categorical data to numeric data
feature_columns = []
for feature_name in categorical_columns:
vocabulary = train_x[feature_name].unique()
feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name,vocabulary))
for feature_name in numerical_columns: feature_columns.append(tf.feature_column.numeric_column(feature_name,dtype=tf.float32))
def make_input_fn(data,label,num_epochs=10,shuffle=True,batch_size=32):
def input_fn():
ds = tf.data.Dataset.from_tensor_slices((dict(data),label))
if shuffle:
ds=ds.shuffle(1000)
ds = ds.batch(batch_size).repeat(num_epochs)
return ds
return input_fn
train_input_funtion = make_input_fn(train_x,train_y)
eval_input_function = make_input_fn(eval_x,eval_y,shuffle=False,num_epochs=1)
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)
linear_est.train(train_input_funtion)
this is the error i received
InvalidArgumentError: 2 root error(s) found.
(0) INVALID_ARGUMENT: assertion failed: [Labels must be <= n_classes - 1] [Condition x <= y did not hold element-wise:] [x (head/losses/Cast:0) = ] [[7895][10795][17710]...] [y (head/losses/check_label_range/Const:0) = ] [1]
[[{{function_node head_losses_check_label_range_assert_less_equal_Assert_AssertGuard_false_22323}}{{node Assert}}]]
[[training/Ftrl/gradients/gradients/linear/linear_model/linear/linear_model/linear/linear_model/enginelocation/weighted_sum_grad/Select_1/_1047]]
(1) INVALID_ARGUMENT: assertion failed: [Labels must be <= n_classes - 1] [Condition x <= y did not hold element-wise:] [x (head/losses/Cast:0) = ] [[7895][10795][17710]...] [y (head/losses/check_label_range/Const:0) = ] [1]
[[{{function_node head_losses_check_label_range_assert_less_equal_Assert_AssertGuard_false_22323}}{{node Assert}}]]
0 successful operations.
0 derived errors ignored.
...
[[training/Ftrl/gradients/gradients/linear/linear_model/linear/linear_model/linear/linear_model/enginelocation/weighted_sum_grad/Select_1/_1047]]
(1) INVALID_ARGUMENT: assertion failed: [Labels must be <= n_classes - 1] [Condition x <= y did not hold element-wise:] [x (head/losses/Cast:0) = ] [[7895][10795][17710]...] [y (head/losses/check_label_range/Const:0) = ] [1]
[[{{node Assert}}]]
0 successful operations.
0 derived errors ignored.
You mentioned that you are creating regression, but here you have tf.estimator.LinearClassifier in the code. May be you meant to use tf.estimator.LinearRegressor instead?

AttributeError: module 'tensorflow' has no attribute tensordot

I have tensor of rank 3 and another tensor of rank 2 ,, I want to use tf.tensordot but it gives me this error..??
i am using tensorflow 0.12.0 and i am importing math_ops
please anyone can help?
def self_attention(inputs, attention_size):
hidden_size = 1200
w_omega = tf.Variable(tf.random_normal([hidden_size, attention_size], stddev=0.1))
b_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1))
u_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1))
with tf.name_scope('v'):
# Applying fully connected layer with non-linear activation to each of the B*T timestamps;
# the shape of `v` is (B,T,D)*(D,A)=(B,T,A), where A=attention_size
v = tf.tanh(tf.tensordot(inputs, w_omega, axes=1) + b_omega)
# For each of the timestamps its vector of size A from `v` is reduced with `u` vector
vu = tf.tensordot(v, u_omega, axes=1, name='vu') # (B,T) shape
alphas = tf.nn.softmax(vu, name='alphas') # (B,T) shape
# Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape
m = inputs * tf.expand_dims(alphas, -1)
#M of size [-1,100,1200] then we sum_pooling or avg_pooling it later to to creat [-1,1200] vector that represents the sentence
output = tf.reduce_sum(m, 1)
return output, alphas

tensorflow giving nans when calculating gradient with sparse tensors

The following snippet is from a fairly large piece of code but hopefully I can give all the information necessary:
y2 = tf.matmul(y1,ymask)
dist = tf.norm(ystar-y2,axis=0)
y1 and y2 are 128x30 and ymask is 30x30. ystar is 128x30. dist is 1x30. When ymask is the identity matrix, everything works fine. But when I set it to be all zeros, apart from a single 1 along the diagonal (so as to set all columns but one in y2 to be zero), I get nans for the gradient of dist with respect to y2, using tf.gradients(dist, [y2]). The specific value of dist is [0,0,7.9,0,...], with all the ystar-y2 values being around the range (-1,1) in the third column and zero elsewhere.
I'm pretty confused as to why a numerical issue would occur here, given there are no logs or divisions, is this underflow? Am I missing something in the maths?
For context, I'm doing this to try to train individual dimensions of y, one at a time, using the whole network.
longer version to reproduce:
import tensorflow as tf
import numpy as np
import pandas as pd
batchSize = 128
eta = 0.8
tasks = 30
imageSize = 32**2
groups = 3
tasksPerGroup = 10
trainDatapoints = 10000
w = np.zeros([imageSize, groups * tasksPerGroup])
toyIndex = 0
for toyLoop in range(groups):
m = np.ones([imageSize]) * np.random.randn(imageSize)
for taskLoop in range(tasksPerGroup):
w[:, toyIndex] = m * 0.1 * np.random.randn(1)
toyIndex += 1
xRand = np.random.normal(0, 0.5, (trainDatapoints, imageSize))
taskLabels = np.matmul(xRand, w) + np.random.normal(0,0.5,(trainDatapoints, groups * tasksPerGroup))
DF = np.concatenate((xRand, taskLabels), axis=1)
trainDF = pd.DataFrame(DF[:trainDatapoints, ])
# define graph variables
x = tf.placeholder(tf.float32, [None, imageSize])
W = tf.Variable(tf.zeros([imageSize, tasks]))
b = tf.Variable(tf.zeros([tasks]))
ystar = tf.placeholder(tf.float32, [None, tasks])
ymask = tf.placeholder(tf.float32, [tasks, tasks])
dataLength = tf.cast(tf.shape(ystar)[0],dtype=tf.float32)
y1 = tf.matmul(x, W) + b
y2 = tf.matmul(y1,ymask)
dist = tf.norm(ystar-y2,axis=0)
mse = tf.reciprocal(dataLength) * tf.reduce_mean(tf.square(dist))
grads = tf.gradients(dist, [y2])
trainStep = tf.train.GradientDescentOptimizer(eta).minimize(mse)
# build graph
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
randTask = np.random.randint(0, 9)
ymaskIn = np.zeros([tasks, tasks])
ymaskIn[randTask, randTask] = 1
batch = trainDF.sample(batchSize)
batch_xs = batch.iloc[:, :imageSize]
batch_ys = np.zeros([batchSize, tasks])
batch_ys[:, randTask] = batch.iloc[:, imageSize + randTask]
gradOut = sess.run(grads, feed_dict={x: batch_xs, ystar: batch_ys, ymask: ymaskIn})
sess.run(trainStep, feed_dict={x: batch_xs, ystar: batch_ys, ymask:ymaskIn})
Here's a very simple reproduction:
import tensorflow as tf
with tf.Graph().as_default():
y = tf.zeros(shape=[1], dtype=tf.float32)
dist = tf.norm(y,axis=0)
(grad,) = tf.gradients(dist, [y])
with tf.Session():
print(grad.eval())
Prints:
[ nan]
The issue is that tf.norm computes sum(x**2)**0.5. The gradient is x / sum(x**2) ** 0.5 (see e.g. https://math.stackexchange.com/a/84333), so when sum(x**2) is zero we're dividing by zero.
There's not much to be done in terms of a special case: the gradient as x approaches all zeros depends on which direction it's approaching from. For example if x is a single-element vector, the limit as x approaches 0 could either be 1 or -1 depending on which side of zero it's approaching from.
So in terms of solutions, you could just add a small epsilon:
import tensorflow as tf
def safe_norm(x, epsilon=1e-12, axis=None):
return tf.sqrt(tf.reduce_sum(x ** 2, axis=axis) + epsilon)
with tf.Graph().as_default():
y = tf.constant([0.])
dist = safe_norm(y,axis=0)
(grad,) = tf.gradients(dist, [y])
with tf.Session():
print(grad.eval())
Prints:
[ 0.]
Note that this is not actually the Euclidean norm. It's a good approximation as long as the input is much larger than epsilon.