Python, XGBoost, custom objective function: predictions remain unchanged over iterations, even with non-zero gradient - xgboost

I am trying to get a custom loss function work with xgboost. I am presenting the results for four choices for objective:
'reg:linear' defined by xgboost, uses mse
mse_approx_obj custom using mse, find code below
mae_approx_obj custom using mean_absolute_error, find code below
pseudohuber_approx_obj Wiki
As you can see from the results below, for xgboost defined mse and custom defined mse, the algorithm functions well. However for custom defined mae and smoothed first-order loss function (Huber) the algorithm behaves erratically. For eg., for mae and Huber even when the gradient has a large value, the predictions don't change at all. This seems strange to me. Any pointers?
def mse_approx_obj(dtrain, preds):
d = preds - dtrain
grad_mse = d
hess_mse = np.full(d.shape,1.0)
return grad_mse, hess_mse
def mae_approx_obj(dtrain, preds):
d = preds - dtrain
grad_mae = np.array(d)
grad_mae[grad_mae > 0] = 1.
grad_mae[grad_mae <= 0] = -1.
hess_mae = np.full(d.shape, 0.0)
return grad_mae, hess_mae
def pseudohuber_approx_obj(dtrain,preds):
d = preds- dtrain
h = 1 #h is the delta
scale = 1 + (d / h) ** 2
scale_sqrt = np.sqrt(scale)
grad = d / scale_sqrt
hess = 1 / scale / scale_sqrt
return grad, hess
def model(init_params, training_data, validation_data):
xgb1 = XGBRegressor(learning_rate=0.1,n_estimators=5, max_depth=8, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective = mae_approx_obj, seed=2)
x_train, y_train = training_data[Xfeatures], training_data[target]
x_test, y_test = validation_data[Xfeatures], validation_data[target]
xgb1.fit(x_train,y_train, eval_set=[(x_train,y_train), (x_test,y_test)], eval_metric='rmse', verbose=True)
return xgb1
xgb1 = model(init_params, training_data, validation_data)
The results for the four objectives, respectively:
[0] validation_0-rmse:61.5518 validation_1-rmse:57.0926
[1] validation_0-rmse:55.4669 validation_1-rmse:51.2765
[2] validation_0-rmse:49.9936 validation_1-rmse:46.3276
[3] validation_0-rmse:45.0812 validation_1-rmse:41.9738
[4] validation_0-rmse:40.6609 validation_1-rmse:37.9743`
[0] validation_0-rmse:61.5506 validation_1-rmse:57.0223
[1] validation_0-rmse:55.4673 validation_1-rmse:51.5381
[2] validation_0-rmse:49.9943 validation_1-rmse:46.6359
[3] validation_0-rmse:45.0801 validation_1-rmse:42.2759
[4] validation_0-rmse:40.6617 validation_1-rmse:38.3059
For mae_approx_obj, I print preds, grad_mae everytime mae_approx_obj is called to help debug the process.
[ 0.5 0.5 0.5 ..., 0.5 0.5 0.5]
[-1. -1. -1. ..., -1. -1. -1.]
[0] validation_0-rmse:68.3176 validation_1-rmse:63.0391
[ 0.5 0.5 0.5 ..., 0.5 0.5 0.5]
[-1. -1. -1. ..., -1. -1. -1.]
[1] validation_0-rmse:68.3176 validation_1-rmse:63.0391
[ 0.5 0.5 0.5 ..., 0.5 0.5 0.5]
[-1. -1. -1. ..., -1. -1. -1.]
[2] validation_0-rmse:68.3176 validation_1-rmse:63.0391
[ 0.5 0.5 0.5 ..., 0.5 0.5 0.5]
[-1. -1. -1. ..., -1. -1. -1.]
[3] validation_0-rmse:68.3176 validation_1-rmse:63.0391
[ 0.5 0.5 0.5 ..., 0.5 0.5 0.5]
[-1. -1. -1. ..., -1. -1. -1.]
[4] validation_0-rmse:68.3176 validation_1-rmse:63.0391
And for huber_approx_obj
[ 60.79294968 71.14537811 68.94273376 ..., 70.04405212 70.04405212 68.72246552]
[-1.99890065 -1.99919903 -1.9991467 ..., -1.9991734 -1.999173 -1.9991411]
[0] validation_0-rmse:1138.79 validation_1-rmse:1106.67
[ 60.79294968 71.14537811 68.94273376 ..., 70.04405212 70.04405212 68.72246552]
[ 1.99999702 1.99999702 1.99999702 ..., 1.99999702 1.9999970 1.99999702]
[1] validation_0-rmse:149.637 validation_1-rmse:261.029
[60.79294968 71.14537811 68.94273376 ..., 70.04405212 70.04405212 68.72246552]
[-1.9996798 -1.9997319 -1.99972188 ..., -1.99972689 -1.99972689 -1.99972081]
[2] validation_0-rmse:149.637 validation_1-rmse:261.029
[ 60.79294968 71.14537811 68.94273376 ..., 70.04405212 70.04405212 68.72246552]
[-1.9996798 -1.9997319 -1.99972188 ..., -1.99972689 -1.99972689 -1.99972081]
[3] validation_0-rmse:149.637 validation_1-rmse:261.029
[60.79294968 71.14537811 68.94273376 ..., 70.04405212 70.04405212 68.72246552]
[-1.9996798 -1.9997319 -1.99972188 ..., -1.99972689 -1.99972689 -1.999720]
[4] validation_0-rmse:149.637 validation_1-rmse:261.029
PS: There is a very similar question here but with no answer. It has been a while so asking again to see if there is something that I am missing.

Related

emcee generates the same samples twice

I'm using emcee to generate samples with a given ln_prob twice, but both times yield the exact same samples.
I am using the same initial state for both samplers, but I don't see why it should matter.
Am I wrong thinking that it should yield different results?
import emcee
import numpy as np
NWALKERS = 32
NDIM = 2
NSAMPLES = 1000
def ln_gaussian(x):
# mu = 0, cov = 1
a = (2*np.pi)** -0.5
return np.log(a * np.exp(-0.5 * np.dot(x,x)))
p0 = np.random.rand(NWALKERS, NDIM)
sampler1 = emcee.EnsembleSampler(NWALKERS, NDIM, ln_gaussian)
sampler2 = emcee.EnsembleSampler(NWALKERS, NDIM, ln_gaussian)
state1 = sampler1.run_mcmc(p0, 100) # burn in
state2 = sampler2.run_mcmc(p0, 100) # burn in
sampler1.reset()
sampler2.reset()
# run sampler 1k times (x32 walkers)
sampler1.run_mcmc(state1, NSAMPLES)
sampler2.run_mcmc(state2, NSAMPLES)
s1 = sampler1.get_chain(flat=True)
s2 = sampler2.get_chain(flat=True)
s1 - s2
The output is
array([[0., 0.],
[0., 0.],
[0., 0.],
...,
[0., 0.],
[0., 0.],
[0., 0.]])
If I use different initial states
p0 = np.random.rand(NWALKERS, NDIM)
p1 = np.random.rand(NWALKERS, NDIM)
it yields different samples
array([[-0.70474519, -0.09671908],
[-0.31555036, -0.33661664],
[ 0.75735537, 0.01540277],
...,
[ 2.84810783, -2.11736446],
[-0.55164227, -0.26478868],
[ 0.01301593, -1.76233017]])
But why should it matter? I thought it's random.

f string formatting for numpy array

Here is my code snippets. It prints the means and the standard deviations from the image pixels.
from numpy import asarray
from PIL import Image
import os
os.chdir("../images")
image = Image.open("dubai_2020.jpg")
pixels = asarray(image)
pixels = pixels.astype("float32")
means, stds = pixels.mean(axis=(0, 1), dtype="float64"), pixels.std(
axis=(0, 1), dtype="float64")
print(f"Means: {means:%.2f}, Stds: {stds:%.2f} ")
And the output is
File "pil_local_standard5.py", line 15, in <module>
print(f"Means: {means:%.2f, %.2f, %.2f}, Stds: {stds:%.2f, %.2f, %.2f} ")
TypeError: unsupported format string passed to numpy.ndarray.__format__
How do I define the f-strings format of the data in this case?
I think the easiest way to accomplish something similar to what you want, currently would require the use of numpy.array2string.
For example, let's say means = np.random.random((5, 3)). Then you could do this:
import numpy as np
means = np.random.random((5, 3)).astype(np.float32) # simulate some array
print(f"{np.array2string(means, precision=2, floatmode='fixed')}")
which will print:
[[0.41 0.12 0.84]
[0.28 0.43 0.29]
[0.68 0.41 0.14]
[0.75 1.00 0.16]
[0.30 0.49 0.37]]
The same can be achieved with:
print(f"{np.array2string(means, formatter={'float': lambda x: f'{x:.2f}'})}")
You can also add separators, if you wish:
print(f"{np.array2string(means, formatter={'float': lambda x: f'{x:.2f}'}, separator=', ')}")
which would print:
[[0.41, 0.12, 0.84],
[0.28, 0.43, 0.29],
[0.68, 0.41, 0.14],
[0.75, 1.00, 0.16],
[0.30, 0.49, 0.37]]
Unfortunately, Python's f-string doesn't support formatting of numpy arrays.
A workaround I came up with:
def prettifyStr(numpyArray, fstringText):
num_rows = numpyArray.ndim
l = len(str(numpyArray))
t = (l // num_rows)
diff_to_center_align = 50 - t
return f"{str(numpyArray)}{' ': <{diff_to_center_align}}{fstringText}"
Sample use
print( prettifyStr(a2, "this is some text") )
print( prettifyStr(a3, "this is some text") )
print( prettifyStr(a1, "this is some text") )
print( prettifyStr(a4, "this is some text") )
Output
[[0. 3. 4. ]
[0. 5. 5.1]] this is some text
[[0. 3. 4. 4.35]
[0. 5. 5.1 3.6 ]] this is some text
[[0 3]
[0 5]] this is some text
[[0. 3. 4. 4.35 4.25]
[0. 5. 5.1 3.6 3.1 ]] this is some text

tensorflow get boxes pair with maximum IOU but discard boxes with all zeros

This question is built from this question: (tensorflow remember the index after calculating getting the maximum box). I find discarding boxes with all zeros particularly hard, so I am posting a new one.
Complete description:
Assume that I have two arrays of boxes, each of which has the shape (?, b1, 4) and (?, b2, 4) respectively (treat ? as a unknown batch size):
box1: [[[1,2,3,4], [2,3,4,5], [3,4,5,6], [0,0,0,0], [0,0,0,0]...]...]
box2: [[[4,3,2,1], [3,2,5,4], [4,3,5,6]...]...]
(the number above are set arbitarily)
Note that box1 may or may not have fake box([0,0,0,0]) at the end.
I want to:
in each batch, for each non-fake box A in box1 (that is, boxes that does not contain all zeros), find in box2 the box B which has the maximum IOU (intersection over union) with A (in the same batch, of course), and then append the tuple (A, B) to a list list_max.
append to list_nonmax all the boxes in box2 that does not have maximum IOU with any box in box1 (separated by batch, of course)
You can assume that:
b1 and b2 are both python variables, not tensorflow tensor.
methods for calculating IOU between single box or between batch of boxes already exists and can be used literally:
iou_single_box(box1, box2) : both box1 and box2 are of shape (4,).
iou_multiple_boxes(bbox1, bbox2) : both bbox1 and bbox2 are of shape (b1, 4) and (b2, 4) respectively.
iou_batch_boxes(bbbox1, bbbox2) : both bbbox1 and bbbox2 are of shape (?, b1, 4) and (?, b2, 4) respectively (treat ? as a unknown batch size).
You can take a look at the question (tensorflow remember the index after calculating getting the maximum box) I post previously. I only add one constraint:
I don't want any fake box in box1 to match against any box in box2. when getting list_max and list_nonmax
Note that the number of fake box is not set.
****: I know this question is quite complicated. I do all these because Tensorflow cannot handle dynamic-length array (you have to have a deterministic b1 for box1 at runtime). So I pad [0, 0, 0, 0] at the end of box1 to make the length fixed.
I believe this is easily doable with tf.boolean_mask() like this code (tested):
from __future__ import print_function
import tensorflow as tf
box1 = tf.reshape( tf.constant( range( 16 ), dtype = tf.float32 ), ( 2, 2, 4 ) )
box1 = tf.concat( [ box1, tf.zeros( ( 2, 2, 4 ) ) ], axis = 1 )
box2 = tf.reshape( tf.constant( range( 2, 26 ), dtype = tf.float32 ), ( 2, 3, 4 ) )
batch_size = box1.get_shape().as_list()[ 0 ]
def dummy_iou_batch_boxes( box1, box2 ):
b1s, b2s = box1.get_shape().as_list(), box2.get_shape().as_list()
return tf.constant( [ [ [9.0,8,7], [1,2,3], [ 0, 10, 0 ], [ 0, 0, 0 ],
[0 ,1,2], [0,5,0], [ 0, 0, 0 ], [ 0, 0, 0 ] ] ] )
iou = dummy_iou_batch_boxes( box1, box2 )
val, idx = tf.nn.top_k( iou, k = 1 )
idx = tf.reshape( idx, ( batch_size, box1.get_shape().as_list()[ 1 ] ) )
one_hot_idx = tf.one_hot( idx, depth = box2.get_shape().as_list()[ 1 ] )
# for listmax
full_idx = tf.where( tf.equal( 1.0, one_hot_idx ) )
box1_idx = full_idx[ :, 0 : 2 ]
box2_idx = full_idx[ :, 0 : 3 : 2 ]
box12 = tf.gather_nd( box1, box1_idx )
box22 = tf.gather_nd( box2, box2_idx )
list_max_raw = tf.stack( [ box12, box22 ], axis = 1 )
# filter out for a = [ 0, 0, 0, 0 ]
nonzero_mask = tf.reduce_any( tf.not_equal( 0.0, list_max_raw ), axis = 2 )[ :, 0 ]
list_max = tf.boolean_mask( list_max_raw, nonzero_mask )
# for list nonmax
nonzero_mask = tf.cast( tf.reduce_any( tf.not_equal( 0.0, box1 ), axis = 2 ), tf.float32 )[ ..., None ]
filtered_one_hot = one_hot_idx * nonzero_mask
active_box2 = tf.sign( tf.reduce_sum( filtered_one_hot, axis = 1 ) )
nonactive_box2 = 1.0 - active_box2
nonactive_box2_idx = tf.where( tf.equal( 1.0, nonactive_box2 ) )
list_nonmax = tf.gather_nd( box2, nonactive_box2_idx )
with tf.Session() as sess:
res = sess.run( [ box1, box2, list_max ] )
print( "Input boxes: " )
for v in res[ : 2 ]:
print( v )
print( " ", "=" * 40 )
print( "List max: " )
for v in res[ 2 : ]:
print( v )
print( " ", "=" * 40 )
res = sess.run( [ list_nonmax ] )
print( "List nonmax: " )
for v in res:
print( v )
print( " ", "=" * 40 )
will output
Input boxes:
[[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
[[ 8. 9. 10. 11.]
[12. 13. 14. 15.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]]
========================================
[[[ 2. 3. 4. 5.]
[ 6. 7. 8. 9.]
[10. 11. 12. 13.]]
[[14. 15. 16. 17.]
[18. 19. 20. 21.]
[22. 23. 24. 25.]]]
========================================
List max:
[[[ 0. 1. 2. 3.]
[ 2. 3. 4. 5.]]
[[ 4. 5. 6. 7.]
[10. 11. 12. 13.]]
[[ 8. 9. 10. 11.]
[22. 23. 24. 25.]]
[[12. 13. 14. 15.]
[18. 19. 20. 21.]]]
========================================
List nonmax:
[[ 6. 7. 8. 9.]
[14. 15. 16. 17.]]
========================================

one-hot encoding and existing data

I have a numpy array (N,M) where some of the columns should be one-hot encoded. Please help to make a one-hot encoding using numpy and/or tensorflow.
Example:
[
[ 0.993, 0, 0.88 ]
[ 0.234, 1, 1.00 ]
[ 0.235, 2, 1.01 ]
.....
]
The 2nd column here ( with values 3 and 2 ) should be one hot-encoded, I know that there are only 3 distinct values ( 0, 1, 2 ).
The resulting array should look like:
[
[ 0.993, 0.88, 0, 0, 0 ]
[ 0.234, 1.00, 0, 1, 0 ]
[ 0.235, 1.01, 1, 0, 0 ]
.....
]
Like that I would be able to feed this array into the tensorflow.
Please notice that 2nd column was removed and it's one-hot version was appended in the end of each sub-array.
Any help would be highly appreciated.
Thanks in advance.
Update:
Here is what I have right now:
Well, not exactly...
1. I have more than 3 columns in the array...but I still want to do it only with 2nd..
2. First array is structured, ie it's shape is (N,)
Here is what I have:
def one_hot(value, max_value):
value = int(value)
a = np.zeros(max_value, 'uint8')
if value != 0:
a[value] = 1
return a
# data is structured array with the shape of (N,)
# it has strings, ints, floats inside..
# was get by np.genfromtxt(dtype=None)
unique_values = dict()
unique_values['categorical1'] = 1
unique_values['categorical2'] = 2
for row in data:
row[col] = unique_values[row[col]]
codes = np.zeros((data.shape[0], len(unique_values)))
idx = 0
for row in data:
codes[idx] = one_hot(row[col], len(unique_values)) # could be optimised by not creating new array every time
idx += 1
data = np.c_[data[:, [range(0, col), range(col + 1, 32)]], codes[data[:, col].astype(int)]]
Also trying to concatenate via:
print data.shape # shape (5000,)
print codes.shape # shape (5000,3)
data = np.concatenate((data, codes), axis=1)
Here's one approach -
In [384]: a # input array
Out[384]:
array([[ 0.993, 0. , 0.88 ],
[ 0.234, 1. , 1. ],
[ 0.235, 2. , 1.01 ]])
In [385]: codes = np.array([[0,0,0],[0,1,0],[1,0,0]]) # define codes here
In [387]: codes
Out[387]:
array([[0, 0, 0], # encoding for 0
[0, 1, 0], # encoding for 1
[1, 0, 0]]) # encoding for 2
# Slice out the second column and append one-hot encoded array
In [386]: np.c_[a[:,[0,2]], codes[a[:,1].astype(int)]]
Out[386]:
array([[ 0.993, 0.88 , 0. , 0. , 0. ],
[ 0.234, 1. , 0. , 1. , 0. ],
[ 0.235, 1.01 , 1. , 0. , 0. ]])

How can I find a basis for the column space of a rectangular matrix?

Given a numpy ndarray with dimensions m by n (where n>m), how can I find the linearly independent columns?
One way is to use the LU decomposition. The factor U will be of the same size as your matrix, but will be upper-triangular. In each row of U, pick the first nonzero element: these are pivot elements, which belong to linearly independent columns. A self-contained example:
import numpy as np
from scipy.linalg import lu
A = np.array([[1, 2, 3], [2, 4, 2]]) # example for testing
U = lu(A)[2]
lin_indep_columns = [np.flatnonzero(U[i, :])[0] for i in range(U.shape[0])]
Output: [0, 2], which means the 0th and 2nd columns of A form a basis for its column space.
#user6655984's answer inspired this code, where I developed a function instead of the author's last line of code (finding pivot columns of U) so that it can handle more diverse A's.
Here it is:
import numpy as np
from scipy import linalg as LA
np.set_printoptions(precision=1, suppress=True)
A = np.array([[1, 4, 1, -1],
[2, 5, 1, -2],
[3, 6, 1, -3]])
P, L, U = LA.lu(A)
print('P', P, '', 'L', L, '', 'U', U, sep='\n')
Output:
P
[[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]]
L
[[1. 0. 0. ]
[0.3 1. 0. ]
[0.7 0.5 1. ]]
U
[[ 3. 6. 1. -3. ]
[ 0. 2. 0.7 -0. ]
[ 0. 0. -0. -0. ]]
I came up with this function:
def get_indices_for_linearly_independent_columns_of_A(U: np.ndarray) -> list:
# I should first convert all "-0."s to "0." so that nonzero() can find them.
U_copy = U.copy()
U_copy[abs(U_copy) < 1.e-7] = 0
# Because some rows in U may not have even one nonzero element,
# I have to find the index for the first one in two steps.
index_of_all_nonzero_cols_in_each_row = (
[U_copy[i, :].nonzero()[0] for i in range(U_copy.shape[0])]
)
index_of_first_nonzero_col_in_each_row = (
[indices[0] for indices in index_of_all_nonzero_cols_in_each_row
if len(indices) > 0]
)
# Because two rows or more may have the same indices
# for their first nonzero element, I should remove duplicates.
unique_indices = sorted(list(set(index_of_first_nonzero_col_in_each_row)))
return unique_indices
Finally:
col_sp_A = A[:, get_indices_for_linearly_independent_columns_of_A(U)]
print(col_sp_A)
Output:
[[1 4]
[2 5]
[3 6]]
Try this one
def LU_decomposition(A):
"""
Perform LU decompostion of a given matrix
Args:
A: the given matrix
Returns: P, L and U, s.t. PA = LU
"""
assert A.shape[0] == A.shape[1]
N = A.shape[0]
P_idx = np.arange(0, N, dtype=np.int16).reshape(-1, 1)
for i in range(N - 1):
pivot_loc = np.argmax(np.abs(A[i:, [i]])) + i
if pivot_loc != i:
A[[i, pivot_loc], :] = A[[pivot_loc, i], :]
P_idx[[i, pivot_loc], :] = P_idx[[pivot_loc, i], :]
A[i + 1:, i] /= A[i, i]
A[i + 1:, i + 1:] -= A[i + 1:, [i]] * A[[i], i + 1:]
U, L, P = np.zeros_like(A), np.identity(N), np.zeros((N, N), dtype=np.int16)
for i in range(N):
L[i, :i] = A[i, :i]
U[i, i:] = A[i, i:]
P[i, P_idx[i][0]] = 1
return P.astype(np.float64), L, U
def get_bases(A):
assert A.ndim == 2
Q = gaussian_elimination(A)
M, N = Q.shape
pivot_idxs = []
for i in range(M):
j = i
while j < N and abs(Q[i, j]) < 1e-5:
j += 1
if j < N:
pivot_idxs.append(j)
return A[:, list(set(pivot_idxs))]