tf.keras.losses.CategoricalCrossentropy gives different values than plain implementation - tensorflow

Any one knows why raw implementation of Categorical Crossentropy function is so different from the tf.keras's api function?
import tensorflow as tf
import math
tf.enable_eager_execution()
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
ce = tf.keras.losses.CategoricalCrossentropy()
res = ce(y_true, y_pred).numpy()
print("use api:")
print(res)
print()
print("implementation:")
step1 = -y_true * np.log(y_pred )
step2 = np.sum(step1, axis=1)
print("step1.shape:", step1.shape)
print(step1)
print("sum step1:", np.sum(step1, ))
print("mean step1", np.mean(step1))
print()
print("step2.shape:", step2.shape)
print(step2)
print("sum step2:", np.sum(step2, ))
print("mean step2", np.mean(step2))
Above gives:
use api:
0.3239681124687195
implementation:
step1.shape: (3, 3)
[[0.10536052 0. 0. ]
[0. 0.11653382 0. ]
[0. 0. 0.0618754 ]]
sum step1: 0.2837697356318653
mean step1 0.031529970625762814
step2.shape: (3,)
[0.10536052 0.11653382 0.0618754 ]
sum step2: 0.2837697356318653
mean step2 0.09458991187728844
If now with another y_true and y_pred:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
It gives:
use api:
16.11809539794922
implementation:
step1.shape: (1, 2)
[[-0. 25.32843602]]
sum step1: 25.328436022934504
mean step1 12.664218011467252
step2.shape: (1,)
[25.32843602]
sum step2: 25.328436022934504
mean step2 25.328436022934504

The difference is because of these values: [.5, .89, .6], since it's sum is not equal to 1. I think you have made a mistake and you meant this instead: [.05, .89, .06].
If you provide the values with sum equal to 1, then both formulas results will be the same:
import tensorflow as tf
import numpy as np
y_true = np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.05, .89, .06], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#output
#[0.10536052 0.11653382 0.0618754 ]
#[0.10536052 0.11653382 0.0618754 ]
However, let's explore how is calculated if the y_pred tensor is not scaled (the sum of values is not equal to 1)? If you look at the source code of categorical cross entropy here, you will see that it scales y_pred so that the class probas of each sample sum to 1:
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
keep_dims=True)
since we passed a pred which the sum of probas is not 1, let's see how this operation changes our tensor [.5, .89, .6]:
output = tf.constant([.5, .89, .6])
output /= tf.reduce_sum(output,
axis=len(output.get_shape()) - 1,
keepdims=True)
print(output.numpy())
# array([0.2512563 , 0.44723618, 0.30150756], dtype=float32)
So, it should be equal if we replace the above operation output (scaled y_pred), and pass it to your own implemented categorical cross entropy, with the unscaled y_pred passing to tensorflow implementation:
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
#unscaled y_pred
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
#scaled y_pred (categorical_crossentropy scales above tensor to this internally)
y_pred = np.array([[.9, .05, .05], [0.2512563 , 0.44723618, 0.30150756], [.05, .01, .94]])
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
[0.10536052 0.80466845 0.0618754 ]
[0.10536052 0.80466846 0.0618754 ]
Now, let's explore the results of your second example. Why your second example shows different output?
If you check the source code again, you will see this line:
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
which clips values below than a threshold. Your input [0.99999999999, 0.00000000001] will be converted to [0.9999999, 0.0000001] in this line, so it gives you a different result:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#now let's first clip the values less than epsilon, then compare loss
epsilon=1e-7
y_pred = tf.clip_by_value(y_pred, epsilon, 1. - epsilon)
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
#results without clipping values
[16.11809565]
[25.32843602]
#results after clipping values if there is a value less than epsilon (1e-7)
[16.11809565]
[16.11809565]

Related

Why does the default GRU implementation in pytorch and keras differ significantly?

After implementing my own GRU cell, I was trying to validate it with the default implementation available on pytorch and keras. My implementation was very close to pytorch but significantly different from keras. So, I first decided to compare the implementations available in pytorch and keras against each other and found that they both were significantly different. Here is some code:
import numpy as np
import torch as tt
import torch.nn as nn
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# data generation
input_size = 2
seq_len = 4
hidden_size = 3
batch_size=1
rng = np.random.default_rng(10)
xx = rng.uniform(size=(batch_size, seq_len, input_size)).astype(np.float32)
print(xx.shape)
output:
(1, 4, 2)
First created a keras gru layer and do one forward pass to get the output
gru_keras = layers.GRU(hidden_size, return_sequences=True)
out_gru_keras = gru_keras(xx)
out_gru_keras
output:
<tf.Tensor: shape=(1, 4, 3), dtype=float32, numpy=
array([[[0.05563376, 0.22639018, 0.14037813],
[0.05021746, 0.27509487, 0.17843583],
[0.0273784 , 0.23740683, 0.1731183 ],
[0.10505023, 0.38027072, 0.31583947]]], dtype=float32)>
Then check the weights in keras gru
gru_keras_weights = gru_keras.get_weights()
print(f'{len(gru_keras_weights)=}')
for w in gru_keras_weights:
print(w.shape, w.dtype, w)
output:
len(gru_keras_weights)=3
(2, 9) float32 [[ 0.38249677 -0.67729133 -0.28855678 0.3081903 -0.530349 0.1531434
0.09444886 0.2978403 0.1516701 ]
[ 0.0833146 0.27516943 0.4720915 -0.7370237 -0.20921749 0.38180763
0.23018956 0.39872426 0.5722596 ]]
(3, 9) float32 [[-0.21449113 0.2944518 -0.25759113 0.2292317 0.35174483 0.42021522
-0.5116475 0.42759803 -0.05884372]
[-0.3477073 -0.15120703 0.7333025 0.418491 -0.07980055 -0.21007833
-0.1924745 0.20504965 0.11737184]
[ 0.01270523 0.15124948 0.4014033 -0.568793 0.4513449 -0.03860948
-0.39513308 -0.36090007 -0.02702253]]
(2, 9) float32 [[0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Now, create a pytorch gru and check its weights
gru_torch = nn.GRU(
input_size=input_size,
hidden_size=hidden_size,
bias=True,
batch_first=True,
num_layers=1,
dropout=0.0,
bidirectional=False,
dtype=tt.float32
)
sd_gru_torch = gru_torch.state_dict()
for k,v in sd_gru_torch.items():
print(f'{k}, {v.shape}')
output:
weight_ih_l0, torch.Size([9, 2])
weight_hh_l0, torch.Size([9, 3])
bias_ih_l0, torch.Size([9])
bias_hh_l0, torch.Size([9])
Clearly, pytorch implements two separate bias bias_ih_l0 and bias_hh_l0 while keras has a concatenated bias (which is zero). Now I copied the weights accordingly.
with tt.no_grad():
gru_torch.get_parameter('weight_ih_l0').copy_(tt.tensor(gru_keras_weights[0].T))
gru_torch.get_parameter('weight_hh_l0').copy_(tt.tensor(gru_keras_weights[1].T))
gru_torch.get_parameter('bias_ih_l0').copy_(tt.tensor(gru_keras_weights[2].T[:,0]))
gru_torch.get_parameter('bias_hh_l0').copy_(tt.tensor(gru_keras_weights[2].T[:,1]))
... and check the parameters again
with tt.no_grad():
for p in gru_torch.parameters():
print(p.shape, p)
output:
torch.Size([9, 2]) Parameter containing:
tensor([[ 0.3825, 0.0833],
[-0.6773, 0.2752],
[-0.2886, 0.4721],
[ 0.3082, -0.7370],
[-0.5303, -0.2092],
[ 0.1531, 0.3818],
[ 0.0944, 0.2302],
[ 0.2978, 0.3987],
[ 0.1517, 0.5723]], requires_grad=True)
torch.Size([9, 3]) Parameter containing:
tensor([[-0.2145, -0.3477, 0.0127],
[ 0.2945, -0.1512, 0.1512],
[-0.2576, 0.7333, 0.4014],
[ 0.2292, 0.4185, -0.5688],
[ 0.3517, -0.0798, 0.4513],
[ 0.4202, -0.2101, -0.0386],
[-0.5116, -0.1925, -0.3951],
[ 0.4276, 0.2050, -0.3609],
[-0.0588, 0.1174, -0.0270]], requires_grad=True)
torch.Size([9]) Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
torch.Size([9]) Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
Now the weights are matching exactly. Taking a forward pass through torch with the same data
with tt.no_grad():
out_gru_torch, _ = gru_torch(tt.tensor(xx))
print (out_gru_torch.shape, out_gru_torch)
output:
torch.Size([1, 4, 3]) tensor([[[0.0638, 0.2232, 0.1145],
[0.0553, 0.2741, 0.1618],
[0.0306, 0.2399, 0.1644],
[0.1231, 0.3970, 0.3153]]])
But it can be seen that the outputs are different in value
out_gru_torch.numpy() - out_gru_keras.numpy()
output:
array([[[ 0.00813178, -0.00323714, -0.02592391],
[ 0.0050362 , -0.000976 , -0.01662555],
[ 0.00321813, 0.0024882 , -0.00869839],
[ 0.0180769 , 0.01671427, -0.00050169]]], dtype=float32)
summing up the differences
np.sum(np.abs(out_gru_torch.numpy() - out_gru_keras.numpy()))
output:
0.10962818
Why is there such a significant difference?
I also tried the same thing with LSTMs but found that the difference was insignificant. This can be attributed to floating point rounding errors maybe. Here is some code.
Make a keras LSTM
lstm_keras = layers.LSTM(hidden_size, return_sequences=True)
out_lstm_keras = lstm_keras(xx)
out_lstm_keras
output:
<tf.Tensor: shape=(1, 4, 3), dtype=float32, numpy=
array([[[-0.12176377, -0.07746243, -0.08807365],
[-0.17760691, -0.11547467, -0.12406464],
[-0.16980645, -0.1159803 , -0.12289675],
[-0.16330168, -0.08463871, -0.18625976]]], dtype=float32)>
check keras LSTM weights
lstm_keras_weights = lstm_keras.get_weights()
print(f'{len(lstm_keras_weights)=}')
for w in lstm_keras_weights:
print(w.shape, w.dtype, w)
output:
len(lstm_keras_weights)=3
(2, 12) float32 [[-0.18769234 0.6526979 0.27196562 -0.23817068 -0.05964065 -0.11090988
-0.6442989 -0.4168117 -0.344454 0.12466687 -0.6536666 -0.28540143]
[-0.35713544 0.34027737 0.09951967 -0.21514818 0.47551024 0.305395
0.16330504 0.22410381 -0.13371867 0.21646535 0.01366949 0.4818431 ]]
(3, 12) float32 [[-0.22128499 -0.17296375 0.03671373 0.16226508 0.19011612 -0.41836154
-0.5816412 -0.32847112 -0.31468534 -0.27402246 0.05426207 0.24291728]
[ 0.4650536 -0.57491106 0.01105271 -0.0380749 0.2271702 0.39930764
0.11620218 -0.19071549 -0.30224687 -0.13937864 -0.27111995 0.08010413]
[ 0.45147026 0.3288076 -0.37750363 0.35117835 -0.31541684 -0.1335725
-0.0910389 -0.24736843 0.03350063 -0.27691114 -0.28898126 -0.27222085]]
(12,) float32 [0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0.]
Now create a pytorch LSTM and check its weights
lstm_torch = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
bias=True,
batch_first=True,
num_layers=1,
dropout=0.0,
bidirectional=False,
dtype=tt.float32
)
sd_lstm_torch = lstm_torch.state_dict()
for k,v in sd_lstm_torch.items():
print(f'{k}, {v.shape}')
output:
weight_ih_l0, torch.Size([12, 2])
weight_hh_l0, torch.Size([12, 3])
bias_ih_l0, torch.Size([12])
bias_hh_l0, torch.Size([12])
It can be noted that in keras LSTM, only one bias is implemented (not a stack of 2 bias as done in GRU), so I choose one of the bias in pytorch LSTM and copied keras bias vector to it. It doesn't make much difference on which bias is copied to, the other bias is made zero. Here I choose bias_ih_l0 to copy to and bias_hh_l0 is made zero.
with tt.no_grad():
lstm_torch.get_parameter('weight_ih_l0').copy_(tt.tensor(lstm_keras_weights[0].T))
lstm_torch.get_parameter('weight_hh_l0').copy_(tt.tensor(lstm_keras_weights[1].T))
lstm_torch.get_parameter('bias_ih_l0').copy_(tt.tensor(lstm_keras_weights[2]))
lstm_torch.get_parameter('bias_hh_l0').copy_(tt.zeros(lstm_keras_weights[2].shape))
Checking the parameters again
with tt.no_grad():
for p in lstm_torch.parameters():
print(p.shape, p)
output:
torch.Size([12, 2]) Parameter containing:
tensor([[-0.1877, -0.3571],
[ 0.6527, 0.3403],
[ 0.2720, 0.0995],
[-0.2382, -0.2151],
[-0.0596, 0.4755],
[-0.1109, 0.3054],
[-0.6443, 0.1633],
[-0.4168, 0.2241],
[-0.3445, -0.1337],
[ 0.1247, 0.2165],
[-0.6537, 0.0137],
[-0.2854, 0.4818]], requires_grad=True)
torch.Size([12, 3]) Parameter containing:
tensor([[-0.2213, 0.4651, 0.4515],
[-0.1730, -0.5749, 0.3288],
[ 0.0367, 0.0111, -0.3775],
[ 0.1623, -0.0381, 0.3512],
[ 0.1901, 0.2272, -0.3154],
[-0.4184, 0.3993, -0.1336],
[-0.5816, 0.1162, -0.0910],
[-0.3285, -0.1907, -0.2474],
[-0.3147, -0.3022, 0.0335],
[-0.2740, -0.1394, -0.2769],
[ 0.0543, -0.2711, -0.2890],
[ 0.2429, 0.0801, -0.2722]], requires_grad=True)
torch.Size([12]) Parameter containing:
tensor([0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0.], requires_grad=True)
torch.Size([12]) Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
Now do a forward pass through pytorch LSTM with the same data
with tt.no_grad():
out_lstm_torch, _ = lstm_torch(tt.tensor(xx))
print (out_lstm_torch.shape, out_lstm_torch)
output:
torch.Size([1, 4, 3]) tensor([[[-0.1218, -0.0775, -0.0881],
[-0.1776, -0.1155, -0.1241],
[-0.1698, -0.1160, -0.1229],
[-0.1633, -0.0846, -0.1863]]])
Taking absolute difference
np.sum(np.abs(out_lstm_torch.numpy() - out_lstm_keras.numpy()))
output:
2.7567148e-07
The difference is significantly less. I also found that the initial hidden states are assumed to be all zeros by both pytorch and keras so I don't need to set it manually.

batch axis in keras custom layer

I want to make a custom layer that does the following, given a batch of input vectors.
For each vector a in the batch:
get the first element a[0].
multiply the vector a by a[0] elementwise.
So if the batch is
[[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.],
[10., 11., 12.]]
This should be a batch of 4 vectors, each with dimension 3 (or am I wrong here?).
Then my layer should transform the batch to the following:
[[ 1., 2., 3.],
[ 16., 20., 24.],
[ 49., 56., 63.],
[100., 110., 120.]]
Here is my implementation for the layer:
class MyLayer(keras.layers.Layer):
def __init__(self, activation=None, **kwargs):
super().__init__(**kwargs)
self.activation = keras.activations.get(activation)
def call(self, a):
scale = a[0]
return self.activation(a * scale)
def get_config(self):
base_config = super().get_config()
return {**base_config,
"activation": keras.activations.serialize(self.activation)}
But the output is different from what I expected:
batch = tf.Variable([[1,2,3],
[4,5,6],
[7,8,9],
[10,11,12]], dtype=tf.float32)
layer = MyLayer()
print(layer(batch))
Output:
tf.Tensor(
[[ 1. 4. 9.]
[ 4. 10. 18.]
[ 7. 16. 27.]
[10. 22. 36.]], shape=(4, 3), dtype=float32)
It looks like the implementation actually treats each column as a vector, which is strange to me because other pre-written models, such as the sequential model, specify the input shape to be (batch_size, ...), which means each row, instead of column, is a vector.
How should I modify my code so that it behaves the way I want?
Actually, your input shape is (4,3). So when you slice this tensor by a[0] it gets the first row which is [1,2,3]. To get what you want you should instead get the first column and then transpose your matrix to give you the desired matrix like this:
def call(self, a):
scale = a[:,0]
return tf.transpose(self.activation(tf.transpose(a) * scale))

T5 Encoder model output all zeros?

I am trying out a project where I use the T5EncoderModel from HuggingFace in order to obtain hidden representations of my input sentences. I have 100K sentences which I tokenize and pad as follows:
for sentence in dataset[original]:
sentence = tokenizer(sentence, max_length=40, padding='max_length', return_tensors='tf', truncation= True)
original_sentences.append(sentence.input_ids)
org_mask.append(sentence.attention_mask)
This gives me the right outputs and tokenizes everything decently. The problem I achieve is when I am trying to actually train the model. The setup is a bit complex and is taken from https://keras.io/examples/vision/semantic_image_clustering/ which I am trying to apply to text.
The set-up for training is as follows:
def create_encoder(rep_dim):
encoder = TFT5EncoderModel.from_pretrained('t5-small', output_hidden_states=True)
encoder.trainable = True
original_input = Input(shape=(max_length), name = 'originalIn', dtype=tf.int32)
augmented_input = Input(shape=(max_length), name = 'originalIn', dtype=tf.int32)
concat = keras.layers.Concatenate(axis=1)([original_input, augmented_input])
#Take 0-index because it returns a TFBERTmodel type, and 0 returns a tensor
encoded = encoder(input_ids=concat)[0]
#This outputs shape: [sentences, max_length, encoded_dims]
output = Dense(rep_dim, activation='relu')(encoded)
return encoder
This function is fed into the ReprensentationLearner class from the above link as such:
class RepresentationLearner(keras.Model):
def __init__(
self,
encoder,
projection_units,
temperature=0.8,
dropout_rate=0.1,
l2_normalize=False,
**kwargs
):
super(RepresentationLearner, self).__init__(**kwargs)
self.encoder = encoder
# Create projection head.
self.projector = keras.Sequential(
[
layers.Dropout(dropout_rate),
layers.Dense(units=projection_units, use_bias=False),
layers.BatchNormalization(),
layers.ReLU(),
]
)
self.temperature = temperature
self.l2_normalize = l2_normalize
self.loss_tracker = keras.metrics.Mean(name="loss")
#property
def metrics(self):
return [self.loss_tracker]
def compute_contrastive_loss(self, feature_vectors, batch_size):
num_augmentations = tf.shape(feature_vectors)[0] // batch_size
if self.l2_normalize:
feature_vectors = tf.math.l2_normalize(feature_vectors, -1)
# The logits shape is [num_augmentations * batch_size, num_augmentations * batch_size].
logits = (
tf.linalg.matmul(feature_vectors, feature_vectors, transpose_b=True)
/ self.temperature
)
# Apply log-max trick for numerical stability.
logits_max = tf.math.reduce_max(logits, axis=1)
logits = logits - logits_max
# The shape of targets is [num_augmentations * batch_size, num_augmentations * batch_size].
# targets is a matrix consits of num_augmentations submatrices of shape [batch_size * batch_size].
# Each [batch_size * batch_size] submatrix is an identity matrix (diagonal entries are ones).
targets = tf.tile(tf.eye(batch_size), [num_augmentations, num_augmentations])
# Compute cross entropy loss
return keras.losses.categorical_crossentropy(
y_true=targets, y_pred=logits, from_logits=True
)
def call(self, inputs):
features = self.encoder(inputs[0])[0]
# Apply projection head.
return self.projector(features[0])
def train_step(self, inputs):
batch_size = tf.shape(inputs)[0]
# Run the forward pass and compute the contrastive loss
with tf.GradientTape() as tape:
feature_vectors = self(inputs, training=True)
loss = self.compute_contrastive_loss(feature_vectors, batch_size)
# Compute gradients
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update loss tracker metric
self.loss_tracker.update_state(loss)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
def test_step(self, inputs):
batch_size = tf.shape(inputs)[0]
feature_vectors = self(inputs, training=False)
loss = self.compute_contrastive_loss(feature_vectors, batch_size)
self.loss_tracker.update_state(loss)
return {"loss": self.loss_tracker.result()}
In order to train it, I use the Colab TPU and train it as such:
with strategy.scope():
encoder = create_encoder(rep_dim)
training_model = RepresentationLearner(encoder=encoder, projection_units=128, temperature=0.1)
lr_scheduler = keras.experimental.CosineDecay(initial_learning_rate=0.001, decay_steps=500, alpha=0.1)
training_model.compile(optimizer=tfa.optimizers.AdamW(learning_rate=lr_scheduler, weight_decay=0.0001))
history = training_model.fit(x = [original_train, augmented_train], batch_size=32*8, epocs = 10)
training_model.save_weights('representation_learner.h5', overwrite=True)
Note that I am giving my model two inputs. When I predict on my input data, I get all zeros, and I can not seem to understand why. I predict as follows:
training_model.load_weights('representation_learner.h5')
feature_vectors= training_model.predict([[original_train, augmented_train]], verbose = 1)
And the output is:
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
With a way too large shape of (1000000, 128)

How is the gradient and hessian of logarithmic loss computed in the custom objective function example script in xgboost's github repository?

I would like to understand how the gradient and hessian of the logloss function are computed in an xgboost sample script.
I've simplified the function to take numpy arrays, and generated y_hat and y_true which are a sample of the values used in the script.
Here is a simplified example:
import numpy as np
def loglikelihoodloss(y_hat, y_true):
prob = 1.0 / (1.0 + np.exp(-y_hat))
grad = prob - y_true
hess = prob * (1.0 - prob)
return grad, hess
y_hat = np.array([1.80087972, -1.82414818, -1.82414818, 1.80087972, -2.08465433,
-1.82414818, -1.82414818, 1.80087972, -1.82414818, -1.82414818])
y_true = np.array([1., 0., 0., 1., 0., 0., 0., 1., 0., 0.])
loglikelihoodloss(y_hat, y_true)
The log loss function is the sum of where .
The gradient (with respect to p) is then however in the code its .
Likewise the second derivative (with respect to p) is however in the code it is .
How are the equations equal?
The log loss function is given as:
where
Taking the partial derivative we get the gradient as
Thus we get the negative of gradient as p-y.
Similar calculations can be done to obtain the hessian.

numpy divide along axis

Is there a numpy function to divide an array along an axis with elements from another array? For example, suppose I have an array a with shape (l,m,n) and an array b with shape (m,); I'm looking for something equivalent to:
def divide_along_axis(a,b,axis=None):
if axis is None:
return a/b
c = a.copy()
for i, x in enumerate(c.swapaxes(0,axis)):
x /= b[i]
return c
For example, this is useful when normalizing an array of vectors:
>>> a = np.random.randn(4,3)
array([[ 1.03116167, -0.60862215, -0.29191449],
[-1.27040355, 1.9943905 , 1.13515384],
[-0.47916874, 0.05495749, -0.58450632],
[ 2.08792161, -1.35591814, -0.9900364 ]])
>>> np.apply_along_axis(np.linalg.norm,1,a)
array([ 1.23244853, 2.62299312, 0.75780647, 2.67919815])
>>> c = divide_along_axis(a,np.apply_along_axis(np.linalg.norm,1,a),0)
>>> np.apply_along_axis(np.linalg.norm,1,c)
array([ 1., 1., 1., 1.])
For the specific example you've given: dividing an (l,m,n) array by (m,) you can use np.newaxis:
a = np.arange(1,61, dtype=float).reshape((3,4,5)) # Create a 3d array
a.shape # (3,4,5)
b = np.array([1.0, 2.0, 3.0, 4.0]) # Create a 1-d array
b.shape # (4,)
a / b # Gives a ValueError
a / b[:, np.newaxis] # The result you want
You can read all about the broadcasting rules here. You can also use newaxis more than once if required. (e.g. to divide a shape (3,4,5,6) array by a shape (3,5) array).
From my understanding of the docs, using newaxis + broadcasting avoids also any unecessary array copying.
Indexing, newaxis etc are described more fully here now. (Documentation reorganised since this answer first posted).
I think you can get this behavior with numpy's usual broadcasting behavior:
In [9]: a = np.array([[1., 2.], [3., 4.]])
In [10]: a / np.sum(a, axis=0)
Out[10]:
array([[ 0.25 , 0.33333333],
[ 0.75 , 0.66666667]])
If i've interpreted correctly.
If you want the other axis you could transpose everything:
> a = np.random.randn(4,3).transpose()
> norms = np.apply_along_axis(np.linalg.norm,0,a)
> c = a / norms
> np.apply_along_axis(np.linalg.norm,0,c)
array([ 1., 1., 1., 1.])