Related
After implementing my own GRU cell, I was trying to validate it with the default implementation available on pytorch and keras. My implementation was very close to pytorch but significantly different from keras. So, I first decided to compare the implementations available in pytorch and keras against each other and found that they both were significantly different. Here is some code:
import numpy as np
import torch as tt
import torch.nn as nn
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# data generation
input_size = 2
seq_len = 4
hidden_size = 3
batch_size=1
rng = np.random.default_rng(10)
xx = rng.uniform(size=(batch_size, seq_len, input_size)).astype(np.float32)
print(xx.shape)
output:
(1, 4, 2)
First created a keras gru layer and do one forward pass to get the output
gru_keras = layers.GRU(hidden_size, return_sequences=True)
out_gru_keras = gru_keras(xx)
out_gru_keras
output:
<tf.Tensor: shape=(1, 4, 3), dtype=float32, numpy=
array([[[0.05563376, 0.22639018, 0.14037813],
[0.05021746, 0.27509487, 0.17843583],
[0.0273784 , 0.23740683, 0.1731183 ],
[0.10505023, 0.38027072, 0.31583947]]], dtype=float32)>
Then check the weights in keras gru
gru_keras_weights = gru_keras.get_weights()
print(f'{len(gru_keras_weights)=}')
for w in gru_keras_weights:
print(w.shape, w.dtype, w)
output:
len(gru_keras_weights)=3
(2, 9) float32 [[ 0.38249677 -0.67729133 -0.28855678 0.3081903 -0.530349 0.1531434
0.09444886 0.2978403 0.1516701 ]
[ 0.0833146 0.27516943 0.4720915 -0.7370237 -0.20921749 0.38180763
0.23018956 0.39872426 0.5722596 ]]
(3, 9) float32 [[-0.21449113 0.2944518 -0.25759113 0.2292317 0.35174483 0.42021522
-0.5116475 0.42759803 -0.05884372]
[-0.3477073 -0.15120703 0.7333025 0.418491 -0.07980055 -0.21007833
-0.1924745 0.20504965 0.11737184]
[ 0.01270523 0.15124948 0.4014033 -0.568793 0.4513449 -0.03860948
-0.39513308 -0.36090007 -0.02702253]]
(2, 9) float32 [[0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Now, create a pytorch gru and check its weights
gru_torch = nn.GRU(
input_size=input_size,
hidden_size=hidden_size,
bias=True,
batch_first=True,
num_layers=1,
dropout=0.0,
bidirectional=False,
dtype=tt.float32
)
sd_gru_torch = gru_torch.state_dict()
for k,v in sd_gru_torch.items():
print(f'{k}, {v.shape}')
output:
weight_ih_l0, torch.Size([9, 2])
weight_hh_l0, torch.Size([9, 3])
bias_ih_l0, torch.Size([9])
bias_hh_l0, torch.Size([9])
Clearly, pytorch implements two separate bias bias_ih_l0 and bias_hh_l0 while keras has a concatenated bias (which is zero). Now I copied the weights accordingly.
with tt.no_grad():
gru_torch.get_parameter('weight_ih_l0').copy_(tt.tensor(gru_keras_weights[0].T))
gru_torch.get_parameter('weight_hh_l0').copy_(tt.tensor(gru_keras_weights[1].T))
gru_torch.get_parameter('bias_ih_l0').copy_(tt.tensor(gru_keras_weights[2].T[:,0]))
gru_torch.get_parameter('bias_hh_l0').copy_(tt.tensor(gru_keras_weights[2].T[:,1]))
... and check the parameters again
with tt.no_grad():
for p in gru_torch.parameters():
print(p.shape, p)
output:
torch.Size([9, 2]) Parameter containing:
tensor([[ 0.3825, 0.0833],
[-0.6773, 0.2752],
[-0.2886, 0.4721],
[ 0.3082, -0.7370],
[-0.5303, -0.2092],
[ 0.1531, 0.3818],
[ 0.0944, 0.2302],
[ 0.2978, 0.3987],
[ 0.1517, 0.5723]], requires_grad=True)
torch.Size([9, 3]) Parameter containing:
tensor([[-0.2145, -0.3477, 0.0127],
[ 0.2945, -0.1512, 0.1512],
[-0.2576, 0.7333, 0.4014],
[ 0.2292, 0.4185, -0.5688],
[ 0.3517, -0.0798, 0.4513],
[ 0.4202, -0.2101, -0.0386],
[-0.5116, -0.1925, -0.3951],
[ 0.4276, 0.2050, -0.3609],
[-0.0588, 0.1174, -0.0270]], requires_grad=True)
torch.Size([9]) Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
torch.Size([9]) Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
Now the weights are matching exactly. Taking a forward pass through torch with the same data
with tt.no_grad():
out_gru_torch, _ = gru_torch(tt.tensor(xx))
print (out_gru_torch.shape, out_gru_torch)
output:
torch.Size([1, 4, 3]) tensor([[[0.0638, 0.2232, 0.1145],
[0.0553, 0.2741, 0.1618],
[0.0306, 0.2399, 0.1644],
[0.1231, 0.3970, 0.3153]]])
But it can be seen that the outputs are different in value
out_gru_torch.numpy() - out_gru_keras.numpy()
output:
array([[[ 0.00813178, -0.00323714, -0.02592391],
[ 0.0050362 , -0.000976 , -0.01662555],
[ 0.00321813, 0.0024882 , -0.00869839],
[ 0.0180769 , 0.01671427, -0.00050169]]], dtype=float32)
summing up the differences
np.sum(np.abs(out_gru_torch.numpy() - out_gru_keras.numpy()))
output:
0.10962818
Why is there such a significant difference?
I also tried the same thing with LSTMs but found that the difference was insignificant. This can be attributed to floating point rounding errors maybe. Here is some code.
Make a keras LSTM
lstm_keras = layers.LSTM(hidden_size, return_sequences=True)
out_lstm_keras = lstm_keras(xx)
out_lstm_keras
output:
<tf.Tensor: shape=(1, 4, 3), dtype=float32, numpy=
array([[[-0.12176377, -0.07746243, -0.08807365],
[-0.17760691, -0.11547467, -0.12406464],
[-0.16980645, -0.1159803 , -0.12289675],
[-0.16330168, -0.08463871, -0.18625976]]], dtype=float32)>
check keras LSTM weights
lstm_keras_weights = lstm_keras.get_weights()
print(f'{len(lstm_keras_weights)=}')
for w in lstm_keras_weights:
print(w.shape, w.dtype, w)
output:
len(lstm_keras_weights)=3
(2, 12) float32 [[-0.18769234 0.6526979 0.27196562 -0.23817068 -0.05964065 -0.11090988
-0.6442989 -0.4168117 -0.344454 0.12466687 -0.6536666 -0.28540143]
[-0.35713544 0.34027737 0.09951967 -0.21514818 0.47551024 0.305395
0.16330504 0.22410381 -0.13371867 0.21646535 0.01366949 0.4818431 ]]
(3, 12) float32 [[-0.22128499 -0.17296375 0.03671373 0.16226508 0.19011612 -0.41836154
-0.5816412 -0.32847112 -0.31468534 -0.27402246 0.05426207 0.24291728]
[ 0.4650536 -0.57491106 0.01105271 -0.0380749 0.2271702 0.39930764
0.11620218 -0.19071549 -0.30224687 -0.13937864 -0.27111995 0.08010413]
[ 0.45147026 0.3288076 -0.37750363 0.35117835 -0.31541684 -0.1335725
-0.0910389 -0.24736843 0.03350063 -0.27691114 -0.28898126 -0.27222085]]
(12,) float32 [0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0.]
Now create a pytorch LSTM and check its weights
lstm_torch = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
bias=True,
batch_first=True,
num_layers=1,
dropout=0.0,
bidirectional=False,
dtype=tt.float32
)
sd_lstm_torch = lstm_torch.state_dict()
for k,v in sd_lstm_torch.items():
print(f'{k}, {v.shape}')
output:
weight_ih_l0, torch.Size([12, 2])
weight_hh_l0, torch.Size([12, 3])
bias_ih_l0, torch.Size([12])
bias_hh_l0, torch.Size([12])
It can be noted that in keras LSTM, only one bias is implemented (not a stack of 2 bias as done in GRU), so I choose one of the bias in pytorch LSTM and copied keras bias vector to it. It doesn't make much difference on which bias is copied to, the other bias is made zero. Here I choose bias_ih_l0 to copy to and bias_hh_l0 is made zero.
with tt.no_grad():
lstm_torch.get_parameter('weight_ih_l0').copy_(tt.tensor(lstm_keras_weights[0].T))
lstm_torch.get_parameter('weight_hh_l0').copy_(tt.tensor(lstm_keras_weights[1].T))
lstm_torch.get_parameter('bias_ih_l0').copy_(tt.tensor(lstm_keras_weights[2]))
lstm_torch.get_parameter('bias_hh_l0').copy_(tt.zeros(lstm_keras_weights[2].shape))
Checking the parameters again
with tt.no_grad():
for p in lstm_torch.parameters():
print(p.shape, p)
output:
torch.Size([12, 2]) Parameter containing:
tensor([[-0.1877, -0.3571],
[ 0.6527, 0.3403],
[ 0.2720, 0.0995],
[-0.2382, -0.2151],
[-0.0596, 0.4755],
[-0.1109, 0.3054],
[-0.6443, 0.1633],
[-0.4168, 0.2241],
[-0.3445, -0.1337],
[ 0.1247, 0.2165],
[-0.6537, 0.0137],
[-0.2854, 0.4818]], requires_grad=True)
torch.Size([12, 3]) Parameter containing:
tensor([[-0.2213, 0.4651, 0.4515],
[-0.1730, -0.5749, 0.3288],
[ 0.0367, 0.0111, -0.3775],
[ 0.1623, -0.0381, 0.3512],
[ 0.1901, 0.2272, -0.3154],
[-0.4184, 0.3993, -0.1336],
[-0.5816, 0.1162, -0.0910],
[-0.3285, -0.1907, -0.2474],
[-0.3147, -0.3022, 0.0335],
[-0.2740, -0.1394, -0.2769],
[ 0.0543, -0.2711, -0.2890],
[ 0.2429, 0.0801, -0.2722]], requires_grad=True)
torch.Size([12]) Parameter containing:
tensor([0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0.], requires_grad=True)
torch.Size([12]) Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
Now do a forward pass through pytorch LSTM with the same data
with tt.no_grad():
out_lstm_torch, _ = lstm_torch(tt.tensor(xx))
print (out_lstm_torch.shape, out_lstm_torch)
output:
torch.Size([1, 4, 3]) tensor([[[-0.1218, -0.0775, -0.0881],
[-0.1776, -0.1155, -0.1241],
[-0.1698, -0.1160, -0.1229],
[-0.1633, -0.0846, -0.1863]]])
Taking absolute difference
np.sum(np.abs(out_lstm_torch.numpy() - out_lstm_keras.numpy()))
output:
2.7567148e-07
The difference is significantly less. I also found that the initial hidden states are assumed to be all zeros by both pytorch and keras so I don't need to set it manually.
Any one knows why raw implementation of Categorical Crossentropy function is so different from the tf.keras's api function?
import tensorflow as tf
import math
tf.enable_eager_execution()
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
ce = tf.keras.losses.CategoricalCrossentropy()
res = ce(y_true, y_pred).numpy()
print("use api:")
print(res)
print()
print("implementation:")
step1 = -y_true * np.log(y_pred )
step2 = np.sum(step1, axis=1)
print("step1.shape:", step1.shape)
print(step1)
print("sum step1:", np.sum(step1, ))
print("mean step1", np.mean(step1))
print()
print("step2.shape:", step2.shape)
print(step2)
print("sum step2:", np.sum(step2, ))
print("mean step2", np.mean(step2))
Above gives:
use api:
0.3239681124687195
implementation:
step1.shape: (3, 3)
[[0.10536052 0. 0. ]
[0. 0.11653382 0. ]
[0. 0. 0.0618754 ]]
sum step1: 0.2837697356318653
mean step1 0.031529970625762814
step2.shape: (3,)
[0.10536052 0.11653382 0.0618754 ]
sum step2: 0.2837697356318653
mean step2 0.09458991187728844
If now with another y_true and y_pred:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
It gives:
use api:
16.11809539794922
implementation:
step1.shape: (1, 2)
[[-0. 25.32843602]]
sum step1: 25.328436022934504
mean step1 12.664218011467252
step2.shape: (1,)
[25.32843602]
sum step2: 25.328436022934504
mean step2 25.328436022934504
The difference is because of these values: [.5, .89, .6], since it's sum is not equal to 1. I think you have made a mistake and you meant this instead: [.05, .89, .06].
If you provide the values with sum equal to 1, then both formulas results will be the same:
import tensorflow as tf
import numpy as np
y_true = np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.05, .89, .06], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#output
#[0.10536052 0.11653382 0.0618754 ]
#[0.10536052 0.11653382 0.0618754 ]
However, let's explore how is calculated if the y_pred tensor is not scaled (the sum of values is not equal to 1)? If you look at the source code of categorical cross entropy here, you will see that it scales y_pred so that the class probas of each sample sum to 1:
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
keep_dims=True)
since we passed a pred which the sum of probas is not 1, let's see how this operation changes our tensor [.5, .89, .6]:
output = tf.constant([.5, .89, .6])
output /= tf.reduce_sum(output,
axis=len(output.get_shape()) - 1,
keepdims=True)
print(output.numpy())
# array([0.2512563 , 0.44723618, 0.30150756], dtype=float32)
So, it should be equal if we replace the above operation output (scaled y_pred), and pass it to your own implemented categorical cross entropy, with the unscaled y_pred passing to tensorflow implementation:
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
#unscaled y_pred
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
#scaled y_pred (categorical_crossentropy scales above tensor to this internally)
y_pred = np.array([[.9, .05, .05], [0.2512563 , 0.44723618, 0.30150756], [.05, .01, .94]])
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
[0.10536052 0.80466845 0.0618754 ]
[0.10536052 0.80466846 0.0618754 ]
Now, let's explore the results of your second example. Why your second example shows different output?
If you check the source code again, you will see this line:
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
which clips values below than a threshold. Your input [0.99999999999, 0.00000000001] will be converted to [0.9999999, 0.0000001] in this line, so it gives you a different result:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#now let's first clip the values less than epsilon, then compare loss
epsilon=1e-7
y_pred = tf.clip_by_value(y_pred, epsilon, 1. - epsilon)
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
#results without clipping values
[16.11809565]
[25.32843602]
#results after clipping values if there is a value less than epsilon (1e-7)
[16.11809565]
[16.11809565]
I am trying out a project where I use the T5EncoderModel from HuggingFace in order to obtain hidden representations of my input sentences. I have 100K sentences which I tokenize and pad as follows:
for sentence in dataset[original]:
sentence = tokenizer(sentence, max_length=40, padding='max_length', return_tensors='tf', truncation= True)
original_sentences.append(sentence.input_ids)
org_mask.append(sentence.attention_mask)
This gives me the right outputs and tokenizes everything decently. The problem I achieve is when I am trying to actually train the model. The setup is a bit complex and is taken from https://keras.io/examples/vision/semantic_image_clustering/ which I am trying to apply to text.
The set-up for training is as follows:
def create_encoder(rep_dim):
encoder = TFT5EncoderModel.from_pretrained('t5-small', output_hidden_states=True)
encoder.trainable = True
original_input = Input(shape=(max_length), name = 'originalIn', dtype=tf.int32)
augmented_input = Input(shape=(max_length), name = 'originalIn', dtype=tf.int32)
concat = keras.layers.Concatenate(axis=1)([original_input, augmented_input])
#Take 0-index because it returns a TFBERTmodel type, and 0 returns a tensor
encoded = encoder(input_ids=concat)[0]
#This outputs shape: [sentences, max_length, encoded_dims]
output = Dense(rep_dim, activation='relu')(encoded)
return encoder
This function is fed into the ReprensentationLearner class from the above link as such:
class RepresentationLearner(keras.Model):
def __init__(
self,
encoder,
projection_units,
temperature=0.8,
dropout_rate=0.1,
l2_normalize=False,
**kwargs
):
super(RepresentationLearner, self).__init__(**kwargs)
self.encoder = encoder
# Create projection head.
self.projector = keras.Sequential(
[
layers.Dropout(dropout_rate),
layers.Dense(units=projection_units, use_bias=False),
layers.BatchNormalization(),
layers.ReLU(),
]
)
self.temperature = temperature
self.l2_normalize = l2_normalize
self.loss_tracker = keras.metrics.Mean(name="loss")
#property
def metrics(self):
return [self.loss_tracker]
def compute_contrastive_loss(self, feature_vectors, batch_size):
num_augmentations = tf.shape(feature_vectors)[0] // batch_size
if self.l2_normalize:
feature_vectors = tf.math.l2_normalize(feature_vectors, -1)
# The logits shape is [num_augmentations * batch_size, num_augmentations * batch_size].
logits = (
tf.linalg.matmul(feature_vectors, feature_vectors, transpose_b=True)
/ self.temperature
)
# Apply log-max trick for numerical stability.
logits_max = tf.math.reduce_max(logits, axis=1)
logits = logits - logits_max
# The shape of targets is [num_augmentations * batch_size, num_augmentations * batch_size].
# targets is a matrix consits of num_augmentations submatrices of shape [batch_size * batch_size].
# Each [batch_size * batch_size] submatrix is an identity matrix (diagonal entries are ones).
targets = tf.tile(tf.eye(batch_size), [num_augmentations, num_augmentations])
# Compute cross entropy loss
return keras.losses.categorical_crossentropy(
y_true=targets, y_pred=logits, from_logits=True
)
def call(self, inputs):
features = self.encoder(inputs[0])[0]
# Apply projection head.
return self.projector(features[0])
def train_step(self, inputs):
batch_size = tf.shape(inputs)[0]
# Run the forward pass and compute the contrastive loss
with tf.GradientTape() as tape:
feature_vectors = self(inputs, training=True)
loss = self.compute_contrastive_loss(feature_vectors, batch_size)
# Compute gradients
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update loss tracker metric
self.loss_tracker.update_state(loss)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
def test_step(self, inputs):
batch_size = tf.shape(inputs)[0]
feature_vectors = self(inputs, training=False)
loss = self.compute_contrastive_loss(feature_vectors, batch_size)
self.loss_tracker.update_state(loss)
return {"loss": self.loss_tracker.result()}
In order to train it, I use the Colab TPU and train it as such:
with strategy.scope():
encoder = create_encoder(rep_dim)
training_model = RepresentationLearner(encoder=encoder, projection_units=128, temperature=0.1)
lr_scheduler = keras.experimental.CosineDecay(initial_learning_rate=0.001, decay_steps=500, alpha=0.1)
training_model.compile(optimizer=tfa.optimizers.AdamW(learning_rate=lr_scheduler, weight_decay=0.0001))
history = training_model.fit(x = [original_train, augmented_train], batch_size=32*8, epocs = 10)
training_model.save_weights('representation_learner.h5', overwrite=True)
Note that I am giving my model two inputs. When I predict on my input data, I get all zeros, and I can not seem to understand why. I predict as follows:
training_model.load_weights('representation_learner.h5')
feature_vectors= training_model.predict([[original_train, augmented_train]], verbose = 1)
And the output is:
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
With a way too large shape of (1000000, 128)
I am trying to create a simple input function with the feature data being the numbers 1-10 and the labels being 0 when x < 5; 5 when x = 5 and 10 when x > 5.
example:
# data
nmbrs = [10., 1., 2., 3., 4., 5., 6. , 7., 8., 9.]
labels = [10., 0., 0., 0., 0., 5., 10., 10., 10., 10.]
# input function
input_fn = tf.estimator.inputs.numpy_input_fn(
x={'numbers': np.array(nmbrs)}, y=np.array(labels),
batch_size=batch_size, num_epochs=None, shuffle=True)
The problem i am having is that the nmbrs and labels array doesnt seem to be in the right form, i tried making it into a 2d array but that didnt work either im sure im doing something really easy wrong here...
EDIT: model and neural net functions
def neural_net(x_dict):
# TF Estimator input is a dict, in case of multiple inputs
x = x_dict['numbers']
# Hidden fully connected layer with 128 neurons
layer_1 = tf.layers.dense(x, n_hidden_1)
# Hidden fully connected layer with 128 neurons
layer_2 = tf.layers.dense(layer_1, n_hidden_2)
# Output fully connected layer with a neuron for each class
out_layer = tf.layers.dense(layer_2, num_classes)
return out_layer
# Define the model function (following TF Estimator Template)
def model_fn(features, labels, mode):
# Build the neural network
logits = neural_net(features)
# Predictions
pred_classes = tf.argmax(logits, axis=1)
pred_probas = tf.nn.softmax(logits)
# If prediction mode, early return
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode, predictions=pred_classes)
# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=logits, labels=tf.cast(labels, dtype=tf.int32)))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op, global_step=tf.train.get_global_step())
Is there a numpy function to divide an array along an axis with elements from another array? For example, suppose I have an array a with shape (l,m,n) and an array b with shape (m,); I'm looking for something equivalent to:
def divide_along_axis(a,b,axis=None):
if axis is None:
return a/b
c = a.copy()
for i, x in enumerate(c.swapaxes(0,axis)):
x /= b[i]
return c
For example, this is useful when normalizing an array of vectors:
>>> a = np.random.randn(4,3)
array([[ 1.03116167, -0.60862215, -0.29191449],
[-1.27040355, 1.9943905 , 1.13515384],
[-0.47916874, 0.05495749, -0.58450632],
[ 2.08792161, -1.35591814, -0.9900364 ]])
>>> np.apply_along_axis(np.linalg.norm,1,a)
array([ 1.23244853, 2.62299312, 0.75780647, 2.67919815])
>>> c = divide_along_axis(a,np.apply_along_axis(np.linalg.norm,1,a),0)
>>> np.apply_along_axis(np.linalg.norm,1,c)
array([ 1., 1., 1., 1.])
For the specific example you've given: dividing an (l,m,n) array by (m,) you can use np.newaxis:
a = np.arange(1,61, dtype=float).reshape((3,4,5)) # Create a 3d array
a.shape # (3,4,5)
b = np.array([1.0, 2.0, 3.0, 4.0]) # Create a 1-d array
b.shape # (4,)
a / b # Gives a ValueError
a / b[:, np.newaxis] # The result you want
You can read all about the broadcasting rules here. You can also use newaxis more than once if required. (e.g. to divide a shape (3,4,5,6) array by a shape (3,5) array).
From my understanding of the docs, using newaxis + broadcasting avoids also any unecessary array copying.
Indexing, newaxis etc are described more fully here now. (Documentation reorganised since this answer first posted).
I think you can get this behavior with numpy's usual broadcasting behavior:
In [9]: a = np.array([[1., 2.], [3., 4.]])
In [10]: a / np.sum(a, axis=0)
Out[10]:
array([[ 0.25 , 0.33333333],
[ 0.75 , 0.66666667]])
If i've interpreted correctly.
If you want the other axis you could transpose everything:
> a = np.random.randn(4,3).transpose()
> norms = np.apply_along_axis(np.linalg.norm,0,a)
> c = a / norms
> np.apply_along_axis(np.linalg.norm,0,c)
array([ 1., 1., 1., 1.])