numpy divide along axis - numpy

Is there a numpy function to divide an array along an axis with elements from another array? For example, suppose I have an array a with shape (l,m,n) and an array b with shape (m,); I'm looking for something equivalent to:
def divide_along_axis(a,b,axis=None):
if axis is None:
return a/b
c = a.copy()
for i, x in enumerate(c.swapaxes(0,axis)):
x /= b[i]
return c
For example, this is useful when normalizing an array of vectors:
>>> a = np.random.randn(4,3)
array([[ 1.03116167, -0.60862215, -0.29191449],
[-1.27040355, 1.9943905 , 1.13515384],
[-0.47916874, 0.05495749, -0.58450632],
[ 2.08792161, -1.35591814, -0.9900364 ]])
>>> np.apply_along_axis(np.linalg.norm,1,a)
array([ 1.23244853, 2.62299312, 0.75780647, 2.67919815])
>>> c = divide_along_axis(a,np.apply_along_axis(np.linalg.norm,1,a),0)
>>> np.apply_along_axis(np.linalg.norm,1,c)
array([ 1., 1., 1., 1.])

For the specific example you've given: dividing an (l,m,n) array by (m,) you can use np.newaxis:
a = np.arange(1,61, dtype=float).reshape((3,4,5)) # Create a 3d array
a.shape # (3,4,5)
b = np.array([1.0, 2.0, 3.0, 4.0]) # Create a 1-d array
b.shape # (4,)
a / b # Gives a ValueError
a / b[:, np.newaxis] # The result you want
You can read all about the broadcasting rules here. You can also use newaxis more than once if required. (e.g. to divide a shape (3,4,5,6) array by a shape (3,5) array).
From my understanding of the docs, using newaxis + broadcasting avoids also any unecessary array copying.
Indexing, newaxis etc are described more fully here now. (Documentation reorganised since this answer first posted).

I think you can get this behavior with numpy's usual broadcasting behavior:
In [9]: a = np.array([[1., 2.], [3., 4.]])
In [10]: a / np.sum(a, axis=0)
Out[10]:
array([[ 0.25 , 0.33333333],
[ 0.75 , 0.66666667]])
If i've interpreted correctly.
If you want the other axis you could transpose everything:
> a = np.random.randn(4,3).transpose()
> norms = np.apply_along_axis(np.linalg.norm,0,a)
> c = a / norms
> np.apply_along_axis(np.linalg.norm,0,c)
array([ 1., 1., 1., 1.])

Related

How to perform an outer product with custom function (pandas/numpy)?

My dataframe has N rows.
I have M centroids. Each centroid is the same shape as a dataframe-row.
I need to create a Nrows by Mcols matrix, where the m-th column is created by applying the m-th centroid to the dataframe.
My solution involves pre-creating the output matrix and filling it one column at a time as we manually iterate over centroids.
It feels clumsy. But I can't see clearly how to do it 'properly'.
df = pd.read_csv('test_data.csv')
centroids = df.sample(n=2)
centroids.reset_index(drop=True, inplace=True)
def getDistanceMatrix(df, centroids):
distanceMatrix = np.zeros((len(df), len(centroids)))
distFunc = lambda centroid, row: sum(centroid != row)
iCentroid = 0
for _, centroid in centroids.iterrows():
distanceMatrix[:, iCentroid] = df.apply(
lambda row: distFunc(centroid, row),
axis=1
)
iCentroid += 1
return distanceMatrix
distanceMatrix = getDistanceMatrix(df, centroids)
Here's an example test_data.csv with 9 rows:
A,B,C,D
1,2,1,1
2,1,1,2
1,2,3,4
2,2,1,2
2,3,3,4
1,1,3,1
4,2,1,2
2,3,3,3
4,1,1,2
It feels like some kind of outer-product-with-a-custom-function.
What's a good way to write this?
I mainly work with "vanilla numpy", so I can not give a nice solution based on pandas. I would do it like this if it only were numpy arrays, but I am not sure if there are any conversion overheads with pandas:
# Convert to numpy arrays (as I'm not proficient with
# pandas dataframes (...yet))
df_np = df.to_numpy()
centroids_np = centroids.to_numpy()
# Broadcast df_np to (2,9,4) and centroids_np to (2,1,4),
# then subtract the two.
# The result is a (2,9,4) array, where:
# - axis=0 corresponds to the centroid of the difference
# - axis=1 corresponds to the element in the dataframe
# - axis=2 corresponds to the individual coordinates
diff = np.broadcast_to(
df_np,
(centroids_np.shape[0], df_np.shape[0], df_np.shape[1])
) - centroids_np[:, None, :]
# Convert to a binary distance
diff = (diff != 0).astype(df_np.dtype)
# Now sum along the coordinates
distanceMatrix2 = np.sum(diff, axis=-1).T
# array([[0, 2],
# [3, 3],
# [2, 2],
# [2, 4],
# [4, 3],
# [2, 0],
# [2, 4],
# [4, 3],
# [3, 3]], dtype=int64)
For reference, your code gives me:
distanceMatrix = getDistanceMatrix(df, centroids)
#array([[0., 2.],
# [3., 3.],
# [2., 2.],
# [2., 4.],
# [4., 3.],
# [2., 0.],
# [2., 4.],
# [4., 3.],
# [3., 3.]])

batch axis in keras custom layer

I want to make a custom layer that does the following, given a batch of input vectors.
For each vector a in the batch:
get the first element a[0].
multiply the vector a by a[0] elementwise.
So if the batch is
[[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.],
[10., 11., 12.]]
This should be a batch of 4 vectors, each with dimension 3 (or am I wrong here?).
Then my layer should transform the batch to the following:
[[ 1., 2., 3.],
[ 16., 20., 24.],
[ 49., 56., 63.],
[100., 110., 120.]]
Here is my implementation for the layer:
class MyLayer(keras.layers.Layer):
def __init__(self, activation=None, **kwargs):
super().__init__(**kwargs)
self.activation = keras.activations.get(activation)
def call(self, a):
scale = a[0]
return self.activation(a * scale)
def get_config(self):
base_config = super().get_config()
return {**base_config,
"activation": keras.activations.serialize(self.activation)}
But the output is different from what I expected:
batch = tf.Variable([[1,2,3],
[4,5,6],
[7,8,9],
[10,11,12]], dtype=tf.float32)
layer = MyLayer()
print(layer(batch))
Output:
tf.Tensor(
[[ 1. 4. 9.]
[ 4. 10. 18.]
[ 7. 16. 27.]
[10. 22. 36.]], shape=(4, 3), dtype=float32)
It looks like the implementation actually treats each column as a vector, which is strange to me because other pre-written models, such as the sequential model, specify the input shape to be (batch_size, ...), which means each row, instead of column, is a vector.
How should I modify my code so that it behaves the way I want?
Actually, your input shape is (4,3). So when you slice this tensor by a[0] it gets the first row which is [1,2,3]. To get what you want you should instead get the first column and then transpose your matrix to give you the desired matrix like this:
def call(self, a):
scale = a[:,0]
return tf.transpose(self.activation(tf.transpose(a) * scale))

tf.keras.losses.CategoricalCrossentropy gives different values than plain implementation

Any one knows why raw implementation of Categorical Crossentropy function is so different from the tf.keras's api function?
import tensorflow as tf
import math
tf.enable_eager_execution()
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
ce = tf.keras.losses.CategoricalCrossentropy()
res = ce(y_true, y_pred).numpy()
print("use api:")
print(res)
print()
print("implementation:")
step1 = -y_true * np.log(y_pred )
step2 = np.sum(step1, axis=1)
print("step1.shape:", step1.shape)
print(step1)
print("sum step1:", np.sum(step1, ))
print("mean step1", np.mean(step1))
print()
print("step2.shape:", step2.shape)
print(step2)
print("sum step2:", np.sum(step2, ))
print("mean step2", np.mean(step2))
Above gives:
use api:
0.3239681124687195
implementation:
step1.shape: (3, 3)
[[0.10536052 0. 0. ]
[0. 0.11653382 0. ]
[0. 0. 0.0618754 ]]
sum step1: 0.2837697356318653
mean step1 0.031529970625762814
step2.shape: (3,)
[0.10536052 0.11653382 0.0618754 ]
sum step2: 0.2837697356318653
mean step2 0.09458991187728844
If now with another y_true and y_pred:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
It gives:
use api:
16.11809539794922
implementation:
step1.shape: (1, 2)
[[-0. 25.32843602]]
sum step1: 25.328436022934504
mean step1 12.664218011467252
step2.shape: (1,)
[25.32843602]
sum step2: 25.328436022934504
mean step2 25.328436022934504
The difference is because of these values: [.5, .89, .6], since it's sum is not equal to 1. I think you have made a mistake and you meant this instead: [.05, .89, .06].
If you provide the values with sum equal to 1, then both formulas results will be the same:
import tensorflow as tf
import numpy as np
y_true = np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.05, .89, .06], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#output
#[0.10536052 0.11653382 0.0618754 ]
#[0.10536052 0.11653382 0.0618754 ]
However, let's explore how is calculated if the y_pred tensor is not scaled (the sum of values is not equal to 1)? If you look at the source code of categorical cross entropy here, you will see that it scales y_pred so that the class probas of each sample sum to 1:
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
keep_dims=True)
since we passed a pred which the sum of probas is not 1, let's see how this operation changes our tensor [.5, .89, .6]:
output = tf.constant([.5, .89, .6])
output /= tf.reduce_sum(output,
axis=len(output.get_shape()) - 1,
keepdims=True)
print(output.numpy())
# array([0.2512563 , 0.44723618, 0.30150756], dtype=float32)
So, it should be equal if we replace the above operation output (scaled y_pred), and pass it to your own implemented categorical cross entropy, with the unscaled y_pred passing to tensorflow implementation:
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
#unscaled y_pred
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
#scaled y_pred (categorical_crossentropy scales above tensor to this internally)
y_pred = np.array([[.9, .05, .05], [0.2512563 , 0.44723618, 0.30150756], [.05, .01, .94]])
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
[0.10536052 0.80466845 0.0618754 ]
[0.10536052 0.80466846 0.0618754 ]
Now, let's explore the results of your second example. Why your second example shows different output?
If you check the source code again, you will see this line:
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
which clips values below than a threshold. Your input [0.99999999999, 0.00000000001] will be converted to [0.9999999, 0.0000001] in this line, so it gives you a different result:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#now let's first clip the values less than epsilon, then compare loss
epsilon=1e-7
y_pred = tf.clip_by_value(y_pred, epsilon, 1. - epsilon)
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
#results without clipping values
[16.11809565]
[25.32843602]
#results after clipping values if there is a value less than epsilon (1e-7)
[16.11809565]
[16.11809565]

Custom word2vec Transformer on pandas dataframe and using it in FeatureUnion

For the below pandas DataFrame df, I want to transform the type column to OneHotEncoding, and transform the word column to its vector representation using the dictionary word2vec. Then I want to concatenate the two transformed vectors with the count column to form the final feature for classification.
>>> df
word type count
0 apple A 4
1 cat B 3
2 mountain C 1
>>> df.dtypes
word object
type category
count int64
>>> word2vec
{'apple': [0.1, -0.2, 0.3], 'cat': [0.2, 0.2, 0.3], 'mountain': [0.4, -0.2, 0.3]}
I defined customized Transformer, and use FeatureUnion to concatenate the features.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder
class w2vTransformer(TransformerMixin):
def __init__(self,word2vec):
self.word2vec = word2vec
def fit(self,x, y=None):
return self
def wv(self, w):
return self.word2vec[w] if w in self.word2vec else [0, 0, 0]
def transform(self, X, y=None):
return df['word'].apply(self.wv)
pipeline = Pipeline([
('features', FeatureUnion(transformer_list=[
# Part 1: get integer column
('numericals', Pipeline([
('selector', TypeSelector(np.number)),
])),
# Part 2: get category column and its onehotencoding
('categoricals', Pipeline([
('selector', TypeSelector('category')),
('labeler', StringIndexer()),
('encoder', OneHotEncoder(handle_unknown='ignore')),
])),
# Part 3: transform word to its embedding
('word2vec', Pipeline([
('w2v', w2vTransformer(word2vec)),
]))
])),
])
When I run pipeline.fit_transform(df), I got the error: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 1, expected 3.
However, if I removed the word2vec Transformer (Part 3) from the pipeline, the pipeline (Part1 1 + Part 2) works fine.
>>> pipeline_no_word2vec.fit_transform(df).todense()
matrix([[4., 1., 0., 0.],
[3., 0., 1., 0.],
[1., 0., 0., 1.]])
And if I keep only the w2v transformer in the pipeline, it also works.
>>> pipeline_only_word2vec.fit_transform(df)
array([list([0.1, -0.2, 0.3]), list([0.2, 0.2, 0.3]),
list([0.4, -0.2, 0.3])], dtype=object)
My guess is that there is something wrong in my w2vTransformer class but don't know how to fix it. Please help.
This error is due to the fact that the FeatureUnion expects a 2-d array from each of its parts.
Now the first two parts of your FeatureUnion:- 'numericals' and 'categoricals' are correctly sending 2-d data of shape (n_samples, n_features).
n_samples = 3 in your example data. n_features will depend on individual parts (like OneHotEncoder will change them in 2nd part, but will be 1 in first part).
But the third part 'word2vec' returns a pandas.Series object which have the 1-d shape (3,). FeatureUnion takes this a shape (1, 3) by default and hence the complains that it does not match other blocks.
So you need to correct that shape.
Now even if you simply do a reshape() at the end and change it to shape (3,1), your code will not run, because the internal contents of that array are lists from your word2vec dict, which are not transformed correctly to a 2-d array. Instead it will become a array of lists.
Change the w2vTransformer to correct the error:
class w2vTransformer(TransformerMixin):
...
...
def transform(self, X, y=None):
return np.array([np.array(vv) for vv in X['word'].apply(self.wv)])
And after that the pipeline will work.

How to iterate over a 3 dimensional tensor

I have a tensor say:
y_true = np.array([[[1.], [0.], [3.]], [[5.], [0.], [0.]]])
I want to iterate over y_true accessing all indevidual values. I want to do something like following in java:
for(i=0;i<y_true.length;i++){
arr2 = y_true[i];
for(j=0;j<arr2.length;j++){
print(arr2[j][0])
}
}.
Are you looking for slicing with [:,:,0]?
>>> y_true[:,:,0]
array([[1., 0., 3.],
[5., 0., 0.]])
There are 2 cases:
You know the rank(dimensionality) of your created numpy array in the example y_true array has rank of 3, and you can check y_true.shape property which should give you with exact size of each dimension of the y_true, then you can write as many for loops the rank of y_true and output each element separately, for example:
import numpy as np
y_true = np.array([[[1.], [0.], [3.]], [[5.], [0.], [0.]]])
dims = y_true.shape
for i in range(dims[0]):
for j in range(dims[1]):
for k in range(dims[2]):
print("Element of np array with indices {} is equal to {}".format([i, j, k], y_true[i, j, k]))
If you don't know the rank of the tensor you want to print then you can write recursive function that will print all the elements, for example:
import numpy as np
def recursively_print_elems(np_arr, idx, pos):
if pos >= len(np_arr.shape):
print("Element of np array with indeces {} is equal to: {}".format(idx, np_arr[tuple(idx)]))
return
for i in range(np_arr.shape[pos]):
idx[pos] = i
recursively_print_elems(np_arr, idx, pos + 1)
def print_elems(np_arr):
idx = [0] * len(np_arr.shape)
recursively_print_elems(np_arr, idx, 0)
y_true = np.array([[[1.], [0.], [3.]], [[5.], [0.], [0.]]])
print_elems(y_true)
The 2nd approach is more general it will work for any dimensional tensor.
Your array:
In [19]: y_true
Out[19]:
array([[[1.],
[0.],
[3.]],
[[5.],
[0.],
[0.]]])
In [20]: y_true.shape
Out[20]: (2, 3, 1)
With a last dimension of size 1, we can reshape it
In [21]: y_true.reshape(2,3)
Out[21]:
array([[1., 0., 3.],
[5., 0., 0.]])
Selecting on that index does just as well.
But you can access all values in order just by raveling/flattening:
In [22]: y_true.ravel()
Out[22]: array([1., 0., 3., 5., 0., 0.])
Or get a 1 iterator:
In [23]: yiter = y_true.flat
In [24]: yiter?
Type: flatiter
String form: <numpy.flatiter object at 0x1fdd200>
Length: 6
File: ~/.local/lib/python3.6/site-packages/numpy/__init__.py
Docstring: <no docstring>
Class docstring:
Flat iterator object to iterate over arrays.
A `flatiter` iterator is returned by ``x.flat`` for any array `x`.
It allows iterating over the array as if it were a 1-D array,
either in a for-loop or by calling its `next` method.
...
So instead of constructing an iterator for each dimension we can iterate on this flat one:
In [25]: for item in yiter:print(item)
1.0
0.0
3.0
5.0
0.0
0.0
ndenumerate uses this flat iterator, and returns both coordinates and values:
In [26]: list(np.ndenumerate(y_true))
Out[26]:
[((0, 0, 0), 1.0),
((0, 1, 0), 0.0),
((0, 2, 0), 3.0),
((1, 0, 0), 5.0),
((1, 1, 0), 0.0),
((1, 2, 0), 0.0)]
A variation on this is ndindex:
In [27]: indexs = np.ndindex(y_true.shape)
In [28]: for ijk in indexs:
...: print(ijk, y_true[ijk])
...:
(0, 0, 0) 1.0
(0, 1, 0) 0.0
(0, 2, 0) 3.0
(1, 0, 0) 5.0
(1, 1, 0) 0.0
(1, 2, 0) 0.0
But where possible it is better to operate on the whole array, rather than iterate. Those whole-array operations do the iteration in compiled code.