Partitioned matrix multiplication in tensorflow or pytorch

Partitioned matrix multiplication in tensorflow or pytorch - tensorflow

Assume I have matrices P with the size [4, 4] which partitioned (block) into 4 smaller matrices [2,2]. How can I efficiently multiply this block-matrix into another matrix (not partitioned matrix but smaller)?
Let's Assume our original matric is:
P = [ 1 1 2 2
1 1 2 2
3 3 4 4
3 3 4 4]
Which split into submatrices:
P_1 = [1 1 , P_2 = [2 2 , P_3 = [3 3 P_4 = [4 4
1 1] 2 2] 3 3] 4 4]
Now our P is:
P = [P_1 P_2
P_3 p_4]
In the next step, I want to do element-wise multiplication between P and smaller matrices which its size is equal to number of sub-matrices:
P * [ 1 0 = [P_1 0 = [1 1 0 0
0 0 ] 0 0] 1 1 0 0
0 0 0 0
0 0 0 0]

You can think of representing your large block matrix in a more efficient way.
For instance, a block matrix
P = [ 1 1 2 2
1 1 2 2
3 3 4 4
3 3 4 4]
Can be represented using
a = [ 1 0 b = [ 1 1 0 0 p = [ 1 2
1 0 0 0 1 1 ] 3 4 ]
0 1
0 1 ]
As
P = a # p # b
With (# representing matrix multiplication). Matrices a and b represents/encode the block structure of P and the small p represents the values of each block.
Now, if you want to multiply (element-wise) p with a small (2x2) matrix q you simply
a # (p * q) # b
A simple pytorch example
In [1]: a = torch.tensor([[1., 0], [1., 0], [0., 1], [0, 1]])
In [2]: b = torch.tensor([[1., 1., 0, 0], [0, 0, 1., 1]])
In [3]: p=torch.tensor([[1., 2.], [3., 4.]])
In [4]: q = torch.tensor([[1., 0], [0., 0]])
In [5]: a # p # b
Out[5]:
tensor([[1., 1., 2., 2.],
[1., 1., 2., 2.],
[3., 3., 4., 4.],
[3., 3., 4., 4.]])
In [6]: a # (p*q) # b
Out[6]:
tensor([[1., 1., 0., 0.],
[1., 1., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
I leave it to you as an exercise how to efficiently produce the "structure" matrices a and b given the sizes of the blocks.

Following is a general Tensorflow-based solution that works for input matrices p (large) and m (small) of arbitrary shapes as long as the sizes of p are divisible by the sizes of m on both axes.
def block_mul(p, m):
p_x, p_y = p.shape
m_x, m_y = m.shape
m_4d = tf.reshape(m, (m_x, 1, m_y, 1))
m_broadcasted = tf.broadcast_to(m_4d, (m_x, p_x // m_x, m_y, p_y // m_y))
mp = tf.reshape(m_broadcasted, (p_x, p_y))
return p * mp
Test:
import tensorflow as tf
tf.enable_eager_execution()
p = tf.reshape(tf.constant(range(36)), (6, 6))
m = tf.reshape(tf.constant(range(9)), (3, 3))
print(f"p:\n{p}\n")
print(f"m:\n{m}\n")
print(f"block_mul(p, m):\n{block_mul(p, m)}")
Output (Python 3.7.3, Tensorflow 1.13.1):
p:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
[30 31 32 33 34 35]]
m:
[[0 1 2]
[3 4 5]
[6 7 8]]
block_mul(p, m):
[[ 0 0 2 3 8 10]
[ 0 0 8 9 20 22]
[ 36 39 56 60 80 85]
[ 54 57 80 84 110 115]
[144 150 182 189 224 232]
[180 186 224 231 272 280]]
Another solution that uses implicit broadcasting is the following:
def block_mul2(p, m):
p_x, p_y = p.shape
m_x, m_y = m.shape
p_4d = tf.reshape(p, (m_x, p_x // m_x, m_y, p_y // m_y))
m_4d = tf.reshape(m, (m_x, 1, m_y, 1))
return tf.reshape(p_4d * m_4d, (p_x, p_y))

Don't know about the efficient method, but you can try these:
Method 1:
Using torch.cat()
import torch
def multiply(a, b):
x1 = a[0:2, 0:2]*b[0,0]
x2 = a[0:2, 2:]*b[0,1]
x3 = a[2:, 0:2]*b[1,0]
x4 = a[2:, 2:]*b[1,1]
return torch.cat((torch.cat((x1, x2), 1), torch.cat((x3, x4), 1)), 0)
a = torch.tensor([[1, 1, 2, 2],[1, 1, 2, 2],[3, 3, 4, 4,],[3, 3, 4, 4]])
b = torch.tensor([[1, 0],[0, 0]])
print(multiply(a, b))
output:
tensor([[1, 1, 0, 0],
[1, 1, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
Method 2:
Using torch.nn.functional.pad()
import torch.nn.functional as F
import torch
def multiply(a, b):
b = F.pad(input=b, pad=(1, 1, 1, 1), mode='constant', value=0)
b[0,0] = 1
b[0,1] = 1
b[1,0] = 1
return a*b
a = torch.tensor([[1, 1, 2, 2],[1, 1, 2, 2],[3, 3, 4, 4,],[3, 3, 4, 4]])
b = torch.tensor([[1, 0],[0, 0]])
print(multiply(a, b))
output:
tensor([[1, 1, 0, 0],
[1, 1, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])

If the matrices are small, you are probably fine with cat or pad. The solution with factorization is very elegant, as the one with a block_mul implementation.
Another solution is turning the 2D block matrix in a 3D volume where each 2D slice is a block (P_1, P_2, P_3, P_4). Then use the power of broadcasting to multiply each 2D slice by a scalar. Finally reshape the output. Reshaping is not immediate but it's doable, port from numpy to pytorch of https://stackoverflow.com/a/16873755/4892874
In Pytorch:
import torch
h = w = 4
x = torch.ones(h, w)
x[:2, 2:] = 2
x[2:, :2] = 3
x[2:, 2:] = 4
# number of blocks along x and y
nrows=2
ncols=2
vol3d = x.reshape(h//nrows, nrows, -1, ncols)
vol3d = vol3d.permute(0, 2, 1, 3).reshape(-1, nrows, ncols)
out = vol3d * torch.Tensor([1, 0, 0, 0])[:, None, None].float()
# reshape to original
n, nrows, ncols = out.shape
out = out.reshape(h//nrows, -1, nrows, ncols)
out = out.permute(0, 2, 1, 3)
out = out.reshape(h, w)
print(out)
tensor([[1., 1., 0., 0.],
[1., 1., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
I haven't benchmarked this against the others, but this doesn't consume additional memory like padding would do and it doesn't do slow operations like concatenation. It has also ther advantage of being easy to understand and visualize.
You can generalize it to any situation by playing with h, w, nrows, ncols.

Although the other answer may be the solution, it is not an efficient way. I come up with another one to tackle the problem (but still is not perfect). The following implementation needs too much memory when our inputs are 3 or 4 dimensions. For example, for input size of 20*75*1024*1024, the following calculation needs around 12gb ram.
Here is my implementation:
import tensorflow as tf
tf.enable_eager_execution()
inps = tf.constant([
[1, 1, 1, 1, 2, 2, 2, 2],
[1, 1, 1, 1, 2, 2, 2, 2],
[1, 1, 1, 1, 2, 2, 2, 2],
[1, 1, 1, 1, 2, 2, 2, 2],
[3, 3, 3, 3, 4, 4, 4, 4],
[3, 3, 3, 3, 4, 4, 4, 4],
[3, 3, 3, 3, 4, 4, 4, 4],
[3, 3, 3, 3, 4, 4, 4, 4]])
on_cells = tf.constant([[1, 0, 0, 1]])
on_cells = tf.expand_dims(on_cells, axis=-1)
# replicate the value to block-size (4*4)
on_cells = tf.tile(on_cells, [1, 1, 4 * 4])
# reshape to a format for permutation
on_cells = tf.reshape(on_cells, (1, 2, 2, 4, 4))
# permutation
on_cells = tf.transpose(on_cells, [0, 1, 3, 2, 4])
# reshape
on_cells = tf.reshape(on_cells, [1, 8, 8])
# element-wise operation
print(inps * on_cells)
Output:
tf.Tensor(
[[[1 1 1 1 0 0 0 0]
[1 1 1 1 0 0 0 0]
[1 1 1 1 0 0 0 0]
[1 1 1 1 0 0 0 0]
[0 0 0 0 4 4 4 4]
[0 0 0 0 4 4 4 4]
[0 0 0 0 4 4 4 4]
[0 0 0 0 4 4 4 4]]], shape=(1, 8, 8), dtype=int32)

Related

What's going on on torch.transform.ToTensor?

let
x = np.array([[0,254],[14,77]])
>>> x
array([[ 0, 254],
[ 14, 77]])
y = torch.from_numpy(x).int().float()
>>> y
tensor([[ 0., 254.],
[ 14., 77.]])
train_transform = transforms.Compose([transforms.ToPILImage(),
transforms.ToTensor()])
z = train_transform(y)
>>> z
tensor([[[0.0000, 0.0078],
[0.9490, 0.7020]]])
if I use the ToTensor, it will be divided into 255, but what is going on with that? Why is 14 and 77 high?

Indices in Numpy and MATLAB

I have a piece of code in Matlab that I want to convert into Python/numpy.
I have a matrix ind which has the dimensions (32768, 24). I have another matrix X which has the dimensions (98304, 6). When I perform the operation
result = X(ind)
the shape of the matrix is (32768, 24).
but in numpy when I perform the same shape
result = X[ind]
I get the shape of the result matrix as (32768, 24, 6).
I would greatly appreciate it if someone can help me with why I can these two different results and how can I fix them. I would want to get the shape (32768, 24) for the result matrix in numpy as well

In Octave, if I define:
>> X=diag([1,2,3,4])
X =
Diagonal Matrix
1 0 0 0
0 2 0 0
0 0 3 0
0 0 0 4
>> idx = [6 7;10 11]
idx =
6 7
10 11
then the indexing selects a block:
>> X(idx)
ans =
2 0
0 3
The numpy equivalent is
In [312]: X=np.diag([1,2,3,4])
In [313]: X
Out[313]:
array([[1, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 4]])
In [314]: idx = np.array([[5,6],[9,10]]) # shifted for 0 base indexing
In [315]: np.unravel_index(idx,(4,4)) # raveled to unraveled conversion
Out[315]:
(array([[1, 1],
[2, 2]]),
array([[1, 2],
[1, 2]]))
In [316]: X[_] # this indexes with a tuple of arrays
Out[316]:
array([[2, 0],
[0, 3]])
another way:
In [318]: X.flat[idx]
Out[318]:
array([[2, 0],
[0, 3]])

How to make a simple Vandermonde matrix with numpy?

My question is how to make a vandermonde matrix. This is the definition:
In linear algebra, a Vandermonde matrix, named after Alexandre-Théophile Vandermonde, is a matrix with the terms of a geometric progression in each row, i.e., an m × n matrix
I would like to make a 4*4 version of this.
So farI have defined values but only for one row as follows
a=2
n=4
for a in range(n):
for i in range(n):
v.append(a**i)
v = np.array(v)
print(v)
I dont know how to scale this. Please help!

Given a starting column a of length m you can create a Vandermonde matrix v with n columns a**0 to a**(n-1)like so:
import numpy as np
m = 4
n = 4
a = range(1, m+1)
v = np.array([a]*n).T**range(n)
print(v)
#[[ 1 1 1 1]
# [ 1 2 4 8]
# [ 1 3 9 27]
# [ 1 4 16 64]]

As proposed by michael szczesny you could use numpy.vander.
But this will not be according to the definition on Wikipedia.
x = np.array([1, 2, 3, 5])
N = 4
np.vander(x, N)
#array([[ 1, 1, 1, 1],
# [ 8, 4, 2, 1],
# [ 27, 9, 3, 1],
# [125, 25, 5, 1]])
So, you'd have to use numpy.fliplr aswell:
x = np.array([1, 2, 3, 5])
N = 4
np.fliplr(np.vander(x, N))
#array([[ 1, 1, 1, 1],
# [ 1, 2, 4, 8],
# [ 1, 3, 9, 27],
# [ 1, 5, 25, 125]])
This could also be achieved without numpy using nested list comprehensions:
x = [1, 2, 3, 5]
N = 4
[[xi**i for i in range(N)] for xi in x]
# [[1, 1, 1, 1],
# [1, 2, 4, 8],
# [1, 3, 9, 27],
# [1, 5, 25, 125]]

# Vandermonde Matrix
def Vandermonde_Matrix(D, k):
'''
D = {(x_i,y_i): 0<=i<=n}
----------------
k degree
'''
n = len(D)
V = np.zeros(shape=(n, k))
for i in range(n):
V[i] = np.power(np.array(D[i][0]), np.arange(k))
return V

Numpy summation with sliding window is really slow

Code:
shape = np.array([6, 6])
grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i, x in enumerate(shape)], indexing='ij')]).T
slices = [tuple(slice(box[i], box[i] + 2) for i in range(len(box))) for box in grid]
score = np.zeros((7,7,3))
column = np.random.randn(36, 12) #just for example
column
>> array([[ 0, 1, 2, 3, ... 425, 426, 427, 428, 429, 430, 431]])
column = column.reshape((16, 3, 3, 3))
for i, window in enumerate(slices):
score[window] += column[i]
score
>> array([[[0.000e+00, 1.000e+00, 2.000e+00],
[3.000e+01, 3.200e+01, 3.400e+01],
[9.000e+01, 9.300e+01, 9.600e+01], ...
[8.280e+02, 8.300e+02, 8.320e+02],
[4.290e+02, 4.300e+02, 4.310e+02]]])
It works but last 2 lines take really much time as they will be in loop. The problem is that 'grid' variable contains an array of windows. And I don't now how to speed up the process.

Let's simplify the problem at bit - reduce the dimensions, and drop the final size 3 dimension:
In [265]: shape = np.array([4,4])
In [266]: grid = np.array([x.ravel() for x in np.meshgrid(*[np.arange(x) for i
...: , x in enumerate(shape)], indexing='ij')]).T
...: grid = [tuple(slice(box[i], box[i] + 3) for i in range(len(box))) fo
...: r box in grid]
...:
...:
In [267]: len(grid)
Out[267]: 16
In [268]: score = np.arange(36).reshape(6,6)
In [269]: X = np.array([score[x] for x in grid]).reshape(4,4,3,3)
In [270]: X
Out[270]:
array([[[[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]],
[[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]],
[[ 2, 3, 4],
[ 8, 9, 10],
[14, 15, 16]],
....
[[21, 22, 23],
[27, 28, 29],
[33, 34, 35]]]])
This is a moving window - one (3,3) array, shift over 1,..., shift down 1, etc
With as_strided is is possible to construct a view of the array, that consists of all these windows, but without actually copying values. Having worked with as_strided before I was able construct the equivalent strides as:
In [271]: score.shape
Out[271]: (6, 6)
In [272]: score.strides
Out[272]: (48, 8)
In [273]: ast = np.lib.stride_tricks.as_strided
In [274]: x=ast(score, shape=(4,4,3,3), strides=(48,8,48,8))
In [275]: np.allclose(X,x)
Out[275]: True
This could be extended to your (28,28,3) dimensions, and turned into the summation.
Generating such moving windows has been covered in previous SO questions. And it's also implemented in one of the image processing packages.
Adaptation for a 3 channel image,
In [45]: arr.shape
Out[45]: (6, 6, 3)
In [46]: arr.strides
Out[46]: (144, 24, 8)
In [47]: arr[:3,:3,0]
Out[47]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
In [48]: x = ast(arr, shape=(4,4,3,3,3), strides=(144,24,144,24,8))
In [49]: x[0,0,:,:,0]
Out[49]:
array([[ 0., 1., 2.],
[ 6., 7., 8.],
[12., 13., 14.]])
Since we are moving the window by one element at a time, the strides for x are easily derived form the source strides.
For 4x4 windows, just change the shape
x = ast(arr, shape=(3,3,4,4,3), strides=(144,24,144,24,8))
In Efficiently Using Multiple Numpy Slices for Random Image Cropping
#Divikar suggests using skimage
With the default step=1, the result is compatible:
In [55]: from skimage.util.shape import view_as_windows
In [63]: y = view_as_windows(arr,(4,4,3))
In [64]: y.shape
Out[64]: (3, 3, 1, 4, 4, 3)
In [69]: np.allclose(x,y[:,:,0])
Out[69]: True

The shape of the predicted_ids in the outputs of `tf.contrib.seq2seq.BeamSearchDecoder`

What is the shape of the contents in the outputs of tf.contrib.seq2seq.BeamSearchDecoder. I know that it is an instance of class BeamSearchDecoderOutput(scores, predicted_ids, parent_ids), but what is the shape of the scores, predicted_ids and parent_ids?

I wrote followig toy code to explore it a little bit myself.
tgt_vocab_size = 20
embedding_decoder = tf.one_hot(list(range(0, tgt_vocab_size)), tgt_vocab_size)
batch_size = 2
start_tokens = tf.fill([batch_size], 0)
end_token = 1
beam_width = 3
num_units=18
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
encoder_outputs = decoder_cell.zero_state(batch_size, dtype=tf.float32)
tiled_encoder_outputs = tf.contrib.seq2seq.tile_batch(encoder_outputs, multiplier=beam_width)
my_decoder = tf.contrib.seq2seq.BeamSearchDecoder(cell=decoder_cell,
embedding=embedding_decoder,
start_tokens=start_tokens,
end_token=end_token,
initial_state=tiled_encoder_outputs,
beam_width=beam_width)
# dynamic decoding
outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(my_decoder,
maximum_iterations=4,
output_time_major=True)
final_predicted_ids = outputs.predicted_ids
scores = outputs.beam_search_decoder_output.scores
predicted_ids = outputs.beam_search_decoder_output.predicted_ids
parent_ids = outputs.beam_search_decoder_output.parent_ids
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
final_predicted_ids_vals = sess.run(final_predicted_ids)
print("final_predicted_ids shape:")
print(final_predicted_ids_vals.shape)
print("final_predicted_ids_vals: \n%s" %final_predicted_ids_vals)
print("scores shape:")
print(sess.run(scores).shape)
print("scores values: \n %s" % sess.run(scores))
print("predicted_ids shape: ")
print(sess.run(predicted_ids).shape)
print("predicted_ids values: \n %s" % sess.run(predicted_ids))
print("parent_ids shape:")
print(sess.run(parent_ids).shape)
print("parent_ids values: \n %s" % sess.run(parent_ids))
The print is as follows:
final_predicted_ids shape:
(4, 2, 3)
final_predicted_ids_vals:
[[[ 1 8 8]
[ 1 8 8]]
[[ 1 13 13]
[ 1 13 13]]
[[ 1 13 13]
[ 1 13 13]]
[[ 1 13 2]
[ 1 13 2]]]
scores shape:
(4, 2, 3)
scores values:
[[[ -2.8376358 -2.843168 -2.8478816]
[ -2.8376358 -2.843168 -2.8478816]]
[[ -2.8478816 -5.655898 -5.6810265]
[ -2.8478816 -5.655898 -5.6810265]]
[[ -2.8478816 -8.478384 -8.495466 ]
[ -2.8478816 -8.478384 -8.495466 ]]
[[ -2.8478816 -11.292251 -11.307263 ]
[ -2.8478816 -11.292251 -11.307263 ]]]
predicted_ids shape:
(4, 2, 3)
predicted_ids values:
[[[ 8 13 1]
[ 8 13 1]]
[[ 1 13 13]
[ 1 13 13]]
[[ 1 13 12]
[ 1 13 12]]
[[ 1 13 2]
[ 1 13 2]]]
parent_ids shape:
(4, 2, 3)
parent_ids values:
[[[0 0 0]
[0 0 0]]
[[2 0 1]
[2 0 1]]
[[0 1 1]
[0 1 1]]
[[0 1 1]
[0 1 1]]]
The outputs of tf.contrib.seq2seq.dynamic_decode(BeamSearchDecoder) is actually an instance of class FinalBeamSearchDecoderOutput which consists of:
predicted_ids: Final outputs returned by the beam search after all decoding is finished. A tensor of shape [batch_size, num_steps, beam_width] (or [num_steps, batch_size, beam_width] if output_time_major is True). Beams are ordered from best to worst.
beam_search_decoder_output: An instance of BeamSearchDecoderOutput that describes the state of the beam search.
So need to make sure the final predictions/translations are of shape [beam_width, batch_size, num_steps] by transpose([2, 0, 1]) or tf.transpose(final_predicted_ids) if output_time_major=True.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Partitioned matrix multiplication in tensorflow or pytorch - tensorflow

Related

What's going on on torch.transform.ToTensor?

Indices in Numpy and MATLAB

How to make a simple Vandermonde matrix with numpy?

Numpy summation with sliding window is really slow

The shape of the predicted_ids in the outputs of `tf.contrib.seq2seq.BeamSearchDecoder`

Categories

Resources