Related
I want to use a 2D array which contains k-index values to quickly fill a 3D array with different mask values above/below each k-index. Only non-zero boundary indices will be used to fill.
Initialize 2D k-index array and extract valid i-j index arrays:
import numpy as np
boundary_indices = np.array([[0, 1, 2], [1, 2, 1], [0, 2, 0]])
ii, jj = np.where(boundary_indices > 0) # determine desired indices
kk = boundary_indices[ii, jj] # align boundary indices with valid indices
Yields:
boundary_indices = array([[0, 1, 2],
[1, 2, 1],
[0, 2, 0]])
ii = array([0, 0, 1, 1, 1, 2])
jj = array([1, 2, 0, 1, 2, 1])
kk = array([1, 2, 1, 2, 1, 2])
Loop through the indices and populate the output array:
output = np.zeros((3, 3, 3), dtype=np.int64)
for i, j, k in zip(ii, jj, kk):
output[i, j, :k] = 7 # fill region above
output[i, j, k:] = 8 # fill region below
While this does yield the correct results, it becomes quite slow once the size of the array increases significantly:
output[:, :, 0] = [[0, 7, 7],
[7, 7, 7],
[0, 7, 0]]
output[:, :, 1] = [[0, 8, 7],
[8, 7, 8],
[0, 7, 0]]
output[:, :, 2] = [[0, 8, 8],
[8, 8, 8],
[0, 8, 0]]
Is there a more efficient way to do this?
Tried output[ii, jj, kk] = 8 but that only imprints the boundary on the output array and not the regions above/below.
I was hoping that there would be some fancy-indexing magic and that something like this would work:
output[ii, jj, :kk] = 7
output[ii, jj, kk:] = 8
But it generates a TypeError: TypeError: only integer scalar arrays can be converted to a scalar index
For such kind of operation, Numba and Cython can be used to produce an efficient code. Here is an example with Numba:
import numba as nb
# `parallel=True` can be added here for large arrays
#nb.njit('int64[:,:,::1](int64[:], int64[:], int64[:])')
def compute(ii, jj, kk):
output = np.zeros((3, 3, 3), dtype=np.int64)
n = output.shape[2]
# `for idx in prange(ii.size)` can be used here for large array
for i, j, k in zip(ii, jj, kk):
# `i, j, k = ii[idx], jj[idx], kk[idx]` can be used here for large array
for l in range(k): # fill region above
output[i, j, l] = 7
for l in range(k, n): # fill region below
output[i, j, l] = 8
return output
# Either kk needs to be converted to an int64-based array with kk.astype(np.int64)
# or boundary_indices needs to be an int64-based array in the first place.
output = compute(ii, jj, kk)
Note that the Numba function can be faster if ii and jj are contiguous. However, they are surprisingly not contiguous when retrieved from np.where. Besides I assume that kk is a 64-bit array. You can change the signature (string in the Numba jit decorator) so to support 32-bit array. Also please note that Numba can lazily compile the function based on the provided type at runtime but this introduce a significant overhead during the first function call. This code is significantly faster, especially for large arrays thanks to the the just-in-time compilation of Numba. The Numba loop can be parallelized using prange and the parallel=True decorator flag although the current code should already be pretty good. Finally, note that you can do the operation np.where(boundary_indices > 0) directly in the Numba loop on the fly so to avoid creating possibly-expensive temporary arrays.
I have a 2D array and a boolean mask of the same size. I want to use the mask to coalesce consecutive rows in the 2D array: By coalesce I mean to reduce the rows by taking the first occurrence. An example:
rows = np.r_['1,2,0', :6, :6]
mask = np.tile([1, 1, 0, 0, 1, 1], (2,1)).T.astype(bool)
Expected output:
array([[0, 0],
[2, 2],
[3, 3],
[4, 4])
And to illustrate how the output might be obtained:
array([[0, 0], array([[0, 0], array([[0, 0],
[1, 1], [0, 0], [2, 2],
[2, 2], -> select -> [2, 2], -> reduce -> [3, 3],
[3, 3], [3, 3], [4, 4]])
[4, 4], [4, 4],
[5, 5]]) [4, 4]])
What I have tried:
rows[~mask].reshape(-1,2)
But this will only select the rows which should not be reduced.
Upgraded answer
I realized that my initial submission did a lot of unnecessary operations, I realized that given mask
mask = [1,1,0,0,1,1,0,0,1,1,1,0]
You simply want to negate the leading ones:
#negate:v v v
mask = [0,1,0,0,0,1,0,0,0,1,1,0]
then negate the mask to get your wanted rows. This way is MUCH more efficient than doing a forward fill on indices and removing repeated indices (see old answer). Revised solution:
import numpy as np
rows = np.r_['1,2,0', :6, :6]
mask = np.tile([1, 1, 0, 0, 1, 1], (2,1)).T.astype(bool)
def maskforwardfill(a: np.ndarray, mask: np.ndarray):
mask = mask.copy()
mask[1:] = mask[1:] & mask[:-1] # Negate leading True values
mask[0] = False # First element should always be False, either it is False anyways, or it is a leading True value (which should be set to False)
return a[~mask] # index out wanted rows
# Reduce mask's dimension since I assume that you only do complete rows
print(maskforwardfill(rows, mask.any(1)))
#[[0 0]
# [2 2]
# [3 3]
# [4 4]]
Old answer
Here I assume that you only need complete rows (like in #Arne's answer). My idea is that given the mask and the corresponding array indices
mask = [1,1,0,0,1,1]
indices = [0,1,2,3,4,5]
you can use np.diff to first obtain
indices = [0,-1,2,3,4,-1]
Then a forward fill (where -1 acts as nan) on the indices such that you get
[0,0,2,3,4,4]
of which can use np.unique to remove repeated indices:
[0,2,3,4] # The rows indices you want
Code:
import numpy as np
rows = np.r_['1,2,0', :6, :6]
mask = np.tile([1, 1, 0, 0, 1, 1], (2,1)).T.astype(bool)
def maskforwardfill(a: np.ndarray, mask: np.ndarray):
mask = mask.copy()
indices = np.arange(len(a))
mask[np.diff(mask,prepend=[0]) == 1] = False # set leading True to False
indices[mask] = -1
indices = np.maximum.accumulate(indices) # forward fill indices
indices = np.unique(indices) # remove repeats
return a[indices] # index out wanted rows
# Reduce mask's dimension since I assume that you only do complete rows
print(maskforwardfill(rows, mask.any(1)))
#[[0 0]
# [2 2]
# [3 3]
# [4 4]]
Assuming it's always about complete rows, you can reduce the mask to one dimension. Then a straightforward approach is to iterate over the rows:
# reduce mask to one dimension for row selection
mask_1d = mask.any(axis=1)
# replace rows with previous ones based on mask
for i in range(1, len(rows)):
if mask_1d[i-1] and mask_1d[i]:
rows[i] = rows[i-1]
# leave out repeated rows
reduced = [rows[0]]
for i in range(1, len(rows)):
if not (rows[i] == rows[i-1]).all():
reduced.append(rows[i])
reduced = np.array(reduced)
reduced
array([[0, 0],
[2, 2],
[3, 3],
[4, 4]])
E.g. imagine I use the Librispeech dataset via TFDS (or whatever dataset, including sequences of varying length of data), and then use padded_batch to create batches, e.g. like this:
import tensorflow_datasets as tfds
dataset = tfds.load(name="librispeech", split="train_clean100")
dataset = dataset.shuffle(1024)
dataset = dataset.padded_batch(32)
Now when iterating through the resulting dataset, i.e. over the (padded) batches, how would I know the original sequence lengths in the padded batch? Or is this information lost at this point? How would I extend the pipeline to include it? Is there a special dataset like AddSeqLengthInfoDataset or so? This would need to run before the padded_batch, right?
(This is basically an equivalent of my question for TF PaddingFIFOQueue but for tf.data.Dataset.)
Is there some example? (I wonder a bit that I have not found anything about this. I would assume this is a pretty standard requirement when you work on sequences, that you need to have the information about the original sequence lengths, or not?)
You can just add a new field to the dataset holding the size of the sequence, for example like this:
import tensorflow as tf
# Make a dataset with variable-size data
def generate_data():
for i in range(10):
yield {'id': i, 'data': range(i % 5)}
ds = tf.data.Dataset.from_generator(generate_data,
{'id': tf.int32, 'data': tf.int32},
{'id': [], 'data': [None]})
# Add field with size of data
ds = ds.map(lambda item: {**item, 'size': tf.shape(item['data'])[0]})
# Padded batch
ds = ds.padded_batch(3)
# Show dataset
for batch in ds:
tf.print(batch)
Output:
{'data': [[0 0]
[0 0]
[0 1]], 'id': [0 1 2], 'size': [0 1 2]}
{'data': [[0 1 2 0]
[0 1 2 3]
[0 0 0 0]], 'id': [3 4 5], 'size': [3 4 0]}
{'data': [[0 0 0]
[0 1 0]
[0 1 2]], 'id': [6 7 8], 'size': [1 2 3]}
{'data': [[0 1 2 3]], 'id': [9], 'size': [4]}
Then you can use for example tf.sequence_mask with the value of that field to mask the padding values.
Another option is simply to pass some special padding_values to padded_batch that cannot appear in the actual data, e.g. -1 or nan, but that depends on whether those are actually invalid values for your problem.
I have a set of points and their transformations (the points they became after the unknown transformation occurred), here they are:
input_coordinates = {
'A': (5, 2),
'B': (2, -3),
'C': (-3, 6)}
final_coordinates = {
'A': (2, -3),
'B': (-3, 6),
'C': (6, 5)}
I also have one single input point that I would like to infer its location in the post-transformation space:
x_coordinate = (5, -7)
And here they all are graphed out visually.
So, given only a map of point-to-point conversions, and assuming a linear transformation, how do I infer the post-transformation point? How do I know where to place X on the right graph?
Are there already any libraries that will do this?
With John Hughes help I've started a function that should return the correct result for linear transformations but I can't figure out how to finish the function.
Here's the starter code for the Linear Transformation solution:
def extrapolate(domain_coordinates, result_coordinates, point):
'''
given a set of input coordinates and their resulting
coordinates post-transformation, and given an additional
input coordinate return the location (coordinate) in the
post-transformation space that corresponds to the most
logical linear transformation of that space for that
additional point. "extrapolate where this point ends up"
'''
import numpy as np
# Add the number 1 to the coordinates for each point
domain = [(x, y, 1) for x, y in domain_coordinates]
# Do the same for the "target" coordinates too
result = [(x, y, 1) for x, y in result_coordinates]
# Put these coordinates, arranged vertically, into a 3×3 matrix
domain = np.array(domain).T
result = np.array(result).T
# D^−1 is the "matrix inverse"
inverse = np.linalg.inv(domain)
# Let M=RD^−1
matrix = result * inverse # why do I need M?...
# Do the same for the extrapolation point
xpoint = [(x, y, 1) for x, y in [point]]
xpoint = np.array(xpoint).T
# extrapolate ???
extrapolated_point = matrix * xpoint # this isn't right...
# testing
print(domain * np.array([[1],[0],[0]]).T)
print(domain * np.array([[1],[0],[0]]).T * matrix)
return extrapolated_point
extrapolate(
domain_coordinates=[(5, 2), (2, -3), (-3, 6)],
result_coordinates=[(2, -3), (-3, 6), (6, 5)],
point=(5, -7))
This code is not working right, it prints...
[[5 0 0]
[2 0 0]
[1 0 0]]
[[ 1.73076923 0. -0.]
[-0.57692308 -0. 0.]
[-0.34615385 0. 0.]]
whereas I would expect it to print...
[[5 0 0]
[2 0 0]
[1 0 0]]
[[ 2 0. 0.]
[-3 0. 0.]
[-1 0. 0.]]
Can you show me where I've gone wrong?
Thanks so much for your help!
I have two tensors x and y, there both are the same dimension shape = (1, 64, 1, 1)
basically y is output from many functions, and x is the input
I want to compare these two tensor using visualize tool like matplotlib..etc
anyway to do ?
The below are x and y example, I only post 10 of 64 since the restriction
x
tensor([[[[-0.8467]],
[[-0.0949]],
[[-0.8253]],
[[-0.1027]],
[[ 0.0476]],
[[-0.4173]],
[[-0.0870]],
[[ 0.0650]],
[[ 0.3816]],
[[ 0.2046]]]], grad_fn=<MulBackward0>)
y
tensor([[[[-2.0307]],
[[-0.1594]],
[[-1.5174]],
[[-0.2767]],
[[ 0.1049]],
[[-0.9605]],
[[-0.2127]],
[[ 0.1342]],
[[ 0.8275]],
[[ 2.0508]],
]])
You can convert x and y to numpy arrays and then use whatever matplotlib function you want
with torch.no_grad():
x_np = x.cpu().numpy()[0, :, 0, 0] # make it 1d
y_np = y.cpu().numpy()[0, :, 0, 0]
plt.plot(x_np - y_np)
plt.show()