NumPy matrix multiplication is 20X slower than OpenCV's cvtColor - numpy

OpenCV converts BGR images to grayscale using the linear transformation Y = 0.299R + 0.587G + 0.114B, according to their documentation.
I tried to mimic it using NumPy, by multiplying the HxWx3 BGR matrix by the 3x1 vector of coefficients [0.114, 0.587, 0.299]', a multiplication that should result in a HxWx1 grayscale image matrix.
The NumPy code is as follows:
import cv2
import numpy as np
import time
im = cv2.imread(IM_PATHS[0], cv2.IMREAD_COLOR)
# Prepare destination grayscale memory
dst = np.zeros(im.shape[:2], dtype = np.uint8)
# BGR -> Grayscale projection column vector
bgr_weight_arr = np.array((0.114,0.587,0.299), dtype = np.float32).reshape(3,1)
for im_path in IM_PATHS:
im = cv2.imread(im_path , cv2.IMREAD_COLOR)
t1 = time.time()
# NumPy multiplication comes here
dst[:,:] = (im # bgr_weight_arr).reshape(*dst.shape)
t2 = time.time()
print(f'runtime: {(t2-t1):.3f}sec')
Using 12MP images (4000x3000 pixels), the above NumPy-powered process typically takes around 90ms per image, and that is without rounding the multiplication results.
On the other hand, when I replace the matrix multiplication part by OpenCV's function: dst[:,:] = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY), the typical runtime I get is around 5ms per image. I.e., 18X faster!
Can anyone explain how is that possible? I have always been taught to believe that NumPy uses all available acceleration techniques, such as SIMD. So how can OpenCV get so dramatically faster?
Update:
Even when using quantized multiplications, NumPy's runtimes stay at the same range, around 90ms...
rgb_weight_arr_uint16 = np.round(256 * np.array((0.114,0.587,0.299))).astype('uint16').reshape(3,1)
for im_path in IM_PATHS:
im = cv2.imread(im_path , cv2.IMREAD_COLOR)
t1 = time.time()
# NumPy multiplication comes here
dst[:,:] = np.right_shift(im # bgr_weight_arr_uint16, 8).reshape(*dst.shape)
t2 = time.time()
print(f'runtime: {(t2-t1):.3f}sec')

Related

numpy - find all pixels near a set of pixels

I have a PIL.Image object input of mode '1' (a black & white bitmap) and I would like to determine, for every pixel in the image, whether it's within n pixels (Euclidean distance - n may be around 100 or so) of any of the white pixels.
The motivation is: input represents every pixel that is different between two other images, and I would like to create a highlight region around all those differences to show clearly where the differences occur.
So far I haven't been able to find a fast algorithm for this - the following code works, but the convolution is very slow because the kernel argument is larger than the convolution can apparently handle efficiently:
from scipy import ndimage
import numpy as np
from PIL import Image
n = 100
y, x = np.ogrid[:2*n, :2*n]
kernel = (x-n)**2 + (y-n)**2 <= n**2
img = Image.open('input.png')
result = ndimage.convolve(np.array(img), kernel) != 0
Image.fromarray(result).save('result.png')
Example input input.png:
Desired output result.png (there are also some undesired artifacts here that I assume come from over/underflow):
Even with these small images, the computation takes 30 seconds or so.
Can someone recommend a better procedure to compute this? Thanks.
ndimage.convolve use a very inefficient algorithm to perform the convolution certainly running in O(n m kn km) where (n,m) is the shape of the image and (kn, km) is the shape of the kernel. You can use an FFT to do that much more efficiently in O(n m log(n m)) time. Hopefully, scipy provide such a function. Here is an example of usage:
import scipy.signal
import numpy as np
from PIL import Image
n = 100
y, x = np.ogrid[:2*n, :2*n]
kernel = (x-n)**2 + (y-n)**2 <= n**2
img = Image.open('input.png')
result = scipy.signal.fftconvolve(img, kernel, mode='same') >= 1.0
Image.fromarray(result).save('result.png')
This is >500 times faster on my machine and this also fix the artefacts. Here is the result:

How to remove black canvas from image in TensorFlow

I'm currenly trying working with tensorflow dataset 'tf_flowers', and noticed that a lot of images consist mostly of black canvas, like this:
flower1
flower2
Is there any easy way to remove/or filter it out? Preferably it should work on batches, and compile into a graph with #tf.function, as I plan to use it also for bigger datasets with dataset.map(...)
The black pixels are just because of padding. This is a simple operation that allows you to have network inputs having the same size (i.e. you have batches containing images with the of size: 223x221 because smaller images are padded with black pixels).
An alternative to padding that removes the need of adding black pixels to the image, is that of preprocessing the images by:
removing padding via cropping operation
resizing the cropped images to the same size (e.g. 223x221)
You can do all of these operations in simple python, thanks to tensorflow map function. First, define your python function
def py_preprocess_image(numpy_image):
input_size = numpy_image.shape # this is (223, 221)
image_proc = crop_by_removing_padding(numpy_image)
image_proc = resize(image_proc, size=input_size)
return image_proc
Then, given your tensorflow dataset train_data, map the above python function on each input:
# train_data is your tensorflow dataset
train_data = train_data.map(
lambda x: tf.py_func(preprocess_image,
inp = [x], Tout=[tf.float32]),
num_parallel_calls=num_threads
)
Now, you only need to define crop_by_removing_padding and resize, which operate on ordinary numpy arrays and can thus be written in pure python code. For example:
def crop_by_removing_padding(img):
xmax, ymax = np.max(np.argwhere(img), axis=0)
img_crop = img[:xmax + 1, :ymax + 1]
return img_crop
def resize(img, new_size):
img_rs = cv2.resize(img, (new_size[1], new_size[0]), interpolation=cv2.INTER_CUBIC)
return img_rs

Fastest way to compute many 3x3 matrix-matrix multiplications

I need to compute the combination of many 3x3 rotation matrices.
Here is a comparison of applying functools.reduce on matmul with numpy and cupy:
import timeit
from functools import reduce
import numpy as np
import cupy as cp
from pyrr.matrix33 import create_from_axis_rotation
# generate random rotation matrices
axes = np.random.rand(10000, 3)
angles = np.pi * np.random.rand(10000)
rotations = [create_from_axis_rotation(*params) for params in zip(axes, angles)]
# then reduce with matmul
xp = np # numpy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")
xp = cp # cupy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")
On a good machine with a Titan GPU, this gives :
numpy: 1.63e+02ms
cupy: 8.78e+02ms
For some reason the GPU is much slower.
In any case, is there a way to calculate this significantly faster ?
Edit
I found a rather simple solution, that works for all chains of small linear transformations (and can be extended to affine transformations easily).
def reduce_loop(matrices):
""" non-optimized reduce """
mat = matrices[0]
for _mat in matrices[1:]:
mat = mat # _mat
return mat
def reduce_split(matrices):
""" reduce by multiplying pairs of matrices recursively """
if len(matrices) == 1:
return matrices[0]
neven = (len(matrices) // 2) * 2
reduced = matrices[:neven:2] # matrices[1:neven:2]
if len(matrices) > neven: # len(matrices) is odd
reduced[-1] = reduced[-1] # matrices[-1]
return reduce_split(reduced)
time = timeit.timeit("reduce_loop(rotations)", number=10, globals=globals())
print(f"reduce_loop: {time * 1000:0.3}ms")
time = timeit.timeit("reduce_split(rotations)", number=10, globals=globals())
print(f"reduce_split: {time * 1000:0.3}ms")
Giving:
reduce_loop: 2.14e+02ms
reduce_split: 24.5ms
I'm sure it's not optimal, but it uses numpy's (and probably cupy's) optimization.
functools.reduce() was removed from core python because it is inefficient and not pythonic. There is no cuPy equivalent, only the host version in the functools library
your cuPy code is spending most of its time fruitlessly copying data from host to device and back again... thousands of times - because reduce() runs only on the host not on the GPU. You are straining your PCI bus, not the GPU
consider making the list “rotations” into a cuPy matrix, and then use striding (not a python list)
use a cuPy reduction kernel to do the matmul
https://docs.cupy.dev/en/stable/reference/generated/cupy.ReductionKernel.html

TensorFlow: How to apply the same image distortion to multiple images

Starting from the Tensorflow CNN example, I'm trying to modify the model to have multiple images as an input (so that the input has not just 3 input channels, but multiples of 3 by stacking images).
To augment the input, I try to use random image operations, such as flipping, contrast and brightness provided in TensorFlow.
My current solution to apply the same random distortion to all input images is to use a fixed seed value for these operations:
def distort_image(image):
flipped_image = tf.image.random_flip_left_right(image, seed=42)
contrast_image = tf.image.random_contrast(flipped_image, lower=0.2, upper=1.8, seed=43)
brightness_image = tf.image.random_brightness(contrast_image, max_delta=0.2, seed=44)
return brightness_image
This method is called multiple times for each image at graph construction time, so I thought for each image it will use the same random number sequence and consequently, it will result in have the same applied image operations for my image input sequence.
# ...
# distort images
distorted_prediction = distort_image(seq_record.prediction)
distorted_input = []
for i in xrange(INPUT_SEQ_LENGTH):
distorted_input.append(distort_image(seq_record.input[i,:,:,:]))
stacked_distorted_input = tf.concat(2, distorted_input)
# Ensure that the random shuffling has good mixing properties.
min_queue_examples = int(num_examples_per_epoch *
MIN_FRACTION_EXAMPLES_IN_QUEUE)
# Generate a batch of sequences and prediction by building up a queue of examples.
return generate_sequence_batch(stacked_distorted_input, distorted_prediction, min_queue_examples,
batch_size, shuffle=True)
In theory, this works fine. And after doing some test runs, this really seemed to solve my problem. But after a while, I found out that I'm having a race-condition, because I use the input pipeline of the CNN-example code with multiple threads (which is the suggested method in TensorFlow to improve performance and reduce memory consumption at runtime):
def generate_sequence_batch(sequence_in, prediction, min_queue_examples,
batch_size):
num_preprocess_threads = 8 # <-- !!!
sequence_batch, prediction_batch = tf.train.shuffle_batch(
[sequence_in, prediction],
batch_size=batch_size,
num_threads=num_preprocess_threads,
capacity=min_queue_examples + 3 * batch_size,
min_after_dequeue=min_queue_examples)
return sequence_batch, prediction_batch
Because multiple threads create my examples, it is not guaranteed anymore that all image operations are performed in the right order (in sense of the right order of random operations).
Here I came to a point where I got completely stuck. Does anyone know how to solve this problem to apply the same image distortion to multiple images?
Some thoughts of mine:
I thought about to do some synchronizations arround these image distortion methods, but I could find anything provided by TensorFlow
I tried to generate to generate a random number for e.g. the random brightness delta using tf.random_uniform() by myself and use this value for tf.image.adjust_contrast(). But the result of the TensorFlow random generator is always a tensor, and I have not found a way to use this tensor as a parameter for tf.image.adjust_contrast() which expects a simple float32 for its contrast_factor parameter.
A solution that would (partly) work would be to combine all images to a huge image using tf.concat(), apply random operations to change contrast and brightness, and split the image afterwards. But this would not work for random flipping, because this would (at least in my case) change the order of the images, and there is no way to detect whether tf.image.random_flip_left_right() has performed a flip or not, which would be required to fix the wrong order of images if necessary.
Here is what I came up with by looking at the code of random_flip_up_down and random_flip_left_right within tensorflow :
def image_distortions(image, distortions):
distort_left_right_random = distortions[0]
mirror = tf.less(tf.pack([1.0, distort_left_right_random, 1.0]), 0.5)
image = tf.reverse(image, mirror)
distort_up_down_random = distortions[1]
mirror = tf.less(tf.pack([distort_up_down_random, 1.0, 1.0]), 0.5)
image = tf.reverse(image, mirror)
return image
distortions = tf.random_uniform([2], 0, 1.0, dtype=tf.float32)
image = image_distortions(image, distortions)
label = image_distortions(label, distortions)
I would do something like this using tf.case. It allows you to specify what to return if certain condition holds https://www.tensorflow.org/api_docs/python/tf/case
import tensorflow as tf
def distort(image, x):
# flip vertically, horizontally, both, or do nothing
image = tf.case({
tf.equal(x,0): lambda: tf.reverse(image,[0]),
tf.equal(x,1): lambda: tf.reverse(image,[1]),
tf.equal(x,2): lambda: tf.reverse(image,[0,1]),
}, default=lambda: image, exclusive=True)
return image
def random_distortion(image):
x = tf.random_uniform([1], 0, 4, dtype=tf.int32)
return distort(image, x[0])
To check if it works.
import numpy as np
import matplotlib.pyplot as plt
# create image
image = np.zeros((25,25))
image[:10,5:10] = 1.
# create subplots
fig, axes = plt.subplots(2,2)
for i in axes.flatten(): i.axis('off')
with tf.Session() as sess:
for i in range(4):
distorted_img = sess.run(distort(image, i))
axes[i % 2][i // 2].imshow(distorted_img, cmap='gray')
plt.show()

numpy: 1d histogram based on 2d-pixel euclidean distance from center

I am using python, with scipy, numpy, etc.
I want to compute the histogram of intensity values of a grayscale image, based on the distance of the pixels to the center of mass of the image. The following solution works, but is very slow:
import matplotlib.pyplot as plt
from scipy import ndimage
import numpy as np
import math
# img is a 2-dimensionsl numpy array
img = np.random.rand(300, 300)
# center of mass of the pixels is easy to get
centerOfMass = np.array(list(ndimage.measurements.center_of_mass(img)))
# declare histogram buckets
histogram = np.zeros(100)
# declare histogram range, which is half the diagonal length of the image, enough in this case.
maxDist = len(img)/math.sqrt(2.0)
# size of the bucket might be less than the width of a pixel, which is fine.
bucketSize = maxDist/len(histogram)
# fill the histogram buckets
for i in range(len(img)):
for j in range(len(img[i])):
dist = np.linalg.norm(centerOfMass - np.array([i,j]))
if(dist/bucketSize < len(histogram)):
histogram[int(dist/bucketSize)] += img[i, j]
# plot the img array
plt.subplot(121)
imgplot = plt.imshow(img)
imgplot.set_cmap('hot')
plt.colorbar()
plt.draw()
# plot the histogram
plt.subplot(122)
plt.plot(histogram)
plt.draw()
plt.show()
As I said before, this works, but is very slow because you are not supposed to double-loop arrays in this manner in numpy. Is there a more efficient way of doing the same thing? I assume I need to apply some function on all the array elements, but I need the index coordinates as well. How can I do that? Currently it takes several seconds for a 1kx1k image, which is ridiculously slow.
All numpy binning functions (bincount, histogram, histogram2d... have a weights keyword argument you can use to do really weird things, such as yours. This is how I would do it:
rows, cols = 300, 300
img = np.random.rand(rows, cols)
# calculate center of mass position
row_com = np.sum(np.arange(rows)[:, None] * img) / np.sum(img)
col_com = np.sum(np.arange(cols) * img) / np.sum(img)
# create array of distances to center of mass
dist = np.sqrt(((np.arange(rows) - row_com)**2)[:, None] +
(np.arange(cols) - col_com)**2)
# build histogram, with intensities as weights
bins = 100
hist, edges = np.histogram(dist, bins=bins, weights=img)
# to reproduce your exact results, you must specify the bin edges
bins = np.linspace(0, len(img)/math.sqrt(2.0), 101)
hist2, edges2 = np.histogram(dist, bins=bins, weights=img)
Haven't timed both approaches, but judging from the delay when running both from the terminal, this is noticeably faster.