tf.subtract cost too long time for large array - tensorflow

The Tensorflow tf.subtract cost too long time for the large array.
My workstation configuration:
CPU: Xeon E5 2699 v3
Mem: 384 GB
GPU: NVIDIA K80
CUDA: 8.5
CUDNN: 5.1
Tensorflow: 1.1.0, GPU version
The following is the test code and result.
import tensorflow as tf
import numpy as np
import time
W=3000
H=4000
in_a = tf.placeholder(tf.float32,(W,H))
in_b = tf.placeholder(tf.float32,(W,H))
def test_sub(number):
sess=tf.Session()
out = tf.subtract(in_a,in_b)
for i in range(number):
a=np.random.rand(W,H)
b=np.random.rand(W,H)
feed_dict = {in_a:a,
in_b:b}
t0=time.time()
out_ = sess.run(out,feed_dict=feed_dict)
t_=(time.time()-t0) * 1000
print "index:",str(i), " total time:",str(t_)," ms"
test_sub(20)
Results:
index: 0 total time: 338.145017624 ms
index: 1 total time: 137.024879456 ms
index: 2 total time: 132.538080215 ms
index: 3 total time: 133.152961731 ms
index: 4 total time: 132.885932922 ms
index: 5 total time: 135.06102562 ms
index: 6 total time: 136.723041534 ms
index: 7 total time: 137.926101685 ms
index: 8 total time: 133.605003357 ms
index: 9 total time: 133.143901825 ms
index: 10 total time: 136.317968369 ms
index: 11 total time: 137.830018997 ms
index: 12 total time: 135.458946228 ms
index: 13 total time: 132.793903351 ms
index: 14 total time: 144.603967667 ms
index: 15 total time: 134.593963623 ms
index: 16 total time: 135.535001755 ms
index: 17 total time: 133.697032928 ms
index: 18 total time: 136.134147644 ms
index: 19 total time: 133.810043335 ms
The test result shows it(i.e., tf.subtract) cost more than 130 ms to dispose a 3000x4000 subtraction, which obviously is too long, especially on the NVIDIA k80 GPU platform.
Can anyone provide some methods to optimize the tf.subtract?
Thanks in advance.

You're measuring not only the execution time of tf.subtract but also the time required from transferring the input data from the CPU memory to the GPU memory: this is your bottleneck.
To avoid it, don't use placeholders to feed the data but generate it with tensorflow (if you have to randomly generate it) or if you have to read them, use the tensorflow input pipeline. (that creates threads that reads the input for you before starting and then feed the graph without exiting from the tensorflow graph)
It's important to do more possible operations within the tensorflow graph in order to remove the data transfer bottleneck.

It sounds reasonable that the time I measured contained the data transferring time from CPU memory to GPU memory.
Since I have to read the input data (e.g., the input data are images generated by mobile phone and they are sent to the tensorflow one by one), does it mean that the tensorflow placeholders must be used?
To the situation mentioned above (the input data are images generated by mobile phone and they are sent to the tensorflow one by one), if two images are not generated at the same time (i.e., the second image comes long after the first one), how can the input pipeline threads read the input data before starting (i.e., the second image are not generated when the tensorflow is disposing the first image)? So, could you give me a simple example to explain the tensorflow input pipeline?

Related

Training with Roboflow-Train-YOLOv5 stops with a '^C'

Running Roboflow's notebook, 'Roboflow-Train-YOLOv5', stops after completion the epochs loop.
Instead a reporting the results, I get the following lines, with a ^C at the end of the 3rd line
from the end.
I would like to know the reason for the failure, and there is a way to fix it.
10 epochs completed in 0.191 hours.
Optimizer stripped from runs/train/yolov5s_results2/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/yolov5s_results2/weights/best.pt, 14.9MB
Validating runs/train/yolov5s_results2/weights/best.pt...
Fusing layers...
my_YOLOv5s summary: 213 layers, 7015519 parameters, 0 gradients, 15.8 GFLOPs
Class Images Labels P R mAP#.5 mAP#.5:.95: 20% 1/5 [00:01<00:04, 1.03s/it]^C
CPU times: user 7.01 s, sys: 830 ms, total: 7.84 s
Wall time: 12min 31s
My Colab plan is Colab pro, so I guess it is not a problem of resources.

Redis requests done in 1 to 3 ms taking 300ms

i'm currently using a Graph Database using Redis for a Julia project.
Sometimes Redis requests are taking 300 ms to execute and i don't understand why.
I run a simple request 10.000 times (the code of the request is below) and it took me :
using Redis, BenchmarkTools
conn = RedisConnection(port=6382) Redis.execute_command(conn,["FLUSHDB"])
q = string("CREATE (:Type {nature :'Test',val:'test'})") BenchmarkTools.DEFAULT_PARAMETERS.seconds = 1000 BenchmarkTools.DEFAULT_PARAMETERS.samples = 10000
stats = #benchmark Redis.execute_command(conn,[ "GRAPH.QUERY", "GraphDetection", q])
And got this results :
BenchmarkTools.Trial: memory estimate: 3.09 KiB allocs estimate: 68
minimum time: 1.114 ms (0.00% GC)
median time: 1.249 ms (0.00% GC)
mean time: 18.623 ms (0.00% GC)
maximum time: 303.269 ms (0.00% GC)
samples: 10000 evals/sample: 1
The Huge difference between median time and mean time came from the problem i'm talking about (the request take either [1-3] ms or [300-310] ms )
I'm not familiar with Julia but please note RedisGraph report its internal execution time, I'll suggest using this report for measurement,
In addition it would be helpful to understand when (on which sample) did RedisGraph took over 100ms to process the query, usually it is the first query which causes RedisGraph to do some extra work.

denoise image segmentation (mode filtering?) - or how to vectorize this operation in numpy?

UPDATE: In my initial post I stupidly applied stats.mode patch-wise rather than along the axis of the patches. Fixing this increased my speed by a factor of 4. however its still slow and my original questions still exist: (1) can i increase the speed? (2) are there different/better/standard approaches to cleaning up noisy categorical data? Back the post:
I have some image segmentation results that are noisy and I want to clean it up. My idea was to take the mode value for (3,3) patches. This code works but its too slow.:
from sklearn.feature_extraction import image
from scipy import stats
def _mode(a,axis=None):
m,_=stats.mode(a,axis=axis)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
patches=patches.reshape(nb_patches,-1)
return _mode(patches,1).reshape(int(np.sqrt(nb_patches)),-1)
""" original method (new version is ~ 5 times faster, but still slow)
def _mode(arr):
m,_=stats.mode(arr,axis=None)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
w=int(np.sqrt(nb_patches))
o=np.array([_mode(patches[p]) for p in range(nb_patches)])
return o.reshape(w,-1)
"""
Questions:
Is there a way to do this that is much much faster? eliminate for loop/vectorize in numpy? porting to c directly or using numba etc? I struggled getting something to work along these paths
Are there better / more standard methods for accomplishing denoising like this on categorical image data?
Here is a before/after example from the mode_smoothing method above
Below I present 2 answers to question my question:
by expanding out my initial attempt into a function that numba can handle
using the above suggestion by Alex Alex, which i'll call "categorical-smoothing" (is there a standard name for this method?)
I haven't written out a mathematical proof yet but it appears this patch-wise mode smoothing is equivalent to the categorical-smoothing for the correct choice of parameters. Both lead to a big speed-boost but categorical-smoothing solution is cleaner, faster, and doesn't involve numba - so it wins.
NUMBA
#njit
def mode_smoothing(data,kernel=(3,3),step=(1,1),edges=False,high_value=False,center_boost=False):
""" mode smoothing over patches
Args:
data<np.array>: numpy array (ndim=2)
kernel<tuple[int]>: (height,width) of patch window
step<tuple[int]>: (y-step,x-step)
edges<bool>:
- if true
* include edge patches by taking mode over smaller patch window
* the returned image be the same shape as the input data
- if false
* only run over patches with the full kernel size
* the returned image will be reduced in size by the radius of the kernel
high_value<bool>:
when there are multiple possible mode values choose the highest if true,
otherwise choose the lowest value
center_boost<int|bool>:
if true, instead of using pure mode-value increase the count on the center pixel
Return
<np.array> of patch wise mode values. shape my be different than input. see `edges` above
"""
h,w=data.shape
ry=int(kernel[0]//2)
rx=int(kernel[1]//2)
sy,sx=step
_mode_vals=[]
if edges:
j0,j1=0,h
i0,i1=0,w
else:
j0,j1=ry,h-ry
i0,i1=rx,w-rx
for j in range(j0,j1,sy):
for i in range(i0,i1,sx):
ap=data[
max(j-ry,0):j+ry+1,
max(i-rx,0):i+rx+1]
cv=data[j,i]
values=np.unique(ap)
count=0
for v in values:
newcount=(ap==v).sum()
if center_boost and (v==cv):
newcount+=center_boost
if high_value:
test=newcount>=count
else:
test=newcount>count
if test:
count=newcount
mode_value=v
_mode_vals.append(mode_value)
return np.array(_mode_vals).reshape(j1-j0,i1-i0)
CATEGORICAL SMOOTHING
from scipy.signal import convolve2d
KERNEL=np.ones((3,3))
def categorical_smoothing(data,nb_categories,kernel=KERNEL):
data=np.eye(nb_categories)[:,data]
for i in range(nb_categories):
data[i]=convolve2d(data[i],kernel,mode='same')
return data.argmax(axis=0)
EQUIVALENCE/SPEED-CHECK
This is probably easy to prove but...
S=512
N=19
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=False)
kernel=np.ones((3,3))
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
print()
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=1)
kernel=np.ones((3,3))
kernel[1,1]=2
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
""" OUTPUT
CPU times: user 826 ms, sys: 0 ns, total: 826 ms
Wall time: 825 ms
CPU times: user 416 ms, sys: 7.83 ms, total: 423 ms
Wall time: 423 ms
True
CPU times: user 825 ms, sys: 3.78 ms, total: 829 ms
Wall time: 828 ms
CPU times: user 422 ms, sys: 3.91 ms, total: 426 ms
Wall time: 425 ms
True
"""

TensorFlow Norm (LRN) doesn't support GPU

I am running following code on Google Cloud ML using BASIC GPU (Tesla K80)
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10
LRN is taking the most amount of time and its running on CPU. I am wondering if following stats quoted in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_train.py were obtained by running on CPU because I don't see thats the case.
System | Step Time (sec/batch) | Accuracy
1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours)
If I force it to run it with GPU it throws following error:
Cannot assign a device to node 'norm1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. [[Node: norm1 = LRNT=DT_HALF, alpha=0.00011111111, beta=0.75, bias=1, depth_radius=4, _device="/device:GPU:0"]

How can I speed up application of Log(x+1) to a sparse array in Julia

A sparse matrix in Julia only stores nonzero elements.
Some functions, such as log(x+1) (in all bases),
map zero to zero, and thus don't need to be applied to those zero elements.
(I think we would call this a Monoid homomorphism.)
How can I use this fact to speed up an operation?
Example code:
X = sprand(10^4,10^4, 10.0^-5, rand)
function naiveLog2p1(N::SparseMatrixCSC{Float64,Int64})
log2(1+N) |> sparse
end
Running:
#time naiveLog2p1(X)
Output is:
elapsed time: 2.580125482 seconds (2289 MB allocated, 6.86% gc time in 3 pauses with 0 full sweep)
On a second time (so that the function is expected to be already compiled):
elapsed time: 2.499118888 seconds (2288 MB allocated, 8.17% gc time in 3 pauses with 0 full sweep)
Little change, presumably cos it is so simple to compile.
As per suggestion of the Julia manual on "Sparse matrix operations" I would convert the sparse matrix into a dense one using findnz(), do the log operations on the values and the reconstruc the sparse matrix with sparse().
function improvedLog2p1(N::SparseMatrixCSC{Float64,Int64})
I,J,V = findnz(N)
return sparse(I,J,log2(1+V))
end
#time improvedLog2p1(X)
elapsed time: 0.000553508 seconds (473288 bytes allocated)
My solution would be to actually operate on the inside of the data structure itself:
mysparselog(N::SparseMatrixCSC) =
SparseMatrixCSC(N.m, N.n, copy(N.colptr), copy(N.rowval), log2(1+N.nzval))
Note that if you want to operate on the sparse matrix in place, which would be fairly often in practice I imagine, this would be a zero-memory operation. Benchmarking reveals this performs similar to the #Oxinabox answer, as it is about the same in terms of memory operations (although that answer doesn't actually return the new matrix, as shown by the mean output):
with warmup times removed:
naiveLog2p1
elapsed time: 1.902405905 seconds (2424151464 bytes allocated, 10.35% gc time)
mean(M) => 0.005568094618997372
mysparselog
elapsed time: 0.022551705 seconds (24071168 bytes allocated)
elapsed time: 0.025841895 seconds (24071168 bytes allocated)
mean(M) => 0.005568094618997372
improvedLog2p1
elapsed time: 0.018682775 seconds (32068160 bytes allocated)
elapsed time: 0.027129497 seconds (32068160 bytes allocated)
mean(M) => 0.004995127985160583
What you are looking for is the sparse nonzeros function.
nonzeros(A)
Return a vector of the structural nonzero values in sparse matrix A. This includes zeros that are explicitly stored in the sparse
matrix. The returned vector points directly to the internal nonzero
storage of A, and any modifications to the returned vector will mutate
A as well.
You can use this as below:
function improvedLog2p1(N::SparseMatrixCSC{Float64,Int64})
M = copy(N)
ms = nonzeros(M) #Creates a view,
ms = log2(1+ms) #changes to ms, change M
M
end
#time improvedLog2p1(X)
running for the first time output is:
elapsed time: 0.002447847 seconds (157 kB allocated)
running for the second time output is:
0.000102335 seconds (109 kB allocated)
That is a 4 orders of magnitude improvement in speed and memory use.