denoise image segmentation (mode filtering?) - or how to vectorize this operation in numpy? - numpy

UPDATE: In my initial post I stupidly applied stats.mode patch-wise rather than along the axis of the patches. Fixing this increased my speed by a factor of 4. however its still slow and my original questions still exist: (1) can i increase the speed? (2) are there different/better/standard approaches to cleaning up noisy categorical data? Back the post:
I have some image segmentation results that are noisy and I want to clean it up. My idea was to take the mode value for (3,3) patches. This code works but its too slow.:
from sklearn.feature_extraction import image
from scipy import stats
def _mode(a,axis=None):
m,_=stats.mode(a,axis=axis)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
patches=patches.reshape(nb_patches,-1)
return _mode(patches,1).reshape(int(np.sqrt(nb_patches)),-1)
""" original method (new version is ~ 5 times faster, but still slow)
def _mode(arr):
m,_=stats.mode(arr,axis=None)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
w=int(np.sqrt(nb_patches))
o=np.array([_mode(patches[p]) for p in range(nb_patches)])
return o.reshape(w,-1)
"""
Questions:
Is there a way to do this that is much much faster? eliminate for loop/vectorize in numpy? porting to c directly or using numba etc? I struggled getting something to work along these paths
Are there better / more standard methods for accomplishing denoising like this on categorical image data?
Here is a before/after example from the mode_smoothing method above

Below I present 2 answers to question my question:
by expanding out my initial attempt into a function that numba can handle
using the above suggestion by Alex Alex, which i'll call "categorical-smoothing" (is there a standard name for this method?)
I haven't written out a mathematical proof yet but it appears this patch-wise mode smoothing is equivalent to the categorical-smoothing for the correct choice of parameters. Both lead to a big speed-boost but categorical-smoothing solution is cleaner, faster, and doesn't involve numba - so it wins.
NUMBA
#njit
def mode_smoothing(data,kernel=(3,3),step=(1,1),edges=False,high_value=False,center_boost=False):
""" mode smoothing over patches
Args:
data<np.array>: numpy array (ndim=2)
kernel<tuple[int]>: (height,width) of patch window
step<tuple[int]>: (y-step,x-step)
edges<bool>:
- if true
* include edge patches by taking mode over smaller patch window
* the returned image be the same shape as the input data
- if false
* only run over patches with the full kernel size
* the returned image will be reduced in size by the radius of the kernel
high_value<bool>:
when there are multiple possible mode values choose the highest if true,
otherwise choose the lowest value
center_boost<int|bool>:
if true, instead of using pure mode-value increase the count on the center pixel
Return
<np.array> of patch wise mode values. shape my be different than input. see `edges` above
"""
h,w=data.shape
ry=int(kernel[0]//2)
rx=int(kernel[1]//2)
sy,sx=step
_mode_vals=[]
if edges:
j0,j1=0,h
i0,i1=0,w
else:
j0,j1=ry,h-ry
i0,i1=rx,w-rx
for j in range(j0,j1,sy):
for i in range(i0,i1,sx):
ap=data[
max(j-ry,0):j+ry+1,
max(i-rx,0):i+rx+1]
cv=data[j,i]
values=np.unique(ap)
count=0
for v in values:
newcount=(ap==v).sum()
if center_boost and (v==cv):
newcount+=center_boost
if high_value:
test=newcount>=count
else:
test=newcount>count
if test:
count=newcount
mode_value=v
_mode_vals.append(mode_value)
return np.array(_mode_vals).reshape(j1-j0,i1-i0)
CATEGORICAL SMOOTHING
from scipy.signal import convolve2d
KERNEL=np.ones((3,3))
def categorical_smoothing(data,nb_categories,kernel=KERNEL):
data=np.eye(nb_categories)[:,data]
for i in range(nb_categories):
data[i]=convolve2d(data[i],kernel,mode='same')
return data.argmax(axis=0)
EQUIVALENCE/SPEED-CHECK
This is probably easy to prove but...
S=512
N=19
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=False)
kernel=np.ones((3,3))
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
print()
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=1)
kernel=np.ones((3,3))
kernel[1,1]=2
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
""" OUTPUT
CPU times: user 826 ms, sys: 0 ns, total: 826 ms
Wall time: 825 ms
CPU times: user 416 ms, sys: 7.83 ms, total: 423 ms
Wall time: 423 ms
True
CPU times: user 825 ms, sys: 3.78 ms, total: 829 ms
Wall time: 828 ms
CPU times: user 422 ms, sys: 3.91 ms, total: 426 ms
Wall time: 425 ms
True
"""

Related

Training with Roboflow-Train-YOLOv5 stops with a '^C'

Running Roboflow's notebook, 'Roboflow-Train-YOLOv5', stops after completion the epochs loop.
Instead a reporting the results, I get the following lines, with a ^C at the end of the 3rd line
from the end.
I would like to know the reason for the failure, and there is a way to fix it.
10 epochs completed in 0.191 hours.
Optimizer stripped from runs/train/yolov5s_results2/weights/last.pt, 14.9MB
Optimizer stripped from runs/train/yolov5s_results2/weights/best.pt, 14.9MB
Validating runs/train/yolov5s_results2/weights/best.pt...
Fusing layers...
my_YOLOv5s summary: 213 layers, 7015519 parameters, 0 gradients, 15.8 GFLOPs
Class Images Labels P R mAP#.5 mAP#.5:.95: 20% 1/5 [00:01<00:04, 1.03s/it]^C
CPU times: user 7.01 s, sys: 830 ms, total: 7.84 s
Wall time: 12min 31s
My Colab plan is Colab pro, so I guess it is not a problem of resources.

Loading hdf5 with resizable length is very very slow compared with non resizable

I have an original hdf5 file with a dataset thats of shape (3737, 224, 224, 3) and it was not extendable. i.e. no maxshape argument passed during its creation.
I decided to create a new hdf5 file and create the dataset with maxshape=(None, 224, 224, 3) such that I can resize it later. I then just copied the dataset from the original hdf5 to this new one, and saved.
The contents of the two hdf5 are exactly the same. I then tried to read all the data back, and I found significant performance degradation for the resizable version.
Original:
CPU times: user 660 ms, sys: 2.58 s, total: 3.24 s
Wall time: 6.08 s
Resizable:
CPU times: user 18.6 s, sys: 4.41 s, total: 23 s
Wall time: 49.5 s
Thats almost 10 times as slow. Is this to be expected? The file size difference is only less than 2 mb. Are there optimization tips/tricks I need to be aware of?
Upon reading the hdf5 doc carefully, it seemed that if you specify a maxshape during creation of a dataset (which enable it to be resizable in the future) also turned chunking on. This seemed to be mandatory. And the "default" chunking size it gives me by default is dataset.chunks = (234, 14, 28, 1).
According to doc, this means data are not contiguous but "haphazardly" stored in a b-tree like structure. This most likely explains the slowness I observed, it is probably doing much more extra i/o than I thought.
I set the chunk size to be the entire dataset size by passing "chunks=(3737, 224, 224, 3)" and this time, I got
CPU times: user 809 µs, sys: 837 ms, total: 838 ms
Wall time: 914 ms
Thats a big speed up in loading my (3737, 224, 224, 3) tensor. I sort of understand why chunking is a scalability solution. But the fact that it magically assign a chunk size is confusing. My context is mini-batch training for deep learning. So the optimal thing is each chunk is a mini-batch.

Avoiding exhausting GPU resources in convNN Tensorflow

I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.
Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.

Groupby/Transform much better in 14.1 but still way slower than workaround

Edit to add: This operation appears to be tremendously improved due to the GIL unlocking in version 0.17.0 of pandas (and other improvements since version 0.14.1 and earlier). See updated benchmarks at the bottom of this question.
This is a followup to this very useful Q/A: Faster way to transform group with mean value in Pandas
I just updated from 14.0 to 14.1 to see how much improvement there was in groupby/transform operations. In brief, the improvement is substantial but it's still much slower than the workaround and essentially unusable for the data I'm working with.
Here's an example with 100,000 obs with 3 obs per group:
df = DataFrame( { 'id' : np.arange( 100000 ) / 3,
'val': np.random.randn( 100000) } )
grp=df.groupby('id')['val']
a = pd.Series(np.repeat(grp.mean().values, grp.count().values))
b = grp.transform(np.mean)
"a" is the awesome workaround from Mr E and Jeff (see link above) for which I am very grateful and "b" is what I think is the standard approach for this case.
In [42]: (a==b).all()
Out[42]: True
In [43]: %timeit pd.Series(np.repeat(grp.mean().values, grp.count().values))
100 loops, best of 3: 3.34 ms per loop
In [44]: %timeit grp.transform(np.mean)
1 loops, best of 3: 4.61 s per loop
Note that's "ms" vs "s" so 1000x difference! I tried to be careful here and do a fair comparison. Please let me know if I screwed that up somehow. I don't understand numpy/pandas internals very well, but assume they are both using the same underlying np.mean function?
More info:
In [61]: %timeit grp.transform('mean')
1 loops, best of 3: 4.59 s per loop
In [62]: pd.__version__
Out[62]: '0.14.1'
~/google drive/data>python -V
Python 2.7.8 :: Anaconda 2.0.1 (x86_64)
I've got a 13 inch macbook air (2012) and using all Anaconda defaults except:
conda install pandas=0.14.1
Edit to add: Here are some updated benchmarks. I'm using a faster computer now so this will compare 0.16.2 and 0.17.0 on a macbook pro (15 inch, mid 2015).
version 0.16.2
%timeit pd.Series(np.repeat(grp.mean().values, grp.count().values))
100 loops, best of 3: 2.71 ms per loop
%timeit grp.transform(np.mean)
100 loops, best of 3: 18.9 ms per loop
version 0.17.0
%timeit pd.Series(np.repeat(grp.mean().values, grp.count().values))
100 loops, best of 3: 2.05 ms per loop
%timeit grp.transform(np.mean)
1000 loops, best of 3: 1.45 ms per loop
The perf improvement in 0.14.1 in this PR, didn't touch on the case of a cythonized function being passed directly (or via name), rather this was improving perf of generic functions (e.g. a passed lambda), by optimizing how the results were set. This PR addresses (and uses) the above solution to provide a substantial perf improvement when using a cythonized (internal) function, e.g. 'mean' in this case.
In the test example goes from 3.6s to 100ms. Note this is not quite as good as a your example above, because you have an implicit optimization. Namely that the group ordering is monotonic increasing. E.g. your groups are in the same order and not interspersed with each other.
Pandas will handle both cases, but checking that this index is actually monotonic takes a small amount of time (hence the difference).
This is merged into master/0.15.0 (releasing prob end of september), though you can simply clone from master, and windows binaries are posted frequently.

Doing efficient Numerics in Haskell

I was inspired by this post called "Only fast languages are interesting" to look at the problem he suggests (sum'ing a couple of million numbers from a vector) in Haskell and compare to his results.
I'm a Haskell newbie so I don't really know how to time correctly or how to do this efficiently, my first attempt at this problem was the following. Note that I'm not using random numbers in the vector as I'm not sure how to do in a good way. I'm also printing stuff in order to ensure full evaluation.
import System.TimeIt
import Data.Vector as V
vector :: IO (Vector Int)
vector = do
let vec = V.replicate 3000000 10
print $ V.length vec
return vec
sumit :: IO ()
sumit = do
vec <- vector
print $ V.sum vec
time = timeIt sumit
Loading this up in GHCI and running time tells me that it took about 0.22s to run for 3 million numbers and 2.69s for 30 million numbers.
Compared to the blog authors results of 0.02s and 0.18s in Lush it's quite a lot worse, which leads me to believe this can be done in a better way.
Note: The above code needs the package TimeIt to run. cabal install timeit will get it for you.
First of all, realize that GHCi is an interpreter, and it's not designed to be very fast. To get more useful results you should compile the code with optimizations enabled. This can make a huge difference.
Also, for any serious benchmarking of Haskell code, I recommend using criterion. It uses various statistical techniques to ensure that you're getting reliable measurements.
I modified your code to use criterion and removed the print statements so that we're not timing the I/O.
import Criterion.Main
import Data.Vector as V
vector :: IO (Vector Int)
vector = do
let vec = V.replicate 3000000 10
return vec
sumit :: IO Int
sumit = do
vec <- vector
return $ V.sum vec
main = defaultMain [bench "sumit" $ whnfIO sumit]
Compiling this with -O2, I get this result on a pretty slow netbook:
$ ghc --make -O2 Sum.hs
$ ./Sum
warming up
estimating clock resolution...
mean is 56.55146 us (10001 iterations)
found 1136 outliers among 9999 samples (11.4%)
235 (2.4%) high mild
901 (9.0%) high severe
estimating cost of a clock call...
mean is 2.493841 us (38 iterations)
found 4 outliers among 38 samples (10.5%)
2 (5.3%) high mild
2 (5.3%) high severe
benchmarking sumit
collecting 100 samples, 8 iterations each, in estimated 6.180620 s
mean: 9.329556 ms, lb 9.222860 ms, ub 9.473564 ms, ci 0.950
std dev: 628.0294 us, lb 439.1394 us, ub 1.045119 ms, ci 0.950
So I'm getting an average of just over 9 ms with a standard deviation of less than a millisecond. For the larger test case, I'm getting about 100ms.
Enabling optimizations is especially important when using the vector package, as it makes heavy use of stream fusion, which in this case is able to eliminate the data structure entirely, turning your program into an efficient, tight loop.
It may also be worthwhile to experiment with the new LLVM-based code generator by using -fllvm option. It is apparently well-suited for numeric code.
Your original file, uncompiled, then compiled without optimization, then compiled with a simple optimization flag:
$ runhaskell boxed.hs
3000000
30000000
CPU time: 0.35s
$ ghc --make boxed.hs -o unoptimized
$ ./unoptimized
3000000
30000000
CPU time: 0.34s
$ ghc --make -O2 boxed.hs
$ ./boxed
3000000
30000000
CPU time: 0.09s
Your file with import qualified Data.Vector.Unboxed as V instead of import qualified Data.Vector as V (Int is an unboxable type) --
first without optimization then with:
$ ghc --make unboxed.hs -o unoptimized
$ ./unoptimized
3000000
30000000
CPU time: 0.27s
$ ghc --make -O2 unboxed.hs
$ ./unboxed
3000000
30000000
CPU time: 0.04s
So, compile, optimize ... and where possible use Data.Vector.Unboxed
Try to use an unboxed vector, although I'm not sure whether it makes a noticable difference in this case. Note also that the comparison is slightly unfair, because the vector package should optimize the vector away entirely (this optimization is called stream fusion).
If you use big enough vectors, Vector Unboxed might become impractical. For me pure (lazy) lists are quicker, if vector size > 50000000:
import System.TimeIt
sumit :: IO ()
sumit = print . sum $ replicate 50000000 10
main :: IO ()
main = timeIt sumit
I get these times:
Unboxed Vectors
CPU time: 1.00s
List:
CPU time: 0.70s
Edit: I've repeated the benchmark using Criterion and making sumit pure. Code and results as follow:
Code:
import Criterion.Main
sumit :: Int -> Int
sumit m = sum $ replicate m 10
main :: IO ()
main = defaultMain [bench "sumit" $ nf sumit 50000000]
Results:
warming up
estimating clock resolution...
mean is 7.248078 us (80001 iterations)
found 24509 outliers among 79999 samples (30.6%)
6044 (7.6%) low severe
18465 (23.1%) high severe
estimating cost of a clock call...
mean is 68.15917 ns (65 iterations)
found 7 outliers among 65 samples (10.8%)
3 (4.6%) high mild
4 (6.2%) high severe
benchmarking sumit
collecting 100 samples, 1 iterations each, in estimated 46.07401 s
mean: 451.0233 ms, lb 450.6641 ms, ub 451.5295 ms, ci 0.950
std dev: 2.172022 ms, lb 1.674497 ms, ub 2.841110 ms, ci 0.950
It looks like print makes a big difference, as it should be expected!