I have the following simple code:
a = tf.constant([7, 6, 7, 4, 5, 4], dtype=tf.float32)
e = tf.constant(5.2, dtype = tf.float32)
with tf.Session() as sess:
print(sess.run(a - e))
the outcome of this subtraction is
[ 1.8000002 0.8000002 1.8000002 -1.1999998 -0.19999981 -1.1999998 ]
instead of
[ 1.8 0.8 1.8 -1.2 -0.2 -1.2 ]
that is very weird. What is possibly the problem?
In a simple way the resaon is that 5.2 does not have a terminating binary representation. So, if you know how 1/3 looks in decimal, then just imagine the same situation in 5.2. Always remember the following point :
A decimal number will have a terminating binary representation if and
only if, the decimal written in its simplest fraction form, has a
denominator which is a power of 2.
If you think of 5.2, then it is 52/10 or 26/5, and 5 is not a power of 2. Now in float32 (double precision arithmetic), one has only 32 bits to represent it. So, of course, it ends up representing a number which is quite close to 5.2 but not quite exactly the same as it.
So, TensorFlow on computing the subtraction, gives you a slight difference.
If you however convert the numbers to tf.float64 you will see the error getting disappeared, although the representation is still not exact. This is just a genuine effort of Print function making your life simpler. It is not showing you the full view. This is because in 64-bit arithmetic, the difference between true value and internally represented value is much smaller than 32-bit arithmetic. So, the Print function sees a continuous stream of zeros, which it truncates.
But if you use something like Print('%.60f'% <YOUR RESULT>) you will see something like
with tf.Session() as sess:
...: for i in range(10):
...: c = sess.run(a - e)
...:
2018-09-09 18:50:42.209402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-09-09 18:50:42.209464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-09 18:50:42.209475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0
2018-09-09 18:50:42.209484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N
2018-09-09 18:50:42.209635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10397 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
for item in c:
...: print('%.60f' % item)
...:
1.780000000000000248689957516035065054893493652343750000000000
0.780000000000000248689957516035065054893493652343750000000000
1.780000000000000248689957516035065054893493652343750000000000
-1.219999999999999751310042483964934945106506347656250000000000
-0.219999999999999751310042483964934945106506347656250000000000
-1.219999999999999751310042483964934945106506347656250000000000
Related
This question has been asked multiple times but still I could not get what I was looking for. Imagine
data=np.random.rand(N,N) #shape N x N
kernel=np.random.rand(3,3) #shape M x M
I know convolution typically means placing the kernel all over the data. But in my case N and M are of the orders of 10000. So I wish to get the value of the convolution at a specific location in the data, say at (10,37) without doing unnecessary calculations at all locations. So the output will be just a number. The main goal is to reduce the computation and memory expenses. Is there any inbuilt function that does this with minimal adjustments?
Indeed, applying the convolution for a particular position coincides with the mere sum over the entries of a (pointwise) multiplication of the submatrix in data and the flipped kernel itself. Here, is a reproducible example.
Code
N = 1000
M = 3
np.random.seed(777)
data = np.random.rand(N,N) #shape N x N
kernel= np.random.rand(M,M) #shape M x M
# Pointwise convolution = pointwise product
data[10:10+M,37:37+M]*kernel[::-1, ::-1]
>array([[0.70980514, 0.37426475, 0.02392947],
[0.24387766, 0.1985901 , 0.01103323],
[0.06321042, 0.57352696, 0.25606805]])
with output
conv = np.sum(data[10:10+M,37:37+M]*kernel[::-1, ::-1])
conv
>2.45430578
The kernel is being flipped by definition of the convolution as explained in here and was kindly pointed Warren Weckesser. Thanks!
The key is to make sense of the index you provided. I assumed it refers to the upper left corner of the sub-matrix in data. However, it can refer to the midpoint as well when M is odd.
Concept
A different example with N=7 and M=3 exemplifies the idea
and is presented in here for the kernel
kernel = np.array([[3,0,-1], [2,0,1], [4,4,3]])
which, when flipped, yields
k[::-1,::-1]
> array([[ 3, 4, 4],
[ 1, 0, 2],
[-1, 0, 3]])
EDIT 1:
Please note that the lecturer in this video does not explicitly mention that flipping the kernel is required before the pointwise multiplication to adhere to the mathematically proper definition of convolution.
EDIT 2:
For large M and target index close to the boundary of data, a ValueError: operands could not be broadcast together with shapes ... might be thrown. To prevent this, padding the matrix data with zeros can prevent this (although it blows up the memory requirement). I.e.
data = np.pad(data, pad_width=M, mode='constant')
I'm having trouble achieving viable results with Mask R-CNN and I can't seem to pinpoint why. I am using a fairly limited dataset (13 images) of large greyscale images (2560 x 2160) where the detection target is very small (mean area of 26 pixels). I have run inspect_nucleus_data.ipynb across my data and verified that the masks and images are being interpreted correctly. I've also followed the wiki guide (https://github.com/matterport/Mask_RCNN/wiki) to have my images read and dealt with as greyscale images rather than just converting them to RGB. Here is one of the images with the detection targets labelled.
During training, the loss values are pretty unpredictable, bouncing between around 1 and 2 without ever reaching a steady decline where it seems like it's converging at all. I'm using these config values at the moment; they're the best I've been able to come up with while fighting off OOM errors:
Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 450
DETECTION_MIN_CONFIDENCE 0
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 1
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 14
IMAGE_MIN_DIM 1024
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 1]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0, 'rpn_class_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 450
MEAN_PIXEL [16.49]
MINI_MASK_SHAPE (56, 56)
NAME nucleus
NUM_CLASSES 2
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (2, 4, 8, 16, 32)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.9
RPN_TRAIN_ANCHORS_PER_IMAGE 512
STEPS_PER_EPOCH 11
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 256
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 1
WEIGHT_DECAY 0.0001
I'm training on all layers. The output I'm getting generally looks like this, with grid-like detections found in weird spots without ever seeming to accurately identify a nucleus. I've added the red square just to highlight a very obvious cluster of nuclei that have been missed:
Here is a binary mask of these same detections so you can see their shape:
Could anyone shed some light on what might be going wrong here?
Some information is lost when you resize the training image dimension by half.
Square also uses a lot of memory. So you might want to use crop instead. Instead of 1024*1024 in square mode, you will have 512*512. You might encounter NAN due to bounding box out of range, in that case you need to make some adjustment to your data feed.
You want to turn off mini mask because that will influence with your accuracy. Using crop mode should help out a ton with memory. So you shouldn't worry.
This question already has answers here:
How does CUDA assign device IDs to GPUs?
(4 answers)
Closed 4 years ago.
I saw this solution, but it doesn't quite answer my question; it's also quite old so I'm not sure how relevant it is.
I keep getting conflicting outputs for the order of GPU units. There are two of them: Tesla K40 and NVS315 (legacy device that is never used). When I run deviceQuery, I get
Device 0: "Tesla K40m"
...
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Device 1: "NVS 315"
...
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
On the other hand, nvidia-smi produces a different order:
0 NVS 315
1 Tesla K40m
Which I find very confusing. The solution I found for Tensorflow (and a similar one for Pytorch) is to use
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
PCI Bus ID is 4 for Tesla and 3 for NVS, so it should set it to 3 (NVS), is that right?
In pytorch I set
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)
to get Tesla K40m
when I set instead
os.environ['CUDA_VISIBLE_DEVICES']='1'
device = torch.cuda.device(1)
print torch.cuda.get_device_name(0)
to get
UserWarning:
Found GPU0 NVS 315 which is of cuda capability 2.1.
PyTorch no longer supports this GPU because it is too old.
warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
NVS 315
So I'm quite confused: what's the true order of GPU devices that tf and pytorch use?
By default, CUDA orders the GPUs by computing power. GPU:0 will be the fastest GPU on your host, in your case the K40m.
If you set CUDA_DEVICE_ORDER='PCI_BUS_ID' then CUDA orders your GPU depending on how you set up your machine meaning that GPU:0 will be the GPU on your first PCI-E lane.
Both Tensorflow and PyTorch use the CUDA GPU order. That is consistent with what you showed:
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)
Default order so GPU:0 is the K40m since it is the most powerful card on your host.
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)
PCI-E lane order, so GPU:0 is the card with the lowest bus-id in your case the NVS.
The keras examples directory contains a lightweight version of a stacked what-where autoencoder (SWWAE) which they train on MNIST data. (https://github.com/fchollet/keras/blob/master/examples/mnist_swwae.py)
In the original SWWAE paper, the authors compute the what and where using soft functions. However, in the keras implementation, they use a trick to get these locations. I would like to understand this trick.
Here is the code of the trick.
def getwhere(x):
''' Calculate the 'where' mask that contains switches indicating which
index contained the max value when MaxPool2D was applied. Using the
gradient of the sum is a nice trick to keep everything high level.'''
y_prepool, y_postpool = x
return K.gradients(K.sum(y_postpool), y_prepool) # How exactly does this line work?
Where y_prepool is a MxN matrix and y_postpool is a M/2 x N/2 matrix (lets assume canonical pooling of a size 2 pixels).
I have verified that the output of getwhere() is a bed of nails matrix where the nails indicate the position of the max (the local argmax if you will).
Can someone construct a small example demonstrating how getwhere works using this "Trick?"
Lets focus on the simplest example, without really talking about convolutions, say we have a vector
x = [1 4 2]
which we max-pool over (with a single, big window), we get
mx = 4
mathematically speaking, it is:
mx = x[argmax(x)]
now, the "trick" to recover one hot mask used by pooling is to do
magic = d mx / dx
there is no gradient for argmax, however it "passes" the corresponding gradient to an element in a vector at the location of maximum element, so:
d mx / dx = [0/dx[1] dx[2]/dx[2] 0/dx[3]] = [0 1 0]
as you can see, all the gradient for non-maximum elements are zero (due to argmax), and "1" appears at the maximum value because dx/x = 1.
Now for "proper" maxpool you have many pooling regions, connected to many input locations, thus taking analogous gradient of sum of pooled values, will recover all the indices.
Note however, that this trick will not work if you have heavily overlapping kernels - you might end up with bigger values than "1". Basically if a pixel is max-pooled by K kernels, than it will have value K, not 1, for example:
[1 ,2, 3]
x = [13,3, 1]
[4, 2, 9]
if we max pool with 2x2 window we get
mx = [13,3]
[13,9]
and the gradient trick gives you
[0, 0, 1]
magic = [2, 0, 0]
[0, 0, 1]
On my 64-bit Debian/Lenny system (4GByte RAM + 4GByte swap partition) I can successfully do:
v=array(10000*random([512,512,512]),dtype=np.int16)
f=fftn(v)
but with f being a np.complex128 the memory consumption is shocking, and I can't do much more with the result (e.g modulate the coefficients and then f=ifftn(f) ) without a MemoryError traceback.
Rather than installing some more RAM and/or expanding my swap partitions, is there some way of controlling the scipy/numpy "default precision" and have it compute a complex64 array instead ?
I know I can just reduce it afterwards with f=array(f,dtype=np.complex64); I'm looking to have it actually do the FFT work in 32-bit precision and half the memory.
It doesn't look like there's any function to do this in scipy's fft functions ( see http://www.astro.rug.nl/efidad/scipy.fftpack.basic.html ).
Unless you're able to find a fixed point FFT library for python, it's unlikely that the function you want exists, since your native hardware floating point format is 128 bits. It does look like you could use the rfft method to get just the real-valued components (no phase) of the FFT, and that would save half your RAM.
I ran the following in interactive python:
>>> from numpy import *
>>> v = array(10000*random.random([512,512,512]),dtype=int16)
>>> shape(v)
(512, 512, 512)
>>> type(v[0,0,0])
<type 'numpy.int16'>
At this point the RSS (Resident Set Size) of python was 265MB.
f = fft.fft(v)
And at this point the RSS of python 2.3GB.
>>> type(f)
<type 'numpy.ndarray'>
>>> type(f[0,0,0])
<type 'numpy.complex128'>
>>> v = []
And at this point the RSS goes down to 2.0GB, since I've free'd up v.
Using "fft.rfft(v)" to compute real-values only results in a 1.3GB RSS. (almost half, as expected)
Doing:
>>> f = complex64(fft.fft(v))
Is the worst of both worlds, since it first computes the complex128 version (2.3GB) and then copies that into the complex64 version (1.3GB) which means the peak RSS on my machine was 3.6GB, and then it settled down to 1.3GB again.
I think that if you've got 4GB RAM, this should all work just fine (as it does for me). What's the issue?
Scipy 0.8 will have single precision support for almost all the fft code (The code is already in the trunk, so you can install scipy from svn if you need the feature now).