To motivate the 'efficient' in the title: I am working with volumetric image data which can be up to 512x512x1000 pixels. So slow loops etc. are not really an option, particularly if the images need to be viewed in a GUI. Imagine sitting just 10s in front of a viewer waiting for images to load...
From two 3D input volumes x and y I calculate new 3D output volumes, currently up to three at a time e.g. by solving equation systems for each pixel. Since a lot of x,y combinations are actually repetitive and often even only a coherent meshgrid range is of interest, I am trying to speed up by creating a lookup table for this subregion. Works well, in my test case I need only ca. 3000 calculations instead of 30 million.
Now, to the problem: I am utterly failing at efficiently looking up the solutions of the 30 million x,y combinations from the 3000 solutions lookup table in a numpythonic way!
Let's try with an example:
# x y s1 s2
lookup = np.array([[ 4, 11, 23., 4. ],
[ 4, 12, 25., 13. ],
[ 5, 11, 21., 19. ],
[ 5, 12, 26., 56. ]])
I succeed in getting the index of one x,y pair following this post:
ii = np.where((lookup[:,0] == 4) & (lookup[:,1]==12))[0][0]
s1, s2 = lookup[ii,-2:]
print('At index',ii,':',s1,s2)
>>> At index 1 : 25.0 13.0
Q1: But how to vectorize this, i.e get full solutions arrays for the 30 million pixels?
s1, s2 = lookup[numpy_magic_with_xy, -2:]
Q2: And actually I'd like to set all solutions to zero for all x,y not within the region of interest. Where do I add that condition?
Q3: And what would really be the fastest way to achieve all this?
PS: I'm fine with using 1D image representations by working with x.ravel() etc. and reshaping at the end. Unless you tell me I don't need to and it's just slowing things down. Just doing it to still understand my own code I guess...
Related
Let's say that I have a dataset with multiple input features and one single output. For the sake of simplicity, let's say the output is binary. Either zero or one.
I want to split this dataset into k parts and use a k-fold cross-validation model to learn the mapping from the input features to the output one. If the dataset is imbalanced, the ratio between the number of records with output 0 and 1 is not going to be one. To make it concrete, let's say that 90% of the records are 0 and only 10% are 1.
I think it's important that within each part of k-folds we should see the same ratio of 0s and 1s in order for successful training (the same 9 to 1 ratio). I know how to do this in Pandas but my question is how to do it in TFX.
Reading the TFX documentation, I know that I can split a dataset by specifying an output_config to the class loading the examples:
output = tfx.proto.Output(
split_config=tfx.proto.SplitConfig(splits=[
tfx.proto.SplitConfig.Split(name='fold_1', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_2', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_3', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_4', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_5', hash_buckets=1)
]))
example_gen = CsvExampleGen(input_base=input_dir, output_config=output)
But then, the aforementioned ratio of the examples in each fold will be random at best. My question is: Is there any way I can specify what goes into each split? Can I somehow enforce the ratio of a feature?
BTW, I have seen and experimented with the partition_feature_name argument of the SplitConfig class. It's not useful here unless there's a feature with the ID of the fold for each example which I think is not practical since I might want to change the number of folds as part of the experiment without changing the dataset.
I'm going to answer my own question but only as a workaround. I'll be happy to see someone develop a real solution to this question.
What I could come up with at this point was to split the dataset into a number of tfrecord files. I've chosen a "composite" number of files so I can split them into (almost) any number I want. For this, I've settled down on 60 since it can be divided by 2, 3, 4, 5, 6, 10, and 12 (I don't think anyone would want KFold with k higher than 12). Then at the time of loading them, I have to somehow select which files will go into each split. There are two things to consider here.
First, the ImportExampleGen class from TFX supports glob file patterns. This means we can have multiple files loaded for each split:
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name="fold_1", pattern="fold_1*"),
tfx.proto.Input.Split(name="fold_2", pattern="fold_2*")
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
Next, we need some ingenuity to enable splitting the files into any number we like at the time of loading them. And this is my approach to it:
fold_3.0_4.0_5.0_6.0_10.0/part-###.tfrecords.gz
fold_3.0_4.0_5.1_6.0_10.6/part-###.tfrecords.gz
fold_3.0_4.0_5.2_6.0_10.2/part-###.tfrecords.gz
fold_3.0_4.0_5.3_6.0_10.8/part-###.tfrecords.gz
...
The file pattern is like this. Between each two _ I include the divisor, a ., and then the remainder. And I'll have as many of these as I want to have the "split possibility" later, at the time of loading the dataset.
In the example above, I'll have the option to load them into 3, 4, 5, 6, and 10 folds. The first file will be loaded as part of the 0th split if I want to split the dataset into any number of folds while the second file will be in the 1st split of 5-fold and 6th split of 10-fold.
And this is how I'll load them:
NUM_FOLDS = 5
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name=f'fold_{index + 1}',
pattern=f"fold_*{str(NUM_FOLDS)+'.'+str(index)}*/*")
for index in range(NUM_FOLDS)
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
I could change the NUM_FOLDS to any of the options 3, 4, 5, 6, or 10 and the loaded dataset will consist of pre-curated k-fold splits. It is worth mentioning that I have made sure of the ratio of the samples within each file at the time of creating them. So any combination of them will also have the same ratio.
Again, this is only a trick in the absence of an actual solution. The main drawback of this approach is the fact that you have to split the dataset manually yourself. I've done so, in this case, using pandas. That meant that I had to load the whole dataset into memory. Which might not be possible for all the datasets.
Given a large (~10 million) number of irregularly spaced points in two dimensions, where each point has some intensity ("weight") associated with it, what existing python implementations are there for interpolating the value at:
a specific point at some random position (i.e. point = (0.5, 0.8))
a large number of points at random positions (i.e. points = np.random.random((1_000_000, 2)))
a regular grid at integer positions (i.e. np.indices((1000, 1000)).T)
I am aware that Delaunay triangulation is often used for this purpose. Are there alternatives to doing it this way?
Do any solutions take advantage of multiple CPU cores or GPUs?
As an example, here is an approach using scipy's LinearNDInterpolator. It does not appear to use more than one CPU core.
There are also other options in scipy, but with this question I am especially interested in hearing about other solutions than the ones in scipy.
# The %time tags are IPython magic functions that time that specific line
dimension_shape = (1000, 1000) # we spread the random [0-1] over [0-1000] to avoid floating point errors
N_points = dimension_shape[0] * dimension_shape[1]
known_points = np.random.random((N_points, 2)) * dimension_shape
known_weights = np.random.random((N_points,))
unknown_point = (0.5, 0.8)
unknown_points = np.random.random((N_points, 2)) * dimension_shape
unknown_grid = np.indices(dimension_shape, dtype=float).T.reshape((-1, 2)) # reshape to a list of 2D points
%time tesselation = Delaunay(known_points) # create grid to know neighbours # 6 sec
%time interp_func = LinearNDInterpolator(tesselation, known_weights) # 1 ms
%time interp_func(unknown_point) # 2 sec # run it once because the scipy function needs to compile
%time interp_func(unknown_point) # ~ns
%time interp_func(unknown_grid) # 400 ms
%time interp_func(unknown_points) # 1 min 13 sec
# Below I sort the above `unknown_points` array, and try again
%time ind = np.lexsort(np.transpose(unknown_points)[::-1]) # 306 ms
unknown_points_sorted = unknown_points[ind].copy()
%time interp_func(unknown_points_sorted) # 19 sec <- much less than 1 min!
In the above code, things that take an appreciable amount of time are the construction of the Delaunay grid, and interpolation on a non-regular grid of points. Note that sorting the non-regular points first results in a significant speed improvement!
Do not feel the need to give a complete answer from the start. Tackling any aspect of the above is welcome.
Scipy is pretty good and I don't think that there are better solutions in Python, but I can add a couple things that might be helpful to you. First off, your idea of sorting the points is a really good one. The so-called "incremental algorithms" build the Delaunay by inserting vertices one at a time. The first step in inserting a vertex in an existing mesh is to figure out which triangle in the mesh to insert it into. To speed things up, some algorithms start the search right at the point where the most recent insertion occurred. So if your points are ordered so that each point inserted is relatively close to the previous one, the search is much faster. If you want more details, you can look up the "Lawson's Walk" algorithm. In my own implementation of the Delaunay (which is in Java, so I'm afraid it won't help you), I have a sort based on the Hilbert space-filling curve. the Hilbert sort works great. But even just sorting by x/y coordinates is a help.
In terms of whether there are other ways to interpolate without using the Delaunay... You could try something using Inverse-Distance-Weighting (IDW). IDW techniques don't require the Delaunay, but they do require some way to figure out which vertices are close to the point for which you wish to interpolate. I've played with dividing my coordinate space into uniformly spaced bins, storing the vertices in the appropriate bins, and then just pulling up the points I need for an interpolation by looking at the neighboring bins. It may be a lot of coding, but it will be reasonably fast and use less memory than the Delaunay
Interpolating on Delaunay triangles is certainly one possibility, but I would recommend sorting the points in a kD-tree, using the tree to query nearest neighbors (in a sufficient radius), and then interpolating with IDW, as was already suggested.
I recreated a deep learning network (Yolov3) and extracted a feature map after the prediction. This has the following dimensions (1, 13, 13, 3, 50). The dimensions 13x13 stand for the grid and the 3 for the RGB values. The 50 stand for the 50 different classes my model can predict.
Currently I am trying to reformat the feature maps for each class individually. That means, I try to create 50 arrays from the structure described above, which contain 3 arrays (each for RGB features) and should each contain the grid of 13x13.
What you have to consider is that the feature map contains the values of the 50 classes for each cell of the 13x13 grid.
Currently I have solved the problem with a for-loop that can only extract one class. So I have to ask myself if I can use Numpy for example with resize, transpose, reshape to set a better previous one.
def extract_feature_maps(model_output, class_index):
for row in model_output:
feature_maps= [[], [], []]
for column in row:
tmp = [[], [], []]
for three_dim in column:
counter = 0
for feature_map_tmp in three_dim:
feature_map_tmp_0 = feature_map_tmp[5:]
feature_number = feature_map_tmp_0[class_index]
tmp[counter].append(feature_number)
counter += 1
feature_maps[0].append(tmp[0])
feature_maps[1].append(tmp[1])
feature_maps[2].append(tmp[2])
return np.array(feature_maps[0]), np.array(feature_maps[1]), np.array(feature_maps[2])
As I said, I can currently only extract the feature map from one class in a very time consuming way. Is there a way to do this more clever?
I have an np.array of observations z where z.shape is (100000, 60). I want to efficiently calculate the 100000x100000 correlation matrix and then write to disk the coordinates and values of just those elements > 0.95 (this is a very small fraction of the total).
My brute-force version of this looks like the following but is, not surprisingly, very slow:
for i1 in range(z.shape[0]):
for i2 in range(i1+1):
r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
if r > 0.95:
file.write("%6d %6d %.3f\n" % (i1,i2,r))
I realize that the correlation matrix itself could be calculated much more efficiently in one operation using np.corrcoef(z), but the memory requirement is then huge. I'm also aware that one could break up the data set into blocks and calculate bite-size subportions of the correlation matrix at one time, but programming that and keeping track of the indices seems unnecessarily complicated.
Is there another way (e.g., using memmap or pytables) that is both simple to code and doesn't put excessive demands on physical memory?
After experimenting with the memmap solution proposed by others, I found that while it was faster than my original approach (which took about 4 days on my Macbook), it still took a very long time (at least a day) -- presumably due to inefficient element-by-element writes to the outputfile. That wasn't acceptable given my need to run the calculation numerous times.
In the end, the best solution (for me) was to sign in to Amazon Web Services EC2 portal, create a virtual machine instance (starting with an Anaconda Python-equipped image) with 120+ GiB of RAM, upload the input data file, and do the calculation (using the matrix multiplication method) entirely in core memory. It completed in about two minutes!
For reference, the code I used was basically this:
import numpy as np
import pickle
import h5py
# read nparray, dimensions (102000, 60)
infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()
# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)
# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd
# transpose back
z = ztrans.T
# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)
arr /= z.shape[0]
# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
hf.create_dataset("correlation", data=arr)
From my rough calculations, you want a correlation matrix that has 100,000^2 elements. That takes up around 40 GB of memory, assuming floats.
That probably won't fit in computer memory, otherwise you could just use corrcoef.
There's a fancy approach based on eigenvectors that I can't find right now, and that gets into the (necessarily) complicated category...
Instead, rely on the fact that for zero mean data the covariance can be found using a dot product.
z0 = z - mean(z, 1)[:, None]
cov = dot(z0, z0.T)
cov /= z.shape[-1]
And this can be turned into the correlation by normalizing by the variances
sigma = std(z, 1)
corr = cov
corr /= sigma
corr /= sigma[:, None]
Of course memory usage is still an issue.
You can work around this with memory mapped arrays (make sure it's opened for reading and writing) and the out parameter of dot (For another example see Optimizing my large data code with little RAM)
N = z.shape[0]
arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N))
dot(z0, z0.T, out=arr)
arr /= sigma
arr /= sigma[:, None]
Then you can loop through the resulting array and find the indices with a large correlation coefficient. (You may be able to find them directly with where(arr > 0.95), but the comparison will create a very large boolean array which may or may not fit in memory).
You can use scipy.spatial.distance.pdist with metric = correlation to get all the correlations without the symmetric terms. Unfortunately this will still leave you with about 5e10 terms that will probably overflow your memory.
You could try reformulating a KDTree (which can theoretically handle cosine distance, and therefore correlation distance) to filter for higher correlations, but with 60 dimensions it's unlikely that would give you much speedup. The curse of dimensionality sucks.
You best bet is probably brute forcing blocks of data using scipy.spatial.distance.cdist(..., metric = correlation), and then keep only the high correlations in each block. Once you know how big a block your memory can handle without slowing down due to your computer's memory architecture it should be much faster than doing one at a time.
please check out deepgraph package.
https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html
I tried on z.shape = (2500, 60) and pearsonr for 2500 * 2500. It has an extreme fast speed.
Not sure for 100000 x 100000 but worth trying.
As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast).
At the moment I allocate a fairly large array which conceptually represents several pairs of two matrices:
__shared__ float A[34*N]
Where N is the number of pairs and the first 16 floats of a pair are one matrix and the following 18 floats are the second.
The thing is, access to the first matrix is conflict free but access to the second one has conflicts. These conflicts are unavoidable, however, my thinking is that because the second matrix is 18 all future matrices will be misaligned to the banks and therefore more conflicts than necessary will occur.
Is this true, if so how can I avoid it?
Everytime I allocate shared memory, does it start at a new bank? So potentially could I do
__shared__ Apair1[34]
__shared__ Apair2[34]
...
Any ideas?
Thanks
If your pairs of matrices are stored contiguously, and if you are accessing the elements linearly by thread index, then you will not have shared memory bank conflicts.
In other words if you have:
A[0] <- mat1 element1
A[1] <- mat1 element2
A[2] <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18
And you access this using:
float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];
Then adjacent threads are accessing adjacent elements in the matrix and you do not have conflicts.
In response to your comments (below) it does seem that you are mistaken in your understanding. It is true that there are 16 banks (in current generations, 32 in the next generation, Fermi) but consecutive 32-bit words reside in consecutive banks, i.e. the address space is interleaved across the banks. This means that provided you always have an array index that can be decomposed to x + threadIdx.x (where x is not dependent on threadIdx.x, or at least is constant across groups of 16 threads) you will not have bank conflicts.
When you access the matrices further along the array, you still access them in a contiguous chunk and hence you will not have bank conflicts. It is only when you start accessing non-adjacent elements that you will have bank conflicts.
The reduction sample in the SDK illustrates bank conflicts very well by building from a naive implementation to an optimised implementation, possibly worth taking a look.
Banks are set up such that each successive 32 bits are in the next bank. So, if you declare an array of 4 byte floats, each subsequent float in the array will be in the next bank (modulo 16 or 32, depending on your architecture). I'll assume you're on compute capability 1.x, so you have a bank of width 16.
If you have arrays of 18 and 16, things can be funny. You can avoid bank conflicts in the 16x16 array by declaring it like
__shared__ float sixteen[16][16+1]
which avoids bank conflicts when accessing transpose elements using threadIdx.x (as I assume you're doing if you're getting conflicts). When accessing elements in, say, the first row of a 16x16 matrix, they'll all reside in the 1st bank. What you want to do is have each of these in a successive bank. Padding does this for you. You treat the array exactly as you would before, as sixteen[row][column], or similarly for a flattened matrix, as sixteen[row*(16+1)+column], if you want.
For the 18x18 case, when accessing in the transpose, you're moving at an even stride. The answer again is to pad by 1.
__shared__ float eighteens[18][18+1]
So now, when you access in the transpose (say accessing elements in the first column), it will access as (18+1)%16 = 3, and you'll access banks 3, 6, 9, 12, 15, 2, 5, 8 etc, so you should get no conflicts.
The particular alignment shift due to having a matrix of size 18 isn't the problem, because the starting point of the array makes no difference, it's only the order in which you access it. If you want to flatten the arrays I've proposed above, and merge them into 1, that's fine, as long as you access them in a similar fashion.