Parallel h5py/hdf5 writing to large dataset skips data chunks - numpy

I am using mpi and h5py/hdf5 (Hdf5 and h5py were compiled to have parallel capabilities and everything runs on python 3.4.) on a cluster to stitch a dataset of overlapping tiles (200 or more images/numpy arrays 2048x2048).
Each tile has an assigned index number that correspond to the position where it should be written and all the indexes are stored in an array:
Example tile_array:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=int32)
The issue I have are:
When Running the code on the cluster:
- When I use multiple cores to write the stitched image in parallel I get randomly missing tiles.
- When I run the stitching serially in rank==0 I still have missing tiles.
- If the dataset is small there are no issues.
- If do not activate atomic mode in the parallel writing the number of errors increases. The number of errors increases if the atomic mode is
activated when I run the writing on rank==0
Running the code on a laptop:
- When I run the same code on single core in a laptop, stripped of the mpi code, no tiles are missing.
Any comment will be greatly appreciated.
Highlight of the procedure:
0) Before stitching, each image of the set is processed and the overlapping edges are blended in order to get a better result in the overlapping regions
1) I open the hdf file where I want to save the data: stitching_file=h5py.File(fpath,'a',driver='mpio',comm=MPI.COMM_WORLD)
2) I create a big dataset that will contain the stitched images
3) I fill the dataset with zeros
4) The processed images will be written (added) to the big dataset by different cores. However, adjacent images have overlapping regions. In order to avoid that different cores will write tiles with some overlap and break the writing I will run four rounds of writing in order to cover the whole dataset with this kind of patttern:
Example tile_array:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=int32)
Writing rounds:
writing round one: array([ 0, 2, 8, 10], dtype=int32)
writing round two: array([ 1, 3, 9, 11], dtype=int32)
writing round three: array([ 4, 6, 12, 14], dtype=int32)
writing round four: array([ 5, 7, 13, 15], dtype=int32)
If I write on single rank I don't use this scheme and just write each position sequentially
Pseudocode
# Preprocessing of the images
------------------------------
------------------------------
# MPI data chunking function
def tasks_chunking(tasks,size):
# This function scatter any kind of list, not only ordered ones
# If the number of cores is bigger than the length of the list
# the size of the chunk will be zero and the
# Chunk the list of tasks
Chunked_list=np.array_split(tasks,size)
NumberChunks=np.zeros(len(Chunked_list),dtype='int32')
for idx,chunk in enumerate(Chunked_list):
NumberChunks[idx]=len(chunk)
# Calculate the displacement
Displacement=np.zeros(len(NumberChunks),dtype='int32')
for idx,count in enumerate(NumberChunks[0:-1]):
Displacement[idx+1]=Displacement[idx]+count
return Chunked_list,NumberChunks,Displacement
# Flush the file after the preprocessing
stitching_file.flush()
# Writing dictionary. Each group will contain a subset of the tiles to be written.
# Data is the size of the final array where to save the data
fake_tile_set = np.arange(Data.size,dtype='int32')
fake_tile_set = fake_tile_set.reshape(Data.shape)
writing_dict = {}
writing_dict['even_group1']=fake_tile_array[::2, ::2].flat[:]
writing_dict['even_group2']=fake_tile_array[::2, 1::2].flat[:]
writing_dict['odd_group1'] =fake_tile_array[1::2, ::2].flat[:]
writing_dict['odd_group2'] =fake_tile_array[1::2, 1::2].flat[:]
# Withouth activating the atomic mode the number of errors in parallel mode is higher.
# Withouth activating the atomic mode the number of errors in parallel mode is higher.
stitching_file.atomic=True
# Loop through the dictionary items to start the writing
for key, item in writing_dict.items():
# Chunk the positions that need to be written
if rank==0:
Chunked_list,NumberChunks,Displacement=tasks_chunking(item,size)
else:
NumberChunks = None
Displacement = None
Chunked_list= None
# Make the cores aware of the number of jobs that will need to run
# The variable count is created by the scatter function and has the number of
# processes and is different in every core
cnt = np.zeros(1, dtype='int32')
comm.Scatter(NumberChunks, cnt, root=0)
# Define the local variable that will be filled up wuth the scattered data
xlocal = np.zeros(cnt, dtype='int32') # Use rank for determine the size of the xlocal on the different cores
# Scatter the value of tasks to the different cores
comm.Scatterv([item,NumberChunks,Displacement,MPI.INT],xlocal, root=0)
for tile_ind in xlocal:
# This writing function is the same called when I run the writing on rank 0 or on a single core in a laptop.
paste_in_final_image(joining, temp_file, stitched_group, tile_ind, nr_pixels)
comm.Barrier()
stitching_file.flush()
stitching_file.close()

Related

numpy array of array adding up another array

I am having the following array of array
a = np.array([[1,2,3],[4,5,6]])
b = np.array([[1,5,10])
and want to add up the value in b into a, like
np.array([[2,7,13],[5,10,16]])
what is the best approach with performance concern to achieve the goal?
Thanks
Broadcasting does that for you, so:
>>> a+b
just works:
array([[ 2, 7, 13],
[ 5, 10, 16]])
And it can also be done with
>>> a + np.tile(b,(2,1))
which gives the result
array([[ 2, 7, 13],
[ 5, 10, 16]])
Depending on size of inputs and time constraints, both methods might be of consideration
Method 1: Numpy Broadcasting
Operation on two arrays are possible if they are compatible
Operation generally done along with broadcasting
broadcasting in lay man terms could be called repeating elements along a specified axis
Conditions for broadcasting
Arrays need to be compatible
Compatibility is decided based on their shapes
shapes are compared from right to left.
from right to left while comparing, either they should be equal or one of them should be 1
smaller array is broadcasted(repeated) over bigger array
a.shape, b.shape
((2, 3), (1, 3))
From the rules they are compatible, so they can be added, b is smaller, so b is repeated long 1 dimension, so b can be treated as [[ 5, 10, 16], [ 5, 10, 16]]. But note numpy does not allocate new memory, it is just view.
a + b
array([[ 2, 7, 13],
[ 5, 10, 16]])
Method 2: Numba
Numba gives parallelism
It will convert to optimized machine code
Why this is because, sometimes numpy broadcasting is not good enough, ufuncs(np.add, np.matmul, etc) allocate temp memory during operations and it might be time consuming if already on memory limits
Easy parallelization
Using numba based on your requirement, you might not need temp memory allocation or various checks which numpy does, which can speed up code for huge inputs, for example. Why are np.hypot and np.subtract.outer very fast?
import numba as nb
#nb.njit(parallel=True)
def sum(a, b):
s = np.empty(a.shape, dtype=a.dtype)
# nb.prange gives numba hint to what to parallelize
for i in nb.prange(a.shape[0]):
s[i] = a[i] + b
return s
sum(a, b)

LGBMClassifier + Unbalanced data + GridSearchCV()

The dependent variable is binary, the unbalanced data is 1:10, the dataset has 70k rows, the scoring is the roc curve, and I'm trying to use LGBM + GridSearchCV to get a model. However, I'm struggling with the parameters as sometimes it doesn't recognize them even when I use the parameters as the documentation shows:
params = {'num_leaves': [10, 12, 14, 16],
'max_depth': [4, 5, 6, 8, 10],
'n_estimators': [50, 60, 70, 80],
'is_unbalance': [True]}
best_classifier = GridSearchCV(LGBMClassifier(), params, cv=3, scoring="roc_auc")
best_classifier.fit(X_train, y_train)
So:
What is the difference between putting the parameters in the GridsearchCV() and params?
As it's unbalanced data, I'm trying to use the roc_curve as the scoring metric as it's a metric that considers the unbalanced data. Should I use the argument scoring="roc_auc" put it in the params argument?
The difference between putting the parameters in GridsearchCV()or params is mentioned in the docs of GridSearch:
When you put it in params:
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
And yes you can put the scoring also in the params.

How do I get a snakemake rule to execute M times to generate MxN files?

I'm converting a bioinformatics pipeline into snakemake and have a script which loops over M files (where M=22 for each non-sex chromosome).
Each file essentially contains N label columns that I want as individual files. The python script does this reliably, my issue is that if I provide the snakefile with wildcards for the output (both chromosomes and labels) it will run the script MxN times whilst I only want it to run M times.
I can circumvent the problem by only looking for one label file per chromosome but this isn't keeping with the spirit of snakemake and the next step in the pipeline requires input from all label files.
I've already tried using the checkpoint feature (which as I understand, reevaluates the DAG after each rule is executed) to check the output, understand that N files have been generated and skip N jobs. But this crashes and I get this error. But because I know my labels beforehand, as I understand I shouldn't need checkpoint/dynamic - I just don't know exactly what I do need.
Is it possible to disable a wildcard from generating jobs and just check that the output is returned?
LABELS = ['A', 'B', 'C', 'D']
CHROMOSOMES = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
rule all:
input:
"out/final.txt"
rule split_files:
'''
Splits the chromosome files by label.
'''
input:
"per_chromosome/myfile.{chromosome}.txt"
output:
"per_label/myfile.{label}.{chromosome}.txt"
script:
"scripts/split_files_snake.py"
rule make_out_file:
'''
Makes the final output file by checking each of label.chromosome files one-by-one
'''
input:
expand("per_label/myfile.{label}.{chromosome}",
label=LABELS,
chromosome=CHROMOSOMES)
output:
"out/final.txt"
script:
"scripts/make_out_file_snake.py"
If you wish to avoid the scropt being run N times, you can specify all your output files without a wildcard in the output:
output:
"per_label/myfile.A.{chromosome}.txt",
"per_label/myfile.B.{chromosome}.txt",
"per_label/myfile.C.{chromosome}.txt",
"per_label/myfile.D.{chromosome}.txt"
To make the code more generic you can use the expand function but pay special attention to the braces in the format string:
output:
expand("per_label/myfile.{label}.{{chromosome}}.txt", label=LABELS)

Tensorflow sliding window transformation over large data

I would like to feed my model with stride-1 windows from a very long data sequence (tens of millions of entries). This is similar to the aim presented in this thread, only that my data sequence may contain several features to begin with, so the final number of features is n_features * window_size. i.e. with two original features and a window size of 3, this would mean transforming this:
[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
to:
[[1, 2, 3, 6, 7, 8], [2, 3, 4, 7, 8, 9], [3, 4, 5, 8, 9, 10]]
I was trying to use slicing with map_fn or Dataset.map, applied to a sequence of indices (per the answer in the above-mentioned thread), as in:
ti = tf.range(data.shape[0] - window_size)
train_dataset = tf.data.Dataset.from_tensor_slices((ti, labels))
def get_window(l, label):
wnd = tf.reshape(data_tensor[l:(l + window_size), :], (-1, window_size * n_features))
wnd = tf.squeeze(wnd)
return (wnd, label)
train_dataset = train_dataset.map(get_window)
train_dataset = train_dataset.batch(batch_size)
...
This is working in principle, but the training is extremely slow, with minimal GPU utilization (1-5%, probably in part because the mapping is done in the CPU).
When trying to do the same with tf.map_fn, the graph building becomes very lengthy, with tremendous memory utilization.
Another option I was trying is to transform all of the data in advance before I load it in Tensorflow. This works much faster (even when considering the pre-processing time, I wonder why - shouldn't it be the same operation as the mapping during training?) but is very inefficient in terms of memory and storage, as the data becomes window_size-fold larger. That is a deal-breaker for my large datasets.
I thought about splitting these transformed bloated datasets into several files ("hyper-batches") and go through them in sequence for each epoch, but this seems very inefficient and I was wondering if there is a better way to achieve this simple transformation.

How do I resolve one hot encoding if my test data has missing values in a col?

For example if my training data has the categorical values (1,2,3,4,5) in the col,then one hot encoding will give me 5 cols. But in the test data I have, say only 4 out of the 5 values i.e.(1,3,4,5).So one hot encoding will give me only 4 cols.Therefore if I apply my trained weights on the test data, I will get an error as the dimensions of the cols do not match in the train and test data, dim(4)!=dim(5).Any suggestions on what do I do with the missing col values?
The image of my code is provided below:
image
Guys don't do this mistake, please!
Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.
In reality, some of the more viable options could be:
Retrain your model periodically to account for new data.
Do not use one hot. Seriously, there are many better options like leave one out encoding (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69), target encoding to name a few. Some classifiers like CatBoost even have a built-in mechanism for encoding, there are mature libraries like target_encoders in Python, where you will find lots of other options.
Embed categorical features and this could save you from a complete retrain (http://flovv.github.io/Embeddings_with_keras/)
You can first combine two dataframes, then get_dummies then split them so they can have exact number of columns i.e
#Example Dataframes
Xtrain = pd.DataFrame({'x':np.array([4,2,3,5,3,1])})
Xtest = pd.DataFrame({'x':np.array([4,5,1,3])})
# Concat with keys then get dummies
temp = pd.get_dummies(pd.concat([Xtrain,Xtest],keys=[0,1]), columns=['x'])
# Selecting data from multi index and assigning them i.e
Xtrain,Xtest = temp.xs(0),temp.xs(1)
# Xtrain.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 1, 0, 0, 0],
# [0, 0, 1, 0, 0],
# [0, 0, 0, 0, 1],
# [0, 0, 1, 0, 0],
# [1, 0, 0, 0, 0]], dtype=uint8)
# Xtest.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 0, 0, 0, 1],
# [1, 0, 0, 0, 0],
# [0, 0, 1, 0, 0]], dtype=uint8)
Do not follow this approach. Its a simple trick with lot of disadvantages. #Vast Academician answer explains better.
Use dummy(binary) encoding instead of one hot encoding. Pandas pd.dummies() with drop_first = True creates dummy encoding to get k-1 dummies out of k categorical levels by removing the first level. The default option drop_first = False creates one hot encoding.
See pandas official documentation
Also dummy(binary) encoding creates less number of columns.