lightgbm lager dataset distributed training using DASK - dask-distributed

if i create
cluster = LocalCluster(host = '168.211.90.21', num_workers = 2),
client = Client(cluster)
import dask.dataframe as dd
dX = dd.read_csv('demo.csv') //demo.csv is an huge local file and time-consuming to load and train
dask_model = lgb.DaskLGBMRegressor(n_estimators=10)
dask_model.fit(dX, dy)
after invoking fit, does that mean dX will sent to remote machine 168.211.90.21 first and allocate to 2 workers then?

Related

Tensorflow tf.data.Dataset takes too much time to generate dataset. Better way to optimize it?

I have .stem.mp4 files each of which is composed of multiple audio sources.
Each length of file is 2 minutes to 6 minutes. It varies a lot.
When I try to make tf.data.Dataset out of it, it seems to take a lot of time to generate a input_batch much more than my model makes a prediction of a given batch.
Let me illustarte an example.
import tensorflow as tf
import tensorflow.keras as keras
sample_data = tf.random.normal((5, 755200, 2)) # 5 sources of audio, stereo channel
# First axis is the mixture of the audio, so this is the input
# Rest 4 axes are the each source of the audio(eg. bass, drum, vocals, etc) so these are the output
input_mixture = sample_data[0, :, :]
target_mixtures = sample_data[1:, :, :]
target_mixtures = np.column_stack(target_mixtures)
length = 44100 * 11 # I want to split these into length of 11 seconds
strides = 44100 # 1 second stride
ds_inp = tf.data.Dataset.from_tensor_slices((input_mixture))
ds_inp = ds_inp.window(length, shift=strides, drop_remainder=True)
ds_inp = ds_inp.flat_map(lambda windows: windows.batch(length))
ds_inp = ds_inp.map(lambda windows: windows, num_parallel_calls=tf.data.AUTOTUNE)
ds_tar = tf.data.Dataset.from_tensor_slices((target_mixtures))
ds_tar = ds_tar.window(length, shift=strides, drop_remainder=True)
ds_tar = ds_tar.flat_map(lambda windows: windows.batch(length))
ds_tar = ds_tar.map(lambda windows: windows, num_parallel_calls=tf.data.AUTOTUNE)
ds_total = [ds_inp, ds_tar]
total_ds = tf.data.Dataset.zip(tuple(ds_total))
total_ds = total_ds.batch(BATCH_SIZE)
total_ds = total_ds.prefetch(tf.data.AUTOTUNE)
This is how I made a tf.data.Dataset from the given file.
And when I measure the time how fast does this make a input_batch and output_batch,
%%time
for i, j in total_ds.take(1):
pass
# Wall time: 18.3 s
My model has about 100 million variables, but since it fairly has a simple structure so that it takes about 6 seconds to generate a predicted_batch out of given input_batch.
So my problem is, is there any way to make it to generate input_batch, output_batch faster?
(My assumption is that, as this 'window' the given arrays, there is no better way to improve this.)
Obviously all of the files are big enough not to be cached.

Efficient way of storing 1TB of random data with Zarr

I'd like to store 1TB of random data backed by a zarr on disk array. Currently, I am doing something like the following:
import numpy as np
import zarr
from numcodecs import Blosc
compressor = Blosc(cname='lz4', clevel=5, shuffle=Blosc.BITSHUFFLE)
store = zarr.DirectoryStore('TB1.zarr')
root = zarr.group(store)
TB1 = root.zeros('data',
shape=(1_000_000, 1_000_000),
chunks=(20_000, 5_000),
compressor=compressor,
dtype='|i2')
for i in range(1_000_000):
TB1[i, :1_000_000] = np.random.randint(0, 3, size=1_000_000, dtype='|i2')
This is going to take some time -- I know things could probably be improved if I wasn't always generating 1_000_000 random numbers and instead reusing the array but I'd like some more randomness for now. Is there a better way to go about building this random dataset ?
Update 1
Using bigger numpy blocks speeds things up a bit:
for i in range(0, 1_000_000, 100_000):
TB1[i:i+100_000, :1_000_000] = np.random.randint(0, 3, size=(100_000, 1_000_000), dtype='|i2')
I'd recommend using Dask Array which will enable parallel computation of random numbers and storage, e.g.:
import zarr
from numcodecs import Blosc
import dask.array as da
shape = 1_000_000, 1_000_000
dtype = 'i2'
chunks = 20_000, 5_000
compressor = Blosc(cname='lz4', clevel=5, shuffle=Blosc.BITSHUFFLE)
# set up zarr array to store data
store = zarr.DirectoryStore('TB1.zarr')
root = zarr.group(store)
TB1 = root.zeros('data',
shape=shape,
chunks=chunks,
compressor=compressor,
dtype=dtype)
# set up a dask array with random numbers
d = da.random.randint(0, 3, size=shape, dtype=dtype, chunks=chunks)
# compute and store the random numbers
d.store(TB1, lock=False)
By default Dask will compute using all available local cores, but can also be configured to run on a cluster via the Distributed package.

Memory leak when running universal-sentence-encoder-large itterating on dataframe

I have 140K sentences I want to get embeddings for. I am using TF_HUB Universal Sentence Encoder and am iterating over the sentences(I know it's not the best way but when I try to feed over 500 sentences into the model it crashes).
My Environment is:
Ubuntu 18.04
Python 3.7.4
TF 1.14
Ram: 16gb
processor: i-5
my code is:
version 1
I iterate inside the tf.session context manager
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = None
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
version 2
I open and close a session within each iteration
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
df = pandas_repository.get_dataframe_from_table('sentences')
for i, row in df.iterrows():
sentence = row['content']
embeddings = embed([sentence])
sentence_embedding = None
with tf.compat.v1.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
sentence_embedding = session.run(embeddings)
df.at[i, 'embedding'] = sentence_embedding
print('processed index:', i)
While version 2 does seem to have some sort of GC and memory is cleared a bit. It still goes over 50 items and explodes.
version 1 just goes on gobbling memory.
The correct solution as given by arnoegw
def calculate_embeddings(dataframe, table_name):
sql_get_sentences = "SELECT * FROM semantic_similarity.sentences WHERE embedding IS NULL LIMIT 1500"
sql_update = 'UPDATE {} SET embedding = data.embedding FROM (VALUES %s) AS data(id, embedding) WHERE {}.id = data.id'.format(table_name, table_name)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
with hub.eval_function_for_module("https://tfhub.dev/google/universal-sentence-encoder-large/3") as embed:
while len(df) >= 0:
sentence_array = df['content'].values
sentence_embeddings = embed(sentence_array)
df['embedding'] = sentence_embeddings.tolist()
values = [tuple(x) for x in df[['id', 'embedding']].values]
pandas_repository.update_db_from_df('semantic_similarity.sentences', sql_update, values)
df = pandas_repository.get_dataframe_from_sql(sql_get_sentences)
I am a newbee to TF and can use any help I can get.
Your code uses tf.Session, so it falls under the TF1.x programming model of first building a dataflow graph and then running it repeatedly with inputs being fed and outputs being fetched from the graph.
But your code does not align well with that programming model. Both versions keep adding new applications of (calls to) the hub.Module to the default TensorFlow graph instead of applying it once and running the same graph repeatedly for the various inputs. Version 2 keeps going into and out of tf.Sessions, which frees some memory but is very inefficient.
Please see my answer to "Strongly increasing memory consumption when using ELMo from Tensorflow-Hub" for guidance how to do it right in the graph-based programming model of TensorFlow 1.x.
TensorFlow 2.0, which is going to be released soon, defaults to the programming model of "eager execution", which does away with graphs and sessions and would have avoided this confusion. TensorFlow Hub will be updated in due course for TF2.0. For a preview close to your use-case, see https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_text_classification.ipynb

Multiprocessing with large HDF5 files

I have 5 HDF5 files that are 22 GB each. Each HDF5 file is a series of 4801 images that are 1920 by 1200 in size. I need to load the same frame number from each HDF5 file, get rid of some rogue pixels, average the stack of 5 images, and write a new HDF5 file with one processed image at each frame number. Because I can't load all 5 HDF5 files in at once without running out of RAM, I am only loading in chunks of images from each HDF5 file, putting 5 images for each frame number into a queue, processing the stack, and writing the resulting image to an HDF5 file. Right now I am using h5py to perform any reading/writing of HDF5 files.
I would like to know what the most computationally effective way is of working on chunked data? Right now, I am dedicating one processor to be the writer, then looping through some chunk size of data for which I create a number of consumers, put the data in a queue, wait for the consumers to be finished, then rinse and repeat until all of the images are processed. This means that every time the loop advances, it creates new consumer processes - I imagine there is some overhead in this. A sample of the code is below.
#!/usr/bin/env python
import time
import os
from multiprocessing import Process, Queue, JoinableQueue, cpu_count
import glob
import h5py
import numpy as np
'''Function definitions'''
# The consumer function takes data off of the Queue
def consumer(inqueue,output):
# Run indefinitely
while True:
# If the queue is empty, queue.get() will block until the queue has data
all_data = inqueue.get()
if all_data:
#n is the index corresponding to the projection location
n, image_data = all_data
#replace zingers with median and average stack
#Find the median for each pixel of the prefiltered image
med = np.median(image_data,axis=0)
#Loop through the image set
for j in range(image_data.shape[0]):
replicate = image_data[j,...]
mask = replicate - med > zinger_level
replicate[mask] = med[mask] # Substitute with median
image_data[j,...] = replicate # Put data back in place
out = np.mean(image_data,axis=0,dtype=np.float32).astype(np.uint16)
output.put((n,out))
else:
break
#function for writing out HDF5 file
def write_hdf(output,output_filename):
#create output HDF5 file
while True:
args = output.get()
if args:
i,data = args
with h5py.File(output_filename,'a') as fout:
fout['Prefiltered_images'][i,...] = data
else:
break
def fprocess_hdf_stack(hdf_filenames,output_filename):
file_list = []
for fname in hdf_filenames:
file_list.append(h5py.File(fname,'r'))
#process chunks of data so that we don't run out of memory
totsize = h5py.File(hdf_filenames[0],'r')['exchange']['data'].shape[0]
data_shape = h5py.File(hdf_filenames[0],'r')['exchange']['data'].shape
fout.create_dataset('Prefiltered_images',data_shape,dtype=np.uint16)
fout.close()
ints = range(totsize)
chunkSize= 100
#initialize how many consumers we would like working
num_consumers = cpu_count()*2
#Create the Queue objects
inqueue = JoinableQueue()
output = Queue()
#start process for writing HDF5 file
proc = Process(target=write_hdf, args=(output,output_filename))
proc.start()
print("Loading %i images into memory..."%chunkSize)
for i in range(0,totsize,chunkSize):
time0 = time.time()
chunk = ints[i:i+chunkSize]
data_list = []
#Make a list of the HDF5 datasets we are reading in
for files in file_list:
#shape is (angles, rows, columns)
data_list.append(files['exchange']['data'][chunk,...])
data_list = np.asarray(data_list)
print("Elapsed time to load images %i-%i is %0.2f minutes." %(chunk[0],chunk[-1],(time.time() - time0)/60))
consumers = []
#Create consumer processes
for i in range(num_consumers):
p = Process(target=consumer, args=(inqueue,output))
consumers.append(p)
p.start()
for n in range(data_list.shape[1]):
#Feed data into the queue
inqueue.put((chunk[n],data_list[:,n,...]))
#Kill all of the processes when everything is finished
for i in range(num_consumers):
inqueue.put(None)
for c in consumers:
c.join()
print("Elapsed time to process images %i-%i is %0.2f minutes." %(chunk[0],chunk[-1],(time.time() - time0)/60))
time.sleep(1)
output.put(None)
proc.join()
#Close the input HDF5 files.
for hdf_file in file_list:
hdf_file.close()
print("Input HDF5 files closed.")
return
if __name__ == '__main__':
start_time = time.time()
raw_images_filenames = glob.glob(raw_images_dir + raw_images_basename)
tempname = os.path.basename(raw_images_filenames[0]).split('.')[0]
tempname_split = tempname.split('_')[:-1]
output_filename = output_dir+'_'.join(tempname_split) + '_Prefiltered.hdf5'
fprocess_hdf_stack(raw_images_filenames,output_filename)
print("Elapsed time is %0.2f minutes" %((time.time() - start_time)/60))
I don't think my bottleneck is actually in the loading of the images. It is in initializing the consumers and carrying out the processing on the 5 images per each frame number. I've played around with taking the consumer function out of the for loop, but I don't know how to put a memory cap on this so that I don't run out of RAM. Thanks!

Spark ml LogisticRegression failing for large libsvm format data

I am trying fit a Spark (1.6.0) ml.LogisticRegression for pure L1 regularization (elasticNetParam = 1.0, regParam = 0.1) model for a large ((2920874 rows, 2564827 features)) dataset that is saved in libsvm format. Since this is pure L1, an OWLQN optimizer is being called for the optimization part. However, in the resulting model, regression weights remain at their initial state and number of iterations seems to be just one, which means the optimizer exits after initialization for some reason. Moreover, the optimizer does not print any info messages which is another proof that it does not go beyond initialization step. My code is attached here:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.mllib.linalg.Vector
import java.io._
import org.apache.spark.mllib.util.MLUtils
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val training = sqlContext.read.format("libsvm").option("numFeatures", "2564828").load("/path/to/large/libsvm/training/data")
val test = sqlContext.read.format("libsvm").option("numFeatures", "2564828").load("/path/to/large/libsvm/test/data")
val lr = (new LogisticRegression()).setMaxIter(100).setElasticNetParam(1.0).setRegParam(0.1)
val lrm = lr.fit(training)
val output = lrm.transform(test)
val evaluator = (new BinaryClassificationEvaluator())
val metric = evaluator.evaluate(output)
Few relevant points:
This code was run from spark-shell (1.6.0) with 40 cores (10 executors with 4 cores each), 4gb driver memory and 4gb executor memory.
Same code with L2 regularization (elasticNetParam = 0.0) runs without any issues.
Same code (pure L1) with a smaller matrix format data used in the LogisticRegression test case runs without issues.