Processing a huge file (>30GB) in Python - pandas

I need to process a huge file of around 30GB containing hundreds of millions of rows. More precisely, I want to perform the three following steps:
Reading the file by chunks: given the size of the file, I don't have the memory to read the file in one go;
Computing stuff on the chunks before aggregating each of them to a more manageable size;
Concatenating the aggregated chunks into a final dataset containing the results of my analyses.
So far, I have coded two threads :
One thread in charge of reading the file by chunks and storing the chunks in a Queue (step 1);
One thread in charge of performing the analyses (step 2) on the chunks;
Here is the spirit of my code so far with dummy data:
import queue
import threading
import concurrent.futures
import os
import random
import pandas as pd
import time
def process_chunk(df):
return df.groupby(["Category"])["Value"].sum().reset_index(drop=False)
def producer(queue, event):
print("Producer: Reading the file by chunks")
reader = pd.read_table(full_path, sep=";", chunksize=10000, names=["Row","Category","Value"])
for index, chunk in enumerate(reader):
print(f"Producer: Adding chunk #{index} to the queue")
queue.put((index, chunk))
time.sleep(0.2)
print("Producer: Finished putting chunks")
event.set()
print("Producer: Event set")
def consumer(queue, event, result_list):
# The consumer stops iff queue is empty AND event is set
# <=> The consumer keeps going iff queue is not empty OR event is not set
while not queue.empty() or not event.is_set():
try:
index, chunk = queue.get(timeout=1)
except queue.Empty:
continue
print(f"Consumer: Retrieved chunk #{index}")
print(f"Consumer: Queue size {queue.qsize()}")
result_list.append(process_chunk(chunk))
time.sleep(0.1)
print("Consumer: Finished retrieving chunks")
if __name__=="__main__":
# Record the execution time
start = time.perf_counter()
# Generate a fake file in the current directory if necessary
path = os.path.dirname(os.path.realpath(__file__))
filename = "fake_file.txt"
full_path = os.path.join(path, filename)
if not os.path.exists(full_path):
print("Main: Generate a dummy dataset")
with open(full_path, "w", encoding="utf-8") as f:
for i in range(100000):
value = random.randint(1,101)
category = i%2
f.write(f"{i+1};{value};{category}\n")
# Defining a queue that will store the chunks of the file read by the Producer
queue = queue.Queue(maxsize=5)
# Defining an event that will be set by the Producer when he is done
event = threading.Event()
# Defining a list storing the chunks processed by the Consumer
result_list = list()
# Launch the threads Producer and Consumer
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
executor.submit(producer, queue, event)
executor.submit(consumer, queue, event, result_list)
# Display that the program is finished
print("Main: Consumer & Producer have finished!")
print(f"Main: Number of processed chunks = {len(result_list)}")
print(f"Main: Execution time = {time.perf_counter()-start} seconds")
I know that each iteration of step 1 takes more time than each iteration of step 2 i.e. that the Consumer will always be waiting for the Producer.
How can I speed up the process of reading my file by chunks (step 1) ?

Related

How to consume all messages from RabbitMQ queue using pika

I would like to write a daemon in Python that wakes up periodically to process some data queued up in a RabbitMQ queue.
When the daemon wakes up, it should consume all messages in the queue (or min(len(queue), N), where N is some arbitrary number) because it's better for the data to be processed in batches. Is there a way of doing this in pika, as opposed to passing in a callback that gets called on every message arrival?
Thanks.
Here is the code written using pika. A similar function can be written using basic.get
The below code will make use of channel.consume to start consuming messages. We break out/stop when the desired number of messages is reached.
I have set a batch_size to prevent pulling of huge number of messages at once. You can always change the batch_size to fit your needs.
from pika import BasicProperties, URLParameters
from pika.adapters.blocking_connection import BlockingChannel, BlockingConnection
from pika.exceptions import ChannelWrongStateError, StreamLostError, AMQPConnectionError
from pika.exchange_type import ExchangeType
import json
def consume_messages(queue_name: str):
msgs = list([])
batch_size = 500
q = channel.queue_declare(queue_name, durable=True, exclusive=False, auto_delete=False)
q_length = q.method.message_count
if not q_length:
return msgs
msgs_limit = batch_size if q_length > batch_size else q_length
try:
# Get messages and break out
for method_frame, properties, body in channel.consume(queue_name):
# Append the message
try:
msgs.append(json.loads(bytes.decode(body)))
except:
logger.info(f"Rabbit Consumer : Received message in wrong format {str(body)}")
# Acknowledge the message
channel.basic_ack(method_frame.delivery_tag)
# Escape out of the loop when desired msgs are fetched
if method_frame.delivery_tag == msgs_limit:
# Cancel the consumer and return any pending messages
requeued_messages = channel.cancel()
print('Requeued %i messages' % requeued_messages)
break
except (ChannelWrongStateError, StreamLostError, AMQPConnectionError) as e:
logger.info(f'Connection Interrupted: {str(e)}')
finally:
# Close the channel and the connection
channel.stop_consuming()
channel.close()
return msgs
You can use the basic.get API, which pulls messages from the brokers, instead of subscribin for the messages to be pushed

Multiprocessing code works using numpy but deadlocked using pytorch

I'm hitting what appears to be a deadlock when trying to make use of multiprocessing with pytorch. The equivalent numpy code works like I expect it to.
I've made a simplified version of my code: a pool of 4 workers executing an array-wide broadcast operation 1000 times (so ~250 each worker). The array in question is 100,000 x 3 and the broadcast operation is subtraction of all rows by a single 1 x 3 row array. The large array is a shared/global array, and the row array is different at each iteration.
The code works exactly as expected using numpy, with the pooled workers showing a 4x speedup over the equivalent for loop.
The code in pytorch, however, hits a deadlock (I assume): none of the workers complete the array broadcast operation even once.
The numpy code below prints the following:
Finished for loop over my_subtractor: took 8.1504 seconds.
Finished pool over my_subtractor: took 2.2247 seconds.
The pytorch code, on the other hand, prints this then stalls:
Finished for loop over my_subtractor: took 3.1082 seconds.
BLA
BLA
BLA
BLA
"BLA" print statements are just to show that each worker is stuck in -- apparently -- a deadlock state. There are exactly 4 of these: one per worker entering -- and getting stuck in -- an iteration.
If you feel ambitious enough to reproduce, note that it doesn't work on Windows because it's not wrapped around if __name__ == '__main__': (I read somewhere that you need this because of the way Windows handles launching processes). Also you will need to create an empty file called my_globals.py.
Here is the numpy code
from time import time
import numpy as np
import my_globals
from multiprocessing import Pool as ThreadPool
# shared memory by virtue of being global
my_globals.minuend = np.random.rand(100000,3)
# array to be iterated over in for loop / pool of workers
subtrahends = np.random.rand(10000,3)
# function called at each iteration (broadcast operation)
def my_subtractor(subtrahend):
my_globals.minuend - subtrahend
return 0
# launch for loop
ts = time()
for idx, subtrahend in enumerate(subtrahends):
my_subtractor(subtrahend)
te = time()
print('Finished for loop over my_subtractor: took %2.4f seconds.' % (te - ts))
# launch equivalent pool of workers
ts = time()
pool = ThreadPool(4)
pool.map(my_subtractor, subtrahends)
pool.close()
pool.join()
te = time()
print('Finished pool over my_subtractor: took %2.4f seconds.' % (te - ts))
Here is the equivalent pytorch code:
from time import time
import torch
import my_globals
from torch.multiprocessing import Pool as ThreadPool
# necessary on my system because it has low limits for number of file descriptors; not recommended for most systems,
# see: https://pytorch.org/docs/stable/multiprocessing.html#file-descriptor-file-descriptor
torch.multiprocessing.set_sharing_strategy('file_system')
# shared memory by virtue of being global
my_globals.minuend = torch.rand(100000,3)
# array to be iterated over in for loop / pool of workers
subtrahends = torch.rand(10000,3)
# function called at each iteration (broadcast operation)
def my_subtractor(subtrahend, verbose=True):
if verbose:
print("BLA") # -- prints for every worker in the pool (so 4 times total)
my_globals.minuend - subtrahend
if verbose:
print("ALB") # -- doesn't print for any worker
return 0
# launch for loop
ts = time()
for idx, subtrahend in enumerate(subtrahends):
my_subtractor(subtrahend, verbose=False)
te = time()
print('Finished for loop over my_subtractor: took %2.4f seconds.' % (te - ts))
# launch equivalent pool of workers
ts = time()
pool = ThreadPool(4)
pool.map(my_subtractor, subtrahends)
pool.close()
pool.join()
te = time()
print('Finished pool over my_subtractor: took %2.4f seconds.' % (te - ts))
You can try to set OMP_NUM_THREADS=1 environment variable as an attempt to crunch-fix this. It helped me with DataLoader+OpenCV deadlock.

Multiprocessing with large HDF5 files

I have 5 HDF5 files that are 22 GB each. Each HDF5 file is a series of 4801 images that are 1920 by 1200 in size. I need to load the same frame number from each HDF5 file, get rid of some rogue pixels, average the stack of 5 images, and write a new HDF5 file with one processed image at each frame number. Because I can't load all 5 HDF5 files in at once without running out of RAM, I am only loading in chunks of images from each HDF5 file, putting 5 images for each frame number into a queue, processing the stack, and writing the resulting image to an HDF5 file. Right now I am using h5py to perform any reading/writing of HDF5 files.
I would like to know what the most computationally effective way is of working on chunked data? Right now, I am dedicating one processor to be the writer, then looping through some chunk size of data for which I create a number of consumers, put the data in a queue, wait for the consumers to be finished, then rinse and repeat until all of the images are processed. This means that every time the loop advances, it creates new consumer processes - I imagine there is some overhead in this. A sample of the code is below.
#!/usr/bin/env python
import time
import os
from multiprocessing import Process, Queue, JoinableQueue, cpu_count
import glob
import h5py
import numpy as np
'''Function definitions'''
# The consumer function takes data off of the Queue
def consumer(inqueue,output):
# Run indefinitely
while True:
# If the queue is empty, queue.get() will block until the queue has data
all_data = inqueue.get()
if all_data:
#n is the index corresponding to the projection location
n, image_data = all_data
#replace zingers with median and average stack
#Find the median for each pixel of the prefiltered image
med = np.median(image_data,axis=0)
#Loop through the image set
for j in range(image_data.shape[0]):
replicate = image_data[j,...]
mask = replicate - med > zinger_level
replicate[mask] = med[mask] # Substitute with median
image_data[j,...] = replicate # Put data back in place
out = np.mean(image_data,axis=0,dtype=np.float32).astype(np.uint16)
output.put((n,out))
else:
break
#function for writing out HDF5 file
def write_hdf(output,output_filename):
#create output HDF5 file
while True:
args = output.get()
if args:
i,data = args
with h5py.File(output_filename,'a') as fout:
fout['Prefiltered_images'][i,...] = data
else:
break
def fprocess_hdf_stack(hdf_filenames,output_filename):
file_list = []
for fname in hdf_filenames:
file_list.append(h5py.File(fname,'r'))
#process chunks of data so that we don't run out of memory
totsize = h5py.File(hdf_filenames[0],'r')['exchange']['data'].shape[0]
data_shape = h5py.File(hdf_filenames[0],'r')['exchange']['data'].shape
fout.create_dataset('Prefiltered_images',data_shape,dtype=np.uint16)
fout.close()
ints = range(totsize)
chunkSize= 100
#initialize how many consumers we would like working
num_consumers = cpu_count()*2
#Create the Queue objects
inqueue = JoinableQueue()
output = Queue()
#start process for writing HDF5 file
proc = Process(target=write_hdf, args=(output,output_filename))
proc.start()
print("Loading %i images into memory..."%chunkSize)
for i in range(0,totsize,chunkSize):
time0 = time.time()
chunk = ints[i:i+chunkSize]
data_list = []
#Make a list of the HDF5 datasets we are reading in
for files in file_list:
#shape is (angles, rows, columns)
data_list.append(files['exchange']['data'][chunk,...])
data_list = np.asarray(data_list)
print("Elapsed time to load images %i-%i is %0.2f minutes." %(chunk[0],chunk[-1],(time.time() - time0)/60))
consumers = []
#Create consumer processes
for i in range(num_consumers):
p = Process(target=consumer, args=(inqueue,output))
consumers.append(p)
p.start()
for n in range(data_list.shape[1]):
#Feed data into the queue
inqueue.put((chunk[n],data_list[:,n,...]))
#Kill all of the processes when everything is finished
for i in range(num_consumers):
inqueue.put(None)
for c in consumers:
c.join()
print("Elapsed time to process images %i-%i is %0.2f minutes." %(chunk[0],chunk[-1],(time.time() - time0)/60))
time.sleep(1)
output.put(None)
proc.join()
#Close the input HDF5 files.
for hdf_file in file_list:
hdf_file.close()
print("Input HDF5 files closed.")
return
if __name__ == '__main__':
start_time = time.time()
raw_images_filenames = glob.glob(raw_images_dir + raw_images_basename)
tempname = os.path.basename(raw_images_filenames[0]).split('.')[0]
tempname_split = tempname.split('_')[:-1]
output_filename = output_dir+'_'.join(tempname_split) + '_Prefiltered.hdf5'
fprocess_hdf_stack(raw_images_filenames,output_filename)
print("Elapsed time is %0.2f minutes" %((time.time() - start_time)/60))
I don't think my bottleneck is actually in the loading of the images. It is in initializing the consumers and carrying out the processing on the 5 images per each frame number. I've played around with taking the consumer function out of the for loop, but I don't know how to put a memory cap on this so that I don't run out of RAM. Thanks!

Example pipeline for TFRecords with chunking for long input sequences

I'm trying to optimise the input pipeline for a model I am using that uses GRUs. The data consists of a large number of files that contain time series of length 5000 with dimensionality of 50. I know that it isn't feasible to feed a single sequence of length 5000 into an RNN owing to the vanishing gradient, and you should instead try to chunk it into (5000-seq_len) overlapping chunks, where seq_len is a more manageable length, say 200 timesteps.
The most obvious method for getting this to work with TFRecords/SequenceExamples is to simply have each chunk included as a new SequenceExample within the same file. This seems massively inefficient however, as the majority of data in the resulting TFRecords file will be duplicate data.
Is there a better method of doing this? I've seen very few examples of how to use TFRecords that don't involve images, and no examples that use non-trivial sequence lengths!
For example:
def chunk_save_tfrecords(X, file_path_prefix, seq_length):
# Generate tfrecord writer
result_tf_file = file_path_prefix + '.tfrecords'
with tf.python_io.TFRecordWriter(result_tf_file) as writer:
# Chunk the data
for i in range(int(X.shape[0] - seq_length)):
chunk = X[i:i+seq_length]
data_features = [
tf.train.Feature(
float_list=tf.train.FloatList(value=chunk[t]))
for t in range(seq_length)] # FloatList per timestep
feature_lists = tf.train.FeatureLists(
feature_list={
'data': tf.train.FeatureList(feature=data_features)})
serialized = tf.train.SequenceExample(
feature_lists=feature_lists).SerializeToString()
writer.write(serialized)
def save_tfrecords(X, file_path_prefix):
# Generate tfrecord writer
result_tf_file = file_path_prefix + '.tfrecords'
with tf.python_io.TFRecordWriter(result_tf_file) as writer:
data_features = [
tf.train.Feature(
float_list=tf.train.FloatList(value=X[t]))
for t in range(X.shape[0])] # FloatList per timestep
feature_lists = tf.train.FeatureLists(
feature_list={
'data': tf.train.FeatureList(feature=data_features)})
serialized = tf.train.SequenceExample(
feature_lists=feature_lists).SerializeToString()
writer.write(serialized)
test = np.random.randn(5000,50)
save_tfrecords(test, 'test')
chunk_save_tfrecords(test, 'test_chunk', 200)
save_tfrecords creates a 1MB file, while chunk_save_tfrecords creates a 200MB file!

Multiprocessing with class functions and class attributes

I have a pandas Dataframe, that has millions of rows and I have to do row-wise operations. Since I have a Multicore CPU, I would like to speed up that process using Multiprocessing. The way I would like to do this is to just split up the dataframe in equally sized dataframes and process each of them within a separate process. So far so good...
The problem is, that my code is written in OOP style and I get Pickle errors using a Multiprocess Pool. What I do is, I pass a reference to a class function self.X to the pool. I further use class attributes within X (only read access). I really don't want to switch back to functional programming style... Hence, is it possible to do Multiprocessing in an OOP envirnoment?
It should be possible as long as all elements in your class (that you pass to the sub-processes) is picklable. That is the only thing you have to make sure. If there are any elements in your class that are not, then you cannot pass it to a Pool. Even if you only pass self.x, everything else like self.y has to be picklable.
I do my pandas Dataframe processing like that:
import pandas as pd
import multiprocessing as mp
import numpy as np
import time
def worker(in_queue, out_queue):
for row in iter(in_queue.get, 'STOP'):
value = (row[1] * row[2] / row[3]) + row[4]
time.sleep(0.1)
out_queue.put((row[0], value))
if __name__ == "__main__":
# fill a DataFrame
df = pd.DataFrame(np.random.randn(1e5, 4), columns=list('ABCD'))
in_queue = mp.Queue()
out_queue = mp.Queue()
# setup workers
numProc = 2
process = [mp.Process(target=worker,
args=(in_queue, out_queue)) for x in range(numProc)]
# run processes
for p in process:
p.start()
# iterator over rows
it = df.itertuples()
# fill queue and get data
# code fills the queue until a new element is available in the output
# fill blocks if no slot is available in the in_queue
for i in range(len(df)):
while out_queue.empty():
# fill the queue
try:
row = next(it)
in_queue.put((row[0], row[1], row[2], row[3], row[4]), block=True) # row = (index, A, B, C, D) tuple
except StopIteration:
break
row_data = out_queue.get()
df.loc[row_data[0], "Result"] = row_data[1]
# signals for processes stop
for p in process:
in_queue.put('STOP')
# wait for processes to finish
for p in process:
p.join()
This way I do not have to pass big chunks of DataFrames and I do not have to think about picklable elements in my class.