How to consume all messages from RabbitMQ queue using pika - rabbitmq

I would like to write a daemon in Python that wakes up periodically to process some data queued up in a RabbitMQ queue.
When the daemon wakes up, it should consume all messages in the queue (or min(len(queue), N), where N is some arbitrary number) because it's better for the data to be processed in batches. Is there a way of doing this in pika, as opposed to passing in a callback that gets called on every message arrival?
Thanks.

Here is the code written using pika. A similar function can be written using basic.get
The below code will make use of channel.consume to start consuming messages. We break out/stop when the desired number of messages is reached.
I have set a batch_size to prevent pulling of huge number of messages at once. You can always change the batch_size to fit your needs.
from pika import BasicProperties, URLParameters
from pika.adapters.blocking_connection import BlockingChannel, BlockingConnection
from pika.exceptions import ChannelWrongStateError, StreamLostError, AMQPConnectionError
from pika.exchange_type import ExchangeType
import json
def consume_messages(queue_name: str):
msgs = list([])
batch_size = 500
q = channel.queue_declare(queue_name, durable=True, exclusive=False, auto_delete=False)
q_length = q.method.message_count
if not q_length:
return msgs
msgs_limit = batch_size if q_length > batch_size else q_length
try:
# Get messages and break out
for method_frame, properties, body in channel.consume(queue_name):
# Append the message
try:
msgs.append(json.loads(bytes.decode(body)))
except:
logger.info(f"Rabbit Consumer : Received message in wrong format {str(body)}")
# Acknowledge the message
channel.basic_ack(method_frame.delivery_tag)
# Escape out of the loop when desired msgs are fetched
if method_frame.delivery_tag == msgs_limit:
# Cancel the consumer and return any pending messages
requeued_messages = channel.cancel()
print('Requeued %i messages' % requeued_messages)
break
except (ChannelWrongStateError, StreamLostError, AMQPConnectionError) as e:
logger.info(f'Connection Interrupted: {str(e)}')
finally:
# Close the channel and the connection
channel.stop_consuming()
channel.close()
return msgs

You can use the basic.get API, which pulls messages from the brokers, instead of subscribin for the messages to be pushed

Related

Simulating a system of batching jobs with interrupting set-up/switch-on times using Simpy

I am new to Simpy and have a problem with combining batching jobs and interrupting set-up time. So could you please help me?
I would like to create a system with servers that need time to set up before being ready to serve.
The system starts to set up whenever enough M (2, 3,...) customers are in the queue. If the number of customers in the system reaches the maximum number of K(50), the coming customer will balk.
When a batch( group of M customers) leaves the system, we check if there are M customers(a batch) who are waiting to be served. If so, we keep the server remaining ON, otherwise, we turn off the server immediately.
I found some code for quite the same problem in a Simpy google group about Covid test simulation that uses Stores Resources and the answer for interrupting set-up time with Container Resources by Michael R. Gibbs
https://groups.google.com/g/python-simpy/c/iFYaDlL4fq0
Interrupt an earlier timeout event in Simpy
I tried to combine 2 codes but It didn't work.
Example, when M = 2, K = 50
Customer 1 arrives and waits
Customer 2 arrives, enough 2 customers then request a server.
Server 1 is SETUP in t1 secs.
Customer 3 arrives and waits
Customer 4 enough 2 customers then request a server.
Server 2 is SETUP in t1 secs.
Server 1 is ON.
Customers 1 and 2 occupied server 1
Customer 1 and 2 completes the service and leaves the system.
Customers 3 and 4 occupied server 1 (because when server 1 finishes
Server 2 is still in the setup process)
Server 2 (still in SETUP mode) is turned off...
... Customer 100 arrives and sees the system has 50 customers, then customer 100 balk
Broke customer arrivals into two parts. A first queue that where the customer waits until there is enough customers to make a batch. When I have enough customers to make a batch, I do so, popping the batched customers from the batching queue, and putting the batch in a processing queue. I count the customers in both queues to see if a arriving customer aborts entry.
When a batch is put in the processing queue, I also start up a server. This means that the number of batches in the processing queue will equal the number of servers starting up. This also means that when a server finishes starting up, there will be a batch to process. Since there will never be a wait for a batch, I use a simple list for my queue.
When a batch starts up, it grabs a batch, and removes itself from the list of starting servers. After the server finishes processing a batch, it checks if there is a batch in the processing queue. If so, grab the batch and keep processing, but also kill the server that is starting up to process the batch. If no batches in the processing queue, shut down.
Here is the code. You should see in the log the queues max out and customers abort, but also see servers start to shut down towards the end
"""
Simulation of servers processing batches
Customers enter a queue where they wait for
enough customers to make a batch
If the there are too many customers in the queues
the arriving customer will abort
When a batch is made, it is put into a second
processing queue where the batch waits to be processed.
When a batch is put into the processing queue, it
starts a server. The server has a start up delay
then loops by seizing a batch, process the batch, release
the batch, checking if another batch is in the
processing queue. If there is another batch, stop a server
that is starting up and process the batch, else end loop
and shutdown server
Programmer: Michael R. Gibbs
"""
import simpy
import random
max_q_size = 50
batch_size = 2
server_start_time = 55
processing_time = lambda : random.triangular(5,20,10)
arrival_gap = lambda : random.triangular(1,1,1)
# there is no wating so normal lists are good enough
batching_q = list()
processing_q = list()
server_q = list() # servers that are still starting up
class Server():
"""
Server that process batches
Has two states: starting up, and batch processing
"""
def __init__(self, id, env, processing_q, server_q):
self.id = id
self.env = env
self.processing_q = processing_q
self.server_q = server_q
self.start_process = self.env.process(self.start_up())
def start_up(self):
"""
starts up the server, then start processing batches
start up can be interrupted, stoping the server
"""
# start up
try:
print(f'{self.env.now} server {self.id} starting up')
yield self.env.timeout(server_start_time)
print(f'{self.env.now} server {self.id} started')
self.env.process(self.process())
except simpy.Interrupt:
print(f'{env.now} server {self.id} has been interupted')
def process(self):
"""
process batches
keeps going as long as there are batches in queue
If starts second batch, also interupts starting up server
"""
while True:
print(f'{self.env.now} server {self.id} starting batch process')
b = processing_q.pop(0)
yield self.env.timeout(processing_time())
print(f'{self.env.now} server {self.id} finish batch process')
if len(self.processing_q) > 0:
# more processes to do,
# steal batch from starting up server
s = self.server_q.pop() # lifo
s.stop()
else:
print(f'{env.now} server {self.id} no more batches, shutting down')
break
def stop(self):
"""
Interrupts server start up, stoping server
"""
try:
self.start_process.interrupt()
except:
pass
def gen_arrivals(env, batching_q, processing_q, server_q):
"""
Generate arring customers
If queues are too big customer will abort
If have enough customers, create a batch and start a server
"""
id = 1
while True:
yield env.timeout(arrival_gap())
q_size = len(batching_q) + (batch_size * len(processing_q))
if q_size >= max_q_size:
print(f'{env.now} customer arrived and aborted, q len: {q_size}')
else:
print(f'{env.now} customer has arrived, q len: {q_size}')
customer = object()
batching_q.append(customer)
# check if a batch can be creatd
while len(batching_q) >= batch_size:
batch = list()
while len(batch) < batch_size:
batch.append(batching_q.pop(0))
# put batch in processing q
processing_q.append(batch)
# start server
server = Server(id, env, processing_q, server_q)
id += 1
server_q.append(server)
# boot up sim
env = simpy.Environment()
env.process(gen_arrivals(env, batching_q, processing_q, server_q))
env.run(100)
When I add a condition to limit the number of servers, it works until a server was interrupted or shut down. Then, these servers seem to have disappeared and no longer active.
Sorry for asking you too much. Here is my code:
import simpy
import random
import numpy as np
class param:
def __init__(self, x):
#self.FILE = 'Setup_time.csv'
self.MEAN_INTERARRIVAL = x # arrival_gap
self.MEAN_SERVICE_TIME = 2 # processing_time
self.MEAN_SWITCH_TIME = 3 # server_start_time
self.NUM_OF_SERVER = 4 # maximum number of servers
self.MAX_SYS_SIZE = 10 # maximum number of customers in the system
self.BATCH_SIZE = 2
self.RANDOM_SEED = 0
# there is no wating so normal lists are good enough
class Server():
"""
Server that process batches
Has two states: starting up, and batch processing
"""
def __init__(self, id, env, processing_q, server_q, param):
self.id = id
self.env = env
self.processing_q = processing_q
self.server_q = server_q
self.start_process = self.env.process(self.start_up(param))
def start_up(self, param):
"""
starts up the server, then start processing batches
start up can be interrupted, stoping the server
"""
global num_servers
# start up
if self.id <= param.NUM_OF_SERVER: # I add the condition to limit the number of servers
try:
num_servers += 1
print(f'{self.env.now} server {self.id} starting up')
yield self.env.timeout(param.MEAN_SWITCH_TIME)
#yield env.timeout(np.random.exponential(1/param.MEAN_SWITCH_TIME))
print(f'{self.env.now} server {self.id} started')
self.env.process(self.process(param))
except simpy.Interrupt:
print(f'{env.now} server {self.id} has been interupted-------------------')
def process(self, param):
"""
process batches
keeps going as long as there are batches in queue
If starts second batch, also interupts starting up server
"""
global num_servers, num_active_server
while True:
num_active_server += 1
b = processing_q.pop(0)
print(f'{self.env.now} server {self.id} starting batch process')
yield self.env.timeout(param.MEAN_SERVICE_TIME)
#yield env.timeout(np.random.exponential(1/param.MEAN_SERVICE_TIME))
num_servers -= 1
num_active_server -= 1
print(f'{self.env.now} server {self.id} finish batch process')
if len(self.processing_q) > 0:
# more processes to do,
# steal batch from starting up server
#if self.server_q:
#s = self.server_q.pop(0) # Do these lines work for FIFO rule?
#s.stop()
s = self.server_q.pop() # lifo
s.stop()
else:
print(f'{env.now} server {self.id} no more batches, shutting down')
break
def stop(self):
"""
Interrupts server start up, stoping server
"""
try:
self.start_process.interrupt()
except:
pass
def gen_arrivals(env, batching_q, processing_q, server_q, param):
"""
Generate arring customers
If queues are too big customer will abort
If have enough customers, create a batch and start a server
"""
global num_servers, num_balk, num_cumulative_customer, num_active_server
id = 1
while True:
yield env.timeout(param.MEAN_INTERARRIVAL)
#yield env.timeout(np.random.exponential(1/param.MEAN_INTERARRIVAL))
num_cumulative_customer += 1
customer = object()
batching_q.append(customer)
q_size = len(batching_q) + (param.BATCH_SIZE * len(processing_q))
sys_size = q_size + (num_active_server * param.BATCH_SIZE)
#if q_size > max_q_size:
if sys_size > param.MAX_SYS_SIZE: # I check the limited condition for number customer in system instead of number customer in queue
num_balk += 1
batching_q.pop(-1) # I added the statement
print(f'{env.now} customer arrived and aborted, sys len: {sys_size }')
else:
#customer = object() # I moved these 2 lines above to update system size before using the if statement
#batching_q.append(customer)
print(f'{env.now} customer has arrived, q len: {q_size}, sys len: {sys_size}')
# check if a batch can be creatd
while len(batching_q) >= param.BATCH_SIZE:
batch = list()
while len(batch) < param.BATCH_SIZE:
batch.append(batching_q.pop(0))
# put batch in processing q
processing_q.append(batch)
# start server
server = Server(id, env, processing_q, server_q, param)
id += 1
server_q.append(server)
#Calculate balking probability
prob_balk = num_balk/num_cumulative_customer
#print(f'{env.now} prob_balk {prob_balk}')
list_prob_balk.append(prob_balk)
# boot up sim
trial = 0
Pb= [] #balking probability
global customer_balk_number
for x in range(1,3):
trial += 1
print('trial:', trial)
batching_q = list()
processing_q = list()
server_q = list() # servers that are still starting up
num_servers = 0 # number of server in system (both starting and serving server)
num_active_server = 0 # number of servers serving customers
num_balk = 0 # number of balking customers
num_cumulative_customer = 0 # total arriving customers
list_prob_balk = [] #list balk prob each trial
paramtest1 = param(x)
random.seed(paramtest1.RANDOM_SEED)
# create and start the model
env = simpy.Environment()
env.process(gen_arrivals(env, batching_q, processing_q, server_q, paramtest1))
env.run(30)
Pb.append(list_prob_balk[-1])
#print('List of balk prob', Pb )

Processing a huge file (>30GB) in Python

I need to process a huge file of around 30GB containing hundreds of millions of rows. More precisely, I want to perform the three following steps:
Reading the file by chunks: given the size of the file, I don't have the memory to read the file in one go;
Computing stuff on the chunks before aggregating each of them to a more manageable size;
Concatenating the aggregated chunks into a final dataset containing the results of my analyses.
So far, I have coded two threads :
One thread in charge of reading the file by chunks and storing the chunks in a Queue (step 1);
One thread in charge of performing the analyses (step 2) on the chunks;
Here is the spirit of my code so far with dummy data:
import queue
import threading
import concurrent.futures
import os
import random
import pandas as pd
import time
def process_chunk(df):
return df.groupby(["Category"])["Value"].sum().reset_index(drop=False)
def producer(queue, event):
print("Producer: Reading the file by chunks")
reader = pd.read_table(full_path, sep=";", chunksize=10000, names=["Row","Category","Value"])
for index, chunk in enumerate(reader):
print(f"Producer: Adding chunk #{index} to the queue")
queue.put((index, chunk))
time.sleep(0.2)
print("Producer: Finished putting chunks")
event.set()
print("Producer: Event set")
def consumer(queue, event, result_list):
# The consumer stops iff queue is empty AND event is set
# <=> The consumer keeps going iff queue is not empty OR event is not set
while not queue.empty() or not event.is_set():
try:
index, chunk = queue.get(timeout=1)
except queue.Empty:
continue
print(f"Consumer: Retrieved chunk #{index}")
print(f"Consumer: Queue size {queue.qsize()}")
result_list.append(process_chunk(chunk))
time.sleep(0.1)
print("Consumer: Finished retrieving chunks")
if __name__=="__main__":
# Record the execution time
start = time.perf_counter()
# Generate a fake file in the current directory if necessary
path = os.path.dirname(os.path.realpath(__file__))
filename = "fake_file.txt"
full_path = os.path.join(path, filename)
if not os.path.exists(full_path):
print("Main: Generate a dummy dataset")
with open(full_path, "w", encoding="utf-8") as f:
for i in range(100000):
value = random.randint(1,101)
category = i%2
f.write(f"{i+1};{value};{category}\n")
# Defining a queue that will store the chunks of the file read by the Producer
queue = queue.Queue(maxsize=5)
# Defining an event that will be set by the Producer when he is done
event = threading.Event()
# Defining a list storing the chunks processed by the Consumer
result_list = list()
# Launch the threads Producer and Consumer
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
executor.submit(producer, queue, event)
executor.submit(consumer, queue, event, result_list)
# Display that the program is finished
print("Main: Consumer & Producer have finished!")
print(f"Main: Number of processed chunks = {len(result_list)}")
print(f"Main: Execution time = {time.perf_counter()-start} seconds")
I know that each iteration of step 1 takes more time than each iteration of step 2 i.e. that the Consumer will always be waiting for the Producer.
How can I speed up the process of reading my file by chunks (step 1) ?

Code implementation of the Redis "Pattern: Reliable queue"

The excellent redis documentation lists a Reliable queue pattern as a good candidate/example for the RPOPLPUSH function.
I understand "reliable queue" to be something with delivery patterns like Amazon SQS FIFO exactly once pattern.
Specifically, you have some N processes feeding into a queue, and some M workers working from the queue. What does this actually look like as an implementation?
I would venture something like:
Make the feeder process populating the work queue.
# feeder1
import redis
import datetime
import time
r = redis.Redis(host='localhost', port=6379, db=0)
while True:
now = datetime.datetime.now()
value_to_work_on = "f1:{}".format(now.second)
r.push('workqueue', value_to_work_on)
time.sleep(1)
Make another
# f2
import redis
import datetime
import time
r = redis.Redis(host='localhost', port=6379, db=0)
while True:
now = datetime.datetime.now()
value_to_work_on = "f2:{}".format(now.second)
r.push('workqueue', value_to_work_on)
time.sleep(1)
Now make the workers
# worker1
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
def do_work(x):
print(x)
return True
while True:
todo = r.rpoplpush("workqueue" "donequeue")
if do_work(todo):
print("success")
else:
r.push("workqueue", todo)
# worker2 is exactly the same, just running elsewhere.
My questions are:
Is this generally what they mean in the documentation? If not, can you provide a fix as an answer?
This seems still incomplete and not really reliable. For example, should there be alternative lists for error vs complete queues? One for every possible error state? What happens if your Redis goes down during processing?
As #rainhacker pointed out in comments, it is now recommended to use Redis Streams for this instead of the recipe described in "Pattern: Reliable Queue"

Python Reddis Queue ValueError: Functions from the __main__ module cannot be processed by workers

I'm trying to enqueue a basic job in redis using python-rq, But it throws this error
"ValueError: Functions from the main module cannot be processed by workers"
Here is my program:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
from rq import Connection, Queue
from redis import Redis
redis_conn = Redis()
q = Queue(connection=redis_conn)
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
Break the provided code to two files:
count_words.py:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
and main.py (where you'll import the required function):
from rq import Connection, Queue
from redis import Redis
from count_words import count_words_at_url # added import!
redis_conn = Redis()
q = Queue(connection=redis_conn)
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
I always separate the tasks from the logic running those tasks to different files. It's just better organization. Also note that you can define a class of tasks and import/schedule tasks from that class instead of the (over-simplified) structure I suggest above. This should get you going..
Also see here to confirm you're not the first to struggle with this example. RQ is great once you get the hang of it.
Currently there is a bug in RQ, which leads to this error. You will not be able to pass functions in enqueue from the same file without explicitly importing it.
Just add from app import count_words_at_url above the enqueue function:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
from rq import Connection, Queue
from redis import Redis
redis_conn = Redis()
q = Queue(connection=redis_conn)
from app import count_words_at_url
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
The other way is to have the functions in a separate file and import them.

Celery Periodic Task with Broadcast Queue

i use celery and celery beat with rabbit mq and everythings works fine.
but now, i've created an periodic task on a broadcast queue and this is not working.
celeryconfig.py
from kombu.common import Broadcast
CELERY_QUEUES=(
Broadcast('for_all_webhosts')
)
task.py
class PeriodicScript(PeriodicTask):
run_every = crontab(minute='*/2', hour='*')
time_limit = 300
soft_time_limit = 600
queue = 'for_all_webhosts'
def run(self, *args, **kwargs):
logger.error("START")
In the Log file i can see that the task is triggerd, but the task is not running
Scheduler: Sending due task tasks.periodic_xxx.PeriodicScript (tasks.periodic_xxx.PeriodicScript)
How it is possible to start a periodic script in a broadcast queue ?
Thanks
Marcel