Python Reddis Queue ValueError: Functions from the __main__ module cannot be processed by workers - redis

I'm trying to enqueue a basic job in redis using python-rq, But it throws this error
"ValueError: Functions from the main module cannot be processed by workers"
Here is my program:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
from rq import Connection, Queue
from redis import Redis
redis_conn = Redis()
q = Queue(connection=redis_conn)
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job

Break the provided code to two files:
count_words.py:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
and main.py (where you'll import the required function):
from rq import Connection, Queue
from redis import Redis
from count_words import count_words_at_url # added import!
redis_conn = Redis()
q = Queue(connection=redis_conn)
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
I always separate the tasks from the logic running those tasks to different files. It's just better organization. Also note that you can define a class of tasks and import/schedule tasks from that class instead of the (over-simplified) structure I suggest above. This should get you going..
Also see here to confirm you're not the first to struggle with this example. RQ is great once you get the hang of it.

Currently there is a bug in RQ, which leads to this error. You will not be able to pass functions in enqueue from the same file without explicitly importing it.
Just add from app import count_words_at_url above the enqueue function:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
from rq import Connection, Queue
from redis import Redis
redis_conn = Redis()
q = Queue(connection=redis_conn)
from app import count_words_at_url
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
The other way is to have the functions in a separate file and import them.

Related

Fail to stream data from Kafka to BigQuery with Apache Beam

I am trying to load a stream of data from a kafka topic to a BigQuery table. While I am able to source the stream from the kafka topic and do transformations on it, I am stuck on loading the transformed data to a BQ table.
Note: I am using apache beam's python SDK with direct runner (right now) to test things. Here's the code:
import os
import argparse
import json
import logging
import apache_beam as beam
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from beam_nuggets.io import kafkaio
def run(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
'--bq_table',
required=True,
help=('Output BigQuery table for results specified as: '
'PROJECT:DATASET.TABLE'))
known_args, pipeline_args = parser.parse_known_args(argv)
bootstrap_server = "localhost:9092"
kafka_topic = 'some_topic'
consumer_config = {
'bootstrap_servers': bootstrap_server,
'group_id': 'etl',
'topic': kafka_topic,
'auto_offset_reset': 'earliest'
}
with beam.Pipeline(argv=pipeline_args) as p:
events = (
p | 'Read from kafka' >> kafkaio.KafkaConsume(
consumer_config=consumer_config, value_decoder=bytes.decode)
| 'toDict' >> beam.MapTuple(lambda k, v: json.loads(v))
| 'extract' >> beam.Map(lambda e: {'x': e['key1'], 'y': e['key2']})
# | 'log' >> beam.ParDo(logging.info) # if I uncomment this (for validation), I see data printed in my console log
)
_ = events | 'Load data to BQ' >> WriteToBigQuery(known_args.bq_table,
schema='x:STRING, y:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
ignore_unknown_columns=True, method='STREAMING_INSERTS')
p.run().wait_until_finish()
if __name__ == "__main__":
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/credentials/file.json"
logging.getLogger().setLevel(logging.DEBUG)
logging.getLogger("kafka").setLevel(logging.INFO)
run()
And this is how I run the above code:
python3 main.py \
--bq_table=<table_id> \
--streaming \
--runner=DirectRunner
I have tried using the batch mode to data insertion as well with WriteToBigQuery (method=FILE_LOADS) and providing a temp GCS location, but that didn't help either.
There is no error even though I have enabled debug logs. So, it's getting really difficult to trace the issue to its source. What do you think I am missing?
Edit:
The python process does not end/exit until I interrupt it.
The Kafka consumer group lag is 0, which indicates that the data is being fetched and processed but not getting loaded to the BQ table.

twisted: multiple logins with a uniq Perspective Broker TCP connection

I'm working on a code that use Twisted this way: a client establishes a permanent Perspective Broker TCP connection (ie, the connection is kept open during the whole process live, which can be days, months or years), and performs logins using this broken for different Avatars when needed.
My problem is that the Avatar logout clean up function is recorded by the broker, to be called and removed when the broker TCP connection is lost.
My problem is that the connection is never lost, and the broker.disconnects list increases more and more.
Even worse, the avatar logout function in defined by the code I'm working on has a reference on the Avatar (by closure), and the avatars can be big, causing a problematic memory leak to the whole process.
Here is an example adpated frome the doc:
The server:
#!/usr/bin/env python
# Inspired from pb6server.py here:
# https://twistedmatrix.com/documents/current/core/howto/pb-cred.html#two-clients
# Copyright (c) Twisted Matrix Laboratories.
# See LICENSE for details.
from zope.interface import implementer
from twisted.spread import pb
from twisted.cred import checkers, portal
from twisted.internet import reactor
PROTOCOL = None
# I didn't find another way to get a reference to the broker: suggestion
# welcome!
class MyFactory(pb.PBServerFactory):
def buildProtocol(self, addr):
global PROTOCOL
PROTOCOL = super().buildProtocol(addr)
return PROTOCOL
class MyPerspective(pb.Avatar):
def __init__(self, name):
self.name = name
def perspective_foo(self, arg):
assert PROTOCOL is not None
print(f"Leaked avatars: {len(PROTOCOL.disconnects)}")
#implementer(portal.IRealm)
class MyRealm:
def requestAvatar(self, avatarId, mind, *interfaces):
if pb.IPerspective not in interfaces:
raise NotImplementedError
return pb.IPerspective, MyPerspective(avatarId), lambda: None
p = portal.Portal(MyRealm())
c = checkers.InMemoryUsernamePasswordDatabaseDontUse(
user0=b"pass0", user1=b"pass1", user2=b"pass2", user3=b"pass3", user4=b"pass4"
)
p.registerChecker(c)
factory = MyFactory(p)
reactor.listenTCP(8800, factory)
reactor.run()
And the client:
#!/usr/bin/env python
# Inspired from pb6client.py here:
# https://twistedmatrix.com/documents/current/core/howto/pb-cred.html#two-clients
from twisted.spread import pb
from twisted.internet import reactor
from twisted.cred import credentials
USERS_COUNT = 5
def main():
factory = pb.PBClientFactory()
reactor.connectTCP("localhost", 8800, factory)
for k in range(10):
i = k % USERS_COUNT
def1 = factory.login(
credentials.UsernamePassword(f"user{i}".encode(), f"pass{i}".encode())
)
def1.addCallback(connected)
reactor.run()
def connected(perspective):
print("got perspective1 ref:", perspective)
print("asking it to foo(13)")
perspective.callRemote("foo", 13)
main()
How may I notify the broker that the avatar should be cleaned, whitout closing the broker TCP connection?
Thanks!

Passing a Queue with concurrent.futures regardless of executor type

Working up from threads to processes, I have switched to concurrent.futures, and would like to gain/retain flexibility in switching between a ThreadPoolExecutor and a ProcessPoolExecutor for various scenarios. However, despite the promise of a unified facade, I am having a hard time passing multiprocessing Queue objects as arguments on the futures.submit() when I switch to using a ProcessPoolExecutor:
import multiprocessing as mp
import concurrent.futures
def foo(q):
q.put('hello')
if __name__ == '__main__':
executor = concurrent.futures.ProcessPoolExecutor()
q = mp.Queue()
p = executor.submit(foo, q)
p.result()
print(q.get())
bumps into the following exception coming from multiprocessing's code:
RuntimeError: Queue objects should only be shared between processes through inheritance
which I believe means it doesn't like receiving the queue as an argument, but rather expects to (not in any OOP sense) "inherit it" on the multiprocessing fork rather than getting it as an argument.
The twist is that with bare-bones multiprocessing, meaning when not using it through the facade which concurrent.futures is ― there seems to be no such limitation, as the following code seamlessly works:
import multiprocessing as mp
def foo(q):
q.put('hello')
if __name__ == '__main__':
q = mp.Queue()
p = mp.Process(target=foo, args=(q,))
p.start()
p.join()
print(q.get())
I wonder what am I missing about this ― how can I make the ProcessPoolExecutor accept the queue as an argument when using concurrent.futures the same as it does when using the ThreadPoolExecutor or multiprocessing very directly like shown right above?

Code implementation of the Redis "Pattern: Reliable queue"

The excellent redis documentation lists a Reliable queue pattern as a good candidate/example for the RPOPLPUSH function.
I understand "reliable queue" to be something with delivery patterns like Amazon SQS FIFO exactly once pattern.
Specifically, you have some N processes feeding into a queue, and some M workers working from the queue. What does this actually look like as an implementation?
I would venture something like:
Make the feeder process populating the work queue.
# feeder1
import redis
import datetime
import time
r = redis.Redis(host='localhost', port=6379, db=0)
while True:
now = datetime.datetime.now()
value_to_work_on = "f1:{}".format(now.second)
r.push('workqueue', value_to_work_on)
time.sleep(1)
Make another
# f2
import redis
import datetime
import time
r = redis.Redis(host='localhost', port=6379, db=0)
while True:
now = datetime.datetime.now()
value_to_work_on = "f2:{}".format(now.second)
r.push('workqueue', value_to_work_on)
time.sleep(1)
Now make the workers
# worker1
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
def do_work(x):
print(x)
return True
while True:
todo = r.rpoplpush("workqueue" "donequeue")
if do_work(todo):
print("success")
else:
r.push("workqueue", todo)
# worker2 is exactly the same, just running elsewhere.
My questions are:
Is this generally what they mean in the documentation? If not, can you provide a fix as an answer?
This seems still incomplete and not really reliable. For example, should there be alternative lists for error vs complete queues? One for every possible error state? What happens if your Redis goes down during processing?
As #rainhacker pointed out in comments, it is now recommended to use Redis Streams for this instead of the recipe described in "Pattern: Reliable Queue"

Get info of exposed models in Tensorflow Serving

Once I have a TF server serving multiple models, is there a way to query such server to know which models are served?
Would it be possible then to have information about each of such models, like name, interface and, even more important, what versions of a model are present on the server and could potentially be served?
It is really hard to find some info about this, but there is possibility to get some model metadata.
request = get_model_metadata_pb2.GetModelMetadataRequest()
request.model_spec.name = 'your_model_name'
request.metadata_field.append("signature_def")
response = stub.GetModelMetadata(request, 10)
print(response.model_spec.version.value)
print(response.metadata['signature_def'])
Hope it helps.
Update
Is is possible get these information from REST API. Just get
http://{serving_url}:8501/v1/models/{your_model_name}/metadata
Result is json, where you can easily find model specification and signature definition.
It is possible to get model status as well as model metadata. In the other answer only metadata is requested and the response, response.metadata['signature_def'] still needs to be decoded.
I found the solution is to use the built-in protobuf method MessageToJson() to convert to json string. This can then be converted to a python dictionary with json.loads()
import grpc
import json
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
from tensorflow_serving.apis import model_service_pb2_grpc
from tensorflow_serving.apis import get_model_status_pb2
from tensorflow_serving.apis import get_model_metadata_pb2
from google.protobuf.json_format import MessageToJson
PORT = 8500
model = "your_model_name"
channel = grpc.insecure_channel('localhost:{}'.format(PORT))
request = get_model_status_pb2.GetModelStatusRequest()
request.model_spec.name = model
result = stub.GetModelStatus(request, 5) # 5 secs timeout
print("Model status:")
print(result)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = get_model_metadata_pb2.GetModelMetadataRequest()
request.model_spec.name = model
request.metadata_field.append("signature_def")
result = stub.GetModelMetadata(request, 5) # 5 secs timeout
result = json.loads(MessageToJson(result))
print("Model metadata:")
print(result)
To continue the decoding process, either follow Tyler's approach and convert the message to JSON, or more natively Unpack into a SignatureDefMap and take it from there
signature_def_map = get_model_metadata_pb2.SignatureDefMap()
response.metadata['signature_def'].Unpack(signature_def_map)
print(signature_def_map.signature_def.keys())
To request data using REST API, for additional data of the particular model that is served, you can issue (via curl, Postman, etc.):
GET http://host:port/v1/models/${MODEL_NAME}
GET http://host:port/v1/models/${MODEL_NAME}/metadata
For more information, please check https://www.tensorflow.org/tfx/serving/api_rest