How to check the output of file.txt in Kafka Consumer? - file-upload

I am trying to integrate Flume with Kafka and I want to pass a file from my local machine to Flume and then to Kafka. I want to see the contents of my file in Kafka Consumer.
Here is my config file in Flume:
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = cat /Users/myname/Downloads/file.txt
# Describe the sink
#a1.sinks.k1.type = logger
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = k1
a1.sinks.k1.brokerList = kafkagames-1:9092,kafkagames-2:9092
a1.sinks.sink1.batchSize = 20
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
I am not sure how to start Kafka from Flume and see the content of file.txt in Kafka Consumer. Please advice. Thanks.

Since Kafka is a message broker "service", you have to run it before producing a message with your Flume producer. After the producer produces (puts) a message to a Kafka topic (a kind of a buffer), you can consume (get) the message with a Kafka consumer.
How to start Kafka service:
http://kafka.apache.org/documentation.html#quickstart

Related

Celery fanout signals with `send_task`

I'm trying to establish communication between different processes running celery. I successfully sent tasks from one process to others using app.send_task on a celery instance. I am struggling now to broadcast tasks through a fanout rabbitmq exchange to all other instances (basically a publish-subscribe pattern for celery).
It must be related to the routing across exchanges and queues but I simply can't make it work.
This is the master emitter which broadcasts a task named signal through the default exchange of type fanout:
from celery import Celery
from kombu import Exchange, Queue
app = Celery('emitter',
broker='pyamqp://test#localhost//',
backend='db+sqlite:///results.db')
default_queue_name = 'default'
default_exchange_name = 'default'
default_routing_key = 'default'
default_exchange = Exchange(default_exchange_name, type='fanout')
default_queue = Queue(
default_queue_name,
default_exchange,
routing_key=default_routing_key)
app.conf.task_queues = (
default_queue,
)
app.conf.task_default_queue = default_queue_name
app.conf.task_default_exchange = default_exchange_name
app.conf.task_default_routing_key = default_routing_key
if __name__ == '__main__':
app.send_task(name='signal', exchange='default')
To my understanding the routing, queue and exchange setup on the other app needs to be identical. Thus, this is a very similarly looking piece of code but defining a task that gets called:
from celery import Celery
from kombu import Exchange, Queue
app = Celery('CLIENT_A',
broker='pyamqp://test#localhost//',
backend='db+sqlite:///results.db')
default_queue_name = 'default'
default_exchange_name = 'default'
default_routing_key = 'default'
default_exchange = Exchange(default_exchange_name, type='fanout')
default_queue = Queue(
default_queue_name,
default_exchange,
routing_key=default_routing_key)
app.conf.task_queues = (
default_queue,
)
app.conf.task_default_queue = default_queue_name
app.conf.task_default_exchange = default_exchange_name
app.conf.task_default_routing_key = default_routing_key
#app.task(name='signal')
def signal():
print('client_a signal')
return 'signal'
The second client will look exactly the same as the first except for the name and the print message:
# [...]
app = Celery('CLIENT_B', ...
# [...] identical to the part above
#app.task(name='signal')
def signal():
print('client_b signal')
I'm starting both client workers with different node names (otherwise celery will complain):
celery -A client_a worker -n node_a
celery -A client_b worker -n node_b
If I then call the emitter (first piece of code) I see the signals being triggered alternately by client_a and client_b but never both as I would like them to be triggered.
The rabbitmq management platform looks as expected with the default exchange defined as fanout and the routing looks alright.
I'm not sure if I'm on the completely wrong track here but that's what I imagined should be possible with correct routing.

How to set up a mirrored queue that will work when the master node goes down?

I have a cluster of two rabbitmq servers in my dev environment and I want to make it so that the queue and all of the messages will be available when the original master goes down.
I made a durable queue on a durable exchange with the following attributes:
ha-mode : all
ha-sync-mode : automatic
x-queue-master-locator : min-masters
I also published a persistant message to the queue.
When I bring down the host that is the master for the queue the state changes to down. I expected that ha-mode all would copy the queue and its messages to all nodes, that ha-sync-mode would keep the nodes synced, and that x-queue-master-locator would move the queue to the other node or in production to the node with the least queues. How do I set up a queue so that I can achieve this?
Edit(More info):
Server info:
rmq: 3.7.17
Erlang: 22.0.7
My config for both nodes:
vm_memory_high_watermark.relative = 0.65
vm_memory_high_watermark_paging_ratio = 0.8
disk_free_limit.relative = 2.0
channel_max = 32
num_acceptors.tcp = 20
num_acceptors.ssl = 0
handshake_timeout = 10000
frame_max = 160000
mirroring_sync_batch_size = 1024
background_gc_enabled = true
background_gc_target_interval = 300000
These attributes mean nothing when you create a queue with them. You need to create a Policy that adds these attributes to the queue.

Consuming from rabbitmq using Celery

I am using Celery with Rabbitmq broker on Server A. Some tasks require interaction with another server say, Server B and I am using Rabbitmq queues for this interaction.
Queue 1 - Server A (Producer), Server B (Consumer)
Queue 2 - Server B (Producer), Server A (Consumer)
My celery is unexpectedly hanging and I have found the reason to be incorrect implementation of Server A consumer code.
channel.start_consuming() keeps polling Rabbitmq as expected however putting this in a celery task creates multiple pollers which don't expire. I can add expiry but the time completion for the data being sent to Server B cannot be guaranteed. The code pasted below is one method I used to tackle the issue but I am not convinced this is best solution.
I wish to know what I am doing wrong and what is the right way to implement this because I have failed searching for articles on the web. Any tips, insights and even links to articles would be extremely helpful.
Finally, my code -
#celery.task
def task_a(data):
do_some_processing
# Create only 1 Rabbitmq consumer instance to avoid celery hangups
task_d.delay()
#celery.task
def task_b(data):
do_some_processing
if data is not None:
task_c.delay()
#celery.task
def task_c():
data = some_data
data = json.dumps(data)
conn_params = pika.ConnectionParameters(host=RABBITMQ_HOST)
connection = pika.BlockingConnection(conn_params)
channel = connection.channel()
channel.queue_declare(queue=QUEUE_1)
channel.basic_publish(exchange='',
routing_key=QUEUE_1,
body=data)
channel.close()
#celery.task
def task_d():
def queue_helper(ch, method, properties, body):
'''
Callback from queue.
'''
data = json.loads(body)
task_b.delay(data)
conn_params = pika.ConnectionParameters(host=RABBITMQ_HOST)
connection = pika.BlockingConnection(conn_params)
channel = connection.channel()
channel.queue_declare(queue=QUEUE_2)
channel.basic_consume(queue_helper,
queue=QUEUE_2,
no_ack=True)
channel.start_consuming()
channel.close()

celery - Programmatically list queues

How can I programmatically, using Python code, list current queues created on a RabbitMQ broker and the number of workers connected to them? It would be the equivalent to:
rabbitmqctl list_queues name consumers
I do it this way and display all the queues and their details (messages ready, unacknowledged etc.) on a web page -
import kombu
conn = kombu.Connection(broker_url)# example 'amqp://guest:guest#localhost:5672/'
conn.connect()
client = conn.get_manager()
queues = client.get_queues('/')#assuming vhost as '/'
You will need kombu to be installed and queues will be a dictionary with keys representing the queue names.
I think I got this when digging through the code of celery flower (The tool used for monitoring celery).
Update: As pointed out by #zaq178miami, you will also need the management plugin that has the http API. I had forgotten that I had enabled than in rabbitmq.
This way did it for me:
def get_queue_info(queue_name):
with celery.broker_connection() as conn:
with conn.channel() as channel:
return channel.queue_declare(queue_name, passive=True)
This will return a namedtuple with the name, number of messages waiting and consumers of that queue.
ksrini answer is correct too and can be used when you require more information about a queue.
Thanks to Ask Solem who gave me the hint.
As a rabbitmq client you can use pika. However it doesn't have option for list_queues. The easiest solution would be calling rabbitmqctl command from python using subprocess:
import subprocess
command = "/usr/local/sbin/rabbitmqctl list_queues name consumers"
process = subprocess.Popen(command.split(), stdout=subprocess.PIPE)
print process.communicate()
I would use simply this:
Just replace the user(default= guest), passwd(default= guest) and port with your values.
import requests
import json
def call_rabbitmq_api(host, port, user, passwd):
url = 'https://%s:%s/api/queues' % (host, port)
r = requests.get(url, auth=(user,passwd),verify=False)
return r
def get_queue_name(json_list):
res = []
for json in json_list:
res.append(json["name"])
return res
if __name__ == '__main__':
host = 'rabbitmq_host'
port = 55672
user = 'guest'
passwd = 'guest'
res = call_rabbitmq_api(host, port, user, passwd)
print ("--- dump json ---")
print (json.dumps(res.json(), indent=4))
print ("--- get queue name ---")
q_name = get_queue_name(res.json())
print (q_name)
Referred from here: https://gist.github.com/hiroakis/5088513#file-example_rabbitmq_api-py-L2

S3 path error with Flume HDFS Sink

I have a Flume consolidator which writes every entry on a S3 bucket on AWS.
The problem is with the directory path.
The events are supposed to be written on /flume/events/%y-%m-%d/%H%M, but they're on //flume/events/%y-%m-%d/%H%M.
It seems that Flume is appending one more "/" at the beginning.
Any ideas for this issue? Is that a problem with my path configuration?
master.sources = source1
master.sinks = sink1
master.channels = channel1
master.sources.source1.type = netcat
# master.sources.source1.type = avro
master.sources.source1.bind = 0.0.0.0
master.sources.source1.port = 4555
master.sources.source1.interceptors = inter1
master.sources.source1.interceptors.inter1.type = timestamp
master.sinks.sink1.type = hdfs
master.sinks.sink1.hdfs.path = s3://KEY:SECRET#BUCKET/flume/events/%y-%m-%d/%H%M
master.sinks.sink1.hdfs.filePrefix = event
master.sinks.sink1.hdfs.round = true
master.sinks.sink1.hdfs.roundValue = 5
master.sinks.sink1.hdfs.roundUnit = minute
master.channels.channel1.type = memory
master.channels.channel1.capacity = 1000
master.channels.channel1.transactionCapactiy = 100
master.sources.source1.channels = channel1
master.sinks.sink1.channel = channel1
The Flume NG HDFS sink doesn't implement anything special for S3 support. Hadoop has some built-in support for S3, but I don't know of anyone actively working on it. From what I have heard, it is somewhat out of date and may have some durability issues under failure.
That said, I know of people using it because it's "good enough".
Are you saying that "//xyz" (with multiple adjacent slashes) is a valid path name on S3? As you probably know, most Unixes collapse adjacent slashes.