In kafka 0.8 producer, we can't specify zk.connect without specifying broker.list - broker

I find that in Kafka 0.72, we can specify either zk.connect or broker.list. But in Kafka 0.8, we can only specify broker.list ,and we can’t specify zk.connect without specifying broker.list. I think, in this case, we can’t balance producer through zookeeper. If anyone use Kafka 0.8, or have some understanding with that? Many thanks!

You can still use a ZooKeeper client to retrieve the broker list:
ZkClient zkClient = new ZkClient("localhost:2108", 4000, 6000, new BytesPushThroughSerializer());
List<String> brokerList = zkClient.getChildren("/brokers/ips");
According to that, you do not have to "hardcode" the broker list on client side and you are flexible as far as the system architecture is concerned. But anyway, this would add the ZooKeeper dependency again which is in fact an disadvantage for producer in several environments.
If you want to get a detailed view to the so called "cluster metadata API" solution, check out this link: https://issues.apache.org/jira/browse/KAFKA-369
Best
pre

In Kafka 0.8, the producer is load balancing through the new cluster metadata API and the use of Zookeeper for that purpose has been removed.

Related

Is there any way to read messages from Kafka topic without consumer?

Just for testing purpose, I want to automate scenario where I need to check Kafka messages content, so just wanted to know if it is possible to read messages without consumers directly from TOPIC using Kafka java libraries?
I'm new to Kafka so any suggestion will be good for me.
Thanks in advance!
You could SSH to the broker in question, then dump the log segments into a deserialized fashion, but it would take less time to simply use a consumer in any language, not necessarily Java
"For testing purposes" Kafka Java API provides MockProducer and MockConsumer, which are backed by Lists, not a full broker

How to scale out apache atlas

There is no info provided in atlas document on how to scale it.
Apache atlas is connected to cassandra or hbase in the backend which can scale out ,but I dont know how apache atlas engine ( rest web-service and request processor ) can scale out.
I can install multiple instances of it on different machine and have load balancer in front of it to fan out the request. But would this model help ? Does it do any kind of locking and do db transaction, so that this model would not work.
Does someone know how apache atlas scales out ?
Thanks.
So Apache Atlas runs Kafka as the message queue under the covers, and in my experience, the way they have designed the Kafka queue (consumer group that says you should ONLY have ONE consumer) is the choke point.
Not only that, when you look at the code, the consumer has a poll time for the broker of 1 sec hard coded into the consumer. Put these two together, and that means that if the consumer can't process the messages from the various producers (HIVE, Spark, etc) within that second, the broker then disengages the ONLY consumer, and waits for a non-existent consumer to pick up messages...
I need to design something similar, but this is as far as I have got...
Hope that helps somewhat...
Please refer to this page. http://atlas.apache.org/#/HighAvailability
Atlas does not support actual horizontal scale-out.
All the requests are handled by the 'Active instance'. the 'Passive instances' just forward all the requests to the 'Active instance'.

Dynamically consume and sink Kafka topics with Flink

I haven't been able to find much information about this online. I'm wondering if its possible to build a Flink app that can dynamically consume all topics matching a regex pattern and sync those topics to s3. Also, each topic being dynamically synced would have Avro messages and the Flink app would use Confluent's Schema Registry.
So lucky man! Flink 1.4 just released a few days ago and this is the first version that provides consuming Kafka topics using REGEX. According to java docs here is how you can use it:
FlinkKafkaConsumer011
public FlinkKafkaConsumer011(PatternsubscriptionPattern,DeserializationSchema<T> valueDeserializer,Properties props)
Creates a new Kafka streaming source consumer for Kafka 0.11.x. Use
this constructor to subscribe to multiple topics based on a regular
expression pattern. If partition discovery is enabled (by setting a
non-negative value for
FlinkKafkaConsumerBase.KEY_PARTITION_DISCOVERY_INTERVAL_MILLIS in the
properties), topics with names matching the pattern will also be
subscribed to as they are created on the fly.
Parameters:
subscriptionPattern - The regular expression for a pattern of topic names to subscribe to.
valueDeserializer - The de-/serializer used to convert between Kafka's byte messages and Flink's objects.
props - The properties used to configure the Kafka consumer client, and the ZooKeeper client.
Just notice that running Flink streaming application, it fetch topic data from Zookeeper at intervals specified using the consumer config :
FlinkKafkaConsumerBase.KEY_PARTITION_DISCOVERY_INTERVAL_MILLIS
It means every consumer should resync the metadata including topics, at some specified intervals.The default value is 5 minutes. So adding a new topic you should expect consumer starts to consume it at most in 5 minutes. You should set this configuration for Flink consumer with your desired time interval.
Subscribing to Kafka topics with a regex pattern was added in Flink 1.4. See the documentation here.
S3 is one of the file systems supported by Flink. For reliable, exactly-once delivery of a stream into a file system, use the flink-connector-filesystem connector.
You can configure Flink to use Avro, but I'm not sure what the status is of interop with Confluent's schema registry.
For searching on these and other topics, I recommend the search on the Flink doc page. For example: https://ci.apache.org/projects/flink/flink-docs-release-1.4/search-results.html?q=schema+registry

How to retrieve JMS Administered Objects from (GlassFish) server

My Glassfish server is up and running and I can run simple JMS client programs that send and retrieve messages to queues and topics that are configured in the Glassfish console manually.
I'm busy writing a simple JMSBrowser to study JMS and would like to find out how I can retrieve from a java client the names of the administered objects Server objects (ConnectionFactory, Queues & Topics).
For example I have ConnectionFactory with JNDI name jms/__defaultConnectionFactory, a Queue jms/GlassFishBookQueue and a topic jms/GlassFishBookTopic.
How can I retrieve these names, when I only know there resource types (javax.jms.ConnectionFactory, javax.jms.Queue and javax.jms.Topic)
In this example I have one of every kind, but each could of course be a list.
Spent a lot of time trying to figure it out, to no avail.
It should be possible as any JMSBrowser present this information, so for instance this screen shot:
https://sourceforge.net/projects/jmstoolbox/
Any hint would be appreciated.
I'm the author of JMSToolBox
The JMS specs does not define a way to play with JMS artefacts defined in a server, ie create/delete/list..Queues/Topics/factories etc
Each Queue Manager has it's own proprietary way to expose those features. In JMSToolBox, I usually use JMX for that (with the help of proprietary MBeans..), but sometimes it is proprietary code to connect and list those objects.
If you connect to the Queue manager server via JNDI, there is probably a way to list all the JMS artefacts from the JNDI tree, and based on some proprietary pattern, determine what "kind" of object they represent(Queue, ConnectionFactory etc..)
BTW, GlassFish embeds OpenMQ. Here is the way it is done in JMSToolBox
I hope this helps

Using ServiceStack.Redis with RedisCloud

Using RedisCloud as a datastore for a ServiceStack based AppHarbor hosted app.
The RedisCloud .net client documentation states to not use the ServiceStack.Redis connection managers:
Note: the ServiceStack.Redis client connection managers (BasicRedisClientManager and PooledRedisClientManager) should be disabled when working with the Garantia Data Redis Cloud. Use the single DNS provided upon DB creation to access your Redis DB. The Garantia Data Redis Cloud distributes your dataset across multiple shards and efficiently balances the load between these shards.
Why would they suggest that? Because they are doing fancy load balancing stuff in their 'Garantia Data' layer and don't want to handle unnecessary connections? The RedisClient class is not thread-safe, so it makes it much more difficult from the application programming perspective.
Should I just ignore their instructions and use a PooledRedisClientManager? How would I configure it with the single uri that RedisCloud provides?
Or will I need to write a basic RedisClient pool wrapper that just creates new RedisClient connections as needed to handle concurrent access (i.e. ignores all read/write pooling specifics, hopefully delegating all that up-stream to the RedisCloud layer)?
Why would they suggest that? Because they are doing fancy load balancing stuff in their 'Garantia Data' layer and don't want to handle unnecessary connections?
I think you could be right. To my knowledge these classes simply wrap creating/retrieving instances of RedisClient (though, I think Basic always creates a new RedisClient). While I looked over their site, I did't see anything about 'max number of connections to the Redis server(s). The previous Redis vendor from AppHarbor (MyRedis) had plans that listed the number of max connections allowed per plan. However, I also didn't see anything on the site mention connection limits/handling.
Should I just ignore their instructions and use a PooledRedisClientManager? How would I configure it with the single uri that RedisCloud provides?
Well, if you do ignore their instructions my guess is you could eventually run into a 'max number of connections exceeded' error. That would make it difficult to get to your Redis Server(s). I think you could still use the BasicRedisClientManager because when you call GetClient() it always 'news up' a RedisClient in the same way shown in their example.