Flume config for Kafka source with ByteArraySerializer - serialization

Trying to read data from Kafka in Flume. I have configured all other necessary details for kafka source. The data from kafka is using ByteArraySerializer.
However the following configs for serializers doesn't appear to be working,
flume.sources.kafka-source.kafka.consumer.key.serializer = org.apache.kafka.common.serialization.ByteArraySerializer
flume.sources.kafka-source.kafka.consumer.value.serializer = org.apache.kafka.common.serialization.ByteArraySerializer
Is there anything wrong here?
PS: I am a flume newbie

Related

logstash not updating :sql_last_value meta data file

I have a pretty standard JDBC configuration for logstash, but suddenly it has stopped updating:sql_last_value.
Want to understand what could be the scenario when it can happen. There are also no error logs at all too.
I have looked the solution here and here
both are not applicable in my case

Where can I pass the properties file in Datastage when using Kafka Connector

There are some properties that I want to change as, for instance, security.protocol from SASL_PLAINTEXT to SASL_SSL. But the Kafka Connector in Datastage has very limited number of properties (host, use kerberos, principal name, keytab, topic name, consumer group, max poll records, max messages, reset policy timeout and classpath)
While reading this documentation the very first thing to do is to pass the JAAS configuration file. But my question is:
Where should I put this file? In the Datastage or Kafka side?
How can I point to this file?
This is what I tried:
Added a before-job subroutine in Datastage and use the following command:
export KAFKA_OPTS="-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf"
Added the -Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf to the Kafka Client Classpath in the Kafka Connector propertis in Datastage
But no matter what I do, everytime that I run the job the parameter security.protocol keeps unchanged:
Kafka_Connector_2,1: security.protocol = SASL_PLAINTEXT
Meaning that it's not reading the properties file.
Have you faced similar problem like these?
The Kafka Connector does have support for SASL SSL
Kafka Connector Properties
This was added in JR61201 for 11.5 and is available in 11.7.1.1
If you want to insert a JVM option such as
-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf
You should be able to leverage the CC_JVM_OPTIONS environment variable.

Kafka Schema Registry - StoreInitializationException: Timed out trying to create or validate schema topic configuration

i'm trying to configure schema registry to work with SSL, i have already zookeeper and kafka brokers working with the same SSL keys.
but whenever i start the schema-registry i get the following error
ERROR Error starting the schema registry(io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: Timed out trying to create or validate schema topic configuration
schema-registry.properties configuration :
listeners=https://IP:8081
kafkastore.connection.url=IP:2181
kafkastore.bootstrap.servers=SSL://IP:9092
kafkastore.topic=_schemas
kafkastore.topic.replication.factor=1
kafkastore.security.protocol=SSL
ssl.truststore.location=/.kafka_ssl/kafka.server.truststore.jks
ssl.truststore.password=password
ssl.keystore.location=/.kafka_ssl/kafka.server.keystore.jks
ssl.keystore.password=password
ssl.key.password=password
ssl.endpoint.identification.algorithm=
inter.instance.protocol=https
can someone advise ?
There are a couple of reasons that might cause this issue. Try to use a different topic for kafkastore.topic in case _schemas got corrupted.
For example,
kafkastore.topic=_schemas_new

How to create a Datalake using Apache Kafka, Amazon Glue and Amazon S3?

I want to store all the data from a Kafka's topic into Amazon S3. I have a Kafka cluster that receives in one topic 200.000 messages per second, and each value message has 50 fields (strings, timestamps, integers, and floats).
My main idea is to use Kafka Connector to store the data in a bucket s3 and after that use Amazon Glue to transform the data and keep it into another bucket. I have the next questions:
1) How to do it? That architecture will work well? I tried with Amazon EMR (Spark Streaming) but I had too many concerns How to decrease the processing time and failed tasks using Apache Spark for events streaming from Apache Kafka?
2) I tried to use Kafka Connect from Confluent, but I have a few questions:
Can I connect to my Kafka Cluster from other Kafka instance and
run in a standalone way my Kafka Connector s3?
What means this error "ERROR Task s3-sink-0 threw an uncaught an
unrecoverable exception"?
ERROR Task s3-sink-0 threw an uncaught and unrecoverable exception
(org.apache.kafka.connect.runtime.WorkerTask:142)
java.lang.NullPointerException at
io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:122)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:290)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:421)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146)
at
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at
org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:26,086]
ERROR Task is being killed and will not recover until manually
restarted (org.apache.kafka.connect.runtime.WorkerTask:143)
[2018-10-05 15:32:27,980] WARN could not create Dir using directory
from url file:/targ. skipping. (org.reflections.Reflections:104)
java.lang.NullPointerException at
org.reflections.vfs.Vfs$DefaultUrlTypes$3.matches(Vfs.java:239) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:98) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at
org.reflections.Reflections.scan(Reflections.java:237) at
org.reflections.Reflections.scan(Reflections.java:204) at
org.reflections.Reflections.(Reflections.java:129) at
org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268)
at
org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:27,981]
WARN could not create Vfs.Dir from url. ignoring the exception and
continuing (org.reflections.Reflections:208)
org.reflections.ReflectionsException: could not create Vfs.Dir from
url, no matching UrlType was found [file:/targ] either use
fromURL(final URL url, final List urlTypes) or use the static
setDefaultURLTypes(final List urlTypes) or
addDefaultURLTypes(UrlType urlType) with your specialized UrlType. at
org.reflections.vfs.Vfs.fromURL(Vfs.java:109) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at
org.reflections.Reflections.scan(Reflections.java:237) at
org.reflections.Reflections.scan(Reflections.java:204) at
org.reflections.Reflections.(Reflections.java:129) at
org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268)
at
org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:35,441]
INFO Reflections took 12393 ms to scan 429 urls, producing 13521 keys
and 95814 values (org.reflections.Reflections:229)
If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do?
What means all these fields key.converter, value.converter, key.converter.schemas.enable, value.converter.schemas.enable, internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable?
What are the possible values for key.converter, value.converter?
3) Once my raw data is in a bucket, I would like to use Amazon Glue to take these data, to deserialize Protobuffer, to change the format of some fields, and finally to store it in another bucket in Parquet. How can I use my own java protobuffer library in Amazon Glue?
4) If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?
To complement #cricket_007's answer
Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
Kafka S3 Connector is part of the Confluent distribution, which also includes Kafka, as well as other related services, but it is not meant to run on your brokers directly, rather:
as a standalone worker running a Connector's configuration given when the service is launched
or as an additional workers' cluster running on the side of your Kafka Brokers' cluster. In that case, interaction/running of connectors is better via the Kafka Connect REST API (Search for "Managing Kafka Connectors" for documentation with examples)
If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do?
Are you talking about another Kafka Connect instance?
if so, you can simply execute the Kafka Connect service in distributed mode which was meant to give the reliability you seem to be looking for...
Or do you mean another Kafka (brokers) cluster?
in that case, you could try (but that would be experimental, and I haven't tried it myself...) to run Kafka Connect in standalone mode and simply update bootstrap.servers parameter of your connector's configuration to point to the new cluster. Why that might work: in standalone mode the offsets of your sink connector(s) are stored locally on your worker (contrarily to distributed mode where the offsets are stored on the Kafka cluster directly...). Why that might not work: it's simply not intended for this use and I'm guessing you might need your topics and partitions to be exactly the same...?
What are the possible values for key.converter, value.converter?
Check Confluent's documentation for kafka-connect-s3 ;)
How can I use my own java protobuffer library in Amazon Glue?
Not sure of the actual method, but Glue jobs spawn off an EMR cluster behind the scenes so I don't see why it shouldn't be possible...
If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?
Yes.
Assuming a daily partitioning, you could actually have you're schedule run the crawler first thing in the morning, as soon as you can expect new data to have created that day's folder on S3 (so at least one object for that day exists on S3)... The crawler will add that day's partition which will then be available for querying with any newly added object.
We use S3 Connect for hundreds of topics and process data using Hive, Athena, Spark, Presto, etc. Seems to work fine, though I feel like an actual database might return results faster.
In any case, to answer about Connect
Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
I'm not sure I understand the question, but Kafka Connect needs to connect to one cluster, you don't need two Kafka clusters to use it. You'd typically run Kafka Connect processes as part of their own cluster, not on the brokers.
What means this error "ERROR Task s3-sink-0 threw an uncaught an unrecoverable exception"?
It means you need to look at the logs to figure out what exception is being thrown and stopping the connector from reading data.
WARN could not create Dir using directory from url file:/targ ... If you're using HDFS connector, I don't think you should be using the default file:// URI
If you can resume the steps to connect to Kafka and keep on s3 from another Kafka instance, how will you do?
You can't "resume from another Kafka instance". As mentioned, Connect can only consume from a single Kafka cluster, and any consumed offsets and consumer groups are stored with it.
What means all these fields
These fields are removed from the latest Kafka releases, you can ignore them. You definitely should not change them
internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable
These are your serializers and deserializers like the regular producer consumer API have
key.converter, value.converter
I believe these are only important for JSON converters. See https://rmoff.net/2017/09/06/kafka-connect-jsondeserializer-with-schemas-enable-requires-schema-and-payload-fields
key.converter.schemas.enable, value.converter.schemas.enable
to deserialize Protobuf, to change the format of some fields, and finally to store it in another bucket in Parquet
Kafka Connect would need to be loaded with a Protobuf converter, and I don't know there is one (I think Blue Apron wrote something... Search github).
Generally speaking, Avro would be much easier to convert to Parquet because native libraries already exist to do that. S3 Connect by Confluent doesn't currently write Parquet format, but there in an open PR. The alternative is to use Pinterest Secor library.
I don't know Glue, but if it's like Hive, you would use ADD JAR during a query to load external code plugins and functions
I have minimal experience with Athena, but Glue maintains all the partitions as a Hive metastore. The automatic part would be the crawler, you can put a filter on the query to do partition pruning

Logstash + stomp + ActiveMQ

I'm using logstash to read a CSV file and post the information to my ActiveMQ using the stomp protocol.
Everything is working great, I only want to add persistence to those messages but I don't know how to tell logstash to do so.
The ActiveMQ site say I need to tell my stomp producer to add the "persistent:true" parameter, but I don't find any documentation about this on logstash site.
Anyone knows anything about this?
Thanks in advance,
http://activemq.apache.org/stomp.html
Well, persistence cannot be set on logstash stomp output.
If this is very important to you, it should be a simple fix in the source.
You can find the file here:
And this line:
#client.send(event.sprintf(#destination), event.to_json)
should be something like this:
#client.send(event.sprintf(#destination), event.to_json, :persistent => true)
You have to build it and install the plugin yourself. My Ruby skills are limited so I have no idea how to do that. Maybe consider adding that as a config param and contribute it with a pull request?
Now you can use the attribute headers to send persistent messages:
stomp {
host => "localhost"
port => 61612
destination => "my_queue"
headers => {
"persistent" => true
}
}
Source:
https://github.com/logstash-plugins/logstash-output-stomp/issues/7