How to create a Datalake using Apache Kafka, Amazon Glue and Amazon S3?

How to create a Datalake using Apache Kafka, Amazon Glue and Amazon S3? - amazon-s3

I want to store all the data from a Kafka's topic into Amazon S3. I have a Kafka cluster that receives in one topic 200.000 messages per second, and each value message has 50 fields (strings, timestamps, integers, and floats).
My main idea is to use Kafka Connector to store the data in a bucket s3 and after that use Amazon Glue to transform the data and keep it into another bucket. I have the next questions:
1) How to do it? That architecture will work well? I tried with Amazon EMR (Spark Streaming) but I had too many concerns How to decrease the processing time and failed tasks using Apache Spark for events streaming from Apache Kafka?
2) I tried to use Kafka Connect from Confluent, but I have a few questions:
Can I connect to my Kafka Cluster from other Kafka instance and
run in a standalone way my Kafka Connector s3?
What means this error "ERROR Task s3-sink-0 threw an uncaught an
unrecoverable exception"?
ERROR Task s3-sink-0 threw an uncaught and unrecoverable exception
(org.apache.kafka.connect.runtime.WorkerTask:142)
java.lang.NullPointerException at
io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:122)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:290)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:421)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146)
at
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at
org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:26,086]
ERROR Task is being killed and will not recover until manually
restarted (org.apache.kafka.connect.runtime.WorkerTask:143)
[2018-10-05 15:32:27,980] WARN could not create Dir using directory
from url file:/targ. skipping. (org.reflections.Reflections:104)
java.lang.NullPointerException at
org.reflections.vfs.Vfs$DefaultUrlTypes$3.matches(Vfs.java:239) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:98) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at
org.reflections.Reflections.scan(Reflections.java:237) at
org.reflections.Reflections.scan(Reflections.java:204) at
org.reflections.Reflections.(Reflections.java:129) at
org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268)
at
org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:27,981]
WARN could not create Vfs.Dir from url. ignoring the exception and
continuing (org.reflections.Reflections:208)
org.reflections.ReflectionsException: could not create Vfs.Dir from
url, no matching UrlType was found [file:/targ] either use
fromURL(final URL url, final List urlTypes) or use the static
setDefaultURLTypes(final List urlTypes) or
addDefaultURLTypes(UrlType urlType) with your specialized UrlType. at
org.reflections.vfs.Vfs.fromURL(Vfs.java:109) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at
org.reflections.Reflections.scan(Reflections.java:237) at
org.reflections.Reflections.scan(Reflections.java:204) at
org.reflections.Reflections.(Reflections.java:129) at
org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268)
at
org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:35,441]
INFO Reflections took 12393 ms to scan 429 urls, producing 13521 keys
and 95814 values (org.reflections.Reflections:229)
If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do?
What means all these fields key.converter, value.converter, key.converter.schemas.enable, value.converter.schemas.enable, internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable?
What are the possible values for key.converter, value.converter?
3) Once my raw data is in a bucket, I would like to use Amazon Glue to take these data, to deserialize Protobuffer, to change the format of some fields, and finally to store it in another bucket in Parquet. How can I use my own java protobuffer library in Amazon Glue?
4) If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?

To complement #cricket_007's answer
Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
Kafka S3 Connector is part of the Confluent distribution, which also includes Kafka, as well as other related services, but it is not meant to run on your brokers directly, rather:
as a standalone worker running a Connector's configuration given when the service is launched
or as an additional workers' cluster running on the side of your Kafka Brokers' cluster. In that case, interaction/running of connectors is better via the Kafka Connect REST API (Search for "Managing Kafka Connectors" for documentation with examples)
If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do?
Are you talking about another Kafka Connect instance?
if so, you can simply execute the Kafka Connect service in distributed mode which was meant to give the reliability you seem to be looking for...
Or do you mean another Kafka (brokers) cluster?
in that case, you could try (but that would be experimental, and I haven't tried it myself...) to run Kafka Connect in standalone mode and simply update bootstrap.servers parameter of your connector's configuration to point to the new cluster. Why that might work: in standalone mode the offsets of your sink connector(s) are stored locally on your worker (contrarily to distributed mode where the offsets are stored on the Kafka cluster directly...). Why that might not work: it's simply not intended for this use and I'm guessing you might need your topics and partitions to be exactly the same...?
What are the possible values for key.converter, value.converter?
Check Confluent's documentation for kafka-connect-s3 ;)
How can I use my own java protobuffer library in Amazon Glue?
Not sure of the actual method, but Glue jobs spawn off an EMR cluster behind the scenes so I don't see why it shouldn't be possible...
If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?
Yes.
Assuming a daily partitioning, you could actually have you're schedule run the crawler first thing in the morning, as soon as you can expect new data to have created that day's folder on S3 (so at least one object for that day exists on S3)... The crawler will add that day's partition which will then be available for querying with any newly added object.

We use S3 Connect for hundreds of topics and process data using Hive, Athena, Spark, Presto, etc. Seems to work fine, though I feel like an actual database might return results faster.
In any case, to answer about Connect
Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
I'm not sure I understand the question, but Kafka Connect needs to connect to one cluster, you don't need two Kafka clusters to use it. You'd typically run Kafka Connect processes as part of their own cluster, not on the brokers.
What means this error "ERROR Task s3-sink-0 threw an uncaught an unrecoverable exception"?
It means you need to look at the logs to figure out what exception is being thrown and stopping the connector from reading data.
WARN could not create Dir using directory from url file:/targ ... If you're using HDFS connector, I don't think you should be using the default file:// URI
If you can resume the steps to connect to Kafka and keep on s3 from another Kafka instance, how will you do?
You can't "resume from another Kafka instance". As mentioned, Connect can only consume from a single Kafka cluster, and any consumed offsets and consumer groups are stored with it.
What means all these fields
These fields are removed from the latest Kafka releases, you can ignore them. You definitely should not change them
internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable
These are your serializers and deserializers like the regular producer consumer API have
key.converter, value.converter
I believe these are only important for JSON converters. See https://rmoff.net/2017/09/06/kafka-connect-jsondeserializer-with-schemas-enable-requires-schema-and-payload-fields
key.converter.schemas.enable, value.converter.schemas.enable
to deserialize Protobuf, to change the format of some fields, and finally to store it in another bucket in Parquet
Kafka Connect would need to be loaded with a Protobuf converter, and I don't know there is one (I think Blue Apron wrote something... Search github).
Generally speaking, Avro would be much easier to convert to Parquet because native libraries already exist to do that. S3 Connect by Confluent doesn't currently write Parquet format, but there in an open PR. The alternative is to use Pinterest Secor library.
I don't know Glue, but if it's like Hive, you would use ADD JAR during a query to load external code plugins and functions
I have minimal experience with Athena, but Glue maintains all the partitions as a Hive metastore. The automatic part would be the crawler, you can put a filter on the query to do partition pruning

Related

Kafka Connect S3 source throws java.io.IOException

Kafka Connect S3 source connector throws the following exception around 20 seconds into reading an S3 bucket:
Caused by: java.io.IOException: Attempted read on closed stream.
at org.apache.http.conn.EofSensorInputStream.isReadAllowed(EofSensorInputStream.java:107)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:133)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
The error is preceded by the following warnning:
WARN Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use. (com.amazonaws.services.s3.internal.S3AbortableInputStream:178)
I am running Kafka connect out of this image: confluentinc/cp-kafka-connect-base:6.2.0. Using the confluentinc-kafka-connect-s3-source-2.1.1 jar.
My source connector configuration looks like so:
{
"connector.class":"io.confluent.connect.s3.source.S3SourceConnector",
"tasks.max":"1",
"s3.region":"eu-central-1",
"s3.bucket.name":"test-bucket-yordan",
"topics.dir":"test-bucket/topics",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class":"io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility":"NONE",
"confluent.topic.bootstrap.servers": "blockchain-kafka-kafka-0.blockchain-kafka-kafka-headless.default.svc.cluster.local:9092",
"transforms":"AddPrefix",
"transforms.AddPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.AddPrefix.regex":".*",
"transforms.AddPrefix.replacement":"$0_copy"
}
Any ideas on what might be the issue? Also I was unable to find the repository of Kafka connect S3 source connector, is it opensource?
Edit: I don't see the problem if gzip compression on the kafka-connect sink is disabled.

The warning means that close()was called before the file was read. S3 was not done with sending the data but the connection was left hanging.
2 options:
Validate that the input stream contains no more data. That way the connection can be reused
Call s3ObjectInputStream.abort() (NOTE: this connection could not be reused if you abort the input stream and a new one will need to created which will have performance impact.) In some cases this might make sense e.g. when the read is getting too slow etc.

AWS DMS FATAL_ERROR Error with replicate-ongoing-changes only

I'm trying to migrate data from Aurora MySQL to S3. Since Aurora MySQL does not support replicating ongoing changes from cluster reader endpoint, my source endpoint is attached to cluster writer endpoint.
When I choose full-load migration only, DMS works. However, i get error Last Error Task 'courral-membership-s3-writer' was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL when i choose full-load + ongoing replication or ongoing replication.
Thanks in advance.

This could be an error caused by - Replication instance class, you may need to upgrade it.

Where can I pass the properties file in Datastage when using Kafka Connector

There are some properties that I want to change as, for instance, security.protocol from SASL_PLAINTEXT to SASL_SSL. But the Kafka Connector in Datastage has very limited number of properties (host, use kerberos, principal name, keytab, topic name, consumer group, max poll records, max messages, reset policy timeout and classpath)
While reading this documentation the very first thing to do is to pass the JAAS configuration file. But my question is:
Where should I put this file? In the Datastage or Kafka side?
How can I point to this file?
This is what I tried:
Added a before-job subroutine in Datastage and use the following command:
export KAFKA_OPTS="-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf"
Added the -Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf to the Kafka Client Classpath in the Kafka Connector propertis in Datastage
But no matter what I do, everytime that I run the job the parameter security.protocol keeps unchanged:
Kafka_Connector_2,1: security.protocol = SASL_PLAINTEXT
Meaning that it's not reading the properties file.
Have you faced similar problem like these?

The Kafka Connector does have support for SASL SSL
Kafka Connector Properties
This was added in JR61201 for 11.5 and is available in 11.7.1.1
If you want to insert a JVM option such as
-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf
You should be able to leverage the CC_JVM_OPTIONS environment variable.

Unable to execute HTTP request: Timeout waiting for connection from pool in Flink

I'm working on an app which uploads some files to an s3 bucket and at a later point, it reads files from s3 bucket and pushes it to my database.
I'm using Flink 1.4.2 and fs.s3a API for reading and write files from the s3 bucket.
Uploading files to s3 bucket works fine without any problem but when the second phase of my app that is reading those uploaded files from s3 starts, my app is throwing following error:
Caused by: java.io.InterruptedIOException: Reopen at position 0 on s3a://myfilepath/a/b/d/4: org.apache.flink.fs.s3hadoop.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:125)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:155)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:702)
at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
at org.apache.flink.api.common.io.GenericCsvInputFormat.open(GenericCsvInputFormat.java:301)
at org.apache.flink.api.java.io.CsvInputFormat.open(CsvInputFormat.java:53)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:160)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:37)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I was able to control this error by increasing the max connection parameter for s3a API.
As of now, I have around 1000 files in the s3 bucket which is pushed and pulled by my app in the s3 bucket and my max connection is 3000. I'm using Flink's parallelism to upload/download these files from s3 bucket. My task manager count is 14.
This is an intermittent failure, I'm having success cases also for this scenario.
My query is,
Why I'm getting an intermittent failure? If the max connection I set was low, then my app should be throwing this error every time I run.
Is there any way to calculate the optimal number of max connection required for my app to work without facing the connection pool timeout error? Or Is this error related to something else that I'm not aware of?
Thanks
In Advance

Some comments, based on my experience with processing lots of files from S3 via Flink (batch) workflows:
When you are reading the files, Flink will calculate "splits" based on the number of files, and each file's size. Each split is read separately, so the theoretical max # of simultaneous connections isn't based on the # of files, but a combination of files and file sizes.
The connection pool used by the HTTP client releases connections after some amount of time, as being able to reuse an existing connection is a win (server/client handshake doesn't have to happen). So that introduces a degree of randomness into how many available connections are in the pool.
The size of the connection pool doesn't impact memory much, so I typically set it pretty high (e.g. 4096 for a recent workflow).
When using AWS connection code, the setting to bump is fs.s3.maxConnections, which isn't the same as a pure Hadoop configuration.

How to fetch Spark Streaming job statistics using REST calls when running in yarn-cluster mode

I have a spark streaming program running on Yarn Cluster in "yarn-cluster" mode. (-master yarn-cluster).
I want to fetch spark job statistics using REST APIs in json format.
I am able to fetch basic statistics using REST url call:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/metrics/json. But this is giving very basic statistics.
However I want to fetch per executor or per RDD based statistics.
How to do that using REST calls and where I can find the exact REST url to get these statistics.
Though $SPARK_HOME/conf/metrics.properties file sheds some light regarding urls i.e.
5. MetricsServlet is added by default as a sink in master, worker and client driver, you can send http request "/metrics/json" to get a snapshot of all the registered metrics in json format. For master, requests "/metrics/master/json" and "/metrics/applications/json" can be sent seperately to get metrics snapshot of instance master and applications. MetricsServlet may not be configured by self.
but that is fetching html pages not json. Only "/metrics/json" fetches stats in json format.
On top of that knowing application_id pro-grammatically is a challenge in itself when running in yarn-cluster mode.
I checked REST API section of Spark Monitoring page, but that didn't worked when we run spark job in yarn-cluster mode. Any pointers/answers are welcomed.

You should be able to access the Spark REST API using:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/
From here you can select the app-id from the list and then use the following endpoint to get information about executors, for example:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/{app-id}/executors
I verified this with my spark streaming application that is running in yarn cluster mode.
I'll explain how I arrived at the JSON response using a web browser. (This is for a Spark 1.5.2 streaming application in yarn-cluster mode).
First, use the hadoop url to view the RUNNING applications. http://{yarn-cluster}:8088/cluster/apps/RUNNING.
Next, select a running application, say http://{yarn-cluster}:8088/cluster/app/application_1450927949656_0021.
Next, click on the TrackingUrl link. This uses a proxy and the port is different in my case: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/. This shows the spark UI. Now, append the api/v1/applications to this URL: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/api/v1/applications.
You should see a JSON response with the application name supplied to SparkConf and the start time of the application.

I was able to reconstruct the metrics in the columns seen in the Spark Streaming web UI (batch start time, processing delay, scheduling delay) using the /jobs/ endpoint.
The script I used is available here. I wrote a short post describing and tying its functionality back to the Spark codebase. This does not need any web-scraping.
It works for Spark 2.0.0 and YARN 2.7.2, but may work for other version combinations too.

You'll need to scrape through the HTML page to get the relevant metrics. There isn't a Spark rest endpoint for capturing this info.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas