Kafka s3 connector to minio in parquet format

Kafka s3 connector to minio in parquet format - amazon-s3

I use docker compose to start 3 services in their respect containers:
zookeeper, kafka broker, and minio-connector
The three services can be started and connected successfully when I use the following configurations in minio-connector to consume from kafka and dump record in JSON format to minio:
start up command:
root#e1d1294c6fe6:/opt/bitnami/kafka/bin# ./connect-standalone.sh /plugins/connector.properties /plugins/s3-sink.properties
connector.properties file:
bootstrap.servers=kafka:9092
plugin.path=/plugins
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
s3-sink.properties file:
name=s3-sink
connector.class=io.confluent.connect.s3.S3SinkConnector
tasks.max=1
topics=202208.minio.connector.test
s3.region=us-east-1
s3.bucket.name=minioUsr
s3.part.size=5242880
flush.size=1
store.url=https://minio.kube.url
storage.class=io.confluent.connect.s3.storage.S3Storage
format.class=io.confluent.connect.s3.format.json.JsonFormat
schema.compatibility=NONE
schema.generator.class=io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
Now I'd like the connector to consume records and dump to minio in parquet format.
kafka and zookeeper services remain the same. I modified the connector.properties and s3-sink.properties but the connector cannot start.
new connector.properties file:
bootstrap.servers=kafka:9092
plugin.path=/plugins
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=true
key.converter.schema.registry.url=https://registry...:1443
value.converter.schema.registry.url=https://registry...:1443
new s3-sink.properties:
name=s3-sink
connector.class=io.confluent.connect.s3.S3SinkConnector
tasks.max=1
topics=202208.minio.connector.test
s3.region=us-east-1
s3.bucket.name=minioUsr
s3.part.size=5242880
flush.size=1
store.url=https://minio.kube.url
storage.class=io.confluent.connect.s3.storage.S3Storage
schema.generator.class=io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
schema.compatibility=NONE
format.class=io.confluent.connect.s3.format.parquet.ParquetFormat
enhanced.avro.schema.support=true
My questions are:
With the above configuration, the connector fails to start with an exception
[2022-08-29 15:21:10,610] ERROR Stopping due to error (org.apache.kafka.connect.cli.ConnectStandalone:130)
org.apache.kafka.common.config.ConfigException: Invalid value io.confluent.connect.avro.AvroConverter for configuration value.converter: Class
io.confluent.connect.avro.AvroConverter could not be found.
at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:728)
at org.apache.kafka.common.config.ConfigDef.parseValue(ConfigDef.java:474)
at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:467)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:108)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:129)
at org.apache.kafka.connect.runtime.WorkerConfig.<init>(WorkerConfig.java:385)
at org.apache.kafka.connect.runtime.standalone.StandaloneConfig.<init>(StandaloneConfig.java:42)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:81)
I installed the connector by downloading the confluentinc-kafka-connect-s3-10.1.0 zip file and manually unzip and copy all jars to /plugins/lib. I found the following jars related to parquet:
root#e1d1294c6fe6:/opt/bitnami/kafka/bin# ls -1 /plugins/lib/*parquet*
/plugins/lib/parquet-avro-1.11.1.jar
/plugins/lib/parquet-column-1.11.1.jar
/plugins/lib/parquet-common-1.11.1.jar
/plugins/lib/parquet-encoding-1.11.1.jar
/plugins/lib/parquet-format-structures-1.11.1.jar
/plugins/lib/parquet-hadoop-1.11.1.jar
What is missing in installation?
Any further change is required in the configuration?

I have to manually install avro converter using confluent-hub
download confluent-hub client:
https://docs.confluent.io/5.5.1/connect/managing/confluent-hub/client.html
using confluent-hub to install avro converter, following answers in:
Kafka Connect Confluent S3 Sink Connector: Class io.confluent.connect.avro.AvroConverter could not be found

Related

Reading files from S3 to kafka topic

I have a situation wherein all the event data is getting stored in an s3 bucket and I need to fetch that from S3 to Kafka topic on ec2. I am using CamelAWSS3Connector and am facing issues of the connector not working.
Following is the error I am facing
[2023-01-06 10:11:21,048] ERROR Failed to create job for config/s3_connect.properties (org.apache.kafka.connect.cli.ConnectStandalone:107)
[2023-01-06 10:11:21,053] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:117)
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: org/jctools/queues/MessagePassingQueue$Supplier
at org.apache.kafka.connect.util.ConvertingFutureCallback.result(ConvertingFutureCallback.java:115)
at org.apache.kafka.connect.util.ConvertingFutureCallback.get(ConvertingFutureCallback.java:99)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:114)
Caused by: java.lang.NoClassDefFoundError: org/jctools/queues/MessagePassingQueue$Supplier
I was expecting the publisher to push msg to topic from s3 to kafka
Following is my properties files
name=CamelAwss3SourceConnector
connector.class=org.apache.camel.kafkaconnector.aws2s3.CamelAws2s3SourceConnector
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
camel.source.maxPollDuration=10000
topics=mytopic
camel.component.aws-s3.access-key=XXXXXXXX
camel.component.aws-s3.region=ap-south-1
camel.source.path.bucketNameOrArn=poc-s3-kafkatopic
camel.source.endpoint.autocloseBody=true
camel.source.endpoint.deleteAfterRead=true
After using export command and adding jars location before calling the publisher following is the error
[2023-01-11 06:43:05,528] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:117) java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: org/apache/camel/kafkaconnector/CamelSourceConnectorConfig
at org.apache.kafka.connect.util.ConvertingFutureCallback.result(ConvertingFutureCallback.java:115)
at org.apache.kafka.connect.util.ConvertingFutureCallback.get(ConvertingFutureCallback.java:99)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:114) Caused by: java.lang.NoClassDefFoundError: org/apache/camel/kafkaconnector/CamelSourceConnectorConfig

Make sure you have added plugin.path=/path/to/extracted-camel-connector to the connect-standalone.properties file.
And if that doesn't work, you'll need to export CLASSPATH environment variable to include the jar files in that path.

Issue connecting Elasticsearch from Spark

I am receiving the below error message while try to connect elastic search with Spark..
Code used --
spark.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", "xxxxx")
.option("es.port", "80")
.option("es.nodes.wan.only", "true")
.option("pushdown", "true")
.option("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.load("indexname/doc")
Error message
java.lang.ClassNotFoundException: Failed to find data source: org.elasticsearch.spark.sql. Please find packages at http://spark.apache.org/third-party-projects.html
Using
Databricks Runtime Version 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)

Error while deploying Flink custom JAR file in AWS EMR

Basically I want to deploy a Flink custom JAR file to a new AWS EMR cluster. Here is a summary of what I did. I created a new AWS EMR cluster.
Step1:Software and steps changes -
Created a AWS EMR cluster with flink as the service. (EMR release version - 5.17.0) and clicked Flink 1.5.2 as the software configuration.
Entered the Configuration JSON:-
[
{
"Classification": "flink-conf",
"Properties": {
"jobmanager.heap.mb": "3072",
"taskmanager.heap.mb": "51200",
"taskmanager.numberOfTaskSlots":"2",
"taskmanager.memory.preallocate": "false",
"parallelism.default": "1"
}
]
Step2:Hardware - No change in the hardware configuration.By default we have 1 master, 2 core and 0 Task instances. All are m3.xlarge type.
Step3:General Cluster Settings - No change here.
Step4:Security - Provided my EC2 key pair.
Once the cluster creation is ready I SSHed to the EC2 machine and tried to deploy the custom jar file. Below are the different errors I got everytime tried to deploy it via the CLI.
1)
flink run -m yarn-cluster -yn 2 -c com.deepak.flink.examples.WordCount flink-examples-assembly-1.0.jar
Using the result of 'hadoop classpath' to augment the Hadoop classpath: /etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*::/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/flink/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-10-09 06:30:36,766 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ip-IPADDRESS.ec2.internal/IPADDRESS:8032
2018-10-09 06:30:36,909 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2018-10-09 06:30:37,168 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Killing YARN application
2)
flink run -c com.deepak.flink.examples.WordCount flink-examples-assembly-1.0.jar
Using the result of 'hadoop classpath' to augment the Hadoop classpath: /etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*::/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/flink/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.client.deployment.ClusterRetrieveException: Couldn't retrieve standalone cluster
at org.apache.flink.client.deployment.StandaloneClusterDescriptor.retrieve(StandaloneClusterDescriptor.java:51)
at org.apache.flink.client.deployment.StandaloneClusterDescriptor.retrieve(StandaloneClusterDescriptor.java:31)
at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:253)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.util.ConfigurationException: Config parameter 'Key: 'jobmanager.rpc.address' , default: null (deprecated keys: [])' is missing (hostname/address of JobManager to connect to).
at org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.getJobManagerAddress(HighAvailabilityServicesUtils.java:141)
at org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:81)
at org.apache.flink.client.program.ClusterClient.<init>(ClusterClient.java:158)
at org.apache.flink.client.program.rest.RestClusterClient.<init>(RestClusterClient.java:183)
at org.apache.flink.client.program.rest.RestClusterClient.<init>(RestClusterClient.java:156)
at org.apache.flink.client.deployment.StandaloneClusterDescriptor.retrieve(StandaloneClusterDescriptor.java:49)
... 10 more
Even I tried to deploy via the AWS Web UI, there also the jar failed to deploy.
So, Basically I want to deploy the custom JAR to the flink YARN Cluster. I am not sure what I am missing for the YARN flink configuration or anything else. Thanks for any help in advance.

You should reduce memory allocation for task manager. Currently, you are trying to allocate 51.2G of memory whereas single m3.xlarge machine has only 15G of memory and in total 30G for 2 machines cluster.

FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.NoSuchMethodError:

I am trying to run a mapreduce job on EMR cluster. The version of Hadoop on EMR is 2.7.3.
The code is used to read HFiles residing on S3 bucket. But every time I run it fails with the below error.
2018-02-22 20:02:11,641 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.NoSuchMethodError: org.apache.hadoop.mapred.TaskLog.createLogSyncer()Ljava/util/concurrent/ScheduledExecutorService;
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.<init>(MRAppMaster.java:250)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.<init>(MRAppMaster.java:233)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1472)
2018-02-22 20:02:12,188 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 1
End of LogType:syslog
The actual code was designed to read files from HDFS which was all doing fine in CDH based clusters where the hadoop version is 2.6.0. However there was a requirement to read the HFiles from S3 bucket on EMR based cluster in AWS. I made few changes in the code which will allow it to read any file system. Below is the snippet of the change
...
Path JSONOutputjob2 = new Path( args[1] );
FileSystem.get(JSONOutputjob2.toUri(), conf2).delete(JSONOutputjob2, true);
...
I am passing the path as an argument and here are the options that I have tried with the file path.
s3n://emr-ip/path/to/the/file
s3a://emr-ip/path/to/the/file
s3://emr-ip/path/to/the/file
This error is really driving me crazy. I have updated my pom.xml file to use the available Hadoop version of the cluster and built the project. The build was also successful. But does not work. Any suggestions or help is much appreciated.
Edit:
I have update my pom to have the aws hadoop version i.e 2.7.3 which did not fix the issue.

Error while deploying ADF application in glassfish

Hello I am using JDeveloper 11.1.2.3.0
I have configured Glassfish in my computer and I have followed the instructions as Shay explained here: https://blogs.oracle.com/shay/entry/deploying_oracle_adf_applications_to
The problem is that when I try to deploy my ADF application as "Deploy to application server" with glassfish in this case I get an error saying that:
[#|2013-08-21T11:45:47.516+0200|SEVERE|glassfish3.1.2|org.apache.catalina.core.ContainerBase|_ThreadID=62;_ThreadName=Thread-2;|ContainerBase.addChild: start:
org.apache.catalina.LifecycleException: java.lang.IllegalArgumentException: java.lang.ClassNotFoundException: oracle.adf.share.glassfish.listener.ADFGlassFishAppLifeCycleListener
If I deploy the ADF aplication as an EAR file and then I try to deploy this EAR file to glassfish through the admin interface I get this other error:
[#|2013-08-21T15:40:16.452+0200|SEVERE|glassfish3.1.2|javax.enterprise.system.tools.deployment.org.glassfish.deployment.common|_ThreadID=65;_ThreadName=Thread-2;|Exception while invoking class com.sun.enterprise.web.WebApplication start method
java.lang.Exception: java.lang.IllegalStateException: ContainerBase.addChild: start: org.apache.catalina.LifecycleException: java.lang.RuntimeException: com.sun.faces.config.ConfigurationException: CONFIGURATION FAILED! javax.el.ELContext
Can anyone help on this?

A few things to check-
Did you extract the adf-essentials zip file with the -j option?
Did you mark both your model and view project to have a deployment platform for Glassfish?\

Please make sure that application is deleted from application folder. If not stop-admin server and delete it manually.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Kafka s3 connector to minio in parquet format - amazon-s3

Related

Reading files from S3 to kafka topic

Issue connecting Elasticsearch from Spark

Error while deploying Flink custom JAR file in AWS EMR

FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.NoSuchMethodError:

Error while deploying ADF application in glassfish

Categories

Resources