Confluent Generalized S3 Source connector (silent failure) - amazon-s3

Running into strange issue with Generalized S3 source connector running on confluent platform. Not able to pin point where exactly the error is or what the root cause is :
The only error I see in the ssh console is this (related to logs) :
[2023-02-11 11:12:45,464] INFO [Worker clientId=connect-1, groupId=connect-cluster-1] Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1709)
log4j:ERROR A "io.confluent.log4j.redactor.RedactorAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [PluginClassLoader{pluginLocation=file:/usr/share/java/source-2.5.1/}] whereas object of type
log4j:ERROR "io.confluent.log4j.redactor.RedactorAppender" was loaded by [jdk.internal.loader.ClassLoaders$AppClassLoader#251a69d7].
log4j:ERROR Could not instantiate appender named "redactor".
[2023-02-11 11:13:35,741] INFO Injecting Confluent license properties into connector '<unspecified>' (org.apache.kafka.connect.runtime.WorkerConfigDecorator:412)
[2023-02-11 11:13:44,001] INFO Injecting Confluent license properties into connector 'S3GenConnectorConnector_7' (org.apache.kafka.connect.runtime.WorkerConfigDecorator:412)
[2023-02-11 11:13:44,006] INFO S3SourceConnectorConfig values:
aws.access.key.id = <<ACCESS KEY HERE>>
aws.secret.access.key = [hidden]
behavior.on.error = fail
bucket.listing.max.objects.threshold = -1
confluent.license = [hidden]
confluent.topic = _confluent-command
confluent.topic.bootstrap.servers = [172.27.157.66:9092]
confluent.topic.replication.factor = 3
directory.delim = /
file.discovery.starting.timestamp = 0
filename.regex = (.+)\+(\d+)\+.+$
folders = []
format.bytearray.extension = .bin
format.bytearray.separator =
format.class = class io.confluent.connect.s3.format.string.StringFormat
format.json.schema.enable = false
mode = RESTORE_BACKUP
parse.error.topic.prefix = error
partition.field.name = []
partitioner.class = class io.confluent.connect.storage.partitioner.DefaultPartitioner
path.format =
record.batch.max.size = 200
s3.bucket.name = mytestbucketamtk
s3.credentials.provider.class = class com.amazonaws.auth.DefaultAWSCredentialsProviderChain
s3.http.send.expect.continue = true
s3.part.retries = 3
s3.path.style.access = true
s3.poll.interval.ms = 60000
s3.proxy.password = null
s3.proxy.url =
s3.proxy.username = null
s3.region = us-east-1
s3.retry.backoff.ms = 200
s3.sse.customer.key = null
s3.ssea.name =
s3.wan.mode = false
schema.cache.size = 50
store.url = null
task.batch.size = 10
topic.regex.list = [first_topic:.*]
topics.dir = topics
(io.confluent.connect.s3.source.S3SourceConnectorConfig:376)
[2023-02-11 11:13:44,029] INFO Using configured AWS access key credentials instead of configured credentials provider class. (io.confluent.connect.s3.source.S3Storage:500)
Connector Config file below :
{
"name": "S3GenConnectorConnector_7",
"config": {
"connector.class": "io.confluent.connect.s3.source.S3SourceConnector",
"tasks.max": "1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"mode": "RESTORE_BACKUP",
"format.class": "io.confluent.connect.s3.format.string.StringFormat",
"s3.bucket.name": "mytestbucketamtk",
"s3.region": "us-east-1",
"aws.access.key.id": <<ACCESS KEY HERE>>,
"aws.secret.access.key": <<SECRET KEY>>,
"topic.regex.list":"first_topic:.*"
}
}
The tasks are not getting created. And also no other errors in the connect console. The connect cluster is running on confluent platform. Any pointers in right direction would be appreciated. Did I miss any required configuration ?

Related

VM creation using terraform in vsphere gives An error occurred while customizing VM

provider "vsphere" {
vsphere_server = "myserver"
user = "myuser"
password = "mypass"
allow_unverified_ssl = true
version = "v1.21.0"
}
data "vsphere_datacenter" "dc" {
name = "pcloud-datacenter"
}
data "vsphere_datastore_cluster" "datastore_cluster" {
name = "pc-storage"
datacenter_id = data.vsphere_datacenter.dc.id
}
data "vsphere_compute_cluster" "compute_cluster" {
name = "pcloud-cluster"
datacenter_id = data.vsphere_datacenter.dc.id
}
data "vsphere_network" "network" {
name = "u32c01p26-1514"
datacenter_id = data.vsphere_datacenter.dc.id
}
data "vsphere_virtual_machine" "vm_template" {
name = "first-terraform-vm"
datacenter_id = data.vsphere_datacenter.dc.id
}
resource "vsphere_virtual_machine" "vm" {
count = 1
name = "first-terraform-vm-1"
resource_pool_id = data.vsphere_compute_cluster.compute_cluster.resource_pool_id
datastore_cluster_id = data.vsphere_datastore_cluster.datastore_cluster.id
num_cpus = 2
memory = 1024
wait_for_guest_ip_timeout = 2
wait_for_guest_net_timeout = 0
guest_id = data.vsphere_virtual_machine.vm_template.guest_id
scsi_type = data.vsphere_virtual_machine.vm_template.scsi_type
network_interface {
network_id = data.vsphere_network.network.id
adapter_type = data.vsphere_virtual_machine.vm_template.network_interface_types[0]
}
disk {
name = "disk0.vmdk"
size = data.vsphere_virtual_machine.vm_template.disks.0.size
eagerly_scrub = data.vsphere_virtual_machine.vm_template.disks.0.eagerly_scrub
thin_provisioned = data.vsphere_virtual_machine.vm_template.disks.0.thin_provisioned
}
folder = "virtual-machines"
clone {
template_uuid = data.vsphere_virtual_machine.vm_template.id
customize {
linux_options {
host_name = "first-terraform-vm-1"
domain = "localhost.localdomain"
}
network_interface {
ipv4_address = "10.10.14.100"
ipv4_netmask = 24
}
ipv4_gateway = "10.10.14.1"
}
}
}
The command terraform script throws the below error
Error:
Virtual machine customization failed on "/pcloud-datacenter/vm/virtual-machines/first-terraform-vm-1":
An error occurred while customizing VM first-terraform-vm-1. For details reference the log file <No Log> in the guest OS.
The virtual machine has not been deleted to assist with troubleshooting. If
corrective steps are taken without modifying the "customize" block of the
resource configuration, the resource will need to be tainted before trying
again. For more information on how to do this, see the following page:
https://www.terraform.io/docs/commands/taint.html
on create_vm.tf line 34, in resource "vsphere_virtual_machine" "vm":
34: resource "vsphere_virtual_machine" "vm" {
Some how the generated vm "first-terraform-vm-1" doesn't have the connected box checked-in in network settings. While i checked my template "first-terraform-vm" it has network connected box checked-in.
I see similar post in github https://github.com/hashicorp/terraform-provider-vsphere/issues/951
But not sure why this issue is still surfacing?
Vsphere version: 6.7
Terraform v0.12.28
provider.vsphere v1.21.0
Is there anything wrong with my template? Or am i missing something? Can anyone help please? Stuck with this for last 2 days.
The problem looks to be with the template that i have used. The linux template should have Network Manager installed and running. It looks like terraform uses the network manager to assign IPaddress for newly created vm.

how to associate floating ip address to a instance in openstack using terraform

I am using terraform to create couple of instances in openstack and I would like to automatically assign floatings ip address to them without any manual intervention.
My .tf file is as below:
resource "openstack_networking_floatingip_v2" "floating-ip" {
count = 4
pool = "floating-ip-pool"
}
resource "openstack_compute_floatingip_associate_v2" "fip-associate" {
floating_ip = openstack_networking_floatingip_v2.floating-ip.address[count.0]
instance_id = openstack_compute_instance_v2.terraform-vm.id[count.0]
}`
I am getting an error
"Error: Missing resource instance key
on image-provisioning.tf line 33, in resource "openstack_compute_floatingip_associate_v2" "fip-associate":
33: instance_id = openstack_compute_instance_v2.terraform-vm.id[count.0]"
My terraform version is : Terraform v0.12.24
+ provider.openstack 1.26.0
able to resolve using for_each option in terraform :
resource "openstack_compute_instance_v2" "terraform_vm" {
image_id = "f8b9189d-2518-4a32-b1ba-2046ea8d47fd"
for_each = var.instance_name
name = each.key
flavor_id = "3"
key_pair = "openstack vm key"
security_groups = ["default"]
network {
name = "webapps-network"
}
}
resource "openstack_networking_floatingip_v2" "floating_ip" {
pool = "floating-ip-pool"
for_each = var.instance_name
}
resource "openstack_compute_floatingip_associate_v2" "fip_associate" {
for_each = var.instance_name
floating_ip = openstack_networking_floatingip_v2.floating_ip[each.key].address
instance_id = openstack_compute_instance_v2.terraform_vm[each.key].id
}

Terraform Variables prompting me when defined in tfvars

There is something that I am not understanding about terraform variables. I am getting prompted for two variables in when I run "terraform apply". I don't think that I should be prompted for any as I defined a terraform.tfvars. I am getting prompted for (applicationNamespace, and staticIpName) but I am not sure why. What am I misunderstanding?
I created a file (terraform.tfvars):
#--------------------------------------------------------------
# General
#--------------------------------------------------------------
cluster = "reddiyo-development"
project = "<MYPROJECTID>"
region = "us-central1"
credentialsLocation = "<MYCERTLOCATION>"
bucket = "reddiyo-terraform-state"
vpcLocation = "us-central1-b"
network = "default"
staticIpName = "dev-env-ip"
#--------------------------------------------------------------
# Specific To NODE
#--------------------------------------------------------------
terraformPrefix = "development"
mainNodeName = "primary-pool"
nodeMachineType = "n1-standard-1"
#--------------------------------------------------------------
# Specific To Application
#--------------------------------------------------------------
applicationNamespace = "application"
I also have a terrform script:
variable "cluster" {}
variable "project" {}
variable "region" {}
variable "bucket" {}
variable "terraformPrefix" {}
variable "mainNodeName" {}
variable "vpcLocation" {}
variable "nodeMachineType" {}
variable "credentialsLocation" {}
variable "network" {}
variable "applicationNamespace" {}
variable "staticIpName" {}
data "terraform_remote_state" "remote" {
backend = "gcs"
config = {
bucket = "${var.bucket}"
prefix = "${var.terraformPrefix}"
}
}
provider "google" {
//This needs to be updated to wherever you put your credentials
credentials = "${file("${var.credentialsLocation}")}"
project = "${var.project}"
region = "${var.region}"
}
resource "google_container_cluster" "gke-cluster" {
name = "${var.cluster}"
network = "${var.network}"
location = "${var.vpcLocation}"
remove_default_node_pool = true
# node_pool {
# name = "${var.mainNodeName}"
# }
node_locations = [
"us-central1-a",
"us-central1-f"
]
//Get your credentials for the newly created cluster so that microservices can be deployed
provisioner "local-exec" {
command = "gcloud config set project ${var.project}"
}
provisioner "local-exec" {
command = "gcloud container clusters get-credentials ${var.cluster} --zone ${var.vpcLocation}"
}
}
resource "google_container_node_pool" "primary_pool" {
name = "${var.mainNodeName}"
cluster = "${var.cluster}"
location = "${var.vpcLocation}"
node_count = "2"
node_config {
machine_type = "${var.nodeMachineType}"
oauth_scopes = [
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/service.management.readonly",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/trace.append",
]
}
management {
auto_repair = true
auto_upgrade = true
}
autoscaling {
min_node_count = 2
max_node_count = 10
}
}
# //Reserve a Static IP
resource "google_compute_address" "ip_address" {
name = "${var.staticIpName}"
}
//Install Ambassador
module "ambassador" {
source = "modules/ambassador"
applicationNamespace = "${var.applicationNamespace}"
}
You can try to force it to read your variables by using:
terraform apply -var-file=<path_to_your_vars>
For reference, read below, if anybody face the similar issue.
“terraform.tfvars” is the default variable file name, from where terraform will read variables.
If any other file name is used, it needs to be passed in the command line i.e: “terraform plan –var=whateverName.tfvars
Also, order of Loading for variables by Terraform program.
Environment Variables
terraform.tfvars
terraform.tfvars.json
Any .auto.tfvars
Any –var or –var-file options

NPE while deserializing avro messages in kafka streams

I wrote a small java class to test the consumption of Avro encoded Kafka topic.
Properties appProps = new Properties();
appProps.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "http://***kfk14bro1.lc:9092");
appProps.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://***kfk14str1.lc:8081");
appProps.put(StreamsConfig.APPLICATION_ID_CONFIG, "consumer");
appProps.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
appProps.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG,LogAndContinueExceptionHandler.class);
StreamsBuilder streamsBuilder = new StreamsBuilder();
streamsBuilder.stream(
"coordinates", Consumed.with(Serdes.String(), new GenericAvroSerde()))
.peek((key, value) -> System.out.println("key=" + key + ", value=" + value));
new KafkaStreams(streamsBuilder.build(), appProps).start();
When I run this class, SerdeConfigs are being logged alright which can be seen in the below log:
[consumer-56b0e0ca-d336-45cc-b388-46a68dbfab8b-StreamThread-1] INFO io.confluent.kafka.serializers.KafkaAvroSerializerConfig - KafkaAvroSerializerConfig values:
schema.registry.url = [http://***kfk14str1.lc:8081]
basic.auth.user.info = [hidden]
auto.register.schemas = true
max.schemas.per.subject = 1000
basic.auth.credentials.source = URL
schema.registry.basic.auth.user.info = [hidden]
value.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy
key.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy
[normal-consumer-56b0e0ca-d336-45cc-b388-46a68dbfab8b-StreamThread-1] INFO io.confluent.kafka.serializers.KafkaAvroDeserializerConfig - KafkaAvroDeserializerConfig values:
schema.registry.url = [http://***kfk14str1.lc:8081]
basic.auth.user.info = [hidden]
auto.register.schemas = true
max.schemas.per.subject = 1000
basic.auth.credentials.source = URL
schema.registry.basic.auth.user.info = [hidden]
specific.avro.reader = false
value.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy
key.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy
but messages are not being consumed and generates the below log for every message:
[normal-consumer-56b0e0ca-d336-45cc-b388-46a68dbfab8b-StreamThread-1] WARN org.apache.kafka.streams.errors.LogAndContinueExceptionHandler - Exception caught during Deserialization, taskId: 0_0, topic: coordinates, partition: 0, offset: 782205986
org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 83
Caused by: java.lang.NullPointerException
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:116)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:88)
at io.confluent.kafka.serializers.KafkaAvroDeserializer.deserialize(KafkaAvroDeserializer.java:55)
at io.confluent.kafka.streams.serdes.avro.GenericAvroDeserializer.deserialize(GenericAvroDeserializer.java:63)
at io.confluent.kafka.streams.serdes.avro.GenericAvroDeserializer.deserialize(GenericAvroDeserializer.java:39)
at org.apache.kafka.common.serialization.Deserializer.deserialize(Deserializer.java:58)
at org.apache.kafka.streams.processor.internals.SourceNode.deserializeValue(SourceNode.java:60)
But I am able to read just fine from the avro console consumer, so I know there is nothing wrong with the data written to the topic. Below command prints logs alright:
~/kafka/confluent-5.1.2/bin/kafka-avro-console-consumer --bootstrap-server http://***kfk14bro1.lc:9092 --topic coordinates --property schema.registry.url=http://***kfk14str1.lc:8081 --property auto.offset.reset=latest
When you instantiate an Avro Serde yourself it is not configured automatically with the schema-registry URL.
So either you have to configure it yourself or you define default serdes by adding:
appProps.setProperty(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
appProps.setProperty(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, GenericAvroSerde.class.getName());
And by removing
Consumed.with(Serdes.String(), new GenericAvroSerde())
To configure Serde use following code (adapt it to your situation):
GenericAvroSerde genericAvroSerde = new GenericAvroSerde();
boolean isKeySerde = false;
genericAvroSerde.configure(
Collections.singletonMap(
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG,
"http://confluent-schema-registry-server:8081/"),
isKeySerde);

Can apache flume hdfs sink accept dynamic path to write?

I am new to apache flume.
I am trying to see how I can get a json (as http source), parse it and store it to a dynamic path on hdfs according to the content.
For example:
if the json is:
[{
"field1" : "value1",
"field2" : "value2"
}]
then the hdfs path will be:
/some-default-root-path/value1/value2/some-value-name-file
Is there such configuration of flume that enables me to do that?
Here is my current configuration (accepts a json via http, and stores it in a path according to timestamp):
#flume.conf: http source, hdfs sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 9000
#a1.sources.r1.handler = org.apache.flume.http.JSONHandler
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/uri/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Thanks!
The solution was in the flume documentation for the hdfs sink:
Here is the revised configuration:
#flume.conf: http source, hdfs sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.port = 9000
#a1.sources.r1.handler = org.apache.flume.http.JSONHandler
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/uri/events/%{field1}
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
and the curl:
curl -X POST -d '[{ "headers" : { "timestamp" : "434324343", "host" :"random_host.example.com", "field1" : "val1" }, "body" : "random_body" }]' localhost:9000