KSQL stream to S3 bucket with node-red - amazon-s3

In nodered,I have been able to subscribe json stream ABC using KSQL node. Now I am trying to push that stream to S3 bucket in form of json file with kafka-s3-connector but i am able to do this with cli only,not with using SQL and S3 node installed in Node-Red.Is there something additional node missing in it,kindly help regarding the same?

i am able to do this with cli only
I am not familiar with NodeRED, but you can send an HTTP POST request into a Kafka Connect Distributed Server with S3 Connector available
curl -XPOST http://connect-server:8083/connectors \
-d '{
"name": "sink-s3",
"config": {
"topics": "your_topic",
"tasks.max": "2",
"name": "sink-s3",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"s3.bucket.name": "example-kafka-bucket",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"__comment": "Confluent Kafka Connect properties",
"flush.size": "200",
"s3.part.size": "5242880",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class": "io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility": "BACKWARD"
}
}'

Related

How to create contact point in grafana using API?

I am trying to create a contact point in grafana for pagerduty using grafana API.
Tried with the help of these URLS: AlertProvisioning HTTP_API
API call reference
YAML reference of data changed to JSON and tried this way, the YAML reference
But getting error as
{"message":"invalid object specification: type should not be an empty string","traceID":"00000000000000000000000000000000"}
My API code below, replaced with dummy integration key for security.
curl -X POST --insecure -H "Authorization: Bearer XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" -H "Content-Type: application/json" -d '{
"contactPoints": [
{
"orgId": 1,
"name": "test1",
"receivers": [
{
"uid": "test1",
"type": "pagerduty",
"settings": {
"integrationKey": "XXXXXXXXXXXXXXXX",
"severity": "critical",
"class": "ping failure",
"component": "Grafana",
"group": "app-stack",
"summary": "{{ `{{ template \"default.message\" . }}` }}"
}
}
]
}
]
},
"overwrite": false
}' http://XXXXXXXXXXXXXXXX.us-east-2.elb.amazonaws.com/api/v1/provisioning/contact-points
I would recommend to enable Grafana swagger UI. You will see POST /api/v1/provisioning/contact-points model there:
Example:
{
"disableResolveMessage": false,
"name": "webhook_1",
"settings": {},
"type": "webhook",
"uid": "my_external_reference"
}

Get current S3 object version with AWS CLI without downloading the object body itself

Is there an AWS CLI command to get the current version of an S3 object without downloading the object itself?
The best I've come up with downloads the first byte of the object and writes it to /dev/null:
aws s3api get-object --bucket mybucket --key myfile --range bytes=0-0 /dev/null | jq '.VersionId'
Is there a better way?
Use the command shown below:
aws s3api list-object-versions --bucket mybucket --prefix myfile.css
Here's the output looks like.
{
"Versions": [
{
"ETag": "\"e4ac40b47c1e1b9269450424f4b72cc1\"",
"Size": 3359,
"StorageClass": "STANDARD",
"Key": "myfile.css",
"VersionId": "Nz7zrGFB_mdYp8Lx7g0rKkDeD3JHUv9f",
"IsLatest": true,
"LastModified": "2020-12-08T14:05:40+00:00",
"Owner": {
"DisplayName": "info",
"ID": "07d06a23da0fa42d662773a9f6ca9f68d3109579d5937630fbe027cc77c36136"
}
}
]
}
Please refer to details here.

Kafka-connect without schema registry

I have a kafka-topic and I would like to feed it with AVRO data (currently in JSON). I know the "proper" way to do it is to use schema-registry but for testing purposes I would like to make it work without it.
So I am sending AVRO data as Array[Byte] as opposed to regular Json objects:
val writer = new SpecificDatumWriter[GenericData.Record]("mySchema.avsc")
val out = new ByteArrayOutputStream
val encoder = EncoderFactory.get.binaryEncoder(out, null)
writer.write(myAvroData, encoder)
encoder.flush
out.close
out.toByteArray
The schema is embarked within each data; how can I make it work with kafka-connect? The kafka-connect configuration currently exhibits the following properties (data is written to s3 as json.gz files), and I want to write Parquet files:
{
"name": "someName",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "120",
"topics": "user_sync",
"s3.region": "someRegion",
"s3.bucket.name": "someBucket",
"s3.part.size": "5242880",
"s3.compression.type": "gzip",
"filename.offset.zero.pad.width": "20",
"flush.size": "5000",
"rotate.interval.ms": "600000",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "YYYY/MM/dd/HH",
"timezone" : "UTC",
"locale": "en",
"partition.duration.ms": "600000",
"timestamp.extractor": "RecordField",
"timestamp.field" : "ts",
"schema.compatibility": "NONE"
I suppose I need to change "format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat? But is it enough?
Thanks a lot!
JsonConverter will be unable to consume Avro encoded data since the binary format contains a schema ID from the registry that's needed to be extracted before the converter can determine what the data looks like
You'll want to use the registryless-avro-converter, which will create a Structured object, and then should be able to converted to a Parquet record.

Confluent Kafka S3 source connector is not working - Number of groups must be positive

Confluent Kafka Connect S3 Source connector is not working, its failing with the below error and here is the configuration. Can someone please help me resolve this issue?
config
{
"name": 'test_s3_source',
"config": {
"connector.class": "io.confluent.connect.s3.source.S3SourceConnector",
"s3.region": "us-east-1",
"s3.bucket.name": "s3-bucket",
"confluent.license" : "" ,
"confluent.topic.bootstrap.servers" : "bootstrap-servers",
"partitioner.class": " io.confluent.connect.storage.partitioner.DefaultPartitioner",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"transforms" : "AddPrefix",
"transforms.AddPrefix.type" : "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.AddPrefix.regex" : ".*",
"transforms.AddPrefix.replacement" : "test.s3.source"
}
}
Error:
Failed to reconfigure connector's tasks, retrying after backoff: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1129)
java.lang.IllegalArgumentException: Number of groups must be positive

Handle lags in Kafka S3 Connector

We'are using Kafka Connect [distributed, confluence 4.0].
It works very well, except that there always remain an uncommitted messages in the topic that connector listens to. The behavior probably related to the S3 connector configuration the "flush.size": "20000". The lags in the topic are always below the flush-size.
Our data comes in batches, I don't want to wait till next batch arrive, nor reduce the flush.size and create tons of files.
Is there away to set timeout where S3 connector will flush the data even if it didn't reach 20000 events?
thanks!
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"topics": "event",
"tasks.max": "3",
"topics.dir": "connect",
"s3.region": "some_region",
"s3.bucket.name": "some_bucket",
"s3.part.size": "5242880",
"flush.size": "20000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"schema.compatibility": "FULL",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'\''day_ts'\''=YYYYMMdd/'\''hour_ts'\''=H",
"partition.duration.ms": "3600000",
"locale": "en_US",
"timezone": "UTC",
"timestamp.extractor": "RecordField",
"timestamp.field": "time"
}
}
To flush outstanding records periodically on low-volume topics with the S3 Connector you may use the configuration property:
rotate.schedule.interval.ms
(Complete list of configs here)
Keep in mind that by using the property above you might see duplicate messages in the event of reprocessing or recovery from errors, regardless of which partitioner you are using.