Kafka Connect S3 Sink add MetaData - amazon-s3

I am trying to add the metadata to the output from kafka into the S3 bucket.
Currently, the output is just the values from the messages from the kafka topic.
I want to get it wrapped with the following (metadata): topic, timestamp, partition, offset, key, value
example:
{
"topic":"some-topic",
"timestamp":"some-timestamp",
"partition":"some-partition",
"offset":"some-offset",
"key":"some-key",
"value":"the-orig-value"
}
note: when I am fetching it throw a consumer it fetched all the metadata. as I wished.
my connector configuration:
{
"name" : "test_s3_sink",
"config" : {
"connector.class" : "io.confluent.connect.s3.S3SinkConnector",
"errors.log.enable" : "true",
"errors.log.include.messages" : "true",
"flush.size" : "10000",
"format.class" : "io.confluent.connect.s3.format.json.JsonFormat",
"name" : "test_s3_sink",
"rotate.interval.ms" : "60000",
"s3.bucket.name" : "some-bucket-name",
"storage.class" : "io.confluent.connect.s3.storage.S3Storage",
"topics" : "some.topic",
"topics.dir" : "some-dir"
}
}
Thanks.

Currently, the output is just the values from the messages from the kafka topic.
Correct, this is the documented behavior. There's a setting for including the key data that you're missing, if you wanted that, as well, but no settings to get the rest of the data.
For the record timestamp, you could edit your producer code to simply add that as part of your records. (and everything else, for that matter, if you're able to query for the next offset of the topic every time you produce)
For Topic and Partition, those are part of the S3 file, so whatever you're reading the files with should be able to parse out that information; the offset value is also part of the filename, then add the line number within the file to get the (approximate) offset of the record.
Or, you can use a Connect transform such as this archive one that relocates the Kafka record metadata (except offset and partition) all into the Connect Struct value such that the sink connector will then write all of it to the files
https://github.com/jcustenborder/kafka-connect-transform-archive
Either way, ConnectRecord has no offset field, a SinkRecord does, but I think that's too late in the API for transforms to access it

Related

Ingest data into warp10 - Performance tip

We're looking for the best way to ingest data in warp10. We are on a Microservices architecture that use Kafka mainly.
Two solutions:
Use Ingress endpoint as defined here: https://www.warp10.io/content/03_Documentation/03_Interacting_with_Warp_10/03_Ingesting_data/01_Ingress (This is the solution we use for now)
Use the warp10 Kafka plugin as defined here: https://blog.senx.io/introducing-the-warp-10-kafka-plugin/
As described here, we use Ingress solution as of now, based on an aggregation of data for x seconds, and call the Ingress API to send data per packet. (Instead of calling the API each time we need to insert something).
For few days, we are experimenting with the Kafka Plugin. We successfully set up the plugin and create an .mc2 responsible to consume data from a given topic and then insert them using UPDATE into warp10.
Questions:
Using the Kafka plugin, would it be better to apply the same buffer mechanism as the one applied when we use the Ingress endpoint? Or, is there any specific implementation in warp10 Kafka plugin that allows to consume message per message in the topic and call the UPDATE function for each ?
Today, as both solutions are working, we're trying to find differences to get the best performance results during ingestion of data. And if possible, without having to apply any buffer mechanism because we are trying to be in real-time as much as possible.
MC2 file:
{
'topics' [ 'our_topic_name' ] // List of Kafka topics to subscribe to
'parallelism' 1 // Number of threads to start for processing the incoming messages. Each thread will handle a certain number of partitions.
'config' { // Map of Kafka consumer parameters
'bootstrap.servers' 'kafka-headless:9092'
'group.id' 'senx-consumer'
'enable.auto.commit' 'true'
}
'macro' <%
// macro executed each time a kafka record is consumed
/*
// received record format :
{
'timestamp' 123 // The record timestamp
'timestampType' 'type' // The type of timestamp, can be one of 'NoTimestampType', 'CreateTime', 'LogAppendTime'
'topic' 'topic_name' // Name of the topic which received the message
'offset' 123 // Offset of the message in 'topic'
'partition' 123 // Id of the partition which received the message
'key' ... // Byte array of the message key
'value' ... // Byte array of the message value
'headers' { } // Map of message headers
}
*/
"recordArray" STORE
"preprod.write" "token" STORE
// macro can be called on timeout with an empty entry map
$recordArray SIZE 0 !=
<%
$recordArray 'value' GET // kafka record value is retrieved in bytes
'UTF-8' BYTES-> // convert bytes to string (WARP10 INGRESS format)
JSON->
"value" STORE
"Records received through Kafka" LOGMSG
$value LOGMSG
$value
<%
DROP
PARSE
// PARSE outputs a gtsList, including only one gts
0 GET
// GTS rename is required to use UPDATE function
"gts" STORE
$gts $gts NAME RENAME
%>
LMAP
// Store GTS in Warp10
$token
UPDATE
%>
IFT
%> // end macro
'timeout' 10000 // Polling timeout (in ms), if no message is received within this delay, the macro will be called with an empty map as input
}
If you want to cache something in Warp 10 to avoid lots of UPDATE per second, you can use SHM (SHared Memory). This is a built-in extension you need to activate.
Once activated, use it with SHMSTORE and SHMLOAD to keep objects in RAM between two WarpScript executions.
In you example, you can push all the incoming GTS in a list, or a list of list of GTS, using +! to append elements to an existing list.
The MERGE of all the GTS in the cache (by name + labels) and UPDATE in the database can then be done in a runner (don't forget to use a MUTEX)
Don't forget the total operation cost:
The ingress format can be optimized for ingestion speed, if you do not repeat classname and labels, and if you gather lines per gts. See here.
PARSE deserialize data from the Warp 10 ingress format.
UPDATE serialize data to the Warp 10 optimized ingress format (and push it to the update endpoint).
the update endpoint deserialize again.
It makes sense to do these deserialize/serialize/deserialize operation if your input data is far from the optimal ingress format. It also make sense if you want to RANGECOMPACT your data to save disk space, or do any preprocessing.

Trying to pass binary files through Logstash

Some process is producing into my Kafka binary files (from Java it comes as bytearray).
I'm trying to consume from Kafka with Logstash and upload the file into s3.
My pipeline:
input {
kafka {
bootstrap_servers => "my-broker:9092"
topic => "my-topic"
partition_assignment_strategy => "org.apache.kafka.clients.consumer.StickyAssignor"
value_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
}
}
filter {
mutate {
remove_field => ["#timestamp", "host"]
}
}
output {
s3 {
region => "eu-west-1"
bucket => "my_bucket"
time_file => 1
prefix => "files/"
rotation_strategy => "time"
}
}
As you can see I used a different deserializer class. However, it seems that Logstash uses by default a coded that converts the bytes array to string. My goal is to upload the file to s3 as it is. Is there any codec known that doesn't do anything to the input data and upload it as it is to?
Right now the files are uploaded to s3, but I can't read them or open them. The binary content was corrupted by Logstash somehow. For example - I tried sending a gzip that contains multiple files inside and I can't open it afterwards in s3.
The warning that I get on Logstash:
0-06-02T10:49:29,149][WARN ][logstash.codecs.plain ][my_pipeline] Received an event that has a different character encoding than you configured. {:text=>"7z\\xBC\\xAF'\\u001C\\u0000\\u0002\\xA6j<........more binary data", :expected_charset=>"UTF-8"}
I'm not sure that Logstash is the best fit for passing binary data and I in the end implemented a Java consumer but the following solution worked for me with Logstash:
The data was sent to Kafka can be serialized to binary data. For
example, I used filebeat to send the binary data so if Kafka`s
output module there is a parameter that is called "value_serializer"
and it should be set to
"org.apache.kafka.common.serialization.ByteArraySerializer"
In your Logstash settings (kafka input) define the
value_deserializer_class to
"org.apache.kafka.common.serialization.ByteArrayDeserializer" just
as I did in the post
Your output in logstash can be any resource that can get binary data.
Be aware, that the output will get a binary data and you will need to deserialize it.
I don't think you really understand what logstash is for.
As it's name log-stash it is for streaming ascii type of files using EOL delimiter to deffer between different log events.
I did managed to find community developed kafkaBeat for reading data from Kafka Topics, there are 2 options:
kafkabeat - Reads data from Kafka topics.
kafkabeat2 - Reads data (json or plain) from Kafka topics.
I didn't test those my own, but using the S3 output option with those might do the trick. If the S3 option is not yet supported you can develop it yourself and push it to the open-source so everyone can enjoy it :-)

Can BigQuery report mismatched the schema field?

When I upsert a row that mismatches schema I get a PartialFailureError along with a message, e.g.:
[ { errors:
[ { message: 'Repeated record added outside of an array.',
reason: 'invalid' } ],
...
]
However for large rows this isn't sufficient, because I have no idea which field is the one creating the error. The bq command does report the malformed field.
Is there either a way to configure or access name of the offending field, or can this be added to the API endpoint?
Please see this Github Issue: https://github.com/googleapis/nodejs-bigquery/issues/70 . Apparently node.js client library is not getting the location field from the API so it's not able to return it to the caller.
Workaround that worked for me: I copied the JSON payload to my Postman client and manually sent a request to REST API (let me know if you need more details of how to do it).

Properly Configuring Kafka Connect S3 Sink TimeBasedPartitioner

I am trying to use the TimeBasedPartitioner of the Confluent S3 sink. Here is my config:
{
"name":"s3-sink",
"config":{
"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"tasks.max":"1",
"file":"test.sink.txt",
"topics":"xxxxx",
"s3.region":"yyyyyy",
"s3.bucket.name":"zzzzzzz",
"s3.part.size":"5242880",
"flush.size":"1000",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"format.class":"io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class":"io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class":"io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"timestamp.extractor":"Record",
"timestamp.field":"local_timestamp",
"path.format":"YYYY-MM-dd-HH",
"partition.duration.ms":"3600000",
"schema.compatibility":"NONE"
}
}
The data is binary and I use an avro scheme for it. I would want to use the actual record field "local_timestamp" which is a UNIX timestamp to partition the data, say into hourly files.
I start the connector with the usual REST API call
curl -X POST -H "Content-Type: application/json" --data #s3-config.json http://localhost:8083/connectors
Unfortunately the data is not partitioned as I wish. I also tried to remove the flush size because this might interfere. But then I got the error
{"error_code":400,"message":"Connector configuration is invalid and contains the following 1 error(s):\nMissing required configuration \"flush.size\" which has no default value.\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"}%
Any idea how to properly set the TimeBasedPartioner? I could not find a working example.
Also how can one debug such a problem or gain further insight what the connector is actually doing?
Greatly appreciate any help or further suggestions.
After studying the code at TimeBasedPartitioner.java and the logs with
confluent log connect tail -f
I realized that both timezone and locale are mandatory, although this is not specified as such in the Confluent S3 Connector documentation. The following config fields solve the problem and let me upload the records properly partitioned to S3 buckets:
"flush.size": "10000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "US",
"timezone": "UTC",
"partition.duration.ms": "3600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "local_timestamp",
Note two more things: First a value for flush.size is also necessary, files are partitioned eventually into smaller chunks, not larger than specified by flush.size. Second, the path.format is better selected as displayed above so a proper tree structure is generated.
I am still not 100% sure if really the record field local_timestamp is used to partition the records.
Any comments or improvements are greatly welcome.
Indeed your amended configuration seems correct.
Specifically, setting timestamp.extractor to RecordField allows you to partition your files based on the timestamp field that your records have and which you identify by setting the property timestamp.field.
When instead one sets timestamp.extractor=Record, then a time-based partitioner will use the Kafka timestamp for each record.
Regarding flush.size, setting this property to a high value (e.g. Integer.MAX_VALUE) will be practically synonymous to ignore it.
Finally, schema.generator.class is no longer required in the most recent versions of the connector.

Ignite and Kafka Integration

I am trying the Ignite and Kafka Integration to bring kafka message into Ignite cache.
My message key is a random string(To work with Ignite, the kafka message key can't be null), and the value is a json string representation for Person(a java class)
When Ignite receives such a message, it looks that Ignite will use the message's key(the random string in my case) as the cache key.
Is it possible to change the message key to the person's id, so that I can put the into the cache.
Looks that streamer.receiver(new StreamReceiver) is workable
streamer.receiver(new StreamReceiver<String, String>() {
public void receive(IgniteCache<String, String> cache, Collection<Map.Entry<String, String>> entries) throws IgniteException {
for (Map.Entry<String, String> entry : entries) {
Person p = fromJson(entry.getValue());
//ignore the message key,and use person id as the cache key
cache.put(p.getId(), p);
}
}
});
Is this the recommended way? and I am not sure whether calling cache.put in StreamReceiver is a correct way, since it is only a pre-processing step before writing to cache.
Data streamer will map all your keys to cache affinity nodes, create batches of entries and send batches to affinity nodes. After it StreamReceiver will receive your entries, get Person's ID and invoke cache.put(K, V). Putting entry lead to mapping your key to corresponding cache affinity node and sending update request to this node.
Everything looks good. But result of mapping your random key from Kafka and result of mapping Person's ID will be different (most likely different nodes). As result your will get poor performance due to redundant network hops.
Unfortunately, current KafkaStreamer implementations doesn't support stream tuple extractors (see e.g. StreamSingleTupleExtractor class). But you can easily create your own Kafka streamer implementation using existing one as example.
Also you can try use KafkaStreamer's keyDecoder and valDecoder in order to extract Person's ID from Kafka message. I don't sure, but it can help.