Properly Configuring Kafka Connect S3 Sink TimeBasedPartitioner - amazon-s3

I am trying to use the TimeBasedPartitioner of the Confluent S3 sink. Here is my config:
{
"name":"s3-sink",
"config":{
"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"tasks.max":"1",
"file":"test.sink.txt",
"topics":"xxxxx",
"s3.region":"yyyyyy",
"s3.bucket.name":"zzzzzzz",
"s3.part.size":"5242880",
"flush.size":"1000",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"format.class":"io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class":"io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class":"io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"timestamp.extractor":"Record",
"timestamp.field":"local_timestamp",
"path.format":"YYYY-MM-dd-HH",
"partition.duration.ms":"3600000",
"schema.compatibility":"NONE"
}
}
The data is binary and I use an avro scheme for it. I would want to use the actual record field "local_timestamp" which is a UNIX timestamp to partition the data, say into hourly files.
I start the connector with the usual REST API call
curl -X POST -H "Content-Type: application/json" --data #s3-config.json http://localhost:8083/connectors
Unfortunately the data is not partitioned as I wish. I also tried to remove the flush size because this might interfere. But then I got the error
{"error_code":400,"message":"Connector configuration is invalid and contains the following 1 error(s):\nMissing required configuration \"flush.size\" which has no default value.\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"}%
Any idea how to properly set the TimeBasedPartioner? I could not find a working example.
Also how can one debug such a problem or gain further insight what the connector is actually doing?
Greatly appreciate any help or further suggestions.

After studying the code at TimeBasedPartitioner.java and the logs with
confluent log connect tail -f
I realized that both timezone and locale are mandatory, although this is not specified as such in the Confluent S3 Connector documentation. The following config fields solve the problem and let me upload the records properly partitioned to S3 buckets:
"flush.size": "10000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "US",
"timezone": "UTC",
"partition.duration.ms": "3600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "local_timestamp",
Note two more things: First a value for flush.size is also necessary, files are partitioned eventually into smaller chunks, not larger than specified by flush.size. Second, the path.format is better selected as displayed above so a proper tree structure is generated.
I am still not 100% sure if really the record field local_timestamp is used to partition the records.
Any comments or improvements are greatly welcome.

Indeed your amended configuration seems correct.
Specifically, setting timestamp.extractor to RecordField allows you to partition your files based on the timestamp field that your records have and which you identify by setting the property timestamp.field.
When instead one sets timestamp.extractor=Record, then a time-based partitioner will use the Kafka timestamp for each record.
Regarding flush.size, setting this property to a high value (e.g. Integer.MAX_VALUE) will be practically synonymous to ignore it.
Finally, schema.generator.class is no longer required in the most recent versions of the connector.

Related

Kafka Connect S3 Sink add MetaData

I am trying to add the metadata to the output from kafka into the S3 bucket.
Currently, the output is just the values from the messages from the kafka topic.
I want to get it wrapped with the following (metadata): topic, timestamp, partition, offset, key, value
example:
{
"topic":"some-topic",
"timestamp":"some-timestamp",
"partition":"some-partition",
"offset":"some-offset",
"key":"some-key",
"value":"the-orig-value"
}
note: when I am fetching it throw a consumer it fetched all the metadata. as I wished.
my connector configuration:
{
"name" : "test_s3_sink",
"config" : {
"connector.class" : "io.confluent.connect.s3.S3SinkConnector",
"errors.log.enable" : "true",
"errors.log.include.messages" : "true",
"flush.size" : "10000",
"format.class" : "io.confluent.connect.s3.format.json.JsonFormat",
"name" : "test_s3_sink",
"rotate.interval.ms" : "60000",
"s3.bucket.name" : "some-bucket-name",
"storage.class" : "io.confluent.connect.s3.storage.S3Storage",
"topics" : "some.topic",
"topics.dir" : "some-dir"
}
}
Thanks.
Currently, the output is just the values from the messages from the kafka topic.
Correct, this is the documented behavior. There's a setting for including the key data that you're missing, if you wanted that, as well, but no settings to get the rest of the data.
For the record timestamp, you could edit your producer code to simply add that as part of your records. (and everything else, for that matter, if you're able to query for the next offset of the topic every time you produce)
For Topic and Partition, those are part of the S3 file, so whatever you're reading the files with should be able to parse out that information; the offset value is also part of the filename, then add the line number within the file to get the (approximate) offset of the record.
Or, you can use a Connect transform such as this archive one that relocates the Kafka record metadata (except offset and partition) all into the Connect Struct value such that the sink connector will then write all of it to the files
https://github.com/jcustenborder/kafka-connect-transform-archive
Either way, ConnectRecord has no offset field, a SinkRecord does, but I think that's too late in the API for transforms to access it

Data Factory copy pipeline from API

We use Azure Data Factory copy pipeline to transfer data from REST api's to a Azure SQL Database and it is doing some strange things. Because we loop over a set of API's that need to be transferred the mapping is empty from the copy activity.
But for one API the automatic mapping is going wrong, the destination table is created with all the needed columns and correct datatypes based on the received metadata. When we run the pipeline for that specific API, the following message is showed.
{ "errorCode": "2200", "message": "ErrorCode=SchemaMappingFailedInHierarchicalToTabularStage,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to process hierarchical to tabular stage, error message: Ticks must be between DateTime.MinValue.Ticks and DateTime.MaxValue.Ticks.\r\nParameter name: ticks,Source=Microsoft.DataTransfer.ClientLibrary,'", "failureType": "UserError", "target": "Copy data1", "details": [] }
As a test we did do the mapping for that API manually by using the "Import Schema" option on the Mapping page. there we see that all the fields are correctly mapped. We execute the pipeline again using the mapping and everything is working fine.
But of course, we don't want to use a manually mapping because it is used in a loop for different API's also.

Exporting Cloudwatch logs in original format

I am looking to find a way to export CW logs in their original form to s3. I used the console to export a days worth of logs from a log group, and it seems that a timestamp was prepended on each line, breaking the original JSON formatting. I was looking to import this into glue as a json file for a test transformation script. The original data used is formated as a normal json string when imported to cloudwatch and normally process the data it looks like:
{ "a": 123, "b": "456", "c": 789 }
After exporting and decompressing the data it looks like
2019-06-28T00:00:00.099Z { "a": 123, "b": "456", "c": 789 }
Which breaks reading the line as a json string since its no long a standard format.
The dataset is fairly large(100GB+) for this run, and will possibly grow larger in the future, so running the command a CLI command and processing each line locally isn't feasible in my opinion. Is there any known way to do what I am looking to do?
Thank you
TimeStamps are automatically added when you push the logs to the CloudWatch.
All the log events present in the CloudWatch has timestamp.
You can create a subscription filter to Kinesis Firehose and on Kinesis using lambda function you can formate the log events(remove the timestamp) then store the logs in the S3.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html

Fiware Entities and STH

I am using the Orion Context Broker, an IoT Agent and Cygnus to handle and persist the data of several devices into a MongoDB. It's working, but I don't know if I'm doing it in the Fiware way, because after reading the documentation I am confused yet about some things:
I don't completely understand the difference between an Entity and an IoT Entity (or device?). My guess is that is a matter of how they provide context data and the nature of the entity modelled, but I would be grateful if someone could clarify it. I am especially confused because the creation of each entity type is different (it seems that I can't initialize an IoT entity at creation time, which I can when dealing with a regular Entity).
I can only persist the data of IoT Entities. Is it possible to have a Short Term History of a regular Entity?
I don't understand why the STH data is repeating attributes that have not changed. If I have an IoT Entity with two attributes, 'a' and 'b', and I modify both of them, a STH entry is created for each one, which is fine. However, if then I change the value of attribute 'b', two more registers are created: one for 'a' (which hasn't changed and is reflecting the same value that it already had) and one for 'b'. Could someone explain to me this behavior?
1. Entities vs IoT Entities
I assume that what you mean by an IoT entity is the entry made by the IoT Agent upon receiving a sensor reading from a provisioned device.
Logically there is no difference between an entity created and maintained by an IoT Agent and an entity created and maintained by any other service making NGSI request to the context broker.
Your so-called IoT Entity is merely a construct where an IoT agent does all the heavy lifting for you and converts the data coming from a device in a propitiatory format into the NGSI standard.
2. Short Term History of a regular Entity
To create Short Term History you will need a separate Generic Enabler such as STH-Comet or QuantumLeap. Both of these enablers receive updates from Orion using the subscriptions mechanism. If you set up your IoT data using one fiware-service header and set up your non-IoT data using another fiware-service you can easily set up a subscription to differentiate between the two.
e.g. the following subscription:
curl -iX POST \
'http://localhost:1026/v2/subscriptions/' \
-H 'Content-Type: application/json' \
-H 'fiware-service: iotdata' \
-H 'fiware-servicepath: /' \
-d '<body>'
Will only apply to entities with the iotdata service path, which would be created when you provision your IoT service.
3. Repeating attributes that have not changed.
The <body> of the subscription can be used to narrow down the conditions under which the historical data is persisted.
The entities, conditions and the attrs are the important part of the subject
subject": {
"entities": [
{
"idPattern": "Motion.*"
}
],
"condition": {
"attrs": [
"count"
]
}
},
"notification": {
"http": {
"url": "http://quantumleap:8668/v2/notify"
},
"attrs": [
"count"
],
"metadata": ["dateCreated", "dateModified"]
},
"throttling": 1
}'
The subscription defined above will only fire if the count attribute is changed and only persist the count attribute. If you do not limit your attrs then multiple lines will be persisted to the database. Similarly if you do not limit the condition then multiple entries of count will be persisted when other attributes are updated.

Can BigQuery report mismatched the schema field?

When I upsert a row that mismatches schema I get a PartialFailureError along with a message, e.g.:
[ { errors:
[ { message: 'Repeated record added outside of an array.',
reason: 'invalid' } ],
...
]
However for large rows this isn't sufficient, because I have no idea which field is the one creating the error. The bq command does report the malformed field.
Is there either a way to configure or access name of the offending field, or can this be added to the API endpoint?
Please see this Github Issue: https://github.com/googleapis/nodejs-bigquery/issues/70 . Apparently node.js client library is not getting the location field from the API so it's not able to return it to the caller.
Workaround that worked for me: I copied the JSON payload to my Postman client and manually sent a request to REST API (let me know if you need more details of how to do it).