Kafka Connect S3 Sink Flush data - Strange lag - amazon-s3

I've a TABLE created from KSQL query and inut Stream that is backed by a Kafka Topic.
This topic is sink to s3 using Kafka Connect.
In the topic, I have around 1k msgs/sec.
The topic has 6 partitions and 3 replicas.
I have a strange output ratio. Sink seems to be strange.
Here is my monitoring :
monitoring
You can see the first chart shows Input ratio B/s, the second Out ratio and the third the lag computed using Burrow.
Here is my s3-sink properties file :
{
"name": "sink-feature-static",
"config": {
"topics": "FEATURE_APP_STATIC",
"topics.dir": "users-features-stream",
"tasks.max": "6",
"consumer.override.auto.offset.reset": "latest",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"parquet.codec": "snappy",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'\'part_date\''=YYYY-MM-dd/'\'part_hour\''=HH",
"partition.duration.ms": "3600000",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://cp-schema-registry.schema-registry.svc.cluster.local:8081",
"flush.size": 1000000,
"s3.part.size": 5242880,
"rotate.interval.ms": "600000",
"rotate.schedule.interval.ms": "600000",
"locale": "fr-FR",
"timezone": "UTC",
"timestamp.extractor": "Record",
"schema.compatibility": "NONE",
"aws.secret.access.key": "secretkey",
"aws.access.key.id": "accesskey",
"s3.bucket.name": "feature-store-prod-kafka-test",
"s3.region": "eu-west-1"
}
}
Here is what I'm observing in s3 bucket : s3 bucket
In these files I have small amount of messages in parquet.snappy. (Sometimes only 1 sometimes more, ...). Around 2 files per seconds per partition. (As I'm using Record timestamp, it's because it's catching up the lag I think).
What I was expecting is :
File commit every 1000000 messages (flush.size) or every 10 minutes (rotate.schedule.interval.ms).
So I'm expecting (As 1M messages > 10min * 1Kmsg/s):
1/ 6 (every 10min) * 6 (nb of partitions) parquet files every hour
2/ Or If I was wrong, At least files with 1M messages inside ...
But neither 1/ or 2/ is observed ...
And I have a huge lag and a flush/commit in s3 file every hour (see monitoring).
Does "partition.duration.ms": "3600000" leads to that observation ?
Where am I wrong ?
Why I do not see a continuous Output flush of data but such spikes ?
Thanks !
Rémy

so yes first set partition.duration.ms to 10 minutes if you want one s3 object per 10 minutes. Second, if you really don't want small files set rotate.interval.ms=-1 and rotate.schedule.interval.ms to 10 minutes (however you loose guarantee of exactly once delivery).
When using rotate.interval.ms, what happens is each time you receive a timestamp earlier than the file offset, kafka-connect flushes leading to very small files at each beginning and end of the hour, it does ensure exactly once delivery in all failures cases.

Related

Merge two message threads into one

have two message threads, each thread consists of ten messages. I need to request to display these two chains in one.
The new thread must consist of ten different messages: five messages from one system, five messages from another (backup) system. Messages from the system use the same SrcMsgId value. Each system has a unique SrcMsgId within the same chain. The message chain from the backup system enters the splunk immediately after the messages from the main system. Messages from the standby system also have a Mainsys_srcMsgId value - this value is identical to the main system's SrcMsgId value. Tell me how can I display a chain of all ten messages? Perhaps first messages from the first system (main), then from the second (backup) with the display of the time of arrival at the server.
Specifically, we want to see all ten messages one after the other, in the order in which they arrived at the server. Five messages from the primary, for example: ("srcMsgId": "rwfsdfsfqwe121432gsgsfgd71") and five from the backup: ("srcMsgId": "rwfsdfsfqwe121432gsgsfgd72"). The problem is that messages from other systems also come to the server, all messages are mixed (chaotically), which is why we want to organize all messages from one system and its relative in the search. Messages from the backup system are associated with the main system only by this parameter: "Mainsys_srcMsgId" - using this key, we understand that messages come from the backup system (secondary to the main one).
Examples of messages from the primary and secondary system:
Main system:
{
"event": "Sourcetype test please",
"sourcetype": "testsystem-2",
"host": "some-host-123",
"fields":
{
"messageId": "ED280816-E404-444A-A2D9-FFD2D171F32",
"srcMsgId": "rwfsdfsfqwe121432gsgsfgd71",
"Mainsys_srcMsgId": "",
"baseSystemId": "abc1",
"routeInstanceId": "abc2",
"routepointID": "abc3",
"eventTime": "1985-04-12T23:20:50Z",
"messageType": "abc4",
.....................................
Message from backup system:
{
"event": "Sourcetype test please",
"sourcetype": "testsystem-2",
"host": "some-host-123",
"fields":
{
"messageId": "ED280816-E404-444A-A2D9-FFD2D171F23",
"srcMsgId": "rwfsdfsfqwe121432gsgsfgd72",
"Mainsys_srcMsgId": "rwfsdfsfqwe121432gsgsfgd71",
"baseSystemId": "abc1",
"routeInstanceId": "abc2",
"routepointID": "abc3",
"eventTime": "1985-04-12T23:20:50Z",
"messageType": "abc4",
"GISGMPRequestID": "PS000BA780816-E404-444A-A2D9-FFD2D1712345",
"GISGMPResponseID": "PS000BA780816-E404-444B-A2D9-FFD2D1712345",
"resultcode": "abc7",
"resultdesc": "abc8"
}
}
When we want to combine in a query only five messages from one chain, related: "srcMsgId".
We make the following request:
index="bl_logging" sourcetype="testsystem-2"
| транзакция maxpause=5m srcMsgId Mainsys_srcMsgId messageId
| таблица _time srcMsgId Mainsys_srcMsgId messageId продолжительность eventcount
| сортировать srcMsgId_time
| streamstats current=f window=1 значения (_time) as prevTime по теме
| eval timeDiff=_time-prevTime
| delta _time как timediff

aws neptune bulk load parallelization

I am trying to insert 624,118,983 records divided into 1000 files, it takes 35 hours to get loaded all into neptune which is very slow.
I have configured db.r5.large instance with 2 instatnce.
I have 1000 files stored in S3 bucket.
I have one loading request pointing to S3 bucket folder which has 1000 files.
when i get the load status I get below response.
{
"status" : "200 OK",
"payload" : {
"feedCount" : [
{
"LOAD_NOT_STARTED" : 640
},
{
"LOAD_IN_PROGRESS" : 1
},
{
"LOAD_COMPLETED" : 358
},
{
"LOAD_FAILED" : 1
}
],
"overallStatus" : {
"fullUri" : "s3://myntriplesfiles/ntriple-folder/",
"runNumber" : 1,
"retryNumber" : 0,
"status" : "LOAD_IN_PROGRESS",
"totalTimeSpent" : 26870,
"startTime" : 1639289761,
"totalRecords" : 224444549,
"totalDuplicates" : 17295821,
"parsingErrors" : 1,
"datatypeMismatchErrors" : 0,
"insertErrors" : 0
}
}
I see here is that LOAD_IN_PROGRESS is always 1. that means neptune is not trying to load mutiple files in parallelization.
How do i tell neptune to load 1000 file in some parallelization for example parallelization factor of 10.
Am i missing any configuration?
This is how I use bulk load api.
curl -X POST -H 'Content-Type: application/json' https://neptune-hostname:8182/loader -d '
{
"source" : "s3://myntriplesfiles/ntriple-folder/",
"format" : "nquads",
"iamRoleArn" : "my aws arn values goes here",
"region" : "us-east-2",
"failOnError" : "FALSE",
"parallelism" : "HIGH",
"updateSingleCardinalityProperties" : "FALSE",
"queueRequest" : "FALSE"
}'
Please advice.
The Amazon Neptune bulk loader does not load multiple files in parallel, but does divide up the contents of each file among the number of available worker threads on the writer instance (limited by how you have the parallelism property set on the load command). If you have no other writes pending during the load period you can set that field to OVERSUBSCRIBE which will use all available worker threads. Secondly, larger files are better than smaller files as that gives the worker threads more that they can do in parallel. Thirdly, using a larger writer instance just for the duration of the load will provide a lot more worker threads that can take on load tasks. The number of worker threads available in an instance is approximately twice the number of vCPU the instance has. Quite often, people will use something like an db-r5-12xl just for the bulk load (for large loads) and then scale that back to something a lot smaller for regular query workloads.
In Addition to the above, Gzip compressing the files would help faster network reads. Neptune, by default understands gzip compressed files.
Also queueRequest: TRUE can be set to achieve better results. Neptune can queue up to 64 requests. Instead of sending only one request you can trigger multiple files in parallel. You can even configure dependencies among the files if you have to. Ref: https://docs.aws.amazon.com/neptune/latest/userguide/load-api-reference-load.html
You need to move to a bigger writer instance only in cases where CPU usage is consistently higher than 60%.

How frequently are the Azure Storage Queue metrics updated?

I observed that it took about 6 hours from the time of setting up Diagnostics (the newer offering still in preview) for the Queue Message Count metric to move from 0 to the actual total number of messages in queue. The other capacity metrics Queue Capacity and Queue Count took about 1 hour to reflect actual values.
Can anyone shed light on how these metrics are updated? It would be good to know how to predict the accuracy of the graphs.
I am concerned because if the latency of these metrics is typically this large then an alert based on queue metrics could take too long to raise.
Update:
Platform metrics are created by Azure resources and give you visibility into their health and performance. Each type of resource creates a distinct set of metrics without any configuration required. Platform metrics are collected from Azure resources at one-minute frequency unless specified otherwise in the metric's definition.
And 'Queue Message Count' is platform metrics.
So it should update the data every 1 minute.
But it didn't. And this is not a problem that only occur on portal. Even you use rest api to get the QueueMessageCount, it still not update after 1 minute:
https://management.azure.com/subscriptions/xxx-xxx-xxx-xxx-xxx/resourceGroups/0730BowmanWindow/providers/Microsoft.Storage/storageAccounts/0730bowmanwindow/queueServices/default/providers/microsoft.insights/metrics?interval=PT1H&metricnames=QueueMessageCount&aggregation=Average&top=100&orderby=Average&api-version=2018-01-01&metricnamespace=Microsoft.Storage/storageAccounts/queueServices
{
"cost": 59,
"timespan": "2021-05-17T08:57:56Z/2021-05-17T09:57:56Z",
"interval": "PT1H",
"value": [
{
"id": "/subscriptions/xxx-xxx-xxx-xxx-xxx/resourceGroups/0730BowmanWindow/providers/Microsoft.Storage/storageAccounts/0730bowmanwindow/queueServices/default/providers/Microsoft.Insights/metrics/QueueMessageCount",
"type": "Microsoft.Insights/metrics",
"name": {
"value": "QueueMessageCount",
"localizedValue": "Queue Message Count"
},
"displayDescription": "The number of unexpired queue messages in the storage account.",
"unit": "Count",
"timeseries": [
{
"metadatavalues": [],
"data": [
{
"timeStamp": "2021-05-17T08:57:00Z",
"average": 1.0
}
]
}
],
"errorCode": "Success"
}
],
"namespace": "Microsoft.Storage/storageAccounts/queueServices",
"resourceregion": "centralus"
}
This may be an issue that needs to be reported to the azure team. It is so slow, it even loses its practicality. I think send an alert based on this is a bad thing(it’s too slow).
Maybe you can design you own logic by code to check the QueueMessageCount.
Just a sample(C#):
1, Get Queues
Then get all of the queue names.
2, Get Properties
Then get the number of the message in each queue.
3, sum the obtained numbers.
4, send custom alert.
Original Answer:
At first, after I send message to one queue in queue storage, the 'Queue Message Count' also remains stubbornly at zero on my side, but a few hours later it can get the 'Queue Message Count':
I thought it would be a bug, but it seems to work well now.

NiFi: Calculate a processor data flow over a period of time (> 5 mins)

Looking for data flow statistics for the recent one week of time (bytesIn, bytesOut). Using NiFi REST API endpoint [GET] /nifi-api/processors/{id}, got below statistics for last five minutes. Is there any existing API end-point to retrieve data flow statistics for one week of time?
{
"id": "1234aa-1234-1f23-1f23-123456ed51f1a",
"status": {
"name": "MyConsumeKafkaProcessor",
"runStatus": "Running",
"statsLastRefreshed": "14:39:55 EDT",
"aggregateSnapshot": {
"type": "ConsumeKafka_0_10",
"runStatus": "Running",
"executionNode": "ALL",
"bytesRead": 0,
"bytesWritten": 23948016,
"read": "0 bytes",
"written": "22.84 MB",
"flowFilesIn": 0,
"bytesIn": 0,
"input": "0 (0 bytes)",
"flowFilesOut": 2188,
"bytesOut": 23948016,
"output": "2,188 (22.84 MB)",
"taskCount": 1179,
"tasksDurationNanos": 15974094510,
"tasks": "1,179",
"tasksDuration": "00:00:15.974",
"activeThreadCount": 0,
"terminatedThreadCount": 0
}
}
}
I had a similar issue, trying to see what comes in over a period of time from Kafka.
I used counter with a frequency of 1 min.
so is like this :
1 - ConsumeKafka
2 - Capture Records out of Kafka
3 - UpdateCounter IN (as a cloned success connection) - the deltas will be records count oof the kafka payload
4 - do you stuff with the data (enrich/change bla bla bla)
5 - capture records count before persisting data
6 - Update OUT counter
7 - Persists the data in (DB/S3/ etc)
I then have flow that interogates the https://${hostname(true)}:8443/nifi-api/counters every 60 sec.
I land thiis data in monitoring db repo.
I use this to messue data delivery IN/OUT of NiFi and look for dropouts, thruput,etc.
I do the same with my Source Data, where in Kafka case i capture the num of msg generated evey min.

Big JSON record to BigQuery is not showing up

I wanted to try to upload big JSON record object to BigQuery.
I am talking of JSON records of 1.5 MB each, with a complex nested schema up to 7th degree.
For simplicity, I started to load file with a single record on one line.
At first I try to have BigQuery to autodetect my schema, but that resulted in table that is not responsive and I cannot perform query on, albeit it says it had at least a record.
Then, assuming that my schema could be too hard to reverse for the loader, I tried to write the schema myself and I then I tried to load my my file with single record.
At first I got a simple error with just "invalid".
bq load --source_format=NEWLINE_DELIMITED_JSON invq_data.test_table
my_single_json_record_file
Upload complete.
Waiting on bqjob_r5a4ce64904bbba9d_0000015e14aba735_1 ... (3s) Current
status: DONE
BigQuery error in load operation: Error processing job 'invq-
test:bqjob_r5a4ce64904bbba9d_0000015e14aba735_1': JSON table
encountered too many errors, giving up. Rows:
1; errors: 1.
Which after checking for the job error was just giving me the following:
"status": {
"errorResult": {
"location": "file-00000000",
"message": "JSON table encountered too many errors, giving up. Rows: 1; errors: 1.",
"reason": "invalid"
},
"errors": [
{
"location": "file-00000000",
"message": "JSON table encountered too many errors, giving up. Rows: 1; errors: 1.",
"reason": "invalid"
}
],
"state": "DONE"
},
The after a couple of more attempts creating new tables, it actually started to succeed on command line, without reporting errors:
bq load --max_bad_records=1 --source_format=NEWLINE_DELIMITED_JSON invq_data.test_table_4 my_single_json_record_file
Upload complete.
Waiting on bqjob_r368f1dff98600a4b_0000015e14b43dd5_1 ... (16s) Current status: DONE
with no error on the status checker...
"statistics": {
"creationTime": "1503585955356",
"endTime": "1503585973623",
"load": {
"badRecords": "0",
"inputFileBytes": "1494390",
"inputFiles": "1",
"outputBytes": "0",
"outputRows": "0"
},
"startTime": "1503585955723"
},
"status": {
"state": "DONE"
},
But no actual records are added to my tables.
I tried to perform the same from WebUI but the result is the same. Green on the completed job, but no actual record added.
Is there something else that I can do for checking where the data is sinking to? Maybe some more log?
I can imagine that maybe I am on the the edge of the 2 MB JSON row size limit but, if so, should this be reported as error?
Thanks in advance for the help!!
EDIT:
It turned out the complexity of my schema was a bit the devil in here.
My json files were valid, but my complex schema had several errors.
It turned out that I had to simplify it such schema anyway, because I got a new batch of data where single json instances where more 30MB and I had to restructure this data in a more relational way, whilst making smaller rows to insert in the database.
Funny enough when the schema was scattered across multiple entities (ergo, simplified) the actually error/inconsistencies of the schema started to actually show up in error returned and it was easier to fix them. (Mostly it was new nested undocumented data which I was not aware anyway... but still my bad).
The lesson here, is when a table schema is too long (I didn't experiment how much precisely is too long) BigQuery just hide itself behind reporting too many errors to show.
But that is a point where you should consider simplify the schema(/structure) of your data.