Ingest data into warp10 - Performance tip - warp10

We're looking for the best way to ingest data in warp10. We are on a Microservices architecture that use Kafka mainly.
Two solutions:
Use Ingress endpoint as defined here: https://www.warp10.io/content/03_Documentation/03_Interacting_with_Warp_10/03_Ingesting_data/01_Ingress (This is the solution we use for now)
Use the warp10 Kafka plugin as defined here: https://blog.senx.io/introducing-the-warp-10-kafka-plugin/
As described here, we use Ingress solution as of now, based on an aggregation of data for x seconds, and call the Ingress API to send data per packet. (Instead of calling the API each time we need to insert something).
For few days, we are experimenting with the Kafka Plugin. We successfully set up the plugin and create an .mc2 responsible to consume data from a given topic and then insert them using UPDATE into warp10.
Questions:
Using the Kafka plugin, would it be better to apply the same buffer mechanism as the one applied when we use the Ingress endpoint? Or, is there any specific implementation in warp10 Kafka plugin that allows to consume message per message in the topic and call the UPDATE function for each ?
Today, as both solutions are working, we're trying to find differences to get the best performance results during ingestion of data. And if possible, without having to apply any buffer mechanism because we are trying to be in real-time as much as possible.
MC2 file:
{
'topics' [ 'our_topic_name' ] // List of Kafka topics to subscribe to
'parallelism' 1 // Number of threads to start for processing the incoming messages. Each thread will handle a certain number of partitions.
'config' { // Map of Kafka consumer parameters
'bootstrap.servers' 'kafka-headless:9092'
'group.id' 'senx-consumer'
'enable.auto.commit' 'true'
}
'macro' <%
// macro executed each time a kafka record is consumed
/*
// received record format :
{
'timestamp' 123 // The record timestamp
'timestampType' 'type' // The type of timestamp, can be one of 'NoTimestampType', 'CreateTime', 'LogAppendTime'
'topic' 'topic_name' // Name of the topic which received the message
'offset' 123 // Offset of the message in 'topic'
'partition' 123 // Id of the partition which received the message
'key' ... // Byte array of the message key
'value' ... // Byte array of the message value
'headers' { } // Map of message headers
}
*/
"recordArray" STORE
"preprod.write" "token" STORE
// macro can be called on timeout with an empty entry map
$recordArray SIZE 0 !=
<%
$recordArray 'value' GET // kafka record value is retrieved in bytes
'UTF-8' BYTES-> // convert bytes to string (WARP10 INGRESS format)
JSON->
"value" STORE
"Records received through Kafka" LOGMSG
$value LOGMSG
$value
<%
DROP
PARSE
// PARSE outputs a gtsList, including only one gts
0 GET
// GTS rename is required to use UPDATE function
"gts" STORE
$gts $gts NAME RENAME
%>
LMAP
// Store GTS in Warp10
$token
UPDATE
%>
IFT
%> // end macro
'timeout' 10000 // Polling timeout (in ms), if no message is received within this delay, the macro will be called with an empty map as input
}

If you want to cache something in Warp 10 to avoid lots of UPDATE per second, you can use SHM (SHared Memory). This is a built-in extension you need to activate.
Once activated, use it with SHMSTORE and SHMLOAD to keep objects in RAM between two WarpScript executions.
In you example, you can push all the incoming GTS in a list, or a list of list of GTS, using +! to append elements to an existing list.
The MERGE of all the GTS in the cache (by name + labels) and UPDATE in the database can then be done in a runner (don't forget to use a MUTEX)
Don't forget the total operation cost:
The ingress format can be optimized for ingestion speed, if you do not repeat classname and labels, and if you gather lines per gts. See here.
PARSE deserialize data from the Warp 10 ingress format.
UPDATE serialize data to the Warp 10 optimized ingress format (and push it to the update endpoint).
the update endpoint deserialize again.
It makes sense to do these deserialize/serialize/deserialize operation if your input data is far from the optimal ingress format. It also make sense if you want to RANGECOMPACT your data to save disk space, or do any preprocessing.

Related

Kafka Connect S3 Sink add MetaData

I am trying to add the metadata to the output from kafka into the S3 bucket.
Currently, the output is just the values from the messages from the kafka topic.
I want to get it wrapped with the following (metadata): topic, timestamp, partition, offset, key, value
example:
{
"topic":"some-topic",
"timestamp":"some-timestamp",
"partition":"some-partition",
"offset":"some-offset",
"key":"some-key",
"value":"the-orig-value"
}
note: when I am fetching it throw a consumer it fetched all the metadata. as I wished.
my connector configuration:
{
"name" : "test_s3_sink",
"config" : {
"connector.class" : "io.confluent.connect.s3.S3SinkConnector",
"errors.log.enable" : "true",
"errors.log.include.messages" : "true",
"flush.size" : "10000",
"format.class" : "io.confluent.connect.s3.format.json.JsonFormat",
"name" : "test_s3_sink",
"rotate.interval.ms" : "60000",
"s3.bucket.name" : "some-bucket-name",
"storage.class" : "io.confluent.connect.s3.storage.S3Storage",
"topics" : "some.topic",
"topics.dir" : "some-dir"
}
}
Thanks.
Currently, the output is just the values from the messages from the kafka topic.
Correct, this is the documented behavior. There's a setting for including the key data that you're missing, if you wanted that, as well, but no settings to get the rest of the data.
For the record timestamp, you could edit your producer code to simply add that as part of your records. (and everything else, for that matter, if you're able to query for the next offset of the topic every time you produce)
For Topic and Partition, those are part of the S3 file, so whatever you're reading the files with should be able to parse out that information; the offset value is also part of the filename, then add the line number within the file to get the (approximate) offset of the record.
Or, you can use a Connect transform such as this archive one that relocates the Kafka record metadata (except offset and partition) all into the Connect Struct value such that the sink connector will then write all of it to the files
https://github.com/jcustenborder/kafka-connect-transform-archive
Either way, ConnectRecord has no offset field, a SinkRecord does, but I think that's too late in the API for transforms to access it

attributes.headers getting lost after a http Request call in Mulesoft?

I am getting some attributes in an API but all getting lost after an HTTP request connector in mule4.
why is it happening?
Look in the connector's configuration properties -> advanced tab for the connector configuration (in this case the HTTP connector's "request" operation) and you'll find a target variable and target value. If you fill in the target with a name - this does an enrichment to avoid overwriting the Mule message. If you leave it blank (the default) it will save the message (attributes, payload) over the top of the existing one - which is what you're seeing now. This mirrors the old mule 3 functionality, but sometimes you want it to leave what you have there alone.
So for the target value you get to pick exactly what gets saved.. If you want just payload: put that in. If you want both payload and attributes - I'd use "message" as that will mean you get both payload and attributes saved in the variable. Of course you may not want as much saved, so feel free to put in whatever dataweave expression you like - so you could even create something with bits from anywhere like:
{
statusCode: attributes.statusCode,
headers: attributes.headers,
payload: payload
}
A connector operation may replace the attributes with those of the operation. If you need to preserve the previous attributes you need to save them to a variable.
This is a default behaviour of MuleSoft. Whenever request crosses to transport barrier it losses existing attributes. You need to preserve attribute before HTTP Request.

Trying to pass binary files through Logstash

Some process is producing into my Kafka binary files (from Java it comes as bytearray).
I'm trying to consume from Kafka with Logstash and upload the file into s3.
My pipeline:
input {
kafka {
bootstrap_servers => "my-broker:9092"
topic => "my-topic"
partition_assignment_strategy => "org.apache.kafka.clients.consumer.StickyAssignor"
value_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
}
}
filter {
mutate {
remove_field => ["#timestamp", "host"]
}
}
output {
s3 {
region => "eu-west-1"
bucket => "my_bucket"
time_file => 1
prefix => "files/"
rotation_strategy => "time"
}
}
As you can see I used a different deserializer class. However, it seems that Logstash uses by default a coded that converts the bytes array to string. My goal is to upload the file to s3 as it is. Is there any codec known that doesn't do anything to the input data and upload it as it is to?
Right now the files are uploaded to s3, but I can't read them or open them. The binary content was corrupted by Logstash somehow. For example - I tried sending a gzip that contains multiple files inside and I can't open it afterwards in s3.
The warning that I get on Logstash:
0-06-02T10:49:29,149][WARN ][logstash.codecs.plain ][my_pipeline] Received an event that has a different character encoding than you configured. {:text=>"7z\\xBC\\xAF'\\u001C\\u0000\\u0002\\xA6j<........more binary data", :expected_charset=>"UTF-8"}
I'm not sure that Logstash is the best fit for passing binary data and I in the end implemented a Java consumer but the following solution worked for me with Logstash:
The data was sent to Kafka can be serialized to binary data. For
example, I used filebeat to send the binary data so if Kafka`s
output module there is a parameter that is called "value_serializer"
and it should be set to
"org.apache.kafka.common.serialization.ByteArraySerializer"
In your Logstash settings (kafka input) define the
value_deserializer_class to
"org.apache.kafka.common.serialization.ByteArrayDeserializer" just
as I did in the post
Your output in logstash can be any resource that can get binary data.
Be aware, that the output will get a binary data and you will need to deserialize it.
I don't think you really understand what logstash is for.
As it's name log-stash it is for streaming ascii type of files using EOL delimiter to deffer between different log events.
I did managed to find community developed kafkaBeat for reading data from Kafka Topics, there are 2 options:
kafkabeat - Reads data from Kafka topics.
kafkabeat2 - Reads data (json or plain) from Kafka topics.
I didn't test those my own, but using the S3 output option with those might do the trick. If the S3 option is not yet supported you can develop it yourself and push it to the open-source so everyone can enjoy it :-)

Ignite and Kafka Integration

I am trying the Ignite and Kafka Integration to bring kafka message into Ignite cache.
My message key is a random string(To work with Ignite, the kafka message key can't be null), and the value is a json string representation for Person(a java class)
When Ignite receives such a message, it looks that Ignite will use the message's key(the random string in my case) as the cache key.
Is it possible to change the message key to the person's id, so that I can put the into the cache.
Looks that streamer.receiver(new StreamReceiver) is workable
streamer.receiver(new StreamReceiver<String, String>() {
public void receive(IgniteCache<String, String> cache, Collection<Map.Entry<String, String>> entries) throws IgniteException {
for (Map.Entry<String, String> entry : entries) {
Person p = fromJson(entry.getValue());
//ignore the message key,and use person id as the cache key
cache.put(p.getId(), p);
}
}
});
Is this the recommended way? and I am not sure whether calling cache.put in StreamReceiver is a correct way, since it is only a pre-processing step before writing to cache.
Data streamer will map all your keys to cache affinity nodes, create batches of entries and send batches to affinity nodes. After it StreamReceiver will receive your entries, get Person's ID and invoke cache.put(K, V). Putting entry lead to mapping your key to corresponding cache affinity node and sending update request to this node.
Everything looks good. But result of mapping your random key from Kafka and result of mapping Person's ID will be different (most likely different nodes). As result your will get poor performance due to redundant network hops.
Unfortunately, current KafkaStreamer implementations doesn't support stream tuple extractors (see e.g. StreamSingleTupleExtractor class). But you can easily create your own Kafka streamer implementation using existing one as example.
Also you can try use KafkaStreamer's keyDecoder and valDecoder in order to extract Person's ID from Kafka message. I don't sure, but it can help.

How to write to different hbase tables in apache flume

I have configured Apache Flume to receive messages (JSON type) in HTTP source. My sinks are MongoDB and HBase.
How can I write the message according to a specified field to different collections and tables?
For example: let's assume we have T_1 and T_2. Now there is an incoming message that should be saved in T_1. How can I handle those messages and assign them where to be saved?
Try using the Multiplexing Channel Selector. The default one (Replicating Channel Selector copies the Flume event produced by the source to all its configured channels. Nevertheless, the multiplexing one is able to put the event into a specific channel depending on the value of a header within the Flume event.
In order to create such a header accordingly to your application logic you will need to create a custom handler for the HTTPSource. This can be easily done by implementing the HttpSourceHandler interface of the API.
you can use regex for tagging message type + multiplexing for sending it to right destination.
example , based on message "TEST"
regex for a string / field
agent.sources.s1.interceptors.i1.type=regex_extractor
agent.sources.s1.interceptors.i1.regex=(TEST1)
assign interceptor to serializer SE1
agent.sources.s1.interceptors.i1.serializers=SE1
agent.sources.s1.intercetpros.i1.serializers.SE1.name=Test
send to required channel , channels (c1,c2) you can map to different sinks
agent.sources.s1.selector.type=multiplexing
agent.sources.s1.selector.header=Test
agent.sources.s1.selector.mapping.Test=c1
all events of test regex will go to channel c1 , others will be defaulted to C2
agent.sources.s1.selector.default=c2