Trying to pass binary files through Logstash - amazon-s3

Some process is producing into my Kafka binary files (from Java it comes as bytearray).
I'm trying to consume from Kafka with Logstash and upload the file into s3.
My pipeline:
input {
kafka {
bootstrap_servers => "my-broker:9092"
topic => "my-topic"
partition_assignment_strategy => "org.apache.kafka.clients.consumer.StickyAssignor"
value_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
}
}
filter {
mutate {
remove_field => ["#timestamp", "host"]
}
}
output {
s3 {
region => "eu-west-1"
bucket => "my_bucket"
time_file => 1
prefix => "files/"
rotation_strategy => "time"
}
}
As you can see I used a different deserializer class. However, it seems that Logstash uses by default a coded that converts the bytes array to string. My goal is to upload the file to s3 as it is. Is there any codec known that doesn't do anything to the input data and upload it as it is to?
Right now the files are uploaded to s3, but I can't read them or open them. The binary content was corrupted by Logstash somehow. For example - I tried sending a gzip that contains multiple files inside and I can't open it afterwards in s3.
The warning that I get on Logstash:
0-06-02T10:49:29,149][WARN ][logstash.codecs.plain ][my_pipeline] Received an event that has a different character encoding than you configured. {:text=>"7z\\xBC\\xAF'\\u001C\\u0000\\u0002\\xA6j<........more binary data", :expected_charset=>"UTF-8"}

I'm not sure that Logstash is the best fit for passing binary data and I in the end implemented a Java consumer but the following solution worked for me with Logstash:
The data was sent to Kafka can be serialized to binary data. For
example, I used filebeat to send the binary data so if Kafka`s
output module there is a parameter that is called "value_serializer"
and it should be set to
"org.apache.kafka.common.serialization.ByteArraySerializer"
In your Logstash settings (kafka input) define the
value_deserializer_class to
"org.apache.kafka.common.serialization.ByteArrayDeserializer" just
as I did in the post
Your output in logstash can be any resource that can get binary data.
Be aware, that the output will get a binary data and you will need to deserialize it.

I don't think you really understand what logstash is for.
As it's name log-stash it is for streaming ascii type of files using EOL delimiter to deffer between different log events.
I did managed to find community developed kafkaBeat for reading data from Kafka Topics, there are 2 options:
kafkabeat - Reads data from Kafka topics.
kafkabeat2 - Reads data (json or plain) from Kafka topics.
I didn't test those my own, but using the S3 output option with those might do the trick. If the S3 option is not yet supported you can develop it yourself and push it to the open-source so everyone can enjoy it :-)

Related

Ingest data into warp10 - Performance tip

We're looking for the best way to ingest data in warp10. We are on a Microservices architecture that use Kafka mainly.
Two solutions:
Use Ingress endpoint as defined here: https://www.warp10.io/content/03_Documentation/03_Interacting_with_Warp_10/03_Ingesting_data/01_Ingress (This is the solution we use for now)
Use the warp10 Kafka plugin as defined here: https://blog.senx.io/introducing-the-warp-10-kafka-plugin/
As described here, we use Ingress solution as of now, based on an aggregation of data for x seconds, and call the Ingress API to send data per packet. (Instead of calling the API each time we need to insert something).
For few days, we are experimenting with the Kafka Plugin. We successfully set up the plugin and create an .mc2 responsible to consume data from a given topic and then insert them using UPDATE into warp10.
Questions:
Using the Kafka plugin, would it be better to apply the same buffer mechanism as the one applied when we use the Ingress endpoint? Or, is there any specific implementation in warp10 Kafka plugin that allows to consume message per message in the topic and call the UPDATE function for each ?
Today, as both solutions are working, we're trying to find differences to get the best performance results during ingestion of data. And if possible, without having to apply any buffer mechanism because we are trying to be in real-time as much as possible.
MC2 file:
{
'topics' [ 'our_topic_name' ] // List of Kafka topics to subscribe to
'parallelism' 1 // Number of threads to start for processing the incoming messages. Each thread will handle a certain number of partitions.
'config' { // Map of Kafka consumer parameters
'bootstrap.servers' 'kafka-headless:9092'
'group.id' 'senx-consumer'
'enable.auto.commit' 'true'
}
'macro' <%
// macro executed each time a kafka record is consumed
/*
// received record format :
{
'timestamp' 123 // The record timestamp
'timestampType' 'type' // The type of timestamp, can be one of 'NoTimestampType', 'CreateTime', 'LogAppendTime'
'topic' 'topic_name' // Name of the topic which received the message
'offset' 123 // Offset of the message in 'topic'
'partition' 123 // Id of the partition which received the message
'key' ... // Byte array of the message key
'value' ... // Byte array of the message value
'headers' { } // Map of message headers
}
*/
"recordArray" STORE
"preprod.write" "token" STORE
// macro can be called on timeout with an empty entry map
$recordArray SIZE 0 !=
<%
$recordArray 'value' GET // kafka record value is retrieved in bytes
'UTF-8' BYTES-> // convert bytes to string (WARP10 INGRESS format)
JSON->
"value" STORE
"Records received through Kafka" LOGMSG
$value LOGMSG
$value
<%
DROP
PARSE
// PARSE outputs a gtsList, including only one gts
0 GET
// GTS rename is required to use UPDATE function
"gts" STORE
$gts $gts NAME RENAME
%>
LMAP
// Store GTS in Warp10
$token
UPDATE
%>
IFT
%> // end macro
'timeout' 10000 // Polling timeout (in ms), if no message is received within this delay, the macro will be called with an empty map as input
}
If you want to cache something in Warp 10 to avoid lots of UPDATE per second, you can use SHM (SHared Memory). This is a built-in extension you need to activate.
Once activated, use it with SHMSTORE and SHMLOAD to keep objects in RAM between two WarpScript executions.
In you example, you can push all the incoming GTS in a list, or a list of list of GTS, using +! to append elements to an existing list.
The MERGE of all the GTS in the cache (by name + labels) and UPDATE in the database can then be done in a runner (don't forget to use a MUTEX)
Don't forget the total operation cost:
The ingress format can be optimized for ingestion speed, if you do not repeat classname and labels, and if you gather lines per gts. See here.
PARSE deserialize data from the Warp 10 ingress format.
UPDATE serialize data to the Warp 10 optimized ingress format (and push it to the update endpoint).
the update endpoint deserialize again.
It makes sense to do these deserialize/serialize/deserialize operation if your input data is far from the optimal ingress format. It also make sense if you want to RANGECOMPACT your data to save disk space, or do any preprocessing.

How to push data from AWS IoT MQTT broker to a random file in S3 bucket

I have created a rule to forward all messages published to any topic e.g. foo/bar of my AWS IoT core managed MQTT broker to a nested folder in S3 bucket. For that, I am using key section. I can send data to nested folder like a/b/c. The problem is - it takes c as destination file and this file gets updated with new data as it arrives. Is there any configuration that I can do to put data in bucket in a new file (with any random name) as it arrives (similar to how it happens when we forward data from firehose to S3)
You can change your key to use the newuuid() function. e.g.
a/b/${newuuid()}
This will write the data to a file in the a/b folder with a filename that is a generated UUID.
The key in AWS IoT S3 Actions allow you to use the IoT SQL Reference Functions to form the folder and filename.
The documentation for the key states:
The path to the file where the data is written. For example, if the value of this argument is "${topic()}/${timestamp()}", the topic the message was sent to is "this/is/my/topic,", and the current timestamp is 1460685389, the data is written to a file called "1460685389" in the "this/is/my/topic" folder on Amazon S3.
If you don't want to use a timestamp then you could form the name of the file using other functions such as a random float (rand()), calculate a hash (md5()), a UUID (newuuid()) or the trace id of the message (traceid()).

Vb.Net Gmail API Send Message with attachment > 5Mb

On Gmail Api documentation i read that i've to make an HTTP request at "/upload/gmail/v1/users/userId/messages/send" when sending message larger then 5mb but i don't find any example how to implement this using the Client Library in .net
All the examples on the site refer to the "messages.Send" function that takes as parameter a raw message and the user id but i saw that there's another overload that also take the stream of the content to upload and the content type of it.
The problem is that i've no clue how to correctly call it.
Does anyone have successfully achived it ?
Thx for your answer
Simone
Simone, it's means that you use simple upload:
uploadType=media. For quick transfer of smaller files, for example, 5 MB or less.
You must use Multipart upload or Resumable upload (https://developers.google.com/gmail/api/guides/uploads)
You can send a post query with payload (see CURLOPT_POSTFIELDS, if use a CURL) on the https://www.googleapis.com/gmail/v1/users/me/messages/send?access_token=your_access_token&uploadType=multipart. Payload must contain json encoded message. Structure of this message for example:
$message = [
'message' => [
'raw' => str_replace(['+', '/'], ['-', '_'], base64_encode($mimeString)),
'threadId' => $yourThreadId
]
];
Variable $mimeString must contain correct mime string

Setting metadata on S3 multipart upload

I'd like to upload a file to S3 in parts, and set some metadata on the file. I'm using boto to interact with S3. I'm able to set metadata with single-operation uploads like so:
Is there a way to set metadata with a multipart upload? I've tried this method of copying the key to change the metadata, but it fails with the error: InvalidRequest: The specified copy source is larger than the maximum allowable size for a copy source: <size>
I've also tried doing the following:
key = bucket.create_key(key_name)
key.set_metadata('some-key', 'value')
<multipart upload>
...but the multipart upload overwrites the metadata.
I'm using code similar to this to do the multipart upload.
Sorry, I just found the answer:
Per the docs:
If you want to provide any metadata describing the object being uploaded, you must provide it in the request to initiate multipart upload.
So in boto, the metadata can be set in the initiate_multipart_upload call. Docs here.
Faced such issue earlier today and discovered that there is no information on how to do that right.
The code example on how we solved that issue provided below.
$uploader = new MultipartUploader($client, $source, [
'bucket' => $bucketName,
'key' => $filename,
'before_initiate' => function (\Aws\Command $command) {
$command['ContentType'] = 'application/octet-stream';
$command['ContentDisposition'] = 'attachment';
},
]);
Unfortunately, documentation https://docs.aws.amazon.com/aws-sdk-php/v3/guide/service/s3-multipart-upload.html#customizing-a-multipart-upload doesn't make it clear and easy to understand that if you'd like to provide alternative meta data with multipart upload you have to go this way.
I hope that will help.

Is there a fast way of accessing line in AWS S3 file?

I have a collection of JSON messages in a file stored on S3 (one message per line). Each message has a unique key as part of the message. I also have a simple DynamoDB table where this key is used as the primary key. The table contains the name of the S3 file where the corresponding JSON message is located.
My goal is to extract a JSON message from the file given the key. Of course, the worst case scenario is when the message is the very last line in the file.
What is the fastest way of extracting the message from the file using the boto library? In particular, is it possible to somehow read the file line by line directly? Of course, I can read the entire contents to a local file using boto.s3.key.get_file() then open the file and read it line by line and check for the id to match. But is there a more efficient way?
Thanks much!
S3 cannot do this. That said, you have some other options:
Store the record's length and position (byte offset) instead of the line number in DynamoDB. This would allow you to retrieve just that record using the Range: header.
Use caching layer to store { S3 object key, line number } => { position, length } tuples. When you want to look up a record by { S3 object key, line number }, reference the cache. If you don't already have this data, you have to fetch the whole file like you do now -- but having fetched the file, you can calculate offsets for every line within it, and save yourself work down the line.
Store the JSON record in DynamoDB directly. This may or may not be practical, given the 64 KB item limit.
Store each JSON record in S3 separately. You could then eliminate the DynamoDB key lookup, and go straight to S3 for a given record.
Which is most appropriate for you depends on your application architecture, the way in which this data is accessed, concurrency issues (probably not significant given your current solution), and your sensitivities for latency and cost.
you can use the built-in readline with streams:
const readline = require('readline');
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const params = {Bucket: 'yourbucket', Key: 'somefile.txt'};
const readStream = s3.getObject(params).createReadStream();
const lineReader = readline.createInterface({
input: readStream,
});
lineReader.on('line', (line) => console.log(line));
You can use S3 SELECT to accomplish this. Also works on parquet files.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html