How to push data from AWS IoT MQTT broker to a random file in S3 bucket - amazon-s3

I have created a rule to forward all messages published to any topic e.g. foo/bar of my AWS IoT core managed MQTT broker to a nested folder in S3 bucket. For that, I am using key section. I can send data to nested folder like a/b/c. The problem is - it takes c as destination file and this file gets updated with new data as it arrives. Is there any configuration that I can do to put data in bucket in a new file (with any random name) as it arrives (similar to how it happens when we forward data from firehose to S3)

You can change your key to use the newuuid() function. e.g.
a/b/${newuuid()}
This will write the data to a file in the a/b folder with a filename that is a generated UUID.
The key in AWS IoT S3 Actions allow you to use the IoT SQL Reference Functions to form the folder and filename.
The documentation for the key states:
The path to the file where the data is written. For example, if the value of this argument is "${topic()}/${timestamp()}", the topic the message was sent to is "this/is/my/topic,", and the current timestamp is 1460685389, the data is written to a file called "1460685389" in the "this/is/my/topic" folder on Amazon S3.
If you don't want to use a timestamp then you could form the name of the file using other functions such as a random float (rand()), calculate a hash (md5()), a UUID (newuuid()) or the trace id of the message (traceid()).

Related

Azure blob Storage: Copy blobs with access tier ARCHIVE within the same Azure storage account is not working

I'm using the startCopy API from azure-storage java sdk version 8.6.5 to copy blobs between containers within the same storage account. As per the docs, it will copy a block blob's contents, properties, and metadata to a new block blob. Does this also mean source and destination access tier will match ?
String copyJobId = cloudBlockBlob.startCopy(sourceBlob);
If the source blob access tier is ARCHIVE, I am getting the following exception -
com.microsoft.azure.storage.StorageException: This operation is not permitted on an archived blob.
at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:87) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:196) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.blob.CloudBlob.startCopy(CloudBlob.java:791) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.blob.CloudBlockBlob.startCopy(CloudBlockBlob.java:302) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.blob.CloudBlockBlob.startCopy(CloudBlockBlob.java:180) ~[azure-storage-8.6.5.jar:?]
As shown below, I used startCopy API to copy all blobs in container02(src) to container03(destination). Blob with access tier ARCHIVE failed and also test1.txt blob's access tier is not same as in source.
I just want to confirm if this is expected or I am not using the right API and need to set these properties explicitly if I need both source and destination to look same??
Thanks in Advance!!!
1. Blob with access tier ARCHIVE failed
You cannot execute the startCopy operation when the access tier is ARCHIVE.
Please refer to this official documentation:
While a blob is in archive storage, the blob data is offline and can't be read, overwritten, or modified. To read or download a blob in archive, you must first rehydrate it to an online tier. You can't take snapshots of a blob in archive storage. However, the blob metadata remains online and available, allowing you to list the blob, its properties, metadata, and blob index tags. Setting or modifying the blob metadata while in archive is not allowed; however you may set and modify the blob index tags. For blobs in archive, the only valid operations are GetBlobProperties, GetBlobMetadata, SetBlobTags, GetBlobTags, FindBlobsByTags, ListBlobs, SetBlobTier, CopyBlob, and DeleteBlob.
2. test1.txt blob's access tier is not same as in source.
The access tier of the copied blob may be related to the default access tier.
Solution:
You may need to move the files from archive storage to the hot or cool access tier. Or you can use this API and specify standardBlobTier and rehydratePriority:
public final String startCopy(final CloudBlockBlob sourceBlob, String contentMd5, boolean syncCopy, final StandardBlobTier standardBlobTier, RehydratePriority rehydratePriority, final AccessCondition sourceAccessCondition, final AccessCondition destinationAccessCondition, BlobRequestOptions options, OperationContext opContext)

Trying to pass binary files through Logstash

Some process is producing into my Kafka binary files (from Java it comes as bytearray).
I'm trying to consume from Kafka with Logstash and upload the file into s3.
My pipeline:
input {
kafka {
bootstrap_servers => "my-broker:9092"
topic => "my-topic"
partition_assignment_strategy => "org.apache.kafka.clients.consumer.StickyAssignor"
value_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
}
}
filter {
mutate {
remove_field => ["#timestamp", "host"]
}
}
output {
s3 {
region => "eu-west-1"
bucket => "my_bucket"
time_file => 1
prefix => "files/"
rotation_strategy => "time"
}
}
As you can see I used a different deserializer class. However, it seems that Logstash uses by default a coded that converts the bytes array to string. My goal is to upload the file to s3 as it is. Is there any codec known that doesn't do anything to the input data and upload it as it is to?
Right now the files are uploaded to s3, but I can't read them or open them. The binary content was corrupted by Logstash somehow. For example - I tried sending a gzip that contains multiple files inside and I can't open it afterwards in s3.
The warning that I get on Logstash:
0-06-02T10:49:29,149][WARN ][logstash.codecs.plain ][my_pipeline] Received an event that has a different character encoding than you configured. {:text=>"7z\\xBC\\xAF'\\u001C\\u0000\\u0002\\xA6j<........more binary data", :expected_charset=>"UTF-8"}
I'm not sure that Logstash is the best fit for passing binary data and I in the end implemented a Java consumer but the following solution worked for me with Logstash:
The data was sent to Kafka can be serialized to binary data. For
example, I used filebeat to send the binary data so if Kafka`s
output module there is a parameter that is called "value_serializer"
and it should be set to
"org.apache.kafka.common.serialization.ByteArraySerializer"
In your Logstash settings (kafka input) define the
value_deserializer_class to
"org.apache.kafka.common.serialization.ByteArrayDeserializer" just
as I did in the post
Your output in logstash can be any resource that can get binary data.
Be aware, that the output will get a binary data and you will need to deserialize it.
I don't think you really understand what logstash is for.
As it's name log-stash it is for streaming ascii type of files using EOL delimiter to deffer between different log events.
I did managed to find community developed kafkaBeat for reading data from Kafka Topics, there are 2 options:
kafkabeat - Reads data from Kafka topics.
kafkabeat2 - Reads data (json or plain) from Kafka topics.
I didn't test those my own, but using the S3 output option with those might do the trick. If the S3 option is not yet supported you can develop it yourself and push it to the open-source so everyone can enjoy it :-)

Accessing FlowFile content in NIFI PutS3Object Processor

I am new to NIFI and want to push data from Kafka to an S3 bucket. I am using the PutS3Object processor and can push data to S3 if I hard code the Bucket value as mphdf/orderEvent, but I want to specify the buckets based on a field in the content of the FlowFile, which is in Json. So, if the Json content is this {"menu": {"type": "file","value": "File"}}, can I have the value for the Bucket property as as mphdf/$.menu.type? I have tried to do this and get the error below. I want to know if there is a way to access the FlowFile content with the PutS3Object processor and make Bucket names configurable or will I have to build my own processor?
ERROR [Timer-Driven Process Thread-10]
o.a.nifi.processors.aws.s3.PutS3Object
com.amazonaws.services.s3.model.AmazonS3Exception: The XML you
provided was not well-formed or did not validate against our
published schema (Service: Amazon S3; Status Code: 400; Error Code:
MalformedXML; Request ID: 77DF07828CBA0E5F)
I believe what you want to do is use an EvaluateJSONPath processor, which evaluates arbitrary JSONPath expressions against the JSON content and extracts the results to flowfile attributes. You can then reference the flowfile attribute using NiFi Expression Language in the PutS3Object configuration (see your first property Object Key which references ${filename}). In this way, you would evaluate $.menu.type and store it into an attribute menuType in the EvaluateJSONPath processor, then in PutS3Object you would have Bucket be mphdf/${menuType}.
You might have to play around with it a bit but off the top of my head I think that should work.

Deleting logs file in Amazon s3 bucket according to created date

How to delete the log files in Amazon s3 according to date.? I have log files in a logs folder folder inside my bucket.
string sdate = datetime.ToString("yyyy-MM-dd");
string key = "logs/" + sdate + "*" ;
AmazonS3 s3Client = AWSClientFactory.CreateAmazonS3Client();
DeleteObjectRequest delRequest = new DeleteObjectRequest()
.WithBucketName(S3_Bucket_Name)
.WithKey(key);
DeleteObjectResponse res = s3Client.DeleteObject(delRequest);
I tried this but doesn't seem to work. I can delete individual files if I put the whole name in the key. But I want to delete all the log files created for a particular date.
You can use S3's Object Lifecycle feature, specifically Object Expiration, to delete all objects under a given prefix and over a given age. It's not instantaneous, but it beats have to make myriad individual requests. To delete everything, just make the age small.
http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html

Is there a fast way of accessing line in AWS S3 file?

I have a collection of JSON messages in a file stored on S3 (one message per line). Each message has a unique key as part of the message. I also have a simple DynamoDB table where this key is used as the primary key. The table contains the name of the S3 file where the corresponding JSON message is located.
My goal is to extract a JSON message from the file given the key. Of course, the worst case scenario is when the message is the very last line in the file.
What is the fastest way of extracting the message from the file using the boto library? In particular, is it possible to somehow read the file line by line directly? Of course, I can read the entire contents to a local file using boto.s3.key.get_file() then open the file and read it line by line and check for the id to match. But is there a more efficient way?
Thanks much!
S3 cannot do this. That said, you have some other options:
Store the record's length and position (byte offset) instead of the line number in DynamoDB. This would allow you to retrieve just that record using the Range: header.
Use caching layer to store { S3 object key, line number } => { position, length } tuples. When you want to look up a record by { S3 object key, line number }, reference the cache. If you don't already have this data, you have to fetch the whole file like you do now -- but having fetched the file, you can calculate offsets for every line within it, and save yourself work down the line.
Store the JSON record in DynamoDB directly. This may or may not be practical, given the 64 KB item limit.
Store each JSON record in S3 separately. You could then eliminate the DynamoDB key lookup, and go straight to S3 for a given record.
Which is most appropriate for you depends on your application architecture, the way in which this data is accessed, concurrency issues (probably not significant given your current solution), and your sensitivities for latency and cost.
you can use the built-in readline with streams:
const readline = require('readline');
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const params = {Bucket: 'yourbucket', Key: 'somefile.txt'};
const readStream = s3.getObject(params).createReadStream();
const lineReader = readline.createInterface({
input: readStream,
});
lineReader.on('line', (line) => console.log(line));
You can use S3 SELECT to accomplish this. Also works on parquet files.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html