Is there a fast way of accessing line in AWS S3 file? - amazon-s3

I have a collection of JSON messages in a file stored on S3 (one message per line). Each message has a unique key as part of the message. I also have a simple DynamoDB table where this key is used as the primary key. The table contains the name of the S3 file where the corresponding JSON message is located.
My goal is to extract a JSON message from the file given the key. Of course, the worst case scenario is when the message is the very last line in the file.
What is the fastest way of extracting the message from the file using the boto library? In particular, is it possible to somehow read the file line by line directly? Of course, I can read the entire contents to a local file using boto.s3.key.get_file() then open the file and read it line by line and check for the id to match. But is there a more efficient way?
Thanks much!

S3 cannot do this. That said, you have some other options:
Store the record's length and position (byte offset) instead of the line number in DynamoDB. This would allow you to retrieve just that record using the Range: header.
Use caching layer to store { S3 object key, line number } => { position, length } tuples. When you want to look up a record by { S3 object key, line number }, reference the cache. If you don't already have this data, you have to fetch the whole file like you do now -- but having fetched the file, you can calculate offsets for every line within it, and save yourself work down the line.
Store the JSON record in DynamoDB directly. This may or may not be practical, given the 64 KB item limit.
Store each JSON record in S3 separately. You could then eliminate the DynamoDB key lookup, and go straight to S3 for a given record.
Which is most appropriate for you depends on your application architecture, the way in which this data is accessed, concurrency issues (probably not significant given your current solution), and your sensitivities for latency and cost.

you can use the built-in readline with streams:
const readline = require('readline');
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const params = {Bucket: 'yourbucket', Key: 'somefile.txt'};
const readStream = s3.getObject(params).createReadStream();
const lineReader = readline.createInterface({
input: readStream,
});
lineReader.on('line', (line) => console.log(line));

You can use S3 SELECT to accomplish this. Also works on parquet files.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html

Related

Trying to pass binary files through Logstash

Some process is producing into my Kafka binary files (from Java it comes as bytearray).
I'm trying to consume from Kafka with Logstash and upload the file into s3.
My pipeline:
input {
kafka {
bootstrap_servers => "my-broker:9092"
topic => "my-topic"
partition_assignment_strategy => "org.apache.kafka.clients.consumer.StickyAssignor"
value_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
}
}
filter {
mutate {
remove_field => ["#timestamp", "host"]
}
}
output {
s3 {
region => "eu-west-1"
bucket => "my_bucket"
time_file => 1
prefix => "files/"
rotation_strategy => "time"
}
}
As you can see I used a different deserializer class. However, it seems that Logstash uses by default a coded that converts the bytes array to string. My goal is to upload the file to s3 as it is. Is there any codec known that doesn't do anything to the input data and upload it as it is to?
Right now the files are uploaded to s3, but I can't read them or open them. The binary content was corrupted by Logstash somehow. For example - I tried sending a gzip that contains multiple files inside and I can't open it afterwards in s3.
The warning that I get on Logstash:
0-06-02T10:49:29,149][WARN ][logstash.codecs.plain ][my_pipeline] Received an event that has a different character encoding than you configured. {:text=>"7z\\xBC\\xAF'\\u001C\\u0000\\u0002\\xA6j<........more binary data", :expected_charset=>"UTF-8"}
I'm not sure that Logstash is the best fit for passing binary data and I in the end implemented a Java consumer but the following solution worked for me with Logstash:
The data was sent to Kafka can be serialized to binary data. For
example, I used filebeat to send the binary data so if Kafka`s
output module there is a parameter that is called "value_serializer"
and it should be set to
"org.apache.kafka.common.serialization.ByteArraySerializer"
In your Logstash settings (kafka input) define the
value_deserializer_class to
"org.apache.kafka.common.serialization.ByteArrayDeserializer" just
as I did in the post
Your output in logstash can be any resource that can get binary data.
Be aware, that the output will get a binary data and you will need to deserialize it.
I don't think you really understand what logstash is for.
As it's name log-stash it is for streaming ascii type of files using EOL delimiter to deffer between different log events.
I did managed to find community developed kafkaBeat for reading data from Kafka Topics, there are 2 options:
kafkabeat - Reads data from Kafka topics.
kafkabeat2 - Reads data (json or plain) from Kafka topics.
I didn't test those my own, but using the S3 output option with those might do the trick. If the S3 option is not yet supported you can develop it yourself and push it to the open-source so everyone can enjoy it :-)

How to push data from AWS IoT MQTT broker to a random file in S3 bucket

I have created a rule to forward all messages published to any topic e.g. foo/bar of my AWS IoT core managed MQTT broker to a nested folder in S3 bucket. For that, I am using key section. I can send data to nested folder like a/b/c. The problem is - it takes c as destination file and this file gets updated with new data as it arrives. Is there any configuration that I can do to put data in bucket in a new file (with any random name) as it arrives (similar to how it happens when we forward data from firehose to S3)
You can change your key to use the newuuid() function. e.g.
a/b/${newuuid()}
This will write the data to a file in the a/b folder with a filename that is a generated UUID.
The key in AWS IoT S3 Actions allow you to use the IoT SQL Reference Functions to form the folder and filename.
The documentation for the key states:
The path to the file where the data is written. For example, if the value of this argument is "${topic()}/${timestamp()}", the topic the message was sent to is "this/is/my/topic,", and the current timestamp is 1460685389, the data is written to a file called "1460685389" in the "this/is/my/topic" folder on Amazon S3.
If you don't want to use a timestamp then you could form the name of the file using other functions such as a random float (rand()), calculate a hash (md5()), a UUID (newuuid()) or the trace id of the message (traceid()).

Sails Skipper: how to read and validate a csv file and exclude the invalid file types during upload?

I'm trying to write a controller that uploads a file to S3 location. However, before upload I need to validate if the incoming file type is a csv or not. And then I need to read the file to check for header colummns in the files etc. I got the type of the file as per below snippet:
req.file('foo')._files[0].stream
But, how to read the entire file stream and check for headers and data etc?There were other similar Qs like (Sails.js Skipper: How to read the uploaded file stream during upload?). But the solution mentioned is to use skipper-csv adapter(which i cannot use as I already use skipper-s3 to upload to s3).
Can someone please post an example on how to read the upstreams and perform any validations before the upload?
Here is how my problem got solved: I'm making a copy of the stream to validate before actual upload. And then checking my validations on the original stream and once passed, I upload the copied stream to my desired location.
For reading the Csv stream, I found a npm package: csv-parser(https://github.com/mafintosh/csv-parser) , which I felt easy to handle events like headers, data.
For creating the copy of the stream, I used the following logic:
const upstream = req.file('file');
const fileStreamMap = {};
const fileStreamMapCopy = {};
_.each(upstream._files, (file) => {
const stream = PassThrough();
const streamCopy = PassThrough();
file.stream.pipe(stream);
file.stream.pipe(streamCopy);
fileStreamMap[fileName] = stream;
fileStreamMapCopy[fileName] = streamCopy;
});
// validate and upload files to S3, if Valid.
validateAndUploadFile(fileStreamMap, fileStreamMapCopy);
}
validateAndUploadFile() contains my custom validation logic for my csv upload.
Also, we can use aws-sdk(https://www.npmjs.com/package/aws-sdk) for s3 upload.
Hope, this helps someone.

Setting metadata on S3 multipart upload

I'd like to upload a file to S3 in parts, and set some metadata on the file. I'm using boto to interact with S3. I'm able to set metadata with single-operation uploads like so:
Is there a way to set metadata with a multipart upload? I've tried this method of copying the key to change the metadata, but it fails with the error: InvalidRequest: The specified copy source is larger than the maximum allowable size for a copy source: <size>
I've also tried doing the following:
key = bucket.create_key(key_name)
key.set_metadata('some-key', 'value')
<multipart upload>
...but the multipart upload overwrites the metadata.
I'm using code similar to this to do the multipart upload.
Sorry, I just found the answer:
Per the docs:
If you want to provide any metadata describing the object being uploaded, you must provide it in the request to initiate multipart upload.
So in boto, the metadata can be set in the initiate_multipart_upload call. Docs here.
Faced such issue earlier today and discovered that there is no information on how to do that right.
The code example on how we solved that issue provided below.
$uploader = new MultipartUploader($client, $source, [
'bucket' => $bucketName,
'key' => $filename,
'before_initiate' => function (\Aws\Command $command) {
$command['ContentType'] = 'application/octet-stream';
$command['ContentDisposition'] = 'attachment';
},
]);
Unfortunately, documentation https://docs.aws.amazon.com/aws-sdk-php/v3/guide/service/s3-multipart-upload.html#customizing-a-multipart-upload doesn't make it clear and easy to understand that if you'd like to provide alternative meta data with multipart upload you have to go this way.
I hope that will help.

boto - more concise way to get key's value from bucket?

I'm trying to figure out a concise way to get data from s3 via boto
my current code looks like this. s3 manager is simply a class that does all the s3 setup for my app.
log.debug("generating downloader")
downloader = s3_manager()
log.debug("accessing bucket")
bucket_archive = downloader.s3_buckets['#archive']
log.debug("getting key")
key = bucket_archive.get_key(archive_filename)
log.debug("getting key into string")
source = key.get_contents_as_string()
the problem is that , looking at my debug logs, i'm making two requests to amazon s3:
key = bucket_archive.get_key(archive_filename)
source = key.get_contents_as_string()
looking at the docs [ http://boto.readthedocs.org/en/latest/ref/s3.html ] , it seems that the call to get_key checks to see if it exists , while the second call gets the actual data. does anyone know of a method to do both at once ? a more concise way of doing this with one request is preferable for our app.
The get_key() method performs a HEAD request on the object to verify that it exists. If you are certain that the bucket and key exist and would prefer not to have the overhead of a HEAD request, you can simply create a Key object directly. Something like this would work:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket', validate=False)
key = bucket.new_key('myexistingkey')
contents = key.get_contents_as_string()
The validate=False on the call to get_bucket eliminates a GET request that also is intended to validate that the bucket exists.