Sails Skipper: how to read and validate a csv file and exclude the invalid file types during upload? - file-upload

I'm trying to write a controller that uploads a file to S3 location. However, before upload I need to validate if the incoming file type is a csv or not. And then I need to read the file to check for header colummns in the files etc. I got the type of the file as per below snippet:
req.file('foo')._files[0].stream
But, how to read the entire file stream and check for headers and data etc?There were other similar Qs like (Sails.js Skipper: How to read the uploaded file stream during upload?). But the solution mentioned is to use skipper-csv adapter(which i cannot use as I already use skipper-s3 to upload to s3).
Can someone please post an example on how to read the upstreams and perform any validations before the upload?

Here is how my problem got solved: I'm making a copy of the stream to validate before actual upload. And then checking my validations on the original stream and once passed, I upload the copied stream to my desired location.
For reading the Csv stream, I found a npm package: csv-parser(https://github.com/mafintosh/csv-parser) , which I felt easy to handle events like headers, data.
For creating the copy of the stream, I used the following logic:
const upstream = req.file('file');
const fileStreamMap = {};
const fileStreamMapCopy = {};
_.each(upstream._files, (file) => {
const stream = PassThrough();
const streamCopy = PassThrough();
file.stream.pipe(stream);
file.stream.pipe(streamCopy);
fileStreamMap[fileName] = stream;
fileStreamMapCopy[fileName] = streamCopy;
});
// validate and upload files to S3, if Valid.
validateAndUploadFile(fileStreamMap, fileStreamMapCopy);
}
validateAndUploadFile() contains my custom validation logic for my csv upload.
Also, we can use aws-sdk(https://www.npmjs.com/package/aws-sdk) for s3 upload.
Hope, this helps someone.

Related

Read csv from s3 and upload to external api as multipart

I want to read the csv file from the s3 bucket using boto3 and upload it to external API using multipart/form-data request.
so far I am able to read the csv
response = s3.get_object(Bucket=bucket, Key=key)
body = response['Body']
Not sure on how to convert this body into multipart.
External api will be taking request in multipart/form-data.
Any Suggestions would be helpful.
Following method solved my issue.
body = response['Body'].read()
multipart_data = MultipartEncoder(
fields={
'file': (file_name, body, 'application/vnd.ms-excel'),
'field01': 'test'
}
)
.read() method will convert the file into binary string.

Why do we have to set responseType when using XMLHttpRequest?

I implemented a HTML which upload a file, and then download another file from server.
When handling the "download part", i noticed that if i download a binary file, i have to set responseType to blob or the file will be broken.
What confused me is that, HTTP header contains content-type which could tell XMLHttpRequest what type of file the server is sending. Why i have to set it manually? I don't understand the logic because it's server's turn to tell what the file type is, rather than predicted by client
const xhr = new XMLHttpRequest();
xhr.responseType = 'blob'
.......
xhr.onload = function(e) {
if (this.status == 200) {
var blob = new Blob([this.response]); // if i don't set responseType, this.response will be broken
let a = document.createElement("a");
Your question made me realise I'm not entirely sure of the following, but it is how I have always looked at it.
responseType sets the type of xhr.response so you can process it as a Blob; it lets you retrieve the results of the xhr request as a Blob. If you don't set it, xhr.response will be Text.
A server may have sent the data the right way based on a mime type, but it still only sends a stream of bytes with a mime type; the interpretation of the received bytes lies on your end, and the received data won't automatically be of the type Blob based on the mime type.
Blob is not a file type on the server; the server may know and send the mime types of files, but Blob isn't one of them, and xhr.response won't be a Blob just because the mime type suggests that a Blob would be the right type.
Also, you may want to process the xhr.response differently from what can be inferred from the mime type, and in that sense, it is a kind of override (though not with the same functionality as xhr.overrideMimeType()).

React Native FileSystem file could not be read? What the Blob?

I am trying to send an audio file recorded using expo-av library to my server using web sockets.
Websocket will allow me to only send String, ArrayBuffer or Blob. I spent whole day trying to find out how to convert my .wav recording into a blob but without success. I tried to use expo-file-system method FileSystem.readAsStringAsync to read the filed as a string but I get an error that the file could not be read. How is that possible? I passed it the correct URI (using recording.getURI()).
I tried to re-engineer my approach to use fetch and FormData post request with the same URI and audio gets sent correctly. But I really would like to use WebSockets so that later I could try to make it stream the sound to the server in real time instead of recording it first and then sending it.
You can try ... But I can't find a way to read the blob itself
// this is from my code ...
let recording = Audio.Recording ;
const info = await FileSystem.getInfoAsync(recording.getURI() || "");
console.log(`FILE INFO: ${JSON.stringify(info)}`);
// get the file as a blob
const response = await fetch(info.uri);
const blob = await response.blob(); // slow - Takes a lot of time

Setting metadata on S3 multipart upload

I'd like to upload a file to S3 in parts, and set some metadata on the file. I'm using boto to interact with S3. I'm able to set metadata with single-operation uploads like so:
Is there a way to set metadata with a multipart upload? I've tried this method of copying the key to change the metadata, but it fails with the error: InvalidRequest: The specified copy source is larger than the maximum allowable size for a copy source: <size>
I've also tried doing the following:
key = bucket.create_key(key_name)
key.set_metadata('some-key', 'value')
<multipart upload>
...but the multipart upload overwrites the metadata.
I'm using code similar to this to do the multipart upload.
Sorry, I just found the answer:
Per the docs:
If you want to provide any metadata describing the object being uploaded, you must provide it in the request to initiate multipart upload.
So in boto, the metadata can be set in the initiate_multipart_upload call. Docs here.
Faced such issue earlier today and discovered that there is no information on how to do that right.
The code example on how we solved that issue provided below.
$uploader = new MultipartUploader($client, $source, [
'bucket' => $bucketName,
'key' => $filename,
'before_initiate' => function (\Aws\Command $command) {
$command['ContentType'] = 'application/octet-stream';
$command['ContentDisposition'] = 'attachment';
},
]);
Unfortunately, documentation https://docs.aws.amazon.com/aws-sdk-php/v3/guide/service/s3-multipart-upload.html#customizing-a-multipart-upload doesn't make it clear and easy to understand that if you'd like to provide alternative meta data with multipart upload you have to go this way.
I hope that will help.

Is there a fast way of accessing line in AWS S3 file?

I have a collection of JSON messages in a file stored on S3 (one message per line). Each message has a unique key as part of the message. I also have a simple DynamoDB table where this key is used as the primary key. The table contains the name of the S3 file where the corresponding JSON message is located.
My goal is to extract a JSON message from the file given the key. Of course, the worst case scenario is when the message is the very last line in the file.
What is the fastest way of extracting the message from the file using the boto library? In particular, is it possible to somehow read the file line by line directly? Of course, I can read the entire contents to a local file using boto.s3.key.get_file() then open the file and read it line by line and check for the id to match. But is there a more efficient way?
Thanks much!
S3 cannot do this. That said, you have some other options:
Store the record's length and position (byte offset) instead of the line number in DynamoDB. This would allow you to retrieve just that record using the Range: header.
Use caching layer to store { S3 object key, line number } => { position, length } tuples. When you want to look up a record by { S3 object key, line number }, reference the cache. If you don't already have this data, you have to fetch the whole file like you do now -- but having fetched the file, you can calculate offsets for every line within it, and save yourself work down the line.
Store the JSON record in DynamoDB directly. This may or may not be practical, given the 64 KB item limit.
Store each JSON record in S3 separately. You could then eliminate the DynamoDB key lookup, and go straight to S3 for a given record.
Which is most appropriate for you depends on your application architecture, the way in which this data is accessed, concurrency issues (probably not significant given your current solution), and your sensitivities for latency and cost.
you can use the built-in readline with streams:
const readline = require('readline');
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const params = {Bucket: 'yourbucket', Key: 'somefile.txt'};
const readStream = s3.getObject(params).createReadStream();
const lineReader = readline.createInterface({
input: readStream,
});
lineReader.on('line', (line) => console.log(line));
You can use S3 SELECT to accomplish this. Also works on parquet files.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html