How do I download an S3 file only if it has changed? - amazon-s3

I have a 900 MB file that I'd like to download to disk from S3 if it isn't already in place downloaded. Is there an easy way for me to only download the file if it isn't already in place? I know S3 supports querying MD5 checksum of files, but I'm hoping not to have to build this logic myself.

You can use AWS CLI's s3 sync command.
Syncs directories and S3 prefixes. Recursively copies new and updated files from the source directory to the destination.
According to this forum thread, you can use sync to synchronize only one file:
aws s3 sync s3://bucket/path/ local/path/ --exclude "*" --include "File.txt"
It says: sync the given paths, exclude all files, but include "File.txt" - so it will sync only "File.txt" under those given paths.
Or with the Java SDK:
According to the javadoc, there is a getObjectMetadata method which will return information about an S3 object (file) without downloading it's contents.
The method returns an ObjectMetadata object which can give you some useful information:
getLastModified method:
Gets the value of the Last-Modified header, indicating the date and time at which Amazon S3 last recorded a modification to the associated object.
getContentMD5 method:
Gets the base64 encoded 128-bit MD5 digest of the associated object (content - not including headers) according to RFC 1864.
getETag method:
Gets the hex encoded 128-bit MD5 digest of the associated object according to RFC 1864.

I have used below code to download S3 files which have timestamp greater than the local folder timestamp. First it's check if any of the files in S3 folder have timestamp greater than the local folder timestamp. If yes then download those files only.
TransferManager transferManager = TransferManagerBuilder.standard().build();
AmazonS3 amazonS3 = AmazonS3ClientBuilder.standard().build();
Path location = Paths.get("/data/test/");
FileTime lastModifiedTime = null;
try {
lastModifiedTime = Files.getLastModifiedTime(location, LinkOption.NOFOLLOW_LINKS);
} catch (IOException e) {
e.printStackTrace();
}
Date lastUpdatedTime = new Date(lastModifiedTime.toMillis());
ObjectListing listing = amazonS3.listObjects("bucket", "test-folder");
List<S3ObjectSummary> summaries = listing.getObjectSummaries();
for (S3ObjectSummary os: summaries) {
if(os.getLastModified().after(lastUpdatedTime)) {
try {
String fileName="/data/test/"+os.getKey();
Download multipleFileDownload = transferManager.download(bucket, os.getKey(), new File(fileName));
while (multipleFileDownload.isDone() == false) {
Thread.sleep(1000);
}
}catch(InterruptedException i){
LOG.error("Exception Occurred while downloading the file ",i);
}
}
}

Related

List out files inside folder in S3 bucket using minio

I am trying to read files from S3 bucket using minio client.
https://docs.min.io/docs/java-client-quickstart-guide.html
I am able to make connection using this client and able to access the bucket also. Now, I need to access a file inside a folder in the bucket but I am not sure how to do it. I thought once I have access to the bucket, I can list out the file names using File library but not able to do it.
File path : s3 bucket endpoint/4275/input/test.csv
Code :
public void listS3BucketObject() {
MinioClient minioClient =
MinioClient.builder()
.endpoint(s3BucketEndpoint)
.credentials(s3BucketAccessKey, s3BucketSecretKey)
.build();
String fileUrl = s3BucketEndpoint + "/" + "4275" + "/" + "input";
File[] fileList = new File(fileUrl).listFiles();
for(File file : fileList) {
System.out.println("File name: "+file.getName()); // getting null exception here
To list a "folder" (called a prefix in S3 terms), use the listObjects call.
See this for an example: https://docs.min.io/docs/java-client-api-reference.html#listObjects

Uploading Multiple files in AWS S3 from terraform

I want to upload multiple files to AWS S3 from a specific folder in my local device. I am running into the following error.
Here is my terraform code.
resource "aws_s3_bucket" "testbucket" {
bucket = "test-terraform-pawan-1"
acl = "private"
tags = {
Name = "test-terraform"
Environment = "test"
}
}
resource "aws_s3_bucket_object" "uploadfile" {
bucket = "test-terraform-pawan-1"
key = "index.html"
source = "/home/pawan/Documents/Projects/"
}
How can I solve this problem?
As of Terraform 0.12.8, you can use the fileset function to get a list of files for a given path and pattern. Combined with for_each, you should be able to upload every file as its own aws_s3_bucket_object:
resource "aws_s3_bucket_object" "dist" {
for_each = fileset("/home/pawan/Documents/Projects/", "*")
bucket = "test-terraform-pawan-1"
key = each.value
source = "/home/pawan/Documents/Projects/${each.value}"
# etag makes the file update when it changes; see https://stackoverflow.com/questions/56107258/terraform-upload-file-to-s3-on-every-apply
etag = filemd5("/home/pawan/Documents/Projects/${each.value}")
}
See terraform-providers/terraform-provider-aws : aws_s3_bucket_object: support for directory uploads #3020 on GitHub.
Note: This does not set metadata like content_type, and as far as I can tell there is no built-in way for Terraform to infer the content type of a file. This metadata is important for things like HTTP access from the browser working correctly. If that's important to you, you should look into specifying each file manually instead of trying to automatically grab everything out of a folder.
You are trying to upload a directory, whereas Terraform expects a single file in the source field. It is not yet supported to upload a folder to an S3 bucket.
However, you can invoke awscli commands using null_resource provisioner, as suggested here.
resource "null_resource" "remove_and_upload_to_s3" {
provisioner "local-exec" {
command = "aws s3 sync ${path.module}/s3Contents s3://${aws_s3_bucket.site.id}"
}
}
Since June 9, 2020, terraform has a built-in way to infer the content type (and a few other attributes) of a file which you may need as you upload to a S3 bucket
HCL format:
module "template_files" {
source = "hashicorp/dir/template"
base_dir = "${path.module}/src"
template_vars = {
# Pass in any values that you wish to use in your templates.
vpc_id = "vpc-abc123"
}
}
resource "aws_s3_bucket_object" "static_files" {
for_each = module.template_files.files
bucket = "example"
key = each.key
content_type = each.value.content_type
# The template_files module guarantees that only one of these two attributes
# will be set for each file, depending on whether it is an in-memory template
# rendering result or a static file on disk.
source = each.value.source_path
content = each.value.content
# Unless the bucket has encryption enabled, the ETag of each object is an
# MD5 hash of that object.
etag = each.value.digests.md5
}
JSON format:
{
"resource": {
"aws_s3_bucket_object": {
"static_files": {
"for_each": "${module.template_files.files}"
#...
}}}}
#...
}
Source: https://registry.terraform.io/modules/hashicorp/dir/template/latest
My objective was to make this dynamic, so whenever i create a folder in a directory, terraform automatically uploads that new folder and its contents into S3 bucket with the same key structure.
Heres how i did it.
First you have to get a local variable with a list of each Folder and the files under it. Then we can loop through that list to upload the source to S3 bucket.
Example: I have a folder called "Directories" with 2 sub folders called "Folder1" and "Folder2" each with their own files.
- Directories
- Folder1
* test_file_1.txt
* test_file_2.txt
- Folder2
* test_file_3.txt
Step 1: Get the local var.
locals{
folder_files = flatten([for d in flatten(fileset("${path.module}/Directories/*", "*")) : trim( d, "../") ])
}
Output looks like this:
folder_files = [
"Folder1/test_file_1.txt",
"Folder1/test_file_2.txt",
"Folder2/test_file_3.txt",
]
Step 2: dynamically upload s3 objects
resource "aws_s3_object" "this" {
for_each = { for idx, file in local.folder_files : idx => file }
bucket = aws_s3_bucket.this.bucket
key = "/Directories/${each.value}"
source = "${path.module}/Directories/${each.value}"
etag = "${path.module}/Directories/${each.value}"
}
This loops over the local var,
So in your S3 bucket, you will have uploaded in the same structure, the local Directory and its sub directories and files:
Directory
- Folder1
- test_file_1.txt
- test_file_2.txt
- Folder2
- test_file_3.txt

Correct code to upload local file to S3 proxy of API Gateway

I created an API function to work with S3. I imported the template swagger. After deployment, I tested with a Node.js project by the npm module aws-api-gateway-client.
It works well with: get bucket lists, get bucket info, get one item, put a bucket, put a plain text object, however I am blocked with put a binary file.
firstly, I ensure ACL is allowed with all permissions on S3. secondly, binary support also added
image/gif
application/octet-stream
The code snippet is as below. The behaviors are:
1) after invokeAPI, the callback function is never hit, after sometime, the Node.js project did not respond. no any error message. The file size (such as an image) is very small.
2) with only two times, the uploading seemed to work, but the result file size is bigger (around 2M bigger) than the original file, so the file is corrupt.
Could you help me out? Thank you!
var filepathname = './items/';
var filename = 'image1.png';
fs.stat(filepathname+filename, function (err, stats) {
var fileSize = stats.size ;
fs.readFile(filepathname+filename,'binary',function(err,data){
var len = data.length;
console.log('file len' + len);
var pathTemplate = '/my-test-bucket/' +filename ;
var method = 'PUT';
var params = {
folder: '',
item:''
};
var additionalParams = {
headers: {
'Content-Type': 'application/octet-stream',
//'Content-Type': 'image/gif',
'Content-Length': len
}
};
var result1 = apigClient.invokeApi(params,pathTemplate,method,additionalParams,data)
.then(function(result){
//never hit :(
console.log(result);
}).catch( function(result){
//never hit :(
console.log(result);
});;
});
});
We encountered the same problem. API Gateway is meant for limited data (10MB as of now), limits shown here,
http://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html
Self Signed URL to S3:
Create an S3 self signed URL for POST from the lambda or the endpoint where you are trying to post.
How do I put object to amazon s3 using presigned url?
Now POST the image directly to S3.
Presigned POST:
Apart from posting the image if you want to post additional properties, you can post it in multi-form format as well.
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#createPresignedPost-property
If you want to process the file after delivering to S3, you can create a trigger from S3 upon creation and process with your Lambda or anypoint that need to process.
Hope it helps.

File upload checking REAL file mime type when uploading directly to S3

I'm using DropzoneJS to upload files directly to S3. When they add I file I am using my backend to check the mime type and create the S3 signature. When I say adding I file, I just means it's added it to the Dropzone queue so the file isn't uploaded yet it's just sending metadata about the file to the /upload/sign url.
this.on('addedfile', function (file) {
$.get('/upload/sign', {
name: file.name,
size: file.size,
type: file.type,
}).done(function (response) {
myDropzone.options.url = response.attributes.action;
file.additionalData = response.additionalData;
myDropzone.processFile(file);
}).fail(function (response) {
var data = JSON.parse(response.responseText);
myDropzone.emit('error', file, data);
});
});
This is all good! The problem is the file's mime type is only determined by the file extension, so I can happily rename a file from image.jpg to image.mp3 and file.type will be audio/mp3. This I guess is fine for browser warnings, but not if I want that mp3 to play or if I eventually want to process the audio!
Is there any way of telling the REAL mime type of the file, without having to pass the upload directly to the servers file system? I need to upload directly to S3 so passing it through an EC2 is not an option.

Files downloaded from Amazon S3 using Knox and Node.js are corrupt

I'm using knox to access my Amazon S3 bucket for file storage. I'm storing all kinds of files - mostly MS Office and pdfs but could be binary or any other kind. I'm also using express 4.13.3 and busboy with connect-busboy for streaming support; when uploading file I'm handling with busboy and thence direct to S3 via knox, so avoiding having to write them to local disk first.
The files upload fine (I can browse and download them manually using Transmit) but I'm having problems downloading.
For clarity I don't want to write the file to local disk, instead keeping it in an in-memory buffer. Here's the code I'm using to handle the GET request:
// instantiate a knox object
var s3client = knox.createClient({
key: config.AWS.knox.key,
secret: config.AWS.knox.secret,
bucket: config.AWS.knox.bucket,
region: config.AWS.region
});
var buffer = undefined;
s3client.get(path+'/'+fileName)
.on('response', function(s3res){
s3res.setEncoding('binary');
s3res.on('data', function(chunk){
buffer += chunk;
});
s3res.on('end', function() {
buffer = new Buffer(buffer, 'binary');
var fileLength = buffer.length;
res.attachment(fileName);
res.append('Set-Cookie', 'fileDownload=true; path=/');
res.append('Content-Length', fileLength);
res.status(s3res.statusCode).send(buffer);
});
}).end();
The file downloads to the browser - I'm using John Culviner's jquery.fileDownload.js - but what is downloaded is corrupt and can't be opened. As you can see I'm using express' .attachment to set the headers for mime type and .append for the additional headers (using .set instead makes no difference).
When the file downloads in Chrome I see the message 'Resource interpreted as Document but transferred with MIME type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet:' (for an Excel file), so express is setting the header correctly, and the size of the file downloaded matches that I see when examining the bucket.
Any ideas what's going wrong?
Looks like the contents might not be being sent to the browser as binary. Try something like the following:
if (s3Res.headers['content-type']) {
res.type( s3Res.headers['content-type'] );
}
res.attachment(fileName);
s3Res.setEncoding('binary');
s3Res.on('data', function(data){
res.write(data, 'binary');
});
s3Res.on('end', function() {
res.send();
});
It will also send the data one chunk at a time as it comes in, so it should be a bit more memory efficient.