S3 — Auto generate folder structure? - file-upload

I need to store user uploaded files in Amazon S3. I'm new to S3, but as I got from docs, S3 requires of me to specify file upload path in PUT method.
I'm wondering if there is a way to send file to S3, and simply get link for http(s) access? I wish Amazon to handle all headache related to file/folder structure itself. For example, I just pipe from node.js file to S3, and on callback I get http link with no expiration date. And Amazon itself creates smth like /2014/12/01/.../$hash.jpg and just returns me the final link? Such use case looks to be quite common.
Is it possible? If no, could you suggest any options to simplify file storage/filesystem tree structure in S3?
Many thanks.

S3 doesnt' have folders, actually. In a normal filesystem, 2014/12/01/blah.jpg would mean you've got a 2014 folder with a folder called 12 inside it and so on, but in S3 the entire 2014/12/01/blah.jpg it the key - essentially a single long filename. You don't have to create any folders.

Related

Unzip files from S3 before putting them into Snowflake

I have data available in an S3 bucket we don't own, with a zipped folder containing files for each date.
We are using Snowflake as our data warehouse. Snowflake accepts gzip'd files, but does not ingest zip'd folders.
Is there a way to directly ingest the files into Snowflake that will be more efficient than copying them all into our own S3 bucket and unzipping them there, then pointing e.g. Snowpipe to that bucket? The data is on the order of 10GB per day, so copying is very doable, but would introduce (potentially) unnecessary latency and cost. We also don't have access to their IAM policies, so can't do something like S3 Sync.
I would be happy to write something myself, or use a product/platform like Meltano or Airbyte, but I can't find a suitable solution.
How about using SnowSQL to load the data into Snowflake, and using Snowflake stage table/user/named stage to hold files at stages.
https://docs.snowflake.com/en/user-guide/data-load-local-file-system-create-stage.html
I had a similar use case. I use an event based trigger that runs a Lambda function everytime there is a new zipped file in my S3 folder. The Lambda functions opens the zipped files, gzips each individual file and re-uploads them to a different S3 folder. Here's the full working code: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

Simple way to load new files only into Redshift from S3?

The documentation for the Redshift COPY command specifies two ways to choose files to load from S3, you either provide a base path and it loads all the files under that path, or you specify a manifest file with specific files to load.
However in our case, which I imagine is pretty common, the S3 bucket periodically receives new files with more recent data. We'd like to be able to load only the files that haven't already been loaded.
Given that there is a table stl_file_scan that logs all the files that have been loaded from S3, it would be nice to somehow exclude those that have successfully been loaded. This seems like a fairly obvious feature, but I can't find anything in the docs or online about how to do this.
Even the Redshift S3 loading template in AWS Data Pipeline appears to manage this scenario by loading all the data -- new and old -- to a staging table, and then comparing/upserting to the target table. This seems like an insane amount of overhead when we can tell up front from the filenames that a file has already been loaded.
I know we could probably move the files that have already been loaded out of the bucket, however we can't do that, this bucket is the final storage place for another process which is not our own.
The only alternative I can think of is to have some other process running that tracks files that have been successfully loaded to redshift, and then periodically compares that to the s3 bucket to determine the differences, and then writes the manifest file somewhere before triggering the copy process. But what a pain! We'd need a separate ec2 instance to run the process which would have it's own management and operational overhead.
There must be a better way!
This is how I solved the problem,
S3 -- (Lambda Trigger on newly created Logs) -- Lambda -- Firehose -- Redshift
It works at any scale. With more load, more calls to Lambda, more data to firehose and everything taken care automatically.
If there are issues with the format of the file, you can configure dead letter queues, events will be sent there and you can reprocess once you fix lambda.
Here I would like to mention some steps that includes process that how to load data in redshift.
Export local RDBMS data to flat files (Make sure you remove invalid
characters, apply escape sequence during export).
Split files into 10-15 MB each to get optimal performance during
upload and final Data load.
Compress files to *.gz format so you don’t end up with $1000
surprise bill :) .. In my case Text files were compressed 10-20
times.
List all file names to manifest file so when you issue COPY command
to Redshift its treated as one unit of load.
Upload manifest file to Amazon S3 bucket.
Upload local *.gz files to Amazon S3 bucket.
Issue Redshift COPY command with different options.
Schedule file archiving from on-premises and S3 Staging area on AWS.
Capturing Errors, setting up restart ability if something fails
Doing it easy way you can follow this link.
In general compare of loaded files to existing on S3 files is a bad but possible practice. The common "industrial" practice is to use message queue between data producer and data consumer that actually loads the data. Take a look on RabbitMQ vs Amazon SQS and etc..

query regarding cloud file storage services- can i append data to an existing file

I am working to create an application where some files will be stored in Amazon S3/Rackspace Cloud Files/other similar cloud file storage providers.
There are a couple of scenarios where it would be easier for me, if I could append data to an existing file... Is this possible? Or do I have to download the file from Amazon S3, then append data to it, and finally upload the modified file back to Amazon S3?
There is no way to append anything to existing files in S3.
You will have to download it and upload it again after modifying.
If you wish though, you can always upload the new data with a tag (a timestamp or a counter), e.g. file_201201011344. So when reading files, you get all files mactching your pattern and append them on the client side.

How to create folder on Amazon S3 using objective-c

I know this is a question that may have been asked before (at least in Python), but I am still struggling to get this right. I compare my local folder structure and content with what I have stored in my Amazon S3 bucket. The directories not exisiting on S3, but which are found locally, are to be created in my S3 bucket. It seems that Amazon S3 does not have the concept of a folder, but rather a folder is identified as an empty file of size 0. My question is, how can I easily create a folder in objective-c by putting an empty file (with name correspoding to the folder name) on S3 (I use ASIHTTP for my get and put events)? I want to create the directory explicitly and not implicitly by copying a new file to a non-exisiting folder. I appreciate your help on this.
It seems that Amazon S3 does not have the concept of a folder, but rather a folder is identified as an empty file of size 0
The / character is often used as a delimiter, when keys are used as pathnames. To make a folder called bar in the parent folder foo, create a key with the name /foo/bar/.
Amazon now has an AWS SDK for Objective C. The S3PutObjectRequest class has the method -initWithKey:inBucket:.

jets3t and Downloading Files from AmazonS3 with Different Name

We're using Amazon S3 for file storage and recently found out that we need to keep some sort of directory structure. Since S3 doesn't allow that, we know we can name the files according to their structure for storage. For example...
abc/123/draft.doc
What I want to know is if I want to provide a public link to this particular file is there anyway that the file can simply be draft.doc instead of abc/123/draft.doc ?
I feel stupid. After some more investigation I realized that by creating a GET url to the resource, I get exactly what I need.