Best approach for setting up AEM S3 Data Store - amazon-s3

We have an existing setup of AEM 6.1 which uses TarMK for data storage. To migrate the all assets to S3, I followed all steps here: https://docs.adobe.com/docs/en/aem/6-1/deploy/platform/data-store-config.html#Data%20Store%20Configurations (Amazon S3 Data Store). Apparently, the data synced to S3 but when I checked the disk usage report, I still see that assets are using disk space even for existing and newly added assets. What's the purpose of using S3 for assets if they still use the disk space? Or am I doing something wrong? How can I verify that my setup is really using S3? Here is my S3DataStore.config
accessKey="xxxxxxxxxx"
secretKey="xxxxxxxxxx"
s3Bucket="dev-aem-assets-local"
s3Region="eu-west-1"
connectionTimeout="120000"
socketTimeout="120000"
maxConnections="40"
writeThreads="30"
maxErrorRetry="10"
continueOnAsyncUploadFailure=B"true"
cacheSize="0"
minRecordLength="10"
Another question is: Do I need to do the same setup on publisher? Or is it ok just to do it on author and use publisher as is by replicating the binary data?

There are a few parts to your questiob so I'll break down the answer into logical blocks. Shout if I miss anything.
Your setup for migration is correct and S3 will use disk space. This is for the write-through cache.
AEM uses write-through cache for writing to S3 and all the settings for this cache are in your S3 config file. Any writes to data store are first written to this cache. Asynchronous background threads then uploaded to the S3 bucket. This mechanism makes AEM very responsive as it's not blocked by slow S3 writes. Also, data reads for recently written blobs are fast because they don't need slow reads from S3. In short, S3 IO traffic is too slow for AEM so this cache boosts the performance. You cannot disable it as it is required for asynchronous write to S3. You can reduce the size but it's recommended to be at least 50% of your S3 bucket size.
You can verify your S3 setup by looking at your logs for messages related to AWS (grep for aws).
As for publisher, yes you need to migrate from your old publisher to the new publisher. Assuming that you are not using binary-less replication, you will need a different S3 bucket for your publisher. In general, you migrate from author to author and publisher to publisher for a standard implementation.

You can also verify your S3 dat usage by looking at the S3 bucket and the traffic on it. If versioning is enabled on your S3 bucket all the blobs will show version stamping.
Async upload of blobs can be monitored from logs and IP traffic monitoring will show activities related to your S3 bucket. The most useful way is to see the network traffic between your AEM server and S3 end-point.

Related

How to stop Adobe Experience Manager from storing binaries into local file system?

We are using Amazon S3 for storing our binaries and we just want to keep reference of those binaries into our local files system. Currently, binaries are getting stored in both S3 and local file system.
Assuming that you are referring to repository/datastore folder when you are using the S3 data store, that is your S3 cache. You can change the size of the cache and in theory reduce it some small number but you cannot completely disable it
cacheSize=<size in bytes>
in your S3 config file.
Note that there is a practical lower limit to this number based on purge factor parameters. Setting this below 10% of your S3 bucket size will have lots of cache purge triggered and this will slow down your system. Changing it to zero will give a configuration error on startup.
Just for some background, the path property in your S3 config is the path to data store on a file system. This is because S3 datastore is implemented as a write through cache. All the S3 data is written on the file system and then asynchronously uploaded to the S3 bucket. The asynchronous uploads are controlled via other config in the same file (number of retries, threads etc.)
This write through cache gives a lot of performance boost to your application as write operations won't suffer from S3 net latency. You should, ideally, configure the cache size and purge ratio according to your disk requirements and performance efficiency rather than reducing it to bare minimum.
UPDATED 28 March 2017
Improved and updated the answer to reflect latest understanding.

File Service Architecture & Cost Analysis

Context
I am developping a webapp that
Takes an URL from the user
Downloads and stores the associated file onto my server
The user can fetch the file from my server at any time before the file is eventually expired and removed
I am planning to deploy this application on the AWS. More specifically, using EC2 and S3.
Challenge
I am trying to come up with a design that is both cost-effective and performant to offer this service.
Analysis
The following assumptions are used:
the downloaded file will be available to only one user, the one who provided the URL and initiated the download
the user will only fetch the file once from the server
the file will only stay on the server for at maximum 24 hours before getting removed
the file sizes are in the 100MB - 5GB range
Consider the following application flow:
Internet → EC2: Download the file onto local storage
EC2 → S3: Upload the downloaded file onto S3, deletes the local copy on EC2
EC2 → User: Provide the user with a direct URL to fetch from S3
S3 → user: The user fetches the file from S3
S3: file is removed after 24 hours.
In terms of network performance, step 1 and 2 will be the bottlenecks as EC2 has limited downloading and uploading bandwidth. Step 4 should not be a problem since S3 is taking care of the bandwidth for transferring file to the end user.
In terms of costs, fixed costs are the EC2 instances, and the main variable cost is step 4, where AWS charges 0.09$/GB in data transfer. Since the files are removed after 24 hours, the storage fee is comparatively tiny.
Question
Have I correctly identified the performance bottlenecks in this application flow?
Is my cost analysis correct?
Is this the optimal flow in terms of costs? Is there any way to further reduce the cost?
Since step 1 and step 2 (downloading from Internet and uploading to S3) will be very bandwidth-consuming when simultaneously downloading multiple large files, will it significantly affect the responsiveness of my server to serve regular API requests? Should I use a dedicated EC2 instance just for handling API calls from the clients, and another dedicated EC2 instance just for downloading and uploading? This will slightly further complicate the design, as I will have to manage the communication between the 2 instances as well.
Can you use more AWS Services? Are you aware of AWS Lambda? https://aws.amazon.com/lambda/details/ It can perform actions in response to actions, e.g. it could delete a file from S3 shortly after it is downloaded. http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
This alleviates the need to track downloads and delete them, once you get past the learning curve of AWS Lambda. It can also handle other processing, so you only have to upload to S3 from EC2.
Regarding cost, S3 has different quality levels, and the "reduced redundancy" might be sufficient for your needs, saving a little money.
How about allowing the client to upload files directly to S3?
Your application would generate a pre-signed url, so that you can control which users can upload files, but after that the client interacts directly with S3. This would remove the costly "download then upload" process in steps 1 & 2.
See this document http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlUploadObject.html

block file system on S3

i am a little puzzled i hope someone can help me out.
we create some ORC-Files that we would like to query while they are stored on S3.
We noticed that the S3 native Filesystem S3n does not really work out for this manner. I am not really sure what the problem is - but my guess is, that the reader is not able to jump to specific bytes inside the file so that he has to load the whole file before he can query it.
So we tried storing the files on S3 (uri s3://) which is a block file system just like HDFS backed by s3 and it worked great.
But i am a little worried after reading up on this source about Amazon EMR which says
Amazon S3 block file system (URI path: s3bfs://)
The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.
Important
We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
EMRFS (URI path: s3://)
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.
I am not using EMR - i create my files by launching an EC2 cluster and then use s3 as a cold storage - but I am kind of puzzled right now and not sure which filesystem I use when I store my files on s3 using the URI scheme s3:// - do i use EMRFS or do i use the deprecated s3bfs filesystem?
Amazon S3 is an object storage system. It is not recommended to "mount" S3 as a filesystem. Amazon Elastic Block Store (EBS) is a block storage system that appears as volumes on Amazon EC2 instances.
When used from Amazon Elastic MapReduce (EMR), Hadoop has extensions that make it easy to work with Amazon S3. However, if you are not using EMR, there is no need to use EMRFS (which is available only on EMR), nor should you use S3 as a block storage system.
The easiest way to use S3 from EC2 is via the AWS Command-Line Interface (CLI). You can copy files to/from S3 by using the aws s3 cp command. There's also a sync command to make it easy to syncrhonize data to/from S3.
You can also programmatically connect to Amazon S3 via an SDK, so that your app can directly transfer files to/from S3.
As to which to choose... typically, applications like to work with files on a local filesystem, so copy your files from S3 to a local device. However, if your app can directly communicate with S3, there will be less "moving parts".

Best way of storing and retrieving files - BaaS or S3

We are facing the following dilemma:
Our mobile client application will be user-authenticating through a BaaS (Backend-as-a-Service) and will then need to send a file to the cloud - specifically an Amazon EC2 server where the main processing will take place. Since the time of processing of the file might take place later, there is a need to store the files (and there is also a prospect of keeping an archive of them for future use by the users). The question is what would you suggest as the preferred way from the following:
a) send the file to the EC2 server directly which will then issue an Amazon S3 request to save the file there
OR
b) store the file to the BaaS (which in our case is parse.com which uses S3 as its data-storage) and retrieve it later by the EC2 server
The cost of transferring a file from EC2 to S3 and inverse is 0 as long as both are on the same region which in both a) and b) cases is true. The problem is that there is a need for mapping each user to the files that he has access to and a) and b) differ a lot in this case.
So basically you are sending file to EC2 - EC2 is processing it and saving it to S3 or it is just saving it to S3 ??.I used a very easy way of transferring data from Ec2 to S3. i.e. s3fuse :- you can basically mount EC2 drive with S3 , so when you store something on EC2, it will automatically be stored in S3 also. Might be handy for you.

How do I use Amazon's new RRS for S3?

Reduced Redundancy Storage (RRS) is a new service from Amazon that is a bit cheaper than S3 because there is less redundancy.
However, I can not find any information on how to specify that my data should use RRS rather than standard S3. In fact, there doesn't seem to be any website interface for an S3 services. If I log into AWS, there are only options for EC2, Elastic MapReduce, CloudFront and RDS, none of which I use.
I know this question is old but it's worth mentioning that Amazon's interface for S3 now has an option to change your files (recursively) to RRS. Select a folder and right click on it, under properties change the storage to RRS.
You can use S3 Browser to switch to Reduced Redundancy Storage. It allows you to view/edit storage class for a single file or for multiple files. Moreover, you can configure default storage class for the bucket, so S3 Browser will automatically apply predefined storage class for all new files you are uploading through S3 Browser.
If you are using S3 Browser to work with RRS, the following article may be helpful:
Working with Amazon S3 Reduced Redundancy Storage (RRS)
Note, Storage Class preferences are stored in a local settings file.Other s3 applications are using their own way to store bucket defaults and currently there is not single standard on this.
All objects in Amazon S3 have a
storage class setting. The default
setting is STANDARD. You can use an
optional header on a PUT request to
specify the setting
REDUCED_REDUNDANCY.
From: http://aws.amazon.com/s3/faqs/#How_do_I_specify_that_I_want_to_store_my_data_using_RRS
If you are looking for a way to convert existing data in amazon s3, you can use a fairly recent version of boto and a script I wrote. Details explained on my blog:
http://www.bryceboe.com/2010/07/02/amazon-s3-convert-objects-to-reduced-redundancy-storage/
If you're on a mac, the free cyberduck ftp program will do it. Log into S3, right-click on the bucket (or folder, or file) and choose 'info' and change the storage class from 'unknown' or 'regular s3 storage' to 'reduced redundancy storage'. Took it about 2 hours to change 30,000 files for me...
If you use boto, you can do this:
key.change_storage_class('REDUCED_REDUNDANCY')