How to stop Adobe Experience Manager from storing binaries into local file system? - amazon-s3

We are using Amazon S3 for storing our binaries and we just want to keep reference of those binaries into our local files system. Currently, binaries are getting stored in both S3 and local file system.

Assuming that you are referring to repository/datastore folder when you are using the S3 data store, that is your S3 cache. You can change the size of the cache and in theory reduce it some small number but you cannot completely disable it
cacheSize=<size in bytes>
in your S3 config file.
Note that there is a practical lower limit to this number based on purge factor parameters. Setting this below 10% of your S3 bucket size will have lots of cache purge triggered and this will slow down your system. Changing it to zero will give a configuration error on startup.
Just for some background, the path property in your S3 config is the path to data store on a file system. This is because S3 datastore is implemented as a write through cache. All the S3 data is written on the file system and then asynchronously uploaded to the S3 bucket. The asynchronous uploads are controlled via other config in the same file (number of retries, threads etc.)
This write through cache gives a lot of performance boost to your application as write operations won't suffer from S3 net latency. You should, ideally, configure the cache size and purge ratio according to your disk requirements and performance efficiency rather than reducing it to bare minimum.
UPDATED 28 March 2017
Improved and updated the answer to reflect latest understanding.

Related

Streaming compression to S3 bucket with a custom directory structure

I have got an application that requires to create a compressed file from different objects that are saved on S3. The issue I am facing is I would like to compress objects on the fly without downloading files into a container and do the compression. The reason for that is the size of files can be quite big and I can easily run out of disk space and of course, there will be an extra round trip time of downloading files on disk, compressing them and upload the compressed file to s3 again.
It is worth mentioning that I would like to locate files in the output compressed file in different directories, so when a user decompress the file can see it is stored in different folders.
Since S3 does not have the concept of physical folder structure, I am not sure if this is possible and if there is a better way than download/uploading the files.
NOTE
My issue is not about how to use AWS Lambda to export a set of big files. It is about how I can export files from S3 without downloading objects on a local disk and create a zip file and upload to S3. I would like to simply zip the files on S3 on the fly and most importantly being able to customize the directory structure.
For example,
inputs:
big-file1
big-file2
big-file3
...
output:
big-zip.zip
with the directory structure of:
images/big-file1
images/big-file2
videos/big-file3
...
I have almost the same use case as yours. I have researched it for about 2 months and try with multiple ways but finally I have to use ECS (EC2) for my use case because of the zip file can be huge like 100GB ....
Currently AWS doesn't support a native way to perform compress. I have talked to them and they are considering it a feature but there is no time line given yet.
If your files is about 3 GB in term of size, you can think of Lambda to achieve your requirement.
If your files is more than 4 GB, I believe it is safe to do it with ECS or EC2 and attach more volume if it requires more space/memory for compression.
Thanks,
Yes, there are at least two ways: either using AWS-Lambda or AWS-EC2
EC2
Since aws-cli has support of cp command, you can pipe S3 file to any archiver using unix-pipe, e.g.:
aws s3 cp s3://yours-bucket/huge_file - | gzip | aws s3 cp - s3://yours-bucket/compressed_file
AWS-Lambda
Since maintaining and using EC2 instance just for compressing may be too expensive, you can use Lambda for one-time compressions.
But keep in mind that Lambda has a lifetime limit of 15 minutes. So, if your files really huge try this sequence:
To make sure that file will be compressed, try partial file compression using Lambda
Compressed files could me merged on S3 into one file using Upload Part - Copy

Best approach for setting up AEM S3 Data Store

We have an existing setup of AEM 6.1 which uses TarMK for data storage. To migrate the all assets to S3, I followed all steps here: https://docs.adobe.com/docs/en/aem/6-1/deploy/platform/data-store-config.html#Data%20Store%20Configurations (Amazon S3 Data Store). Apparently, the data synced to S3 but when I checked the disk usage report, I still see that assets are using disk space even for existing and newly added assets. What's the purpose of using S3 for assets if they still use the disk space? Or am I doing something wrong? How can I verify that my setup is really using S3? Here is my S3DataStore.config
accessKey="xxxxxxxxxx"
secretKey="xxxxxxxxxx"
s3Bucket="dev-aem-assets-local"
s3Region="eu-west-1"
connectionTimeout="120000"
socketTimeout="120000"
maxConnections="40"
writeThreads="30"
maxErrorRetry="10"
continueOnAsyncUploadFailure=B"true"
cacheSize="0"
minRecordLength="10"
Another question is: Do I need to do the same setup on publisher? Or is it ok just to do it on author and use publisher as is by replicating the binary data?
There are a few parts to your questiob so I'll break down the answer into logical blocks. Shout if I miss anything.
Your setup for migration is correct and S3 will use disk space. This is for the write-through cache.
AEM uses write-through cache for writing to S3 and all the settings for this cache are in your S3 config file. Any writes to data store are first written to this cache. Asynchronous background threads then uploaded to the S3 bucket. This mechanism makes AEM very responsive as it's not blocked by slow S3 writes. Also, data reads for recently written blobs are fast because they don't need slow reads from S3. In short, S3 IO traffic is too slow for AEM so this cache boosts the performance. You cannot disable it as it is required for asynchronous write to S3. You can reduce the size but it's recommended to be at least 50% of your S3 bucket size.
You can verify your S3 setup by looking at your logs for messages related to AWS (grep for aws).
As for publisher, yes you need to migrate from your old publisher to the new publisher. Assuming that you are not using binary-less replication, you will need a different S3 bucket for your publisher. In general, you migrate from author to author and publisher to publisher for a standard implementation.
You can also verify your S3 dat usage by looking at the S3 bucket and the traffic on it. If versioning is enabled on your S3 bucket all the blobs will show version stamping.
Async upload of blobs can be monitored from logs and IP traffic monitoring will show activities related to your S3 bucket. The most useful way is to see the network traffic between your AEM server and S3 end-point.

File Service Architecture & Cost Analysis

Context
I am developping a webapp that
Takes an URL from the user
Downloads and stores the associated file onto my server
The user can fetch the file from my server at any time before the file is eventually expired and removed
I am planning to deploy this application on the AWS. More specifically, using EC2 and S3.
Challenge
I am trying to come up with a design that is both cost-effective and performant to offer this service.
Analysis
The following assumptions are used:
the downloaded file will be available to only one user, the one who provided the URL and initiated the download
the user will only fetch the file once from the server
the file will only stay on the server for at maximum 24 hours before getting removed
the file sizes are in the 100MB - 5GB range
Consider the following application flow:
Internet → EC2: Download the file onto local storage
EC2 → S3: Upload the downloaded file onto S3, deletes the local copy on EC2
EC2 → User: Provide the user with a direct URL to fetch from S3
S3 → user: The user fetches the file from S3
S3: file is removed after 24 hours.
In terms of network performance, step 1 and 2 will be the bottlenecks as EC2 has limited downloading and uploading bandwidth. Step 4 should not be a problem since S3 is taking care of the bandwidth for transferring file to the end user.
In terms of costs, fixed costs are the EC2 instances, and the main variable cost is step 4, where AWS charges 0.09$/GB in data transfer. Since the files are removed after 24 hours, the storage fee is comparatively tiny.
Question
Have I correctly identified the performance bottlenecks in this application flow?
Is my cost analysis correct?
Is this the optimal flow in terms of costs? Is there any way to further reduce the cost?
Since step 1 and step 2 (downloading from Internet and uploading to S3) will be very bandwidth-consuming when simultaneously downloading multiple large files, will it significantly affect the responsiveness of my server to serve regular API requests? Should I use a dedicated EC2 instance just for handling API calls from the clients, and another dedicated EC2 instance just for downloading and uploading? This will slightly further complicate the design, as I will have to manage the communication between the 2 instances as well.
Can you use more AWS Services? Are you aware of AWS Lambda? https://aws.amazon.com/lambda/details/ It can perform actions in response to actions, e.g. it could delete a file from S3 shortly after it is downloaded. http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
This alleviates the need to track downloads and delete them, once you get past the learning curve of AWS Lambda. It can also handle other processing, so you only have to upload to S3 from EC2.
Regarding cost, S3 has different quality levels, and the "reduced redundancy" might be sufficient for your needs, saving a little money.
How about allowing the client to upload files directly to S3?
Your application would generate a pre-signed url, so that you can control which users can upload files, but after that the client interacts directly with S3. This would remove the costly "download then upload" process in steps 1 & 2.
See this document http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlUploadObject.html

Script to take a S3 bucket, Compress it, push the compressed file to an SFTP server

I have a s3 bucket with about 100 gb of small files (in folders).
I have been requested to back this up to a local NAS on a weekly basis.
I have access to a an EC2 instance that is attached to the S3 storage.
My Nas allows me to run an sFTP server.
I also have access to a local server in which I can run a cron job to pull the backup if need be.
How can I best go about this? If possible i would like to only download the files that have been added or changed, or compress it on the server end and then push the compressed file to the SFtp on the Nas.
The end goal is to have a complete backup of the S3 bucket on my Nas with the lowest amount of transfer each week.
Any suggestions are welcome!
Thanks for your help!
Ryan
I think the most scalable method for you to achieve this is using AWS Elastic Map Reduce and Data pipeline.
The architecture is this way:
You will use Data pipeline to configure S3 as an input data node, then EC2 with pig/hive scripts to do the required processing to send the data to SFTP. Pig is extendable to have a custom UDF (user defined function) to send data to SFTP. Then you can setup this pipeline to run at a periodical interval. Having said this this, it requires quite some reading to achieve all these - But a good skill to achieve if you for see future data transformation needs.
Start reading from here:
http://aws.typepad.com/aws/2012/11/the-new-amazon-data-pipeline.html
Similar method can be used for Taking periodic backup of DynamoDB to S3, Reading files from FTP servers, processing and moving to say S3/RDS etc.

AWS EC2- Synching source code files with S3 - is it a proper approach?

On an app server in which a few source files change frequently, Is the following approach recommended?
Use a cron job with S3tools to sync the source files with S3 private bucket (every 15 mins for example).
On server start up - Use user data script to sync with the sources bucket to retrieve the latest sources.
Advantages:
1. No need to attach EBS for app server just to save a few files
2. Similar setup to all app servers
3. Sources automatically backed up.
4. As a byproduct, distributes code to multiple app servers automatically.
Disadvantages:
keeping source code on S3
other?
What do you think about this methodology? Is this the right way to use EC2 when source code change frequently (a few times a day) please recommend the best approach to run EC2 instances where sources change often.
I think you're better off using a proper source code repository, like Subversion or Git, rather than storing the source files on S3. That way you can have a central location for the source files while avoiding the update consistency problems that kdgregory mentioned.
You can put the source repository on one of your own servers outside of EC2, or host it on an EC2 instance (make sure the repository files are on an EBS volume in the latter case).
If you're going to be running a large number of EC2 instances, then it will be less effort to have them sync themselves from a central location (ie, you sync to private bucket, app-servers sync from that bucket).
HOWEVER, recognize that updates to an S3 bucket are atomic only at the object level, and more importantly, are not guaranteed to be immediately consistent (although I recall seeing a recent note that the us-west endpoint does offer read-after-write consistency).
This means that your app-servers may load a set of new files that are internally inconsistent -- some will be old, some will be new. If this is a problem for you, then you should implement a scheme that uploads directly to the app-servers, and ensures changeset consistency (perhaps by uploading to a temporary directory that is then renamed).