I wanted to copy an S3 bucket on Kubernetes nodes as a DaemonSet, as the new node will also get the s3 bucket copy as soon it gets launched,
I prefer an S3 copy to the Kubernetes node because copying S3 to directly to the pod as an AWS API would mean multiple calls as multiple pods require it and it will take time to copy content each time when the pod is launching.
Assuming that your S3 content is static and doesn't change often. I believe more than a DaemonSet it makes more sense to use a one time Job to copy the whole S3 bucket to a local disk. It's not clear how you would signal the kube-scheduler that your node is not ready until the S3 bucket is fully copied. But, perhaps you can taint your node before the job is finished and remove the taint after the job finishes.
Note also that S3 is inherently slow and meant to be used for processing (reading/writing) single files at a time, so if your bucket has a large amount of data it would take a long time to copy to the node disk.
If your S3 content is dynamically (constantly changing) then it would be more challenging since you would have to files in sync. Your apps would probably have to cache architecture where you would go to the local disk to find files and if they are not there, then make a request to S3.
Related
We have an existing setup of AEM 6.1 which uses TarMK for data storage. To migrate the all assets to S3, I followed all steps here: https://docs.adobe.com/docs/en/aem/6-1/deploy/platform/data-store-config.html#Data%20Store%20Configurations (Amazon S3 Data Store). Apparently, the data synced to S3 but when I checked the disk usage report, I still see that assets are using disk space even for existing and newly added assets. What's the purpose of using S3 for assets if they still use the disk space? Or am I doing something wrong? How can I verify that my setup is really using S3? Here is my S3DataStore.config
accessKey="xxxxxxxxxx"
secretKey="xxxxxxxxxx"
s3Bucket="dev-aem-assets-local"
s3Region="eu-west-1"
connectionTimeout="120000"
socketTimeout="120000"
maxConnections="40"
writeThreads="30"
maxErrorRetry="10"
continueOnAsyncUploadFailure=B"true"
cacheSize="0"
minRecordLength="10"
Another question is: Do I need to do the same setup on publisher? Or is it ok just to do it on author and use publisher as is by replicating the binary data?
There are a few parts to your questiob so I'll break down the answer into logical blocks. Shout if I miss anything.
Your setup for migration is correct and S3 will use disk space. This is for the write-through cache.
AEM uses write-through cache for writing to S3 and all the settings for this cache are in your S3 config file. Any writes to data store are first written to this cache. Asynchronous background threads then uploaded to the S3 bucket. This mechanism makes AEM very responsive as it's not blocked by slow S3 writes. Also, data reads for recently written blobs are fast because they don't need slow reads from S3. In short, S3 IO traffic is too slow for AEM so this cache boosts the performance. You cannot disable it as it is required for asynchronous write to S3. You can reduce the size but it's recommended to be at least 50% of your S3 bucket size.
You can verify your S3 setup by looking at your logs for messages related to AWS (grep for aws).
As for publisher, yes you need to migrate from your old publisher to the new publisher. Assuming that you are not using binary-less replication, you will need a different S3 bucket for your publisher. In general, you migrate from author to author and publisher to publisher for a standard implementation.
You can also verify your S3 dat usage by looking at the S3 bucket and the traffic on it. If versioning is enabled on your S3 bucket all the blobs will show version stamping.
Async upload of blobs can be monitored from logs and IP traffic monitoring will show activities related to your S3 bucket. The most useful way is to see the network traffic between your AEM server and S3 end-point.
i am a little puzzled i hope someone can help me out.
we create some ORC-Files that we would like to query while they are stored on S3.
We noticed that the S3 native Filesystem S3n does not really work out for this manner. I am not really sure what the problem is - but my guess is, that the reader is not able to jump to specific bytes inside the file so that he has to load the whole file before he can query it.
So we tried storing the files on S3 (uri s3://) which is a block file system just like HDFS backed by s3 and it worked great.
But i am a little worried after reading up on this source about Amazon EMR which says
Amazon S3 block file system (URI path: s3bfs://)
The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.
Important
We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
EMRFS (URI path: s3://)
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.
I am not using EMR - i create my files by launching an EC2 cluster and then use s3 as a cold storage - but I am kind of puzzled right now and not sure which filesystem I use when I store my files on s3 using the URI scheme s3:// - do i use EMRFS or do i use the deprecated s3bfs filesystem?
Amazon S3 is an object storage system. It is not recommended to "mount" S3 as a filesystem. Amazon Elastic Block Store (EBS) is a block storage system that appears as volumes on Amazon EC2 instances.
When used from Amazon Elastic MapReduce (EMR), Hadoop has extensions that make it easy to work with Amazon S3. However, if you are not using EMR, there is no need to use EMRFS (which is available only on EMR), nor should you use S3 as a block storage system.
The easiest way to use S3 from EC2 is via the AWS Command-Line Interface (CLI). You can copy files to/from S3 by using the aws s3 cp command. There's also a sync command to make it easy to syncrhonize data to/from S3.
You can also programmatically connect to Amazon S3 via an SDK, so that your app can directly transfer files to/from S3.
As to which to choose... typically, applications like to work with files on a local filesystem, so copy your files from S3 to a local device. However, if your app can directly communicate with S3, there will be less "moving parts".
We are facing the following dilemma:
Our mobile client application will be user-authenticating through a BaaS (Backend-as-a-Service) and will then need to send a file to the cloud - specifically an Amazon EC2 server where the main processing will take place. Since the time of processing of the file might take place later, there is a need to store the files (and there is also a prospect of keeping an archive of them for future use by the users). The question is what would you suggest as the preferred way from the following:
a) send the file to the EC2 server directly which will then issue an Amazon S3 request to save the file there
OR
b) store the file to the BaaS (which in our case is parse.com which uses S3 as its data-storage) and retrieve it later by the EC2 server
The cost of transferring a file from EC2 to S3 and inverse is 0 as long as both are on the same region which in both a) and b) cases is true. The problem is that there is a need for mapping each user to the files that he has access to and a) and b) differ a lot in this case.
So basically you are sending file to EC2 - EC2 is processing it and saving it to S3 or it is just saving it to S3 ??.I used a very easy way of transferring data from Ec2 to S3. i.e. s3fuse :- you can basically mount EC2 drive with S3 , so when you store something on EC2, it will automatically be stored in S3 also. Might be handy for you.
I'm running hive over EMR,
and need to copy some files to all EMR instances.
One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to copy stright from S3 to HDFS.
What is the best way to go about this?
the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):
% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile
This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.
I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.
Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .
S3DistCp is an extension of DistCp that is optimized to work with
Amazon Web Services (AWS), particularly Amazon Simple Storage Service
(Amazon S3). You use S3DistCp by adding it as a step in a job flow.
Using S3DistCp, you can efficiently copy large amounts of data from
Amazon S3 into HDFS where it can be processed by subsequent steps in
your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use
S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon
S3
Example Copy log files from Amazon S3 to HDFS
This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.
elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'
Note that according to Amazon, at http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html "Amazon Elastic MapReduce - File System Configuration", the S3 Block FileSystem is deprecated and its URI prefix is now s3bfs:// and they specifically discourage using it since "it can trigger a race condition that might cause your job flow to fail".
According to the same page, HDFS is now 'first-class' file system under S3 although it is ephemeral (goes away when the Hadoop jobs ends).
On an app server in which a few source files change frequently, Is the following approach recommended?
Use a cron job with S3tools to sync the source files with S3 private bucket (every 15 mins for example).
On server start up - Use user data script to sync with the sources bucket to retrieve the latest sources.
Advantages:
1. No need to attach EBS for app server just to save a few files
2. Similar setup to all app servers
3. Sources automatically backed up.
4. As a byproduct, distributes code to multiple app servers automatically.
Disadvantages:
keeping source code on S3
other?
What do you think about this methodology? Is this the right way to use EC2 when source code change frequently (a few times a day) please recommend the best approach to run EC2 instances where sources change often.
I think you're better off using a proper source code repository, like Subversion or Git, rather than storing the source files on S3. That way you can have a central location for the source files while avoiding the update consistency problems that kdgregory mentioned.
You can put the source repository on one of your own servers outside of EC2, or host it on an EC2 instance (make sure the repository files are on an EBS volume in the latter case).
If you're going to be running a large number of EC2 instances, then it will be less effort to have them sync themselves from a central location (ie, you sync to private bucket, app-servers sync from that bucket).
HOWEVER, recognize that updates to an S3 bucket are atomic only at the object level, and more importantly, are not guaranteed to be immediately consistent (although I recall seeing a recent note that the us-west endpoint does offer read-after-write consistency).
This means that your app-servers may load a set of new files that are internally inconsistent -- some will be old, some will be new. If this is a problem for you, then you should implement a scheme that uploads directly to the app-servers, and ensures changeset consistency (perhaps by uploading to a temporary directory that is then renamed).