EMR Spark job - usage of HDFS and EBS storage - amazon-emr

Does Spark on EMR distributes input data from Amazon S3 to underlying HDFS ?
What is the usage of EBS volumes which are also attached to nodes ?

The root EBS volume for each node is used for the operating system and application files. This is a 10GB volume by default. Additional volumes attached to the core nodes are used for HDFS. Task nodes may have additional volumes, but Task nodes do not have HDFS Name Nodes, and will not store HDFS data.
Instance Storage documentation for EMR:
Instance store and/or EBS volume storage is used for HDFS data, as well as buffers, caches, scratch data, and other temporary content that some applications may "spill" to the local file system.
Spark will store temporary data in HDFS if configured to do that.
You can configure properties like spark.local.dir to set where Spark should write data.
Unless you are specifically writing data to HDFS, you don't need to provision large EBS volumes for core nodes. I suggest launching a cluster with what you estimate you'll need, and then adding additional core nodes as your HDFS requirements increase.

Whether you specify HDFS or not it is always spun by the EMR. I couldn't find any documentation why EMR spins HDFS; but to my experience EMR first writes to HDFS as temporary storage and then copies these data to S3. Some part of the root volume is used to host this HDFS -- even though you didn't check HDFS checkbox when spinning EMR.

Related

Files related to AIX HACMP Commands

Are there any files which stores the output of the below HACMP command?
cllsgrp
clshowres
According to this document cluster configuration files are stored in shared disk (between nodes).
From here:
The cluster repository disk is used as the central repository for the
cluster configuration data. The cluster repository disk must be
accessible from all nodes in the cluster and is a minimum of 10 GB in
size. Given the importance of the cluster configuration data, the
cluster repository disk should be backed up by a redundant and highly
available storage configuration.

What is the difference between a data lake with HDFS or S3 in AWS?

I need to build a data lake on AWS, but I don't know how exactly S3 is different from HDFS. I found some answers in the Internet but I still don't understand the real difference.
I also need to know if someone has the data lake architecture of HDFS and S3 in AWS.
HDFS is only accessible to the Hadoop cluster in which it exists. If the cluster turns off or is terminated, the data in HDFS will be gone.
Data in Amazon S3:
Remains available at all times (it cannot be 'turned off')
Is accessible to multiple clusters
Is accessible to other AWS services, such as Amazon Athena (which is 'Presto as a service', so you might not even need a Hadoop cluster)
Has multiple storage classes, such as storing less-frequently accessed data at a lower cost
Does not have storage limits (while HDFS is limited to the storage available in the Hadoop cluster)

ML backup and Amazon S3. How?

I would like to put MLbackup directory to cloud. Is there any limitations to pushing Full and incremental backups with journal archiving enabled, to S3 compatible Object storage? Is journal archiving supported in S3 compatible cloud storage? What would happen if I put backup with journal archiving enabled to S3 storage? Will it eventually work or I will get errors?
Also, provide documents link to configure ML to point to cloud storage.
You can backup to S3, but if you want to have Journaling enabled, you will need to have them written to a different location. Journal archiving is not supported on S3.
The default location for Journals are in the backup, but when creating programmatically you can specify a different $journal-archive-path.
Backing Up a Database
The directory you specified can be an operating system mounted directory path, it can be an HDFS path, or it can be an S3 path. For details on using HDFS and S3 storage in MarkLogic, see Disk Storage Considerations in the Query Performance and Tuning Guide.
S3 and MarkLogic
Storage on S3 has an 'eventual consistency' property, meaning that write operations might not be available immediately for reading, but they will be available at some point. Because of this, S3 data directories in MarkLogic have a restriction that MarkLogic does not create Journals on S3. Therefore, MarkLogic recommends that you use S3 only for backups and for read-only forests, otherwise you risk the possibility of data loss. If your forests are read-only, then there is no need to have journals.

How does Apache Hive work on Amazon?

I'm looking to figure out the mechanics of Apache Hive hosted by Amazon. I'm assuming, it substitutes HDFS with S3 and Hadoop MapReduce with EMR. Are my assumptions correct?
You mostly correct. I would say that it is most convenient way to run Hive on amazon is
replacing HDFS with S3. It is practical since data is living on S3 and we can run Hadoop / Hive cluster on demand. Some drawback is slow write performance - so doing data transformations will be slow. Doing aggregations - is mostly fine
In the same time there are other configurations:
Build HDFS over local drives.
Build HDFS over EBS volumes.
Each one with their trade offs.

Hadoop upload files from local machine to amazon s3

I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)