I'm looking to figure out the mechanics of Apache Hive hosted by Amazon. I'm assuming, it substitutes HDFS with S3 and Hadoop MapReduce with EMR. Are my assumptions correct?
You mostly correct. I would say that it is most convenient way to run Hive on amazon is
replacing HDFS with S3. It is practical since data is living on S3 and we can run Hadoop / Hive cluster on demand. Some drawback is slow write performance - so doing data transformations will be slow. Doing aggregations - is mostly fine
In the same time there are other configurations:
Build HDFS over local drives.
Build HDFS over EBS volumes.
Each one with their trade offs.
Related
What would be the best solution to transfer files between s3 and an EC2 instance using airflow?
After research i found there was a s3_to_sftp_operator but i know it's good practice to execute tasks on the external systems instead of the airflow instance...
I'm thinking about running a bashoperator that executes an aws cli on the remote ec2 instance since it respects the principle above.
Do you have any production best practice to share about this case ?
The s3_to_sftp_operator is going to be the better choice unless the files are large. Only if the files are large would I consider a bash operator with an ssh onto a remote machine. As for what large means, I would just test with the s3_to_sftp_operator and if the performance of everything else on airflow isn't meaningfully impacted then stay with it. I'm regularly downloading and opening ~1 GiB files with PythonOperators in airflow on 2 vCPU airflow nodes with 8 GiB RAM. It doesn't make sense to do anything more complex on files that small.
The best solution would be not to transfer the files, and most likely to get rid of the EC2 while you are at it.
If you have a task that needs to run on some data in S3, then just run that task directly in airflow.
If you can't run that task in airflow because it needs vast power or some weird code that airflow won't run, then have the EC2 instance read S3 directly.
If you're using airflow to orchestrate the task because the task is watching the local filesystem on the EC2, then just trigger the task and have the task read S3.
Does Spark on EMR distributes input data from Amazon S3 to underlying HDFS ?
What is the usage of EBS volumes which are also attached to nodes ?
The root EBS volume for each node is used for the operating system and application files. This is a 10GB volume by default. Additional volumes attached to the core nodes are used for HDFS. Task nodes may have additional volumes, but Task nodes do not have HDFS Name Nodes, and will not store HDFS data.
Instance Storage documentation for EMR:
Instance store and/or EBS volume storage is used for HDFS data, as well as buffers, caches, scratch data, and other temporary content that some applications may "spill" to the local file system.
Spark will store temporary data in HDFS if configured to do that.
You can configure properties like spark.local.dir to set where Spark should write data.
Unless you are specifically writing data to HDFS, you don't need to provision large EBS volumes for core nodes. I suggest launching a cluster with what you estimate you'll need, and then adding additional core nodes as your HDFS requirements increase.
Whether you specify HDFS or not it is always spun by the EMR. I couldn't find any documentation why EMR spins HDFS; but to my experience EMR first writes to HDFS as temporary storage and then copies these data to S3. Some part of the root volume is used to host this HDFS -- even though you didn't check HDFS checkbox when spinning EMR.
I need to build a data lake on AWS, but I don't know how exactly S3 is different from HDFS. I found some answers in the Internet but I still don't understand the real difference.
I also need to know if someone has the data lake architecture of HDFS and S3 in AWS.
HDFS is only accessible to the Hadoop cluster in which it exists. If the cluster turns off or is terminated, the data in HDFS will be gone.
Data in Amazon S3:
Remains available at all times (it cannot be 'turned off')
Is accessible to multiple clusters
Is accessible to other AWS services, such as Amazon Athena (which is 'Presto as a service', so you might not even need a Hadoop cluster)
Has multiple storage classes, such as storing less-frequently accessed data at a lower cost
Does not have storage limits (while HDFS is limited to the storage available in the Hadoop cluster)
I am looking for a native offering, such as any of the RDS solutions, Elastic Cache, Amazon Redshift, not something that I would have to host myself.
From the Apache Kudu: https://kudu.apache.org/ :
Kudu provides a combination of fast inserts/updates and efficient columnar
scans to enable multiple real-time analytic workloads across a single storage
layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the
flexibility to address a wider variety of use cases without exotic workarounds.
As I understand it, Kudu is a columnar distributed storage engine for tabular data that allows for fast scans and ad-hoc analytical queries but ALSO allows for random updates and inserts. Every table has a primary key that you can use to find and update single records...
Second answer after question was revised.
The answer is Amazon EMR running Apache Kudu.
Amazon EMR is Amazon's service for Hadoop. Apache Kudu is a package that you install on Hadoop along with many others to process "Big Data".
If you are looking for a managed service for only Apache Kudu, then there is nothing. Apache Kudu is an open source tool that sits on top of Hadoop and is a companion to Apache Impala. On AWS both require Amazon EMR running Hadoop version 2.x or greater.
I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)