I want to download data form S3 to HDFS. I tried s3cmd but it's not parallel and thus slow. I am trying to make hadoop distcp work like this:
hadoop distcp -Dfs.s3n.awsAccessKeyId=[Access Key] -Dfs.s3n.awsSecretAccessKey=[Secret Key] s3n://[account-name]/[bucket]/folder /data
but it gives me:
ipc.Client: Retrying connect to server:
ec2-[ip].compute-1.amazonaws.com/[internal-ip]:9001. Already tried 0 time(s)
distcp is a map reduce based job. Make sure job tracker service is start. Try
hadoop/bin/start-all.sh
Related
I am trying to setup postgresql db as external Hive metastore for AWS EMR.
I have tried hosting it on both EC2 and RDS.
I have already tried steps as given here.
But it doesnt go through, EMR fails in the provisioning step only with message
On the master instance (instance-id), application provisioning failed
I could not decipher anything from the failure log.
I also copied postgresql jdbc jar in paths
/usr/lib/hive/lib/ and /usr/lib/hive/jdbc/
in case EMR doesnt already has it, but still no help!
Then I setup the system by manually editing hive-site.xml and setting properties:
javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
javax.jdo.option.ConnectionUserName
javax.jdo.option.ConnectionPassword
datanucleus.fixedDatastore
datanucleus.schema.autoCreateTables
and had to run hive --service metatool -listFSRoot.
After these manual settings I was able to get EMR to use postgres db as remote metastore.
Is there any way I can make it work using the configuration file as mentioned in official documentation?
Edit:
Configuration setting I am using to for remote mysql metastore:
classification=hive-site,properties=[javax.jdo.option.ConnectionURL=jdbc:mysql://[host]:3306/[dbname]?createDatabaseIfNotExist=true,javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver,javax.jdo.option.ConnectionUserName=[user],javax.jdo.option.ConnectionPassword=[pass]]
I could never find a clean approach to configure this at the time of EMR startup itself.
The main problem is that EMR initializes the schema with MySQL using the command :
/usr/lib/hive/bin/schematool -initSchema -dbType MySQL
which should be postgres for our case.
The following manual steps allows to you configure postgres as external metastore:
1) Start EMR cluster with hive application, with default configurations.
2) Stop hive using command :
sudo stop hive-server2
3) Copy postgresql-jdbc jar (stored in some S3 location) to /usr/lib/hive/lib/ on EMR
4) Overwrite the default hive-site.xml in /usr/lib/hive/conf/ with custom one containing the JDO configuration for the Postgresql running on the EC2 node
5) Execute command :
sudo /usr/lib/hive/bin/schematool -upgradeSchema -dbType postgres
I am running a Spark cluster on Amazon EMR. I am running the PageRank example programs on the cluster.
While running the programs on my local machine, I am able to see the output properly. But the same doesn't work on EMR. The S3 folder only shows empty files.
The commands I am using:
For starting the cluster:
aws emr create-cluster --name SparkCluster --ami-version 3.2 --instance-type m3.xlarge --instance-count 2 \
--ec2-attributes KeyName=sparkproj --applications Name=Hive \
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark \
--log-uri s3://sampleapp-amahajan/output/ \
--steps Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server
For adding the job:
aws emr add-steps --cluster-id j-9AWEFYP835GI --steps \
Name=PageRank,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--class,SparkPageRank,s3://sampleapp-amahajan/pagerank_2.10-1.0.jar,s3://sampleapp-amahajan/web-Google.txt,2],ActionOnFailure=CONTINUE
After a few unsuccessful attempts... I made a text file for the output of the job and it is successfully created on my local machine. But I am unable to view the same when I SSH into the cluster. I tried FoxyProxy to view the logs for the instances and neither does anything show up there.
Could you please let me know where I am going wrong?
Thanks!
How are you writing the text file locally? Generally, EMR jobs save their output to S3, so you could use something like outputRDD.saveToTextFile("s3n://<MY_BUCKET>"). You could also save the output to HDFS, but storing the results to S3 is useful for "ephemeral" clusters-- where you provision an EMR cluster, submit a job, and terminate upon completion.
"While running the programs on my local machine, I am able to see the
output properly. But the same doesn't work on EMR. The S3 folder only
shows empty files"
For the benefit of newbies:
If you are printing output to the console, it will be displayed in local mode but when you execute on EMR cluster, the reduce operation will be performed on worker nodes and they cant right to the console of the Master/Driver node!
With proper path you should be able to write results to s3.
I feel that connecting EMR to Amazon S3 is highly unreliable because of the dependency on network speed.
I can only find links for describing an S3 location. I want to use EMR with HDFS - how do I do this?
You can just use hdfs input and output paths like hdfs:///input/.
Say you have a job added to a cluster as follows:
ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg s3:/input --arg s3:/output
instead you can have it as follows if you need it to be on hdfs:
ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg hdfs:///input --arg hdfs:///output
In order to interact with the HDFS on EMR cluster, ssh to the master node and execute general HDFS commands.
For example to see the output file, you might do as follows:
hadoop fs -get hdfs://output/part-r-0000 /home/ec2-user/firstPartOutputFile
But if you are working with transient clusters, using in-situ HDFS is discouraged, as you will lose data when cluster is terminated.
Also I have benchmarks which prove that using S3 or HDFS doesn't provide much of performance difference.
For a workload of ~200GB:
- Job got finished in 22 seconds with S3 as input source
- Job got finished in 20 seconds with HDFS as input source
EMR is super optimized to read/write data from/to S3.
For intermediate steps' output writing into hdfs is best.
So, say if you have 3 steps in your pipeline, then you may have input/output as follows:
Step 1: Input from S3, Output in HDFS
Step 2: Input from HDFS, Output in HDFS
Step 3: Input from HDFS, Output in S3
I'm running hive over EMR,
and need to copy some files to all EMR instances.
One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to copy stright from S3 to HDFS.
What is the best way to go about this?
the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):
% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile
This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.
I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.
Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .
S3DistCp is an extension of DistCp that is optimized to work with
Amazon Web Services (AWS), particularly Amazon Simple Storage Service
(Amazon S3). You use S3DistCp by adding it as a step in a job flow.
Using S3DistCp, you can efficiently copy large amounts of data from
Amazon S3 into HDFS where it can be processed by subsequent steps in
your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use
S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon
S3
Example Copy log files from Amazon S3 to HDFS
This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.
elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'
Note that according to Amazon, at http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html "Amazon Elastic MapReduce - File System Configuration", the S3 Block FileSystem is deprecated and its URI prefix is now s3bfs:// and they specifically discourage using it since "it can trigger a race condition that might cause your job flow to fail".
According to the same page, HDFS is now 'first-class' file system under S3 although it is ephemeral (goes away when the Hadoop jobs ends).
There are some large datasets (25gb+, downloadable on the Internet) that I want to play around with using Amazon EMR. Instead of downloading the datasets onto my own computer, and then re-uploading them onto Amazon, what's the best way to get the datasets onto Amazon?
Do I fire up an EC2 instance, download the datasets (using wget) into S3 from within the instance, and then access S3 when I run my EMR jobs? (I haven't used Amazon's cloud infrastructure before, so not sure if what I just said makes any sense.)
I recommend the following...
fire up your EMR cluster
elastic-mapreduce --create --alive --other-options-here
log on to the master node and download the data from there
wget http://blah/data
copy into HDFS
hadoop fs -copyFromLocal data /data
There's no real reason to put the original dataset through S3. If you want to keep the results you can move them into S3 before shutting down your cluster.
If the dataset is represented by multiple files you can use the cluster to download it in parallel across the machines. Let me know if this is the case and I'll walk you through it.
Mat
If you're just getting started and experimenting with EMR, I'm guessing you want these on s3 so you don't have to start an interactive Hadoop session (and instead use the EMR wizards via the AWS console).
The best way would be to start a micro instance in the same region as your S3 bucket, download to that machine using wget and then use something like s3cmd (which you'll probably need to install on the instance). On Ubuntu:
wget http://example.com/mydataset dataset
sudo apt-get install s3cmd
s3cmd --configure
s3cmd put dataset s3://mybucket/
The reason you'll want your instance and s3 bucket in the same region is to avoid extra data transfer charges. Although you'll be charged in bound bandwidth to the instance for the wget, the xfer to S3 will be free.
I'm not sure about it, but to me it seems like hadoop should be able to download files directly from your sources.
just enter http://blah/data as your input, and hadoop should do the rest. It certainly works with s3, why should it not work with http?