I can't get Hadoop to start using Amazon EC2/S3 - amazon-s3

I have created an AMI image and installed Hadoop from the Cloudera CDH2 build. I configured my core-site.xml as so:
<property>
<name>fs.default.name</name>
<value>s3://<BUCKET NAME>/</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value><ACCESS ID></value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value><SECRET KEY></value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop-0.20/cache/${user.name}</value>
</property>
But I get the following error message when I start up the hadoop daemons in the namenode log:
2010-11-03 23:45:21,680 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.default.name): s3://<BUCKET NAME>/ is not of scheme 'hdfs'.
at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:177)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:198)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1006)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1015)
2010-11-03 23:45:21,691 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
However, I am able to execute hadoop commands from the command line like so:
hadoop fs -put sun-javadb-common-10.5.3-0.2.i386.rpm s3://<BUCKET NAME>/
hadoop fs -ls s3://poc-jwt-ci/
Found 3 items
drwxrwxrwx - 0 1970-01-01 00:00 /
-rwxrwxrwx 1 16307 1970-01-01 00:00 /sun-javadb-common-10.5.3-0.2.i386.rpm
drwxrwxrwx - 0 1970-01-01 00:00 /var
You will notice there is a / and a /var folders in the bucket. I ran the hadoop namenode -format when I first saw this error, then restarted all services, but still receive the weird Invalid URI for NameNode address (check fs.default.name): s3://<BUCKET NAME>/ is not of scheme 'hdfs'.
I also notice that the file system created looks like this:
hadoop fs -ls s3://<BUCKET NAME>/var/lib/hadoop-0.20/cache/hadoop/mapred/system
Found 1 items
-rwxrwxrwx 1 4 1970-01-01 00:00 /var/lib/hadoop0.20/cache/hadoop/mapred/system/jobtracker.info
Any ideas of what's going on?

First I suggest you just use Amazon Elastic MapReduce. There is zero configuration required on your end. EMR also has a few internal optimizations and monitoring that works in your benefit.
Second, do not use s3: as your default FS. First, s3 is too slow to be used to store intermediate data between jobs (a typical unit of work in hadoop is a dozen to dozens of MR jobs). it also stores the data in a 'proprietary' format (blocks etc). So external apps can't effectively touch the data in s3.
Note that s3: in EMR is not the same s3: in the standard hadoop distro. The amazon guys actually alias s3: as s3n: (s3n: is just raw/native s3 access).

You could also use Apache Whirr for this workflow like this:
Start by downloading the latest release (0.7.0 at this time) from http://www.apache.org/dyn/closer.cgi/whirr/
Extract the archive and try to run ./bin/whirr version. You need to have Java installed for this to work.
Make your Amazon AWS credentials available as environment variables:
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
Update the Hadoop EC2 config to match your needs by editing recipes/hadoop-ec2.properties. Check the Configuration Guide for more info.
Start a cluster Hadoop by running:
./bin/whirr launch-cluster --config recipes/hadoop-ec2.properties
You can see verbose logging output by doing tail -f whirr.log
Now you can login to your cluster and do your work.
./bin/whirr list-cluster --config recipes/hadoop-ec2.properties
ssh namenode-ip
start jobs as needed or copy data from / to S3 using distcp
For more explanations you should read the Quick Start Guide and the 5 minutes guide.
Disclaimer: I'm one of the committers.

I think you should not execute bin/hadoop namenode -format, because it is used for format the hdfs. In the later version, hadoop has move these functions in a separate scripts file which called "bin/hdfs". After you set the configuration parameters in core-site.xml and other configuration files, you can use S3 as the underlying file system directly.

Use
fs.defaultFS = s3n://awsAccessKeyId:awsSecretAccessKey#BucketName in your /etc/hadoop/conf/core-site.xml
Then do not start your datanode or namenode, if you have services that need your datanode and namenode this will not work..
I did this and can access my bucket using commands like
sudo hdfs dfs -ls /
Note if you have awsSecretAccessKey's with "/" character then you will have to url encode this.

Use s3n instead of s3.
hadoop fs -ls s3n://<BUCKET NAME>/etc

Related

rclone failing with "AccessControlListNotSupported" on cross-account copy -- AWS CLI Works

Quick Summary now that I think I see the problem:
rclone seems to always send ACL with a copy request, with a default value of "private". This will fail in a (2022) default AWS bucket which (correctly) assumes "No ACL". Need a way to suppress ACL send in rclone.
Detail
I assume an IAM role and attempt to do an rclone copy from a data center Linux box to a default options private no-ACL bucket in the same account as the role I assume. It succeeds.
I then configure a default options private no-ACL bucket in another account than the role I assume. I attach a bucket policy to the cross-account bucket that trusts the role I assume. The role I assume has global permissions to write S3 buckets anywhere.
I test the cross-account bucket policy by using the AWS CLI to copy the same linux box source file to the cross-account bucket. Copy works fine with AWS CLI, suggesting that the connection and access permissions to the cross account bucket are fine. DataSync (another AWS service) works fine too.
Problem: an rclone copy fails with the AccessControlListNotSupported error below.
status code: 400, request id: XXXX, host id: ZZZZ
2022/08/26 16:47:29 ERROR : bigmovie: Failed to copy: AccessControlListNotSupported: The bucket does not allow ACLs
status code: 400, request id: XXXX, host id: YYYY
And of course it is true that the bucket does not support ACL ... which is the desired best practice and AWS default for new buckets. However the bucket does support a bucket policy that trusts my assumed role, and that role and bucket policy pair works just fine with the AWS CLI copy across account, but not with the rclone copy.
Given that AWS CLI copies just fine cross account to this bucket, am I missing one of rclone's numerous flags to get the same behaviour? Anyone think of another possible cause?
Tested older, current and beta rclone versions, all behave the same
Version Info
os/version: centos 7.9.2009 (64 bit)
os/kernel: 3.10.0-1160.71.1.el7.x86_64 (x86_64)
os/type: linux
os/arch: amd64
go/version: go1.18.5
go/linking: static
go/tags: none
Failing Command
$ rclone copy bigmovie s3-standard:SOMEBUCKET/bigmovie -vv
Failing RClone Config
type = s3
provider = AWS
env_auth = true
region = us-east-1
endpoint = https://bucket.vpce-REDACTED.s3.us-east-1.vpce.amazonaws.com
#server_side_encryption = AES256
storage_class = STANDARD
#bucket_acl = private
#acl = private
Note that I've tested all permutations of the commented out lines with similar result
Note that I have tested with and without the private endpoint listed with same results for both AWS CLI and rclone, e.g. CLI works, rclone fails.
A log from the command with the -vv flag
2022/08/25 17:25:55 DEBUG : Using config file from "PERSONALSTUFF/rclone.conf"
2022/08/25 17:25:55 DEBUG : rclone: Version "v1.55.1" starting with parameters ["/usr/local/rclone/1.55/bin/rclone" "copy" "bigmovie" "s3-standard:SOMEBUCKET" "-vv"]
2022/08/25 17:25:55 DEBUG : Creating backend with remote "bigmovie"
2022/08/25 17:25:55 DEBUG : fs cache: adding new entry for parent of "bigmovie", "MYDIRECTORY/testbed"
2022/08/25 17:25:55 DEBUG : Creating backend with remote "s3-standard:SOMEBUCKET/bigmovie"
2022/08/25 17:25:55 DEBUG : bigmovie: Need to transfer - File not found at Destination
2022/08/25 17:25:55 ERROR : bigmovie: Failed to copy: s3 upload: 400 Bad Request: <?xml version="1.0" encoding="UTF-8"?>
AccessControlListNotSupported The bucket does not allow ACLs8DW1MQSHEN6A0CFAd3Rlnx/XezTB7OC79qr4QQuwjgR+h2VYj4LCZWLGTny9YAy985be5HsFgHcqX4azSDhDXefLE+U=
2022/08/25 17:25:55 ERROR : Attempt 1/3 failed with 1 errors and: s3 upload: 400 Bad Request: <?xml version="1.0" encoding="UTF-8"?>

Getting error while AWS EKS cluster backup using Velero tool

Please let me know what is my mistake!
Used this command to backup AWS EKS cluster using velero tool but it's not working :
./velero.exe install --provider aws --bucket backup-archive/eks-cluster-backup/prod-eks-cluster/ --secret-file ./minio.credentials --use-restic --backup-location-config region=minio,s3ForcePathStyle=true,s3Url=s3Url=s3://backup-archive/eks-cluster-backup/prod-eks-cluster/ --kubeconfig ../kubeconfig-prod-eks --plugins velero/velero-plugin-for-aws:v1.0.0
cat minio.credentials
[default]
aws_access_key_id=xxxx
aws_secret_access_key=yyyyy/zzzzzzzz
region=ap-southeast-1
Getting Error:
../kubectl.exe --kubeconfig=../kubeconfig-prod-eks.txt logs deployment/velero -n velero
time="2020-12-09T09:07:12Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="backup storage location's bucket name \"backup-archive/eks-cluster-backup/\" must not contain a '/' (if using a prefix, put it in the 'Prefix' field instead)" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/persistence/object_store.go:110" error.function=github.com/vmware-tanzu/velero/pkg/persistence.NewObjectBackupStore logSource="pkg/controller/backup_sync_controller.go:168"
Note: I have tried --bucket backup-archive but still no use
This is the source of your problem: --bucket backup-archive/eks-cluster-backup/prod-eks-cluster/.
The error says: must not contain a '/' .
This means it cannot contain a slash in the middle of the bucket name (leading/trailing slashes are trimmed, so that's not a problem). Source: https://github.com/vmware-tanzu/velero/blob/3867d1f434c0b1dd786eb8f9349819b4cc873048/pkg/persistence/object_store.go#L102-L111.
If you want to namespace your backups within a bucket, you may use the --prefix parameter. Like so:
--bucket backup-archive --prefix /eks-cluster-backup/prod-eks-cluster/.

Fluentd grep + output logs

I have a service, deployed into a kubernetes cluster, with fluentd set as a daemon set. And i need to diversify logs it receives so they end up in different s3 buckets.
One bucket would be for all logs, generated by kubernetes and our debug/error handling code, and another bucket would be a subset of logs, generated by the service, parsed by structured logger and identified by a specific field in json. Think of it one bucket is for machine state and errors, another is for "user_id created resource image_id at ts" description of user actions
The service itself is ignorant of the fluentd, so i cannot manually set the tag for logs based on which s3 bucket i want them to end in.
Now, the fluentd.conf i use sets s3 stuff like this:
<match **>
# docs: https://docs.fluentd.org/v0.12/articles/out_s3
# note: this configuration relies on the nodes have an IAM instance profile with access to your S3 bucket
type copy
<store>
type s3
log_level info
s3_bucket "#{ENV['S3_BUCKET_NAME']}"
s3_region "#{ENV['S3_BUCKET_REGION']}"
aws_key_id "#{ENV['AWS_ACCESS_KEY_ID']}"
aws_sec_key "#{ENV['AWS_SECRET_ACCESS_KEY']}"
s3_object_key_format %{path}%{time_slice}/cluster-log-%{index}.%{file_extension}
format json
time_slice_format %Y/%m/%d
time_slice_wait 1m
flush_interval 10m
utc
include_time_key true
include_tag_key true
buffer_chunk_limit 128m
buffer_path /var/log/fluentd-buffers/s3.buffer
</store>
<store>
...
</store>
</match>
So, what i would like to do is to have something like a grep plugin
<store>
type grep
<regexp>
key type
pattern client-action
</regexp>
</store>
Which would send logs into a separate s3 bucket to the one defined for all logs
I am assuming that user action logs are generated by your service and system logs include docker, kubernetes and systemd logs from the nodes.
I found your example yaml file at the official fluent github repo.
If you check out the folder in that link, you'll see two more files called kubernetes.conf and systemd.conf. These files have got source sections where they tag their data.
The match section in fluent.conf is matching **, i.e. all logs and sending them to s3. You want to split your log types here.
Your container logs are being tagged kubernetes.* in kubernetes.conf on this line.
so your above config turns into
<match kubernetes.* >
#type s3
# user log s3 bucket
...
and for system logs match every other tag except kubernetes.*

Cloudera CDH 4.6.0 - Hive metastore service not starting

I installed Cloudera CDH 4.6.0 on my Centos 6.2 linux server machine (Cloudera manager - 4.8). I am able to start few services, however not able to start the Hive metastore service.
Cloudera is using Postgre SQL as the remote metatore DB. My host name is delvmpll2, but when starting Hive service, it is giving java.net.UnknownHostException: localhost.localdomain.
I edited the hostname in hive-site.xml and restarted all the services, but still the same exception is coming. I could not find the place where cloudera is picking this hostname.
Could someone please let me know what would have went wrong.
Here is the exception
Caused by: java.net.UnknownHostException: localhost.localdomain
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:195)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at java.net.Socket.connect(Socket.java:478)
at java.net.Socket.<init>(Socket.java:375)
at java.net.Socket.<init>(Socket.java:189)
at org.postgresql.core.PGStream.<init>(PGStream.java:62)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:76)
... 58 more
2014-07-04 07:16:06,354 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: Shutting down hive metastore.
Thanks in advance
Finally I solved it.
I changed the server_host value in config.ini file in /etc/cloudera-scm-agent to my host and after that when I restarted the services, all the services are running well

Applications not shown in yarn UI when running mapreduce hadoop job?

I am using Hadoop2.2. I see that my jobs are completed with success. I can browse the filesystem to find the output. However, when I browse http://NNode:8088/cluster/apps, I am unable to see any applications that have been completed so far ( I ran 3 wordcount jobs, but none of it is seen here).
Are there any configurations that need to be taken into account?
Here is the yarn-site.xml
<property>
<name>yarn.resourcemanager.hostname</name>
<value>NNode</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
-->
Here is mapred-site.xml:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
I have job history server running too:
jps
4422 NameNode
5452 Jps
4695 SecondaryNameNode
4924 ResourceManager
72802 Jps
5369 JobHistoryServer
After applications are completed, their responsibility might be moved to Job History Server. So check Job History Server URL. It normally listen on port 19888. E.g.
http://<job_history_server_address>:19888/jobhistory
Log directories and log retain durations are configurable in yarn-site.xml. With YARN, even one can aggregate logs to a single (configurable) location.
Sometimes, even though application is listed, logs are not available (I am not sure if its due to some bug in YARN). However, almost each time I was able to get the logs using command line:
yarn logs -applicationId the_application_id
Athough there are multiple options. Use help for details:
yarn logs --help
you can refer Hadoop is not showing my job in the job tracker even though it is running
conf.set("fs.defaultFS", "hdfs://master:9000");
conf.set("mapreduce.jobtracker.address", "master:54311");
conf.set("mapreduce.framework.name", "yarn");
conf.set("yarn.resourcemanager.address", "master:8032");
I tested in my cluster. It works!