Apache Hama on Amazon Elastic MapReduce - apache

I am trying to run Apache Hama on Amazon Elastic MapReduce using https://github.com/awslabs/emr-bootstrap-actions/tree/master/hama script. However, when trying out with one master node and two slave nodes, peer.getNumPeers() in the BSP code reports only 1 peer. I am suspecting whether Hama runs in local mode.
Moreover, looking at configurations at https://hama.apache.org/getting_started_with_hama.html, my understanding is that the list of all the servers should go in hama-site.xml file for property hama.zookeeper.quorum and also in groomservers file. However, I wonder whether these are being configured properly in the install script. Would really appreciate if anyone could point out whether it's a limitation in the script or whether I am doing something wrong.

#Madhura
Hama doesn't always need groomserver file to run fully distributed mode.
groomserver file is needed to run hama cluster using only start-bspd.sh. But emr-bootstrap-action of hama runs groomservers on each slave nodes using hama-daemon.sh file. Code executed in install script is as follow.
$ /bin/hama-daemon.sh --config ${HAMA_HOME}/conf start groom
I think you need to check the emr logs whether they have error or not.

Related

"java.lang.NullPointerException" error in JMeter Non-Gui mode

When I try to execute JMeter (Version 5.1.1 & 5.2.1) recorded script in Non-Gui mode using distributed testing, It is displaying below shown "java.lang.NullPointerException" error while generating HTML report. Also JTL report is creating an empty notepad file without any data.
.
Note:- This error occurrs only when I place CSV Data Set Config - Config Element in the test plan. When I remove/disable it, HTML and JTL reports get generated without any error. Also I can't skip this CSV Data Set Config plugin on test execution.
.
Please let me know, If there is any other solution to overcome this issue.
Thanks in advance.
You're highlightling not the cause but rather a consequence, you should be rather paying attention to the Summariser output which states summary = 0
which basically means that no Samplers were executed so your test script execution on slaves failed somewhere somehow. First of all I would recommend checking jmeter.log on the master and jmeter-server.log files on the remote machines, most probably you will be able to figure out the root cause from there.
Quick checklist:
Make sure to use the same Java version on the master and the slaves
Make sure to use the same JMeter version (better the latest one) on the master and the slaves
If your test relies on JMeter Plugins - you need to install all the plugins used in the test onto all the slaves
If you define some properties in user.properties file you need to do the same on all the remote machines (or alternatively pass them via -G command-line argument)
If you're using external 3rd-party files (CSV files, files to be uploaded, etc.) - you will need to manually copy them to the slave machines
Double check Remote hosts and RMI configuration to ensure that the slaves can communicate with the master in order to send Sample Results back to it. Also make sure that the relevant ports are open in Windows Firewall
More information: How to Perform Distributed Testing in JMeter
The issue seems like a with csv file path.Make sure you are providing the correct path in csv-file-config.Normally this happens when it is not able to read the data from the location.

Copy / update the code in docker container without stopping container

I have a docker-composer setup in which i am uploading source code for server say flask api . Now when i change my python code, I have to follow steps like this
stop the running containers (docker-compose stop)
build and load updated code in container (docker-compose up --build)
This take a bit long time . Is there any better way ? Like update code in the running docker and then restarting Apache server without stopping whole container ?
There are few dirty ways you can modify file system of running container.
First you need to find the path of directory which is used as runtime root for container. Run docker container inspect id/name. Look for the key UpperDir in JSON output. You can edit/copy/delete files in that directory.
Another way is to get the process ID of the process running within container. Go to the /proc/process_id/root directory. This is the root directory for the process running inside docker. You can edit it on the fly and changes will appear in the container.
You can run the docker build while the old container is still running, and your downtime is limited to the changeover period.
It can be helpful for a couple of reasons to put a load balancer in front of your container. Depending on your application this could be a "dumb" load balancer like HAProxy, or a full Web server like nginx, or something your cloud provider makes available. This allows you to have multiple copies of the application running at once, possibly on different hosts (helps for scaling and reliability). In this case the sequence becomes:
docker build the new image
docker run it
Attach it to the load balancer (now traffic goes to both old and new containers)
Test that the new container works correctly
Detach the old container from the load balancer
docker stop && docker rm the old container
If you don't mind heavier-weight infrastructure, this sequence is basically exactly what happens when you change the image tag in a Kubernetes Deployment object, but adopting Kubernetes is something of a substantial commitment.

Download files from FTP to amazon EMR

I need to download files from FTP server to amazon EMR, I have a shell script to download files but it's working in linux machines, not in amazon EMR namenode. I am not getting any error, the terminal not displaying anything after ran shell script.
Note:I have enable ports on Master security groups. I know the other approach to download FTP to s3 and then amazon EMR, but I need to download files directly to Amazon EMR.
I assume you have tried to download files from FTP server to amazon EMR using bootstrap scripts.
To debug whats going wrong. Can you connect to master node / slaves nodes when they are up and see you script runs well ? this can help if script is running for not.
Other way to debug is , once node is launched try to run script manually on the EMR nodes and see if they throw some error.
Hope the will help to debug why scripts are not running.

Not able to backup the log files during instance termination issued by Auto Scaling Policy

I am having EC2 instances with auto scaling enabled on it.
Now as part of scale down policy when one of the instance is issued termination, the log files remaining on that instance need to be backed up on s3, but I am not finding any way to perform s3 logging of log files for that instance. I have tried putting the needed script in rc0.d directory through chkconfig with highest priority. I also tried to put my script in /lib/systemd/system/halt.service (or reboot.service or poweroff.service), but no luck till now.
I have found some threads related to this on stack overflow and AWS forum but no proper solution found till now.
Can any one please let me know the solution to this problem?
The only reliable way I have found of achieving this behaviour is to use rsyslog/syslog to transfer the log files to a central host as soon as they are written to the syslog subsystem.
This means you will need to run another instance that receives the log files and ships them to S3, or use an SQS-based system such as logstash.
Unfortunately there is no other way to ensure all of your log messages will be stored on S3 - you can not guarantee that your script will finish before autoscaling "pulls the plug".

Fastest / best way copy data between S3 to EC2?

I have a fairly large amount of data (~30G, split into ~100 files) I'd like to transfer between S3 and EC2: when I fire up the EC2 instances I'd like to copy the data from S3 to EC2 local disks as quickly as I can, and when I'm done processing I'd like to copy the results back to S3.
I'm looking for a tool that'll do a fast / parallel copy of the data back and forth. I have several scripts hacked up, including one that does a decent job, so I'm not looking for pointers to basic libraries; I'm looking for something fast and reliable.
Unfortunately, Adam's suggestion won't work as his understanding of EBS is wrong (although I wish he was right and often thought myself it should work that way)... as EBS has nothing to do with S3, but it will only give you an "external drive" for EC2 instances that are separate, but connectable to the instances. You still have to do copying between S3 and EC2, even though there are no data transfer costs between the two.
You didn't mention an operating system of your instance, so I cannot give tailored information. A popular command line tool I use is http://s3tools.org/s3cmd ... it is based on Python and therefore, according to info on its website it should work on Win as well as Linux, although I use it ALL the time on Linux. You could easily whip up a quick script that uses its built in "sync" command that works similar to rsync, and have it triggered every time you're done processing your data. You could also use the recursive put and get commands to get and put data only when needed.
There are graphical tools like Cloudberry Pro that have some command line options for Windows too that you can setup schedule commands. http://s3tools.org/s3cmd is probably the easiest.
By now, there is a sync command in the AWS Command line tools, that should do the trick: http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
On startup:
aws s3 sync s3://mybucket /mylocalfolder
before shutdown:
aws s3 sync /mylocalfolder s3://mybucket
Of course, the details are always fun to work out eg. how can parallel it is (and can you make it more parallel and is that any faster goven the virtual nature of the whole setup)
Btw hope you're still working on this... or somebody is. ;)
I think you might be better off using an Elastic Block Store to store your files instead of S3. An EBS is akin to a 'drive' on S3 that can be mounted into your EC2 instance without having to copy the data each time, thereby allowing you to persist your data between EC2 instances without having to write to or read from S3 each time.
http://aws.amazon.com/ebs/
Install s3cmd Package as
yum install s3cmd
or
sudo apt-get install s3cmd
depending on your OS
then copy data with this
s3cmd get s3://tecadmin/file.txt
also ls can list the files.
for more detils see this
For me the best form is:
wget http://s3.amazonaws.com/my_bucket/my_folder/my_file.ext
from PuTTy