Google DataProc Hive and Presto query doesn't work - hive

I have a Google DataProc cluster with presto installed as an optional component. I create a external table in Hive and its size is ~1GB. While the table is queryable(for example, groupby statement, distinct, etc succeed), I have problems with perform a simple select * from tableA with Hive and Presto:
For Hive, if I logged in to master node of cluster, and run the query from Hive command line, it success. However, when I run the following command from my local machine:
gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;"
I get the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
ERROR: (gcloud.dataproc.jobs.submit.hive) Job [3e165c0edcda4e35ad0d5f62b77725bc] entered state [ERROR] while waiting for [DONE].
Though I've updated the configurations in mapred-site.xml as:
mapreduce.map.memory.mb=9000;
mapreduce.map.java.opts=-Xmx7000m;
mapreduce.reduce.memory.mb=9000;
mapreduce.reduce.java.opts=-Xmx7000m;
For Presto, similarly the statements such as groupBy and distinct work. However, for the select * from tableA, everytime it just hangs forever at about RUNNING 60% until timeout. And regardless if I run from local machine or from master node of cluster, I get the same issue.
I don't understand why such a small external table can have such issue. Any help is appreciated, thank you!

The Presto CLI binary /usr/bin/presto specifies a jvm -Xmx argument inline (it uses some tricks to bootstrap itself as a java binary); unfortunately, that -Xmx is not normally fetched from /opt/presto-server/etc/jvm.config like the settings for the actual presto-server.
In your case, if you're selecting everything from a 1G parquet table, you're probably actually dealing with something like 6G uncompressed text, and you're trying to stream all of that to the console output. This is likely also not going to work with the Dataproc job-submission because the streamed output is designed to print out human-readable amounts of data, and will slow down considerably if dealing with non-human amounts of data.
If you want to still try doing that with the CLI, you can run:
sudo sed -i "s/Xmx1G/Xmx5G/" /usr/bin/presto
To modify the jvm settings for the CLI on the master, before starting it back up. You'd probably then want to pipe the output to a local file instead of streaming it to your console, because you won't be able to read 6G of text streaming through your screen.

I think the problem is that the output of gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;" went through the Dataproc server which OOMed. To avoid that, you can query data from the cluster directly without going through the server.
Try following the Dataproc Presto tutorial - Presto CLI queries, run these commands from your local machine:
gcloud compute ssh <master-node> \
--project=${PROJECT} \
--zone=${ZONE} \
-- -D 1080 -N
./presto-cli \
--server <master-node>:8080 \
--socks-proxy localhost:1080 \
--catalog hive \
--schema default

Related

Using aws emr add-steps for spark-submit

We have a complicated spark-submit command that we would like to submit to AWS EMR using the aws emr add-steps CLI command. We are having trouble figuring out the correct syntax to use. For example, consider the example command from Apache's Running Spark on YARN page:
$ ./bin/spark-submit --class my.main.Class \
--master yarn \
--deploy-mode cluster \
--jars my-other-jar.jar,my-other-other-jar.jar \
my-main-jar.jar \
app_arg1 app_arg2
Following the guidance from this EMR command-runner page, we created something like this:
$ aws emr add-steps\
--cluster-id j-123456789 \
--steps Type=CUSTOM_JAR,Name='Test_Job',Jar='command-runner.jar',ActionOnFailure=CONTINUE,Args=[\
./bin/spark-submit,\
--class,my.main.Class,\
--master,yarn,\
--deploy-mode cluster,\
--jars,my-other-jar.jar,my-other-other-jar.jar,my-main-jar.jar,app_arg1,app_arg2]
However, due to the apparent parsing at commas, the command appears to only associate --jars with "my-other-jar", whereas "my-other-other-jar" is not. I'm hoping somebody can tell us the proper syntax to use. For example, should we use --jars for each extra jar, like:
--jars,my-other-jar.jar,--jars,my-other-other-jar.jar
or maybe there is some special list syntax, e.g.,
--jars,[my-other-jar.jar,my-other-other-jar.jar]
or something else. Can anybody tell us, or point us to, the correct syntax to use for spark-submit arguments that might take a list, i.e., not just --jars, but also --conf, --files, ...?

SSH into Hadoop cluster using Paramiko, and then executing dependent commands

I am implementing a python script, which uses paramiko to connect to a hadoop cluster. My problem is that I can SSH to a root user only, and from inside I have to switch user to hdfs to execute my command.
now I need something to automate this switching to HDFS user and then cding into /tmp/ and then executing command from there. I have tried invoke_shell() , it hangs, and also the && inside the exec_command, it also doesnt work.
I am getting a permission denied exception:
java.io.FileNotFoundException: file.txt (Permission denied)
There are two workflows that I have thought of:
1st one:
1. sudo -u hdfs -s
2. cd /tmp/
3. <execute the command> <outputDir>
2nd one:
sudo -u hdfs <execution command> /tmp/<outputDir>
The first one doesnt give the above error. But the second one throws this. I was trying second one just to avoid the dependent command issue.
Any help or suggestions will be appreciated.

How to run hive commands from shell?

I have to repair tables in hive from my shell script after successful completion of my spark application.
msck repair table <DATABASE_NAME>.<TABLE_NAME>;
Please suggest me a suitable approach for this which also works for large tables with partitions.
I have found a workaround for this using :
hive -S -e "msck repair table <DATABASE_NAME>.<TABLE_NAME>;"
-S : This silents the output generated from Hive.
-e : This is used for running hive command.
-f : This is used for providing a hql script.

How to start pig with -t ColumnMapKeyPrune on aws emr

In my pig script i want file name with each record for some further processing so i used -tagFile option. Now after using -tagFile option, the column names were getting un aligned so i used below command to get only required columns after referring this blog : http://www.webopius.com/content/764/resolved-apache-pig-with-tagsource-tagfile-option-generates-incorrect-columns
pig -x mapreduce -t ColumnMapKeyPrune
Now i want to run the script on AWS EMR but i am not sure how to enable this -t ColumnMapKeyPrune option on EMR Pig.?
I am using AWS CLI to create aws cluster and submit jobs.
Any pointer for how to enable -t ColumnMapKeyPrune on EMR Pig.?
I got the solution. I need to add below line in pig script:
set pig.optimizer.rules.disabled 'ColumnMapKeyPrune';

Never successfully built a large hadoop&spark cluster

I was wondering if anybody could help me with this issue in deploying a spark cluster using the bdutil tool.
When the total number of cores increase (>= 1024), it failed all the time with the following reasons:
Some machine is never sshable, like "Tue Dec 8 13:45:14 PST 2015: 'hadoop-w-5' not yet sshable (255); sleeping"
Some nodes fail with an "Exited 100" error when deploying spark worker nodes, like "Tue Dec 8 15:28:31 PST 2015: Exited 100 : gcloud --project=cs-bwamem --quiet --verbosity=info compute ssh hadoop-w-6 --command=sudo su -l -c "cd ${PWD} && ./deploy-core-setup.sh" 2>>deploy-core-setup_deploy.stderr 1>>deploy-core-setup_deploy.stdout --ssh-flag=-tt --ssh-flag=-oServerAliveInterval=60 --ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=us-central1-f"
In the log file, it says:
hadoop-w-40: ==> deploy-core-setup_deploy.stderr <==
hadoop-w-40: dpkg-query: package 'openjdk-7-jdk' is not installed and no information is available
hadoop-w-40: Use dpkg --info (= dpkg-deb --info) to examine archive files,
hadoop-w-40: and dpkg --contents (= dpkg-deb --contents) to list their contents.
hadoop-w-40: Failed to fetch http://httpredir.debian.org/debian/pool/main/x/xml-core/xml-core_0.13+nmu2_all.deb Error reading from server. Remote end closed connection [IP: 128.31.0.66 80]
hadoop-w-40: E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
I tried 16-core 128-nodes, 32-core 64-nodes, 32-core 32-nodes and other over 1024-core configurations, but either the above Reason 1 or 2 will show up.
I also tried to modify the ssh-flag to change the ConnectTimeout to 1200s, and change bdutil_env.sh to set the polling interval to 30s, 60s, ..., none of them works. There will be always some nodes which fail.
Here is one of the configurations that I used:
time ./bdutil \
--bucket $BUCKET \
--force \
--machine_type n1-highmem-32 \
--master_machine_type n1-highmem-32 \
--num_workers 64 \
--project $PROJECT \
--upload_files ${JAR_FILE} \
--env_var_files hadoop2_env.sh,extensions/spark/spark_env.sh \
deploy
To summarize some of the information that came out from a separate email discussion, as IP mappings change and different debian mirrors get assigned, there can be occasional problems where the concurrent calls to apt-get install during a bdutil deployment can either overload some unbalanced servers or trigger DDOS protections leading to deployment failures. These do tend to be transient, and at the moment it appears I can deploy large clusters in zones like us-east1-c and us-east1-d successfully again.
There are a few options you can take to reduce the load on the debian mirrors:
Set MAX_CONCURRENT_ASYNC_PROCESSES to a much smaller value than the default 150 inside bdutil_env.sh, such as 10 to only deploy 10 at a time; this will make the deployment take longer, but would lighten the load as if you just did several back-to-back 10-node deployments.
If the VMs were successfully created but the deployment steps fail, instead of needing to retry the whole delete/deploy cycle, you can try ./bdutil <all your flags> run_command -t all -- 'rm -rf /home/hadoop' followed by ./bdutil <all your flags> run_command_steps to just run through the whole deployment attempt.
Incrementally build your cluster using resize_env.sh; initially set --num_workers 10 and deploy your cluster, and then edit resize_env.sh to set NEW_NUM_WORKERS=20, and run ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh deploy and it will only deploy the new workers 10-20 without touching those first 10. Then you just repeat, adding another 10 workers to NEW_NUM_WORKERS each time. If a resize attempt fails, you simply ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh delete to only delete those extra workers without affecting the ones you already deployed successfully.
Finally, if you're looking for more reproducible and optimized deployments, you should consider using Google Cloud Dataproc, which lets you use the standard gcloud CLI to deploy cluster, submit jobs, and further manage/delete clusters without needing to remember your bdutil flags or keep track of what clusters you have on your client machine. You can SSH into Dataproc clusters and use them basically the same as bdutil clusters, with some minor differences, like Dataproc DEFAULT_FS being HDFS so that any GCS paths you use should fully-specify the complete gs://bucket/object name.