I have to repair tables in hive from my shell script after successful completion of my spark application.
msck repair table <DATABASE_NAME>.<TABLE_NAME>;
Please suggest me a suitable approach for this which also works for large tables with partitions.
I have found a workaround for this using :
hive -S -e "msck repair table <DATABASE_NAME>.<TABLE_NAME>;"
-S : This silents the output generated from Hive.
-e : This is used for running hive command.
-f : This is used for providing a hql script.
Related
I have some parquet files stored in HDFS that I want to convert to csv files FIRST and export them in a remote file using ssh.
I don't know if it's possible or simple by writing a spark job (I know that we can convert parquet to csv file JUST by using spark.read.parquet then to the same DF use spark.write as a csv file). But I really wanted to do it by using a impala shell request.
So, I thought about something like this :
hdfs dfs -cat my-file.parquet | ssh myserver.com 'cat > /path/to/my-file.csv'
Can you help me PLEASE with this request ? Please.
Thank you !!
Example without kerberos:
impala-shell -i servername:portname -B -q 'select * from table' -o filename '--output_delimiter=\001'
I could explain it all, but it is late and here is a link that allows you to do that as well as the header if you want: http://beginnershadoop.com/2019/10/02/impala-export-to-csv/
You can do that by multiples ways.
One approach could be as in the example below.
With impala-shell you can run a query and pipe to ssh to write the output in a remote machine.
$ impala-shell --quiet --delimited --print_header --output_delimiter=',' -q 'USE fun; SELECT * FROM games' | ssh remoteuser#ip.address.of.remote.machine "cat > /home/..../query.csv"
This command change from default database to a fun database and run a query on it.
You can change the --output_delimiter='\t', --print_header or not along with other options.
I have a Google DataProc cluster with presto installed as an optional component. I create a external table in Hive and its size is ~1GB. While the table is queryable(for example, groupby statement, distinct, etc succeed), I have problems with perform a simple select * from tableA with Hive and Presto:
For Hive, if I logged in to master node of cluster, and run the query from Hive command line, it success. However, when I run the following command from my local machine:
gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;"
I get the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
ERROR: (gcloud.dataproc.jobs.submit.hive) Job [3e165c0edcda4e35ad0d5f62b77725bc] entered state [ERROR] while waiting for [DONE].
Though I've updated the configurations in mapred-site.xml as:
mapreduce.map.memory.mb=9000;
mapreduce.map.java.opts=-Xmx7000m;
mapreduce.reduce.memory.mb=9000;
mapreduce.reduce.java.opts=-Xmx7000m;
For Presto, similarly the statements such as groupBy and distinct work. However, for the select * from tableA, everytime it just hangs forever at about RUNNING 60% until timeout. And regardless if I run from local machine or from master node of cluster, I get the same issue.
I don't understand why such a small external table can have such issue. Any help is appreciated, thank you!
The Presto CLI binary /usr/bin/presto specifies a jvm -Xmx argument inline (it uses some tricks to bootstrap itself as a java binary); unfortunately, that -Xmx is not normally fetched from /opt/presto-server/etc/jvm.config like the settings for the actual presto-server.
In your case, if you're selecting everything from a 1G parquet table, you're probably actually dealing with something like 6G uncompressed text, and you're trying to stream all of that to the console output. This is likely also not going to work with the Dataproc job-submission because the streamed output is designed to print out human-readable amounts of data, and will slow down considerably if dealing with non-human amounts of data.
If you want to still try doing that with the CLI, you can run:
sudo sed -i "s/Xmx1G/Xmx5G/" /usr/bin/presto
To modify the jvm settings for the CLI on the master, before starting it back up. You'd probably then want to pipe the output to a local file instead of streaming it to your console, because you won't be able to read 6G of text streaming through your screen.
I think the problem is that the output of gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;" went through the Dataproc server which OOMed. To avoid that, you can query data from the cluster directly without going through the server.
Try following the Dataproc Presto tutorial - Presto CLI queries, run these commands from your local machine:
gcloud compute ssh <master-node> \
--project=${PROJECT} \
--zone=${ZONE} \
-- -D 1080 -N
./presto-cli \
--server <master-node>:8080 \
--socks-proxy localhost:1080 \
--catalog hive \
--schema default
I want to know what is the command to execute a hive Script
Complete the Code to execute the hive script ./custexport.hql
Scripts hive>
If you are using hive then
hive -f 'your hql file'
if you are using beeline then also you can use -f option with complete beeline command.
We have couple of hql files for compiling ddls.
in hive we used the following command from bash :
hive -v -f abc.hql
but, in beeline this doesn't work from bash. Any idea what can be the equivalent command for beeline.
Make sure your hiveserver2 is up & running on some port
In beeline
beeline -u "jdbc:hive2://localhost:port/database_name/" -f abc.hql
Refer this doc for more commands
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
Refer this doc if you have not yet configured hiveserver2
https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2
I have a DB2 script to first drop and then create some table spaces and functions. I can run the SQL script successfully in DB2 command line on the targeted database.
I need to execute this SQL script in a shell script multiple times. It can be executed successfully the first time, then it will hang at the second/third time. The command to execute the SQL script is very simple:
db2 CONNECT TO ktest4
db2 -v -f /tmp/sql/application_system/opmdb2_privilege_remove.sql.5342
I use DB2 9.7.8, and LINUX operating system. When the SQL script is hanged, I can still manually run the SQL script successfully in DB2 command line on the targeted database.
Does anyone know the reason? Thanks.
Xiaoyang Gao
Are you sure DB2 is blocking? Did you put a semicolon between commands
db2 CONNECT TO ktest4 ; db2 -v -f /tmp/sql/application_system/opmdb2_privilege_remove.sql.5342
In order to trace the execution, I advise you to put some output, in order to detect where is it blocking
date ; db2 -r /tmp/output.log CONNECT TO ktest4 ; db2 -r /tmp/output.log values current timestamp ; db2 -r /tmp/output.log -v -f /tmp/sql/application_system ; db2 -r /tmp/output.log values current timestamp ; db2 -r /tmp/output.log terminate
With a command like this, you will save all outputs, and then you could check where is the error.