Unreadable beeline query results - hive

Environment
Beeline version 1.1.0-cdh5.13.0 by Apache Hive
Context
I need to execute some select statements with beeline and process the results later on.
However, the output of beeline is not readable, not even "processable".
Code
I execute the following command:
beeline -u $IMPALA_URL -n writer -p writer --outputformat=tsv2 -e "select * from bas_fca.table_info" > results 2> errors
Output
The content of the results file:
-----------------------------
table type creation_date
The content of the errors file:
2021-06-14 14:39:03,580 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
scan complete in 2ms
Connecting to jdbc:hive2://mynodes:21050/
Connected to: Impala (version 2.10.0-cdh5.13.0)
Driver: Hive JDBC (version 1.1.0-cdh5.13.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
cu_ddi.activ_from_api
cu_ddi.cumul_from_api
cu_ddi.data_from_api
cu_ddi.envoi_from_api
cu_ddi.inscript
cu_ddi.tmp
No rows selected (0.098 seconds)
Beeline version 1.1.0-cdh5.13.0 by Apache Hive
Closing: 0: jdbc:hive2://mynodes:21050/
Expected output
When running above query in Hue, I have the following results:
cu_ddi.activ_from_api EXTERNAL_TABLE 2020-01-29 16:13:18
cu_ddi.cumul_from_api EXTERNAL_TABLE 2020-01-22 09:04:47
cu_ddi.data_from_api EXTERNAL_TABLE 2020-01-21 15:05:58
cu_ddi.envoi_from_api EXTERNAL_TABLE 2020-01-29 16:12:25
cu_ddi.inscript EXTERNAL_TABLE 2019-10-25 11:35:03
cu_ddi.tmp MANAGED_TABLE 2020-01-21 15:06:40
Questions
Why do I only have the table names ?
Why are the incomplete results written to STDERR ?
How to have the results with a processable format (CSV, etc.) ?
Edit 1
Requested by Koushik Roy.
Output of:
beeline -u $IMPALA_URL -n writer -p writer -e "select * from bas_fca.table_info" > results
cat results
+--------+-------+----------------+--+
| table | type | creation_date |
+--------+-------+----------------+--+
+--------+-------+----------------+--+

Related

How to use subprocess.run() to run Hive query?

So I'm trying to execute a hive query using the subprocess module, and save the output into a file data.txt as well as the logs (into log.txt), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.
Here's what I'm running:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# note - "hive -e [query]" would normally just print all the results
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '"{}"'.format(query)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff,
shell=True)
log_buff.close()
data_buff.close()
I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.
The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.
Update:
With the above way of doing things I'm currently getting the following output:
log.txt
[ralston#tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties
data.txt
[ralston#tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston#tpsci-gw01-vm tmp]$
And I can verify the java/hive process did run:
[ralston#tpsci-gw01-vm tmp]$ ps -u ralston
PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash
But it looks like it's not finishing and not logging everything that I'd like.
So I managed to get this working with the following setup:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '"{}"'.format(query), ">", "{}".format(outfile)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
try:
outs, errs = proc.communicate(timeout=10)
print("job finished")
finished = True
except subprocess.TimeoutExpired:
elapsed = abs(time.time() - start) / 60.
if elapsed >= 60:
print("Job took over 60 mins")
break
print("Comm timed out. Continuing")
continue
print("done")
log_buff.close()
data_buff.close()
Which produced the output as needed. I knew about process.communicate() but that previously didn't work. I believe the issue was related to not adding an output file with > ${outfile} to the hive query.
Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.

How To Send Output To Terminal Window with Hive Script

I am familiar with storing output/results for a Hive Query to file, but what command do I use in the script to display the results of the HQL to the terminal?
Normally Hive prints results to the stdout, if not redirected it displays on console. You do not need any special command for this.
If you want to display results on the console screen and at the same time store them in a file, use tee command:
hive -e "use mydb; select * from test_t" | tee ./results.txt
OK
123 {"value(B)":"Bye"}
123 {"value(G)":"Jet"}
Time taken: 1.322 seconds, Fetched: 2 row(s)
Check file contains results
cat ./results.txt
123 {"value(B)":"Bye"}
123 {"value(G)":"Jet"}
See here: https://ru.wikipedia.org/wiki/Tee
This was my output:
There was no output, because I had yet to properly use the LOAD DATA INPATH command to my hdfs. After loading, I received output from the SELECT statement in the script.

Execute an Impala query and get query time

I want to be able to execute a number of Impala queries and return the time it took for each query to execute. Using the Impala shell, I can do this with the following command:
impl -q "select count(*) from database.table;"
This gives me the output
Using service name 'impala'
SSL is enabled. Impala server certificates will NOT be verified (set --ca_cert to change)
Connected to *****.************:21000
Server version: impalad version 2.6.0-cdh5.8.3 RELEASE (build c644f476b774db9db87a619628f7a6ecc5f843e0)
Query: select count(*) from database.table
+----------+
| count(*) |
+----------+
| 1130976 |
+----------+
Fetched 1 row(s) in 0.86s
I want to be able to fetch that last line and extract the time. It doesn't really matter how, which is why I haven't tagged a language. I have tried using grep like this:
impl -q "select count(*) from database.table" | grep -Po "\d+\.\d+"
But that does nothing but remove the table. Putting the query in a python script and using subprocess couldn't find impl as a command, and same for scala.
The weird thing is that impala-shell dumps those messages to stderr rather than to stdout, so to fetch the last line, you would have to append a 2>&1 to redirect stderr to stdout
impala-shell -q "query string" 2>&1 | grep -Po "\d+\.\d+(?=s)"
Notice that a positive lookahead (?=s) is probably required to avoid capturing version numbers

BigQuery bq command with asterisk (*) doesn't work in Compute Engine

I have a directory with a file named file1.txt
And I run the command:
bq query "SELECT * FROM [publicdata:samples.shakespeare] LIMIT 5"
In my local machine it works fine but in Compute Engine I receive this error:
Waiting on bqjob_r2aaecf624e10b8c5_0000014d0537316e_1 ... (0s) Current status: DONE
BigQuery error in query operation: Error processing job 'my-project-id:bqjob_r2aaecf624e10b8c5_0000014d0537316e_1': Field 'file1.txt' not found.
If the directory is empty it works fine. I'm guessing the asterisk is expanding the file(s) into the query but I don't know why.
Apparently the bq command which is located at /usr/bin/bq has the following script:
#!/bin/sh
exec /usr/lib/google-cloud-sdk/bin/bq ${#}
which expands the asterisk.
As a current workaround I'm calling /usr/lib/google-cloud-sdk/bin/bq directly.

Loading Data from Remote Machine to Hive Database

I have a CSV file stored on a remote machine. I need to load this data into my Hive Database which is installed in different machine. Is there any way to do this?
note: I am using Hive 0.12.
Since Hive basically applies a schema to data that resides in HDFS, you'll want to create a location in HDFS, move your data there, and then create a Hive table that points to that location. If you're using a commercial distribution, this may be possible from Hue (the Hadoop User Environment web UI).
Here's an example from the command line.
Create csv file on local machine:
$ vi famous_dictators.csv
... and this is what the file looks like:
$ cat famous_dictators.csv
1,Mao Zedong,63000000
2,Jozef Stalin,23000000
3,Adolf Hitler,17000000
4,Leopold II of Belgium,8000000
5,Hideki Tojo,5000000
6,Ismail Enver Pasha,2500000
7,Pol Pot,1700000
8,Kim Il Sung,1600000
9,Mengistu Haile Mariam,950000
10,Yakubu Gowon,1100000
Then scp the csv file to a cluster node:
$ scp famous_dictators.csv hadoop01:/tmp/
ssh into the node:
$ ssh hadoop01
Create a folder in HDFS:
[awoolford#hadoop01 ~]$ hdfs dfs -mkdir /tmp/famous_dictators/
Copy the csv file from the local filesystem into the HDFS folder:
[awoolford#hadoop01 ~]$ hdfs dfs -copyFromLocal /tmp/famous_dictators.csv /tmp/famous_dictators/
Then login to hive and create the table:
[awoolford#hadoop01 ~]$ hive
hive> CREATE TABLE `famous_dictators`(
> `rank` int,
> `name` string,
> `deaths` int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> LOCATION
> 'hdfs:///tmp/famous_dictators';
You should now be able to query your data in Hive:
hive> select * from famous_dictators;
OK
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
3 Adolf Hitler 17000000
4 Leopold II of Belgium 8000000
5 Hideki Tojo 5000000
6 Ismail Enver Pasha 2500000
7 Pol Pot 1700000
8 Kim Il Sung 1600000
9 Mengistu Haile Mariam 950000
10 Yakubu Gowon 1100000
Time taken: 0.789 seconds, Fetched: 10 row(s)