How do I set up a Hive query to be able to continue running even after I turn off the computer?
hive -e "select session_skey, visitor_nkey, purchaseid, session_seq_num, hit_dt_1, hit_dt_ts, hit_elapsed_time,split(page_url,'/')[2], split(page_url,'/')[3], split(page_url,'/')[4],split(page_url,'cm_src=')[1],page_url from omn_prod_etl.web_fct_hit where hit_dt_1 >= '2018-11-05' and hit_dt_1 <= '2018-01-05' and page_url like '%ewd.com/products/%cm_src%' and data_source_skey= 'gobli'" | sed 's/[[:space:]]\+/^/g' > /data/archive/hit_PR4.csv
Related
Condition within Hive query in shell script not working properly
Wrote a shell script to send out email alert based on the condition in the outcome of a query, but no matter what happens, only the 2nd part (else part) always gets sent, no matter the outcome of the variable. Please kindly help to check. Below is the script:
#!bin/sh
strata=$(impala connection string -q "SELECT calendar, COUNT(*) row_count FROM TABLE
WHERE calendar = CAST(from_unixtime(unix_timestamp(now() - interval 1 days), 'yyyyMMdd') AS INT)
GROUP BY calendar
ORDER BY calendar DESC;")
if [ $strata -eq 0 ] ;then
echo -e 'The table HAS NOT been refreshed today, kindly hold' | mailx -s 'Alerting:Refresh_Status' -c email.address -- email.address
else
echo -e The number of records is $strata | mailx -s 'Alerting: Refresh_Status' -c email.address -- email.address
fi
The output of the variable will either be 0 or the number of records in the table, and email will be sent based on that. But the else part is the only one that gets sent regardless of the result.
correct me if I'm wrong, I think the strata variable will contain the result of your query hence the if statement in your script will jump to the else state because the result is not equal 0.
I think the query should be like this.
SELECT COUNT(*) row_count FROM TABLE
WHERE calendar = CAST(from_unixtime(unix_timestamp(now() - interval 1 days), 'yyyyMMdd') AS INT)
You just need to exclude the calendar from your select statement.
I have partitioned table in hive and I want to assign value for date column dynamically( yesterday's date ). Below is my current query but it's not working.
ALTER TABLE db1.table1 ADD IF NOT EXISTS PARTITION (loaddate="date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') , 1)") LOCATION "hdfs://location1/abc/rawdata/externalhivetables/downloading/data";
Instead of returning the date value it's returning me the complete expression.
select downloading.loaddate From downloading limit 3;
+------------------------------------------------------------+
| downloading.loaddate |
+------------------------------------------------------------+
| date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') , 1) |
| date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') , 1) |
| date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') , 1) |
In hive shell we cannot assign variable variables from the result of query yet, we need to have 2 steps:
Use Shell script to execute the query and store the result into a variable.
Then initialize the hive shell/script with the variable.
bash$ var=`hive -S -e "select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') , 1);"`
bash$ echo $var
Now initialize hive/beeline shell with the varvalue
bash$ hive -hiveconf dd=$var
hive> ALTER TABLE db1.table1 ADD IF NOT EXISTS PARTITION (loaddate='${hiveconf:dd}') LOCATION "hdfs://location1/abc/rawdata/externalhivetables/downloading/data";
Refer to this and this links for additional information.
Use shell to calculate date and substitute it using shell variable substitution:
bash$ dt=$(date -d '-1 day' +%Y-%m-%d)
bash$ hive -e "ALTER TABLE db1.table1 ADD IF NOT EXISTS PARTITION (loaddate='$dt') LOCATION 'hdfs://location1/abc/rawdata/externalhivetables/downloading/data'"
FTIMESTAMP="2018-07-09 00:00:00"
LTIMESTAMP="2018-07-09 08:00:00"
echo $FTIMESTAMP
echo $LTIMESTAMP
bq query --nouse_legacy_sql 'insert `table1`(Time,UserId)
select Time,UserId from `table2`
WHERE _PARTITIONTIME >= "$FTIMESTAMP" AND _PARTITIONTIME < "$LTIMESTAMP"'
When I ran these commands in .sh script, it gave the following error:
*Error in query string: Error processing job '************': Could not cast literal "$FTIMESTAMP" to type TIMESTAMP at [3:25].*
I want to pass those parameters dynamically once this query is successful.
Or is there any other way to extract the data for last 8 hours on the basis of partition time.
It's really a better idea to use query parameters instead of modifying your query text directly; you won't have issues where the query text ends up with syntax errors or other problems. Here is an example using parameters with the names from your question:
$ bq query --use_legacy_sql=false \
--parameter=FTIMESTAMP:TIMESTAMP:"2018-07-09 00:00:00" \
--parameter=LTIMESTAMP:TIMESTAMP:"2018-07-09 00:00:00" \
"SELECT #FTIMESTAMP, #LTIMESTAMP;"
+---------------------+---------------------+
| f0_ | f1_ |
+---------------------+---------------------+
| 2018-07-09 00:00:00 | 2018-07-09 00:00:00 |
+---------------------+---------------------+
In your case, you would want something like this:
$ bq query --nouse_legacy_sql \
--parameter=FTIMESTAMP:TIMESTAMP:"2018-07-09 00:00:00" \
--parameter=LTIMESTAMP:TIMESTAMP:"2018-07-09 00:00:00" \
'insert `table1`(Time,UserId)
select Time,UserId from `table2`
WHERE _PARTITIONTIME >= #FTIMESTAMP AND _PARTITIONTIME < #LTIMESTAMP'
If you still want to set the parameter values from shell variables, you can do so like this:
$ FTIMESTAMP="2018-07-09 00:00:00"
$ LTIMESTAMP="2018-07-09 00:00:00"
$ bq query --nouse_legacy_sql \
--parameter=FTIMESTAMP:TIMESTAMP:"$FTIMESTAMP" \
--parameter=LTIMESTAMP:TIMESTAMP:"$LTIMESTAMP" \
'insert `table1`(Time,UserId)
select Time,UserId from `table2`
WHERE _PARTITIONTIME >= #FTIMESTAMP AND _PARTITIONTIME < #LTIMESTAMP'
This sets the values of the query parameters from the shell variables, which are then passed to BigQuery.
how to execute a HIVE query in background when the query looks like below
Select count(1) from table1 where column1='value1';
I am trying to write it using a script like below
#!/usr/bin/ksh
exec 1> /home/koushik/Logs/`basename $0 | cut -d"." -f1 | sed 's/\.sh//g'`_$(date +"%Y%m%d_%H%M%S").log 2>&1
ST_TIME=`date +%s`
cd $HIVE_HOME/bin
./hive -e 'SELECT COUNT(1) FROM TABLE1 WHERE COLUMN1 = ''value1'';'
END_TIME=`date +%s`
TT_SECS=$(( END_TIME - ST_TIME))
TT_HRS=$(( TT_SECS / 3600 ))
TT_REM_MS=$(( TT_SECS % 3600 ))
TT_MINS=$(( TT_REM_MS / 60 ))
TT_REM_SECS=$(( TT_REM_MS % 60 ))
printf "\n"
printf "Total time taken to execute the script="$TT_HRS:$TT_MINS:$TT_REM_SECS HH:MM:SS
printf "\n"
but getting error like
FAILED: SemanticException [Error 10004]: Line 1:77 Invalid table alias or column reference 'value1'
let me know exactly where I am doing mistake.
Create a document named example
vi example
Enter the query in the document and save it.
create table sample as
Select count(1) from table1 where column1='value1';
Now run the document using the following command:
hive -f example 1>example.error 2>example.output &
You will get the result as
[1]
Now disown the process :
disown
Now the process will run in the background. If you want to know the status of the output, you may use
tail -f example.output
True #Koushik ! Glad that you found the issue.
In the query, bash was unable to form the hive query due to ambiguous single quotes.
Though SELECT COUNT(1) FROM Table1 WHERE Column1 = 'Value1' is valid in hive,
$hive -e 'SELECT COUNT(1) FROM Table1 WHERE Column1 = 'Value1';' is not valid.
The best solution would be to use double quotes for the Value1 as
hive -e 'SELECT COUNT(1) FROM Table1 WHERE Column1 = "Value1";'
or use a quick and dirty solution by including the single quotes within double quotes.
hive -e 'SELECT COUNT(1) FROM Table1 WHERE Column1 = "'"Value1"'";'
This would make sure that the hive query is properly formed and then executed accordingly. I'd not suggest this approach unless you've a desperate ask for a single quote ;)
I am able to resolve it replacing single quote with double quote. Now the modified statement looks like
./hive -e 'SELECT COUNT(1) FROM Table1 WHERE Column1 = "Value1";'
I am looking for a way to count the number of columns in a table in Hive.
I know the following code works in Microsoft SQL Server. Is there a Hive equivalent?
SELECT COUNT(*),
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = 'database_name'
AND TABLE_SCHEMA = 'schema_name'
AND TABLE_NAME = 'table_name'
Try this
SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name]
Try this, it will show you the columns of your table:
DESCRIBE schemaName.tableName;
I do not know of a way to count the columns directly, however, I solved the problem for my needs indirectly via:
echo 'table1name:, '`hive -e 'describe schemaname.table1name;' | grep -v 'col_name' | wc -l > num_columns.csv
echo 'table2name:, '`hive -e 'describe schemaname.table2name;' | grep -v 'col_name' | wc -l >> num_columns.csv
...
(I needed the grep -v bit because I have headers on by default; without it you get one too many lines counted in the wc -l step.)
you have to check if your HIVE include HIVE-287 because for versions of HIVE which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
Just do a describe it will show you all columns then at the bottom then you can see number of rows it fetched that is number of columns.