How to export hive query result to single local file? - hive

I want to export hive query result to single local file with pipe delimiter.
Hive query contains order by clause.
I have tried below solutions.
Solution1:
hive -e 'insert overwrite local directory '/problem1/solution' fields terminated by '|' select * from table_name order by rec_date'
This solution is creating multiple files. After merging files, it loosing data order.
Solution2:
beeline -u 'jdbc:hive2://server_ip:10000/db_name' --silent --outputformat=dsv --delimiterForDSV='|' -e 'select * from table_name order by rec_date' > /problem1/solution
This solution is creating single file but it has empty 2 lines at top and 2 lines at bottom.
I am removing empty lines using sed command. It takes very long time.
Is there any other efficient way to achieve this?

Try these settings for executing ORDER BY on single reducer:
set hive.optimize.sampling.orderby=false; --disable parallel ORDER BY
Or try to set the number of reducers manually:
set mapred.reduce.tasks=1;

Related

Exporting Hive Table Data into .csv

This question may have been asked before, and I am relatively new to the HADOOP and HIVE language. So I'm trying to export content, as a test, to see if I am doing things correctly. The code is below.
Use MY_DATABASE_NAME;
INSERT OVERWRITE LOCAL DIRECTORY '/random/directory/test'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n"
SELECT date_ts,script_tx,sequence_id FROM dir_test WHERE date_ts BETWEEN '2018-01-01' and '2018-01-02';
That is what I have so far, but then it generates multiple files and I want to combine them into a .csv file or a .xls file, to be worked on. My question, what do I do next to accomplish this?
Thanks in advance.
You can achieve by following ways:
Use single reducer in the query like ORDER BY <col_name>
Store to HDFS and then use command hdfs dfs –getmerge [-nl] <src> <localdest>
Using beeline: beeline --outputformat=csv2 -f query_file.sql > <file_name>.csv

Incremental updates in HIVE using SQOOP appends data into middle of the table

I am trying to append the new data from SQLServer to Hive using the following command
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password passwd --table testable --where "ID > 11854" --hive-import -hive-table hivedb.hivetesttable --fields-terminated-by ',' -m 1
This command appends the data.
But when I run
select * from hivetesttable;
it doesnot show the new data at the end.
This is because the sqoop import statement for appending the new data result the mapper output as part-m-00000-copy
So my data in the hive table directory looks like
part-m-00000
part-m-00000-copy
part-m-00001
part-m-00002
Is there any way to append the data at end by changing the name of mapper?
Hive, similarly to any other relational database, doesn't guarantee any order unless you explicitly use ORDER BY clause.
You're correct in your analysis - the reason why the data appears in the "middle" is that Hive will read one file after another based on lexicographical sort and Sqoop simply names the files that they will get appended somewhere in the middle of that list.
However this operation is fully valid - Sqoop appended data to Hive table and because your query doesn't have any explicit ORDER BY statement the result have no guarantees with regards to order. In fact Hive itself can change this behavior and read files based on time of creation without breaking any compatibility.
I'm also interested to see how this is affecting your use case? I'm assuming that the query to list all rows is just a test one. Do you have any issues with actual production queries?

Hive query - INSERT OVERWRITE LOCAL DIRECTORY creates multiple files for a single table

I do the following from a hive table myTable.
INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT concat_ws('',NAME,PRODUCT,PRC,field1,field2,field3,field4,field5) FROM myTable;
So, this command generates 2 files 000000_0 and 000001_0 inside the folder out/.
But, I need the contents as a single file. What should I do?
There are multiple files in the directory because every reducer is writing one file. If you really need the contents as a single file, run your map reduce job with only 1 reducer which will write to a single file.
However depending on your data size, this might not be a good approach to run a single reducer.
Edit: Instead of forcing hive to run 1 reduce task and output a single reduce file, it would be better to use hadoop fs operations to merge outputs to a single file.
For example
hadoop fs -text /myDir/out/* | hadoop fs -put - /myDir/out.txt
A bit late to the game, but I found that using LIMIT large_number, where large_number is bigger than rows in your query. It forces hive to use at least a reducer. For example:
set mapred.reduce.tasks=1; INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT * FROM table_name LIMIT 1000000000
Worked flawlessly.
CLUSTER BY will make the work.

Storing query result in a variable

I have a query whose result I wanted to store in a variable
How can I do it ?
I tried
./hive -e "use telecom;insert overwrite local directory '/tmp/result' select
avg(a) from abc;"
./hive --hiveconf MY_VAR =`cat /tmp/result/000000_0`;
I am able to get average value in MY_VAR but it takes me in hive CLI which is not required
and is there a way to access unix commands inside hive CLI?
Use Case: in mysql the following is valid:
set #max_date := select max(date) from some_table;
select * from some_other_table where date > #max_date;
This is super useful for scripts that need to repeatedly call this variable since you only need to execute the max date query once rather than every time the variable is called.
HIVE does not currently support this. (please correct me if I'm wrong! I have been trying to figure out how to do this all all afternoon)
My workaround is to store the required variable in a table that is small enough to map join onto the query in which it is used. Because the join is a map rather than a broadcast join it should not significantly hurt performance. For example:
drop table if exists var_table;
create table var_table as
select max(date) as max_date from some_table;
select some_other_table.*
from some_other_table
left join var_table
where some_other_table.date > var_table.max_date;
The suggested solution by #visakh is not optimal because stores the string 'select count(1) from table_name;' rather than the returned value and so will not be helpful in cases where you need to call a var repeatedly during a script.
Storing hive query output in a variable and using it in another query.
In shell create a variable with desired value by doing:
var=`hive -S -e "select max(datekey) from ....;"`
echo $var
Use the variable value in another hive query by:
hive -hiveconf MID_DATE=$var -f test.hql
You can simply achieve this using a shell script.
create a shell script
file: avg_op.sh
#!/bin/sh
hive -e 'use telecom;select avg(a) from abc;' > avg.txt
wait
value=`cat avg.txt`
hive --hiveconf avgval=$value -e "set avgval;set hiveconf:avgval;
use telecom;
select * from abc2 where avg_var=\${hiveconf:avgval};"
execute the .sh file
>bash avg_op.sh
If you trying to capture the number from a Hive query or impala query in Linux, you can achieve this by executing the query and selecting numbers from the regex.
With Hive,
max=`beeline -u ${hiveConnectionUrl} -e "select max(col1) from schema_name.table_name;" | sed 's/[^0-9]*//g'`
The main part is to extract the number from the result. Also, if you're getting too large a result, you can use --silent=true flag to silent the execution which would reduce the log messages.
You can use BeeTamer for that. It allows to store result (or part of it) in a variable, and use this variable later in your code.
Beetamer is a macro language / macro processor that allows to extend functionality of the Apache Hive and Cloudera Impala engines.
select avg(a) from abc;
%capture MY_AVERAGE;
select * from abc2 where avg_var=#MY_AVERAGE#;
In here you save average value from you query into macro variable MY_AVERAGE and then reusing it in the second query.
try below :
$ var=$(hive -e "select '12' ")
$ echo $var
12 -- output

Export Hive Query Results

I'm new to hive and could use some tips.
I'm trying to export query results from hive as a csv. When I try to pipe them out of CLI like:
hive -e 'select * from table'>OutPut.txt
I get a text file that has all the records but doesn't have the column headers. Does anyone have a tip for how to export the query results with the column headers, to a csv file?
If I run the query in hue, and then download the results as a csv I get a csv with the column headers but no records. If anyone has a tip on how to download query results from hue with records and column headers, I would greatly appreciate it too.
To export the column headers, you need to set the following in the hiverc file:
set hive.cli.print.header=true;
To get just the headers into a file, you could try the following:
hive -e 'set hive.cli.print.header=true; SELECT * FROM TABLE_NAME LIMIT 0;' > /file_path/file_name.txt
Having the column header but missing data is a known issue: HUE-544
The workaround is to use Hue 3 or more or switch to HiveServer2 (recommended starting from CDH4.6).