writing hive query results to hdfs

writing hive query results to hdfs - hive

I can export a hive query results using this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/events/'
but if I want to export it to an HDFS dir at /user/events/
how do I do that? I tried this:
INSERT OVERWRITE DIRECTORY '/user/user/events/'
> row format delimited
> fields terminated by '\t'
> select * from table;
but get this error then:
FAILED: ParseException line 2:0 cannot recognize input near 'row' 'format' 'delimited' in statement

remove the LOCAL keyword - it specifies local filesystem. Without it the result will go to hdfs. You may actually need to use OVERWRITE though. So:
INSERT OVERWRITE DIRECTORY '/user/events/'

Related

Exporting Hive Table Data into .csv

This question may have been asked before, and I am relatively new to the HADOOP and HIVE language. So I'm trying to export content, as a test, to see if I am doing things correctly. The code is below.
Use MY_DATABASE_NAME;
INSERT OVERWRITE LOCAL DIRECTORY '/random/directory/test'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n"
SELECT date_ts,script_tx,sequence_id FROM dir_test WHERE date_ts BETWEEN '2018-01-01' and '2018-01-02';
That is what I have so far, but then it generates multiple files and I want to combine them into a .csv file or a .xls file, to be worked on. My question, what do I do next to accomplish this?
Thanks in advance.

You can achieve by following ways:
Use single reducer in the query like ORDER BY <col_name>
Store to HDFS and then use command hdfs dfs –getmerge [-nl] <src> <localdest>
Using beeline: beeline --outputformat=csv2 -f query_file.sql > <file_name>.csv

Hive- query output to file csv/excel

I am trying to output the results of Hive to a File (preferably excel) I tried below methods and non of them work as explained in most posts. I wonder because I use Hue environment. I am new to Hue and hive, any help would be appreciated
insert overwrite directory 'C:/Users/Microsoft/Windows/Data Assets' row format delimited fields terminated by '\n' stored as textfile select * from final_table limit 100;
INSERT OVERWRITE LOCAL DIRECTORY 'C:/Users/Microsoft/Windows/Data Assets'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
select * from final_table limit 100;

I have tried running the same query in my setup, it works fine.
In your case , it might be an issue with 'C:/Users/Microsoft/Windows/Data Assets' folder permission.
Try writing to different folder(User's Home folder).
Query:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/fsimage' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n' STORED AS TEXTFILE SELECT * FROM stream_data_partitioned limit 100;
Output Attached
Destination folder

Hive unable to move query results to a folder

I have written a select query in hive to move data to a particular folder.
But I am getting an error.
Please help.
Moving data to local directory /Dataproviders/DataSurgery/Order/out/jul24msngtxn/negtxns
Failed with exception Unable to move source hdfs://mycluster/tmp/hive/sshuser/253d3089-fcc0-4656-82ca-ccbe893196ed/hive_2018-08-16_06-58-29_220_388527949811395742-1/-mr-10000 to destination /Dataproviders/DataSurgery/Order/out/jul24msngtxn/negtxns
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
INSERT OVERWRITE LOCAL DIRECTORY '/Dataproviders/DataSurgery/Order/out/jul24msngtxn/negtxns/'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\034'
STORED AS TEXTFILE
select * from sourcetable;
I have given full permission to the following folders.
But still issue exists
hdfs dfs -chmod 777 /tmp/hive
hdfs dfs -chmod -R 777 /Dataproviders/DataSurgery/

I made a terrible mistake.
The keyword LOCAL should not be present to write into an hdfs directory.
I removed that and query worked fine.
Please find the correct query.
INSERT OVERWRITE DIRECTORY '/Dataproviders/DataSurgery/Order/out/jul24msngtxn/negtxns/'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\034'
STORED AS TEXTFILE
select * from sourcetable;

Hadoop Hive: create external table with dynamic location

I am trying to create a Hive external table that points to an S3 output file.
The file name should reflect the current date (it is always a new file).
I tried this:
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION concat('s3://BlobStore/Exports/Daily_', from_unixtime(unix_STRING(),'yyyy-MM-dd'));
but I get an error:
FAILED: Parse Error: line 3:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
is there any way to dynamically specify table location?

OK, I found the hive variables feature.
So I pass the location in the cli as follows
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/
and then use the variable in the hive command
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${s3File}';

This function doesn't work at my side ,
how did you make this happen ?
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/

Hive External table-CSV File- Header row

Below is the hive table i have created:
CREATE EXTERNAL TABLE Activity (
column1 type, </br>
column2 type
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/exttable/';
In my HDFS location /exttable, i have lot of CSV files and each CSV file also contain the header row. When i am doing select queries, the result contains the header row as well.
Is there any way in HIVE where we can ignore the header row or first line ?

you can now skip the header count in hive 0.13.0.
tblproperties ("skip.header.line.count"="1");

If you are using Hive version 0.13.0 or higher you can specify "skip.header.line.count"="1" in your table properties to remove the header.
For detailed information on the patch see: https://issues.apache.org/jira/browse/HIVE-5795

Lets say you want to load csv file like below located at /home/test/que.csv
1,TAP (PORTUGAL),AIRLINE
2,ANSA INTERNATIONAL,AUTO RENTAL
3,CARLTON HOTELS,HOTEL-MOTEL
Now, we need to create a location in HDFS that holds this data.
hadoop fs -put /home/test/que.csv /user/mcc
Next step is to create a table. There are two types of them to choose from. Refer this for choosing one.
Example for External Table.
create external table industry_
(
MCC string ,
MCC_Name string,
MCC_Group string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/mcc/'
tblproperties ("skip.header.line.count"="1");
Note: When accessed via Spark SQL, the header row of the CSV will be shown as a data row.
Tested on: spark version 2.4.

There is not. However, you can pre-process your files to skip the first row before loading into HDFS -
tail -n +2 withfirstrow.csv > withoutfirstrow.csv
Alternatively, you can build it into where clause in HIVE to ignore the first row.

If your hive version doesn't support tblproperties ("skip.header.line.count"="1"), you can use below unix command to ignore the first line (column header) and then put it in HDFS.
sed -n '2,$p' File_with_header.csv > File_with_No_header.csv

To remove the header from the csv file in place use:
sed -i 1d filename.csv

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

writing hive query results to hdfs - hive

remove the LOCAL keyword - it specifies local filesystem. Without it the result will go to hdfs. You may actually need to use OVERWRITE though. So: INSERT OVERWRITE DIRECTORY '/user/events/'

Related

Exporting Hive Table Data into .csv

Hive- query output to file csv/excel

Hive unable to move query results to a folder

Hadoop Hive: create external table with dynamic location

Hive External table-CSV File- Header row

Categories

Resources