I'm new to hive and could use some tips.
I'm trying to export query results from hive as a csv. When I try to pipe them out of CLI like:
hive -e 'select * from table'>OutPut.txt
I get a text file that has all the records but doesn't have the column headers. Does anyone have a tip for how to export the query results with the column headers, to a csv file?
If I run the query in hue, and then download the results as a csv I get a csv with the column headers but no records. If anyone has a tip on how to download query results from hue with records and column headers, I would greatly appreciate it too.
To export the column headers, you need to set the following in the hiverc file:
set hive.cli.print.header=true;
To get just the headers into a file, you could try the following:
hive -e 'set hive.cli.print.header=true; SELECT * FROM TABLE_NAME LIMIT 0;' > /file_path/file_name.txt
Having the column header but missing data is a known issue: HUE-544
The workaround is to use Hue 3 or more or switch to HiveServer2 (recommended starting from CDH4.6).
Related
I want to export hive query result to single local file with pipe delimiter.
Hive query contains order by clause.
I have tried below solutions.
Solution1:
hive -e 'insert overwrite local directory '/problem1/solution' fields terminated by '|' select * from table_name order by rec_date'
This solution is creating multiple files. After merging files, it loosing data order.
Solution2:
beeline -u 'jdbc:hive2://server_ip:10000/db_name' --silent --outputformat=dsv --delimiterForDSV='|' -e 'select * from table_name order by rec_date' > /problem1/solution
This solution is creating single file but it has empty 2 lines at top and 2 lines at bottom.
I am removing empty lines using sed command. It takes very long time.
Is there any other efficient way to achieve this?
Try these settings for executing ORDER BY on single reducer:
set hive.optimize.sampling.orderby=false; --disable parallel ORDER BY
Or try to set the number of reducers manually:
set mapred.reduce.tasks=1;
This question may have been asked before, and I am relatively new to the HADOOP and HIVE language. So I'm trying to export content, as a test, to see if I am doing things correctly. The code is below.
Use MY_DATABASE_NAME;
INSERT OVERWRITE LOCAL DIRECTORY '/random/directory/test'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n"
SELECT date_ts,script_tx,sequence_id FROM dir_test WHERE date_ts BETWEEN '2018-01-01' and '2018-01-02';
That is what I have so far, but then it generates multiple files and I want to combine them into a .csv file or a .xls file, to be worked on. My question, what do I do next to accomplish this?
Thanks in advance.
You can achieve by following ways:
Use single reducer in the query like ORDER BY <col_name>
Store to HDFS and then use command hdfs dfs –getmerge [-nl] <src> <localdest>
Using beeline: beeline --outputformat=csv2 -f query_file.sql > <file_name>.csv
I know that you can get column names from a table via the following trick in hive:
hive> set hive.cli.print.header=true;
hive> select * from tablename;
Is it also possible to just get the column names from the table?
I dislike having to change a setting for something I only need once.
My current solution is the following:
hive> set hive.cli.print.header=true;
hive> select * from tablename;
hive> set hive.cli.print.header=false;
This seems too verbose and against the DRY-principle.
If you simply want to see the column names this one line should provide it without changing any settings:
describe database.tablename;
However, if that doesn't work for your version of hive this code will provide it, but your default database will now be the database you are using:
use database;
describe tablename;
you could also do show columns in $table or see Hive, how do I retrieve all the database's tables columns for access to hive metadata
The solution is
show columns in table_name;
This is simpler than use
describe tablename;
Thanks a lot.
use desc tablename from Hive CLI or beeline to get all the column names. If you want the column names in a file then run the below command from the shell.
$ hive -e 'desc dbname.tablename;' > ~/columnnames.txt
where dbname is the name of the Hive database where your table is residing
You can find the file columnnames.txt in your root directory.
$cd ~
$ls
Best way to do this is setting the below property:
set hive.cli.print.header=true;
set hive.resultset.use.unique.column.names=false;
Below is the hive table i have created:
CREATE EXTERNAL TABLE Activity (
column1 type, </br>
column2 type
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/exttable/';
In my HDFS location /exttable, i have lot of CSV files and each CSV file also contain the header row. When i am doing select queries, the result contains the header row as well.
Is there any way in HIVE where we can ignore the header row or first line ?
you can now skip the header count in hive 0.13.0.
tblproperties ("skip.header.line.count"="1");
If you are using Hive version 0.13.0 or higher you can specify "skip.header.line.count"="1" in your table properties to remove the header.
For detailed information on the patch see: https://issues.apache.org/jira/browse/HIVE-5795
Lets say you want to load csv file like below located at /home/test/que.csv
1,TAP (PORTUGAL),AIRLINE
2,ANSA INTERNATIONAL,AUTO RENTAL
3,CARLTON HOTELS,HOTEL-MOTEL
Now, we need to create a location in HDFS that holds this data.
hadoop fs -put /home/test/que.csv /user/mcc
Next step is to create a table. There are two types of them to choose from. Refer this for choosing one.
Example for External Table.
create external table industry_
(
MCC string ,
MCC_Name string,
MCC_Group string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/mcc/'
tblproperties ("skip.header.line.count"="1");
Note: When accessed via Spark SQL, the header row of the CSV will be shown as a data row.
Tested on: spark version 2.4.
There is not. However, you can pre-process your files to skip the first row before loading into HDFS -
tail -n +2 withfirstrow.csv > withoutfirstrow.csv
Alternatively, you can build it into where clause in HIVE to ignore the first row.
If your hive version doesn't support tblproperties ("skip.header.line.count"="1"), you can use below unix command to ignore the first line (column header) and then put it in HDFS.
sed -n '2,$p' File_with_header.csv > File_with_No_header.csv
To remove the header from the csv file in place use:
sed -i 1d filename.csv
I need to export data from CSV to SQL Server
My CSV file is like this
Name,CustomerID
A,1
b,2
End
I need to export only the data into the SQL Server table
A,1
b,2
I tried to work with BULK INSERT, but header is coming
I need to remove header and footer.
Is the only option is to create a bcp with format file
Any help appreciated.
Thanks
You can use first_row and last_row options. If there are 100 rows set first_row=2 and last_row=99