Hive External table-CSV File- Header row - hive

Below is the hive table i have created:
CREATE EXTERNAL TABLE Activity (
column1 type, </br>
column2 type
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/exttable/';
In my HDFS location /exttable, i have lot of CSV files and each CSV file also contain the header row. When i am doing select queries, the result contains the header row as well.
Is there any way in HIVE where we can ignore the header row or first line ?

you can now skip the header count in hive 0.13.0.
tblproperties ("skip.header.line.count"="1");

If you are using Hive version 0.13.0 or higher you can specify "skip.header.line.count"="1" in your table properties to remove the header.
For detailed information on the patch see: https://issues.apache.org/jira/browse/HIVE-5795

Lets say you want to load csv file like below located at /home/test/que.csv
1,TAP (PORTUGAL),AIRLINE
2,ANSA INTERNATIONAL,AUTO RENTAL
3,CARLTON HOTELS,HOTEL-MOTEL
Now, we need to create a location in HDFS that holds this data.
hadoop fs -put /home/test/que.csv /user/mcc
Next step is to create a table. There are two types of them to choose from. Refer this for choosing one.
Example for External Table.
create external table industry_
(
MCC string ,
MCC_Name string,
MCC_Group string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/mcc/'
tblproperties ("skip.header.line.count"="1");
Note: When accessed via Spark SQL, the header row of the CSV will be shown as a data row.
Tested on: spark version 2.4.

There is not. However, you can pre-process your files to skip the first row before loading into HDFS -
tail -n +2 withfirstrow.csv > withoutfirstrow.csv
Alternatively, you can build it into where clause in HIVE to ignore the first row.

If your hive version doesn't support tblproperties ("skip.header.line.count"="1"), you can use below unix command to ignore the first line (column header) and then put it in HDFS.
sed -n '2,$p' File_with_header.csv > File_with_No_header.csv

To remove the header from the csv file in place use:
sed -i 1d filename.csv

Related

insert into hive external table as select and ensure it generates single file in table directory

My question is somewhat similar to the below post. I want to download some data from a hive table using select query. But because the data is large, I want to write it as an external table in a given path. so that I can create a csv file. Uses the below code
create external table output(col1 STRING, col2STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '{outdir}/output'
INSERT OVERWRITE TABLE output
Select col1, col2 from atable limit 1000
This works fine, and create a file in 0000_ format, which can be copied as a csv file.
But my question is how to ensure that the output will always have a single file? If there is no partition defined, will it always be single file? What is the rule it uses to split files?
Saw few similar questions like below. But it discuss hdfs file access.
How to point to a single file with external table
I know the below alternative, but I use a hive connection object to execute queries from a remote node.
hive -e ' selectsql; ' | sed 's/[\t]/,/g' > outpathwithfilename
You can set the below property before doing the overwrite
set mapreduce.job.reduces=1;
Note: If the hive engine doesn't allow to be modified at runtime, then whitelist the parameter by setting below property in hive-site.xml
hive.security.authorization.sqlstd.confwhitelist.append=|mapreduce.job.|mapreduce.map.|mapreduce.reduce.*

How to create/copy data to partitions in hive manually

I am working on a hive solution wherein I need to append some values to the high volume files. So instead of appending them, I am trying using map-reduce method
The approach is below
Table creation:
create external table demo_project_data(data string) PARTITIONED BY (business_date string, src_sys_file_nm string, prd_typ_cd string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
LOCATION '/user/hive/warehouse/demo/project/data';
hadoop fs -mkdir -p /user/hive/warehouse/demo/project/data/business_date='20180707'/src_sys_file_nm='a_b_c_20180707_1.dat.gz'/prd_typ_cd='abcd'
echo "ALTER TABLE demo_project_data ADD IF NOT EXISTS PARTITION(business_date='20180707',src_sys_file_nm='a ch_ach_fotp_20180707_1.dat.gz',prd_typ_cd='ach')
LOCATION '/user/hive/warehouse/demo/project/data/business_date='20180707'/src_sys_file_nm='a_b_c_20180707_1.dat.gz'/prd_typ_cd='abcd';"|hive
hadoop fs -cp /apps/tdi/data/a_b_c_20180707_1.dat.gz /user/hive/warehouse/demo/project/data/business_date='20180707'/src_sys_file_nm='a_b_c_20180707_1.dat.gz'/prd_typ_cd='abcd'
echo "INSERT OVERWRITE DIRECTORY '/user/20180707' select *,'~karthick~kb~demo' from demo_project_data where src_sys_file_nm='a_b_c_20180707_1.dat.gz' and business_date='20180707' and prd_typ_cd='abcd';"|hive
I have some amount of data in the file but I dont see any results in the above query. The files are properly copied under the correct location.
What is that I am making wrong? Query has no issues
Also I will be looping over multiple dates. I would like to know if this is the right way to do it.
You can Use below command to fetch the results from the partition -
MSCK REPAIR TABLE <tablename>;
Refer,
MSCK REPAIR TABLE:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

Hue on Cloudera - NULL values (importing file)

Yesterday I installed Cloudera QuickStart VM 5.8. After the import operation of files from the database by HUE, in some tables there were a NULL value (the entire column). In previous steps data display them properly as they should be imported.
First Pic.
Second Pic.
can you run the command describe formatted table_name in hive shell and see what is the field delimiter and then go to the warehouse directory and see if the delimiter in the data and in the table definition is same.i am sure it will not be same thats why you see null.
i am assuming you have imported the data in the default warehouse directory.
then you can do one of the following
1) delete your hive table and create it again with correct delimiter as it is in the actual data ( row format delimited fields terminated by "your delimitor" and give location as your data file
or
2) delete the data that is imported and run sqoop import again and give the fields-terminated-by " the delimitor in the hive table definition"
Once check datatype of second(col_1) and third(col_2) in original database from where your exporting.
This can not be case of missing delimiter, else fourth(col_3) would not have populated correctly, which is correct.

HiveQL Where In Clause That Points to a Set of Files

I have a set of ~100 files each with 50k IDs in them. I want to be able to make a query against Hive that has a Where In clause using the IDs from these files. I could also do this directly from Groovy, but I'm thinking the code would be cleaner if I did all of the processing from Hive instead of referencing an external Set. Is this possible?
Create an external table describing the format of your files, and set the location to the HDFS path of a directory containing the files.. i.e for tab delimited files
create external table my_ids(
id bigint,
other_col string
)
row format delimited fields terminated by "\t"
stored as textfile
location 'hdfs://mydfs/data/myids'
Now you can use Hive to access this data.

Configuring delimiter for Hive MR Jobs

Is there any way to configure the delimiter for Hive MR Jobs??
The default delimiter being used by hive internally is "hive delimiter" (/001). My usecase is to configure the delimiter so that i can use any delimiter as per the requirement. In hadoop there is a property "mapred.textoutputformatter.separator" which will set the key-value separator to the value specified for this property..Is there any such way to configure the delimiter in Hive?..I searched many but didn't get any useful links. Please help me.
As of hive-0.11.0, you can write
INSERT OVERWRITE LOCAL DIRECTORY '...'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
SELECT ...
See HIVE-3682 for the complete syntax.
You can try that:
SELECT (rest of your query)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 'YourChar' (example: FIELDS TERMINATED BY '\t')
You can also use this :-
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim'='-','serialization.format'='-')
This will separate columns using - delimiter but it is specific to LazSimpleSerde.
i guess you are using INSERT OVERWRITE DIRECTORY option to write to a hdfs file.
If you create a hive table on top of the hdfs file with no delimiter, it will take '\001' as delimiter, so you can read the file from a hive table without any issues
If you source table dnt not specify the delimiter in the create schema statement, then you wont be able to change that. You op will always contain the default. And yes the delimiter will be controlled by create schema for the source table. So that isnt configurable either.
I have had a similar issue and ended up modifying 001 as second step after finishing hive MR job.