How to combine multiple text files into a Hive table - sql

I am currently trying to write a Hive script to take in a directory path and generate a Hive table combining all the different files in the path together. I have found how to load files given I know the direct path to them, but how can I do this without knowing all the file paths?
This is how I would do it if I know the paths given directory, /combine:
LOAD DATA INPATH '/combine/file1.txt' INTO TABLE tablename;
LOAD DATA INPATH '/combine/file2.txt' INTO TABLE tablename;
But how would you get the same result if you dont know the file paths, only the directory?

just just * symbol could load all the file under the e directory into the table .
LOAD DATA INPATH '/combine/*' INTO TABLE tablename;

You can use an external Hive table.
Create a folder on HDFS and load the two files there:
hadoop fs -mkdir /hive-data
hadoop fs -put file1.txt /hive-data/file1.txt
hadoop fs -put file2.txt /hive-data/file2.txt
Alternatively, specify a directory from which to load all files:
hadoop fs -put directory-with-files/* /hive-data/
Verify the files are loaded properly:
hadoop fs -ls /hive-data
Create an external table in Hive and refer to the HDFS location:
(Change the schema, field, and line delimiters to match your data files.)
CREATE EXTERNAL TABLE tablename
(
id INT,
desc STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///hive-data';
Verify data in Hive:
select * from tablename;

Related

hive not adding file extensions to file names

I've run the hive query
create table my_schema.my_table stored as parquet as select ...
It created the table and the files, but i do not see .parq file extension next to the files, which is a bit of a problem for me since i wanted to be able to run something like hdfs -ls -R /path/to/directory | grep .parq to list all parquet files in a directory.
Is there either a way to filter parquet files regardless of file extension or a way to make hive include the extension?
I have a similar query using impala and there i can see the .parq files without any issue.
Hive will not add extension to the file. You need to do it manually:
$ hadoop fs -put /path/to/directory/000000_0 /path/to/directory/data.parquet

load only few files from a HDFS directory

I want to load some of the files from a HDFS directory into a table.
The files in the HDFS directory as below.
/data/log/user1log.csv
/data/log/user2log.csv
/data/log/user3log.csv
/data/log/user4log.csv
/data/log/user5log.csv
Now I want to load /data/log/user1log.csv and /data/log/user2log.csv files.
I have tried the below.
CREATE EXTERNAL TABLE log_data (username string,log_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
tblproperties ("skip.header.line.count"="1");
load data inpath '/data/log/user1log.csv' into table log_data;
load data inpath '/data/log/user2log.csv' into table log_data;
But after loading data into table files are vanishing from HDFS location.
But the file we should keep in the HDFS location.
Please help me.
Thanks in advance.
I don't think it's possible, when you do Load inpath it moves data rather than copying.
However, you have a External Table so you can load data even without using Load inpath
Here's how you can do it.
Specify the location for your Hive Table
CREATE EXTERNAL TABLE log_data (username string,log_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
tblproperties ("skip.header.line.count"="1");
location '/data/log_data/table'
Copy Files to Location
hdfs dfs -cp /data/log/user1log.csv /data/log_data/table/
hdfs dfs -cp /data/log/user2log.csv /data/log_data/table/

LOAD DATA INPATH table files start with some string in Impala

Just a simple question, I'm new in Impala.
I want to load data from the HDFS to my datalake using impala.
So I have a csv this_is_my_data.csv and what I want to do is load the file without specify all the extension, I mean something like the following:
LOAD DATA INPATH 'user/myuser/this_is.* INTO TABLE my_table
This is, a string starting with this_is and whatever follows.
If you need some additional information, please let me know. Thanks in advance.
The documentation says:
You can specify the HDFS path of a single file to be moved, or the
HDFS path of a directory to move all the files inside that directory.
You cannot specify any sort of wildcard to take only some of the files
from a directory.
The workaround is to put your files into table directory using mv or cp command. Check your table directory using DESCRIBE FORMATTED command and run mv or cp command (in a shell, not Impala of course):
hdfs dfs -mv "user/myuser/this_is.*" "/user/cloudera/mytabledir"
Or put files you need to load into some directory first then load all the directory.

Hive Managed table - Filename

In hive managed tables is there anyway to input/specify the filename for the data files getting created?
For example, the below data file ends with "000000_0", is it possible to get that file generated with specific name?
hdfs://quickstart.cloudera:8020/user/hive/warehouse/orders_partitioned/order_month=Apr/000000_0
There is no way to specify the file name when you input the data using hive cli or sqoop. But there is a way to input the specified file using copy command
hadoop fs -cp <src_file> <dest_folder>
In this case you have to be careful the data in this source file is to be matched exactly with the partition condition of destination directory.

Hive External table-CSV File- Header row

Below is the hive table i have created:
CREATE EXTERNAL TABLE Activity (
column1 type, </br>
column2 type
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/exttable/';
In my HDFS location /exttable, i have lot of CSV files and each CSV file also contain the header row. When i am doing select queries, the result contains the header row as well.
Is there any way in HIVE where we can ignore the header row or first line ?
you can now skip the header count in hive 0.13.0.
tblproperties ("skip.header.line.count"="1");
If you are using Hive version 0.13.0 or higher you can specify "skip.header.line.count"="1" in your table properties to remove the header.
For detailed information on the patch see: https://issues.apache.org/jira/browse/HIVE-5795
Lets say you want to load csv file like below located at /home/test/que.csv
1,TAP (PORTUGAL),AIRLINE
2,ANSA INTERNATIONAL,AUTO RENTAL
3,CARLTON HOTELS,HOTEL-MOTEL
Now, we need to create a location in HDFS that holds this data.
hadoop fs -put /home/test/que.csv /user/mcc
Next step is to create a table. There are two types of them to choose from. Refer this for choosing one.
Example for External Table.
create external table industry_
(
MCC string ,
MCC_Name string,
MCC_Group string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/mcc/'
tblproperties ("skip.header.line.count"="1");
Note: When accessed via Spark SQL, the header row of the CSV will be shown as a data row.
Tested on: spark version 2.4.
There is not. However, you can pre-process your files to skip the first row before loading into HDFS -
tail -n +2 withfirstrow.csv > withoutfirstrow.csv
Alternatively, you can build it into where clause in HIVE to ignore the first row.
If your hive version doesn't support tblproperties ("skip.header.line.count"="1"), you can use below unix command to ignore the first line (column header) and then put it in HDFS.
sed -n '2,$p' File_with_header.csv > File_with_No_header.csv
To remove the header from the csv file in place use:
sed -i 1d filename.csv