How to combine multiple text files into a Hive table

I am currently trying to write a Hive script to take in a directory path and generate a Hive table combining all the different files in the path together. I have found how to load files given I know the direct path to them, but how can I do this without knowing all the file paths?
This is how I would do it if I know the paths given directory, /combine:
LOAD DATA INPATH '/combine/file1.txt' INTO TABLE tablename;
LOAD DATA INPATH '/combine/file2.txt' INTO TABLE tablename;
But how would you get the same result if you dont know the file paths, only the directory?

just just * symbol could load all the file under the e directory into the table .
LOAD DATA INPATH '/combine/*' INTO TABLE tablename;

You can use an external Hive table.
Create a folder on HDFS and load the two files there:
hadoop fs -mkdir /hive-data
hadoop fs -put file1.txt /hive-data/file1.txt
hadoop fs -put file2.txt /hive-data/file2.txt
Alternatively, specify a directory from which to load all files:
hadoop fs -put directory-with-files/* /hive-data/
Verify the files are loaded properly:
hadoop fs -ls /hive-data
Create an external table in Hive and refer to the HDFS location:
(Change the schema, field, and line delimiters to match your data files.)
id INT,
LOCATION 'hdfs:///hive-data';
Verify data in Hive:
select * from tablename;


hive not adding file extensions to file names

I've run the hive query
create table my_schema.my_table stored as parquet as select ...
It created the table and the files, but i do not see .parq file extension next to the files, which is a bit of a problem for me since i wanted to be able to run something like hdfs -ls -R /path/to/directory | grep .parq to list all parquet files in a directory.
Is there either a way to filter parquet files regardless of file extension or a way to make hive include the extension?
I have a similar query using impala and there i can see the .parq files without any issue.
Hive will not add extension to the file. You need to do it manually:
$ hadoop fs -put /path/to/directory/000000_0 /path/to/directory/data.parquet

load only few files from a HDFS directory

I want to load some of the files from a HDFS directory into a table.
The files in the HDFS directory as below.
Now I want to load /data/log/user1log.csv and /data/log/user2log.csv files.
I have tried the below.
CREATE EXTERNAL TABLE log_data (username string,log_dt string)
tblproperties ("skip.header.line.count"="1");
load data inpath '/data/log/user1log.csv' into table log_data;
load data inpath '/data/log/user2log.csv' into table log_data;
But after loading data into table files are vanishing from HDFS location.
But the file we should keep in the HDFS location.
Please help me.
Thanks in advance.
I don't think it's possible, when you do Load inpath it moves data rather than copying.
However, you have a External Table so you can load data even without using Load inpath
Here's how you can do it.
Specify the location for your Hive Table
CREATE EXTERNAL TABLE log_data (username string,log_dt string)
tblproperties ("skip.header.line.count"="1");
location '/data/log_data/table'
Copy Files to Location
hdfs dfs -cp /data/log/user1log.csv /data/log_data/table/
hdfs dfs -cp /data/log/user2log.csv /data/log_data/table/

LOAD DATA INPATH table files start with some string in Impala

Just a simple question, I'm new in Impala.
I want to load data from the HDFS to my datalake using impala.
So I have a csv this_is_my_data.csv and what I want to do is load the file without specify all the extension, I mean something like the following:
LOAD DATA INPATH 'user/myuser/this_is.* INTO TABLE my_table
This is, a string starting with this_is and whatever follows.
If you need some additional information, please let me know. Thanks in advance.
The documentation says:
You can specify the HDFS path of a single file to be moved, or the
HDFS path of a directory to move all the files inside that directory.
You cannot specify any sort of wildcard to take only some of the files
from a directory.
The workaround is to put your files into table directory using mv or cp command. Check your table directory using DESCRIBE FORMATTED command and run mv or cp command (in a shell, not Impala of course):
hdfs dfs -mv "user/myuser/this_is.*" "/user/cloudera/mytabledir"
Or put files you need to load into some directory first then load all the directory.

Hive Managed table - Filename

In hive managed tables is there anyway to input/specify the filename for the data files getting created?
For example, the below data file ends with "000000_0", is it possible to get that file generated with specific name?
There is no way to specify the file name when you input the data using hive cli or sqoop. But there is a way to input the specified file using copy command
hadoop fs -cp <src_file> <dest_folder>
In this case you have to be careful the data in this source file is to be matched exactly with the partition condition of destination directory.

Hive External table-CSV File- Header row

Below is the hive table i have created:
column1 type, </br>
column2 type
LOCATION '/exttable/';
In my HDFS location /exttable, i have lot of CSV files and each CSV file also contain the header row. When i am doing select queries, the result contains the header row as well.
Is there any way in HIVE where we can ignore the header row or first line ?
you can now skip the header count in hive 0.13.0.
tblproperties ("skip.header.line.count"="1");
If you are using Hive version 0.13.0 or higher you can specify "skip.header.line.count"="1" in your table properties to remove the header.
For detailed information on the patch see:
Lets say you want to load csv file like below located at /home/test/que.csv
Now, we need to create a location in HDFS that holds this data.
hadoop fs -put /home/test/que.csv /user/mcc
Next step is to create a table. There are two types of them to choose from. Refer this for choosing one.
Example for External Table.
create external table industry_
MCC string ,
MCC_Name string,
MCC_Group string
LOCATION '/user/mcc/'
tblproperties ("skip.header.line.count"="1");
Note: When accessed via Spark SQL, the header row of the CSV will be shown as a data row.
Tested on: spark version 2.4.
There is not. However, you can pre-process your files to skip the first row before loading into HDFS -
tail -n +2 withfirstrow.csv > withoutfirstrow.csv
Alternatively, you can build it into where clause in HIVE to ignore the first row.
If your hive version doesn't support tblproperties ("skip.header.line.count"="1"), you can use below unix command to ignore the first line (column header) and then put it in HDFS.
sed -n '2,$p' File_with_header.csv > File_with_No_header.csv
To remove the header from the csv file in place use:
sed -i 1d filename.csv