load only few files from a HDFS directory - hive

I want to load some of the files from a HDFS directory into a table.
The files in the HDFS directory as below.
/data/log/user1log.csv
/data/log/user2log.csv
/data/log/user3log.csv
/data/log/user4log.csv
/data/log/user5log.csv
Now I want to load /data/log/user1log.csv and /data/log/user2log.csv files.
I have tried the below.
CREATE EXTERNAL TABLE log_data (username string,log_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
tblproperties ("skip.header.line.count"="1");
load data inpath '/data/log/user1log.csv' into table log_data;
load data inpath '/data/log/user2log.csv' into table log_data;
But after loading data into table files are vanishing from HDFS location.
But the file we should keep in the HDFS location.
Please help me.
Thanks in advance.

I don't think it's possible, when you do Load inpath it moves data rather than copying.
However, you have a External Table so you can load data even without using Load inpath
Here's how you can do it.
Specify the location for your Hive Table
CREATE EXTERNAL TABLE log_data (username string,log_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
tblproperties ("skip.header.line.count"="1");
location '/data/log_data/table'
Copy Files to Location
hdfs dfs -cp /data/log/user1log.csv /data/log_data/table/
hdfs dfs -cp /data/log/user2log.csv /data/log_data/table/

Related

importing data into hive table from external s3 bucket url link

I need to import data from a public s3 bucket which url is shared with me. how to load the data into hive table?
I have tried below command but its not working:
create external table airlines_info (.... ) row format
delimited fields terminated by '|' lines terminated by '\n'
stored as textfile location 'https://ml-cloud-dataset.....*.txt';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:ml-cloud-dataset.s3.amazonaws.com/Airlines_data.txt is not a directory or unable to create one)
I am very new to hive and I am not sure about the code. I also tried below code after creating the table to load the data into hive table but that's also not working
load data inpath 'https://ml-cloud-dataset.....*.txt' into table airlines_info;
Table location should be directory in HDFS or S3, not file and not https link.
Download file manually, put into local filesystem and if you already have the table created then use
load data local inpath 'local_path_to_file' into table airlines_info;
If you do not have the table yet, create it and specify some location inside your S3, or alternatively create MANAGED table (remove EXTERNAL from your DDL), without location specified, it will create location for you, check location using DESCRIBE FORMATTED command, later you can convert table to EXTERNAL if necessary using ALTER TABLE airlines_info SET TBLPROPERTIES('EXTERNAL'='TRUE');
Instead of load data command you can simply copy file into table location using AWS CLI (provide correct local path and table directory S3 URL):
aws s3 cp C:\Users\My_user\Downloads\Airlines_data.txt s3://mybucket/path/airlines_info/

How Can we load data into hive using URL

I have created a table in hive and I need to load csv data into hive table,
but the data is in github (I have downloaded and tested it is working fine) I need to load data directly from URL is it possible to load data into hive from URL
something like this can work
LOAD DATA INPATH 'https://github.com/xx/stock-prices.csv' INTO TABLE
stocks;
Loading data from flat files into Hive can be done using below command.
From Apache Hive Wiki:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
If the keyword LOCAL is specified, Hive looks for file path in local filesystem and loads from there. If the keyword LOCAL is not specified, Hive looks for file path in HDFS filesystem and loads data there.
You can specify full URI for HDFS files as well as local files.
Example:
file:///user/data/project/datafolder (Local Path)
hdfs://namenode:10001/user/data/project/datafolder (HDFS path)
This means it is not possible to load data directly into hive from https. So you have to download the data first and load into hive.
This is not the solution but the correct answer.

How to combine multiple text files into a Hive table

I am currently trying to write a Hive script to take in a directory path and generate a Hive table combining all the different files in the path together. I have found how to load files given I know the direct path to them, but how can I do this without knowing all the file paths?
This is how I would do it if I know the paths given directory, /combine:
LOAD DATA INPATH '/combine/file1.txt' INTO TABLE tablename;
LOAD DATA INPATH '/combine/file2.txt' INTO TABLE tablename;
But how would you get the same result if you dont know the file paths, only the directory?
just just * symbol could load all the file under the e directory into the table .
LOAD DATA INPATH '/combine/*' INTO TABLE tablename;
You can use an external Hive table.
Create a folder on HDFS and load the two files there:
hadoop fs -mkdir /hive-data
hadoop fs -put file1.txt /hive-data/file1.txt
hadoop fs -put file2.txt /hive-data/file2.txt
Alternatively, specify a directory from which to load all files:
hadoop fs -put directory-with-files/* /hive-data/
Verify the files are loaded properly:
hadoop fs -ls /hive-data
Create an external table in Hive and refer to the HDFS location:
(Change the schema, field, and line delimiters to match your data files.)
CREATE EXTERNAL TABLE tablename
(
id INT,
desc STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///hive-data';
Verify data in Hive:
select * from tablename;

Files lost by overwriting into hive managed tables

I am using hadoop 2.7.3 and hive 2.1.1.
I had some 8-9 file in HDFS. I created one internal hive table. I loaded first of those 8 files in that table. Did some operation on that data.
After that I loaded the second of those files by overwriting into that table.
load data inpath '/path/path1/first.csv' into table ABC;
load data inpath '/path/path1/second.csv' overwrite into table ABC;
Did some operation on second data.
I then loaded third file and so on till the last file by using "overwrite into" .
Now, I see all those files are not there in there original location. Also, at /user/hive/warehouse/ABC only the last of the files is there.
Where did those previous files go? Are they lost because of overwriting into hive table? I did "hdfs dfs -ls -R / | grep "filename" but could not find my files.
LOAD DATA INPATH will move (not copy) the file from the source HDFS path to the table warehouse path.
OVERWRITE will delete the files (if HDFS Trash is enabled, move the files to Trash) that already exist in the table and replace with the files given in the path.
LOAD DATA LOCAL INPATH copies the files.
LOAD DATA INPATH moves the files.
overwrite deletes existing files before moving in new files.

Hive: source data gets moved to hive datawarehouse even when table is external

In Hue --> Hive Query Browser I created an external table in Hive and loaded data from one of my CSV files into it using the following statements:
CREATE EXTERNAL TABLE movies(movieId BIGINT, title VARCHAR(100), genres VARCHAR(100)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
LOAD DATA INPATH '/user/admin/movie_data/movies' INTO TABLE movies;
I see that the source file "movies" disappears from HDFS and moves to the hive datawarehouse. I am under the impression that an external table acts only as a link to original source data.
Should the external table not be independent of source data - as in if I were to drop the table, source file will still persist? How do I create such an external table?
The external tables stores data in a hdfs location mentioned while we create the table. So if we dont provide location while creating the table it will be defaulted to warehouse hdfs folder.
Try running "use mydatabase_name;show create table mytable_name;" to get the table definition to see what is the location it is pointed to.
If you need a hdfs location other than default one you need to mention it while creating table.Refer below query
[Create external table test (col1 string) location '/data/database/tablename';]
Secondly LOAD INPATH will not move data from INPATH to external hdfs location, it will insert data from INPATH to your table table (more like copying data from inpath to tables's hdfs location)