Creating external table from file, not directory - sql

When I run the create external table query, I have to provide a directory for the 'Location' attribute. But if the directory I point to has more than one file, then it reads both files. For example, if I put LOCATION 'dir1/', and dir1 contains file1 and file2, both files will be read.
To avoid this, I want to point to a single file. When I tried LOCATION 'dir1/file1', it gave me an error that the file path is not a directory or unable to create one. Is there a way to point to just the single file?

If You want to load data from HDFS so try this
LOAD DATA INPATH '/user/data/file1' INTO TABLE table1;
And if you want to load data from local storage so,
LOAD DATA LOCAL INPATH '/data/file1' INTO TABLE table1;

Related

importing data into hive table from external s3 bucket url link

I need to import data from a public s3 bucket which url is shared with me. how to load the data into hive table?
I have tried below command but its not working:
create external table airlines_info (.... ) row format
delimited fields terminated by '|' lines terminated by '\n'
stored as textfile location 'https://ml-cloud-dataset.....*.txt';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:ml-cloud-dataset.s3.amazonaws.com/Airlines_data.txt is not a directory or unable to create one)
I am very new to hive and I am not sure about the code. I also tried below code after creating the table to load the data into hive table but that's also not working
load data inpath 'https://ml-cloud-dataset.....*.txt' into table airlines_info;
Table location should be directory in HDFS or S3, not file and not https link.
Download file manually, put into local filesystem and if you already have the table created then use
load data local inpath 'local_path_to_file' into table airlines_info;
If you do not have the table yet, create it and specify some location inside your S3, or alternatively create MANAGED table (remove EXTERNAL from your DDL), without location specified, it will create location for you, check location using DESCRIBE FORMATTED command, later you can convert table to EXTERNAL if necessary using ALTER TABLE airlines_info SET TBLPROPERTIES('EXTERNAL'='TRUE');
Instead of load data command you can simply copy file into table location using AWS CLI (provide correct local path and table directory S3 URL):
aws s3 cp C:\Users\My_user\Downloads\Airlines_data.txt s3://mybucket/path/airlines_info/

LOAD DATA INPATH table files start with some string in Impala

Just a simple question, I'm new in Impala.
I want to load data from the HDFS to my datalake using impala.
So I have a csv this_is_my_data.csv and what I want to do is load the file without specify all the extension, I mean something like the following:
LOAD DATA INPATH 'user/myuser/this_is.* INTO TABLE my_table
This is, a string starting with this_is and whatever follows.
If you need some additional information, please let me know. Thanks in advance.
The documentation says:
You can specify the HDFS path of a single file to be moved, or the
HDFS path of a directory to move all the files inside that directory.
You cannot specify any sort of wildcard to take only some of the files
from a directory.
The workaround is to put your files into table directory using mv or cp command. Check your table directory using DESCRIBE FORMATTED command and run mv or cp command (in a shell, not Impala of course):
hdfs dfs -mv "user/myuser/this_is.*" "/user/cloudera/mytabledir"
Or put files you need to load into some directory first then load all the directory.

Files lost by overwriting into hive managed tables

I am using hadoop 2.7.3 and hive 2.1.1.
I had some 8-9 file in HDFS. I created one internal hive table. I loaded first of those 8 files in that table. Did some operation on that data.
After that I loaded the second of those files by overwriting into that table.
load data inpath '/path/path1/first.csv' into table ABC;
load data inpath '/path/path1/second.csv' overwrite into table ABC;
Did some operation on second data.
I then loaded third file and so on till the last file by using "overwrite into" .
Now, I see all those files are not there in there original location. Also, at /user/hive/warehouse/ABC only the last of the files is there.
Where did those previous files go? Are they lost because of overwriting into hive table? I did "hdfs dfs -ls -R / | grep "filename" but could not find my files.
LOAD DATA INPATH will move (not copy) the file from the source HDFS path to the table warehouse path.
OVERWRITE will delete the files (if HDFS Trash is enabled, move the files to Trash) that already exist in the table and replace with the files given in the path.
LOAD DATA LOCAL INPATH copies the files.
LOAD DATA INPATH moves the files.
overwrite deletes existing files before moving in new files.

Inserting csv files into Hive table dynamically

There is a folder named "Sample" which contains n number of csv files. How to load all the csv files into Hive table dynamically?
For normal insert we use
load data inpath "file1.csv" into table Person;
With out hardcoding, can it be done for all the files??
You just need to pass in the directory name like:
load data inpath "/directory/name/here" into table Person;
Quoting the manual :
filepath can refer to a file (in which case Hive will move the file
into the table) or it can be a directory (in which case Hive will move
all the files within that directory into the table). In either case,
filepath addresses a set of files.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

How to load data into Hive table

I'm using the hortonworks's Hue (more like a GUI interface that connects hdfs, hive, pig together)and I want to load the data within the hdfs into my current created table.
Suppose the table's name is "test", and the file which contains the data, the path is:
/user/hdfs/test/test.txt"
But I'm unable to load the data into the table, I tried:
load data local inpath '/user/hdfs/test/test.txt' into table test
But there's error said can't find the file, there's no matching path.
I'm still so confused.
Any suggestions?
Thanks
As you said "load the data within the hdfs into my current created table".
But in you command you are using :
load data local inpath '/user/hdfs/test/test.txt' into table test
Using local keyword it looks for the file in your local filesystem. But you file is in HDFS.
I think you need to remove local keyword from you command.
Hope it helps...!!!
Since you are using the hue and the output is showing not matching path. I think you have to give the complete path.
for example:
load data local inpath '/home/cloudera/hive/Documents/info.csv' into table tablename; same as you can give the complete path where the hdfs in which the document resides.
You can use any other format file
remove local keyword as ur referring to local file system