Hive query insertion into other directories - hive

I have one table called balance in warehouse directory that has following data.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
I created a Internal table called balance and loaded the above file into balance table using.
LOAD data local inpath '/home/cloudera/bal.txt' into table balance
Now I just wanted to have all those rows in balance table into a HDFS directory.hence I used the below query.
Insert overwrite directory '/user/cloudera/surenhive' select * from balance;
when I run this query all the data also loaded into above mentioned directory in HDFS.
If i navigate to /user/cloudera/surenhive then i can see the data ,but there are some junk characters between data . Why is it junk characters appearing? How to remove those.
but the below query gives me result without any issues.
Insert overwrite local directory '/home/cloudera/surenhive' select * from balance;
if I load the file from local and store the output into HDFS directory creates any problem for that junk characters.

First of all, if you've loaded the data into a hive table, then it is already in HDFS. Do "describe formatted balance" and you will see the hdfs location of the hive table; the files are there.
But to answer your question more specifically, the default delimiter that hive uses is ^A. That's probably the You can change that by specifying a different delimiter when you do the insert:
insert overwrite directory '/user/cloudera/surenhive'
row format delimited fields terminated by ','
select * from balance;
Alternatively, since it seems you're using an older version of Hive, you could do a "create-table-as-select" with the correct file format, then make the table external and drop it. This will leave you with just the files on hdfs:
create table tmp
row format delimited fields terminated by ','
location '/user/cloudera/surenhive'
as select * from balance;
alter table tmp set tblproperties('EXTERNAL'='TRUE');
drop table tmp;

Related

INSERT OVERWRITE on just created table

I have to replicate a process for a client. I have never worked with Hive, so I am trying to understand what they were doing in other cases.
The Hive script I am trying to understand is this one:
DROP TABLE IF EXISTS distribution.030601_TI11;
CREATE EXTERNAL TABLE IF NOT EXISTS distribution.030601_TI11(
mygroup STRING, year STRING, type1 STRING, type2 STRING,
type3 STRING, myvalue INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '/warehouse/distribution/030601_TI11';
INSERT OVERWRITE TABLE distribution.030601_TI11
SELECT *
FROM develop.030601_TI11;
What are they doing?
As far as I have read about Hive, a DROP TABLE IF EXISTS statement over a external table will only delete the table metadata and not the table data. But I would like to know if that INSERT OVERWRITE statement is dropping the previous entries stored in the table, and inserting only the new rows contained in the specified location
And also, how is the LOCATION managed? I want to create the table from a single .csv file. Can I write something like LOCATION '/warehouse/develop/myfile.csv' or I can only provide a HDFS directory as a location?
INSERT OVERWRITE TABLE removes all files inside table location and moves new file. This happens at the very end when the query has already successfully executed and result files are created in the temporary location, after that load task removes all files in table location and moves files from temp location to the table location. See also this answer: https://stackoverflow.com/a/63378038/2700344
If you want to create table on top of single file, put it in some folder and make sure there are no other files in the same folder and and specify that folder as a location in create table DDL. Also you can put that file into existing table location using hdfs dfs -put command or using LOAD command or using some other means. Main point here is that table should have it's own location, does not matter how many files are in the location - single file or many files, location is a folder (directory), not a file. Even if it was possible to create table on top of single file instead of folder, it is unsafe, because overwrite can create another files and table will have location pointing to non-existing file. carefully read answers on this question: How to point to a single file with external table
You are right, the location for external table will remain as is. So, by drop-create statements they are ensuring that the table doesn't exist before dropping or creating. And the table seems to be dynamic in nature so that can be another reason of drop-create.
Please notice you are using CREATE EXTERNAL TABLE IF NOT EXISTS which means if table exist, it will not recreate.
Storage will be cleaned and loaded using INSERT OVERWRITE.
Now, if you want to create a table on top of csv file just use LOCATION '/warehouse/develop/myfile. You dont have to use .csv in location.

insert into hive external table as select and ensure it generates single file in table directory

My question is somewhat similar to the below post. I want to download some data from a hive table using select query. But because the data is large, I want to write it as an external table in a given path. so that I can create a csv file. Uses the below code
create external table output(col1 STRING, col2STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '{outdir}/output'
INSERT OVERWRITE TABLE output
Select col1, col2 from atable limit 1000
This works fine, and create a file in 0000_ format, which can be copied as a csv file.
But my question is how to ensure that the output will always have a single file? If there is no partition defined, will it always be single file? What is the rule it uses to split files?
Saw few similar questions like below. But it discuss hdfs file access.
How to point to a single file with external table
I know the below alternative, but I use a hive connection object to execute queries from a remote node.
hive -e ' selectsql; ' | sed 's/[\t]/,/g' > outpathwithfilename
You can set the below property before doing the overwrite
set mapreduce.job.reduces=1;
Note: If the hive engine doesn't allow to be modified at runtime, then whitelist the parameter by setting below property in hive-site.xml
hive.security.authorization.sqlstd.confwhitelist.append=|mapreduce.job.|mapreduce.map.|mapreduce.reduce.*

Hive: load gziped CSV from hdfs as read-only into a table

I have an hdfs folder with many csv.gz within, all with the same schema. My customer needs to read the content of these tables through Hive.
I tried to apply https://cwiki.apache.org/confluence/display/Hive/CompressedStorage . However it moves the file, whereas I need it to stay in its initial directory.
Another problem is that I should load each file one by one, I would rather create a table from the directory and not manage file individually.
I do not master Hive at all. Is his possible?
Yes, this is possible via Hive. You can create an external table and reference the existing HDFS location containing the gzip files. The schema for the data should be specified during the table creation.
hive> CREATE EXTERNAL TABLE my_data
(
column_1 int,
column_2 string
)
LOCATION 'hdfs:///my_data_folder_with_gzip_files';

External Table in Hive - Location

The below table returns no data while running a select statement
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.
Is there any way the location be made dynamic like
/user/data/CSV/*/*/*/*/part-*
or just /user/data/CSV/* would do fine ?
(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)
First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.
If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.
Hive does not iterate through multiple folders -
Hence for the above scenario
I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)
hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>
This line
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?
Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.
You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
;
alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;
You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.
You may also be able to leverage a
create table like
statement. To create a schema like you have in another table. Then alter the table to point at the files you want.
I know this isn't exactly what you want and is more of a work around. Good luck!

Hive Extended table

When we create using
Create external table employee (name string,salary float) row format delimited fields terminated by ',' location /emp
In /emp directory there are 2 emp files.
so when we run select * from employee, it get the data from both the file ad display.
What will be happen when there will be others file also having different kind of record which column is not matching with the employee table , so it will try to load all the files when we run "select * from employee"?
1.Can we specify the specific file name which we want to load?
2.Can we create other table also with the same location?
Thanks
Prashant
It will load all the files in emp directory even it doesn’t match with table.
for your first question. you can use Regex serde.if your data matches to regex.then it loads to the table.
regex for access log in hive serde
https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
other options:I am pointing some links.these links has some ways.
when creating an external table in hive can I point the location to specific files in a direcotry?
https://issues.apache.org/jira/browse/HIVE-951
for your second question: yes we can create other tables also with the same location.
Here are your answers
1. If the data in the file dosent match with table format, hive doesnt throw an error. It tries to read the data as best as it could. If data for some columns are missing it will put NULL for them.
No we cannot specify the file name for any table to read data. Hive will consider all the files under the table directory.
Yes, we can create other tables with the same location.