Hive query - INSERT OVERWRITE LOCAL DIRECTORY creates multiple files for a single table - hive

I do the following from a hive table myTable.
INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT concat_ws('',NAME,PRODUCT,PRC,field1,field2,field3,field4,field5) FROM myTable;
So, this command generates 2 files 000000_0 and 000001_0 inside the folder out/.
But, I need the contents as a single file. What should I do?

There are multiple files in the directory because every reducer is writing one file. If you really need the contents as a single file, run your map reduce job with only 1 reducer which will write to a single file.
However depending on your data size, this might not be a good approach to run a single reducer.
Edit: Instead of forcing hive to run 1 reduce task and output a single reduce file, it would be better to use hadoop fs operations to merge outputs to a single file.
For example
hadoop fs -text /myDir/out/* | hadoop fs -put - /myDir/out.txt

A bit late to the game, but I found that using LIMIT large_number, where large_number is bigger than rows in your query. It forces hive to use at least a reducer. For example:
set mapred.reduce.tasks=1; INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT * FROM table_name LIMIT 1000000000
Worked flawlessly.

CLUSTER BY will make the work.

Related

How do I force hive to always create a consistent filename like 000000_0?

I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.
INSERT OVERWRITE TABLE tab_content
PARTITION(datekey)
select * from source
Better do not do this and modify your program to accept any number of files in the directory.
Each reducer (or mapper if it runs on map-only) creates it's own file. These reducers do know nothing about each other, they named during creation. Files are marked as 000001_0,000002_0. But it can be 000001_1 also if attempt number 0 has failed and attempt number 1 has succeeded. Also if table is partitioned and there is no distribute by partition key at the end, each reducer will create it's own file in each partition.
You can force it to work on a single final reducer (it can be done for example if you add order by clause or setting set mapred.reduce.tasks = 1;). But bear in mind that this solution is not scalable, because too many data will cause performance problems on single reducer. Also What will happen if attempt 0 has failed and it was restarted and attempt 1 succeeded? It will create 000001_1 instead of 000001_0.

Hive: create table and write it locally at the same time

Is it possible in hive to create a table and have it saved locally at the same time?
When I get data for my analyses, I usually create temporary tables to track eventual
mistakes in the queries/scripts. Some of these are just temporary tables, while others contain the data that I actually need for my analyses.
What I do usually is using hive -e "select * from db.table" > filename.tsv to get the data locally; however when the tables are big this can take quite some time.
I was wondering if there is some way in my script to create the table and save it locally at the same time. Probably this is not possible, but I thought it is worth asking.
Honestly doing it the way you are is the best way out of the two possible ways but it is worth noting you can preform a similar task in an .hql file for automation.
Using syntax like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp' select * from table;
You can run a query and store it somewhere in the local directory (as long as there is enough space and correct privileges)
A disadvantage to this is that with a pipe you get the data stored nicely as '|' delimitation and new line separated, but this method will store the values in the hive default '^b' I think.
A work around is to do something like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
But this is only in Hive 0.11 or higher

Pig : Load in table, then overwrite that table after transformation

Let's say I have a table:
db.table
I load the table and do some transforms on it, and, finally, attempt to store it
mytable = LOAD 'db.table' USING HCatLoader();
.
.
-- My transforms
.
.
STORE mytable_final INTO 'db.table' USING HCatStorer();
But the code complains I'm writing into a table with existing data.
I've looked at this JIRA ticket, which seems to be inactive (I have tried using FORCE and OVERWRITE in several places in the STORE command).
I've also looked at this SO post, but the author is loading from one location and storing in a different location. If I use what is in that post, the result from the transformation is no data. Deleting the files isn't an option. I'm thinking of storing the files temporarily, but I don't know if this is the best option.
I am trying to get the behavior you get in Hive using INSERT OVERWRITE.
I am not familiar with HCatLoader and HCatStorer. But if you LOAD from and STORE to HDFS, Pig provides shell commands that enable you to do the deleting and moving from within your script.
STORE A INTO '/this/path/is/temporary';
RMF '/this/path/is/permanent';
MV '/this/path/is/temporary' '/this/path/is/permanent';

how can i copy data from a Hive table into local system?

i have created a table in Hive "sample" and loaded a csv file "sample.txt" into it.
now i need that data from "sample" into my local /opt/zxy/sample.txt.
How can i do that?
Hortonworks' Sandbox lets you do it through its HCatalog menu. Otherwise, the syntax is
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/c' SELECT a.* FROM b
as per Hive language manual
Since your intention is just to copy the entire file from HDFS to your local FS, I would not suggest you to do it through a Hive query, because of the following reasons :
It'll start a Mapreduce job which will take more time than a normal copy.
It'll create file(s) with different names(000000_0, 000001_0 and so on), which will require you to rename the file manually afterwards.
You might face problem in opening these files as they are without any extension. Your OS would be unable to choose an application to open these files on its own. In such a case you either have to rename the file or manually select an application to open it.
To avoid these problems you could use HDFS get command :
bin/hadoop fs -get /user/hive/warehouse/sample/sample.txt /opt/zxy/sample.txt
Simple n easy. But if you need to copy some selected data, then you have to use a Hive query.
HTH
I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:
hive -e 'select * from sample' > /opt/zxy/sample.txt
Hope that helps.
Readers who are accessing Hive from Windows OS can check out this script on Github.
It's a Python+paramiko script that extracts Hive data to local Windows OS file-system.

hive create table filename 000000_0?

I'm currently creating an external table like that:
CREATE EXTERNAL TABLE site_datatype (
....
yada yada
....
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION '/user/accounting/summary/2011-12-15/site_datatype.result'
But instead of creating a file called "site_datatype.result" with the contents in it when i run the insert overwrite table select, it creates a directory "site_datatype.result" with a file called "000000_0" in it (correct contents though).
Is this supposed to work this way? And if yes, how can I workaround this (inside hive) to get it done the way I need it?
Thanks,
Mario
Hive operates at the directory level, so multiple reducers can quickly dump results into HDFS. If it were to operate at the file level, it would have to send it to a single reducer to consolidate into a single file, adding an unnecessary bottleneck.
If you absolutely need data from a Hive table in a single file, you can set the number of reducers to 1, then query your data and push it to a new table or directory (via Insert Overwrite).
Another option would be to get the table from HDFS (hadoop fs -get hive/warehouse/sampletable/ .) and then 'cat' all of the files back together.