Hive creating extra subfolders under partitioned directories on INSERT OVERWRITE

Hive creating extra subfolders under partitioned directories on INSERT OVERWRITE - hive

I have a table partitioned on year,month,day and hour. If I use the following INSERT OVERWRITE to a specific partition it places a file under appropriate directory structure. This file contains the string abc:-
INSERT OVERWRITE TABLE testtable PARTITION(year = 2017, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM
(
select 'abc' as c1
) as tbl;
But if I use the following statement, Hive surprisingly creates three new folders under the folder "hour=18".
And there is a file inside each of these three subfolders.
INSERT OVERWRITE TABLE testtable PARTITION(year = 2017, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM
(
select 'abc' as c1
union ALL
select 'xyz' as c1
union ALL
select 'mno' as c1
) as tbl;
When I query the data, it shows the data as expected. But why did it create these 3 new folders? Since the partitioning scheme is only for year,month,day and hour I wouldn't expect Hive to create folders for anything other than these.

Actually it has nothing to do with INSERT OVERWRITE or partitioning.
It's UNION ALL statement that adds additional directories.
Why it bothers you?
You can do some DISTRIBUTE BY shenanigans or set number of reducers to 1 to put this into one file.

Hi guys I had the same issue and thought of sharing.
Union all adds extra subfolder in the table.
The count(*) on the table will give 0 records and the msck repair will error out with the default properties.
After using set hive.msck.path.validator=ignore; MSCK will not error out but will message "Partitions not in metastore"
Only after setting the properties as mentioned above by DogBoneBlues
(SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;) The table is returning values.(count(*))

You can use just "union" instead of "union all" if you dont care about duplicates. "union" should not create sub-folders.

Related

BigQuery: Run queries to create table and append to the table if table exist

Is it possible to run query to create a table if it does not exist, and append to the table if the table already exists? I like to write a single query to create or append. Note: I am using Admin console for now, will be using API eventually.
I have following query:
CREATE TABLE IF NOT EXISTS `project_1.dataset_1.tabe_1`
OPTIONS(
description="Some desc"
) AS
SELECT *
FROM source_table
I get following error:
A table named project_1.dataset_1.tabe_1 already exists.

Above query creates a table named 'table_1' if it does not exist under 'project_1.dataset_1', and append to the table if the table already exists.
IF
(
SELECT
COUNT(1)
FROM
`project_1.dataset_1.__TABLES_SUMMARY__`
WHERE
table_id='table_1') = 0
THEN
CREATE OR REPLACE TABLE
`project_1.dataset_1.table_1` AS
SELECT
'apple' AS item,
'fruit' AS category
UNION ALL
SELECT
'leek',
'vegetable';
ELSE
INSERT
`project_1.dataset_1.table_1` ( item,
category )
SELECT
'lettuce' AS item,
'vegetable' AS category
UNION ALL
SELECT
'orange',
'fruit';
END IF;

This seems like it may be a good opportunity to leverage scripting within a single query to accomplish your needs.
See this page for adding control flow to a query to handle an error (e.g. if the table create fails due to existing). For the exception case, you could then INSERT ... SELECT statement as needed.
You can do this via the API as well if you prefer. Simply issue a tables.get equivalent appropriate to the particular library/language you choose and see if the table exists, and then insert the appropriate query based on that outcome of that check.

Hive- how do I "create table as select.." with partitions from original table?

I need to create a "work table" from our hive dlk. While I can use:
create table my_table as
select *
from dlk.big_table
just fine, I have problem with carrying over partitions (attributes day, month and year) from original "big_table" or just creating new ones from these attributes.
Searching the web did not really helped me answer this question- all "tutorials" or solutions deal either with create as select OR creating partitions, never both.
Can anybody here please help?

Creating partitioned table as select is not supported. You can do it in two steps:
create table my_table like dlk.big_table;
This will create table with the same schema.
Load data.
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table my_table partition (day, month, year)
select * from dlk.big_table;

Trying to append a query onto a temporary table

Using SQL Server 2008-R2
I have a csv of purchase IDs and in my database there is a table with these purchase IDs and there corresponding User IDs in our system. I need these to run a more complicated query after that using. I tried to bulk insert or run import wizard but I don't have permission. My new idea is to create a #temp using SELECT INTO and then have the query inside that like below.
SELECT *
INTO ##PurchaseIDs
FROM
(
SELECT PurchaseID, UserID, Added
FROM Users
WHERE PurchaseID IN (
/* These are the csv IDs just copied and pasted in */
'49397828',
'49397883',
etc.
What happens is that there are ~55,000 IDs so I get this error.
The query processor ran out of internal resources and could not
produce a query plan. This is a rare event and only expected for
extremely complex queries or queries that reference a very large
number of tables or partitions. Please simplify the query. If you
believe you have received this message in error, contact Customer
Support Services for more information.
It works if I upload about 30,000 so my new plan is to see if I can make a temp table, then append a new table to the end of that. I am also open to other ideas on how to accomplish what I am looking to do. I attached an idea of what I am thinking below.
INSERT *
INTO ##PurchaseIDs
FROM (
SELECT PurchaseID, UserID, Added
FROM Users
WHERE PurchaseID IN (
/* These are the OTHER csv IDS just copied and pasted in */
'57397828',
'57397883',
etc.

You need to create a temp table and insert the values in IN clause to the temp table and Join the temp table to get the result
Create table #PurchaseIDs (PurchaseID int)
insert into #PurchaseIDs (PurchaseID)
Select '57397828'
Union All
Select '57397828'
Union All
......
values from csv
Now use Exists to check the existence of PurchaseID in temp table instead of IN clause
SELECT PurchaseID,
UserID,
Added
FROM Users u
WHERE EXISTS (SELECT 1
FROM #PurchaseIDs p
WHERE u.PurchaseID = p.PurchaseID)

How do I copy data from one table to another in postgres using copy command

We use copy command to copy data of one table to a file outside database.
Is it possible to copy data of one table to another table using command.
If yes can anyone please share the query.
Or is there any better approach like we can use pg_dump or something like that.

You cannot easily do that, but there's also no need to do so.
CREATE TABLE mycopy AS
SELECT * FROM mytable;
or
CREATE TABLE mycopy (LIKE mytable INCLUDING ALL);
INSERT INTO mycopy
SELECT * FROM mytable;
If you need to select only some columns or reorder them, you can do this:
INSERT INTO mycopy(colA, colB)
SELECT col1, col2 FROM mytable;
You can also do a selective pg_dump and restore of just the target table.

If the columns are the same (names and datatypes) in both tables then you can use the following
INSERT INTO receivingtable (SELECT * FROM sourcetable WHERE column1='parameter' AND column2='anotherparameter');

Suppose there is already a table and you want to copy all records from this table to another table which is not currently present in the database then following query will do this task for you:
SELECT * into public."NewTable" FROM public."ExistingTable";

How to overwrite matching key in "load from" statement in informix sql

I have a table (in informix ) which stores two columns :empId and status (Y/N). There are some other scripts which, when run, update the status of these employeeIDs based on certain conditions.
The task at hand is , a user provides a path to a file containing employee-IDs. I have a script which then looks at this file and does a "load from user_supplied_file insert into employeeStatusTable".
All the employeeIDs mentioned in this file are to be inserted in this table with the status 'N'. Th real issue is the user-supplied file may contains an employeeId that is already present in the table with the status updated to 'Y' (by some other script or job). In this case, the existing entry should get overwritten. In short, the entry in the table should read "empId", "N".
Is there any way to acheive this? Thanks in advance.

As far I know , the LOAD statement is limited to use together of INSERT statement.
I pretty sure there a lot of ways to do this , I will suggest two way:
In both cases, is supported only for database version >= 11.50 and have certain limitations like:
The Merge works only if the two tables have 1 to 1 relationship
The external table is limited to Database Server file system, not will access anything on client machine
SUGGESTION 1
Load into a temporary table and then use the MERGE statement.
create temp table tp01 ( cols.... ) with no log ;
load from xyz.txt insert into tp01 ;
merge into destTable as A
using tp01 as B
ON A.empID = B.empID
WHEN MATCHED THEN UPDATE SET status = 'N'
WHEN NOT MATCHED THEN INSERT (empid, status) values ( b.empid, 'N');
drop table tp01;
SUGGESTION 2
Create a external table to you TXT file and then just use the MERGE or UPDATE using this table when needed.
create external table ex01 .... using ( datafile('file:/tmp/your.txt'), delimited ...);
merge into destTable as A
using ex01 as B
ON A.empID = B.empID
WHEN MATCHED THEN UPDATE SET status = 'N'
WHEN NOT MATCHED THEN INSERT (empid, status) values ( b.empid, 'N');

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive creating extra subfolders under partitioned directories on INSERT OVERWRITE - hive

Actually it has nothing to do with INSERT OVERWRITE or partitioning. It's UNION ALL statement that adds additional directories. Why it bothers you? You can do some DISTRIBUTE BY shenanigans or set number of reducers to 1 to put this into one file.

You can use just "union" instead of "union all" if you dont care about duplicates. "union" should not create sub-folders.

Related

BigQuery: Run queries to create table and append to the table if table exist

Hive- how do I "create table as select.." with partitions from original table?

Trying to append a query onto a temporary table

How do I copy data from one table to another in postgres using copy command

How to overwrite matching key in "load from" statement in informix sql

Categories

Resources