hive doesn't support merge function

hive doesn't support merge function - hive

trying to update the value from table to another table, both of these tables have the same field name but different values, the query must be work fine on any normal DB but here it returns
Error while compiling statement: FAILED: ParseException line 1:0
cannot recognize input near 'MERGE' 'INTO' 'FINAL'
MERGE
INTO FINAL
USING FIRST_STAGE
ON IMSI = FIRST_STAGE.IMSI and Site = FIRST_STAGE.Site
WHEN MATCHED THEN UPDATE SET
Min_Date = least(FIRST_STAGE.Min_Date, Min_Date),
Max_Date = greatest(FIRST_STAGE.Max_Date, Max_Date),
NoofDays = FIRST_STAGE.NoofDays + NoofDays,
Down_Link = FIRST_STAGE.Down_Link + Down_Link,
up_Link = FIRST_STAGE.up_Link + up_Link,
connection = FIRST_STAGE.connection + connection
WHEN NOT MATCHED THEN INSERT ( Min_Date,
Max_Date,
NoofDays,
IMSI,
Site,
Down_Link,
Up_Link,
Connection )
VALUES ( FIRST_STAGE.Min_Date,
FIRST_STAGE.Max_Date,
FIRST_STAGE.NoofDays,
FIRST_STAGE.IMSI,
FIRST_STAGE.Site,
FIRST_STAGE.Down_Link,
FIRST_STAGE.Up_Link,
FIRST_STAGE.Connection )

Hive merge statement is introduced in Hortonworks distribution.
Prerequisite for these merge statement to run is:
Final table needs to be created with transactional enabled ,ORC format ,and bucketed.
AFAIK In case of Cloudera distribution we need to use Kudu to perform upsert operations starting from cloudera-5.10+.
Note: Upsert statement only works for Impala tables that use the Kudu storage engine.
I don't think we can run merge statements as mentioned in the post in CDH distributions as of now.

Related

Optimization when merging from Oracle datalink

I am trying to write an Oracle procedure to merge data from a remote datalink into a local table. Individually the pieces work quickly, but together they time out. Here is a simplified version of what I am trying.
What works:
Select distinct ProjectID from Project where LastUpdated < (sysdate - 6/24);
--Works in split second.
Merge into project
using (select /*+DRIVING_SITE(remoteCompData)*/
rp.projectID,
rp.otherdata
FROM Them.Remote_Data#DBLink rd
WHERE rd.projectID in (1,2,3)) sourceData -- hardcoded IDs
On (rd.projectID = project.projectID)
When matched...
-- Merge statement works quickly when the IDs are hard coded
What doesn't work: Combining the two statements above.
Merge into project
using (select /*+DRIVING_SITE(rd)*/ -- driving site helps when this piece is extracted from the larger statement
rp.projectID,
rp.otherdata
FROM Them.Remote_Data#DBLink rd
WHERE rd.projectID in --in statement that works quickly by itself.
(Select distinct ProjectID from Project where LastUpdated < (sysdate - 6/24))
-- This select in the in clause one returns 10 rows. Its a test database.
On (rd.projectID = project.projectID)
)
When matched...
-- When I run this statement in SQL Developer, this is all that I get without the data updating
Connecting to the database local.
Process exited.
Disconnecting from the database local.
I also tried pulling out the in statement into a with statement hoping it would execute differently, but it had no effect.
Any direction for paths to pursue would be appreciated.
Thanks.

The /*+DRIVING_SITE(rd)*/ hint doesn't work with MERGE because the operation must run in the database where the merged table sits. Which in this case is the local database. That means the whole result set from the remote table is pulled across the database link and then filtered against the data from the local table.
So, discard the hint. I also suggest you convert the IN clause into a join:
Merge into project p
using (select rp.projectID,
rp.otherdata
FROM Project ld
inner join Them.Remote_Data#DBLink rd
on rd.projectID = ld.projectID
where ld.LastUpdated < (sysdate - 6/24)) q
-- This select in the in clause one returns 10 rows. Its a test database.
On (q.projectID = p.projectID)
)
Please bear in mind that answers to performance tuning questions without sufficient detail are just guesses.

I found your question having same problem. Yes, the hint in query is ignored when the query is included into using clause of merge command.
In my case I created work table, say w_remote_data for your example, and splitted merge command into two commands: (1) fill the work table, (2) invoke merge command using work table.
The pitfall is, we cannot simply use neither of commands create w_remote_data as select /*+DRIVING_SITE(rd)*/ ... or insert into w_remote_data select /*+DRIVING_SITE(rd)*/ ... to fill the work table. Both of these commands are valid but they are slow - the hint does not apply too so we would not get rid of the problem. The solution is in PLSQL: collect result of query in using clause using intermediate collection. See example (for simplicity I assume w_remote_data has same structure as remote_data, otherwise we have to define custom record instead of %rowtype):
declare
type ct is table of w_remote_data%rowtype;
c ct;
i pls_integer;
begin
execute immediate 'truncate table w_remote_data';
select /*+DRIVING_SITE(rd)*/ *
bulk collect into c
from Them.Remote_Data#DBLink rd ...;
if c.count > 0 then
forall i in c.first..c.last
insert into w_remote_data values c(i);
end if;
merge into project p using (select * from w_remote_data) ...;
execute immediate 'truncate table w_remote_data';
end;
My case was ETL script where I could rely it won't run in parallel. Otherwise we would have to cope with temporary (session-private) tables, I didn't try if it works with them.

Encountered " "MERGE" "MERGE "" at line 1, column 1. Was expecting: <EOF>

We must need to use Legacy SQL in BigQuery. But, Merge is not working in Legacy SQL. How we write below query in Legacy SQL?
MERGE [ABC:xyz.tmp_cards] AS target_tbl
USING [ABC:xyz.tmp_cards_1533188902] AS source_tbl
ON target_tbl.id = source_tbl.id
WHEN MATCHED AND target_tbl.id = source_tbl.id THEN
UPDATE SET target_tbl.id = source_tbl.id,
target_tbl.user_id = source_tbl.user_id,
target_tbl.expiration_date = source_tbl.expiration_date,
target_tbl.created_at = source_tbl.created_at,
target_tbl.updated_at = source_tbl.updated_at
WHEN NOT MATCHED THEN
INSERT (id, user_id, expiration_date, created_at, updated_at)
VALUES (source_tbl.id, source_tbl.user_id, source_tbl.expiration_date, source_tbl.created_at, source_tbl.updated_at)

Support for DML MERGE statements appeared in Beta just this year for standard SQL. It's not possible to do it in Legacy SQL and this is why Standard SQL is the preferred SQL dialect for querying data stored in BigQuery. Because the new features are for the last DML for BigQuery and not the old one.

How do I use union all with multiple tables using sql in Hadoop?

I am trying to 'union all' multiple tables in SAS (using sql) within Hadoop. I found threads on union all and was able to get it to run within my local sas user, however the output was too large and SAS crashed, so I have to put the datasets that I then want to union in Hadoop and then union them. This is where I am having issues with the syntax. The code is below. I usually use the beginning and ending part of the code for connecting to Hadoop.
Proc SQL noerrorstop;
Connect to HADOOP (server='X' port=X);
Execute (set X) by HADOOP;
Execute (drop Table X.CV_All) by HADOOP;
Execute (create Table X.CV_All as
SELECT cv.*
INTO: CV_All
FROM (SELECT * FROM X.CV_Dec
UNION ALL
SELECT * FROM X.CV_Jan
UNION ALL
SELECT * FROM X.CV_Feb) cv;
) by HADOOP;
DISCONNECT FROM HADOOP;
quit;
I receive the following error: ERROR: Execute error: Error while compiling statement: FAILED: ParseException line 1:86 missing EOF at ':' near 'INTO'
Thank you in advance.

I think Hadoop uses create table as rather than select into. Does this work?
CREATE TABLE cv_all as
SELECT cv.*
FROM (SELECT * FROM X.CV_Dec
UNION ALL
SELECT * FROM X.CV_Jan
UNION ALL
SELECT * FROM X.CV_Feb
) cv;
Some comments. First, I don't think the subquery is necessary for the statement, but I'm leaving it in.
Second, you are missing the point of Hadoop, by having multiple tables with the same format. You should have a single table with a date column. You can partition by the data by the date.

SQL statement does not have effect using pyodbc

I am running python script on the server that should update existing
table 'loading_log' using ODBC connection.
The issue is that my script does not have any effect on the table in Database i.e. it does not delete records and does not insert new records.
At the same time I don't see any errors thrown after the execution.
If I run the same SQL query from Desktop using the same credentials it works fine.
My question:
Why it does not work inside python script?
Here's an excerpt from my code:
curs.execute('''
delete from loading_log
''')
#
#record loaded record ids into loading_log table
#
#logging.info('insert laoded record id data into loading_log table')
curs.execute('''
insert into loading_log (catalog_sample_events_id,ShippingId)
select top 500
cs.catalog_sample_events_id,
cs.shipping_id ShippingId
from catalog_sample_events cs
join event_type et on et.event_type_id = cs.event_type_id
join event_source es on es.event_source_id = cs.event_source_id
join etl_status esi on esi.etl_status_id = cs.etl_status_id
where cs.catalog_sample_events_id > ?
order by cs.catalog_sample_events_id
''', max_id)

You need to commit the transaction:
curs.commit()
or tell pyodbc to use autocommit mode. See pyodbc wiki for more details.

Update multiple values in an oracle table using values from an APEX collection

I am using APEX collections to store some values and pass them between pages in Oracle Application Express 4.2.3.
I would like to then perform an update statement on a table called "project" with the values from the collection.
My code so far is as follows:
update project
SET name=c.c002,
description=c.c007,
start_date=c.c004,
timeframe=c.c005,
status=c.c009
FROM
apex_collections c
WHERE
c.collection_name = 'PROJECT_DETAILS_COLLECTION'
and id = :p14_id;
where :p14_id is the value of a page item.
However, I am getting the following error:
ORA-00933: SQL command not properly ended
Anyone have any idea on how to approach this?
Thanks!

The UPDATE syntax you are using is not valid in Oracle; it does not allow you to use FROM in the way you are attempting.
The simplest way to do this in Oracle would with a subquery:
update project
set (name, description, start_date, timeframe, status) =
(select c.c002, c.c007, c.c004, c.c005, c.c009
FROM
apex_collections c
WHERE
c.collection_name = 'PROJECT_DETAILS_COLLECTION'
)
WHERE
id = :p14_id
;
Note that if the subquery returns no rows, the columns in the target table will be updated to NULL; this could be avoided by adding a similar EXISTS condition in the predicate for the update. It could also be avoided by using a MERGE statement instead of an UPDATE.
If the subquery returns multiple rows, the statement will throw an error. It looks like tthat should not be the case here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

hive doesn't support merge function - hive

Related

Optimization when merging from Oracle datalink

Encountered " "MERGE" "MERGE "" at line 1, column 1. Was expecting: <EOF>

How do I use union all with multiple tables using sql in Hadoop?

SQL statement does not have effect using pyodbc

Update multiple values in an oracle table using values from an APEX collection

Categories

Resources