I need to update one table records based on another one.
I tried
update currencies
set priority_order= t2.priority_order
from currencies t1
inner join currencies1 t2
on t1.id = t2.id
but is giving error (same query work for MySQL and SQL Server).
Then I tried below:
update currencies
set priority_order= (select
priority_order
from currencies1
where
currencies.id=currencies1.id
)
It is working but is very slow, I need to do it for some big tables as well.
Any ideas?
In Postgres, this would look something like:
update currencies t1
set priority_order = t2.priority_order
from currencies1 t2
where t1.id = t2.id;
UPDATE currencies dst
SET priority_order = src.priority_order
FROM currencies src
WHERE dst.id = src.id
-- Suppress updates if the value does not actually change
-- This will avoid creation of row-versions
-- which will need to be cleaned up afterwards, by (auto)vacuum.
AND dst.priority_order IS DISTINCT FROM src.priority_order
;
Testing (10K rows, after cache warmup), using the same table for both source and target for the updates:
CREATE TABLE
INSERT 0 10000
VACUUM
Timing is on.
cache warming:
UPDATE 0
Time: 16,410 ms
zero-rows-touched:
UPDATE 0
Time: 8,520 ms
all-rows-touched:
UPDATE 10000
Time: 84,375 ms
Normally you will seldomly see the case of no rows affected, neither the case with all rows affected. But with only 50% of the rows touched, the query would still be twice as fast. (plus the reduced work for vacuum after the query)
Related
I am inserting records using left joining in Hive.When I set limit 1 query works but for all records query get stuck at 99% reduce job.
Below query works
Insert overwrite table tablename select a.id , b.name from a left join b on a.id = b.id limit 1;
But this does not
Insert overwrite table tablename select table1.id , table2.name from table1 left join table2 on table1.id = table2.id;
I have increased number of reducers but still it doesn't work.
Here are a few Hive optimizations that might help the query optimizer and reduce overhead of data sent across the wire.
set hive.exec.parallel=true;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
However, I think there's a greater chance that the underlying problem is key in the join. For a full description of skew and possible work arounds see this https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
You also mentioned that table1 is much smaller than table2. You might try a map-side join depending on your hardware constraints. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)
If your query is getting stuck at 99% check out following options -
Data skewness, if you have skewed data it might possible 1 reducer is doing all the work
Duplicates keys on both side - If you have many duplicate join keys on both side your output might explode and query might get stuck
One of your table is small try to use map join or if possible SMB join which is a huge performance gain over reduce side join
Go to resource manager log and see amount of data job is accessing and writing.
Hive automatically does some optimizations when it comes to joins and loads one side of the join to memory if it fits the requirements. However in some cases these jobs get stuck at 99% and never really finish.
I have faced this multiple times and the way I have avoided this by explicitly specifying some settings to hive. Try with the settings below and see if it works for you.
hive.auto.convert.join=false
mapred.compress.map.output=true
hive.exec.parallel=true
Make sure you don't have rows with duplicate id values in one of your data tables!
I recently encountered the same issue with a left join's map-reduce process getting stuck on 99% in Hue.
After a little snooping I discovered the root of my problem: there were rows with duplicate member_id matching variables in one of my tables. Left joining all of the duplicate member_ids would have created a new table containing hundreds of millions of rows, consuming more than my allotted memory on our company's Hadoop server.
use these configuration and try
hive> set mapreduce.map.memory.mb=9000;
hive> set mapreduce.map.java.opts=-Xmx7200m;
hive> set mapreduce.reduce.memory.mb=9000;
hive> set mapreduce.reduce.java.opts=-Xmx7200m
I faced the same problem with a left outer join similar to:
select bt.*, sm.newparam from
big_table bt
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
I made an analysis based on the already given answers and I saw two of the given problems:
Left table was more than 100x bigger than the right table
select count(*) from big_table -- returned 130M
select count(*) from small_table -- returned 1.3M
I also detected that one of the join variable was rather skewed in the right table:
select count(*), cate
from small_table
group by cate
-- returned
-- A 70K
-- B 1.1M
-- C 120K
I tried most of the solutions given in other answers plus some extra parameters I found here Without success.:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Lastly I found out that the left table had a really high % of null values for the join columns: bt.ident and bt.cate
So I tried one last thing, which finally worked for me: to split the left table depending on bt.ident and bt.cate being null or not, to later make a union all with both branches:
select * from
(select bt.*, sm.newparam from
select * from big_table bt where ident is not null or cate is not null
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
union all
select *, null as newparam from big_table nbt where ident is null and cate is null) combined
I have 2 tables table1 and table2 both having large amounts of data, Table1 has 5 million and Table2 has 80,000 records. I am running an update,
Update Table1 a
Set
a.id1=(SELECT DISTINCT p.col21
FROM Table2 p
WHERE p.col21 = SUBSTR(a.id, 2, LENGTH(a.id));
The substr and distinct in the query are making it slow.
How can this query be re-written to speed up the process and
What columns do I need to index
May be a merge
merge into Table1 a
using Table2 p
on (p.col21 = SUBSTR(a.id, 2, LENGTH(a.id))
When matched then
update set a.id1 = p.col21;
and a Function Based Index on a.id.
I see that you are dynamically calculating:
p.col21=SUBSTR(a.id,2,LENGTH(a.id))
This will take some substantial time and make it impossible to create an index. Did you consider actually creating a column with that value? This would allow you to index on it and make it much faster. If the id is static this seems like an easy win.
How many rows does your subquery return, and how many rows are you updating? With a large number of updates, indexes may not help you out at all.
I have two tables. Tables 2 contains more recent records.
Table 1 has 900K records and Table 2 about the same.
To execute the query below takes about 10 mins. Most of the queries (at the time of execution the query below) to table 1 give timeout exception.
DELETE T1
FROM Table1 T1 WITH(NOLOCK)
LEFT OUTER JOIN Table2 T2
ON T1.ID = T2.ID
WHERE T2.ID IS NULL AND T1.ID IS NOT NULL
Could someone help me to optimize the query above or write something more efficient?
Also how to fix the problem with time out issue?
Optimizer will likely chose to block whole table as it is easier to do if it needs to delete that many rows. In the case like this I delete in chunks.
while(1 = 1)
begin
with cte
as
(
select *
from Table1
where Id not in (select Id from Table2)
)
delete top(1000) cte
if ##rowcount = 0
break
waitfor delay '00:00:01' -- give it some rest :)
end
So the query deletes 1000 rows at a time. Optimizer will likely lock just a page to delete the rows, not whole table.
The total time of this query execution will be longer, but it will not block other callers.
Disclaimer: assumed MS SQL.
Another approach is to use SNAPSHOT transaction. This way table readers will not be blocked while rows are being deleted.
Wait a second, are you trying to do this...
DELETE Table1 WHERE ID NOT IN (SELECT ID FROM Table2)
?
If so, that's how I would write it.
You could also try to update the statistics on both tables. And of course indexes on Table1.ID and Table2.ID could speed things up considerably.
EDIT: If you're getting timeouts from the designer, increase the "Designer" timeout value in SSMS (default is 30 seconds). Tools -> Options -> Designers -> "Override connection string time-out value for table designer updates" -> enter reasonable number (in seconds).
Both ID columns need an index
Then use simpler SQL
DELETE Table1 WHERE NOT EXISTS (SELECT * FROM Table2 WHERE Table1.ID = Table2.ID)
I've added a field to a MySQL table. I need to populate the new column with the value from another table. Here is the query that I'd like to run:
UPDATE table1 t1
SET t1.user_id =
(
SELECT t2.user_id
FROM table2 t2
WHERE t2.usr_id = t1.usr_id
)
I ran that query locally on 239K rows and it took about 10 minutes. Before I do that on the live environment I wanted to ask if what I am doing looks ok i.e. does 10 minutes sound reasonable. Or should I do it another way, a php loop? a better query?
Use an UPDATE JOIN! This will provide you a native inner join to update from, rather than run the subquery for every bloody row. It tends to be much faster.
update table1 t1
inner join table2 t2 on
t1.usr_id = t2.usr_id
set t1.user_id = t2.user_id
Ensure that you have an index on each of the usr_id columns, too. That will speed things up quite a bit.
If you have some rows that don't match up, and you want to set t1.user_id = null, you will need to do a left join in lieu of an inner join. If the column is null already, and you're just looking to update it to the values in t2, use an inner join, since it's faster.
I should make mention, for posterity, that this is MySQL syntax only. The other RDBMS's have different ways of doing an update join.
There are two rather important pieces of information missing:
What type of tables are they?
What indexes exist on them?
If table2 has an index that contains user_id and usr_id as the first two columns and table1 is indexed on user_id, it shouldn't be that bad.
You don't have an index on t2.usr_id.
Create this index and run your query again, or a multiple-table UPDATE proposed by #Eric (with LEFT JOIN, of course).
Note that MySQL lacks other JOIN methods than NESTED LOOPS, so it's index that matters, not the UPDATE syntax.
However, the multiple table UPDATE is more readable.
We want to delete some matching rows within the same table seems to have a performance issue as the table has 1 billion rows.
Since it Oracle database, we can use PLSQL as well to incrementally delete, but we want to see what options available just using sql to improve the performance of it.
DELETE
FROM schema.adress
WHERE key = 6776
AND matchSequence = 1
AND EXISTS
(
SELECT 1
FROM schema.adress t2
WHERE t2.flngEntityKey = 9909
AND t2.matchType = 'NEW'
AND t2.matchType = schema.adress.matchType
AND t2.key = schema.adress.key
AND t2.sequence = schema.adress.sequence
)
Additional details
Cardinality is 900 Million rows
No triggers