SQL Server Update query very slow - sql

I ran the following query on a previous years data and it took 3 hours, this year it took 13 days. I don't know why this is though. Any help would be much appreciated.
I have just tested the queries in the old SQL server and it works in 3 hours. Therefore the problem must have something to do with the new SQL server I created. Do you have any ideas what the problem might be?
The query:
USE [ABCJan]
CREATE INDEX Link_Oct ON ABCJan2014 (Link_ref)
GO
CREATE INDEX Day_Oct ON ABCJan2014 (date_1)
GO
UPDATE ABCJan2014
SET ABCJan2014.link_id = LT.link_id
FROM ABCJan2014 MT
INNER JOIN [Central].[dbo].[LookUp_ABC_20142015] LT
ON MT.Link_ref = LT.Link_ref
UPDATE ABCJan2014
SET SumAvJT = ABCJan2014.av_jt * ABCJan2014.n
UPDATE ABCJan2014
SET ABCJan2014.DayType = LT2.DayType
FROM ABCJan2014 MT
INNER JOIN [Central].[dbo].[ABC_20142015_days] LT2
ON MT.date_1 = LT2.date1
With the following data structures:
ABCJan2014 (70 million rows - NO UNIQUE IDENTIFIER - Link_ref & date_1 together are unique)
Link_ID nvarchar (17)
Link_ref int
Date_1 smalldatetime
N int
Av_jt int
SumAvJT decimal(38,14)
DayType nvarchar (50)
LookUp_ABC_20142015
Link_ID nvarchar (17) PRIMARY KEY
Link_ref int INDEXED
Link_metres int
ABC_20142015_days
Date1 smalldatetime PRIMARY KEY & INDEXED
DayType nvarchar(50)
EXECUTION PLAN
It appears to be this part of the query that is taking such a long time.
Thanks again for any help, I'm pulling my hair out.

Create Index on ABCJan2014 table as it is currently a heap

If you look at the execution plan the time is in the actual update
Look at the log file
Is the log file on a fast disk?
Is the log file on the same physical disk?
Is the log file required to grow?
Size the log file to like 1/2 the size of the data file
As far as indexes test and tune this
If the join columns are indexed not much to do here
select count(*)
FROM ABCJan2014 MT
INNER JOIN [Central].[dbo].[LookUp_ABC_20142015] LT
ON MT.Link_ref = LT.Link_ref
select count(*)
FROM ABCJan2014 MT
INNER JOIN [Central].[dbo].[ABC_20142015_days] LT2
ON MT.date_1 = LT2.date1
Start with a top (1000) to get update tuning working
For grins please give this a try
Please post this query plan
(do NOT add an index to ABCJan2014 link_id)
UPDATE top (1000) ABCJan2014
SET MT.link_id = LT.link_id
FROM ABCJan2014 MT
JOIN [Central].[dbo].[LookUp_ABC_20142015] LT
ON MT.Link_ref = LT.Link_ref
AND MT.link_id <> LT.link_id
If LookUp_ABC_20142015 is not active then add a nolock
JOIN [Central].[dbo].[LookUp_ABC_20142015] LT with (nolock)
nvarchar (17) for a PK to me is just strange
why n - do you really have some unicode?
why not just char(17) and let it allocate space?

Why have 3 update statements when you can do it in one?
UPDATE MT
SET MT.link_id = CASE WHEN LT.link_id IS NULL THEN MT.link_id ELSE LT.link_id END,
MT.SumAvJT = MT.av_jt * MT.n,
MT.DayType = CASE WHEN LT2.DayType IS NULL THEN MT.DayType ELSE LT2.DayType END
FROM ABCJan2014 MT
LEFT OUTER JOIN [Central].[dbo].[LookUp_ABC_20142015] LT
ON MT.Link_ref = LT.Link_ref
LEFT OUTER JOIN [Central].[dbo].[ABC_20142015_days] LT2
ON MT.date_1 = LT2.date1
Also, I would create only one index for the join. Create the following index after the updates.
CREATE INDEX Day_Oct ON ABCJan2014 (date_1)
GO
Before you run, compare the execution plan by putting the update query above and your 3 update statements altogether in one query window, and do Display Estimated Execution Plan. It will show the estimated percentages and you'll be able to tell if it's any better (if new one is < 50%).
Also, it looks like the query is slow because it's doing a Hash Match. Please add a PK index on [LookUp_ABC_20142015].Link_ref.
[LookUp_ABC_20142015].Link_ID is a bad choice for PK, so drop the PK on that column.
Then add an index to [ABCJan2014].Link_ref.
See if that makes any improvement.

If you are going to update a table you need a unique identifier, so put on on ABCJan2014 ASAP especially since it is so large. There is no reason why you can't create a unique index on the fields that together compose the unique record. In the future, do not ever design a table that does not have a unique index or PK. This is simply asking for trouble both in processing time and more importantly in data integrity.
When you have a lot of updating to do to a large table, it is sometimes more effective to work in batches. You don't tie up the table in a lock for a long period of time and sometimes it is even faster due to how the database internals are working the problem. Consider processing 50,000 K records at a time (you may need to experiment to find the sweet spot of records to process in a batch, there is generally a point where the update starts to take significantly longer) in a loop or cursor.
UPDATE ABCJan2014
SET ABCJan2014.link_id = LT.link_id
FROM ABCJan2014 MT
JOIN [Central].[dbo].[LookUp_ABC_20142015] LT ON MT.Link_ref = LT.Link_ref
The code above will update all records from the join. If some of the records already have the link_id you might save considerable time by only updating the records where link_id is null or ABCJan2014.link_id <> LT.link_id. You have a 70 million record table, you don't need to be updating records that do not need a change. The same thing of course applies to your other updates as well.
Not knowing how much data gets added to this table or how often this number need updating, consider that this SumAvJT might be best defined as a persisted calculated field. Then it gets updated automatically when one of the two values changes. This wouldn't help if the table is bulk loaded but might if records come in individually.

In the execution plan, it makes recommendations for indexes being added. Have you created those indexes? Also, take a look at your older server's data structure - script out the table structures including indexes - and see if there are differences between them. At some point somebody's possibly built an index on your old server's tables to make this more efficient.
That said, what volume of data are you looking at? If you're looking at significantly different volumes of data, it could be that the execution plans generated by the servers differ significantly. SQL Server doesn't always guess right, when it builds the plans.
Also, are you using prepared statements (i.e., stored procedures)? If you are, then it's possible that the cached data access plan is simply out of date & needs to be updated, or you need to update statistics on the tables and then run the procedure with recompile so that a new data access plan is generated.

where is located the [Central] server ?
It is possible to duplicate your [Central].[dbo].[LookUp_ABC_20142015] and [Central].[dbo].[ABC_20142015_days] table locally ?
1) Do:
select * into [ABC_20142015_days] from [Central].[dbo].[ABC_20142015_days]
select * into [LookUp_ABC_20142015] from [Central].[dbo].[LookUp_ABC_20142015]
2) Recreate the index on [ABC_20142015_days] and [LookUp_ABC_20142015]...
3) Rewrite your updates by removing the "[Central].[dbo]." prefix !
Just after writing this solution, I found an other solution, but I'm not sure if it's applicable to your server: add the "REMOTE" join hints... I never use it, but you can found the documentation at https://msdn.microsoft.com/en-us/library/ms173815.aspx
Hopping it could help you...

All the previous answers that suggest improving the structure of the tables and the queries itself are nice to know for you, there is doubt about that.
However your question is why the SAME data/structure and the SAME queries give this huge difference.
So before you look at optimising sql you must find the real cause. And the real cause is hardware or software or configuration. Start by compating sql server with the old one then move to the hardware and benchmark it. Lastly look at the software for differences.
Only when you solved the actual problem you can start improving the sql itself

ALTER TABLE dbo.ABCJan2014
ADD SumAvJT AS av_jt * n --PERSISTED
CREATE INDEX ix ON ABCJan2014 (Link_ref) INCLUDE (link_id)
GO
CREATE INDEX ix ON ABCJan2014 (date_1) INCLUDE (DayType)
GO
UPDATE ABCJan2014
SET ABCJan2014.link_id = LT.link_id
FROM ABCJan2014 MT
JOIN [Central].[dbo].[LookUp_ABC_20142015] LT ON MT.Link_ref = LT.Link_ref
UPDATE ABCJan2014
SET ABCJan2014.DayType = LT2.DayType
FROM ABCJan2014 MT
JOIN [Central].[dbo].[ABC_20142015_days] LT2 ON MT.date_1 = LT2.date1

I guess there is a lot of page splitting. Can You try this?
SELECT
(SELECT LT.link_id FROM [Central].[dbo].[LookUp_ABC_20142015] LT
WHERE MT.Link_ref = LT.Link_ref) AS Link_ID,
Link_ref,
Date_1,
N,
Av_jt,
MT.av_jt * MT.n AS SumAvJT,
(SELECT LT2.DayType FROM [Central].[dbo].[ABC_20142015_days] LT2
WHERE MT.date_1 = LT2.date1) AS DayType
INTO ABCJan2014new
FROM ABCJan2014 MT

In addition to all answer above.
i) Even 3 hour is lot.I mean even if any query take 3 hours,I first check my requirement and revise it.Raise the issue.Of course I will optimize my query.
Like in your query,none of the update appear to be serious matter.
Like #Devart pointed,one of the column can be calculated columns.
ii) Trying running other query in new server and compare.?
iii) Rebuild the index.
iv) Use "with (nolock)" in your join.
v) Create index on table LookUp_ABC_20142015 column Link_ref.
vi)clustered index on nvarchar (17) or datetime is always a bad idea.
join on datetime column or varchar column always take time.

Try with alias instead of recapturing table name in UPDATE query
USE [ABCJan]
CREATE INDEX Link_Oct ON ABCJan2014 (Link_ref)
GO
CREATE INDEX Day_Oct ON ABCJan2014 (date_1)
GO
UPDATE MT
SET MT.link_id = LT.link_id
FROM ABCJan2014 MT
INNER JOIN [Central].[dbo].[LookUp_ABC_20142015] LT
ON MT.Link_ref = LT.Link_ref
UPDATE ABCJan2014
SET SumAvJT = av_jt * n
UPDATE MT
SET MT.DayType = LT2.DayType
FROM ABCJan2014 MT
INNER JOIN [Central].[dbo].[ABC_20142015_days] LT2
ON MT.date_1 = LT2.date1

Frankly, I think you've already answered your own question.
ABCJan2014 (70 million rows - NO UNIQUE IDENTIFIER - Link_ref & date_1 together are unique)
If you know the combination is unique, then by all means 'enforce' it. That way the server will know it too and can make use of it.
Query Plan showing the need for an index on [ABCJAN2014].[date_1] 3 times in a row!
You shouldn't believe everything that MSSQL tells you, but you should at least give it a try =)
Combining both I'd suggest you add a PK to the table on the fields [date_1] and [Link_ref] (in that order!). Mind: adding a Primary Key -- which is essentially a clustered unique index -- will take a while and require a lot of space as the table pretty much gets duplicated along the way.
As far as your query goes, you could put all 3 updates in 1 statement (similar to what joordan831 suggests) but you should take care about the fact that a JOIN might limit the number of rows affected. As such I'd rewrite it like this:
UPDATE ABCJan2014
SET ABCJan2014.link_id = (CASE WHEN LT.Link_ref IS NULL THEN ABCJan2014.link_id ELSE LT.link_id END), -- update when there is a match, otherwise re-use existig value
ABCJan2014.DayType = (CASE WHEN LT2.date1 IS NULL THEN ABCJan2014.DayType ELSE LT2.DayType END), -- update when there is a match, otherwise re-use existig value
SumAvJT = ABCJan2014.av_jt * ABCJan2014.n
FROM ABCJan2014 MT
LEFT OUTER JOIN [Central].[dbo].[LookUp_ABC_20142015] LT
ON MT.Link_ref = LT.Link_ref
LEFT OUTER JOIN [Central].[dbo].[ABC_20142015_days] LT2
ON MT.date_1 = LT2.date1
which should have the same effect as running your original 3 updates sequentially; but hopefully taking a lot less time.
PS: Going by the Query Plans, you already have indexes on the tables you JOIN to ([LookUp_ABC_20142015] & [LookUp_ABC_20142015]) but they seem to be non-unique (and not always clustered). Assuming they're suffering from the 'we know it's unique but the server doesn't'-illness: it would be advisable to also add a Primary Key to those tables on the fields you join to, both for data-integrity and performance reasons!
Good luck.

Update data
set
data.abcKey=surrogate.abcKey
from [MyData].[dbo].[fAAA_Stage] data with(nolock)
join [MyData].[dbo].[dBBB_Surrogate] surrogate with(nolock)
on data.MyKeyID=surrogate.MyKeyID
The surrogate table must have a nonclustered index with an unique key. myKeyID must be created as an unique non-clustered key. The performance results improvements are significant.

Related

Oracle Sql tuning with index

I have a table T with some 500000 records. That table is a hierarchical table.
My goal is to update the table by self joining the same table based on some condition for parent - child relationship
The update query is taking really long because the number of rows is really high. I have created an unique index on the column which helps identifying the rows to update (meanign x and Y). After creating the index the cost has reduced but still the query is performing a lot slower.
This my query format
update T
set a1, b1
= (select T.parent.a1, T.parent.b1
from T T.paremt, T T.child
where T.parent.id = T.child.Parent_id
and T.X = T.child.X
and T.Y = T.child.Y
after creating the index the execution plan shows that it is doing an index scan for CRS.PARENT but going for a full table scan for for CRS.CHILD and also during update as a result the query is taking for ever to complete.
Please suggest any tips or recommendations to solve this problem
You are updating all 500,000 rows, so an index is a bad idea. 500,000 index lookups will take much longer than it needs to.
You would be better served using a MERGE statement.
It is hard to tell exactly what your table structure is, but it would look something like this, assuming X and Y are the primary key columns in T (...could be wrong about that):
MERGE INTO T
USING ( SELECT TC.X,
TC.Y,
TP.A1,
TP.A2
FROM T TC
INNER JOIN T TP ON TP.ID = TC.PARENT_ID ) U
ON ( T.X = U.X AND T.Y = U.Y )
WHEN MATCHED THEN UPDATE SET T.A1 = U.A1,
T.A2 = U.A2;

Hive query stuck at 99%

I am inserting records using left joining in Hive.When I set limit 1 query works but for all records query get stuck at 99% reduce job.
Below query works
Insert overwrite table tablename select a.id , b.name from a left join b on a.id = b.id limit 1;
But this does not
Insert overwrite table tablename select table1.id , table2.name from table1 left join table2 on table1.id = table2.id;
I have increased number of reducers but still it doesn't work.
Here are a few Hive optimizations that might help the query optimizer and reduce overhead of data sent across the wire.
set hive.exec.parallel=true;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
However, I think there's a greater chance that the underlying problem is key in the join. For a full description of skew and possible work arounds see this https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
You also mentioned that table1 is much smaller than table2. You might try a map-side join depending on your hardware constraints. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)
If your query is getting stuck at 99% check out following options -
Data skewness, if you have skewed data it might possible 1 reducer is doing all the work
Duplicates keys on both side - If you have many duplicate join keys on both side your output might explode and query might get stuck
One of your table is small try to use map join or if possible SMB join which is a huge performance gain over reduce side join
Go to resource manager log and see amount of data job is accessing and writing.
Hive automatically does some optimizations when it comes to joins and loads one side of the join to memory if it fits the requirements. However in some cases these jobs get stuck at 99% and never really finish.
I have faced this multiple times and the way I have avoided this by explicitly specifying some settings to hive. Try with the settings below and see if it works for you.
hive.auto.convert.join=false
mapred.compress.map.output=true
hive.exec.parallel=true
Make sure you don't have rows with duplicate id values in one of your data tables!
I recently encountered the same issue with a left join's map-reduce process getting stuck on 99% in Hue.
After a little snooping I discovered the root of my problem: there were rows with duplicate member_id matching variables in one of my tables. Left joining all of the duplicate member_ids would have created a new table containing hundreds of millions of rows, consuming more than my allotted memory on our company's Hadoop server.
use these configuration and try
hive> set mapreduce.map.memory.mb=9000;
hive> set mapreduce.map.java.opts=-Xmx7200m;
hive> set mapreduce.reduce.memory.mb=9000;
hive> set mapreduce.reduce.java.opts=-Xmx7200m
I faced the same problem with a left outer join similar to:
select bt.*, sm.newparam from
big_table bt
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
I made an analysis based on the already given answers and I saw two of the given problems:
Left table was more than 100x bigger than the right table
select count(*) from big_table -- returned 130M
select count(*) from small_table -- returned 1.3M
I also detected that one of the join variable was rather skewed in the right table:
select count(*), cate
from small_table
group by cate
-- returned
-- A 70K
-- B 1.1M
-- C 120K
I tried most of the solutions given in other answers plus some extra parameters I found here Without success.:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Lastly I found out that the left table had a really high % of null values for the join columns: bt.ident and bt.cate
So I tried one last thing, which finally worked for me: to split the left table depending on bt.ident and bt.cate being null or not, to later make a union all with both branches:
select * from
(select bt.*, sm.newparam from
select * from big_table bt where ident is not null or cate is not null
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
union all
select *, null as newparam from big_table nbt where ident is null and cate is null) combined

SubQuery vs TempTable before Merge

I have a complex query that I want to use as the Source of a Merge into a table. This will be executed over millions of rows. Currently I am trying to apply constraints to the data by inserting it into a temp table before the merge.
The operations are:
Filter out duplicate data.
Join some tables to pull in additional data
Insert into the temp table.
Here is the query.
-- Get all Orders that aren't in the system
WITH Orders AS
(
SELECT *
FROM [Staging].Orders o
WHERE NOT EXISTS
(
SELECT 1
FROM Maps.VendorBOrders vbo
JOIN OrderFact of
ON of.Id = vbo.OrderFactId
AND InternalOrderId = o.InternalOrderId
AND of.DataSetId = o.DataSetId
AND of.IsDelete = 0
)
)
INSERT INTO #VendorBOrders
(
CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,BillingProviderId
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,RelatedOrderId
,DataSetId
)
SELECT
vc.CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,bp.Id
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,ro.Id
,o.DataSetId
FROM
Orders o
-- Join related orders to match orders sharing same instance
JOIN Maps.VendorBRelatedOrder ro
ON ro.OrderControlNumber = o.OrderControlNumber
AND ro.EquitableCustomerId = o.EquitableCustomerId
AND ro.DataSetId = o.DataSetId
JOIN BillingProvider bp
ON bp.ProviderNPI = o.ProviderNPI
-- Join on customers and fail if the customer doesn't exist
LEFT OUTER JOIN [Maps].VendorBCustomer vc
ON vc.ExtenalCustomerId = o.ExtenalCustomerId
AND vc.VendorId = o.VendorId;
I am wondering if there is anything I can do to optimize it for time. I have tried using the DB Engine Tuner, but this query takes 100x more CPU Time than the other queries I am running. Is there anything else that I can look into or can the query not be improved further?
CTE is just syntax
That CTE is evaluated (run) on that join
First just run it as a select statement (no insert)
If the select is slow then:
Move that CTE to a #TEMP so it is evaluated once and materialized
Put an index (PK if applicable) on the three join columns
If the select is not slow then it is insert time on #VendorBOrders
Fist only create PK and sort the insert on the PK so as not to fragment that clustered index
Then AFTER the insert is complete build any other necessary indexes
Generally when I do speed testing I perform checks on the parts of SQL to see where the problem lies. Turn on the 'Execution plan' and see where a lot of the time is going. Also if you want to just do the quick and dirty highlight your CTE and run just that. Is that fast, yes, move on.
I have at times found a single index being off throws off a whole complex logic of joins by merely having the database do one part of something large and then finding that piece.
Another idea is that if you have a fast tempdb on a production environment or the like, dump your CTE to a temp table as well. Index on that and see if that speeds things up. Sometimes CTE's, table variables, and temp tables lose some performance at joins. I have found that creating an index on a partial object will improve performance at times but you are also putting more load on the tempdb to do this, so keep that in mind.

Horrible Oracle update performance

I am performing an update with a query like this:
UPDATE (SELECT h.m_id,
m.id
FROM h
INNER JOIN m
ON h.foo = m.foo)
SET m_id = id
WHERE m_id IS NULL
Some info:
Table h is roughly ~5 million rows
All rows in table h have NULL values for m_id
Table m is roughly ~500 thousand rows
m_id on table h is an indexed foreign key pointing to id on table m
id on table m is the primary key
There are indexes on m.foo and h.foo
The EXPLAIN PLAN for this query indicated a hash join and full table scans, but I'm no DBA, so I can't really interpret it very well.
The query itself ran for several hours and did not complete. I would have expected it to complete in no more than a few minutes. I've also attempted the following query rewrite:
UPDATE h
SET m_id = (SELECT id
FROM m
WHERE m.foo = h.foo)
WHERE m_id IS NULL
The EXPLAIN PLAN for this mentioned ROWID lookups and index usage, but it also went on for several hours without completing. I've also always been under the impression that queries like this would cause the subquery to be executed for every result from the outer query's predicate, so I would expect very poor performance from this rewrite anyway.
Is there anything wrong with my approach, or is my problem related to indexes, tablespace, or some other non-query-related factor?
Edit:
I'm also having abysmal performance from simple count queries like this:
SELECT COUNT(*)
FROM h
WHERE m_id IS NULL
These queries are taking anywhere from ~30 seconds to sometimes ~30 minutes(!).
I am noticing no locks, but the tablespace for these tables is sitting at 99.5% usage (only ~6MB free) right now. I've been told that this shouldn't matter as long as indexes are being used, but I don't know...
Some points:
Oracle does not index NULL values (it will index a NULL that is part of a globally non-null tuple, but that's about it).
Oracle is going for a HASH JOIN because of the size of both h and m. This is likely the best option performance-wise.
The second UPDATE might get Oracle to use indexes, but then Oracle is usually smart about merging subqueries. And it would be a worse plan anyway.
Do you have recent, reasonable statistics for your schema? Oracle really needs decent statistics.
In your execution plan, which is the first table in the HASH JOIN? For best performance it should be the smaller table (m in your case). If you don't have good cardinality statistics, Oracle will get messed up. You can force Oracle to assume fixed cardinalities with the cardinality hint, it may help Oracle get a better plan.
For example, in your first query:
UPDATE (SELECT /*+ cardinality(h 5000000) cardinality(m 500000) */
h.m_id, m.id
FROM h
INNER JOIN m
ON h.foo = m.foo)
SET m_id = id
WHERE m_id IS NULL
In Oracle, FULL SCAN reads not only every record in the table, it basically reads all storage allocated up to the maximum used (the high water mark in Oracle documentation). So if you have had a lot of deleted rows your tables might need some cleaning up. I have seen a SELECT COUNT(*) on an empty table consume 30+ seconds because the table in question had like 250 million deleted rows. If that is the case, I suggest analyzing your specific case with a DBA, so he/she can reclaim space from deleted rows and lower the high water mark.
As far as I remember, a WHERE m_id IS NULL performs a full-table scan, since NULL values cannot be indexed.
Full-table scan means, that the engine needs to read every record in the table to evaluate the WHERE condition, and cannot use an index.
You could try to add a virtual column set to a not-null value if m_id IS NULL, and index this column, and use this column in the WHERE condition.
Then you could also move the WHERE condition from the UPDATE statement to the sub-select, which will probably make the statement faster.
Since JOINs are expensive, rewriting INNER JOIN m ON h.foo = m.foo as
WHERE h.foo IN (SELECT m.foo FROM m WHERE m.foo IS NOT NULL)
may also help.
For large tables, MERGE is often much faster than UPDATE. Try this (untested):
MERGE INTO h USING
(SELECT h.h_id,
m.id as new_m_id
FROM h
INNER JOIN m
ON h.foo = m.foo
WHERE h.m_id IS NULL
) new_data
ON (h.h_id = new_data.h_id)
WHEN MATCHED THEN
UPDATE SET h.m_id = new_data.new_m_id;
Try undocumented hint /*+ BYPASS_UJVC */. If it works, add an UNIQUE/PK constraint on m.foo.
I would update the table in iterations, for example, add a condition according to where h.date_created > sysdate-30 and after it finishes I would run the same query and change the condition to: where h.date_created between sysdate-30 and sysdate-60 etc. If you don't have a column like date_created maybe there's another column you can filter by ? for example: WHERE m.foo = h.foo AND m.foo between 1 and 10
Only the result of plan can explain why the cost of this update is high, but an educated guess will be that both tables are very big and that there are many NULL values as well as a lot of matching (m.foo = h.foo)...

Performance issue with select query in Firebird

I have two tables, one small (~ 400 rows), one large (~ 15 million rows), and I am trying to find the records from the small table that don't have an associated entry in the large table.
I am encountering massive performance issues with the query.
The query is:
SELECT * FROM small_table WHERE NOT EXISTS
(SELECT NULL FROM large_table WHERE large_table.small_id = small_table.id)
The column large_table.small_id references small_table's id field, which is its primary key.
The query plan shows that the foreign key index is used for the large_table:
PLAN (large_table (RDB$FOREIGN70))
PLAN (small_table NATURAL)
Statistics have been recalculated for indexes on both tables.
The query takes several hours to run. Is this expected?
If so, can I rewrite the query so that it will be faster?
If not, what could be wrong?
I'm not sure about Firebird, but in other DBs often a join is faster.
SELECT *
FROM small_table st
LEFT JOIN large_table lt
ON st.id = lt.small_id
WHERE lt.small_id IS NULL
Maybe give that a try?
Another option, if you're really stuck, and depending on the situation this needs to be run in, is to take the small_id column out of the large_table, possibly into a temp table, and then do a left join / EXISTS query.
If the large table only has relatively few distinct values for small_id, the following might perform better:
select *
from small_table st left outer join
(select distinct small_id
from large_table
) lt
on lt.small_id = st.id
where lt.small_id is null
In this case, the performance would be better by doing a full scan of the large table and then index lookups in the small table -- the opposite of what it is doing. Doing a distinct could do just an index scan on the large table which then uses the primary key index on the small table.