I'm using a v12 server in Azure SQL Database, and I have the following table:
CREATE TABLE [dbo].[AudienceNiches](
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[WebsiteId] [nvarchar](128) NOT NULL,
[VisitorId] [nvarchar](128) NOT NULL,
[VisitDate] [datetime] NOT NULL,
[Interest] [nvarchar](50) NULL,
[Gender] [float] NULL,
[AgeFrom18To24] [float] NULL,
[AgeFrom25To34] [float] NULL,
[AgeFrom45To54] [float] NULL,
[AgeFrom55To64] [float] NULL,
[AgeFrom65Plus] [float] NULL,
[AgeFrom35To44] [float] NULL,
CONSTRAINT [PK_AudienceNiches] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
I'm executing this query: (UPDATED QUERY)
`select a.interest, count(interest) from (
select visitorid, interest
from audienceNiches
WHERE WebsiteId = #websiteid
AND VisitDate >= #startdate
AND VisitDate <= #enddate
group by visitorid, interest) as a
group by a.interest`
And I have the following indexs (all ASC):
idx_WebsiteId_VisitDate_VisitorId
idx_WebsiteId_VisitDate
idx_VisitorId
idx_Interest
The problem is that my query return 18K rows aproximaly and takes 5 seconds, the whole table has 8.8M records, and if I expand a little the data the time increases a lot, so, what would be the best index to this query? What I'm missing?
The best index for this query is a composite index on these columns, in this order:
WebsiteId
VisitDate
Interest
VisitorId
This allows the query to be completely answered from the index. SqlServer can range scan on (WebsiteId, VisitDate) and then exclude null Interest and finally count distinct VisitorIds all from the index. The indexes entries will be in the correct order to allow these operations to occur efficiently.
It's difficult for me to write SQL without having the data to test against, but see if this gives the results you're looking for with a better execution time.
SELECT interest, count(distinct visitorid)
FROM audienceNiches
WHERE WebsiteId = #websiteid
AND VisitDate between #startdate and #enddate
AND interest is not null
GROUP BY interest
Indexes can require an almost infinite amount of understanding, but in your case I think you would see good performance gains by indexing the WebsiteId and VisitDate as separate indexes.
It's important though to make sure your indexes are in good shape. You need to maintain them by keeping statistics up to date, and rebuilding your indexes periodically.
Lastly, you should examine the query plan when tuning query performance. SQL Server will tell you if it thinks it would benefit from a column (or columns) being indexed, and also alert you to other performance related issues.
Press Ctrl+L from within Management Studio and see what's going on with the query.
Your query could be written in this way, because in final result set you do not pull column visitorid from table audienceNiches, so no need to write two different level of group by. Check with this query and let me know if still facing performance issue.
select interest, count(interest)
from audienceNiches
WHERE WebsiteId = #websiteid
AND VisitDate >= #startdate
AND VisitDate <= #enddate
group by interest
First off, your updated query can be effectively reduced to this:
select an.Interest, count(an.Interest)
from dbo.AudienceNiches an
where an.WebsiteId = #WebSiteId
and an.VisitDate between #startdate and #enddate
group by an.Interest;
Second, depending on the cardinality of your data, one of the following indices will provide the best possible performance:
create index IX_AudienceNiches_WebSiteId_VisitDate_Interest on dbo.AudienceNiches
(WebSiteId, VisitDate, Interest);
or
create index IX_AudienceNiches_VisitDate_WebSiteId_Interest on dbo.AudienceNiches
(VisitDate, WebSiteId, Interest);
As your data will grow, however, I think that eventually the latter one will become more efficient, on average.
P.S. Your table is severely denormalised in multiple aspects. I only hope you know what you are doing.
Related
I have a database table with about 3.5 million rows. The table holds contract data records, with an amount, a date, and some IDs related to other tables (VendorId, AgencyId, StateId), this is the database table:
CREATE TABLE [dbo].[VendorContracts]
(
[Id] [uniqueidentifier] NOT NULL,
[ContractDate] [datetime2](7) NOT NULL,
[ContractAmount] [decimal](19, 4) NULL,
[VendorId] [uniqueidentifier] NOT NULL,
[AgencyId] [uniqueidentifier] NOT NULL,
[StateId] [uniqueidentifier] NOT NULL,
[CreatedBy] [nvarchar](max) NULL,
[CreatedDate] [datetime2](7) NOT NULL,
[LastModifiedBy] [nvarchar](max) NULL,
[LastModifiedDate] [datetime2](7) NULL,
[IsActive] [bit] NOT NULL,
CONSTRAINT [PK_VendorContracts]
PRIMARY KEY CLUSTERED ([Id] ASC)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
I have a page on my site where the user can filter a paged grid by VendorId and ContractDate, and sort by the ContractAmount or ContractDate. This is the query that EF Core produces when sorting by ContractAmount for this particular vendor that has over a million records:
DECLARE #__vendorId_0 uniqueIdentifier = 'f39c7198-b05a-477e-b7bc-cb189c5944c0';
DECLARE #__startDate_1 datetime2 = '2017-01-01T07:00:00.0000000';
DECLARE #__endDate_2 datetime2 = '2018-01-02T06:59:59.0000000';
DECLARE #__p_3 int = 0;
DECLARE #__p_4 int = 50;
SELECT [v].[Id], [v].[AdminFee], [v].[ContractAmount], [v].[ContractDate], [v].[PONumber], [v].[PostalCode], [v].[AgencyId], [v].[StateId], [v].[VendorId]
FROM [VendorContracts] AS [v]
WHERE (([v].[VendorId] = #__vendorId_0) AND ([v].[ContractDate] >= #__startDate_1)) AND ([v].[ContractDate] <= #__endDate_2)
ORDER BY [v].[ContractAmount] ASC
OFFSET #__p_3 ROWS FETCH NEXT #__p_4 ROWS ONLY
When I run this, it takes 50s, whether sorting ASC or DESC or offsetting by thousands, it's always 50s.
If I look at my Execution Plan, I see that it does use my index, but the Sort Cost is what's making the query take so long
This is my index:
CREATE NONCLUSTERED INDEX [IX_VendorContracts_VendorIdAndContractDate] ON [dbo].[VendorContracts]
(
[VendorId] ASC,
[ContractDate] DESC
)
INCLUDE([ContractAmount],[AdminFee],[PONumber],[PostalCode],[AgencyId],[StateId])
WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF)
The strange thing is that I have a similar index for sorting by ContractDate, and that one returns results in less than a second, even on the vendor that has millions of records.
Is there something wrong with my index? Or is sorting by a decimal data type just incredibly intensive?
You have an index that allows the
VendorId = #__vendorId_0 and ContractDate BETWEEN #__startDate_1 AND #__endDate_2
predicate to be seeked exactly.
SQL Server estimates that 6,657 rows will match this predicate and need to be sorted so it requests a memory grant suitable for that amount of rows.
In reality for the parameter values where you see the problem nearly half a million are sorted and the memory grant is insufficient and the sort spills to disc.
50 seconds for 10,299 spilled pages does still sound unexpectedly slow but I assume you may well be on some very low SKU in Azure SQL Database?
Some possible solutions to resolve the issue might be to
Force it to use an execution plan that is compiled for parameter values with your largest vendor and wide date range (e.g. with OPTIMIZE FOR hint). This will mean an excessive memory grant for smaller vendors though which may mean other queries have to incur memory grant waits.
Use OPTION (RECOMPILE) so every invocation is recompiled for the specific parameter values passed. This means in theory every execution will get an appropriate memory grant at the cost of more time spent in compilation.
Remove the need for a sort at all. If you have an index on VendorId, ContractAmount INCLUDE (ContractDate) then the VendorId = #__vendorId_0 part can be seeked and the index read in ContractAmount order. Once 50 rows have been found that match the ContractDate BETWEEN #__startDate_1 AND #__endDate_2 predicate then query execution can stop. SQL Server might not choose this execution plan without hints though.
I'm not sure how easy or otherwise it is to apply query hints through EF but you could look at forcing a plan via query store if you manage to get the desired plan to appear there.
I use SQL Server 2014 Express.
I have a table consisting of information about various professional fights. I have assigned my own Fight ID to each row.
Sometimes, information about the same fight is recorded multiple times, using different Fight IDs. My goal is to identify these duplicates, and then delete them from my table.
This is the code for creating my table:
CREATE TABLE [dbo].[Fights](
[FightId] [int] IDENTITY(1,1) NOT NULL,
[LowIdFighter] [int] NOT NULL,
[HighIdFighter] [int] NOT NULL,
[LowIdFighterOutcome] [nvarchar](100) NOT NULL,
[EventName] [nvarchar](100) NOT NULL,
[EventDate] [datetime] NOT NULL,
[WinningMethod] [nvarchar](100) NOT NULL,
[Referee] [nvarchar](50) NOT NULL,
[FinishingRound] [int] NOT NULL,
[FinishingTime] [time](7) NOT NULL,
CONSTRAINT [PK_Fights] PRIMARY KEY CLUSTERED
(
[FightId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Here is my query:
SELECT
LowIdFighter, HighIdFighter, CAST(EventDate AS DATE),
LowIdFighterOutcome,
COUNT(*) as NumTimesSaved
INTO Duplicate_Fights
FROM Fights
GROUP BY
LowIdFighter, HighIdFighter, EventDate,
LowIdFighterOutcome
HAVING COUNT(*) > 1;
GO
My problem is this: The table Duplicate_Fights contains fights that are not actually duplicates. Fights are considered to be duplicates insofar as they share the same values in any two of the four columns (LowIdFighter, HighIdFighter, EventDate, LowIdFighterOutcome). E.g., two fights are considered to be duplicates of each other if they share the same LowIdFighter-HighIdFighter pair, even if these two fighters fought at two completely separate events, possibly with two completely different outcomes. Obviously, this is not what I want.
I want to write a query expression that returns a table of duplicate fights which share the same values in all of the four columns. I would appreciate any help on the matter. Thank you.
EDIT: Here is a screenshot of the output when I exclude the "HAVING COUNT(*) > 1" statement:
Row 149 and Row 150 are not duplicates of each other, because the dates in those two rows are different, and yet they are included the table of duplicate fights.
Those two rows are not duplicate of each other, but both of them exist twice, i.e. fighter 45386 had two fights and both were inserted twice.
I am just trying to help.
Your query should work, as suggested by dnoeth,but just make sure that you use exactly the same attribute(or its manipulation) in the GROUP BY clause as you use to retrieve column (in Select clause) as below:
SELECT
LowIdFighter, HighIdFighter, CAST(EventDate AS DATE),
LowIdFighterOutcome,
COUNT(*) as NumTimesSaved
INTO Duplicate_Fights
FROM Fights
GROUP BY
LowIdFighter, HighIdFighter, CAST(EventDate AS DATE),
LowIdFighterOutcome
HAVING COUNT(*) > 1;
GO
Our database has many tables which follow an "insert only" schema. Rows are added to the end, the "current" value can then be found by finding the most recently inserted row for each logical key.
Here's an example:
CREATE TABLE [dbo].[SPOTQUOTE](
[ID] [numeric](19, 0) NOT NULL,
[QUOTETYPE] [varchar](255) NOT NULL,
[QUOTED_UTC_TS] [datetime] NOT NULL,
[QUOTED_UTC_MILLIS] [smallint] NOT NULL,
[RECEIVED_UTC_TS] [datetime] NOT NULL,
[RECEIVED_UTC_MILLIS] [smallint] NOT NULL,
[VALUE_ASK] [float] NULL,
[VALUE_BID] [float] NULL,
[FEEDITEM_ID] [numeric](19, 0) NOT NULL,
[SAMPLING] [int] NOT NULL,
CONSTRAINT [SPOTQUOTE_pk1] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
The logical key of this table is "feeditem_id". However, so that we can perform historical queries, instead, rows are only ever inserted into this table at the end, using "ID" as the actual physical key.
Therefore, we know that the max(id) for each distinct feeditem_id is going to be found towards the end of the table, not the beginning.
When querying the table, we want to find the "latest" update for each "feeditem_id", which is the "logical key" for this table.
The following is the query we want:
select feeditem_id, max(id)
from spotquote
group by feeditem_id
having feeditem_id in (827, 815, 806)
so that we have the latest id for each feeditem_id.
Unfortunately, SQL server 2008 generates a sub optimal query plan for this query.
From my understanding of SQL, the fact that this is selecting for the max id, which is the primary clustered key, implies that the optimal query plan for this is to:
Start at the end of the table
Walk backwards, keeping track of the max id encountered so far for each feeditem_id
Stop once an Id for each feeditem_id has been found
I would expect this to be extremely fast.
First question: is there some way I can explicitly tell SQL server to execute the above query plan?
I have tried:
SELECT feeditem_id, max(ID) as latest from SPOTQUOTE with (index(SPOTQUOTE_pk1)) group by feeditem_id
having FEEDITEM_ID in (827, 815, 806)
But, in practise, it seems to execute even more slowly.
I am wondering if the "clustered index scan" is walking the table forwards instead of backwards... Is there a way I can confirm if this is what is happening?
How can I confirm that this clustered index scan works from the back of the table, and how can I convince SQL server to search the clustered index backwards instead?
Update
The issue is indeed that the clustered index scan is not searching backwards when I perform a group by.
In contrast, the following SQL query produces essentially the correct query plan:
select
FEEDITEM_ID, MAX(id) from
(select top 100 * from
SPOTQUOTE
where FEEDITEM_ID in (827,815,806)
order by ID desc) s
group by feeditem_id
I can see it in management studio that "Ordered = True" and "Scan Direction = BACKWARD":
and it executes blindingly fast - 2 milliseconds - and it "almost certainly" works.
I just want it to "stop" once it has found an entry for each feed id rather than the first 100 entries.
It's frustrating that there seems to be no way to tell SQL server to execute this obviously more efficient query.
If I do a normal "group by" with appropriate indexes on feeditem_id and id, it's faster - about 300 ms - but still that's still 100 times slower than the backwards clustered index scan.
SQL Server is unable to produce such a query plan as of 2012. Rewrite the query:
SELECT ids.feeditem_id, MaxID
FROM (VALUES (827), (815), (806)) ids(feeditem_id)
CROSS APPLY (
select TOP 1 ID AS MaxID
from spotquote sq
where sq.feeditem_id = ids.feeditem_id
ORDER BY ID DESC
) x
This results in a plan that does a seek into the spotquote table per ID that you specify. This is the best we can do. SQL Server is unable to abort an aggregation as soon as all groups you are interested in have at least one value.
I have a query as follows;
SELECT COUNT(Id) FROM Table
The table contains 33 million records - it contains a primary key on Id and no other indices.
The query takes 30 seconds.
The actual execution plan shows it uses a clustered index scan.
We have analysed the table and found it isn't fragmented using the first query shown in this link: http://sqlserverpedia.com/wiki/Index_Maintenance.
Any ideas as to why this query is so slow and how to fix it.
The Table Definition:
CREATE TABLE [dbo].[DbConversation](
[ConversationID] [int] IDENTITY(1,1) NOT NULL,
[ConversationGroupID] [int] NOT NULL,
[InsideIP] [uniqueidentifier] NOT NULL,
[OutsideIP] [uniqueidentifier] NOT NULL,
[ServerPort] [int] NOT NULL,
[BytesOutbound] [bigint] NOT NULL,
[BytesInbound] [bigint] NOT NULL,
[ServerOutside] [bit] NOT NULL,
[LastFlowTime] [datetime] NOT NULL,
[LastClientPort] [int] NOT NULL,
[Protocol] [tinyint] NOT NULL,
[TypeOfService] [tinyint] NOT NULL,
CONSTRAINT [PK_Conversation_1] PRIMARY KEY CLUSTERED
(
[ConversationID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
One thing I have noticed is the database is set to grow in 1Mb chunks.
It's a live system so we restricted in what we can play with - any ideas?
UPDATE:
OK - we've improved performance in the actual query of interest by adding new non-clustered indices on appropriate columns so it's not a critical issue anymore.
SELECT COUNT is still slow though - tried it with NOLOCK hints - no difference.
We're all thinking it's something to do with the Autogrowth set to 1Mb rather than a larger number, but surprised it has this effect. Can MDF fragmentation on the disk be a possible cause?
Is this a frequently read/inserted/updated table? Is there update/insert activity concurrent with your select?
My guess is the delay is due to contention.
I'm able to run a count on 189m rows in 17 seconds on my dev server, but there's nothing else hitting that table.
If you aren't too worried about contention or absolute accuracy you can do:
exec sp_spaceused 'MyTableName' which will give a count based on meta-data.
If you want a more exact count but don't necessarily care if it reflect concurrent DELETE or INSERT activity you can do your current query with a NOLOCK hint:
SELECT COUNT(id) FROM MyTable WITH (NOLOCK) which will not get row-level locks for your query and should run faster.
Thoughts:
Use SELECT COUNT(*) which is correct for "how many rows" (as per ANSI SQL). Even if ID is the PK and thus not nullable, SQL Server will count ID. Not rows.
If you can live with approximate counts, then use sys.dm_db_partition_stats. See my answer here: Fastest way to count exact number of rows in a very large table?
If you can live with dirty reads use WITH (NOLOCK)
use [DatabaseName]
select tbl.name, dd.rows from sysindexes dd
inner join sysobjects tbl on dd.id = tbl.id where dd.indid < 2 and tbl.xtype = 'U'
select sum(dd.rows)from sysindexes dd
inner join sysobjects tbl on dd.id = tbl.id where dd.indid < 2 and tbl.xtype = 'U'
By using these queries you can fetch all tables' count within 0-5 seconds
use where clause according to your requirement.....
Another idea: When the files grow with 1MB parts, it may be fragmented on the file system. You can't see this by SQL, you see it using a disk defragmentation tool.
I'm porting a process which creates a MASSIVE CROSS JOIN of two tables. The resulting table contains 15m records (looks like the process makes a 30m cross join with a 2600 row table and a 12000 row table and then does some grouping which must split it in half). The rows are relatively narrow - just 6 columns. It's been running for 5 hours with no sign of completion. I only just noticed the count discrepancy between the known good and what I would expect for the cross join, so my output doesn't have the grouping or deduping which will halve the final table - but this still seems like it's not going to complete any time soon.
First I'm going to look to eliminate this table from the process if at all possible - obviously it could be replaced by joining to both tables individually, but right now I do not have visibility into everywhere else it is used.
But given that the existing process does it (in less time, on a less powerful machine, using the FOCUS language), are there any options for improving the performance of large CROSS JOINs in SQL Server (2005) (hardware is not really an option, this box is a 64-bit 8-way with 32-GB of RAM)?
Details:
It's written this way in FOCUS (I'm trying to produce the same output, which is a CROSS JOIN in SQL):
JOIN CLEAR *
DEFINE FILE COSTCENT
WBLANK/A1 = ' ';
END
TABLE FILE COSTCENT
BY WBLANK BY CC_COSTCENT
ON TABLE HOLD AS TEMPCC FORMAT FOCUS
END
DEFINE FILE JOINGLAC
WBLANK/A1 = ' ';
END
TABLE FILE JOINGLAC
BY WBLANK BY ACCOUNT_NO BY LI_LNTM
ON TABLE HOLD AS TEMPAC FORMAT FOCUS INDEX WBLANK
JOIN CLEAR *
JOIN WBLANK IN TEMPCC TO ALL WBLANK IN TEMPAC
DEFINE FILE TEMPCC
CA_JCCAC/A16=EDIT(CC_COSTCENT)|EDIT(ACCOUNT_NO);
END
TABLE FILE TEMPCC
BY CA_JCCAC BY CC_COSTCENT AS COST CENTER BY ACCOUNT_NO
BY LI_LNTM
ON TABLE HOLD AS TEMPCCAC
END
So the required output really is a CROSS JOIN (it's joining a blank column from each side).
In SQL:
CREATE TABLE [COSTCENT](
[COST_CTR_NUM] [int] NOT NULL,
[CC_CNM] [varchar](40) NULL,
[CC_DEPT] [varchar](7) NULL,
[CC_ALSRC] [varchar](6) NULL,
[CC_HIER_CODE] [varchar](20) NULL,
CONSTRAINT [PK_LOOKUP_GL_COST_CTR] PRIMARY KEY NONCLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY
= OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE [JOINGLAC](
[ACCOUNT_NO] [int] NULL,
[LI_LNTM] [int] NULL,
[PR_PRODUCT] [varchar](5) NULL,
[PR_GROUP] [varchar](1) NULL,
[AC_NAME_LONG] [varchar](40) NULL,
[LI_NM_LONG] [varchar](30) NULL,
[LI_INC] [int] NULL,
[LI_MULT] [int] NULL,
[LI_ANLZ] [int] NULL,
[LI_TYPE] [varchar](2) NULL,
[PR_SORT] [varchar](2) NULL,
[PR_NM] [varchar](26) NULL,
[PZ_SORT] [varchar](2) NULL,
[PZNAME] [varchar](26) NULL,
[WANLZ] [varchar](3) NULL,
[OPMLNTM] [int] NULL,
[PS_GROUP] [varchar](5) NULL,
[PS_SORT] [varchar](2) NULL,
[PS_NAME] [varchar](26) NULL,
[PT_GROUP] [varchar](5) NULL,
[PT_SORT] [varchar](2) NULL,
[PT_NAME] [varchar](26) NULL
) ON [PRIMARY]
CREATE TABLE [JOINCCAC](
[CA_JCCAC] [varchar](16) NOT NULL,
[CA_COSTCENT] [int] NOT NULL,
[CA_GLACCOUNT] [int] NOT NULL,
[CA_LNTM] [int] NOT NULL,
[CA_UNIT] [varchar](6) NOT NULL,
CONSTRAINT [PK_JOINCCAC_KNOWN_GOOD] PRIMARY KEY CLUSTERED
(
[CA_JCCAC] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY
= OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
With the SQL Code:
INSERT INTO [JOINCCAC]
(
[CA_JCCAC]
,[CA_COSTCENT]
,[CA_GLACCOUNT]
,[CA_LNTM]
,[CA_UNIT]
)
SELECT Util.PADLEFT(CONVERT(varchar, CC.COST_CTR_NUM), '0',
7)
+ Util.PADLEFT(CONVERT(varchar, GL.ACCOUNT_NO), '0',
9) AS CC_JCCAC
,CC.COST_CTR_NUM AS CA_COSTCENT
,GL.ACCOUNT_NO % 900000000 AS CA_GLACCOUNT
,GL.LI_LNTM AS CA_LNTM
,udf_BUPDEF(GL.ACCOUNT_NO, CC.COST_CTR_NUM, GL.LI_LNTM, 'N') AS CA_UNIT
FROM JOINGLAC AS GL
CROSS JOIN COSTCENT AS CC
Depending on how this table is subsequently used, it should be able to be eliminated from the process, by simply joining to both the original tables used to build it. However, this is an extremely large porting effort, and I might not find the usage of the table for some time, so I was wondering if there were any tricks to CROSS JOINing big tables like that in a timely fashion (especially given that the existing process in FOCUS is able to do it more speedily). That way I could validate the correctness of my building of the replacement query and then later factor it out with views or whatever.
I am also considering factoring out the UDFs and string manipulation and performing the CROSS JOIN first to break the process up a bit.
RESULTS SO FAR:
It turns out that the UDFs do contribute a lot (negatively) to the performance. But there also appears to be a big difference between a 15m row cross join and a 30m row cross join. I do not have SHOWPLAN rights (boo hoo), so I can't tell whether the plan it is using is better or worse after changing indexes. I have not refactored it yet, but am expecting the entire table to go away shortly.
Examining that query shows only one column used from one table, and only two columns used from the other table. Due to the very low numbers of columns used, this query can be easily enhanced with covering indexes:
CREATE INDEX COSTCENTCoverCross ON COSTCENT(COST_CTR_NUM)
CREATE INDEX JOINGLACCoverCross ON JOINGLAC(ACCOUNT_NO, LI_LNTM)
Here are my questions for further optimization:
When you put the query in query analyzer and whack the "show estimated execution plan" button, it will show a graphical representation of what it's going to do.
Join Type: There should be a nested loop join in there. (the other options are merge join and hash join). If you see nested loop, then ok. If you see merge join or hash join, let us know.
Order of table access: Go all the way to the top and scroll all the way to the right. The first step should be accessing a table. Which table is that and what method is used(index scan, clustered index scan)? What method is used to access the other table?
Parallelism: You should see the little jaggedy arrows on almost all icons in the plan indicating that parallelism is being used. If you don't see this, there is a major problem!
That udf_BUPDEF concerns me. Does it read from additional tables? Util.PADLEFT concerns me less, but still.. what is it? If it isn't a Database Object, then consider using this instead:
RIGHT('z00000000000000000000000000' + columnName, 7)
Are there any triggers on JOINCCAC? How about indexes? With an insert this large, you'll want to drop all triggers and indexes on that table.
Continuing on what others a saying, DB functions that contained queries which are used in a select always made my queries extremely slow. Off the top of my head, I believe i had a query run in 45 seconds, then I removed the function, and then result was 0 seconds :)
So check udf_BUPDEF is not doing any queries.
Break down the query to make it a plain simple cross join.
SELECT CC.COST_CTR_NUM, GL.ACCOUNT_NO
,CC.COST_CTR_NUM AS CA_COSTCENT
,GL.ACCOUNT_NO AS CA_GLACCOUNT
,GL.LI_LNTM AS CA_LNTM
-- I don't know what is BUPDEF doing? but remove it from the query for time being
-- ,udf_BUPDEF(GL.ACCOUNT_NO, CC.COST_CTR_NUM, GL.LI_LNTM, 'N') AS CA_UNIT
FROM JOINGLAC AS GL
CROSS JOIN COSTCENT AS CC
See how good is the simple cross join? (without any functions applied on it)