Fine tuning update query - sql

I am developing a batch process in the peoplesoft application engine.
I have inserted data in staging table from JOB table.
There are 120,596 employees in total, whose data have to be processed, this is in development environment.
In testing environment, the number of rows to be processed is 249047.
There are many non job data which also have to be sent for employees.
My design is in such way that I will write individual update statements to update the data in the table, then I will select data from the staging table and write it in the file.
The update is taking too much time, I would like to know a technique to fine tune it.
Searched for many things, and even tried using /* +Append */ in the update query, but it throws an error message, sql command not ended.
Also, my update query has to check for nvl or null values.
Is there any way to share the code over stackoverflow, I mean, this is insert,update statement, written in peoplesoft actions, so that people here can have a look into that?
Kindly suggest me a technique, my goal is to finish the execution within 5-10 minutes.
My update statement:
I have figured out the cause. It is this update statement
UPDATE %Table(AZ_GEN_TMP)
SET AZ_HR_MANAGER_ID = NVL((
SELECT e.emplid
FROM PS_EMAIL_ADDRESSES E
WHERE UPPER(SUBSTR(E.EMAIL_ADDR, 0, INSTR(E.EMAIL_ADDR, '#') -1)) = (
SELECT c.contact_oprid
FROM ps_az_can_employee c
WHERE c.emplid = %Table(AZ_GEN_TMP).EMPLID
AND c.rolename='HRBusinessPartner'
AND c.seqnum = (
SELECT MAX(c1.seqnum)
FROM ps_az_can_employee c1
WHERE c1.emplid= c.emplid
AND c1.rolename= c.rolename ) )
AND e.e_addr_type='PINT'), ' ')
In order to fine tune this,I am inserting the value contact_oprid in my staging table, using hint.
SELECT /* +ALL_ROWS */ c.contact_oprid
FROM ps_az_can_employee c
WHERE c.emplid = %Table(AZ_GEN_TMP).EMPLID
AND c.rolename='HRBusinessPartner'
AND c.seqnum = (
SELECT MAX(c1.seqnum)
FROM ps_az_can_employee c1
WHERE c1.emplid= c.emplid
AND c1.rolename= c.rolename ) )
AND e.e_addr_type='PINT')
and doing an update on staging table:
UPDATE staging_table
SET AZ_HR_MANAGER_ID = NVL((
SELECT e.emplid
FROM PS_EMAILtable E
WHERE UPPER(REGEXP_SUBSTR(e.email_addr,'[^#]+',1,1)) = staging_table.CONTACT_OPRID
AND e.e_addr_type='PINT'),' ') /
This will take 5 hours, as it has to process 2 lakhs rows of data.
Is there any way using which the processing can be speed up, i mean, using hints or indexes?
Also, if I don't use this, the processing to update other value is very fast, gets finished in 10 minutes.
Kindly help me with this.
Thanks.

I have resolved this, used MERGE INTO TABLE oracle statement, and now the process takes 10 minutes to execute, including file writing operation. Thanks all for your help and suggestions.

Related

How do I make an offset in this SQL Server 2000 query?

I wanna do an offset like: from 0 to 10000 records, from 10000 to 20000 records and so on. How do I modify this query to add an offset? Also, how can I improve this query for performance?
SELECT
CASE
WHEN c.DataHoraUltimaAtualizacaoILR >= e.DataHoraUltimaAtualizacaoILR AND c.DataHoraUltimaAtualizacaoILR >= t.DataHoraUltimaAtualizacaoILR THEN c.DataHoraUltimaAtualizacaoILR
WHEN e.DataHoraUltimaAtualizacaoILR >= c.DataHoraUltimaAtualizacaoILR AND e.DataHoraUltimaAtualizacaoILR >= t.DataHoraUltimaAtualizacaoILR THEN e.DataHoraUltimaAtualizacaoILR
WHEN t.DataHoraUltimaAtualizacaoILR >= c.DataHoraUltimaAtualizacaoILR AND t.DataHoraUltimaAtualizacaoILR >= e.DataHoraUltimaAtualizacaoILR THEN t.DataHoraUltimaAtualizacaoILR
ELSE c.DataHoraUltimaAtualizacaoILR
END AS 'updated_at',
p.Email,
c.ID_Cliente,
p.Nome,
p.DataHoraCadastro,
p.Sexo,
p.EstadoCivil,
p.DataNascimento,
getdate() as [today],
datediff (yy,p.DataNascimento,getdate()) as 'Idade',
datepart(month,p.DataNascimento) as 'MesAniversario',
e.Bairro,
e.Cidade,
e.UF,
c.CodLoja as codloja_cadastro,
t.DDD,
t.Numero
FROM
PessoaFisica p
LEFT JOIN
Cliente c ON (c.ID_Pessoa = p.ID_PessoaFisica)
LEFT JOIN
Loja l ON (CAST(l.CodLoja AS integer) = CAST(c.CodLoja AS integer))
LEFT JOIN
PessoaEndereco pe ON (pe.ID_Pessoa = p.ID_PessoaFisica)
LEFT JOIN
Endereco e ON (e.ID_Endereco = pe.ID_Endereco)
LEFT JOIN
PessoaTelefone pt ON (pt.ID_Pessoa = p.ID_PessoaFisica)
LEFT JOIN
Telefone t ON (t.ID_Telefone = pt.ID_Telefone)
WHERE
p.Email IS NOT NULL
AND p.Email <> ''
--and p.Email = 'aline.salles#uol.com.br'
GROUP BY
p.Email, c.ID_Cliente, p.Nome, p.EstadoCivil, p.DataHoraCadastro,
c.CodLoja, p.Sexo, e.Bairro, p.DataNascimento, e.Cidade, e.UF,
t.DDD, t.Numero, c.DataHoraUltimaAtualizacaoILR, e.DataHoraUltimaAtualizacaoILR,
t.DataHoraUltimaAtualizacaoILR
ORDER BY
updated_at DESC
Overall Process
If you have access to a more modern SQL Server version, then you could setup a process to copy the raw data to a new database on a daily basis. This might initially be an exact copy of the source database, just for staging the data. Then build an transformation process, using stored procedures or perhaps SSIS for high performance. That process would transform your data into your desired end state, and load it into the final database.
The copy process could be replication, but if your staging database is SQL Server 2005 or above, then you could also build a simple SSIS job to perform the copy. Run that job in a schedule task (SQL Agent) on a daily basis. You could combine the two - load data, then transform - but if using SSIS, then I recommend keeping these as separate SSIS packages, which will help with debugging problems. In the scheduled task you could run the two packages back-to-back.
Performance
You'll need good indexing on the table, but indexing alone is not sufficient. Casting CodLoja as an integer will prevent you from using indexes on that field. If you need to store those as strings for some other reason, then consider adding calculated columns,
ALTER TABLE xyz Add CodLojaAsInt as (CAST(CodLoja as int))
Then place an index on that new calculated column. The problem is that any function call in a ON or WHERE clause will cause SQL Server to scan the entire and convert every single row, instead of peaking into an index.
After searching and looking over my problem again, #sfuqua helped me with this solution. Basically I'll create some more organized tables in my local DB and get all the abstract/ugly data from the remote DB and process it locally to new tables.
I'm gonna use Elasticsearch to speed up the indexing and queries.
It sounds like you're trying to emulate MySQL's SELECT ... LIMIT X,Y feature. SQL Server doesn't have that. In SQL Server 2005+, you can use ROW_NUMBER() in a subquery. Since you're on 2000, however, you're going to have to do it one of the hard ways.
They way I've always done it is like this:
SELECT ... FROM Table WHERE PK IN
(SELECT TOP #PageSize PK FROM Table WHERE PK NOT IN
(SELECT TOP #StartRow PK FROM Table ORDER BY SortColumn)
ORDER BY SortColumn)
ORDER BY SortColumn
Although I recommend rewriting it to use EXISTS instead of IN and seeing which works better. You'll have to use EXISTS if you have compound primary keys.
That code and the other solutions are here.

Stored procedure SQL optimization

Im running a query and its taking a long time. Here is my sample code:
SELECT #AppleCount=COUNT(*)
FROM (
SELECT * FROM #iDToStoreMapping sm
WHERE StoreFront=73
AND sm.CategoryCountryCategoryTYpeMappingID NOT IN
(SELECT * FROM #FinishedDls)
) rows
#AppleCount is supposed to be all the categoryCountryCategoryTypeMappingId's in existence and #FinishedDls has that id if an application finished its download and wrote that id in there, so this query is supposed to get the count of those ids which haven't downloaded yet. There's about 50k ids. and i have to run this query 3 times, but each one takes a couple mins. Is there anything im doing wrong?
Sometimes using an explicit join instead of not in results in better performance:
SELECT #AppleCount = COUNT(*)
FROM #iDToStoreMapping sm left outer join
#FinishedDls fd
on sm.CategoryCountryCategoryTYpeMappingID = fd.id
WHERE StoreFront = 73 and
fd.id is null;
I didnt use a primary key on my table variables and that is what caused the terrible performance. Sorry everyone.

Can this SQL Query be optimized to run faster?

I have an SQL Query (For SQL Server 2008 R2) that takes a very long time to complete. I was wondering if there was a better way of doing it?
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND t.Code NOT IN (SELECT Code FROM ExcludedCodes)
Table1 has around 90Million rows in it and is indexed by Name and Code.
ExcludedCodes only has around 30 rows in it.
This query is in a stored procedure and gets called around 40k times, the total time it takes the procedure to finish is 27 minutes.. I believe this is my biggest bottleneck because of the massive amount of rows it queries against and the number of times it does it.
So if you know of a good way to optimize this it would be greatly appreciated! If it cannot be optimized then I guess im stuck with 27 min...
EDIT
I changed the NOT IN to NOT EXISTS and it cut the time down to 10:59, so that alone is a massive gain on my part. I am still going to attempt to do the group by statement as suggested below but that will require a complete rewrite of the stored procedure and might take some time... (as I said before, im not the best at SQL but it is starting to grow on me. ^^)
In addition to workarounds to get the query itself to respond faster, have you considered maintaining a column in the table that tells whether it is in this set or not? It requires a lot of maintenance but if the ExcludedCodes table does not change often, it might be better to do that maintenance. For example you could add a BIT column:
ALTER TABLE dbo.Table1 ADD IsExcluded BIT;
Make it NOT NULL and default to 0. Then you could create a filtered index:
CREATE INDEX n ON dbo.Table1(name)
WHERE IsExcluded = 0;
Now you just have to update the table once:
UPDATE t
SET IsExcluded = 1
FROM dbo.Table1 AS t
INNER JOIN dbo.ExcludedCodes AS x
ON t.Code = x.Code;
And ongoing you'd have to maintain this with triggers on both tables. With this in place, your query becomes:
SELECT #Count = COUNT(Name)
FROM dbo.Table1 WHERE IsExcluded = 0;
EDIT
As for "NOT IN being slower than LEFT JOIN" here is a simple test I performed on only a few thousand rows:
EDIT 2
I'm not sure why this query wouldn't do what you're after, and be far more efficient than your 40K loop:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
WHERE src.Code NOT IN (SELECT Code FROM dbo.ExcludedCodes)
GROUP BY src.Name;
Or the LEFT JOIN equivalent:
SELECT src.Name, COUNT(src.*)
FROM dbo.Table1 AS src
INNER JOIN #temptable AS t
ON src.Name = t.Name
LEFT OUTER JOIN dbo.ExcludedCodes AS x
ON src.Code = x.Code
WHERE x.Code IS NULL
GROUP BY src.Name;
I would put money on either of those queries taking less than 27 minutes. I would even suggest that running both queries sequentially will be far faster than your one query that takes 27 minutes.
Finally, you might consider an indexed view. I don't know your table structure and whether your violate any of the restrictions but it is worth investigating IMHO.
You say this gets called around 40K times. WHy? Is it in a cursor? If so do you really need a cursor. Couldn't you put the values you want for #name in a temp table and index it and then join to it?
select t.name, count(t.name)
from table t
join #name n on t.name = n.name
where NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t.code)
group by t.name
That might get you all your results in one query and is almost certainly faster than 40K separate queries. Of course if you need the count of all the names, it's even simpleer
select t.name, count(t.name)
from table t
NOT EXISTS (SELECT Code FROM ExcludedCodes WHERE Code = t
group by t.name
NOT EXISTS typically performs better than NOT IN, but you should test it on your system.
SELECT #count = COUNT(Name)
FROM Table1 t
WHERE t.Name = #name AND NOT EXISTS (SELECT 1 FROM ExcludedCodes e WHERE e.Code = t.Code)
Without knowing more about your query it's tough to supply concrete optimization suggestions (i.e. code suitable for copy/paste). Does it really need to run 40,000 times? Sounds like your stored procedure needs reworking, if that's feasible. You could exec the above once at the start of the proc and insert the results in a temp table, which can keep the indexes from Table1, and then join on that instead of running this query.
This particular bit might not even be the bottleneck that makes your query run 27 minutes. For example, are you using a cursor over those 90 million rows, or scalar valued UDFs in your WHERE clauses?
Have you thought about doing the query once and populating the data in a table variable or temp table? Something like
insert into #temp (name, Namecount)
values Name, Count(name)
from table1
where name not in(select code from excludedcodes)
group by name
And don't forget that you could possibly use a filtered index as long as the excluded codes table is somewhat static.
Start evaluating the execution plan. Which is the heaviest part to compute?
Regarding the relation between the two tables, use a JOIN on indexed columns: indexes will optimize query execution.

SQL queries slow when running in sequence, but quick when running separately

I have a table that I will populate with values from an expensive calculation (with xquery from an immutable XML column). To speed up deployment to production I have precalculated values on a test server and saved to a file with BCP.
My script is as follows
-- Lots of other work, including modifying OtherTable
CREATE TABLE FOO (...)
GO
BULK INSERT FOO
FROM 'C:\foo.dat';
GO
-- rerun from here after the break
INSERT INTO FOO
(ID, TotalQuantity)
SELECT
e.ID,
SUM(e.Quantity) as TotalQuantity
FROM (select
o.ID,
h.n.value('TotalQuantity[1]/.', 'int') as TotalQuantity
FROM dbo.OtherTable o
CROSS APPLY XmlColumn.nodes('(item/.../salesorder/)') h(n)
WHERE o.ID NOT IN (SELECT DISTINCT ID FROM FOO)
) as E
GROUP BY e.ID
When I run the script in management studio the first two statements completes within seconds, but the last statement takes 4 hours to complete. Since no rows are added to the OtherTable since my foo.dat was computed management studio reports (0 row(s) affected).
If I cancel the query execution after a couple of minutes and selects just the last query and run that separately it completes within 5 seconds.
Notable facts:
The OtherTable contains 200k rows and the data in XmlColumn is pretty large, total table size ~3GB
The FOO table gets 1.3M rows
What could possibly make the difference?
Management studio has implicit transactions turned off. Is far as I can understand each statement will then run in its own transaction.
Update:
If I first select and run the script until -- rerun from here after the break, then select and run just the last query, it is still slow until I cancel execution and try again. This at least rules out any effects of running "together" with the previous code in the script and boils down to the same query being slow on first execution and fast on the second (running with all other conditions the same).
Probably different execution plans. See Slow in the Application, Fast in SSMS? Understanding Performance Mysteries.
Could it possibly be related to the statistics being completely wrong on the newly created Foo table? If SQL Server automatically updates the statistics when it first runs the query, the second run would have its execution plan created from up-to-date statistics.
What if you check the statistics right after the bulk insert (with the STATS_DATE function) and then checks it again after having cancelled the long-running query? Did the stats get updated, even though the query was cancelled?
In that case, an UPDATE STATISTICS on Foo right after the bulk insert could help.
Not sure exactly why it helped, but i rewrote the last query to an left outer join instead and suddenly the execution dropped to 15 milliseconds.
INSERT INTO FOO
(ID, TotalQuantity)
SELECT
e.ID,
SUM(e.Quantity) as TotalQuantity
FROM (select
o.ID,
h.n.value('TotalQuantity[1]/.', 'int') as TotalQuantity
FROM dbo.OtherTable o
INNER JOIN FOO f ON o.ID = f.ID
CROSS APPLY o.XmlColumn.nodes('(item/.../salesorder/)') h(n)
WHERE f.ID = null
) as E
GROUP BY e.ID

How to perform multiple SQL tasks when using SQL within code (in this case vbscript)

I am hitting a brick wall with something I'm trying to do.
I'm trying to perform a complex query and return the results to a vbscript (vbs) record set.
In order to speed up the query I create temporary tables and then use those tables in the main query (creates a speed boost of around 1200% on just using sub queries)
the problem is, the outlying code seems to ignore the main query, only 'seeing' the result of the very first command (i.e. it will return a 'records affected' figure)
For example, given a query like this..
delete from temp
select * into temp from sometable where somefield = somefilter
select sum(someotherfield) from yetanothertable where account in (select * from temp)
The outlying code only seems to 'see' the returned result of 'delete from temp' I can't access the data that the third command is returning.
(Obviously the sql query above is pseudo/fake. the real query is large and it's content not relevant to the question being asked. I need to solve this problem as without being able to use a temporary table the query goes from taking 3 seconds to 6 minutes!)
edit: I know I could get around this by making multiple calls to ADODB.Connection's execute (make the call to empty the temp tables, make the call to create them again, finally make the call to get the data) but I'd rather find an elegant solution/way to avoid this way of doing it.
edit 2: Below is the actual SQL code I've ended up with. Just adding it for the curiosity of people who have replied. It doesn't use the nocount as I'd already settled on a solution which works for me. It is also probably badly written. It evolved over time from something more basic. I could probably improve it myself but as it works and returns data extremely quickly I have stuck with it. (for now)
Here's the SQL.
Here's the Code where it's called. My chosen solution is to run the first query into a third temp table, then run a select * on that table from the code, then a delete from from the code...
I make no claims about being a 'good' sql scripter (self taught via necesity mostly), and the database is not very well designed (a mix of old and new tables. Old tables not relational and contain numerical values and date values stored as strings)
Here is the original (slow) query...
select
name,
program_name,
sum(handle) + sum(refund) as [Total Sales],
sum(refund) as Refunds,
sum(handle) as [Net Sales],
sum(credit - refund) as Payout,
cast(sum(comm) as money) as commission
from
(select accountnumber,program_name,
cast(credit_amount as money) as credit,cast(refund_amt as money) as refund,handle, handle * (
(select commission from amtotecommissions
where _date = a._date
and pool_type = (case when a.pool_type in ('WP','WS','PS','WPS') then 'WN' else a.pool_type end)
and program_name = a.program_name) / 100) as comm
from amtoteaccountactivity a where _date = '#yy/#mm/#dd' and transaction_type = 'Bet'
and accountnumber not in ('5067788','5096272') /*just to speed the query up a bit. I know these accounts aren't included*/
) a,
ews_db.dbo.amtotetrack t
where (a.accountnumber in (select accountno from ews_db.dbo.get_all_customers where country = 'US')
or a.accountnumber in ('5122483','5092147'))
and t.our_code = a.program_name collate database_default
and t.tracktype = 2
group by name,program_name
I suspect that with the right SQL and indexes you should be able to get equal performance with a single SELECT, however there isn't enough information in the original question to be able to give guidance on that.
I think you'll be best of doing this as a stored procedure and calling that.
CREATE PROCEDURE get_Count
#somefilter int
AS
delete from temp;
select * into temp from sometable where somefield = #somefilter;
select sum(someotherfield) from yetanothertable
where account in (select * from temp);
However an example avoiding the IN the way you're using it via a JOIN will probably fix the performance issue. Use EXPLAIN SELECT to see what's going on and optimise from there. For example the following
select sum(transactions.value) from transactions
inner join user on transactions.user=user.id where user.name='Some User'
is much quicker than
select sum(transactions.value) from transactions
where user in (SELECT id from user where user.name='Some User')
because the amount of rows scanned in the second example will be the entire table, whereas in the first the indexes can be used.
Rev1
Taking the slow SQL posted it is appears that there are full table scans going on where the SQL states WHERE .. IN e.g.
where (a.accountnumber in (select accountno from ews_db.dbo.get_all_customers))
The above will pull in lots of records which may not be required. This together with the other nested table selects are not allowing the optimiser to pull in only the records that match, as would be the case when using JOIN at the outer level.
When building these type of complex queries I generally start with the inner detail, because we need to have the inner detail so we can perform joins and aggregate operations.
What I mean by this is if you have a typical DB with customers that have orders that create transactions that contain items then I would start with the items and pull in the rest of the detail with joins.
By way of example only I suggest building the query more like the following:
select name,
program_name,
SUM(handle) + SUM(refund) AS [Total Sales],
SUM(refund) AS Refunds,
SUM(handle) AS [Net Sales],
SUM(credit - refund) AS Payout,
CAST(SUM(comm) AS money) AS commission,
FROM ews_db.dbo.get_all_customers AS cu
INNER JOIN amtoteactivity AS a ON a.accoutnumber = cu.accountnumber
INNER JOIN ews_db.dbo.amtotetrack AS track ON track.our_code = a.program_name
INNER JOIN amtotecommissions AS commision ON co.program_name = a.program_name
WHERE customers.country='US'
AND t.tracktype = 2
AND a.transaction_type = 'Bet'
AND a._date = ''#yy/#mm/#dd'
AND a.program_name = co.program_name
AND co.pool_type = (case when a.pool_type in ('WP','WS','PS','WPS') then 'WN' else a.pool_type end)
GROUP BY name,program_name,co.commission
NOTE: The above is not functional and is for illustration purposes. I'd need to have the database online to build the real query. I'm hoping you'll get the general idea and build from there.
My top tip for complex queries that don't work is simply to completely start again throwing away what you've already got. Sometimes I will do this three or four times when building a really tricky query.
Always build these queries gradually starting from the most detail and working outwards. Inspect the results at each stage because it helps visualise what the data are.
If you could come to a common data structure for all the selects you could UNION ALL them together with perhaps selecting a constant in each union so you know where the data was coming from - kinda like
select '1',col1,col2,'' from table 1
UNION ALL
select '2',col1,col2,col3 from table2
I just solved my original problem (that I came up against again today on a different query) in a slightly hacky way...
Conn.Execute(split(query,";")(0))
set rs = Conn.Execute(split(query,";")(1))
Works perfectly!
Edit : I just noticed that the first comment on my original question also provided a quick fix (set nocount on). I forgot about that. Well there is this and that. I had tried to get the query working without the temporary table but I couldn't get anywhere near the same performance as with it.