How to improve the efficiency of below query in SQL Server? - sql

I have a ten million level database. The client needs to read data and perform calculation.
Due to the large amount of data, if it is saved in the application cache, memory will be overflow and crash will occur.
If I use select statement to query data from the database in real time, the time may be too long and the number of operations on the database may be too frequent.
Is there a better way to read the database data? I use C++ and C# to access SQL Server database.
My database statement is similar to the following:
SELECT TOP 10 y.SourceName, MAX(y.EndTimeStamp - y.StartTimeStamp) AS ProcessTimeStamp
FROM
(
SELECT x.SourceName, x.StartTimeStamp, IIF(x.EndTimeStamp IS NOT NULL, x.EndTimeStamp, 134165256277210658) AS EndTimeStamp
FROM
(
SELECT
SourceName,
Active,
LEAD(Active) OVER(PARTITION BY SourceName ORDER BY TicksTimeStamp) NextActive,
TicksTimeStamp AS StartTimeStamp,
LEAD(TicksTimeStamp) OVER(PARTITION BY SourceName ORDER BY TicksTimeStamp) EndTimeStamp
FROM Table1
WHERE Path = N'App1' and TicksTimeStamp >= 132165256277210658 and TicksTimeStamp < 134165256277210658
) x
WHERE (x.Active = 1 and x.NextActive = 0) OR (x.Active = 1 and x.NextActive = null)
) y
GROUP BY y.SourceName
ORDER BY ProcessTimeStamp DESC, y.SourceName
The database structure is roughly as follows:
ID Path SourceName TicksTimeStamp Active
1 App1 Pipe1 132165256277210658 1
2 App1 Pipe1 132165256297210658 0
3 App1 Pipe1 132165956277210658 1
4 App2 Pipe2 132165956277210658 1
5 App2 Pipe2 132165956277210658 0
I use the ExecuteReader of C #. The same SQL statement runs on SQL Management for 4s, but the time returned by the ExecuteReader is 8-9s. Does the slow time have anything to do with this interface?

I don't really 'get' the entire query but I'm wondering about this part:
WHERE (x.Active = 1 and x.NextActive = 0) OR (x.Active = 1 and x.NextActive = null)
SQL doesn't really like OR's so why not convert this to
WHERE x.Active = 1 and ISNULL(x.NextActive, 0) = 0
This might cause a completely different query plan. (or not)
As CharlieFace mentioned, probably best to share the query plan so we might get an idea of what's going on.
PS: I'm also not sure what those 'ticksTimestamps' represent, but it looks like you're fetching a pretty wide range there, bigger volumes will also cause longer processing time. Even though you only return the top 10 it still has to go through the entire range to calculate those durations.

I agree with #Charlieface. I think the index you want is as follows:
CREATE INDEX idx ON Table1 (Path, TicksTimeStamp) INCLUDE (SourceName, Active);
You can add both indexes (with different names of course) and see which one the execution engine chooses.

I can suggest adding the following index which should help the inner query using LEAD:
CREATE INDEX idx ON Table1 (SourceName, TicksTimeStamp, Path) INCLUDE (Active);
The key point of the above index is that it should allow the lead values to be rapidly computed. It also has an INCLUDE clause for Active, to cover the entire select.

Related

Redshift SQL result set 100s of rows wide efficiency (long to wide)

Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.
Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.

Optimize complicated SQL Update

Somebody at work made this UPDATE some years ago and itt works, the problem is it's taking almost 5 hours when called multiple times in a process, this is not a regular UPDATE, there is no 1 to 1 record matching between tables, this does an update based on accumulative (SUM) of a parituclar field in the same table, and things get more complicated because this SUM is restricted to special conditions based on dates and another field.
I think this is something like an (implicit) inner join with no 1 to 1 match, like ALL VS ALL, so when having for example 7000 records in the table this thing will process 7000 * 7000 records, more than 55 million, in my opinion cursors should have been used here, but now i need more speed and i don't think cursors will get me there.
My question is: Is there any way to rewrite this and make it faster?? Pay attention to the conditions on that SUM, this is not an easy to see UPDATE (at least for me).
More info:
CodCtaCorriente and CodCtaCorrienteMon are primary keys on this table but, as I said before there is no intention to make a 1 to 1 match here that's why this keys are not used in the query, CodCtaCorrienteMon is used in conditions but not as a join condition (ON).
UPDATE #POS SET SaldoDespuesEvento =
(SELECT SUM(Importe)
FROM #POS CTACTE2
WHERE CTACTE2.CodComitente = #POS.CodComitente
AND CTACTE2.CodMoneda = #POS.CodMoneda
AND CTACTE2.EstaAnulado = 0
AND (DATEDIFF(day, CTACTE2.FechaLiquidacion, #POS.FechaLiquidacion) > 0
OR
(DATEDIFF(day, CTACTE2.FechaLiquidacion, #POS.FechaLiquidacion) = 0
AND (#POS.CodCtaCorrienteMon >= CTACTE2.CodCtaCorrienteMon))))
WHERE #POS.EstaAnulado = 0 AND #POS.EsSaldoAnterior = 0
From your query plan it looks like its spending most of the time in the filter right after the index spool.
If you are going to run this query a few times, I would create an index on the 'CodComitente', 'CodMoneda', 'EstaAnulado', 'FechaLiquidacion', and 'CodCtaCorrienteMon' columns.
I don't know much about the Index Spool iterator; but basically from what I understand about it, its used as a 'temporary' index created at query time. So if you are running this query multiple times, I would create that index once, then run the query as many times as you need.
Also, I would try creating a variable to store the result of your sum operation, so you can avoid running that as much as possible.
DECLARE #sumVal AS INT
SET #sumVal = SELECT SUM(Importe)
FROM #POS CTACTE2
WHERE CTACTE2.CodComitente = #POS.CodComitente
AND CTACTE2.CodMoneda = #POS.CodMoneda
AND CTACTE2.EstaAnulado = 0
AND (DATEDIFF(day, CTACTE2.FechaLiquidacion, #POS.FechaLiquidacion) > 0
OR
(DATEDIFF(day, CTACTE2.FechaLiquidacion, #POS.FechaLiquidacion) = 0
AND (#POS.CodCtaCorrienteMon >= CTACTE2.CodCtaCorrienteMon)))
UPDATE #POS SET SaldoDespuesEvento = #sumVal
WHERE #POS.EstaAnulado = 0 AND #POS.EsSaldoAnterior = 0
It is hard to help much without the query plan but I would make the an assumption that if there is not already indexes on the FechaLiquidacion and CodCtaCorrienteMon columns then performance would be improved by creating them as long as database storage space is not an issue.
Found the solution, this is a common problem: Running Totals
This is one of the few cases CURSORS perform better, see this and more available solutions here (or browse stackoverflow, there are many cases like this):
http://weblogs.sqlteam.com/mladenp/archive/2009/07/28/SQL-Server-2005-Fast-Running-Totals.aspx

Erratic query performance

I am new to this site, but please don't hold it against me. I have only used it once.
Here is my dilemma: I have moderate SQL knowledge but am no expert. The query below was created by a consultant a long time ago.
On most mornings it takes a 1.5 hours to run because there is lots of data. BUT other mornings, it takes 4-6 hours. I have tried eliminating any jobs that are running. I am thoroughly confused as to what to try to find out what is causing this problem.
Any help would be appreciated.
I have already broken this query into 2 queries, but any tips on ways to help boost performance would be greatly appreciated.
This query builds back our inventory transactions to find what our stock on hand value was at any given point in time.
SELECT
ITCO, ITIM, ITLOT, Time, ITWH, Qty, ITITCD,ITIREF,
SellPrice, SellCost,
case
when Transaction_Cost is null
then Qty * (SELECT ITIACT
FROM (Select Top 1 B.ITITDJ, B.ITIREF, B.ITIACT
From OMCXIT00 AS B
Where A.ITCO = B.ITCO
AND A.ITWH = B.ITWH
AND A.ITIM = B.ITIM
AND A.ITLOT = B.ITLOT
AND ((A.ITITDJ > B.ITITDJ)
OR (A.ITITDJ = B.ITITDJ AND A.ITIREF <= B.ITIREF))
ORDER BY B.ITITDJ DESC, B.ITIREF DESC) as C)
else Transaction_Cost
END AS Transaction_Cost,
case when ITITCD = 'S' then ' Shipped - Stock' else null end as TypeofSale,
case when ititcd = 'S' then ITIREF else null end as OrderNumber
FROM
dbo.InvTransTable2 AS A
Here is the execution plan.
http://i.imgur.com/mP0Cu.png
Here is the DTA but I am unsure how to read it since the recommedations are blank. Shouldn't that say "Create"?
http://i.imgur.com/4ycIP.png
You can not do match with dbo.InvTransTable2, because of you are selected all records from it, so it will be left scanning records.
Make sure that you have clustered index on OMCXIT00, it looks like it is a heap, no clustered index.
Make sure that clustered index is small, but has more distinct values in it.
If you have not many records OMCXIT00, it may be sufficient to create index with key ITCO and include following columns in include ( ITITDJ , ITIREF, ITWH,ITCO ,ITIM,ITLOT )
Index creation example:
CREATE INDEX IX_dbo_OMCXIT00
ON OMCXIT00 ([ITCO])
INCLUDE ( ITITDJ , ITIREF)
If it does not help, then you need to see which columns in the predicates that you are searching for has more distinct values, and create index with key one or some of them and make sure reorder predicate order in where clause.
A.ITCO = B.ITCO
AND A.ITWH = B.ITWH
AND A.ITIM = B.ITIM
AND A.ITLOT = B.ITLOT
besides adding indexes to change table scans for index seeks, ask to yourself: "do i really need this order by in this sql code?". if you dont neet this sorting, remove order by from your sql code. next, there is a good chance your code will be faster.

Optimizing MySQL statement with lot of count(row) an sum(row+row2)

I need to use InnoDB storage engine on a table with about 1mil or so records in it at any given time. It has records being inserted to it at a very fast rate, which are then dropped within a few days, maybe a week. The ping table has about a million rows, whereas the website table only about 10,000.
My statement is this:
select url
from website ws, ping pi
where ws.idproxy = pi.idproxy and pi.entrytime > curdate() - 3 and contentping+tcpping is not null
group by url
having sum(contentping+tcpping)/(count(*)-count(errortype)) < 500 and count(*) > 3 and
count(errortype)/count(*) < .15
order by sum(contentping+tcpping)/(count(*)-count(errortype)) asc;
I added an index on entrytime, yet no dice. Can anyone throw me a bone as to what I should consider to look into for basic optimization of this query. The result set is only like 200 rows, so I'm not getting killed there.
In the absence of the schemas of the relations, I'll have to make some guesses.
If you're making WHERE a.attrname = b.attrname clauses, that cries out for a JOIN instead.
Using COUNT(*) is both redundant and sometimes less efficient than COUNT(some_specific_attribute). The primary key is a good candidate.
Why would you test contentping+tcpping IS NOT NULL, asking for a calculation that appears unnecessary, instead of just testing whether the attributes individually are null?
Here's my attempt at an improvement:
SELECT url
FROM website AS ws
JOIN ping AS pi
ON ws.idproxy = pi.idproxy
WHERE
pi.entrytime > CURDATE() - 3
AND pi.contentping IS NOT NULL
AND pi.tcpping IS NOT NULL
GROUP BY url
HAVING
SUM(pi.contentping + pi.tcpping) / (COUNT(pi.idproxy) - COUNT(pi.errortype)) < 500
AND COUNT(pi.idproxy) > 3
AND COUNT(pi.errortype) / COUNT(pi.idproxy) < 0.15
ORDER BY
SUM(pi.contentping + pi.tcpping) / (COUNT(pi.idproxy) - COUNT(pi.errortype)) ASC;
Performing lots of identical calculations in both the HAVING and ORDER BY clauses will likely be costing you performance. You could either put them in the SELECT clause, or create a view that has those calculations as attributes and use that view for accessing the values.

LIMIT in FoxPro

I am attempting to pull ALOT of data from a fox pro database, work with it and insert it into a mysql db. It is too much to do all at once so want to do it in batches of say 10 000 records. What is the equivalent to LIMIT 5, 10 in Fox Pro SQL, would like a select statement like
select name, address from people limit 5, 10;
ie only get 10 results back, starting at the 5th. Have looked around online and they only make mention of top which is obviously not of much use.
Take a look at the RecNo() function.
FoxPro does not have direct support for a LIMIT clause. It does have "TOP nn" but that only provides the "top-most records" within a given percentage, and even that has a limitation of 32k records returned (maximum).
You might be better off dumping the data as a CSV, or if that isn't practical (due to size issues), writing a small FoxPro script that auto-generates a series of BEGIN-INSERT(x10000)-COMMIT statements that dump to a series of text files. Of course, you would need a FoxPro development environment for this, so this may not apply to your situation...
Visual FoxPro does not support LIMIT directly.
I used the following query to get over the limitation:
SELECT TOP 100 * from PEOPLE WHERE RECNO() > 1000 ORDER BY ID;
where 100 is the limit and 1000 is the offset.
It is very easy to get around LIMIT clause using TOP clause ; if you want to extract from record _start to record _finish from a file named _test, you can do :
[VFP]
** assuming _start <= _finish, if not you get a top clause error
*
_finish = MIN(RECCOUNT('_test'),_finish)
*
SELECT * FROM (SELECT TOP (_finish - _start + 1) * FROM (SELECT TOP _finish *, RECNO() AS _tempo FROM _test ORDER BY _tempo) xx ORDER BY _tempo DESC) yy ORDER BY _tempo
**
[/VFP]
I had to convert a Foxpro database to Mysql a few years ago. What I did to solve this was add an auto-incrementing id column to the Foxpro table and use that as the row reference.
So then you could do something like.
select name, address from people where id >= 5 and id <= 10;
The Foxpro sql documentation does not show anything similar to limit.
Here, adapt this to your tables. Took me like 2 mins, i do this waaaay too often.
N1 - group by whatever, and make sure you got a max(id), you can use recno() to make one, sorted correctly
N2 - Joins N1 where the ID = Max Id of N1, display the field you want from N2
Then if you want to join to other tables, put that all in brackets and give it an alias and include it in a join.
Select N1.reference, N1.OrderNoteCount, N2.notes_desc LastNote
FROM
(select reference, count(reference) OrderNoteCount, Max(notes_key) MaxNoteId
from custnote
where reference != ''
Group by reference
) N1
JOIN
(
select reference, count(reference) OrderNoteCount, notes_key, notes_desc
from custnote
where reference != ''
Group by reference, notes_key, notes_desc
) N2 ON N1.MaxNoteId = N2.notes_key
To expand on Eyvind's answer I would create a program to uses the RecNo() function to pull records within a given range, say 10,000 records.
You could then programmatically cycle through the large table in chucks of 10,000 records at a time and preform your data load into you MySQL database.
By using the RecNO() function you can be certain not to insert rows more than once, and be able to restart at a know point in the data load process. That by it's self can be very handy in the event you need to stop and restart the load process.
Depending on the number of the returned rows and if you are using .NET Framework you can offset/limit the gotten DataTable on the following way:
dataTable = dataTable.AsEnumerable().Skip(offset).Take(limit).CopyToDataTable();
Remember to add the Assembly System.Data.DataSetExtensions.