I have a program that is pulling data from a Visual FoxPro table and dumping into a Dataset with VB.net. My connection string works great, and the query I'm using usually runs with respectable speed. As I've ran it more, however, I've learned that there is a large amount of "bad" data in my table. So now, I'm trying to refine my query to buffer against the "bad" data, but what I thought would be a very small tweak has yielded massive performance losses, and I'm not particularly sure why.
My original query is:
'Pull desired columns for orders that have not "shipped" and were received in past 60 days.
'To "ship", an order must qualify with both an updated ship date and Sales Order #.
sqlSelect = "SELECT job_id,cust_id,total_sale,received,due,end_qty,job_descr,shipped,so "
sqlFrom = "FROM job "
sqlWhere = "WHERE fac = 'North Side' AND shipped < {12/30/1899} AND so = '' AND received >= DATE()-60;"
sql = sqlSelect & sqlFrom & sqlWhere
This has a run-time of about 20 seconds; while I'd prefer it to be quicker, it's not a problem. In my original testing (and occasional debugging), I replaced sqlWhere with sqlWhere = "WHERE job_id = 127350". This runs pretty much instantaneously.
Now the problem block: Once I replaced sqlWhere with
'Find jobs that haven't "shipped" OR were received within last 21 days.
'Recently shipped items are desired in results.
sqlWhere = "WHERE fac = 'North Side' AND ((shipped < {12/30/1899} AND so = '') OR received >= DATE()-21);"
My performance jumped to about 3 min 40 sec. This time is almost exactly the same as the time to run with sqlWhere = "WHERE received >= DATE();".
I'm not the moderator of these tables; I'm merely pulling from them to create a series of reports for our users. My best guess is that the received field is not indexed, this is the cause of my performance drop-off. But while my first search returns about 100 records, pulling the jobs only from today returns about 5, and still takes about 11x as long.
So my question is three part:
1) Would someone be able to explain the phenomenon I'm experiencing right now? I feel like I'm somewhat on the right track, but my knowledge of SQL has been limited to circumstantial use within other languages...
2) Is there something I'm missing, or some better way to obtain the results I need? There are a large volume of records that haven't "shipped", but simply because the user only input a shipped date or s/o, and didn't do the other. I need a way to view very recent orders (regardless of "shipped" status), and then also view less recent orders that have "bad" data, so I can get the user in the habit of cleaning up the data.
3) Is it bad SQL practice to overconstrain a WHERE clause? If I run fifteen field comparisons, joined together with nested ANDs/ORs, am I wasting my time when I could be doing something much cleaner?
Many thanks,
B
If you are looking for a non-indexed record in your WHERE string, the SQL engine must do a table scan, i.e. - look at every record in the table.
The difference between the two queries is having the OR instead of the AND. When you have a non-indexed column in an AND, the SQL engine can use the indexes to narrow down the number of records it has to look at for the non-indexed column. When you have an OR, it now must look at every record in the table and compare on that column.
Adding an index on the Received column would probably fix the performance issue.
In general, there are two things you don't want to have happen in your WHERE clause.
1. A primary condition on an non-indexed column
2. Using a calculation on a column. For example, doing WHERE Shipped-2 < date() is often worse than doing Shipped < Date() + 2, because the former doesn't typically allow the index to be used.
Refining your query through multiple WHERE clauses is generally a good thing. The fewer records you need to return to your application the better your performance will be, but you need to have appropriate indexing in place.
Related
So I am honestly a little puzzled by this!
I have a query that returns a set of transactions that contain both repair costs and an odometer reading at the time of repair on the master level. To get an accurate Cost per mile reading I need to do a subquery to get both the first meter reading between a starting date and an end date, and an ending meter.
(select top 1 wf2.ro_num
from wotrans wotr2
left join wofile wf2
on wotr2.rop_ro_num = wf2.ro_num
and wotr2.rop_fac = wf2.ro_fac
where wotr.rop_veh_num = wotr2.rop_veh_num
and wotr.rop_veh_facility = wotr2.rop_veh_facility
AND ((#sdate = '01/01/1900 00:00:00' and wotr2.rop_tran_date = 0)
OR ([dbo].[udf_RTA_ConvertDateInt](#sdate) <= wotr2.rop_tran_date
AND [dbo].[udf_RTA_ConvertDateInt](#edate) >= wotr2.rop_tran_date))
order by wotr2.rop_tran_date asc) as highMeter
The reason I have the tables aliased as xx2 is because those tables are also used in the main query, and I don't want these to interact with each other except to pull the correct vehicle number and facility.
Basically when I run the main query it returns a value that is not correct; it returns the one that is second(keep in mind that the first and second have the same date.) But when I take the subquery and just copy and paste it into it's own query and run it, it returns the correct value.
I do have a work around for this, but I am just curious as to why this happening. I have searched quite a bit and found not much(other than the fact that people don't like order bys in subqueries). Talking to one of my friends that also does quite a bit of SQL scripting, it looks to us as if the subquery is ordering differently than the subquery by itsself when you have multiple values that are the same for the order by(i.e. 10 dates of 08/05/2016).
Any ideas would be helpful!
Like I said I have a work around that works in this one case, but don't know yet if it will work on a larger dataset.
Let me know if you want more code.
I am stumped on how to make this query run more efficiently/correctly. Here is the query first and then I can describe the tables that are involved:
UPDATE agg_pivot_test AS p
LEFT JOIN jd_cleaning AS c
ON c.Formerly = IIF(c.Formerly LIKE '*or*', '*' & p.LyFinalCode & '*', CStr(p.LyFinalCode))
SET p.CyFinalCode = c.FinalCode
WHERE p.CyFinalCode IS NULL AND c.Formerly IS NOT NULL;
agg_pivot_test has 200 rows of data and only 99 fit the criteria of WHERE p.CyFinalCode IS NULL. The JOIN needs some explaining. It is an IIF because some genius decided to link last year's data to this year's data using Formerly. It is a string because sometimes multiple items have been consolidated down to one so they use "or" (e.g., 632 or 631 or 630). So if I want to match this year's data I have to use Formerly to match last year's LyFinalCode. So this year the code might be 629, but I have to use the Formerly to map the items that were 632, 631, or 630 to the new code. Make sense? That is why the ON has an IIF. Also, Formerly is a string and LyFinalCode is an integer... fun.
Anyway, when you run the query it says it is updating 1807 records when again, there are only 200 records and only 99 that fit the criteria.
Any suggestions about what this is happening or how to fix it?
An interesting problem. I don't think I've ever come across something quite like this before.
I'm guessing what's happening is that rows where CyFinalCode is null, are being matched multiple times by the join statement, and thus the join expression is calculating a cartesian product of row-matches, and this is the basis of the rows updated message. It seems odd, as I would have expected access to complain about multiple row matches, when row matches should only be 1:1 in an update statement.
I would suggest rewriting the query (with this join) as a select statement, and seeing what the query gives you in the way of output; Something like:
SELECT p.*, c.*
FROM agg_pivot_test p LEFT JOIN jd_cleaning c
ON c.Formerly = IIF(c.Formerly LIKE '*or*', '*' & p.LyFinalCode & '*', CStr(p.LyFinalCode))
WHERE p.CyFinalCode IS NULL AND c.Formerly IS NOT NULL
I'm also inclined to suggest changing "... & p.LyFinalCode & ..." to "... & CStr(p.LyFinalCode) & ..." - though I can't really see why it should make a difference.
The only other thing I can think to suggest is change the join a bit: (this isnt guaranteed to be better necessarily - though it might be)
UPDATE agg_pivot_test AS p LEFT JOIN jd_cleaning AS c
ON (c.Formerly = CStr(p.LyFinalCode) OR InStr(c.Formerly, CStr(p.LyFinalCode)) > 0)
(Given the syntax of your statement, I assume this sql is running within access via ODBC; in which case this should be fine. If I'm wrong the sql is running server side, you'll need to change InStr to SubString.)
I have a table in MS Access which has stock prices arranged like
Ticker1, 9:30:00, $49.01
Ticker1, 9:30:01, $49.08
Ticker2, 9:30:00, $102.02
Ticker2, 9:30:01, $102.15
and so on.
I need to do some calculation where I need to compare prices in 1 row, with the immediately previous price (and if the price movement is greater than X% in 1 second, I need to report the instance separately).
If I were doing this in Excel, it's a fairly simple formula. I have a few million rows of data, so that's not an option.
Any suggestions on how I could do it in MS Access?
I am open to any kind of solutions (with or without SQL or VBA).
Update:
I ended up trying to traverse my records by using ADODB.Recordset in nested loops. Code below. I though it was a good idea, and the logic worked for a small table (20k rows). But when I ran it on a larger table (3m rows), Access ballooned to 2GB limit without finishing the task (because of temporary tables, the size of the original table was more like ~300MB). Posting it here in case it helps someone with smaller data sets.
Do While Not rstTickers.EOF
myTicker = rstTickers!ticker
rstDates.MoveFirst
Do While Not rstDates.EOF
myDate = rstDates!Date_Only
strSql = "select * from Prices where ticker = """ & myTicker & """ and Date_Only = #" & myDate & "#" 'get all prices for a given ticker for a given date
rst.Open strSql, cn, adOpenKeyset, adLockOptimistic 'I needed to do this to open in editable mode
rst.MoveFirst
sPrice1 = rst!Open_Price
rst!Row_Num = i
rst.MoveNext
Do While Not rst.EOF
i = i + 1
rst!Row_Num = i
rst!Previous_Price = sPrice1
sPrice2 = rst!Open_Price
rst!Price_Move = Round(Abs((sPrice2 / sPrice1) - 1), 6)
sPrice1 = sPrice2
rst.MoveNext
Loop
i = i + 1
rst.Close
rstDates.MoveNext
Loop
rstTickers.MoveNext
Loop
If the data is always one second apart without any milliseconds, then you can join the table to itself on the Ticker ID and the time offsetting by one second.
Otherwise, if there is no sequence counter of some sort to join on, then you will need to create one. You can do this by doing a "ranking" query. There are multiple approaches to this. You can try each and see which one works the fastest in your situation.
One approach is to use a subquery that returns the number of rows are before the current row. Another approach is to join the table to itself on all the rows before it and do a group by and count. Both approaches produce the same results but depending on the nature of your data and how it's structured and what indexes you have, one approach will be faster than the other.
Once you have a "rank column", you do the procedure described in the first paragraph, but instead of joining on an offset of time, you join on an offset of rank.
I ended up moving my data to a SQL server (which had its own issues). I added a row number variable (row_num) like this
ALTER TABLE Prices ADD Row_Num INT NOT NULL IDENTITY (1,1)
It worked for me (I think) because my underlying data was in the order that I needed for it to be in. I've read enough comments that you shouldn't do it, because you don't know what order is the server storing the data in.
Anyway, after that it was a join on itself. Took me a while to figure out the syntax (I am new to SQL). Adding SQL here for reference (works on SQL server but not Access).
Update A Set Previous_Price = B.Open_Price
FROM Prices A INNER JOIN Prices B
ON A.Date_Only = B.Date_Only
WHERE ((A.Ticker=B.Ticker) AND (A.Row_Num=B.Row_Num+1));
BTW, I had to first add the column Date_Only like this (works on Access but not SQL server)
UPDATE Prices SET Prices.Date_Only = Format([Time_Date],"mm/dd/yyyy");
I think the solution for row numbers described by #Rabbit should work better (broadly speaking). I just haven't had the time to try it out. It took me a whole day to get this far.
How do we match columns based on condition of closeness to value.
This requires Complex Query / Range Comparison / Multiple Joins conditions.
Getting Query size exceeded 2GB error.
Tables :
InvDetails1 / InvDetails2 / INVDL / ExpectedResult
Field Relation :
InvDetails1.F1 = InDetails2.F3
InvDetails2.F5 = INVDL.F1
INVDL.DLID = ExpectedResult.DLID
ExpectedResult.Total - 1 < InvDetails1.F6 < ExpectedResult.Total + 1
left(InvDetails1.F21,10) = '2013-03-07'
Return Results where Number of records from ExpectedResult is only 1.
Group by InvDetails1.F1 , count(ExpectedResult.DLID) works.
From this result.
Final Result :
InvDetails1.F1 , InvDetails1.F16 , ExpectedResult.DLID , ExpectedResult.NMR
ExpectedResult - has millions of rows.
InvDetails - few hundred thousands
If I was in that situation and was finding that my query was "hitting a wall" at 2GB then one thing I would try would be to create a separate saved Select Query to isolate the InvDetails1 records just for the specific date in question. Then I would use that query instead of the full InvDetails1 table when joining to the other tables.
My reasoning is that the query optimizer may not be able to use your left(InvDetails1.F21,10) = '2013-03-07' condition to exclude InvDetails1 records early in the execution plan, possibly causing the query to grow much larger than it really needs to (internally, while it is being processed). Forcing the date selection to the beginning of the process by putting it in a separate (prerequisite) query may keep the size of the "main" query down to a more feasible size.
Also, if I found myself in the situation where my queries were getting that big I would also keep a watchful eye on the size of my .accdb (or .mdb) file to ensure that it does not get too close to 2GB. I've never had it happen myself, but I've heard that database files that hit the 2GB barrier can result in some nasty errors and be rather "challenging" to recover.
Background:
We narrowed down some performance issues in our ERP to one SQL statement. Our support person made one small change to the SQL statment and performance improved. The back-end is Microsoft Access. (yes I know, yes it's embarrassing, no I have no choice in the matter)
We went from 75% of the time the statement would take 90 seconds to run, and 25% 2-3 seconds to 100% of the time 2-3 seconds.
The POHDR table is large 20,000 plus rows, SUPPNAME is under 1000. Both tables have the vnum field indexed.
SQL statements below, but all that changed is the Where clause uses SUPPNAME.vnum instead of POHDR.vnum.
BEFORE:
SELECT DISTINCTROW POHDR.*,
SUPPNAME.SNAME1,
suppname.sname1 & chr(13) & chr(10) & POHDR.vnum as SUPPFLD
FROM POHDR
INNER JOIN SUPPNAME
ON POHDR.VNUM = SUPPNAME.VNUM
WHERE ((POHDR.VNUM= '20023' AND POHDR.RECDATE Is Null))
AND [POHDR].[CANCEL] Is Null and ((POHDR.CLOSED=No))
order by IIf(InStr(PO,'-'),Left(PO,InStr(PO,'-')) & '_' & Mid(PO,InStr(PO,'-')),PO)
AFTER:
SELECT DISTINCTROW POHDR.*,
SUPPNAME.SNAME1,
suppname.sname1 & chr(13) & chr(10) & POHDR.vnum as SUPPFLD
FROM POHDR
INNER JOIN SUPPNAME
ON POHDR.VNUM = SUPPNAME.VNUM
WHERE ((SUPPNAME.VNUM= '26037' AND POHDR.RECDATE Is Null))
AND [POHDR].[CANCEL] Is Null
and ((POHDR.CLOSED=No))
order by IIf(InStr(PO,'-'),Left(PO,InStr(PO,'-')) & '_' & Mid(PO,InStr(PO,'-')),PO)
Does changing where the vnum is selected to a smaller table with less or no duplication of the vnum really make that big of a difference, or is there something else going on?
thanks, Brian the Curious.
p.s. Also, I did not write or have control of this sql statment. And not sure exactly what is going on with the if in the order by clause either.
I don't know if there is a very good "WHY" answer to your question, other than the fact that changing that WHERE clause caused a different query plan to be chosen by the optimizer.
The optimizer can choose a query plan based on many factors (indexes, statistics, black magic, etc). As long as you've got good indexes, the best you can usually do is carefully test different similar versions of the query, and compare the query plans they generate.
The key in this kind of query optimization is indexing - so first thing to check if all fields are indexed, in particular SUPPNAME.VNUM and POHDR.VNUM.
Then the second, of course, is the number of records in each table that satisfy each of these conditions. If there is only one SUPPNAME.VNUM= '26037' it could eliminate a lot of POHDR.VNUM= '20023' which would fail on other criteria, which then does not need to be executed.
But without seeing the structure of your DB and data in it, I would not make any definite conclusions.