Number of records discrepancy - only change is sorting - sql

I have an Access 2003 database with a query that is a left outer join of a table to another query. If I didn't sort that final query, I got 42 records. If I sorted the final query by the 2 joined fields, I got 43 records. No other changes were made to the query.
To verify this, I took the query, copied it, applied the sort with no other changes, and the record count went up by one. Perplexed, I copied the results into Excel, sorted, and compared row by row. I discovered one record was duplicated (all fields were exactly the same), where there were actually no duplicate records in the source table and query.
I would think this is a bug, and I know there are a few in Access, but has anyone heard of this behavior before?

It is possible that you have a corrupt index. It may be worth taking a back up and then compacting and repairing the database, which should rebuild the indexes.

I had a similar problem on Access 2003 where I had duplicate records where 1 field was an autonumber which should up twice.
My query was :
SELECT qry_Tasks_with_Names.*, Location.Location, WorkRequests.Element, WorkRequests.Site_ID
FROM ((WorkRequests LEFT JOIN Location ON WorkRequests.Site_ID=Location.LocationID) INNER JOIN qry_Tasks_with_Names_by_Individual ON WorkRequests.Work_Request_ID=qry_Tasks_with_Names_by_Individual.Work_Request_ID) INNER JOIN qry_Tasks_with_Names ON WorkRequests.Work_Request_ID=qry_Tasks_with_Names.Work_Request_ID
WHERE (WorkRequests.Site_ID=Forms!TaskList!comboFilter_by_Site Or Forms!TaskList!comboFilter_by_Site Is Null) And (qry_Tasks_with_Names.Assigned_to_User_ID=forms!taskList!comboFilter_by_Person Or forms!taskList!comboFilter_by_Person Is Null) And (qry_Tasks_with_Names.Assigned_to_Team_ID=forms!taskList!comboFilter_by_Team Or forms!taskList!comboFilter_by_Team Is Null) And qry_Tasks_with_Names.Assigned_to_User_ID>0
ORDER BY qry_Tasks_with_Names.SLA_Due;
The above was initially created by the query designer, but it gets itself confused at this level and it also seems that I do.
Once I removed the inner join on qry_Tasks_with_Names_by_Individual all was OK.
No idea why, but hopefully this may save someone else some tears if they have the same problem.

Related

Poor performance with stacked joins

I'm not sure I can provide enough details for an answer, but my company is having a performance issue with an older mssql view. I've narrowed it down to the right outer joins, but I'm not familiar with the structure of joins following joins without a "ON" with each one, as in the code snippet below.
How do I write the joins below to either improve performance or to the simpler format of Join Tablename on Field1 = field2 format ?
FROM dbo.tblObject AS tblObject_2
JOIN dbo.tblProspectB2B PB ON PB.Object_ID = tblObject_2.Object_ID
RIGHT OUTER JOIN dbo.tblProspectB2B_CoordinatorStatus
RIGHT OUTER JOIN dbo.tblObject
INNER JOIN dbo.vwDomain_Hierarchy
INNER JOIN dbo.tblContactUser
INNER JOIN dbo.tblProcessingFile WITH ( NOLOCK )
LEFT OUTER JOIN dbo.enumRetentionRealization AS RR ON RR.RetentionRealizationID = dbo.tblProcessingFile.RetentionLeadTypeID
INNER JOIN dbo.tblLoan
INNER JOIN dbo.tblObject AS tblObject_1 WITH ( NOLOCK ) ON dbo.tblLoan.Object_ID = tblObject_1.Object_ID ON dbo.tblProcessingFile.Loan_ID = dbo.tblLoan.Object_ID ON dbo.tblContactUser.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.vwDomain_Hierarchy.Object_ID = tblObject_1.Domain_ID ON dbo.tblObject.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.tblProspectB2B_CoordinatorStatus.Object_ID = dbo.tblLoan.ReferralSourceContactID ON tblObject_2.Object_ID = dbo.tblLoan.ReferralSourceContactID
Your last INNER JOIN has a number of ON statements. Per this question and answer, such syntax is equivalent to a nested subquery.
That is one of the worst queries I have ever seen. Since I cannot figure out how it is supposed to work without the underlying data, this is what I suggest to you.
First find a good sample loan and write a query against this view to return where loan_id = ... Now you have a data set you chan check you changes against more easily than the, possibly, millions of records this returns. Make sure these results make sense (that right join to tbl_objects is bothering me as it makes no sense to return all the objects records)
Now start writing your query with what you think should be the first table (I would suggest that loan is the first table, if it not then the first table is Object left joined to loan)) and the where clause for the loan id.
Check your results, did you get the same loan information as teh view query with the where clause added?
Then add each join one at a time and see how it affects the query and whether the results appear to be going off track. Once you have figured out a query that gives the same results with all the tables added in, then you can try for several other loan ids to check. Once those have checked out, then run teh whole query with no where clause and check against the view results (if it is a large number you may need to just see if teh record counts match and visually check through (use order by on both things in order to make sure your results are in the same order). In the process try to use only left joins and not that combination of right and left joins (its ok to leave teh inner ones alone).
I make it a habit in complex queries to do all the inner joins first and then the left joins. I never use right joins in production code.
Now you are ready to performance tune.
I woudl guess the right join to objects is causing a problem in that it returns teh whole table and the nature of that table name and teh other joins to the same table leads me to believe that he probably wanted a left join. Without knowing the meaning of the data, it is hard to be sure. So first if you are returning too many records for one loan id, then consider if the real problem is that as tables have grown, returning too many records has become problematic.
Also consider that you can often take teh view and replace it with code to get the same results. Views calling views are a poor technique that often leads to performance issues. Often the views on top of the other views call teh same tables and thus you end up joining to them multiple times when you don;t need to.
Check your Explain plan or Execution plan depending on what database backend you have. Analysis of this should show where you might have missing indexes.
Also make sure that every table in the query is needed. This especially true when you join to a view. The view may join to 12 other tables but you only need the data from one of them and it can join to one of your tables. MAke sure that you are not using select * but only returning teh fields the view actually needs. You have inner joins so, by definition, select * is returning fields you don't need.
If your select part of teh view has a distinct in it, then consider if you can weed down the multiple records you get that made distinct needed by changing to a derived table or adding a where clause. To see what is causing the multiples, you may need to temporarily use select * to see all the columns and find out which one is not uniques and is causing the issue.
This whole process is not going to be easy or fun. Just take it slowly, work carefully and methodically and you will get there and have a query that is understandable and maintainable in the end.

Inconsistent results from BigQuery: same query, different number of rows

I noticed today that one my query was having inconsistent results: every time I run it I have a different number of rows returned (cache deactivated).
Basically the query looks like this:
SELECT *
FROM mydataset.table1 AS t1
LEFT JOIN EACH mydataset.table2 AS t2
ON t1.deviceId=t2.deviceId
LEFT JOIN EACH mydataset.table3 AS t3
ON t2.email=t3.email
WHERE t3.email IS NOT NULL
AND (t3.date IS NULL OR DATE_ADD(t3.date, 5000, 'MINUTE')<TIMESTAMP('2016-07-27 15:20:11') )
The tables are not updated between each query. So I'm wondering if you also have noticed that kind of behaviour.
I usually make queries that return a lot of rows (>1000) so a few missing rows here and there is hardly noticeable. But this query return a few row, and it varies everytime between 10 and 20 rows :-/
If a Google engineer is reading this, here are two Job ID of the same query with different results:
picta-int:bquijob_400dd739_1562d7e2410
picta-int:bquijob_304f4208_1562d7df8a2
Unless I'm missing something, the query that you provide is completely deterministic and so should give the same result every time you execute it. But you say it's "basically" the same as your real query, so this may be due to something you changed.
There's a couple of things you can do to try to find the cause:
replace select * by an explicit selection of fields from your tables (a combination of fields that uniquely determine each row)
order the table by these fields, so that the order becomes the same each time you execute the query
simplify your query. In the above query, you can remove the first condition and turn the two left outer joins into inner joins and get the same result. After that, you could start removing tables and conditions one by one.
After each step, check if you still get different result sets. Then when you have found the critical step, try to understand why it causes your problem. (Or ask here.)

SQL query, view joined to table - number of results inconsistent

Apologies in advance for the vagueness of this question, but it involves a query which is too big to describe in full, and field/table names that I can't reveal. So I'm not really expecting a solution, but if someone give some advice on how I could proceed in solving it myself, I'd be grateful.
SQL Server 2000.
I have a query which joins a view and a table with an INNER JOIN and has a WHERE clause:
SELECT
view.join_field
FROM
view INNER JOIN table ON view.join_field=table.join_field
WHERE
table.other_field='EE'
# (23 rows)
This produces 23 results (it should be 1000s). If I add another WHERE clause to the above query, I get more results, instead of less:
SELECT
view.join_field
FROM
view INNER JOIN table ON view.join_field=table.join_field
WHERE
table.other_field='EE' AND
view.field2=1
This gives me a few thousand results, as was originally expected. Changing the value to 2 or 3 (the only other values present) also gives me thousands of results each, but if I change it to view.field2 IN (1,2,3) I end up with only 38 results.
Going back to the original query, which gave me 23 results, if I add the table field I have in the WHERE clause to the SELECT block, I get the right number of results:
SELECT
view.join_field,
table.other_field
FROM
view INNER JOIN table ON view.join_field=table.join_field
WHERE
table.other_field='EE'
# (8764 rows)
If I instead use a WHERE clause of table.other_field='GG' (the only other value present in the table), none of these strange things happen, and I get the expected number of results.
If I SELECT the contents of view into a temporary table, and use that in my query, I also get the thousands of rows I was expecting.
view itself is an LEFT OUTER JOIN of another view and two other tables. table, in my query, is not involved in any of the views.
Can anyone give me even the vaguest of ideas of what's going on? Are my tables or views corrupt, somehow?

That was not the right table: Access wiped the wrong data

I... don't quite know if I have the right idea about Access here.
I wrote the following, to grab some data that existed in two places:-
Select TableOne.*
from TableOne inner join TableTwo
on TableOne.[LINK] = TableTwo.[LINK]
Now, my interpretation of this is:
Find the table "TableOne"
Match the LINK field to the corresponding field in the table "TableTwo"
Show only records from TableOne that have a matching record in TableTwo
Just to make sure, I ran the query with some sample tables in SSMS, and it worked as expected.
So why, when I deleted the rows from within that query, did it delete the rows from TableTwo, and NOT from TableOne as expected? I've just lost ~3 days of work.
Edit: For clarity, I manually selected the rows in the query window and deleted them. I did not use a delete query - I've been stung by that a couple of times lately.
Since you have deleted the records manually, your query has to be updateable. This means that your query couldn't have been solely a cartesian join or a join without referential integrity, since these queries are non-updateable in ms access.
When I recreate your query based on two fields without indexes or primary keys, I am not even able to manualy delete records. This leads me to believe there was unknowingly a relationship established which deleted the records in table two. Perhaps you should take a look in the design view of your queries and relationships window, since the query itself should indeed select only records from table one.
Not sure why it got deleted, but I suggest to rewrite your query:
delete TableOne
where LINK in (select LINK from TableTwo)
This should work for you:
DELETE TableOne
FROM TableOne a
INNER JOIN
TableTwo b on b.Bid = a.Bid
and [my filter condition]

Trying to fix SQL query with two tables

I have the following tables:
Entry
EntryID - int
EntryDate - datetime
Hour
EntryID - int
InHour - datetime
OutHour - datetime
For each registry in the Entry table, there should be at least one (could be many) registries on the Hour table, like so:
Entry
EntryID: 8
EntryDate: 9/9/2010 12:31:25
Hour
EntryID: 8
InHour: 9/9/2010 12:31:25
OutHour: 9/9/2010 18:21:19
Now, this information is stored on 2 equal databases, one on local machine and one on a server. I'm trying to write a query that will delete all the information that has already been passed to the server under the condition that the registries that do not have an OutHour (null) will not be deleted.
I wrote the following query:
DELETE from [dbo].[Entry]
WHERE [dbo].[Entry].[EntryID] IN (SELECT [EntryID]
FROM [LINKEDSERVER].[MYDATABASE].[dbo].[Entry])
AND [dbo].[Entry].[EntryID] IN (SELECT [EntryID]
FROM [dbo].[Hour]
WHERE [OutHour] IS NOT NULL)
DELETE from [dbo].[Hour]
WHERE [dbo].[Hour].[InHour] IN (SELECT [InHour]
FROM [LINKEDSERVER].[MYDATABASE].[dbo].[Hour])
AND [dbo].[Hour].[OutHour] IS NOT NULL
AFAIK, this query first checks in the Entry table and will delete any registries that are already on the server and do not have a corresponding Hour registry that has a null OutHour. However today I found out that an Entry record was deleted but the corresponding Hour wasn't (it had a null OutHour).
What am I doing wrong? Any help is appreciated.
Thanks!
What's going wrong is that your second query only uses InHour, without referring to the EntryID. Also, your first query has its conditions completely independent from each other, which may not be a problem if your Hour table constraints are correct (the first column can never be null when the second is not null), but it's worth looking at.
In relational databases, it's best to get in the habit of thinking in terms of JOINs rather than IN(). Using IN() can often return the same results as a JOIN (with some differences in NULL handling) and often even gets the same execution plan, but it is #1 a "relaxed" way of thinking about the problem which doesn't lend itself well to the mental space needed for writing complex queries and #2 can't compare multiple values at once, it can only do a single comparison (at least in SQL Server, since some other DBMSes can do this).
Let me rewrite your queries as JOINs and maybe it will help you see what's wrong.
DELETE E
FROM
dbo.Entry E
INNER JOIN LINKEDSERVER.MYDATABASE.dbo.Entry L ON E.EntryID = L.EntryID
INNER JOIN Hour H ON E.EntryID = H.EntryID
WHERE
H.OutHour IS NOT NULL
DELETE H
FROM
dbo.Hour H
INNER JOIN LINKEDSERVER.MYDATABASE.dbo.Hour L ON H.InHour L.InHour
WHERE
H.OutHour IS NOT NULL
I recommend you put a cascade delete foreign key constraint on the hour table so that when you delete from the Entry table, the child Hour rows all disappear. There are still problems here as you could have many Hour rows per EntryID and semantically you can end up trying to delete the same row over the linked server multiple times.
Also, be aware that huge joins over linked servers can experience very poor performance because sometimes the query engine decides to pull huge rowsets over the link, even entire tables. You can mitigate this by doing things in batches, perhaps by first doing a select into a temp table based on a JOIN across the link, then deleting corresponding rows in small batches of 100 or 1000 or 5000 (testing is in order to find the right size).
Last, if you do find that your queries are unnecessarily pulling huge sets of data over the link (determine this by running Query Profiler on the remote matchine to see what actual queries are being submitted), then strategic use of CROSS APPLY can help by forcing row-by-row processing, which in the case of linked servers can be an enormous performance improvement, despite how counter-intuitive that is compared to the standard and strong recommendation to never do row-by-row in relational databases. Think of it as forcing a "stretch bookmark lookup" rather than a "stretch table scan" and you'll get an inkling of why this can be such a big help.
My very first suggestion is to put a foreign key relationship between the two on EntryID. This will prevent any deletions from the Entry table without first removing all instances from the Hour table.
Secondly, with a foreign key in place you have to do it from the child to the parent (aka, start at the bottom of the hierarchy). This means i would do this first:
delete from dbo.Hour where OutHour is not null
delete e
from dbo.Entry e
left outer join dbo.Hour h
on e.entryid=h.entryid
where h.entryid is null