Should I avoid loops in joins? - sql

I remember I was taught to never create a loop when joining tables in sql.
In effect, using Business Objects it even tells me if there are loops in the schema I've defined in the Universe.
I've tried to search on the web about this statement but I wasn't able to find a reference.
Why is it dangerous to do this?
Edit: maybe I was too succint.
My question wasn't about looping intended as a "FOR LOOP" or similar.
I was talking about something like this WHERE clause in a SELECT statement:
WHERE TABLE1.foo = TABLE2.foo
AND TABLE2.bar = TABLE3.bar
AND TABLE3.baz = TABLE1.baz
if you draw the relation you will see "a loop" in the join.
Is this dangerous from a correctness and/or performance point of view?
Thanks to all.
Edit 2: added an example.
I've just thought an example, maybe it isn't the best, but I think it will serve to understand.
------------ ----------------- ----------------------
- DELIVERY - - DELIVERY_DATE - - DELIVERY_DETAILS -
------------ ----------------- ----------------------
- id - <--- - id - <----- date_id -
- company - |----- delivery_id - - product -
- year - - date - - quantity -
- number - ----------------- - datetime_of_event -
- customer - ----------------------
- ----------
1 <-----> N 1 <----> N
In the DELIVERY table every delivery appears only once
In the DELIVERY_TABLE we have the list of every date in which the delivery was processed. So, a delivery may be prepared in several days.
In the last table we have the details of every delivery. So, in this table we track every event related to the preparation of the delivery
So, the cardinalities are 1:N for each couple of tables.
The join is very simple:
DELIVERY.id = DELIVERY_DATE.delivery_id AND
DELIVERY_DATE.id = DELIVERY_DETAILS.date_id
Now, suppose I want to join another table, where I have some other information for a delivery in a certain date.
Let's define it:
------------
- EMPLOYEE -
------------
- company -
- year -
- number -
- date -
- employee -
------------
Now the join should be:
DELIVERY.id = DELIVERY_DATE.delivery_id AND
EMPLOYEE.company = DELIVERY.company AND
EMPLOYEE.year = DELIVERY.year AND
EMPLOYEE.number = DELIVERY.number AND
EMPLOYEE.date = DELIVERY_DATE.date
To sum up, I'll end having EMPLOYEE joining both DELIVERY and DELIVERY_DATE, having the cycle in the join.
Should I rewrite it in this way?
EMPLOYEE.company = DELIVERY.company AND
EMPLOYEE.year = DELIVERY.year AND
EMPLOYEE.number = DELIVERY.number AND
EMPLOYEE.date IN (SELECT date FROM DELIVERY_DATE d WHERE d.delivery_id = DELIVERY.id)
Edit 3: finally found a link
As usual, when you've given up searching for a link, you find it.
So, this article explains all. It's related to Business Objects, but the content is generic.
Thanks to all for your time.

EDIT: I see from the update that this is an issue that's specific to a BO designer, where a table is used more than once, but BO automatically combines join clauses, which then incorrectly (or, rather, unintentionally) restricts the result set. This question actually has nothing to do with cycles, per se, it's really about using entities in more than one context in the same query. I'll leave my original answer below, even though it doesn't really address OP's concern.
Disclaimer: This is a non-answer answer, because it's both an answer and a question. It probably should be a comment, but you can't comment unless you ask/answer questions, but since I genuinely want to help, I'll do it the only way I can, even if it's not the way things are done here. So sue me.
The short answer is no, you shouldn't avoid loops (or "cycles") in joins.
More to the point, queries are constructed so as to declare the correct logical condition(s) to produce the data you're looking for. The same logical condition can often be established in many different ways, so sometimes it makes sense to ask if one way is preferable to another. This becomes of particular interest when performance is important. But first and foremost a query must return the right result set. How this is accomplished depends on the underlying schema. And this is what you really should be focused on.
In your example, why role does EMPLOYEE play regarding the DELIVERY tables? Why would you join on those columns? What would that mean that an EMPLOYEE has the same "date" as a delivery? I understand it's a contrived example, but the point I'm trying to make is that whether joins create a cycle in the graph or not is wholly (well, principally) dependent on what a particular result set's logical meaning is.
And on the issue of JOIN syntax, using JOIN...ON clauses is preferable to WHERE clauses because it separates what you need to do to combine entities from data filtering operations.

Well, your question is easy. You should not be using where clauses to perform joins. You should be using on clauses. Your joins can be expressed as:
from Table1 join
Table2
on Table1.foo = Table2.foo join
Table3
on Table2.bar = Table3.bar and
Table1.baz = Table3.baz
Whether this is appropriate or not depends on your data structures. Sometimes it is. You shouldn't worry about it.
By the way, I wouldn't refer to this as "loops" which is very associated with "for loops" in programming and nested loop joins in SQL. You might refer to this as cycles in join conditions.
Wmax . . . as for the new join syntax. It is not "just" a matter of taste. The "," in a from clause means "cross join". This is a very expensive operation, in most cases. It is much better to be clear about what you want to accomplish:
FROM A cross join B
is much clearer about its intentions than:
FROM A, B
Second, if you leave the "," out, you still have valid syntax:
FROM A B
However, this means something very different (assigning the alias B to the table A).
The third reason is the most important reason. The old syntax does not have a way of expressing left outer join, right outer join, and full outer join.
So, to write clearer queries, to avoid errors, and to access better functionality, you should learn the new syntax.

Related

Understanding an SQL Query

I'm new to SQL and I've been racking my brain trying to figure out exactly what a query I received at work to modify is stating. I believe it's using an alias but I'm not sure why because it only has one table that it is referring to. I think it's a fairly simply one I just don't get it.
select [CUSTOMERS].Prefix,
[CUSTOMERS].NAME,
[CUSTOMERS].Address,
[CUSTOMERS].[START_DATE],
[CUSTOMERS].[END_DATE] from [my_Company].[CUSTOMERS]
where [CUSTOMERS].[START_DATE] =
(select max(a.[START_DATE])
from [my_company].[CUSTOMERS] a
where a.Prefix = [CUSTOMERS].Prefix
and a.Address = [CUSTOMERS].ADDRESS
and coalesce(a.Name, 'Go-Figure') =
coalesce([CUSTOMERS].a.Name, 'Go-Figure'))
Here's a shot at it in english...
It looks like the intent is to get a list of customer names, addresses, start dates.
But the table is expected to contain more than one row with the same customer name and address, and the author wants only the row with the most recent start date.
Fine Points:
If a customer has the same name and address and prefix as another customer, the one with the most recent start date appears.
If a customer is missing the name 'Go Figure' is used. And so two rows with missing names will match, and the one with the most recent start date will be returned. A row with a missing name will not match another row that has a name. Both rows will be returned.
Any row that has no start date will be excluded from results.
This does not look like a query from a real business application. Maybe it's just a conceptual prototype. It is full of problems in most real world situations. Matching names and addresses with simple equality just doesn't work well in the real world, unless the names and addresses are already cleaned and de-duplicated by some other process.
Regarding the use of alias: Yes. The sub-query uses a as an alias for the my_Company.CUSTOMERS table.
I believe there is an error on the last line.
[CUSTOMERS].a.Name
is not a valid reference. It was probably meant to be
[CUSTOMERS].Name
I assume, it selects records about customers records from table [CUSTOMERS] whith the most recent [CUSTOMERS].[START_DATE]
#Joshp gave a good answer although I have seen these kinds of queries and worse in all kinds of real applications.
See if the query below gives you the same result though. The queries would not be equivalent in general but I suspect thet are the same with the data you've got. I believe the only assumption I'm making is that the ranges between start and end dates never intersect or overlap which implies that max start and max end are always together in the same row.
select
c.Prefix, c.NAME, c.Address,
max(c.START_DATE) as Start_Date,
max(c.END_DATE) as End_Date
from my_Company.CUSTOMERS as c
group by c.Prefix, c.NAME, c.Address
You'll notice the alias is a nice shorthand that keeps the query readable. Of course when there's only a single table they aren't strictly necessary at all.

SQL, to loop or not to loop?

the problem story goes like:
consider a program to manage bank accounts with balance limits for each customer
{table Customers, table Limits} where for each Customer.id there's one Limit record
then the client said to store a history for the limits' changes, it's not a problem since I've already had date column for Limit but the active/latest limits's view-query needs to be changed
before: Customer-Limit was 1 to 1 so a simple select did the job
now: it would show all the Limits' records which means multiple records for each Customers and I need the latest Limits only so I thought of something like this pseudo code
foreach( id in Customers)
{
select top 1 *
from Limits
where Limits.customer_id = id
order by Limits.date
}
but while looking through SO for similar issues, I came across stuff like
"95% of the time when you need a looping structure in tSQL you are probably doing it wrong"-JohnFx
and
"SQL is primarily a set-orientated language - it's generally a bad idea to use a loop in it."-Mark Bannister
can anyone confirm/explain why is it wrong to loop? and in the explained problem above, what am I getting wrong that I need to loop?
thanks in advance
update : my solution
in light of TomTom's answer & suggested link here and before Dean kindly answered with code I came up with this
SELECT *
FROM Customers c
LEFT JOIN Limits a ON a.customer_id = c.id
AND a.date =
(
SELECT MAX(date)
FROM Limits z
WHERE z.customer_id = a.customer_id
)
thought I'd share :>
thanks for your response,
happy coding
Will this do?
;with l as (
select *, row_number() over(partition by customer_id order by date desc) as rn
from limits
)
select *
from customers c
left join l on c.customer_id = l.customer_id and l.rn = 1
I am assuming that earlier (i.e. before implementing the history functionality) you must be updating the Limits table. Now, for implementing the history functionality you have started inserting new records. Doesnt this trigger a lot of changes in your databases and code?
Instead of inserting new records, how about keeping the original functionality as is and creating a new table say Limits_History which will store all the old values from Limits table before updating it? Then all you need to do is fetch records from this table if you want to show history. This will not cause any changes in your existing SPs and code hence will be less error prone.
To insert record in the Limits_History table, you can simply create an AFTER TRIGGER and use the deleted magic table. Hence you need not worry about calling an SP or something to maintain history. The trigger will do this for you. Good examples of trigger are here
Hope this helps
It is wrong. You can do the same by quyting customers and limits with a subquery limiting to the most recent record on limit.
This is similar in concept to the query presented in Most recent record in a left join
You may have to do so in 2 joins - get most recent date, then get limit for the date. While this may look complex - it is a beginner issue, talk complex when you have sql statements reaching 2 printed pages and more ;)
Now, for an operational system the table design is broken - limits should contain the mos trecent limit, and a LimitHistory table the historical (or: all) entries, allowing fast retrieval of the CURRENT limit (which will be the one to apply to all transaction) without the overhead of the history. The table design you have assumes all limits are identical - that may be the truth (is the truth) for a reporting data warehouse, but is wrong for a transactional system as the history is not transacted.
Confirmation for why loop is wrong is exactly in the quoted parts in your question - SQL is a set-orientated language.
This means when you work on sets there's no reason to loop through the single rows, because you already have the 'result' (set) of data you want to work on.
Then the work you are doing should be done on the set of rows, because otherwise your selection is wrong.
That being said there are of course situations where looping is done in SQL and it will generally be done via cursors if on data, or done via a while loop if calculating stuff. (generally, exceptions always change).
However, as also mentioned in the quotes, often when you feel like using a loop you either shouldn't (it's poor performance) or you're doing logic in the wrong part of your application.
Basically - it is similar to how object orientated languages works on objects and references to said objects. Set based language works on - well, sets of data.
SQL is basically made to function in that manner - query relational data into result sets - so when working with the language, you should let it do what it can do and work on that. Just as if it was Java or any other language.

What can I do to speed this slow query up?

We have a massive, multi-table Sybase query we call the get_safari_exploration_data query, that fetches all sorts of info related to explorers going on a safari, and all the animals they encounter.
This query is slow, and I've been asked to speed it up. The first thing that jumps out at me is that there doesn't seem to be a pressing need for the nested SELECT statement inside the outer FROM clause. In that nested SELECT, there also seems to be several fields that aren't necessary (vegetable, broomhilda, devoured, etc.). I'm also skeptical about the use of the joins ("*=" instead of "INNER JOIN...ON").
SELECT
dog_id,
cat_metadata,
rhino_id,
buffalo_id,
animal_metadata,
has_4_Legs,
is_mammal,
is_carnivore,
num_teeth,
does_hibernate,
last_spotted_by,
last_spotted_date,
purchased_by,
purchased_date,
allegator_id,
cow_id,
cow_name,
cow_alias,
can_be_ridden
FROM
(
SELECT
mp.dog_id as dog_id,
ts.cat_metadata + '-yoyo' as cat_metadata,
mp.rhino_id as rhino_id,
mp.buffalo_id as buffalo_id,
mp.animal_metadata as animal_metadata,
isnull(mp.has_4_Legs, 0) as has_4_Legs,
isnull(mp.is_mammal, 0) as is_mammal,
isnull(mp.is_carnivore, 0) as is_carnivore,
isnull(mp.num_teeth, 0) as num_teeth,
isnull(mp.does_hibernate, 0) as does_hibernate,
jungle_info.explorer as last_spotted_by,
exploring_journal.spotted_datetime as last_spotted_date,
jungle_info.explorer as purchased_by,
early_exploreration_journal.spotted_datetime as purchased_date,
alleg_id as allegator_id,
ho.cow_id,
ho.cow_name,
ho.cow_alias,
isnull(mp.is_ridable,0) as can_be_ridden,
ts.cat_metadata as broomhilda,
ts.squirrel as vegetable,
convert (varchar(15), mp.rhino_id) as tms_id,
0 as devoured
FROM
mammal_pickles mp,
very_tricky_animals vt,
possibly_venomous pv,
possibly_carniv_and_tall pct,
tall_and_skinny ts,
tall_and_skinny_type ptt,
exploration_history last_exploration_history,
master_exploration_journal exploring_journal,
adventurer jungle_info,
exploration_history first_exploration_history,
master_exploration_journal early_exploreration_journal,
adventurer jungle_info,
hunting_orders ho
WHERE
mp.exploring_strategy_id = 47
and mp.cow_id = ho.cow_id
and ho.cow_id IN (20, 30, 50)
and mp.rhino_id = vt.rhino_id
and vt.version_id = pv.version_id
and pv.possibly_carniv_and_tall_id = pct.possibly_carniv_and_tall_id
and vt.tall_and_skinny_id = ts.tall_and_skinny_id
and ts.tall_and_skinny_type_id = ptt.tall_and_skinny_type_id
and mp.alleg_id *= last_exploration_history.exploration_history_id
and last_exploration_history.master_exploration_journal_id *= exploring_journal.master_exploration_journal_id
and exploring_journal.person_id *= jungle_info.person_id
and mp.first_exploration_history_id *= first_exploration_history.exploration_history_id
and first_exploration_history.master_exploration_journal_id *= early_exploreration_journal.master_exploration_journal_id
and early_exploreration_journal.person_id *= jungle_info.person_id
) TEMP_TBL
So I ask:
Am I correct about the nested SELECT?
Am I correct about the unnecessary fields inside the nested SELECT?
Am I correct about the structure/syntax/usage of the joins?
Is there anything else about the structure/nature of this query that jumps out at you as being terribly inefficient/slow?
Unfortunately, unless there is irrefutable, matter-of-fact proof that decomposing this large query into smaller queries is beneficial in the long run, management will simply not approve refactoring it out into multiple, smaller queries, as this will take considerable time to refactor and test. Thanks in advance for any help/insight here!
Am I correct about the nested SELECT?
You would be in some cases, but a competent planner would collapse it and ignore it here.
Am I correct about the unnecessary fields inside the nested SELECT?
Yes, especially considering that some of them don't show up at all in the final list of fields.
Am I correct about the structure/syntax/usage of the joins?
Insofar as I'm aware, *= and =* are merely syntactic sugar for a left and right join, but I might be wrong in stating that. If not, then they merely force the way joins occur, but they may be necessary for your query to work at all.
Is there anything else about the structure/nature of this query that jumps out at you as being terribly inefficient/slow?
Yes.
Firstly, you've some calculations that aren't needed, e.g. convert (varchar(15), mp.rhino_id) as tms_id. Perhaps a join or two as well, but I admittedly haven't looked at the gory details of the query.
Next, you might have a problem with the db design itself, e.g. a cow_id field. (Seriously? :-) )
Last, there occasionally is something to be said about doing multiple queries instead of a single one, to avoid doing tons of joins.
In a blog, for instance, it's usually a good idea to grab the top 10 posts, and then to use a separate query to fetch their tags (where id in (id1, id2, etc.)). In your case, the selective part seems to be around here:
mp.exploring_strategy_id = 47
and mp.cow_id = ho.cow_id
and ho.cow_id IN (20, 30, 50)
so maybe isolate that part in one query, and then build an in () clause using the resulting IDs, and fetch the cosmetic bits and pieces in one or more separate queries.
Oh, and as point out by Gordon, check your indexes as well. But then, note that the indexes may end up of little use without splitting the query into more manageable parts.
I would suggest the following approach.
First, rewrite the query using ANSI standard joins with the on clause. This will make the conditions and filtering much easier to understand. Also, this is "safe" -- you should get exactly the same results as the current version. Be careful, because the *= is an outer join, so not everything is an inner join.
I doubt this step will improve performance.
Then, check each of the reference tables and be sure that the join keys have indexes on them in the reference table. If keys are missing, then add them in.
Then, check whether the left outer joins are necessary. There are filters on tables that are left outer join'ed in . . . these filters convert the outer joins to inner joins. Probably not a performance hit, but you never know.
Then, consider indexing the fields used for filtering (in the where clause).
And, learn how to use explain capabilities. Any nested loop joins (without an index) as likely culprits for performance problems.
As for the nested select, I think Sybase is smart enough to "do the right thing". Even if it wrote out and re-read the result set, that probably would have a marginal effect on the query compared to getting the joins right.
If this is your real data structure, by the way, it sounds like a very interesting domain. It is not often that I see a field called allegator_id in a table.
I will answer some of your questions.
You think that the fields (vegetable, broomhilda, devoured) in nested SELECT could be causing performance issue. Not necessarily. The two unused fields (vegetable, broomhilda) in nested SELECT are from the ts table but the cat_metadata field which is being used is also from ts table. So unless cat_metadata is being covered by index used on ts table, there wont be any performance impact. Because, to extract cat_metadata field the data page from table will need to be fetched anyway. The extraction of other two fields will take little CPU, that's it. So don't worry about that. The 'devoured' field is also a constant. That will not affect the performance either.
Dennis pointed out about usage of convert function convert(varchar(15), mp.rhino_id). I disagree that that will effect performance as it will consume CPU only.
Lastly I would say, try using the set table count to 13, as there are 13 tables in there. Sybase uses four tables at a time for optimisation.

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.

SQL: Alternative to "First" function?

I'm trying to write a query I don't want to have Cartesian products on. I was going to use the First function, because some Type_Codes have multiple descriptions, and I don't want to multiply my dollars.
Select
Sum(A.Dollar) as Dollars,
A.Type_Code,
First(B.Type_Description) as FstTypeDescr
From
Totals A,
TypDesc B
Where
A.Type_Code = B.Type_Code
Group by A.Type_Code
I just want to grab ANY of the descriptions for a given code (I don't really care which one). I get the following error when trying to use FIRST:
[IBM][CLI Driver][DB2/AIX64] SQL0440N No authorized routine named "FIRST" of type "FUNCTION"
Is there another way to do this?
Instead of First(), use MIN().
first() is not SQL standard. I forget what database product it works in, but it's not in most SQL engines. As Recursive points out, min() accomplishes the same thing for your purposes here, the difference being that depending on indexes and other components of the query, it may require a search of many records to find the minimum value, when in your case -- and many of my own -- all you really want is ANY match. I don't know any standard SQL way to ask that question. SQL appears to have been designed by mathematicians seeking a rigorous application of set theory, rather than practical computer geeks seeking to solve real-world problems as quickly and efficiently as possible.
I forget the actual name of this feature, but you can make it so you actually join to a subquery. in that subquery, you can do like oxbow_lakes suggests and use "top 1"
something like:
select * from table1
inner join (select top 1 from table2) t2 on t2.id = table1.id