Strategy for avoiding a common sql development error (misleading result on join bug)

Strategy for avoiding a common sql development error (misleading result on join bug) - sql

Sometimes when i'm writing moderately complex SELECT statements with a few JOINs, wrong key columns are sometimes used in the JOIN statement that still return valid-looking results.
Because the auto numbering values (especially early in development) all tend to fall in similar ranges (sub 100s or so) the SELECT sill produces some results. These results often look valid at first glance and a problem is not detected until much, much later making debugging much more difficult because familiarity with the data structures and code has staled. (Gone stale in the dev's mind.)
i just spent several hours tracking down yet another of this issue that i've run into a too many times before. i name my tables and columns carefully, write my SQL statements methodically but this is an issue i can't seem to competely avoid. It comes back and bites me for hours of productivity about twice a year on average.
My question is: Has anyone come up with a clever method for avoiding this; what i assume is probably a common SQL bug/mistake?
i have thought of trying to auto-number starting with different start values but this feels cludgy and would get ugly trying to keep such a scheme straight for data models with dozens of tables... Any better ideas?
P.S.
i am very careful and methodical in naming my tables and columns. Patient table gets PatientId column, Facility get a FacilityId etc. This issues tends to arise when there are join tables involved where the linkage takes on extra meaning such as: RelatedPatientId, ReferingPatientId, FavoriteItemId etc.

When writing long complex SELECT statements try to limit the result to one record.
For instance, assume you have this gigantic enormous awesome CMS system and you have to write internal reports because the reports that come with it are horrendous. You notice that there are about 500 tables. Your select statement joins 30 of these tables. Your result should limit your row count by using a WHERE clause.
My advice is to rather then get all this code written and generalized for all cases, break the problem up and use WHERE and limit the row count to only say a record. Check all fields, if they look ok, break it up and let your code return more rows. Only after further checking should you generalize.
It bites a lot of us who keep adding more and more joins until it seems to look ok, but only after Joe Blow the accountant runs the report does he realize that the PO for 4 million was really the telephone bill for the entire year. Somehow that join got messed up!

One option would be to use your natural keys.
More practically, Red Gate SQL Prompt picks the FK columns for me.
I also tend to build up one JOIN at a time to see how things look.

If you have a visualization or diagramming tool for your SQL statements, you can follow the joins visually, and any errors will become immediately apparent, provided you have followed a sensible naming scheme for your primary and foreign keys.

Your column names should take care of this unless you named them all "ID". Are you writing multiple select statement using the same tables? You may want to create views for the more common ones.

If you're using SQL Server, you can use GUID columns as primary keys (that's what we do). You won't have problems with collisions again.

You could use GUIDs as your primary keys, but it has its pros and cons.
This pro is actually not mentioned on that page.
I have never tried doing this myself - I use a tool on top of SQL that makes incorrect joins very unlikely, so I don't have this problem. I just thought I'd mention it as another option though!

For IDs use TableNameID, for example for table Person, use PersonID
Use db model and look at the drawing when writing queries.
This way join looks like:
... ON p.PersonID = d.PersonID
as opposed to:
... ON p.ID = d.ID
Auto-increment integer PKs are among your best friends.

Related

SQL Server Performance Tuning of large table

I have a table of 755 columns and around holding 2 million records as of now and it will grow.There are many procedures accessing it with other tables join, are running slow. Now it's hard to split/normalize them as everything is already built and customer is not ready to spend much on it. Is there any way to make the query access to that table faster? Please advise.
Will column store index help?

How little are they prepared to spend?
It may be possible to split this table into multiple 1 to 1 joined tables (vertical partitioning), then use a view to present it as one single blob to existing code.
With some luck you may get join elimination happening frequently enough to make it worthwhile.
View will probably require INSTEAD OF triggers to fully replicate existing logic. INSTEAD OF triggers have a number of restrictions e.g. no support for OUTPUT clause, which can prove to be to hard to overcome depending on your specific setup.
You can name your view the same as existing table, which will eliminate the need of fixing code everywhere.
IMO this is the simplest you can do short of a full DB re-factoring exercise.
See: http://aboutsqlserver.com/2010/09/15/vertical-partitioning-as-the-way-to-reduce-io/ and https://logicalread.com/sql-server-optimizer-may-eliminate-foreign-key-joins-mc11/#.WXgEzlERW6I

755 Columns thats a lot. You should try to index the columns that are mostly used in where clause. this might speed up the process

It is fine, dont worry about it, actually how many columns you have it is not important in sql server (But be careful I said 'have'). The main problem is data count and how many column you select in queries. There is a few point firstly you can check.
Do not use * selector and change it if used in everywhere
In the joins, do not use it directly, you can firstly filter it as inner select. (Just try it, I have no idea about your table so I m telling the general rules.)
Try the diminish data count for ex: use history table for old records. This technicque depends on needs of your organization.
Try to use column index and something like that features.
And of course remove dynamic selects in your queries.
I wish one of them will work.

How to tell if a query will scale well?

What are some of the methods/techniques experienced SQL developers use to determine if a particular SQL query will scale well as load increases, rows in associated tables increase etc.

Some rules that I follow that make the most difference.
Don't use per-row functions in your queries like if, case, coalesce and so on. Work around them by putting data in the database in the format you're going to need it, even if that involves duplicate data.
For example, if you need to lookup surnames fast, store them in the entered form and in their lowercase form, and index the lowercase form. Then you don't have to worry about things like select * from tbl where lowercase(surname) = 'smith';
Yes, I know that breaks 3NF but you can still guarantee data integrity by judicious use of triggers or pre-computed columns. For example, an insert/update trigger on the table can force the lower_surname column to be set to the lowercase version of surname.
This moves the cost of conversion to the insert/update (which happens infrequently) and away from the select (which happens quite a lot more). You basically amortise the cost of conversion.
Make sure that every column used in a where clause is indexed. Not necessarily on its own but at least as the primary part of a composite key.
Always start off in 3NF and only revert if you have performance problems (in production). 3NF is often the easiest to handle and reverting should only be done when absolutely necessary.
Profile, in production (or elsewhere, as long as you have production data and schemas). Database tuning is not a set-and-forget operation unless the data in your tables never changes (very rare). You should be monitoring, and possibly tuning, periodically to avoid the possibility that changing data will bring down performance.
Don't, unless absolutely necessary, allow naked queries to your database. Try to control what queries can be run. Your job as a DBA will be much harder if some manager can come along and just run:
select * from very_big_table order by column_without_index;
on your database.
If managers want to be able to run ad-hoc queries, give them a cloned DBMS (or replica) so that your real users (the ones that need performance) aren't affected.
Don't use union when union all will suffice. If you know that there can be no duplicates between two selects of a union, there's no point letting the DBMS try to remove them.
Similarly, don't use select distinct on a table if you're retrieving all the primary key columns (or all columns in a unique constraint). There is no possibility of duplicates in those cases so, again, you're asking the DBMS to do unnecessary work.
Example: we had a customer with a view using select distinct * on one of their tables. Querying the view took 50 seconds. When we replaced it with a view starting select *, the time came down to sub-second. Needless to say, I got a good bottle of red wine out of that :-)
Try to avoid select * as much as possible. In other words, only get the columns you need. This makes little difference when you're using MySQL on your local PC but, when you have an app in California querying a database in Inner Mongolia, you want to minimise the amount of traffic being sent across the wire as much as possible.

don't make tables wide, keep them narrow as well as the indexes. Make sure that queries are fully covered by indexes and that those queries are SARGable.
Test with a ton of data before going in production, take a look at this: Your testbed has to have the same volume of data as on production in order to simulate normal usage

Pull up the execution plan and look for any of the following:
Table Scan
[Clustered] Index Scan
RID Lookup
Bookmark Lookup
Key Lookup
Nested Loops
Any of those things (in descending order from most to least scalable) mean that the database/query likely won't scale to much larger tables. An ideal query will have mostly index seeks, hash or merge joins, the occasional sort, and other low-impact operations (spools and so on).
The only way to prove that it will scale, as other answers have pointed out, is to test it on data of the desired size. The above is just a rule of thumb.

In addition (and along the same lines) to Robert's suggestion, consider the execution plan. Is it utilizing indexes? Are there any scans or such? Can you simply for the query in any way? For example, Eliminate IN in favor of EXISTS and only join to tables you need to join to.
You don't mention the technology -- keep in mind that different technologies can affect the efficiency of more complex queries.

I strongly recommend reading some reference material on this. This (hyperlink below) is probably a pretty good book to look into. Make sure to look under "Selectivity", among other topics.
SQL Tuning - Dan Tow

What is so bad about using SQL INNER JOIN

Every time a database diagram gets looked out, one area people are critical of is inner joins. They look at them hard and has questions to see if an inner join really needs to be there.
Simple Library Example:
A many-to-many relationship is normally defined in SQL with three tables: Book, Category, BookCategory.
In this situation, Category is a table that contains two columns: ID, CategoryName.
In this situation, I have gotten questions about the Category table, is it need? Can it be used as a lookup table, and in the BookCategory table store the CategoryName instead of the CategoryID to stop from having to do an additional INNER JOIN. (For this question, we are going to ignore the changing, deleting of any CategoryNames)
The question is, what is so bad about inner joins? At what point is doing them a negative thing (general guidelines like # of transactions, # of records, # of joins in a statement, etc)?

Your example is a good counterexample. How do you rename categories if they're spread throughout the various rows of the BookCategory table? Your UPDATE to do the rename would touch all the rows in the same category.
With the separate table, you only have to update one row. There is no duplicate information.

I would be more concerned about OUTER joins, and the potential to pick up info that wasn't intended.
In your example, having the Category table means that a book is limited to being filed under a preset Category (via a foriegn key relationship), if you just shoved multiple entries in to the BookCategory table then it would be harder to limit what is selected for the Category.
Doing an INNER join is not so bad, it is what databases are made for. The only time it is bad is when you are doing it on a table or column that is inadequately indexed.

I am not sure there is some thing wrong in inner join per se, it is like each IF you add to your code impacts performance (or should I say every line...), but still, you need a minimum number of those to make your system work (yes yes, I know about Turing machines).
So if you have something that is not needed, it will be frowned upon.

When you map your domain model onto the relational model you have to split the information across multiple relations in order to get a normalized model - there is no way around that. And then you have to use joins to combine the relations again and get your information back. The only bad thing about this is that joins are relative expensive.
The other option would be not to normalize your relational model. This will fill your database with much redundant data, give you many opportunities to turn your data inconsistent and make updates a nightmare.
The only reason not to normalize a relational model (I can think of at the moment) is that reading performance is extremely - and I mean extremely - critical.
By the way, why do you (they) only mention inner joins? How are left, right, and full outer joins significantly different from inner joins?

Nobody can offer much about general guidelines - they'd be specific to the server, hardware, database design, and expectations... way too many variables.
Specifically about INNER JOINs being inefficient or bad... JOINs are the center of relational DBs, and they've been around for decades. It's only wrong when you use it wrong, because obviously someone's doing it right since it's not extinct yet. Personally, I'd assume anyone throwing out blanket statements like that either don't know SQL or know just enough to get in trouble. Next time it comes up, teach them how to use the query cache.
(Not mentioning update/delete, but you didn't say inserts!: the increased maintainability through avoiding humans and their typos can easily be worth at least 10x the time a join will take.)

Has anyone written a higher level query langage (than sql) that generates sql for common tasks, on limited schemas

Sql is the standard in query languages, however it is sometime a bit verbose. I am currently writing limited query language that will make my common queries quicker to write and with a bit less mental overhead.
If you write a query over a good database schema, essentially you will be always joining over the primary key, foreign key fields so I think it should be unnecessary to have to state them each time.
So a query could look like.
select s.name, region.description from shop s
where monthly_sales.amount > 4000 and s.staff < 10
The relations would be
shop -- many to one -- region,
shop -- one to many -- monthly_sales
The sql that would be eqivilent to would be
select distinct s.name, r.description
from shop s
join region r on shop.region_id = region.region_id
join monthly_sales ms on ms.shop_id = s.shop_id
where ms.sales.amount > 4000 and s.staff < 10
(the distinct is there as you are joining to a one to many table (monthly_sales) and you are not selecting off fields from that table)
I understand that original query above may be ambiguous for certain schemas i.e if there the two relationship routes between two of the tables. However there are ways around (most) of these especially if you limit the schema allowed. Most possible schema's are not worth considering anyway.
I was just wondering if there any attempts to do something like this?
(I have seen most orm solutions to making some queries easier)
EDIT: I actually really like sql. I have used orm solutions and looked at linq. The best I have seen so far is SQLalchemy (for python). However, as far as I have seen they do not offer what I am after.

Hibernate and LinqToSQL do exactly what you want

I think you'd be better off spending your time just writing more SQL and becoming more comfortable with it. Most developers I know have gone through just this progression, where their initial exposure to SQL inspires them to bypass it entirely by writing their own ORM or set of helper classes that auto-generates the SQL for them. Usually they continue adding to it and refining it until it's just as complex (if not more so) than SQL. The results are sometimes fairly comical - I inherited one application that had classes named "And.cs" and "Or.cs", whose main functions were to add the words " AND " and " OR ", respectively, to a string.
SQL is designed to handle a wide variety of complexity. If your application's data design is simple, then the SQL to manipulate that data will be simple as well. It doesn't make much sense to use a different sort of query language for simple things, and then use SQL for the complex things, when SQL can handle both kinds of thing well.

I believe that any (decent) ORM would be of help here..

Entity SQL is slightly higher level (in places) than Transact SQL. Other than that, HQL, etc. For object-model approaches, LINQ (IQueryable<T>) is much higher level, allowing simple navigation:
var qry = from cust in db.Customers
select cust.Orders.Sum(o => o.OrderValue);
etc

Martin Fowler plumbed a whole load of energy into this and produced the Active Record pattern. I think this is what you're looking for?

Not sure if this falls in what you are looking for but I've been generating SQL dynamically from the definition of the Data Access Objects; the idea is to reflect on the class and by default assume that its name is the table name and all properties are columns. I also have search criteria objects to build the where part. The DAOs may contain lists of other DAO classes and that directs the joins.
Since you asked for something to take care of most of the repetitive SQL, this approach does it. And when it doesn't, I just fall back on handwritten SQL or stored procedures.

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?

This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.

There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.

I'm with you - a single SQL would be better

There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.

It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.

The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!

You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.

Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.

The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.

A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.

Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)

You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.

There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas