SQL Code Smells

SQL Code Smells - sql

Could you please list some of the bad practices in SQL, that novice people do?
I have found the use of "WHILE loop" in scenarios which could be resolved using set operations.
Another example is inserting data only if it does not exist. This can be achieved using LEFT OUTER JOIN. Some people go for "IF"
Any other thoughts?
Edit: What I am looking for is specific scenarios (as mentioned in the question) that could be achieved using SQL without using procedural constructs
Thanks
Lijo

Here are some I have seen:
Using cursors instead of equivalent (and faster) set operations (joins etc).
Dynamic SQL for everything.
Code that is open to SQL Injection attacks.
Full outer joins even when they are not needed.
Huge stored procedures (hundreds/thousands of lines).
No comments.

Placing ODBC or dynamic SQL calls all over the code.
Often it is better to define a data abstraction layer that provides access
to the databases. All the SQL code can hide in that layer.
This often avoids replication of similar queries, and makes changing
data models easier to do.

Personally for me: anything that is not a plain INSERT, UPDATE, DELETE or SELECT statement
I don't like logic in SQL.

My biggest beef here is definitely repetitive SQL. As an example, multiple stored procedures that perform the exact same joins but different filters.
Using Views in such cases can make your database MUCH easier to look at and work with

Creating vendor-specific SQL, when generic SQL would do.
Creating tables dynamically at runtime (other than TEMPORARY tables).
Letting your application code have table create or super user privs.

The question asking for a list of SQL smells, no answer can be
exhaustive. I will be expanding my answer as time permits and memory
serves:
Redundant grouping
Redundant grouping is the application of the GROUP BY statement—and
consequently of aggregate functions—to more columns than required. It
occurs when the author starts by collecting most of or all the data that
he needs, to group it at the very end. Redundant grouping is,
therefore, late grouping, for the correct approach is to group early,
and only the data that needs grouping.
If a main entity (main) have a journal (jrnl) and refer to another
enity appendage (apnd), then the following query:
SELECT
main . Id ,
main . Name ,
MAX(jrnl.Entry) AS Entry ,
MAX(jrnl.Date ) AS Date ,
apnd . Reference,
apnd . Status
FROM main
JOIN jrnl ON jrnl.Parent = main.Id
JOIN apnd ON apnd.Id = main.ApndId
GROUP BY main.Id, main.Name, apnd.Reference, apnd.Status
has redundant grouping, because the sole purpose of the GROUP BY
clause is to obtain the latest journal entry. It should be rewritten in
a non-redundant mannger as follows:
SELECT
main.Id ,
main.Name ,
skel.Entry ,
jrnl.Date ,
apnd.Reference,
apnd.Status
FROM
( SELECT
jrnl.Parent AS MainId,
MAX(jrnl.Entry) AS MaxEntry
FROM main
GROUP BY jrnl.Parent
) skel -- eton
JOIN main ON main.Id = skel.MainId
JOIN jrnl ON jrnl.Entry = skel.MaxEntry
JOIN apnd ON apnd.Id = main.ApndId
That is—we group on the narrowest dataset possible, and join the rest
afterwards, even if it means referencing the same tables!

Related

Are joins the proper way to do cross table queries?

I have a few tables in their third normal form and I need to do some cross table queries to get the information I need.
I looked at joins but it seems like it will create a new table. Is this the proper way to perform such queries? Or should I just do nested queries ? I guess it might make sense if I have to do these queries alot? I'm really not sure how well optimize these operations are. I'm using the sequelize ORM and I'm not sure I see any clear solution.

It seems to me you are asking about joins vs subqueries. These are to some extent different. But let's start with a couple of points.
A join creates a new relvar, not a new table. A relvar is a variable standing in for the relation output by the join operation. It is transient (as opposed to a view which would be persistent).
Joins and subqueries are not always perfect substitutes. Sometimes you will need both.
Your query output is also a relvar.
The above being said, generally where possible I think joins are preferable. The major reason is that a SQL query that can be written using the structure below is far easier (as you master the language) to both understand and debug than most alternatives, and also subqueries in column lists necessarily perform badly:
SELECT [column_list]
FROM [initial_table]
[join list]
WHERE [filters]
GROUP BY [grouping list]
HAVING [post-aggregation filters]
LIMIT [limit and offset]
If your query fits the above structure then you can usually expect that specific kinds of problems will occur in logic in specific parts of the query. On the other hand, with subqueries, you have to check these independently.

How can I compare tuples using MySQL?

one more problem. I need your help.
Make a list of medications that have been entered as the same (identical_with) but differ in their association with the disease.
identical_with
association
I don't know how to do that.
The result should be in that case:
result

To solve your problem, you need to use twice the table association. Following code should be OK:
select
i.Name_1, i.Name_2
from
association a
inner join
identical_with i
on i.Name_1 = a.Name
inner join
association a2
on i.Name_2 = a2.Name
where
a2.Fachname <> a1.Fachname

This is a bit long for a comment, although the answer is essentially "you can't do this in MySQL".
The support you are looking for is for hierarchical or recursive queries. Almost every databases except MySQL has built-in support for these types of queries. This leaves you with essentially four choices:
Switch to using a database that has such support. Among free databases, these include Postgres, SQL Server Express, and Oracle Express.
If you limit the depth of equivalence, you can use repeated self joins.
You can do this with a while loop in a stored procedure. However, that is not a single SQL statement.
Use a nested set model
Use a method where you store the full path.
Unfortunately, the last two methods require triggers to maintain the data structure of inserts, updates, and deletes.

Sql Join on User Defined Function: how to optimize

I'm trying to optimize a query in a database. That query is similar to the following:
select * from Account
inner join udf_Account('user') udfAccount
on Account.Id = udfAccount.AccountId
Actually the real query is much longer but the most important point is that it contains a few inner join on user defined functions (udf) which depend on the user id. (So this is constant parameter which do not change during the query evaluation).
Due to a large amount of data, my query takes approximatively 20 seconds on a production database which is not acceptable.
I have already seen that by storing the results of the functions in temporary tables and using these tables in the query reduces a lot the duration of the query.
I'm asking the following questions:
Can I avoid the temporary tables. Isn't it a way to tell sql that the function can be evaluated only once ? Using temporary tables would imply some important changes in my code this is why I would be happy if I had another solution.
Are there any other ways to optimize my query ?

In SQL Server, if your functions are Inline rather than Multi-Statement, SQL Server explands tham (macro-like) into your queries. It's just like they become sub-queries in your main query.
This notionally allows the optimiser to make a 'better' execution plan.
For example; Provided that the fields you are joining on are directly derived from their source tables, this should make indexes on those fields available.
Without looking at the whole query and your individual functions, it appears that you're already in a good place with regards to your syntax. The next place to look is at the indexes that exist, and aim for index-seeks rather than table-scans or index-scans.
(That's all a bit simplistic, but it's a good start for query optimisation, which is an immense topic.)
Another option is to look at using CROSS APPLY with your inline table valued functions.
(Available in SQL Server 2005 onwards)
This allows the values from tables in your queries to be used as parameters to your functions. Again, provided that the functions are inline, SQL Server expands the function inline when building the execution plan.
An example could be...
SELECT
Account.AccountID,
subAccount.AccountID AS SubAccountID,
Balance.currentAvailable AS SubAccountBalance
FROM
Account
CROSS APPLY
dbo.getSubAccounts('User', Account.AccountID) AS SubAccount
CROSS APPLY
dbo.getCurrentBalance(SubAccount.AccountID) AS Balance
WHERE
Account.AccountID = 1234

I believe you want to define what mysql calls a "deterministic" function. Depending on your flavor of SQL this will have different syntax. But ultimately the biggest optimisation would be to not use a function at all, but simply add an account column to the user table.

Generating select statement with joins from table information

I've got a bunch of classes that describe database schema: Table,Field,ForeignKey.
Tables have ForeignKeys list and Fields list.
Now I would like to generate SELECT statement with all the joins that are described in ForeignKey instances.
The question is: is order of tables relevant for the query time? Another words - do I have to care or is it done automatically for me by the db engine?

is order of tables relevant for the query time? Another words - do I have to care or is it done automatically for me by the db engine?
To the optimizer, no -- it doesn't matter.
For sake of readability and maintainance, you might want to consider laying the FROM and JOIN clauses out in a manner that reads well. If only dealing with INNER joins, there's no issue but OUTER JOINS I generally define after the FROM clause and use LEFT JOIN syntax exclusively. But that's a matter of style & taste...

Which are the SQL improvements you are waiting for?

Dealing with SQL shows us some limitations and gives us an opportunity to imagine what could be.
Which improvements to SQL are you waiting for? Which would you put on top of the wish list?
I think it can be nice if you post in your answer the database your feature request lacks.

T-SQL Specific: A decent way to select from a result set returned by a stored procedure that doesn't involve putting it into a temporary table or using some obscure function.
SELECT * FROM EXEC [master].[dbo].[xp_readerrorlog]

I know it's wildly unrealistic, but I wish they'd make the syntax of INSERT and UPDATE consistent. Talk about gratuitous non-orthogonality.

Operator to manage range of dates (or numbers):
where interval(date0, date1) intersects interval(date3, date4)
EDIT: Date or numbers, of course are the same.
EDIT 2: It seems Oracle have something to go, the undocumented OVERLAPS predicate. More info here.

A decent way of walking a tree with hierarchical data. Oracle has CONNECT BY but the simple and common structure of storing an object and a self-referential join back to the table for 'parent' is hard to query in a natural way.

More SQL Server than SQL but better integration with Source Control. Preferably SVN rather than VSS.

Implicit joins or what it should be called (That is, predefined views bound to the table definition)
SELECT CUSTOMERID, SUM(C.ORDERS.LINES.VALUE) FROM
CUSTOMER C
A redesign of the whole GROUP BY thing so that every expression in the SELECT clause doesn't have to be repeated in the GROUP BY clause
Some support for let expressions or otherwise more legal places to use an alias, a bit related to the GROUP BY thing, but I find other times what I just hate Oracle for forcing me to use an outer select just to reference a big expression by alias.

I would like to see the ability to use Regular Expressions in string handling.

A way of dynamically specifying columns/tables without having to resort to full dynamic sql that executes in another context.

Ability to define columns based on other columns ad infinitum (including disambiguation).
This is a contrived example and not a real world case, but I think you'll see where I'm going:
SELECT LTRIM(t1.a) AS [a.new]
,REPLICATE(' ', 20 - LEN([a.new])) + [a.new] AS [a.conformed]
,LEN([a.conformed]) as [a.length]
FROM t1
INNER JOIN TABLE t2
ON [a.new] = t2.a
ORDER BY [a.new]
instead of:
SELECT LTRIM(t1.a) AS [a.new]
,REPLICATE(' ', 20 - LEN(LTRIM(t1.a))) + LTRIM(t1.a) AS [a.conformed]
,LEN(REPLICATE(' ', 20 - LEN(LTRIM(t1.a))) + LTRIM(t1.a)) as [a.length]
FROM t1
INNER JOIN TABLE t2
ON LTRIM(t1.a) = t2.a
ORDER BY LTRIM(t1.a)
Right now, in SQL Server 2005 and up, I would use a CTE and build up in successive layers.

I'd like the vendors to actually standardise their SQL. They're all guilty of it. The LIMIT/OFFSET clause from MySQL and PostGresql is a good solution that no-one else appears to do. Oracle has it's own syntax for explicit JOINs whilst everyone else uses ANSI-92. MySQL should deprecate the CONCAT() function and use || like everyone else. And there are numerous clauses and statements that are outside the standard that could be wider spread. MySQL's REPLACE is a good example. There's more, with issues about casting and comparing types, quirks of column types, sequences, etc etc etc.

parameterized order by, as in:
select * from tableA order by #columName

Support in SQL to specify if you want your query plan to be optimized to return the first rows quickly, or all rows quickly.
Oracle has the concept of FIRST_ROWS hint, but a standard approach in the language would be useful.

Automatic denormalization.
But I may be dreaming.

Improved pivot tables. I'd like to tell it to automatically create the columns based on the keys found in the data.

On my wish list is a database supporting sub-queries in CHECK-constraints, without having to rely on materialized view tricks. And a database which supports the SQL standard's "assertions", i.e. constraints which may span more than one table.
Something else: A metadata-related function which would return the possible values of a given column, if the set of possible values is low. I.e., if a column has a foreign key to another column, it would return the existing values in the column being referred to. Of if the column has a CHECK-constraint like "CHECK foo IN(1,2,3)", it would return 1,2,3. This would make it easier to create GUI elements based on a table schema: If the function returned a list of two values, the programmer could decide that a radio button widget would be relevant - or if the function returned - e.g. - 10 values, the application showed a dropdown-widget instead. Etc.

UPSERT or MERGE in PostgreSQL. It's the one feature whose absence just boggles my mind. Postgres has everything else; why can't they get their act together and implement it, even in limited form?

Check constraints with subqueries, I mean something like:
CHECK ( 1 > (SELECT COUNT(*) FROM TABLE WHERE A = COLUMN))

These are all MS Sql Server/T-SQL specific:
"Natural" joins based on an existing Foreign Key relationship.
Easily use a stored proc result as a resultset
Some other loop construct besides while
Unique constraints across non NULL values
EXCEPT, IN, ALL clauses instead of LEFT|RIGHT JOIN WHERE x IS [NOT] NULL
Schema bound stored proc (to ease #2)
Relationships, schema bound views, etc. across multiple databases

WITH clause for other statements other than SELECT, it means for UPDATE and DELETE.
For instance:
WITH table as (
SELECT ...
)
DELETE from table2 where not exists (SELECT ...)

Something which I call REFERENCE JOIN. It joins two tables together by implicitly using the FOREIGN KEY...REFERENCES constraint between them.

A relational algebra DIVIDE operator. I hate always having to re-think how to do all elements of table a that are in all of given from table B.
http://www.tc.umn.edu/~hause011/code/SQLexample.txt

String Agregation on Group by (In Oracle is possible with this trick):
SELECT deptno, string_agg(ename) AS employees
FROM emp
GROUP BY deptno;
DEPTNO EMPLOYEES
---------- --------------------------------------------------
10 CLARK,KING,MILLER
20 SMITH,FORD,ADAMS,SCOTT,JONES
30 ALLEN,BLAKE,MARTIN,TURNER,JAMES,WARD

More OOP features:
stored procedures and user functions
CREATE PROCEDURE tablename.spname ( params ) AS ...
called via
EXECUTE spname
FROM tablename
WHERE conditions
ORDER BY
which implicitly passes a cursor or a current record to the SP. (similar to inserted and deleted pseudo-tables)
table definitions with inheritance
table definition as derived from base table, inheriting common columns etc
Btw, this is not necessarily real OOP, but only syntactic sugar on existing technology, but it would simplify development a lot.

Abstract tables and sub-classing
create abstract table person
(
id primary key,
name varchar(50)
);
create table concretePerson extends person
(
birth date,
death date
);
create table fictionalCharacter extends person
(
creator int references concretePerson.id
);

Increased temporal database support in Sql Server. Intervals, overlaps, etc.
Increased OVER support in Sql Server, including LAG, LEAD, and TOP.

Arrays
I'm not sure what's holding this back but lack of arrays lead to temp tables and related mess.

Some kind of UPGRADE table which allows to make changes on the table to be like the given:
CREATE OR UPGRADE TABLE
(
a VARCHAR,
---
)

My wish list (for SQLServer)
Ability to store/use multiple execution plans for a stored procedure concurrently and have the system automatically understand the best stored plan to use at each execution.
Currently theres one plan - if it is no longer optimal its used anyway or a brand new one is computed in its place.
Native UTF-8 storage
Database mirroring with more than one standby server and the ability to use a recovery model approaching 'simple' provided of course all servers are up and the transaction commits everywhere.
PCRE in replace functions
Some clever way of reusing fragments of large sql queries, stored match conditions, select conditions...etc. Similiar to functions but actually implemented more like preprocessor macros.

Comments for check constraints. With this feature, an application (or the database itself when raising an error) can query the metadata and retrieve that comment to show it to the user.

Automated dba notification in the case where the optimizer generates a plan different that the plan that that the query was tested with.
In other words, every query can be registered. At that time, the plan is saved. Later when the query is executed, if there is a change to the plan, the dba receives a notice, that something unexpected occurred.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Code Smells - sql

Here are some I have seen: Using cursors instead of equivalent (and faster) set operations (joins etc). Dynamic SQL for everything. Code that is open to SQL Injection attacks. Full outer joins even when they are not needed. Huge stored procedures (hundreds/thousands of lines). No comments.

Placing ODBC or dynamic SQL calls all over the code. Often it is better to define a data abstraction layer that provides access to the databases. All the SQL code can hide in that layer. This often avoids replication of similar queries, and makes changing data models easier to do.

Personally for me: anything that is not a plain INSERT, UPDATE, DELETE or SELECT statement I don't like logic in SQL.

My biggest beef here is definitely repetitive SQL. As an example, multiple stored procedures that perform the exact same joins but different filters. Using Views in such cases can make your database MUCH easier to look at and work with

Creating vendor-specific SQL, when generic SQL would do. Creating tables dynamically at runtime (other than TEMPORARY tables). Letting your application code have table create or super user privs.

Related

Are joins the proper way to do cross table queries?

How can I compare tuples using MySQL?

Sql Join on User Defined Function: how to optimize

Generating select statement with joins from table information

Which are the SQL improvements you are waiting for?

Categories

Resources