Is this SQL select code following good practice? - sql

I am using SQLite and will port to MySQL (5) later.
I wanted to know if I am doing something I shouldn't be doing. I tried purposely to design so I'll compare to 0 instead of 1 (I changed hasApproved to NotApproved to do this, not a big deal and I haven't written any code). I was told I never need to write a subquery but I do here. My Votes table is just id, ip, postid (I don't think I can write that subquery as a join instead?) and that's pretty much all that is on my mind.
Naming conventions I don't really care about since the tables are created via reflection and is all over the place.
select
id,
name,
body,
upvotes,
downvotes,
(select 1 from UpVotes where IPAddr=? AND post=Post.id) as myup,
(select 1 from DownVotes where IPAddr=#0 AND post=Post.id) as mydown
from Post
where
flag = '0'
limit ?, ?"

Since you're asking about good practices... the "upvotes" and "downvotes" appearing in your Posts table looks like you're duplicating data in your database. That's a problem, because now you always have to worry whether or not the data is in sync and correct. If you want to know the number of upvotes then count them, don't also store them in the Post table. I'm not positive that is what you're doing, but it's a guess.
Onto your query... You will probably get better performance using a JOINed subquery instead of how you have it. With the scalar subqueries as columns they have to be run once for every row that is returned. That could be a pretty big performance hit if you're returning a bunch of rows. Instead, try:
SELECT
P.id,
P.name,
P.body,
P.upvotes,
P.downvotes,
COALESCE(UV.cnt, 0) AS upvotes2,
COALESCE(DV.cnt, 0) AS downvotes2
FROM
dbo.Posts P
LEFT OUTER JOIN (SELECT post_id, COUNT(*) cnt FROM dbo.UpVotes GROUP BY post_id) AS UV ON UV.post_id = P.id
LEFT OUTER JOIN (SELECT post_id, COUNT(*) cnt FROM dbo.DownVotes GROUP BY post_id) AS DV ON DV.post_id = P.id
Compare it to your own query and see if it gives you better performance.
EDIT: A couple of other posters have advocated a single table for up/down votes. They are absolutely correct. That makes the query even easier and also probably much faster:
SELECT
P.id,
P.name,
P.body,
P.upvotes,
P.downvotes,
SUM(CASE WHEN V.vote_type = 'UP' THEN 1 ELSE 0 END) AS upvotes2,
SUM(CASE WHEN V.vote_type = 'DOWN' THEN 1 ELSE 0 END) AS downvotes2,
FROM
dbo.Posts P
LEFT OUTER JOIN Votes V ON
V.post_id = P.id
GROUP BY
P.id,
P.name,
P.body,
P.upvotes,
P.downvotes

I'm guessing that you're trying to ensure that a user only votes once on each post here.
I wouldn't - I don't - use separate tables for up votes and down votes. Add vote type to your votes table and you won't need correlated subqueries.

Here is my opinions:
It seems that table "UpVotes" and "DownVotes" have same structure and can be merged into one table.
The relation between table "Post" and "Up/DownVotes" can be constrained by foreign key.
Although I am not sure about the performance difference, but I think it would be better to use "join" mechanism rather than nesting two select statement in a select statement.

You can use joins to achieve the same thing, and I would expect joins to work a lot more efficiently than embeded selects.

Related

Sub query select statment vs inner join

I'm confused about these two statements who's faster and more common to use and best for memory
select p.id, p.name, w.id, w.name
from person p
inner join work w on w.id = p.wid
where p.id in (somenumbers)
vs
select p.id, p.name, (select id from work where id=p.wid) , (select name from work where id=p.wid)
from person p
where p.id in (somenumbers)
The whole idea of this is that if I have I huge database and I want to make inner join it will take memory and less performance to johin work table and person table but the sub query select statments it will only select one statment at the time so which is the best here
First, the two queries are not the same. The first filters out any rows that have no matching rows in work.
The equivalent first query uses a left join:
select p.id, p.name, w.id, w.name
from person p left join
work w
on w.id = p.wid
where p.id in (somenumbers);
Then, the second query can be simplified to:
select p.id, p.name, p.wid,
(select name from work where w.id = p.wid)
from person p
where p.id in (somenumbers);
There is no reason to look up the id in work when it is already present in person.
If you want optimized queries, then you want indexes on person(id, wid, name) and work(id, name).
With these indexes, the two queries should have basically the same performance. The subquery will use the index on work for fetching the rows from work and the where clause will use the index on person. Either query should be fast and scalable.
The subqueries in your second example will execute once for every row, which will perform badly. That said, some optimizers may be able to convert it to a join for you - YMMV.
A good rule to follow in general is: much prefer joins to subqueries.
joins give better performance as comparison with sub-query .if there is join on Int column or have index on join column gives best performance .
select p.id, p.name, w.id, w.name
from person p
inner join work w on w.id = p.wid
where p.id in (somenumbers)
It really depends on how you want to optimaze the query (includie but not limited to add/removing/reordering the index),
I found the setup which makes join soars might let subquery suffer, the opposite may also be true. Thus there is not that much point to compare them with the same setup.
I choose to use and optimize with join. In my experince join at its best condition setup, rarely loses to subquery, but a lot eaiser to read.
When the vendor stuff an extreme load of queries with subqueries to the system. Unless the performance start to crawl, due to my other work's query optimization, it simply doesn't worth the effort to change them.

Do subselects do an implicit join?

I have a sql query that seems to work but I dont really understand why. Therefore I would very much appreciate if someone could help explain whats going on:
THE QUERY RETURNS: All organisations that dont have any comments that were not created by the consultant who created the organisation record.
SELECT \"organisations\".*
FROM \"organisations\"
WHERE \"organisations\".\"id\" NOT IN
(SELECT \"comments\".\"commentable_id\"
FROM \"comments\"
WHERE \"comments\".\"commentable_type\" = 'Organisation'
AND (comments.author_id != organisations.consultant_id)
ORDER BY \"comments\".\"created_at\" ASC
)
It seems to do so correctly.
The part I dont understand is why (comments.author_id != organisations.consultant_id) is working!? I dont understand how postgres even knows what "organisations" is inside that subselect? It is not defined in here.
If this was written as a join where I had joined comments to organisations then I would totally understand how you could do something like this but in this case its a subselect. How does it know how to map the comments and organisations table and exclude the ones where (comments.author_id != organisations.consultant_id)
That subselect happens in a row so it can see all columns of that row. You will probably get better performance with this
select organisations.*
from organisations
where not exists (
select 1
from comments
where
commentable_type = 'organisation' and
author_id != organisations.consultant_id
)
Notice that it is not necessary to qualify commentable_type since the one in comments has priority over any other outside the subselect. And if comments does not have a consultant_id column then it would be possible to take its qualifier out, although not recommended for better legibility.
The order by in your query buys you nothing, just added cost.
You are running a correlated subquery. http://technet.microsoft.com/en-us/library/ms187638(v=sql.105).aspx
This is commonly used in all databases. A subquery in the WHERE clause can refer to tables used in the parent query, and they often do.
That being said, your current query could likely be written better.
Here is one way, using an outer join with comments, where no matches are found based on your criteria -
select o.*
from organizations o
left join comments c
on c.commentable_type <> 'Organisation'
and c.author_id = o.consultant_id
where c.commentable_id is null

SQL COUNT(col) vs extra logging column... efficiency?

I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;

Is there a better way to sort this query?

We generate a lot of SQL procedurally and SQL Server is killing us. Because of some issues documented elsewhere we basically do SELECT TOP 2 ** 32 instead of TOP 100 PERCENT.
Note: we must use the subqueries.
Here's our query:
SELECT * FROM (
SELECT [me].*, ROW_NUMBER() OVER( ORDER BY (SELECT(1)) )
AS rno__row__index FROM (
SELECT [me].[id], [me].[status] FROM (
SELECT TOP 4294967296 [me].[id], [me].[status] FROM
[PurchaseOrders] [me]
LEFT JOIN [POLineItems] [line_items]
ON [line_items].[id] = [me].[id]
WHERE ( [line_items].[part_id] = ? )
ORDER BY [me].[id] ASC
) [me]
) [me]
) rno_subq
WHERE rno__row__index BETWEEN 1 AND 25
Are there better ways to do this that anyone can see?
UPDATE: here is some clarification on the whole subquery issue:
The key word of my question is "procedurally". I need the ability to reliably encapsulate resultsets so that they can be stacked together like building blocks. For example I want to get the first 10 cds ordered by the name of the artist who produced them and also get the related artist for each cd.. What I do is assemble a monolithic subselect representing the cds ordered by the joined artist names, then apply a limit to it, and then join the nested subselects to the artist table and only then execute the resulting query. The isolation is necessary because the code that requests the ordered cds is unrelated and oblivious to the code selecting the top 10 cds which in turn is unrelated and oblivious to the code that requests the related artists.
Now you may say that I could move the inner ORDER BY into the OVER() clause, but then I break the encapsulation, as I would have to SELECT the columns of the joined table, so I can order by them later. An additional problem would be the merging of two tables under one alias; if I have identically named columns in both tables, the select me.* would stop right there with an ambiguous column name error.
I am willing to sacrifice a bit of the optimizer performance, but the 2**32 seems like too much of a hack to me. So I am looking for middle ground.
If you want top rows by me.id, just ask for that in the ROW_NUMBER's ORDER BY. Don't chase your tail around subqueries and TOP.
If you have a WHERE clause on a joined table field, you can have an outer JOIN. All the outer fields will be NULL and filtered out by the WHERE, so is effectively an inner join.
.
WITH cteRowNumbered AS (
SELECT [me].id, [me].status
ROW_NUMBER() OVER (ORDER BY me.id ASC) AS rno__row__index
FROM [PurchaseOrders] [me]
JOIN [POLineItems] [line_items] ON [line_items].[id] = [me].[id]
WHERE [line_items].[part_id] = ?)
SELECT me.id, me.status
FROM cteRowNumbered
WHERE rno__row__index BETWEEN 1 and 25
I use CTEs instead of subqueries just because I find them more readable.
Use:
SELECT x.*
FROM (SELECT po.id,
po.status,
ROW_NUMBER() OVER( ORDER BY po.id) AS rno__row__index
FROM [PurchaseOrders] po
JOIN [POLineItems] li ON li.id = po.id
WHERE li.pat_id = ?) x
WHERE x.rno__row__index BETWEEN 1 AND 25
ORDER BY po.id ASC
Unless you've omitted details in order to simplify the example, there's no need for all your subqueries in what you provided.
Kudos to the only person who saw through naysaying and actually tried the query on a large table we do not have access to. To all the rest saying this simply will not work (will return random rows) - we know what the manual says, and we know it is a hack - this is why we ask the question in the first place. However outright dismissing a query without even trying it is rather shallow. Can someone provide us with a real example (with preceeding CREATE/INSERT statements) demonstrating the above query malfunctioning?
Your update makes things much clearer. I think that the approach which you're using is seriously flawed. While it's nice to be able to have encapsulated, reusable code in your applications, front-end applications are a much different animal than a database. They typically deal with small structures and small, discrete process that run against those structures. Databases on the other hand often deal with tables that are measured in the millions of rows and sometimes more than that. Using the same methodologies will often result in code that simply performs so badly as to be unusable. Even if it works now, it's very likely that it won't scale and will cause major problems down the road.
Best of luck to you, but I don't think that this approach will end well in all but the smallest of databases.

Select Users with more Items

Each user HAS MANY photos and HAS MANY comments. I would like to order users by SUM(number_of_photos, number_of_comments)
Can you suggest me the SQL query?
GROUP BY with JOINs works more efficiently than dependent subqueries (in all relational DBs I know):
Select * From Users
Left Join Photos On (Photos.user_id = Users.id)
Left Join Comments On (Comments.user_id = Users.id)
Group By UserId
Order By (Count(Photos.id) + Count(Comments.id))
with some assumptions on the tables (e.g. an id primary key in each of them).
Select * From Users U
Order By (Select Count(*) From Photos
Where userId = U.UserId) +
(Select Count(*) From Comments
Where userId = U.UserId)
EDIT: although every query using subqueries can also be done using Joins, which will be faster ,
is not a simple question,
and is irrelevant unless the system is
experiencing performance problems.
1) Both constructions must be translated by the query optimizer into a query plan which includes some type of correlated join, be it a nested loop join, hash-join, merge join, or whatever. And it's entirely possible, (even likely), that they will both result in the same query plan.
NOTE: This is because the entire SQL Statement is translated into a single query plan. The subqueries do NOT get their own, individual query plans as though they were being executed in isolation.
What query plan and what type of joins are used will depend on the data structure and the data in each specific situation. The only way to tell which is faster is to try both, in controlled environments, and measure the performance... but,
2) Unless the system is experiencing an issue with performance, (unacceptable poor performance). clarity is more important. And for problems like the one described above, (where none of the data attributes in the "other" tables are required in the output of the SQL Statement, a Subquery is much clearer in describing the function and purpose of the SQL that a join with Group Bys would be.
I think that the accepted solutions would be problematic from a performance standpoint, assuming you have many users, photos, and comments. Your query runs two separate select statements for every row in the user table.
What you want to do is synthesize a query using ActiveRecord that looks like this:
SELECT user.*, COUNT(c.id) + COUNT(p.id) AS total_count
FROM users u LEFT JOIN photos p ON u.id = p.user_id
LEFT JOIN comments c ON u.id = c.user_id
GROUP BY user.id
ORDER BY total_count DESC
The join will be much, much more efficient. Using left joins insures that even if a user has no comments or photos they will still be included in the results.
If I were to assume that you had a count of comments and a count of photos (user.number_of_photos, user.number_of_comments; as seen above), it would be simple (not stupid):
Select user_id from user order by number_of_photos DESC, number_of_comments DESC
In Ruby On Rails:
User.find(:all, :order => '((SELECT COUNT(*) FROM photos WHERE user_id=users.id) + (SELECT COUNT(*) FROM classifications WHERE user_id=users.id)) DESC')