Sorting out the dublettes in SQL table - sql

I give an anology of my real problem below:
Imagine a website showing articles and all articles have comments associated with it. Now I want to get the articles that have comments that are commented bigger than a certain date, let say 2011-02-02. I also want to get the comment nearest in time to 2011-02-02. Don't forget that every article have more than one comment associated with it. I want this to happen in one single SQL query.
I found it hard to explain my problem so I give the SQL code:
SELECT articles.*, comments.date AS date
FROM articles, comments
WHERE comments.commentId in (SELECT commentId
FROM comments
WHERE date > 2011-02-02
ORDER BY date asc
LIMIT 1)
ORDER BY comments.date desc
The problem lies in the member section of the SQL query. Because it is only returning one single row. i want this to happen for each article

Use a subquery. Unfortunately your question doesn't give me much for schema...so I'll invent as I go. Lets say you have a table 'Article' with article_id as it's PK and your other table is comments (links on article_ID). I'm assuming article_id + date makes a comment unique.
Select article.article_id, comment.comment_text,comment.comment_date from article
inner join (select min(comment_date) 'comment_date', article_id
from comment
where comment_date < '2010-02-02'
group by article_id) c
on c.article_id = article.article_id
inner join comment on comment.article_id = c.article_id and c.comment_date = comment.comment_date
You can use subqueries as tables within joins. Use the subquery to isolate the single comment you want, then join back to the comment table to get the comment text. Hopefully this made sense. I don't have a MYSQL database to test this on, but I think the syntax should work (it does on MSSQL atleast)
editted for formatting. And you can include a where statement at the bottom of this query to filter what articles you wanted to see.

You just have to query comments greater than your date, returning article ID (you do have a normalised structure, right? It's hard to tell without any detail).
To find the comment closest to your date, order the data by comment date in ascending order and take the first.

select top 1 a.* from articles a
inner join comments c on c.articleid = a.id
where c.date > '2011-02-02'
order by c.date asc
That should do it, although I'm not super familiar with MySQL.

Related

How exactly is the value of count(*) determined in BigQuery?

I am joining a table of about 70000 rows with a slightly bigger second table through inner join each. Now count(a.business_column) and count(*) give different results. The former correctly reports back ~70000, while the latter gives ~200000. But this only happens when I select count(*) alone, when I select them together they give the same result (~70000). How is this possible?
select
count(*)
/*,count(a.business_column)*/
from table_a a
inner join each table_b b
on b.key_column = a.business_column
UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.
To answer the title question: COUNT(*) in BigQuery is always accurate.
The caveat is that in SQL COUNT(*) and COUNT(column) have semantically different meanings - and the sample query can be interpreted in different ways.
See: http://www.xaprb.com/blog/2009/04/08/the-dangerous-subtleties-of-left-join-and-count-in-sql/
There they have this sample query:
select user.userid, count(email.subject)
from user
inner join email on user.userid = email.userid
group by user.userid;
That query turns out to be ambigous, and the article author changes it for a more explicit one, adding this comment:
But what if that’s not what the author of the query meant? There’s no
way to really know. There are several possible intended meanings for
the query, and there are several different ways to write the query to
express those meanings more clearly. But the original query is
ambiguous, for a few reasons. And everyone who reads this query
afterwards will end up guessing what the original author meant. “I
think I can safely change this to…”
UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.
COUNT(*) counts most repeated field in your query, if you want to count full records - use COUNT(0).

Do subselects do an implicit join?

I have a sql query that seems to work but I dont really understand why. Therefore I would very much appreciate if someone could help explain whats going on:
THE QUERY RETURNS: All organisations that dont have any comments that were not created by the consultant who created the organisation record.
SELECT \"organisations\".*
FROM \"organisations\"
WHERE \"organisations\".\"id\" NOT IN
(SELECT \"comments\".\"commentable_id\"
FROM \"comments\"
WHERE \"comments\".\"commentable_type\" = 'Organisation'
AND (comments.author_id != organisations.consultant_id)
ORDER BY \"comments\".\"created_at\" ASC
)
It seems to do so correctly.
The part I dont understand is why (comments.author_id != organisations.consultant_id) is working!? I dont understand how postgres even knows what "organisations" is inside that subselect? It is not defined in here.
If this was written as a join where I had joined comments to organisations then I would totally understand how you could do something like this but in this case its a subselect. How does it know how to map the comments and organisations table and exclude the ones where (comments.author_id != organisations.consultant_id)
That subselect happens in a row so it can see all columns of that row. You will probably get better performance with this
select organisations.*
from organisations
where not exists (
select 1
from comments
where
commentable_type = 'organisation' and
author_id != organisations.consultant_id
)
Notice that it is not necessary to qualify commentable_type since the one in comments has priority over any other outside the subselect. And if comments does not have a consultant_id column then it would be possible to take its qualifier out, although not recommended for better legibility.
The order by in your query buys you nothing, just added cost.
You are running a correlated subquery. http://technet.microsoft.com/en-us/library/ms187638(v=sql.105).aspx
This is commonly used in all databases. A subquery in the WHERE clause can refer to tables used in the parent query, and they often do.
That being said, your current query could likely be written better.
Here is one way, using an outer join with comments, where no matches are found based on your criteria -
select o.*
from organizations o
left join comments c
on c.commentable_type <> 'Organisation'
and c.author_id = o.consultant_id
where c.commentable_id is null

Mysql performance of repeated count

I need to fetch for each blog article, number of comments and I currently use this SQL
select
id as article_id,
title,
content,
pic,
(select count(id) as comments from article_comments where
article_comments.article_parent_id = article_id group by article_id) as comments
from articles limit 1000);
This query has some significant delay compared to query without the count(id) subquery. The delay is about roughly 2 - 4 seconds for 1000 selected articles. Is there a way to improve performance of this query?
Using count for big data will create a delay increasingly. In order to improve getting the number of comments in an article, create an attribute in the article table called comment_count. And everytime someone enters a comment the number will be increased by 1 in the corresponding article record. In that way, when you want to retrieve the article, you don't have to count the comments every time you load the page, it will be just an attribute.
This is your query:
select id as article_id, title, content, pic,
(select count(id) as comments
from article_comments
where article_comments.article_parent_id = articles.article_id
group by article_id
) as comments
from articles
limit 1000;
First, the group by is unnecessary. Second, the index article_comments(article_parent_id) should help. The final query might look like this:
select a.id as article_id, a.title, a.content, a.pic,
(select count(*) as comments
from article_comments ac
where ac.article_parent_id = a.article_id
) as comments
from articles a
limit 1000;
Note that this also introduces table aliases. Those make the query easier to write and read.
I discovered that if circumstances allow it, it is much faster to make 1st sql query then extract required ids from it and make 2nd sql query with in() operator instead of joining tables / nesting queries.
select id as article_id, title, content, pic from articles limit 1000
At this point we need to declare string variable that is going to contain set of ids that will go into in() operator in next query.
<?php $in = '1, 2, 3, 4,...,1000'; ?>
Now we select comment count for a set of previously fetched article ids.
select count(*) from article_comments where article_id in ($in)
This method is slightly messier in terms of php code, because at his point we need $articles array containing article data and $comments['article_id'] array containing count of comments for each article.
Contrary to improvement in performance this method is messier for php code and makes it impossible to search for values in second or any next table.
This method is hence only applicable if performance is key and no other operations are required.

SQL return distinct while sorting on another column

Afternoon all, hope you can help an SQL newbie with what's probably a simple request. I'll jump straight in with the question/problem.
For table Property_Information, I'd like to retrieve either a complete record, or even specified fields if possible where the below criteria are met.
The table has column PLCODE which is not unique. The Table also has column PCODE, which is unique and which there are multiple per PLCODE (If that makes sense).
What I need to do is request the lowest PCODE record, for each unique PLCODE.
E.G. There are 6500 records in this table, and 255 unique PLCODES; therefore I'd expect a results set of the 255 individual PLCODES, each with the lowest PCODE record atttached.
As I'm here, and already feel like a burden to the community, perhaps someone might suggest a good resource for developing existing (but basic) SQL skills?
Many thanks in advance
P.S. Query will be performed on MSSQLSMS 2012 on a 2005 DB if that's of any relevance
select PLCODE, min(PCODE) from table group by PLCODE
you google any ansi sql site or find SQL tutorials.
Something like this will give you all columns for your grouped rows.
WITH CTE AS
(
SELECT
PLCODE
, MIN(PCODE) AS PCODE
FROM Property_Information
GROUP BY PLCODE
)
SELECT p.* FROM CTE c
LEFT JOIN Property_Information p
ON c.PLCODE = p.PLCODE AND c.PCODE = p.PCODE
SELECT
*, MIN(PCODE)
FROM
Property_Information
GROUP BY
PLCODE

SQL left join query runs VERY slow

Basically I'm trying to pull a random poll question that a user has not yet responded to from a database. This query takes about 10-20 seconds to execute, which is obviously no good! The responses table is about 30K rows and the database also has about 300 questions.
SELECT questions.id
FROM questions
LEFT JOIN responses ON ( questions.id = responses.questionID
AND responses.username = 'someuser' )
WHERE
responses.username IS NULL
ORDER BY RAND() ASC
LIMIT 1
PK for questions and reponses tables is 'id' if that matters.
Any advice would be greatly appreciated.
You most likely need an index on
responses.questionID
responses.username
Without the index searching through 30k rows will always be slow.
Here's a different approach to the query which might be faster:
SELECT q.id
FROM questions q
WHERE q.id NOT IN (
SELECT r.questionID
FROM responses r
WHERE r.username = 'someuser'
)
Make sure there is an index on r.username and that should be pretty quick.
The above will return all the unanswered questios. To choose the random one, you could go with the inefficient (but easy) ORDER BY RAND() LIMIT 1, or use the method suggested by Tom Leys.
The problem is probably not the join, it's almost certainly sorting 30k rows by order rand()
See: Do not order by rand
He suggests (replace quotes in this example with your query)
SELECT COUNT(*) AS cnt FROM quotes
-- generate random number between 0 and cnt-1 in your programming language and run
-- the query:
SELECT quote FROM quotes LIMIT $generated_number, 1
Of course you could probably make the first statement a subselect inside the second.
Is OP even sure the original query returns the correct result set?
I assume the "AND responses.username = 'someuser'" clause was added to join specification with intention that join will then generate null rightside columns for only the id's that someuser has not answered.
My question: won't that join generate null rightside columns for every question.id that has not been answered by all users? The left join works such that, "If any row from the target table does not match the join expression, then NULL values are generated for all column references to the target table in the SELECT column list."
In any case, nickf's suggestion looks good to me.