SQL query in MySQL using GROUP BY - sql

Okay so this query should be easy but I'm having a bit of difficult. Let's say I have a table called 'foo' with columns 'a', 'b'.
I'm trying to figure out the following in one query:
select how of column 'a' are available of type column 'b', this is done with the following:
mysql> select count(a),b from foo GROUP BY b;
that's straight forward. but now I want to add a third output to that query as well which shows the percentage of the result from count(a) divided by count(*). So if I have 100 rows total, and one of the GROUP BY results comes back with 20, I can get the third column to output 20%. Meaning that column a makes for 20% of the aggregate pool.

Assuming you have > 0 rows in foo
SELECT count(a), b, (count(a) / (SELECT count(*) FROM foo)) * 100
FROM foo
GROUP BY b

There is a risk of it running slow, best bet is to program whatever is to preform two separate queries.
SELECT count(*) INTO #c FROM foo;
SELECT count(a), b, (count(a)/#c)*100 FROM foo GROUP by b;

Related

Can I (in a many to many relationship) select only those id:s in column A that has a connection to all id:s in column B?

I need to retrieve only those id's in "A" that has a connection to all id´s in "B".
In the example below, the result should be '...fa3e' because '...65d6' does NOT have a reference to all id´s in "B".
However, if '...fa3e' and '...65d6' reference the same id's in column B, then the query should return both '...fa3e' and '...65d6'.
And, subsequently, if a fifth row would connect '...fa3e' with a completely new id in "B". Then '...65d6' would be excluded because it no longer holds a reference to all id's in column "B".
Is there a way to accomplish this in SQL server?
I can´t really come up with a good description/search term of what it is I´m trying to do ("Exclude column A based on values in column B" is not quite right). Hence I´m striking out looking for resources.
I believe these values reside in the same table.
For distinct a values only:
select a
from T
group by a
having count(distinct b) = (select count(distinct b) from T);
To return all the rows:
select * from T where a in (
select a from T group by a
having count(distinct b) = (select count(distinct b) from T)
);
If (a, b) pairs are always unique then you wouldn't need the distinct qualifier on the left-hand counts. In fact you could even use count(*) for that.
This seems like it's going to be a terrible query, but at it's most basic, you want
All A where B in...
All B that are fully distinct
In SQL, that looks like
select distinct A
from test
where B in (select B from test group by B having count(1) = 1);
Absolutely zero guarantees on performance, but, this gives you the right value A. If you want to see which A/B pair actually made the cut, it could be SELECT A, B FROM test... too.

What's the difference between select distinct count, and select count distinct?

I am aware of select count(distinct a), but I recently came across select distinct count(a).
I'm not very sure if that is even valid.
If it is a valid use, could you give me a sample code with a sample data, that would explain me the difference.
Hive doesn't allow the latter.
Any leads would be appreciated!
Query select count(distinct a) will give you number of unique values in a.
While query select distinct count(a) will give you list of unique counts of values in a. Without grouping it will be just one line with total count.
See following example
create table t(a int)
insert into t values (1),(2),(3),(3)
select count (distinct a) from t
select distinct count (a) from t
group by a
It will give you 3 for first query and values 1 and 2 for second query.
I cannot think of any useful situation where you would want to use:
select distinct count(a)
If the query has no group by, then the distinct is anomalous. The query only returns on row anyway. If there is a group by, then the aggregation columns should be in the select, to identify each row.
I mean, technically, with a group by, it would be answering the question: "how many different non-null values of a are in groups". Usually, it is much more useful to know the value per group.
If you want to count the number of distinct values of a, then use count(distinct a).

SQL Basic Syntax

I have the following problem:
What happens if the query didn't ask for B in the select?. I think it would give an error because the aggregate is computed based on the values in the select clause.
I have the following relation schema and queries:
Suppose R(A,B) is a relation with a single tuple (NULL, NULL).
SELECT A, COUNT(B)
FROM R
GROUP BY A;
SELECT A, COUNT(*)
FROM R
GROUP BY A;
SELECT A, SUM(B)
FROM R
GROUP BY A;
The first query returns NULL and 0. I am not sure about what the second query returns. The aggregate COUNT(*) count the number of tuples in one table; however, I don't know what it does to a group. The third returns NULL,NULL
The only rule about SELECT and GROUP BY is that the unaggregated columns in the SELECT must be in the GROUP BY (with very specific exceptions).
You can have columns in the GROUP BY that never appear in the SELECT. That is fine. It doesn't affect the definition of a group, but multiple rows may seem to have the same values in the GROUP BY columns.

Two ways to use Count, are they equivalent?

Is
SELECT COUNT(a.attr)
FROM TABLE a
equivalent to
SELECT B
FROM (SELECT COUNT(a.attr) as B
FROM TABLE a)
I would guess no, but I'm not sure.
I'm also assuming the answer would be the same for functions like min, max, avg, correct?
EDIT:
This is all out of curiosity, I'm still new at this. Is there a difference between the value returned for the count of the following and the above?
SELECT B, C
FROM (SELECT COUNT(a.attr) as B, a.C
FROM TABLE a
GROUP BY c)
EDIT AGAIN: I looked into it, lesson learned: I should be awake when I try to learn about these things.
Technically, they are not the same, the first one is a simple select, the second one is a select with a sub select.
But every sane optimizer will generate the same execution plan for both of them.
The results are the same, and would be the same as:
SELECT E
FROM
(SELECT D as E
FROM
(SELECT C as D
FROM
(SELECT B as C
FROM
(SELECT COUNT(a.attr) as B
FROM TABLE a))))
And equally as pointless.
The second query is essentially obfuscating a COUNT and should be avoided.
EDIT:
Yes, your edited query that was added to the OP is the same thing. It's just adding a subquery for no reason.
Am posting this answer to supplement what has already been said in the other answers, and because you cannot format comments :)
You can always check the execution plan to see if queries are equivalent; this is what SQL Server makes of it:
DECLARE #A TABLE
(
attr int,
c int
)
INSERT #A(attr,c) VALUES(1,1)
INSERT #A(attr,c) VALUES(2,1)
INSERT #A(attr,c) VALUES(3,1)
INSERT #A(attr,c) VALUES(4,2)
INSERT #A(attr,c) VALUES(5,2)
SELECT count(attr) FROM #A
SELECT B
FROM (SELECT COUNT(attr) as B
FROM #A) AS T
SELECT B, C
FROM (SELECT COUNT(attr) as B, c AS C
FROM #A
GROUP BY c) AS T
Here's the execution plan of the SELECT statments, as you can see there is no difference in the first two:
Yes there are. All your doing in the second one is naming the returned count B. They will return the same results.
http://www.roseindia.net/sql/sql-as-keyword.shtml
EDIT:
Better example:
http://www.w3schools.com/sql/sql_alias.asp
The third example will be different because it contains a group by. It will return the count for every distinct a.C entry. Example
B C
w/e a
w/e a
w/e b
w/e a
w/e c
Would return
3 a
1 b
1 c
Not necessarily in that order
Easiest way to check all of this is to try it for yourself and see what it returns.
Your first code sample is correct, but second does not have any sense.
You just select all data twice without any operations.
So, output for first and second samples will be equal.

Can i use MAX function for each tuple in the retrieved data set

My table result contains fields:
id count
____________
1 3
2 2
3 2
From this table i have to form another table score which should look as follows
id my_score
_____________
1 1.0000
2 0.6667
3 0.6667
That is my_score=count/MAX(count) but if i give the query as
create TEMPORARY TABLE(select id,(count/MAX(count)) AS my_score from result);
only 1 st row is retrieved.
Can any one suggest the query so that my_score is calculated for all tuples.
Thanks in advance.
SELECT
a.ID,
a.count / b.total
FROM result as A
CROSS JOIN (SELECT MAX(Count) AS Total From Result) AS B
B only returns one row so you want to take the Cartesian product of the table against its own aggregate to get your end value.
Not sure if this works in mysql, but try:
select id, count / (select max(count) from result) as my_score
from result
I don't think you can apply an aggregation function to each row in one step. Use a stored procedure to do the calculation in two steps -- calculate the max and store it in a variable, then do your selection and divide the count by the variable. Alternatively, you could use a subquery, but I don't really see that as an improvement.
create procedure calculate_score
begin
declare maxcount decimal(6,4);
set maxcount := select max(count) from result;
select id, count / maxcount as score from result;
end
Note: I'm not sure if MySQL will implicitly handle the data conversions from int to decimal or what types your columns are. If the data conversions need to be handled manually, you'll have to adjust the above.
I would be hesitating because of probable performance issues but semantically this should work:
select
a.id
, a.count / b.count
from
result a cross join result b
where
b.count = (select max(count) from result)
Edit: #eftpotrm has a much more elegant solution!