Get ANY(col) instead of MIN(col) from a group - sql

I have a SQL Query (simplified from real use):
SELECT MIN(cola), colb FROM tbl GROUP BY colb;
But actually, I don't need the minimum value- any cola value will do- it's only used to show an example value from the group.
At the moment PG has to do the group and then sort each group by cola to find the minimum value in the group, but this is slow because there's a lot of records in each group.
Does Postgres have some kind of FIRST(cola) or ANY(cola) that would just return whatever cola it sees first (like MySQL does when you don't use an aggregate function) or without needing to sort / read cola from every row?

I think using DISTINCT ON() with no order by will achieve what you are after:
SELECT DISTINCT ON (ColB) ColA, ColB
FROM tbl;
Example on SQL Fiddle
The docs state
DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
However, with no example data to work on I can't actually compare if this will outperform using MIN or any other aggregate function.

This statement:
At the moment PG has to do the group and then sort each group by cola
to find the minimum value in the group, but this is slow because
there's a lot of records in each group.
May logically describe what Postgres does, but it does not explain what is actually going on.
Postgres -- as with any database that I'm familiar with -- will keep a "register" for the minimum value. As new data comes in, it will compare the value in the next row to the minimum. If the new value is smaller, then it will be copied in. This, incidentally, is whay min(), max(), avg(), and count() are all faster than count(distinct). For the latter, the list of values within a group must be maintained.
The distinct on approach may be faster than the group by. The reason, however, is not because the database engine is sorting all values for a given colb to get the minimum.

Try using fetch first row at the end of your sql:
http://www.postgresql.org/docs/8.1/static/sql-fetch.html
SELECT MIN(cola), colb
FROM tbl
GROUP BY colb
FETCH FIRST ROW only;

Inspired by Gareth's answer above:
SQL Fiddle
; WITH C as (SELECT *, ROW_NUMBER() OVER (PARTITION BY ColB) as rn FROM tbl)
SELECT *
FROM c
WHERE rn = 1
Not sure if it will perform any better\worse than MIN().

Related

In SQL, does groupby on an ordered query behave the same as doing both in the same query?

Are the following queries identical, or might I get different results (in any major DB system, e.g. MSSQL, MySQL, Postgres, SQLite):
Doing both in the same query:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
vs. ordering in a subquery:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Looking at the first sample:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
Let's think about what GROUP BY does by looking at this imaginary sample data:
A B
- -
1 1
1 2
Then think about this query:
SELECT A
FROM SampleData
GROUP BY A
ORDER BY B
The GROUP BY clause puts the two rows into a single group. Then we want to order by B... but the two rows in the group have different values for B. Which should it use?
Obviously in this situation it doesn't really matter: there's only one row in the results, so the order is not relevant. But generally, how does the database know what to do?
The database could guess which one you want, or just take the first value, or the last — whatever those mean in a setting where the data is unordered by definition. And in fact this is what MySql will try to do for you: it will try to guess are your meaning. But this response is really inappropriate. You specified an in-exact query; the only correct thing to do is throw an error, which is what most databases will do.
Now let's look at the second sample:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Here it is important to remember databases have their roots in relational set theory, and what we think of as "tables" are more formally described as Unordered Relations. Again: the idea of being "unordered" is baked into the very nature of a table at the deepest level.
In this case the inner query can run and create results in the specified order, and then the outer query can use that with GROUP BY to create a new set... but just like tables, query results are unordered relations. Without an ORDER BY clause the final result is also unordered by definition.
Now you might tend to get results in the order you want, but the reality is all bets are off. In fact, the databases that run this query will tend to give you results in the order in which they first encountered each group, which will not tend to match the ORDER BY because the GROUP BY expression is looking at completely different columns. Other databases (Sql Server is in this group) will not even allow the query to run, though I might prefer a warning here.
So now we come to the final section, where we must re-think the question, like this:
How can I use GROUP BY on the one group column, while also ordering by some_other_column not in the group?
The answer is each group can contain multiple rows, and so you must tell the database which row to look at to get the correct (specific) some_other_column value. The typical way to do this is with another aggregate function, which might look like this:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_agg_func(some_other_column)
That code will run without error on pretty much any database.
Just be careful here. On one hand, when people want to do this it's often for the common case where they know every record for some_other_column in each group will have the same value. For example, you might GROUP BY UserID, but ORDER BY Email, where of course every record with the same UserID should have the same Email address. As humans, we have the ability to make that kind of inference. Computers, however, don't handle that kind of thinking as well, and so we help it out with an extra aggregate function like MIN() or MAX().
On the other hand, if you're not careful sometimes the two different aggregate functions don't match up, and you end up showing the value from one row in the group, while using a completely different row from the group for the ORDER BY expression in a way that is not good.
Tables are unordered sets of data. A query result is a table. So if you select from a subquery that contains an ORDER BY clause, that clause means nothing; the data set is unordered by definition. The DBMS is free to ignore the ORDER BY clause. Some DBMS may even issue a warning or error, but I suppose it's more common that the ORDER BY clause just has no effect - at least not guaranteed.
In this query
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
you try to order your results by some_other_value. If this is meant to be a column, you can't, because that other column is no part of your results. You'll get a syntax error. If some_other_value is a fixed value, then there is nothing ordered, because you'd have the same sort key for every row. But it can be an expression based on your result data (group key and aggreation results) and you can order your result rows by that.
In this query
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
the ORDER BY clause has no effect. You could just as well just select FROM my_table directly:
SELECT group, some_agg_func(some_value)
FROM my_table as alias
GROUP BY group
This gets the results unordered (or at least the order you see is not guaranteed to be thus every time you run that query), because your query doesn't have an ORDER BY clause.

Selecting distinct values from database

I have a table as follows:
ParentActivityID | ActivityID | Timestamp
1 A1 T1
2 A2 T2
1 A1 T1
1 A1 T5
I want to select unique ParentActivityID's along with Timestamp. The time stamp can be the most recent one or the first one as is occurring in the table.
I tried to use DISTINCT but i came to realise that it dosen't work on individual columns. I am new to SQL. Any help in this regard will be highly appreciated.
DISTINCT is a shorthand that works for a single column. When you have multiple columns, use GROUP BY:
SELECT ParentActivityID, Timestamp
FROM MyTable
GROUP BY ParentActivityID, Timestamp
Actually i want only one one ParentActivityID. Your solution will give each pair of ParentActivityID and Timestamp. For e.g , if i have [1, T1], [2,T2], [1,T3], then i wanted the value as [1,T3] and [2,T2].
You need to decide what of the many timestamps to pick. If you want the earliest one, use MIN:
SELECT ParentActivityID, MIN(Timestamp)
FROM MyTable
GROUP BY ParentActivityID
Try this:
SELECT [ParentActivityId],
MIN([Timestamp]) AS [FirstTimestamp],
MAX([Timestamp]) AS [RecentTimestamp]
FROM [Table]
GROUP BY [ParentActivityId]
This will provide you the first timestamp and the most recent timestamp for each ParentActivityId that is present in your table. You can choose the ones you need as per your need.
"Group by" is what you need here. Just do "group by ParentActivityID" and tell that most recent timestamp along all rows with same ParentActivityID is needed for you:
SELECT ParentActivityID, MAX(Timestamp) FROM Table GROUP BY ParentActivityID
"Group by" operator is like taking rows from a table and putting them in a map with a key defined in group by clause (ParentActivityID in this example). You have to define how grouping by will handle rows with duplicate keys. For this you have various aggregate functions which you specify on columns you want to select but which are not part of the key (not listed in group by clause, think of them as a values in a map).
Some databases (like mysql) also allow you to select columns which are not part of the group by clause (not in a key) without applying aggregate function on them. In such case you will get some random value for this column (this is like blindly overwriting value in a map with new value every time). Still, SQL standard together with most databases out there will not allow you to do it. In such case you can use min(), max(), first() or last() aggregate function to work around it.
Use CTE for getting the latest row from your table based on parent id and you can choose the columns from the entire row of the output .
;With cte_parent
As
(SELECT ParentActivityId,ActivityId,TimeStamp
, ROW_NUMBER() OVER(PARTITION BY ParentActivityId ORDER BY TimeStamp desc) RNO
FROM YourTable )
SELECT *
FROM cte_parent
WHERE RNO =1

SQL Server : GROUP error

I have a problem in SQL Server. I have a table name rTable and inside many columns of different types (Date, XML, varchar and so..)
Now I need group by one of this columns.
SELECT TOP 100 *
FROM rTable
GROUP BY integer_column
But this query give me error.
In MySQL it working normaly, but SQL Server asking me for do some operation in group with another columns, I need just show it.
What is recomendation?
You can not select all the column with group by clause in MS SQL.
SELECT integer_column, count(integer_column)
FROM rTable
GROUP BY integer_column
and also can use aggregate function like count, sum in select clause if you wanted.
Actually, this is not a limitation of MS SQL (although it has many), but it is conforming to the SQL standard.
You have to name all columns in the GROUP BY clause, except those using an aggregate function.
You are using two MySQL extensions that don't work in SQL Server.
In Standard SQL each column selected in an aggregation query must either be in the GROUP BY clause (integer_column in your case) or get aggregated (such as max(other_column)). You select * instead, so you select integer_column and other_column in this example. In MySQL you simply get a random match for other_column in the group. In SQL Server you get a syntax error instead.
Then: MySQL extends GROUP BY, so that it is guaranteed to get the records in the order specified in GROUP BY. In Standard SQL you must add an ORDER BY clause to be sure to get the records in the desired order. TOP without ORDER BY might give you 100 random records.
Something like this would work:
SELECT TOP 100
integer_column,
MIN(other_column) AS other_column
FROM rTable
GROUP BY integer_column
ORDER BY integer_column;
EDIT: I wonder whether you know what yor MySQL query is actually doing, so I think I'd better elaborate. Let's say your table only contains three columns integer_column, cola, and colb. And let's further say your table has just three records:
integer_column cola colb
1 a1 b1
1 a2 b2
2 a3 b3
Your query picks the lowest 100 integer_column. In this example exist just two: 1 and 2, so both are picked. Then per integer_column the query picks random cola and colb. Your result may look like this:
integer_column cola colb
1 a2 b1
2 a3 b3
(It usually won't. The dbms will play it easy and pick values from one record only, either the first or the last, so you probably get either 1-a1-b1 or 1-a2-b2, but there is no guarantee to that.)
Having said this, you will get the lowest integer_column along with indeterminate values. In Standard SQL you can use MIN or MAX, to get a value. It doesn't matter if you use MIN(cola) which is a1 or MAX(cola) which is a2, as you don't care anyhow.

"group by" needed in count(*) SQL statement?

The following statement works in my database:
select column_a, count(*) from my_schema.my_table group by 1;
but this one doesn't:
select column_a, count(*) from my_schema.my_table;
I get the error:
ERROR: column "my_table.column_a" must appear in the GROUP BY clause
or be used in an aggregate function
Helpful note: This thread: What does SQL clause "GROUP BY 1" mean? discusses the meaning of "group by 1".
Update:
The reason why I am confused is because I have often seen count(*) as follows:
select count(*) from my_schema.my_table
where there is no group by statement. Is COUNT always required to be followed by group by? Is the group by statement implicit in this case?
This error makes perfect sense. COUNT is an "aggregate" function. So you need to tell it which field to aggregate by, which is done with the GROUP BY clause.
The one which probably makes most sense in your case would be:
SELECT column_a, COUNT(*) FROM my_schema.my_table GROUP BY column_a;
If you only use the COUNT(*) clause, you are asking to return the complete number of rows, instead of aggregating by another condition. Your questing if GROUP BY is implicit in that case, could be answered with: "sort of": If you don't specify anything is a bit like asking: "group by nothing", which means you will get one huge aggregate, which is the whole table.
As an example, executing:
SELECT COUNT(*) FROM table;
will show you the number of rows in that table, whereas:
SELECT col_a, COUNT(*) FROM table GROUP BY col_a;
will show you the the number of rows per value of col_a. Something like:
col_a | COUNT(*)
---------+----------------
value1 | 100
value2 | 10
value3 | 123
You also should take into account that the * means to count everything. Including NULLs! If you want to count a specific condition, you should use COUNT(expression)! See the docs about aggragate functions for more details on this topic.
If you don't use the Group by clause at all then all that will be returned is a count of 1 for each row, which is already assumed anyway and therefore redundant data. By adding GROUP BY 1 you have categorized the information thereby making it non-redundant even though it returns the same result in theory as the statement that creates an error.
When you have a function like count, sum etc. you need to group the other columns. This would be equivalent to your query:
select column_a, count(*) from my_schema.my_table group by column_a;
When you use count(*) with no other column, you are counting all rows from SELECT * from the table. When you use count(*) alongside another column, you are counting the number of rows for each different value of that other column. So in this case you need to group the results, in order to show each value and its count only once.
group by 1 in this case refers to column_a which has the column position 1 in your query.
This why it works on your server. Indeed this is not a good practice in sql.
You should mention the column name because the column order may change in the table so it will be hard to maintain this code.
The best solution is:
select column_a, count(*) from my_schema.my_table group by column_a;

Select last n rows without use of order by clause

I want to fetch the last n rows from a table in a Postgres database. I don't want to use an ORDER BY clause as I want to have a generic query. Anyone has any suggestions?
A single query will be appreciated as I don't want to use FETCH cursor of Postgres.
That you get what you expect with Lukas' solution (as of Nov. 1st, 2011) is pure luck. There is no "natural order" in an RDBMS by definition. You depend on implementation details that could change with a new release without notice. Or a dump / restore could change that order. It can even change out of the blue when db statistics change and the query planner chooses a different plan that leads to a different order of rows.
The proper way to get the "last n" rows is to have a timestamp or sequence column and ORDER BY that column. Every RDBMS you can think of has ORDER BY, so this is as 'generic' as it gets.
The manual:
If ORDER BY is not given, the rows are returned in whatever order the
system finds fastest to produce.
Lukas' solution is fine to avoid LIMIT, which is implemented differently in various RDBMS (for instance, SQL Server uses TOP n instead of LIMIT), but you need ORDER BY in any case.
Use window functions!
select t2.* from (
select t1.*, row_number() over() as r, count(*) over() as c
from (
-- your generic select here
) t1
) t2
where t2.r + :n > t2.c
In the above example, t2.r is the row number of every row, t2.c is the total records in your generic select. And :n will be the n last rows that you want to fetch. This also works when you sort your generic select.
EDIT: A bit less generic from my previous example:
select * from (
select my_table.*, row_number() over() as r, count(*) over() as c
from my_table
-- optionally, you could sort your select here
-- order by my_table.a, my_table.b
) t
where t.r + :n > t.c