How to use DISTINCT and ORDER BY in same SELECT statement? - sql

After executing the following statement:
SELECT Category FROM MonitoringJob ORDER BY CreationDate DESC
I am getting the following values from the database:
test3
test3
bildung
test4
test3
test2
test1
but I want the duplicates removed, like this:
bildung
test4
test3
test2
test1
I tried to use DISTINCT but it doesn't work with ORDER BY in one statement. Please help.
Important:
I tried it with:
SELECT DISTINCT Category FROM MonitoringJob ORDER BY CreationDate DESC
it doesn't work.
Order by CreationDate is very important.

The problem is that the columns used in the ORDER BY aren't specified in the DISTINCT. To do this, you need to use an aggregate function to sort on, and use a GROUP BY to make the DISTINCT work.
Try something like this:
SELECT DISTINCT Category, MAX(CreationDate)
FROM MonitoringJob
GROUP BY Category
ORDER BY MAX(CreationDate) DESC, Category

Extended sort key columns
The reason why what you want to do doesn't work is because of the logical order of operations in SQL, as I've elaborated in this blog post, which, for your first query, is (simplified):
FROM MonitoringJob
SELECT Category, CreationDate i.e. add a so called extended sort key column
ORDER BY CreationDate DESC
SELECT Category i.e. remove the extended sort key column again from the result.
So, thanks to the SQL standard extended sort key column feature, it is totally possible to order by something that is not in the SELECT clause, because it is being temporarily added to it behind the scenes.
So, why doesn't this work with DISTINCT?
If we add the DISTINCT operation, it would be added between SELECT and ORDER BY:
FROM MonitoringJob
SELECT Category, CreationDate
DISTINCT
ORDER BY CreationDate DESC
SELECT Category
But now, with the extended sort key column CreationDate, the semantics of the DISTINCT operation has been changed, so the result will no longer be the same. This is not what we want, so both the SQL standard, and all reasonable databases forbid this usage.
Workarounds
It can be emulated with standard syntax as follows
SELECT Category
FROM (
SELECT Category, MAX(CreationDate) AS CreationDate
FROM MonitoringJob
GROUP BY Category
) t
ORDER BY CreationDate DESC
Or, just simply (in this case), as shown also by Prutswonder
SELECT Category, MAX(CreationDate) AS CreationDate
FROM MonitoringJob
GROUP BY Category
ORDER BY CreationDate DESC
I have blogged about SQL DISTINCT and ORDER BY more in detail here.

If the output of MAX(CreationDate) is not wanted - like in the example of the original question - the only answer is the second statement of Prashant Gupta's answer:
SELECT [Category] FROM [MonitoringJob]
GROUP BY [Category] ORDER BY MAX([CreationDate]) DESC
Explanation: you can't use the ORDER BY clause in an inline function, so the statement in the answer of Prutswonder is not useable in this case, you can't put an outer select around it and discard the MAX(CreationDate) part.

Just use this code, If you want values of [Category] and [CreationDate] columns
SELECT [Category], MAX([CreationDate]) FROM [MonitoringJob]
GROUP BY [Category] ORDER BY MAX([CreationDate]) DESC
Or use this code, If you want only values of [Category] column.
SELECT [Category] FROM [MonitoringJob]
GROUP BY [Category] ORDER BY MAX([CreationDate]) DESC
You'll have all the distinct records what ever you want.

2) Order by CreationDate is very important
The original results indicated that "test3" had multiple results...
It's very easy to start using MAX all the time to remove duplicates in Group By's... and forget or ignore what the underlying question is...
The OP presumably realised that using MAX was giving him the last "created" and using MIN would give the first "created"...

if object_id ('tempdb..#tempreport') is not null
begin
drop table #tempreport
end
create table #tempreport (
Category nvarchar(510),
CreationDate smallint )
insert into #tempreport
select distinct Category from MonitoringJob (nolock)
select * from #tempreport ORDER BY CreationDate DESC

Distinct will sort records in ascending order. If you want to sort in desc order use:
SELECT DISTINCT Category
FROM MonitoringJob
ORDER BY Category DESC
If you want to sort records based on CreationDate field then this field must be in the select statement:
SELECT DISTINCT Category, creationDate
FROM MonitoringJob
ORDER BY CreationDate DESC

You can use CTE:
WITH DistinctMonitoringJob AS (
SELECT DISTINCT Category Distinct_Category FROM MonitoringJob
)
SELECT Distinct_Category
FROM DistinctMonitoringJob
ORDER BY Distinct_Category DESC

By subquery, it should work:
SELECT distinct(Category) from MonitoringJob where Category in(select Category from MonitoringJob order by CreationDate desc);

We can do this with select sub query
Here is the the query:
SELECT * FROM (
SELECT DISTINCT Category FROM MonitoringJob
) AS Tbl
ORDER BY Tbl.CreationDate DESC

Try next, but it's not useful for huge data...
SELECT DISTINCT Cat FROM (
SELECT Category as Cat FROM MonitoringJob ORDER BY CreationDate DESC
);

It can be done using inner query Like this
$query = "SELECT *
FROM (SELECT Category
FROM currency_rates
ORDER BY id DESC) as rows
GROUP BY currency";

SELECT DISTINCT Category FROM MonitoringJob ORDER BY Category ASC

Related

BigQuery - Extract last entry of each group

I have one table where multiple records inserted for each group of product. Now, I want to extract (SELECT) only the last entries. For more, see the screenshot. The yellow highlighted records should be return with select query.
The HAVING MAX and HAVING MIN clause for the ANY_VALUE function is now in preview
HAVING MAX and HAVING MIN were just introduced for some aggregate functions - https://cloud.google.com/bigquery/docs/release-notes#February_06_2023
with them query can be very simple - consider below approach
select any_value(t having max datetime).*
from your_table t
group by t.id, t.product
if applied to sample data in your question - output is
You might consider below as well
SELECT *
FROM sample_table
QUALIFY DateTime = MAX(DateTime) OVER (PARTITION BY ID, Product);
If you're more familiar with an aggregate function than a window function, below might be an another option.
SELECT ARRAY_AGG(t ORDER BY DateTime DESC LIMIT 1)[SAFE_OFFSET(0)].*
FROM sample_table t
GROUP BY t.ID, t.Product
Query results
You can use window function to do partition based on key and selecting required based on defining order by field.
For Example:
select * from (
select *,
rank() over (partition by product, order by DateTime Desc) as rank
from `project.dataset.table`)
where rank = 1
You can use this query to select last record of each group:
Select Top(1) * from Tablename group by ID order by DateTime Desc

Extract and concatenate the same field from multiple records in big query

I would like to be able to extract one field from multiple records from within a single table. For example, assuming I have a schema as follows
userId, eventTimestamp, theField
And what I want to do is be able to concatenate all instances of the field 'theField' together into a single string for a given userId ordered by eventTimestamp. And for an extra wrinkle, lets say I only want to include the first fiftiest oldest records.
My first attempt was to try something like:
SELECT
userId,
eventTimestamp,
LEAD(theField,0) OVER (PARTITION BY userId ORDER BY eventTimestamp) AS step0,
LEAD(theField,1) OVER (PARTITION BY userId ORDER BY eventTimestamp) AS step1,
....,
LEAD(theField,50) OVER (PARTITION BY userId ORDER BY eventTimestamp) AS step50,
And then the next step was to wrap that first step up in another SELECT statement as follows:
SELECT userId, eventTimestamp, CONCAT(STRING(step0), STRING(step1),...,STRING(step50)) as concatenatedString
FROM [whateverDataset.whateverTable],
GROUP BY
userId, eventTimestamp
This approach doesn't work though because if I have more than 50 steps (which I do), then I end up getting multiple rows for each of those outer SELECT statements, basically N-50 rows, where N = the total number of records for a particular userId. A 'solution' to this would be to have a HAVING statement in the inner SELECT statement to limit itself to only reporting the first 50 records, but overall this seems like a rather cumbersome solution. In non-BigQuery variants of SQL the GROUP_CONCAT seems to be a good way to go forward, but it either doesn't work here or I lack the creativity to get it to work. Anyone have any suggestions?
Thanks,
Brad
For BigQuery Legacy SQL:
SELECT
userid, GROUP_CONCAT(theField) AS Fields
FROM (
SELECT
userid, eventTimestamp, theField,
ROW_NUMBER() OVER(PARTITION BY userid ORDER BY eventTimestamp DESC) AS pos
FROM YourTable
ORDER BY eventTimestamp
)
WHERE pos < 51
GROUP BY userid
Please note: inner ORDER BY does not guarantee the order of theField in GROUP_CONCAT. But, so far, in all practical cases I see the order is carrying. So, test carefuly
For BigQuery Standard SQL:
Don't forget to uncheck Use Legacy SQL checkbox under Show Options
SELECT
userid,
(SELECT STRING_AGG(fields) FROM t.fields) AS fields
FROM (
SELECT
userid,
ARRAY(SELECT theField FROM t.fields ORDER BY eventTimestamp) fields
FROM (
SELECT
userid,
ARRAY_AGG(STRUCT(theField, eventTimestamp)) fields
FROM (
SELECT
userid,
eventTimestamp,
theField,
ROW_NUMBER() OVER(PARTITION BY userid ORDER BY eventTimestamp DESC) AS pos
FROM YourTable
)
WHERE pos < 51
GROUP BY userid
) t
) t

Ambiguous column name using row_number() without alias

I'm trying to implement pagination in a query that is built using information from a view, and I need to use the row_number() function over a column when I don't know which table it is from.
SELECT * FROM (
SELECT class.ID as ID, user.ID as USERID, row_number() over (ORDER BY
ID desc) as row_number FROM class, user
) out_q WHERE row_number > #startrow ORDER BY row_number
The problem is that I only have the result column name (ID or USERID) that came from a previous query. If I execute this query, it will raise the error 'Ambiguous column name "ID"'. Is there a way to specify that I'm referencing the column ID that is being selected and not from a different table?
Is it possible to specify an alias to the query result itself?
I have already tried the following,
SELECT TOP 30 * FROM (
SELECT *, row_number() over (ORDER BY ID desc) as row_number FROM(
SELECT class.ID as ID, user.ID as USERID FROM class, user
) in_q
) out_q WHERE row_number > #startrow ORDER BY row_number
It works, but the SGBD gets confused on which query plan it has to use, because of the small row goal present in the outer query and the big set of results returned by the inner query, when #startrow is a small number, the query executes in less than one second, when it is a big number the query takes minutes to execute.
Your problem is the id in the row_number itself. If you want a stable sort, then include both ids:
SELECT *
FROM (SELECT class.ID as ID, user.ID as USERID,
row_number() over (ORDER BY class.ID desc, user.id) as row_number
FROM class CROSS JOIN user
) out_q
WHERE row_number > #startrow
ORDER BY row_number;
I assume the cartesian product is intentional. Sometimes, this indicates an error in the query. In general, I would advise you to avoid using commas in the from clause. If you do want a cartesian product, then be explicit by using CROSS JOIN.
You could try using the option you already tried, then use the OPTIMIZE FOR hint.
OPTION ( OPTIMIZE FOR (#startrow = 100000) );
See a description of the hint in MSDN docs here: https://msdn.microsoft.com/en-us/library/ms181714.aspx.

SQL Order By using concat

I'm concatenating two fields and I only want to order by the second field (p.organizationname). Is that possible?
I'm displaying this field so I need a solution that doesn't include me having to select the fields separately.
Here is what i have so far:
SELECT distinct Concat(Concat(f.REFERENCEFILE, ','),p.ORGANIZATIONNAME)
FROM PEOPLE p,FOLDER f,FOLDERPEOPLE fp,folderinfo fi...
Order By concat(Concat(f.REFERENCEFILE, ','),p.ORGANIZATIONNAME)
Use GROUP BY and ORDER BY an aggregate instead of DISTINCT:
SELECT Concat(Concat(f.REFERENCEFILE, ','),p.ORGANIZATIONNAME)
FROM PEOPLE p,FOLDER f,FOLDERPEOPLE fp,folderinfo fi...
GROUP BY Concat(Concat(f.REFERENCEFILE, ','),p.ORGANIZATIONNAME)
Order By MAX(p.ORGANIZATIONNAME)
The problem can be illustrated with an example:
ID Col1
1 Dog
1 Cat
2 Horse
Distinct ID? Easy: 1,2
Distinct ID Order by Col1... wait.. which value of Col1 should SQL use? SQL is confused and angry.
Since you are using a concatenation of two fields and want to sort by one of those fields, you could also include the sort field in a DISTINCT subquery and then ORDER BY the sort field without including it in your SELECT list.
Since you have a DISTINCT your ORDER BY clause should be specified in the SELECT, you can use a subquery to achieve the same result in your case since the Distinct values will be the same when you add P.ORGANIZATIONNAME
SELECT col
FROM( SELECT distinct Concat(Concat(f.REFERENCEFILE, ','),p.ORGANIZATIONNAME) a,
p.ORGANIZATIONNAME b
FROM PEOPLE p,FOLDER f,FOLDERPEOPLE fp,folderinfo fi... ) t
order by b

SQL Select Bottom Records

I have a query where I wish to retrieve the oldest X records. At present my query is something like the following:
SELECT Id, Title, Comments, CreatedDate
FROM MyTable
WHERE CreatedDate > #OlderThanDate
ORDER BY CreatedDate DESC
I know that normally I would remove the 'DESC' keyword to switch the order of the records, however in this instance I still want to get records ordered with the newest item first.
So I want to know if there is any means of performing this query such that I get the oldest X items sorted such that the newest item is first. I should also add that my database exists on SQL Server 2005.
Why not just use a subquery?
SELECT T1.*
FROM
(SELECT TOP X Id, Title, Comments, CreatedDate
FROM MyTable
WHERE CreatedDate > #OlderThanDate
ORDER BY CreatedDate) T1
ORDER BY CreatedDate DESC
Embed the query. You take the top x when sorted in ascending order (i.e. the oldest) and then re-sort those in descending order ...
select *
from
(
SELECT top X Id, Title, Comments, CreatedDate
FROM MyTable
WHERE CreatedDate > #OlderThanDate
ORDER BY CreatedDate
) a
order by createddate desc