Aggregate functions in WHERE clause in SQLite - sql

Simply put, I have a table with, among other things, a column for timestamps. I want to get the row with the most recent (i.e. greatest value) timestamp. Currently I'm doing this:
SELECT * FROM table ORDER BY timestamp DESC LIMIT 1
But I'd much rather do something like this:
SELECT * FROM table WHERE timestamp=max(timestamp)
However, SQLite rejects this query:
SQL error: misuse of aggregate function max()
The documentation confirms this behavior (bottom of page):
Aggregate functions may only be used in a SELECT statement.
My question is: is it possible to write a query to get the row with the greatest timestamp without ordering the select and limiting the number of returned rows to 1? This seems like it should be possible, but I guess my SQL-fu isn't up to snuff.

SELECT * from foo where timestamp = (select max(timestamp) from foo)
or, if SQLite insists on treating subselects as sets,
SELECT * from foo where timestamp in (select max(timestamp) from foo)

There are many ways to skin a cat.
If you have an Identity Column that has an auto-increment functionality, a faster query would result if you return the last record by ID, due to the indexing of the column, unless of course you wish to put an index on the timestamp column.
SELECT * FROM TABLE ORDER BY ID DESC LIMIT 1

I think I've answered this question 5 times in the past week now, but I'm too tired to find a link to one of those right now, so here it is again...
SELECT
*
FROM
table T1
LEFT OUTER JOIN table T2 ON
T2.timestamp > T1.timestamp
WHERE
T2.timestamp IS NULL
You're basically looking for the row where no other row matches that is later than it.
NOTE: As pointed out in the comments, this method will not perform as well in this kind of situation. It will usually work better (for SQL Server at least) in situations where you want the last row for each customer (as an example).

you can simply do
SELECT *, max(timestamp) FROM table
Edit:
As aggregate function can't be used like this so it gives error. I guess what SquareCog had suggested was the best thing to do
SELECT * FROM table WHERE timestamp = (select max(timestamp) from table)

Related

Why does SQLite return the wrong value from a subquery?

Given a schema and data in SQLite 3.7.17 (I'm stuck with this version):
CREATE TABLE reservations (id INTEGER NOT NULL PRIMARY KEY,NodeID INTEGER,ifIndex INTEGER,dt TEXT,action TEXT,user TEXT,p TEXT);
INSERT INTO "reservations" VALUES(1,584,436211200,'2022-03-12 10:10:00','R','s','x');
INSERT INTO "reservations" VALUES(2,584,436211200,'2022-03-12 10:10:01','R','s','x');
INSERT INTO "reservations" VALUES(3,584,436211200,'2022-03-12 10:10:05','U','s','x');
INSERT INTO "reservations" VALUES(4,584,436211200,'2022-03-12 10:09:01','R','s','x');
I'm trying to get the most recent action for each pair of (NodeID,ifIndex).
Running SELECT MAX(dt),action FROM reservations GROUP BY NodeId,ifIndex; I get:
MAX(dt)|action
2022-03-12 10:10:05|U
Perfect.
Now I want to select just the action from this query (dropping the MAX(dt)): SELECT t.action FROM (SELECT MAX(dt),action FROM reservations GROUP BY NodeId,ifIndex) AS t;:
t.action
R
This I don't understand. Also: SELECT t.* FROM (SELECT MAX(dt),action FROM reservations GROUP BY NodeId,ifIndex) AS t;:
MAX(dt)|action
2022-03-12 10:10:05|U
gives the correct value. So why does the query not seem to be querying against the subquery?
Perhaps it's a bug in this version of SQLite as SQLFiddle works fine (http://sqlfiddle.com/#!7/f7619a/4)
In attempt to workaround this issue I use this query: SELECT t2.action FROM (SELECT MAX(dt),* FROM reservations GROUP BY NodeId,ifIndex) AS t1 INNER JOIN reservations AS t2 on t1.id = t2.i which seems to work:
action
U
You are right, this seems to be a bug in your SQLite version.
To get into more detail, you are using SQLite's GROUP BY extension "Bare columns in an aggregate query".
In standard SQL and almost all RDBMS your query
SELECT MAX(dt), action FROM reservations GROUP BY NodeId, ifIndex;
is invalid. Why is that? You group by NodeId and ifIndex, thus aggregating your data down to one result row per NodeId and ifIndex. In each such row you want to show the group's maximum date and the group's action. But while there is one maximum date for a group, there is no one action for it, but several. Your query is considered invalid in standard SQL, because you don't tell the DBMS which of the group's actions you want to see. This could be the minimum action for example (i.e. the first in alphabetical order). That means there must be an aggregation function invoked on that column.
Not so in SQLite. When SQLite finds a "bare column" in a GROUP BY query that is meant to find a MAX or MIN of a column, it considers this to mean to take the bare column's value from the row where the minimum or maximum is found in. This is an extension to the SQL standard, and SQLite is the only DBMS I know of to feature this. You can read about this in the SQLite docs: Search "Bare columns in an aggregate query" in https://www.sqlite.org/lang_select.html#resultset.
SELECT MAX(dt), action FROM reservations GROUP BY NodeId, ifIndex;
hence finds the action in the row with the maximum dt. If you selected MIN(dt) instead, it would get you the action of the row with the minimum dt.
And of course a query selecting from a subquery result should still get the same value. It seems, however, that in your version SQLite gets confused with its bare column detection. It doesn't throw an error telling you it doesn't know which action to select, but it doesn't select the maximum dt's action either. Obviously a bug.
In standard SQL (and almost any RDBMS) your original query would be written like this:
SELECT dt, action
FROM reservations r
WHERE dt =
(
SELECT MAX(dt)
FROM reservations mr
WHERE mr.NodeId = r.NodeId AND mr.ifIndex = r.ifIndex
);
or like this:
SELECT dt, action
FROM reservations r
WHERE NOT EXISTS
(
SELECT NULL
FROM reservations gr
WHERE gr.NodeId = r.NodeId
AND gr.ifIndex = r.ifIndex
AND gr.dt > r.dt
);
or like this:
SELECT dt, action
FROM
(
SELECT dt, action, MAX(dt) OVER (PARTITION BY NodeId, ifIndex) AS max_dt
FROM reservations
) with_max_dt
WHERE dt = max_dt;
And there are still other ways to get the top row(s) per group.
In any of these proper SQL queries, you can remove dt from the select list and still get the maximum dt's action.

SQL Server : verify that two columns are in same sort order

I have a table with an ID and a date column. It's possible (likely) that when a new record is created, it gets the next larger ID and the current datetime. So if I were to sort by date or I were to sort by ID, the resulting data set would be in the same order.
How do I write a SQL query to verify this?
It's also possible that an older record is modified and the date is updated. In that case, the records would not be in the same sort order. I don't think this happens.
I'm trying to move the data to another location, and if I know that there are no modified records, that makes it a lot simpler.
I'm pretty sure I only need to query those two columns: ID, RecordDate. Other links indicate I should be able to use LAG, but I'm getting an error that it isn't a built-in function name.
In other words, both https://dba.stackexchange.com/questions/42985/running-total-to-the-previous-row and Is there a way to access the "previous row" value in a SELECT statement? should help, but I'm still not able to make that work for what I want.
If you cannot use window functions, you can use a correlated subquery and EXISTS.
SELECT *
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.id < t1.id
AND t2.recorddate > t1.recorddate);
It'll select all records where another record with a lower ID and a greater timestamp exists. If the result is empty you know that no such record exists and the data is like you want it to be.
Maybe you want to restrict it a bit more by using t2.recorddate >= t1.recorddate instead of t2.recorddate > t1.recorddate. I'm not sure how you want it.
Use this:
SELECT ID, RecordDate FROM tablename t
WHERE
(SELECT COUNT(*) FROM tablename WHERE tablename.ID < t.ID)
<>
(SELECT COUNT(*) FROM tablename WHERE tablename.RecordDate < t.RecordDate);
It counts for each row, how many rows have id less than the row's id and
how many rows have RecordDate less than the row's RecordDate.
If these counters are not equal then it outputs this row.
The result is all the rows that would not be in the same position after sorting by ID and RecordDate
One method uses window functions:
select count(*)
from (select t.*,
row_number() over (order by id) as seqnum_id,
row_number() over (order by date, id) as seqnum_date
from t
) t
where seqnum_id <> seqnum_date;
When the count is zero, then the two columns have the same ordering. Note that the second order by includes id. Two rows could have the same date. This makes the sort stable, so the comparison is valid even when date has duplicates.
the above solutions are all good but if both dates and ids are in increment then this should also work
select modifiedid=t2.id from
yourtable t1 join yourtable t2
on t1.id=t2.id+1 and t1.recordDate<t2.recordDate

Selecting distinct values from database

I have a table as follows:
ParentActivityID | ActivityID | Timestamp
1 A1 T1
2 A2 T2
1 A1 T1
1 A1 T5
I want to select unique ParentActivityID's along with Timestamp. The time stamp can be the most recent one or the first one as is occurring in the table.
I tried to use DISTINCT but i came to realise that it dosen't work on individual columns. I am new to SQL. Any help in this regard will be highly appreciated.
DISTINCT is a shorthand that works for a single column. When you have multiple columns, use GROUP BY:
SELECT ParentActivityID, Timestamp
FROM MyTable
GROUP BY ParentActivityID, Timestamp
Actually i want only one one ParentActivityID. Your solution will give each pair of ParentActivityID and Timestamp. For e.g , if i have [1, T1], [2,T2], [1,T3], then i wanted the value as [1,T3] and [2,T2].
You need to decide what of the many timestamps to pick. If you want the earliest one, use MIN:
SELECT ParentActivityID, MIN(Timestamp)
FROM MyTable
GROUP BY ParentActivityID
Try this:
SELECT [ParentActivityId],
MIN([Timestamp]) AS [FirstTimestamp],
MAX([Timestamp]) AS [RecentTimestamp]
FROM [Table]
GROUP BY [ParentActivityId]
This will provide you the first timestamp and the most recent timestamp for each ParentActivityId that is present in your table. You can choose the ones you need as per your need.
"Group by" is what you need here. Just do "group by ParentActivityID" and tell that most recent timestamp along all rows with same ParentActivityID is needed for you:
SELECT ParentActivityID, MAX(Timestamp) FROM Table GROUP BY ParentActivityID
"Group by" operator is like taking rows from a table and putting them in a map with a key defined in group by clause (ParentActivityID in this example). You have to define how grouping by will handle rows with duplicate keys. For this you have various aggregate functions which you specify on columns you want to select but which are not part of the key (not listed in group by clause, think of them as a values in a map).
Some databases (like mysql) also allow you to select columns which are not part of the group by clause (not in a key) without applying aggregate function on them. In such case you will get some random value for this column (this is like blindly overwriting value in a map with new value every time). Still, SQL standard together with most databases out there will not allow you to do it. In such case you can use min(), max(), first() or last() aggregate function to work around it.
Use CTE for getting the latest row from your table based on parent id and you can choose the columns from the entire row of the output .
;With cte_parent
As
(SELECT ParentActivityId,ActivityId,TimeStamp
, ROW_NUMBER() OVER(PARTITION BY ParentActivityId ORDER BY TimeStamp desc) RNO
FROM YourTable )
SELECT *
FROM cte_parent
WHERE RNO =1

SELECT DISTINCT returns more rows than expected

I have read many answers here, but until now nothing could help me. I'm developing a ticket system, where each ticket has many updates.
I have about 2 tables: tb_ticket and tb_updates.
I created a SELECT with subqueries, where it took a long time (about 25 seconds) to get about 1000 rows. Now I changed it to INNER JOIN instead many SELECTs in subqueries, it is really fast (70 ms), but now I get duplicates tickets. I would like to know how can I do to get only the last row (ordering by time).
My current result is:
...
67355;69759;"COMPANY X";"2014-08-22 09:40:21";"OPEN";"John";1
67355;69771;"COMPANY X";"2014-08-26 10:40:21";"UPDATE";"John";1
The first column is the ticket ID, the second is the update ID... I would like to get only a row per ticket ID, but DISTINCT does not work in this case. Which row should be? Always the latest one, so in this case 2014-08-26 10:40:21.
UPDATE:
It is a postgresql database. I did not share my current query because it has only portuguese names, so I think it would not help at all.
SOLUTION:
Used_By_Already had the best solution to my problem.
Without the details of your tables one has to guess the field names, but it seems that tb_updates has many records for a single record in tb_ticket (a many to one relationship).
A generic solution to your problem - to get just the "latest" record - is to use a subquery on tb_updates (see alias mx below) and then join that back to tb_updates so that only the record that has the latest date is chosen.
SELECT
t.*
, u.*
FROM tb_ticket t
INNER JOIN tb_updates u
ON t.ticket_id = u.ticket_id
INNER JOIN (
SELECT
ticket_id
, MAX(updated_at) max_updated
FROM tb_updates
GROUP BY
ticket_id
) mx
ON u.ticket_id = mx.ticket_id
AND u.updated_at = mx.max_updated
;
If you have a dbms that supports ROW_NUMBER() then using that function can be a very effective alternative method, but you haven't informed us which dbms you are using.
by the way:
These rows ARE distinct:
67355;69759;"COMPANY X";"2014-08-22 09:40:21";"OPEN";"John";1
67355;69771;"COMPANY X";"2014-08-26 10:40:21";"UPDATE";"John";1
69759 is different to 69771, and that is enough for the 2 rows to be DISTINCT
there are difference in the 2 dates also.
distinct is a row operator which means is considers the entire row, not just the first column, when deciding which rows are unique.
Used_By_Already's solution would work just fine. I'm not sure on the performance but another solution would be to use cross apply, though that is limited to only a few DBMS's.
SELECT *
FROM tb_ticket ticket
CROSS APPLY (
SELECT top(1) *
FROM tb_updates details
ORDER BY updateTime desc
WHERE details.ticketID = ticket.ticketID
) updates
U Can try something like below if your updateid is identity column:
Select ticketed, max(updateid) from table
group by ticketed
To obtain last row you have to end your query with order by time desc then use TOP (1) in the select statement to select only the first row in the query result
ex:
select TOP (1) .....
from .....
where .....
order by time desc

How do I calculate a moving average using MySQL?

I need to do something like:
SELECT value_column1
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
Except in addition to value_column1, I also need to retrieve a moving average of the previous 20 values of value_column1.
Standard SQL is preferred, but I will use MySQL extensions if necessary.
This is just off the top of my head, and I'm on the way out the door, so it's untested. I also can't imagine that it would perform very well on any kind of large data set. I did confirm that it at least runs without an error though. :)
SELECT
value_column1,
(
SELECT
AVG(value_column1) AS moving_average
FROM
Table1 T2
WHERE
(
SELECT
COUNT(*)
FROM
Table1 T3
WHERE
date_column1 BETWEEN T2.date_column1 AND T1.date_column1
) BETWEEN 1 AND 20
)
FROM
Table1 T1
Tom H's approach will work. You can simplify it like this if you have an identity column:
SELECT T1.id, T1.value_column1, avg(T2.value_column1)
FROM table1 T1
INNER JOIN table1 T2 ON T2.Id BETWEEN T1.Id-19 AND T1.Id
I realize that this answer is about 7 years too late. I had a similar requirement and thought I'd share my solution in case it's useful to someone else.
There are some MySQL extensions for technical analysis that include a simple moving average. They're really easy to install and use: https://github.com/mysqludf/lib_mysqludf_ta#readme
Once you've installed the UDF (per instructions in the README), you can include a simple moving average in a select statement like this:
SELECT TA_SMA(value_column1, 20) AS sma_20 FROM table1 ORDER BY datetime_column1
When I had a similar problem, I ended up using temp tables for a variety of reasons, but it made this a lot easier! What I did looks very similar to what you're doing, as far as the schema goes.
Make the schema something like ID identity, start_date, end_date, value. When you select, do a subselect avg of the previous 20 based on the identity ID.
Only do this if you find yourself already using temp tables for other reasons though (I hit the same rows over and over for different metrics, so it was helpful to have the small dataset).
My solution adds a row number in table. The following example code may help:
set #MA_period=5;
select id1,tmp1.date_time,tmp1.c,avg(tmp2.c) from
(select #b:=#b+1 as id1,date_time,c from websource.EURUSD,(select #b:=0) bb order by date_time asc) tmp1,
(select #a:=#a+1 as id2,date_time,c from websource.EURUSD,(select #a:=0) aa order by date_time asc) tmp2
where id1>#MA_period and id1>=id2 and id2>(id1-#MA_period)
group by id1
order by id1 asc,id2 asc
In my experience, Mysql as of 5.5.x tends not to use indexes on dependent selects, whether a subquery or join. This can have a very significant impact on performance where the dependent select criteria change on every row.
Moving average is an example of a query which falls into this category. Execution time may increase with the square of the rows. To avoid this, chose a database engine which can perform indexed look-ups on dependent selects. I find postgres works effectively for this problem.
In mysql 8 window function frame can be used to obtain the averages.
SELECT value_column1, AVG(value_column1) OVER (ORDER BY datetime_column1 ROWS 19 PRECEDING) as ma
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
This calculates the average of the current row and 19 preceding rows.