How do I calculate a moving average using MySQL? - sql

I need to do something like:
SELECT value_column1
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
Except in addition to value_column1, I also need to retrieve a moving average of the previous 20 values of value_column1.
Standard SQL is preferred, but I will use MySQL extensions if necessary.

This is just off the top of my head, and I'm on the way out the door, so it's untested. I also can't imagine that it would perform very well on any kind of large data set. I did confirm that it at least runs without an error though. :)
SELECT
value_column1,
(
SELECT
AVG(value_column1) AS moving_average
FROM
Table1 T2
WHERE
(
SELECT
COUNT(*)
FROM
Table1 T3
WHERE
date_column1 BETWEEN T2.date_column1 AND T1.date_column1
) BETWEEN 1 AND 20
)
FROM
Table1 T1

Tom H's approach will work. You can simplify it like this if you have an identity column:
SELECT T1.id, T1.value_column1, avg(T2.value_column1)
FROM table1 T1
INNER JOIN table1 T2 ON T2.Id BETWEEN T1.Id-19 AND T1.Id

I realize that this answer is about 7 years too late. I had a similar requirement and thought I'd share my solution in case it's useful to someone else.
There are some MySQL extensions for technical analysis that include a simple moving average. They're really easy to install and use: https://github.com/mysqludf/lib_mysqludf_ta#readme
Once you've installed the UDF (per instructions in the README), you can include a simple moving average in a select statement like this:
SELECT TA_SMA(value_column1, 20) AS sma_20 FROM table1 ORDER BY datetime_column1

When I had a similar problem, I ended up using temp tables for a variety of reasons, but it made this a lot easier! What I did looks very similar to what you're doing, as far as the schema goes.
Make the schema something like ID identity, start_date, end_date, value. When you select, do a subselect avg of the previous 20 based on the identity ID.
Only do this if you find yourself already using temp tables for other reasons though (I hit the same rows over and over for different metrics, so it was helpful to have the small dataset).

My solution adds a row number in table. The following example code may help:
set #MA_period=5;
select id1,tmp1.date_time,tmp1.c,avg(tmp2.c) from
(select #b:=#b+1 as id1,date_time,c from websource.EURUSD,(select #b:=0) bb order by date_time asc) tmp1,
(select #a:=#a+1 as id2,date_time,c from websource.EURUSD,(select #a:=0) aa order by date_time asc) tmp2
where id1>#MA_period and id1>=id2 and id2>(id1-#MA_period)
group by id1
order by id1 asc,id2 asc

In my experience, Mysql as of 5.5.x tends not to use indexes on dependent selects, whether a subquery or join. This can have a very significant impact on performance where the dependent select criteria change on every row.
Moving average is an example of a query which falls into this category. Execution time may increase with the square of the rows. To avoid this, chose a database engine which can perform indexed look-ups on dependent selects. I find postgres works effectively for this problem.

In mysql 8 window function frame can be used to obtain the averages.
SELECT value_column1, AVG(value_column1) OVER (ORDER BY datetime_column1 ROWS 19 PRECEDING) as ma
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
This calculates the average of the current row and 19 preceding rows.

Related

Selecting rows where price changed

I have a problem much like this one
SQL: selecting rows where column value changed from previous row
Although not in mysql in SQL Server. i tried ypercube's first answer jiri's answer and egor's as well. All of them just run for over 5 min with no results (one i let run over 10min). The table contains over a million records so i know this is a big part of the problem. I have a feeling ypercube's second answer might work well but don't know how to change this variable driven mysql query to SQL.
Any help would be appreciated.
SQL version 2008 r2
Basically i need to determine when a price has changed on table containing the price,productID, serialnumber and a datestamp.
I can get a quick list of what productids/serial numbers need to be checked to compare against. Sorry i did not include this earlier i was thinking i could just adapt a solution to fit it.
A common table expression should do it fairly efficiently;
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY system ORDER BY timestamp) rn
FROM TableX
)
SELECT a.timestamp, a.system, a.statusa, a.statusb
FROM cte a JOIN cte b ON a.system = b.system AND a.rn = b.rn+1
WHERE a.statusa <> b.statusa
An SQLfiddle to test with.

generate rownum in select statement without using db specific functions

I want to get the rownumbers in sql select statement but it shouldn't be DB specific query like I cant use rownum of oracle.Please let me know how can i achieve this.
I have table structure as follows pid,emplid,desc as colums and pid and emplid combination will be used as primary key. So suggest the query in this use case.
Thanks,
Shyam
The row_number() function is supported on a lot of the major RDBMS but I don't believe it's in MySQL so it really depends how agnostic you want it to be. Might be best to move it out of the database layer if you want it truly agnostic.
EDIT: valex's method of calculating rownum is probably a better option than moving it out of DB
To do it you table has to have an unique Id- like field - anything to distinguish one row from another. If it is then:
select t1.*,
(select count(id) from t as t2 where t2.id<=t1.id) as row_number
from t as t1 order by Id
UPD: if you have 2 columns to make an order then it will look like:
select t1.*,
(select count(id) from t as t2 where t2.id1<=t1.id1 and t2.id2<=t1.id2)
as row_number
from t as t1 order by Id1,id2

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?
The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+
Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case
If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...
Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.

How does sql optimization work internally?

My previous question:
Date of max id: sql/oracle optimization
In my previous question, I was finding different ways of finding the date of the record with the highest id number. Below are several of the offered solutions, and their 'cost' as calculated by explain plan.
select date from table where id in (
select max(id) from table)
has a cost of 8
select date from table where rownum < 2 order by id desc;
has a cost of 5
select date from (select date from table order by id desc) where rownum < 2;
also has a cost of 5
with ranked_table as (select rownum as rn, date from table order by id desc)
select date from ranked_table where rn = 1;
has a cost of 906665
SELECT t1.date
FROM table t1
LEFT OUTER JOIN table t2
ON t1.id < t2.id
WHERE t2.id IS NULL;
has a cost of 1438619
Obviously the index on id is doing its job. But I was wondering, in what cases would the last two perform at least as well, if not better? I want to understand the benefits of doing it that way.
This was done in Oracle. All varieties can be discussed, but kindly say what your answer applies to.
Use solution #1 if you want the most portable SQL that will work on a wide variety of other brands of RDBMS (i.e. not all brands support rownum):
select date from table where id in (select max(id) from table);
Use solution #3 if you want the most efficient solution for Oracle:
select date from (select date from table order by id desc) where rownum < 2;
Note that solution #2 doesn't always give the right answer, because it returns the "first" two rows before it has sorted them by id. If this happens to return the rows with the highest id values, it's only by coincidence.
select date from table where rownum < 2 order by id desc;
Regarding the more complex queries #4 and #5 that give such a high cost, I agree I wouldn't recommend using them for such a simple task as fetching the row with the highest id. But understanding how to use subquery factoring and self-joins can be useful for solving other more complex types of queries, where the simple solutions simply don't do the job.
Example: given a hierarchy of threaded forum comments, show the "hottest" comments with the most direct replies.
Almost all decent databases have introduced instructions called optimizer hints which are not portable, there are default costs on joining tables, you can advice the query optimizer to use nested loop joins or dynamic table hash joins. A good explaination for oracle you find in oracle performance tuning guide

Aggregate functions in WHERE clause in SQLite

Simply put, I have a table with, among other things, a column for timestamps. I want to get the row with the most recent (i.e. greatest value) timestamp. Currently I'm doing this:
SELECT * FROM table ORDER BY timestamp DESC LIMIT 1
But I'd much rather do something like this:
SELECT * FROM table WHERE timestamp=max(timestamp)
However, SQLite rejects this query:
SQL error: misuse of aggregate function max()
The documentation confirms this behavior (bottom of page):
Aggregate functions may only be used in a SELECT statement.
My question is: is it possible to write a query to get the row with the greatest timestamp without ordering the select and limiting the number of returned rows to 1? This seems like it should be possible, but I guess my SQL-fu isn't up to snuff.
SELECT * from foo where timestamp = (select max(timestamp) from foo)
or, if SQLite insists on treating subselects as sets,
SELECT * from foo where timestamp in (select max(timestamp) from foo)
There are many ways to skin a cat.
If you have an Identity Column that has an auto-increment functionality, a faster query would result if you return the last record by ID, due to the indexing of the column, unless of course you wish to put an index on the timestamp column.
SELECT * FROM TABLE ORDER BY ID DESC LIMIT 1
I think I've answered this question 5 times in the past week now, but I'm too tired to find a link to one of those right now, so here it is again...
SELECT
*
FROM
table T1
LEFT OUTER JOIN table T2 ON
T2.timestamp > T1.timestamp
WHERE
T2.timestamp IS NULL
You're basically looking for the row where no other row matches that is later than it.
NOTE: As pointed out in the comments, this method will not perform as well in this kind of situation. It will usually work better (for SQL Server at least) in situations where you want the last row for each customer (as an example).
you can simply do
SELECT *, max(timestamp) FROM table
Edit:
As aggregate function can't be used like this so it gives error. I guess what SquareCog had suggested was the best thing to do
SELECT * FROM table WHERE timestamp = (select max(timestamp) from table)