Partition by rearranges table on each query run - sql

The below query always rearranges my table (2021-01-01 not followed by 2021-01-02 but any other random date ) at each run and messes up the average calculation. If I remove the partition by the table will get ordered by EventTime(date) correctly...but I have 6 kinds of Symbols I would like the average of. What am I doing wrong?
select ClosePrice, Symbol, EventTime, AVG(ClosePrice) over(
partition by Symbol
order by EventTime
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) [SMA]
from ClosePrices

The query is missing an ORDER BY clause for the final results. The ORDER BY inside the OVER() expression only applies to that window.
The SQL language is based on relational set theory concepts, which explicitly deny any built-in order for tables. That is, there is no guaranteed order for your tables or queries unless you deliberately set one via an ORDER BY clause.
In the past it may have seemed like you always get rows back in a certain order, but if so it's because you've been lucky. There are lots of things that can cause a database to return results in a different order, sometimes even for different runs of the same query.
If you care about the order of the results, you MUST use ORDER BY:
select ClosePrice, Symbol, EventTime, AVG(ClosePrice) over(
partition by Symbol
order by EventTime
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) [SMA]
from ClosePrices
ORDER BY EventTime

Related

What else do I need to add to my SQL query to bring related information in other columns if using MIN() GROUP BY

There is a table with the following column headers: indi_cod, ries_cod, date, time and level. Each ries_cod contains more than one indi_cod, and these indi_cod are random consecutive numbers.
Which SQL query would be appropriate to build if the aim is to find the smallest ID of each ries_cod, and at the same time bring its related information corresponding to date, time and level?
I tried the following query:
SELECT MIN (indi_cod) AS min_indi_cod
FROM my-project-01-354113.indi_cod.second_step
GROUP BY ries_cod
ORDER BY ries_cod
And, indeed, it presented me with the minimum value of indi_cod for each group of ries_cod, but I couldn't write the appropriate query to bring me the information from the date, time and level columns corresponding to each indi_cod.
I usually use some kind of ranking for this type of thing. you can use row_number, rank, or dense_rank depending on your rdbms. here is an example.
with t as(select a.*,
row_number() over (partition by ries_cod, order by indi_cod) as rn
from mytable)
select * from t where rn = 1
in addition if you are using oracle you can do this without two queries by using keep.
https://renenyffenegger.ch/notes/development/databases/SQL/select/group-by/keep-dense_rank/index
I think you just need to group by with the other columns
SELECT MIN (indi_cod), ries_cod, date, time, level AS min_indi_cod
FROM mytavke p
GROUP BY ries_cod, date, time, level
ORDER BY ries_cod

Count()over() have repeated records

I often use sum() over() to calculate cumulative value,but today,I tried count ()over(),the result is out of my expectation,can someone explain why the result have repeated records on the same day?
I know the regular way is to count (distinct I'd) group by date,and then sum()over(order by date),just curious for the result of "count(id)over(order by date)"
Select pre.date,count(person_id) over (order by pre.date)
From (select distinct person_id, date from events) pre
The result will be repeated records for the same day.
Because your outer query has not filtered or aggregated the results from the inner query. It returns the same number of rows.
You want aggregation:
select pre.date, count(*) as cnt_on_date,
sum(count(*)) over (order by pre.date) as running_count
from (select distinct person_id, date from events) pre
group by pre.date;
Almost all analytical functions, except row_number() which comes to mind, do not differentiate ties for the same value of columns in order by clause. In some documentation it is stated directly:
Oracle
If you specify a logical window with the RANGE keyword, then the function returns the same result for each of the rows
Postgresql
By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause.
My SQL
With 'ORDER BY': The default frame includes rows from the partition start through the current row, including all peers of the current row (rows equal to the current row according to the ORDER BY clause).
But in general, the addition of ORDER BY in analytical clause implicitly sets window specification to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. As window calculation is made for each row in the defined window, with default to RANGE rows with the same value of ORDER BY columns will come into the same window and will produce the same result. So to have a real running total, there should be ROWS BETWEEN or more detail column in ORDER BY part of analytic clause. Functions that does not support windowing clause are exception of this rule, but it sometimes not documented directly, so I will not try to list them here. Functions that can be used as aggregate are not exception in general and produce the same value.

Does partition by / order by imply ordering in a query?

I see no difference in:
select
ID,
TYPE,
XTIME,
first_value(XTIME) over (partition by TYPE order by XTIME)
from SERIES;
and:
select
ID,
TYPE,
XTIME,
first_value(XTIME) over (partition by TYPE order by XTIME)
from SERIES
order by TYPE, XTIME;
Does partition by / order by imply ordering in a query?
No it does not.
The result set from a query is ordered only when it has an explicit order by clause. Otherwise the result set is in an arbitrary order. Often, the underlying algorithms used for running the query determine the ordering of the result set.
You can only depend on the final ordering when you have an order by clause for the outermost select. That the result sets are in the same order is simply a coincidence.
I am curious why you don't simply use this?
min(xtime) over (partition by type)
The difference between the two queries is that the second one would be ordered ascending by TYPE and then XTIME, while the first one would, in theory, have random ordering.
The PARTITION BY and ORDER BY clause used in the call to FIRST_VALUE have an effect on that particular analytic function, but do not by themselves affect the final order of the result set.
In general, if you want to order a result set in SQL, you need to use an ORDER BY clause. If both your queries happen to have the same ordering right now, it is only by chance, and is not guaranteed. How might this happen? Well, if Oracle is iterating the clustered index of your table, and that ordering happens to coincide with ORDER BY TYPE, XTIME, then you would see this order even without using ORDER BY.

Reverse initial order of SELECT statement

I want to run a SQL query in Postgres that is exactly the reverse of the one that you'd get by just running the initial query without an order by clause.
So if your query was:
SELECT * FROM users
Then
SELECT * FROM users ORDER BY <something here to make it exactly the reverse of before>
Would it just be this?
ORDER BY Desc
You are building on the incorrect assumption that you would get rows in a deterministic order with:
SELECT * FROM users;
What you get is really arbitrary. Postgres returns rows in any way it sees fit. For simple queries typically in order of their physical storage, which typically is the order in which rows were entered. But there are no guarantees, and the order may change any time between two calls. For instance after any UPDATE (writing a new physical row version), or when any background process reorders rows - like VACUUM. Or a more complex query might return rows according to an index or a join. Long story short: there is no reliable order for table rows in a relational database unless you specify it with ORDER BY.
That said, assuming you get rows from the above simple query in the order of physical storage, this would get you the reverse order:
SELECT * FROM users
ORDER BY ctid DESC;
ctid is the internal tuple ID signifying physical order. Related:
In-order sequence generation
How list all tables with data changes in the last 24 hours?
here is a tsql solution, thid might give you an idea how to do it in postgres
select * from (
SELECT *, row_number() over( order by (select 1)) rowid
FROM users
) x
order by rowid desc

Calculating SQL Server ROW_NUMBER() OVER() for a derived table

In some other databases (e.g. DB2, or Oracle with ROWNUM), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance:
ROW_NUMBER() OVER()
This is particularly useful when used with ordered derived tables, such as:
SELECT t.*, ROW_NUMBER() OVER()
FROM (
SELECT ...
ORDER BY
) t
How can this be emulated in SQL Server? I've found people using this trick, but that's wrong, as it will behave non-deterministically with respect to the order from the derived table:
-- This order here ---------------------vvvvvvvv
SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1))
FROM (
SELECT TOP 100 PERCENT ...
-- vvvvv ----redefines this order here
ORDER BY
) t
A concrete example (as can be seen on SQLFiddle):
SELECT v, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM (
SELECT TOP 100 PERCENT 1 UNION ALL
SELECT TOP 100 PERCENT 2 UNION ALL
SELECT TOP 100 PERCENT 3 UNION ALL
SELECT TOP 100 PERCENT 4
-- This descending order is not maintained in the outer query
ORDER BY 1 DESC
) t(v)
Also, I cannot reuse any expression from the derived table to reproduce the ORDER BY clause in my case, as the derived table might not be available as it may be provided by some external logic.
So how can I do it? Can I do it at all?
The Row_Number() OVER (ORDER BY (SELECT 1)) trick should NOT be seen as a way to avoid changing the order of underlying data. It is only a means to avoid causing the server to perform an additional and unneeded sort (it may still perform the sort but it's going to cost the minimum amount possible when compared to sorting by a column).
All queries in SQL server ABSOLUTELY MUST have an ORDER BY clause in the outermost query for the results to be reliably ordered in a guaranteed way.
The concept of "retaining original order" does not exist in relational databases. Tables and queries must always be considered unordered until and unless an ORDER BY clause is specified in the outermost query.
You could try the same unordered query 100,000 times and always receive it with the same ordering, and thus come to believe you can rely on said ordering. But that would be a mistake, because one day, something will change and it will not have the order you expect. One example is when a database is upgraded to a new version of SQL Server--this has caused many a query to change its ordering. But it doesn't have to be that big a change. Something as little as adding or removing an index can cause differences. And more: Installing a service pack. Partitioning a table. Creating an indexed view that includes the table in question. Reaching some tipping point where a scan is chosen instead of a seek. And so on.
Do not rely on results to be ordered unless you have said "Server, ORDER BY".