When using a window frame clause with a range we define the start point and the end point of the window we aggregate over. If we order by something that has multiple rows for a value the actual row processed is not deterministic and will be somewhere within this set. So will the result include all rows with the same value as current row as well in this case?
https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/AnalyzingData/SQLAnalytics/WindowFraming.htm Does not mention this explicitly but seems to hint at it being the actual non-deterministic row.
So if I have the following table t:
| ts | x |
|------------------ |--- |
| 2017-11-29 10:00 | 1 |
| 2017-11-30 10:00 | 2 |
| 2017-11-30 11:00 | 3 |
| 2017-12-01 11:00 | 4 |
and the following query:
with results as (
select
sum(x) over (order by ts::date range between current row and unbounded following) as r
from t
)
select r from results where ts = '2017-11-30 11:00'
will it say 9 (2+3+4) or will it say either 9 or 7 depending on the how the ordering took place?
How do I include all items with the same value in my window as well?
So when just getting all data from the results you are actually able to test this using the following query:
with results as (
select
sum(x) over (order by ts::date range between current row and unbounded following) as r
from t
)
select r from results
The results are:
r 10 9 9 4 and the two 9's in there mean that it actually includes the whole date since it can't order any further, not just the current row.
I tested this using sqlfiddle on Postgres 9.6 and in Vertica 8.1 directly in the database.
Related
I am trying to create a report that pulls the date from a previous row, does some calculation and then displays the answer on the row below that row. The column in question is "Time Spent".
E.g. I have 3 rows.
+=====+===============+============+====+
|name | DatCompleted | Time Spent | idx|
+=====+===============+============+====+
| A | 1/1/17 | NULL | 0 |
+-----+---------------+------------+----+
| B | 11/1/17 | 10 days | 1 |
+-----+---------------+------------+----+
| C | 20/1/17 | 9 days | 2 |
+=====+===============+============+====+
Time Spent C = DatCompleted of C - DateCompleted of B
Apart from using a crazy loop and using row x row instead of set I can't see how I would complete this. Has anyone ever used this logic before in SQL? If how did you go about this?
Thanks in advance!
Most databases support the ANSI standard LAG() function. Date functions differ depending on the database, but something like this:
select t.*,
(DateCompleted - lag(DateCompleted) over (order by DateCompleted)) as TimeSpent
from t;
In SQL Server, you would use datediff():
select t.*,
datediff(day,
lag(DateCompleted) over (order by DateCompleted),
DateCompleted
) as TimeSpent
from t;
You can do this by using ROW number syntax is
ROW_NUMBER ( ) OVER ( [ PARTITION BY value_expression , ... [ n ] ] order_by_clause)
For reference you can use ROW_NUMBER
You have an index already (similar to rownumber above). Join to itself.
Select table1.*
,TimeSpent=DateDiff("d",table1.DateCompleted,copy.DateCompleted)
from table1
join table1 copy on table.idx=copy.idx-1
I have a problem that needs to be solved using sql in oracle.
I have a dataset like given below:
value | date
-------------
1 | 01/01/2017
2 | 02/01/2017
3 | 03/01/2017
3 | 04/01/2017
2 | 05/01/2017
2 | 06/01/2017
4 | 07/01/2017
5 | 08/01/2017
I need to show the result in the below format:
value | date | Group
1 | 01/01/2017 | 1
2 | 02/01/2017 | 2
3 | 03/01/2017 | 3
3 | 04/01/2017 | 3
2 | 05/01/2017 | 4
2 | 06/01/2017 | 4
4 | 07/01/2017 | 5
5 | 08/01/2017 | 6
The logic is whenever value changes over date, it gets assigned a new group/id, but if its the same as the previous one , then its part of the same group.
Here is one method using lag() and cumulative sum:
select t.*,
sum(case when value = prev_value then 0 else 1 end) over (order by date) as grp
from (select t.*,
lag(value) over (order by date) as prev_value
from t
) t;
The logic here is to simply count the number of times that the value changes from one month to the next.
This assumes that date is actually stored as a date and not a string. If it is a string, then the ordering will not be correct. Either convert to a date or use a column that specifies the correct ordering.
Here is a solution using the MATCH_RECOGNIZE clause, introduced in Oracle 12.*
select value, dt, mn as grp
from inputs
match_recognize (
order by dt
measures match_number() as mn
all rows per match
pattern ( a b* )
define b as value = prev(value)
)
order by dt -- if needed
;
Here is how this works: Other than SELECT, FROM and ORDER BY, the query has only one clause, MATCH_RECOGNIZE. What this clause does is: it takes the rows from inputs and it orders them by dt. Then it searches for patterns: one row, marked as a, with no constraints, followed by zero or more rows b, where b is defined by the condition that the value is the same as for the prev[ious] row. What the clause calculates or measures is the match_number() - first "match" of the pattern, second match etc. We use this match number as the group number (grp) in the outer query - that's all we needed!
*Notes: The existence of solutions like this shows why it is important for posters to state their Oracle version. (Run the statement select * from v$version to find out.) Also: date and group are reserved words in Oracle and shouldn't be used as column names. Not even for posting made-up sample data. (There are workarounds but they aren't needed in this case.) Also, whenever using dates like 03/01/2017 in a post, please indicate whether that is March 1 or January 3, there's no way for "us" to tell. (It wasn't important in this case, but it is in the vast majority of cases.)
I Have table t1 ordered by tasteRating
Fruit | tasteRating|Cost
-----------------------
Apple | 99 | 1
Banana| 87 | 2
Cherry| 63 | 5
I want t2
Fruit | Cost | Total Cost
-------------------------
Apple | 1 | 1
Banana| 2 | 3
Cherry| 5 | 8
Is there a way to generate Total Cost dynamically in SQL based on value of Cost?
Doing this on Redshift.
Thanks
A running sum like that, can easily be done in a modern DBMS using window functions:
select col_1,
sum(col_1) over (order by taste_rating desc) as col_2
from the_table;
Note however that a running sum without an order by doesn't make sense. So you have to include a column that defines the order of the rows.
SQLFiddle: http://sqlfiddle.com/#!15/166b9/1
EDIT: (By Gordon)
RedShift has weird limitations on Window functions. For some reason, it requires the rows between syntax:
sum(col_1) over (order by taste_rating desc
rows between unbounded preceding and current row
) as col_2
I have no idea why it has this requirement. It is not required by ANSI (although it is supported) and it is not a limitation in Postgres (the base database for Redshift).
This answer to shows how to produce High/Low/Open/Close values from a ticker:
Retrieve aggregates for arbitrary time intervals
I am trying to implement a solution based on this (PG 9.2), but am having difficulty in getting the correct value for first_value().
So far, I have tried two queries:
SELECT
cstamp,
price,
date_trunc('hour',cstamp) AS h,
floor(EXTRACT(minute FROM cstamp) / 5) AS m5,
min(price) OVER w,
max(price) OVER w,
first_value(price) OVER w,
last_value(price) OVER w
FROM trades
Where date_trunc('hour',cstamp) = timestamp '2013-03-29 09:00:00'
WINDOW w AS (
PARTITION BY date_trunc('hour',cstamp), floor(extract(minute FROM cstamp) / 5)
ORDER BY date_trunc('hour',cstamp) ASC, floor(extract(minute FROM cstamp) / 5) ASC
)
ORDER BY cstamp;
Here's a piece of the result:
cstamp price h m5 min max first last
"2013-03-29 09:19:14";77.00000;"2013-03-29 09:00:00";3;77.00000;77.00000;77.00000;77.00000
"2013-03-29 09:26:18";77.00000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.80000;77.00000
"2013-03-29 09:29:41";77.80000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.80000;77.00000
"2013-03-29 09:29:51";77.00000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.80000;77.00000
"2013-03-29 09:30:04";77.00000;"2013-03-29 09:00:00";6;73.99004;77.80000;73.99004;73.99004
As you can see, 77.8 is not what I believe is the correct value for first_value(), which should be 77.0.
I though this might be due to the ambiguous ORDER BY in the WINDOW, so I changed this to
ORDER BY cstamp ASC
but this appears to upset the PARTITION as well:
cstamp price h m5 min max first last
"2013-03-29 09:19:14";77.00000;"2013-03-29 09:00:00";3;77.00000;77.00000;77.00000;77.00000
"2013-03-29 09:26:18";77.00000;"2013-03-29 09:00:00";5;77.00000;77.00000;77.00000;77.00000
"2013-03-29 09:29:41";77.80000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.00000;77.80000
"2013-03-29 09:29:51";77.00000;"2013-03-29 09:00:00";5;77.00000;77.80000;77.00000;77.00000
"2013-03-29 09:30:04";77.00000;"2013-03-29 09:00:00";6;77.00000;77.00000;77.00000;77.00000
since the values for max and last now vary within the partition.
What am I doing wrong? Could someone help me better to understand the relation between PARTITION and ORDER within a WINDOW?
Although I have an answer, here's a trimmed-down pg_dump which will allow anyone to recreate the table. The only thing that's different is the table name.
CREATE TABLE wtest (
cstamp timestamp without time zone,
price numeric(10,5)
);
COPY wtest (cstamp, price) FROM stdin;
2013-03-29 09:04:54 77.80000
2013-03-29 09:04:50 76.98000
2013-03-29 09:29:51 77.00000
2013-03-29 09:29:41 77.80000
2013-03-29 09:26:18 77.00000
2013-03-29 09:19:14 77.00000
2013-03-29 09:19:10 77.00000
2013-03-29 09:33:50 76.00000
2013-03-29 09:33:46 76.10000
2013-03-29 09:33:15 77.79000
2013-03-29 09:30:08 77.80000
2013-03-29 09:30:04 77.00000
\.
SQL Fiddle
All the functions you used act on the window frame, not on the partition. If omitted the frame end is the current row. To make the window frame to be the whole partition declare it in the frame clause (range...):
SELECT
cstamp,
price,
date_trunc('hour',cstamp) AS h,
floor(EXTRACT(minute FROM cstamp) / 5) AS m5,
min(price) OVER w,
max(price) OVER w,
first_value(price) OVER w,
last_value(price) OVER w
FROM trades
Where date_trunc('hour',cstamp) = timestamp '2013-03-29 09:00:00'
WINDOW w AS (
PARTITION BY date_trunc('hour',cstamp) , floor(extract(minute FROM cstamp) / 5)
ORDER BY cstamp
range between unbounded preceding and unbounded following
)
ORDER BY cstamp;
Here's a quick query to illustrate the behaviour:
select
v,
first_value(v) over w1 f1,
first_value(v) over w2 f2,
first_value(v) over w3 f3,
last_value (v) over w1 l1,
last_value (v) over w2 l2,
last_value (v) over w3 l3,
max (v) over w1 m1,
max (v) over w2 m2,
max (v) over w3 m3,
max (v) over () m4
from (values(1),(2),(3),(4)) t(v)
window
w1 as (order by v),
w2 as (order by v rows between unbounded preceding and current row),
w3 as (order by v rows between unbounded preceding and unbounded following)
The output of the above query can be seen here (SQLFiddle here):
| V | F1 | F2 | F3 | L1 | L2 | L3 | M1 | M2 | M3 | M4 |
|---|----|----|----|----|----|----|----|----|----|----|
| 1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 4 | 4 |
| 2 | 1 | 1 | 1 | 2 | 2 | 4 | 2 | 2 | 4 | 4 |
| 3 | 1 | 1 | 1 | 3 | 3 | 4 | 3 | 3 | 4 | 4 |
| 4 | 1 | 1 | 1 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
Few people think of the implicit frames that are applied to window functions that take an ORDER BY clause. In this case, windows are defaulting to the frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Think about it this way:
On the row with v = 1 the ordered window's frame spans v IN (1)
On the row with v = 2 the ordered window's frame spans v IN (1, 2)
On the row with v = 3 the ordered window's frame spans v IN (1, 2, 3)
On the row with v = 4 the ordered window's frame spans v IN (1, 2, 3, 4)
If you want to prevent that behaviour, you have two options:
Use an explicit ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING clause for ordered window functions
Use no ORDER BY clause in those window functions that allow for omitting them (as MAX(v) OVER())
More details are explained in this article about LEAD(), LAG(), FIRST_VALUE() and LAST_VALUE()
The result of max() as window function is base on the frame definition.
The default frame definition (with ORDER BY) is from the start of the frame up to the last peer of the current row (including the current row and possibly more rows ranking equally according to ORDER BY). In the absence of ORDER BY (like in my answer you are referring to), or if ORDER BY treats every row in the partition as equal (like in your first example), all rows in the partition are peers, and max() produces the same result for every row in the partition, effectively considering all rows of the partition.
Per documentation:
The default framing option is RANGE UNBOUNDED PRECEDING, which is the
same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY,
this sets the frame to be all rows from the partition start
up through the current row's last peer. Without ORDER BY, all rows of the
partition are included in the window frame, since all rows become
peers of the current row.
Bold emphasis mine.
The simple solution would be to omit the ORDER BY in the window definition - just like I demonstrated in the example you are referring to.
All the gory details about frame specifications in the chapter Window Function Calls in the manual.
I'm using Vertica, which precludes me from using CROSS APPLY, unfortunately. And apparently there's no such thing as CTEs in Vertica.
Here's what I've got:
t:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
Note that on the first day, the delta is equal to the metric value.
I'd like to fill in the gaps, like this:
t_fill:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0 -- a delta of 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
I've thought of a way to do this day by day, but what I'd really like is a solution that works in one go.
I think I could get something working with LAST_VALUE, but I can't come up with the right JOIN statements that will let me properly partition and order on each id's day-by-day history.
edit:
assume I have a table like this:
calendar:
day
------------
2011-01-01
2011-01-02
...
that can be involved with joins. My intent would be to maintain the date range in calendar to match the date range in t.
edit:
A few more notes on what I'm looking for, just to be specific:
In generating t_fill, I'd like to exactly cover the date range in t, as well as any dates that are missing in between. So a correct t_fill will start on the same date and end on the same date as t.
t_fill has two properties:
1) once an id appears on some date, it will always have a row for each later date. This is the gap-filling implied in the original question.
2) Should no row for an id ever appear again after some date, the t_fill solution should merrily generate rows with the same metric value (and 0 delta) from the date of that last data point up to the end date of t.
A solution might backfill earlier dates up to the start of the date range in t. That is, for any id that appears after the first date in t, rows between the first date in t and the first date for the id will be filled with metric=0 and d_metric=0. I don't prefer this kind of solution, since it has a higher growth factor for each id that enters the system. But I could easily deal with it by selecting into a new table only rows where metric!=0 and d_metric!=0.
This about what Jonathan Leffler proposed, but into old-fashioned low-level SQL (without fancy CTE's or window functions or aggregating subqueries):
SET search_path='tmp'
DROP TABLE ttable CASCADE;
CREATE TABLE ttable
( zday date NOT NULL
, id INTEGER NOT NULL
, metric INTEGER NOT NULL
, d_metric INTEGER NOT NULL
, PRIMARY KEY (id,zday)
);
INSERT INTO ttable(zday,id,metric,d_metric) VALUES
('2011-12-01',1,10,10)
,('2011-12-03',1,12,2)
,('2011-12-04',1,15,3)
;
DROP TABLE ctable CASCADE;
CREATE TABLE ctable
( zday date NOT NULL
, PRIMARY KEY (zday)
);
INSERT INTO ctable(zday) VALUES
('2011-12-01')
,('2011-12-02')
,('2011-12-03')
,('2011-12-04')
;
CREATE VIEW v_cte AS (
SELECT t.zday,t.id,t.metric,t.d_metric
FROM ttable t
JOIN ctable c ON c.zday = t.zday
UNION
SELECT c.zday,t.id,t.metric, 0
FROM ctable c, ttable t
WHERE t.zday < c.zday
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday = c.zday
)
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday < c.zday
AND nx.zday > t.zday
)
)
;
SELECT * FROM v_cte;
The results:
zday | id | metric | d_metric
------------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
(4 rows)
I am not Vertica user, but if you do not want to use their native support for GAP fillings, here you can find a more generic SQL-only solution to do so.
If you want to use something like a CTE, how about using a temporary table? Essentially, a CTE is a view for a particular query.
Depending on your needs you can make the temporary table transaction or session-scoped.
I'm still curious to know why gap-filling with constant-interpolation wouldn't work here.
Given the complete calendar table, it is doable, though not exactly trivial. Without the calendar table, it would be a lot harder.
Your query needs to be stated moderately precisely, which is usually half the battle in any issue with 'how to write the query'. I think you are looking for:
For each date in Calendar between the minimum and maximum dates represented in T (or other stipulated range),
For each distinct ID represented in T,
Find the metric for the given ID for the most recent record in T on or before the date.
This gives you a complete list of dates with metrics.
You then need to self-join two copies of that list with dates one day apart to form the deltas.
Note that if some ID values don't appear at the start of the date range, they won't show up.
With that as guidance, you should be able get going, I believe.