Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last month.
Improve this question
I have a table with rows that look like this:
ts,ticker,m1,m5,m15,m30,h1,h2,h4,d1,high,vwap,low
2020-12-03 00:00:00.000000,DOGEUSDT,0.00336,0.00336,0.00336,0.00336,0.00336,0.00336,0.00336,0.00336,0.003366,0.0033595364,0.003356
2020-12-03 00:01:00.000000,DOGEUSDT,0.00336,0.00336,0.00336,0.00336,0.00336,0.00336,0.00336,0.00336,0.003371,0.0033696603,0.003365
2020-12-03 00:02:00.000000,DOGEUSDT,0.00337,0.00337,0.00337,0.00337,0.00337,0.00337,0.00337,0.00337,0.003376,0.0033727777,0.00337
2020-12-03 00:03:00.000000,DOGEUSDT,0.00337,0.00337,0.00337,0.00337,0.00337,0.00337,0.00337,0.00337,0.003376,0.0033747195,0.003373
The queries are always the same: for a specific ticker, return the aggregate data for a time range.
For that reason, in the aggregated data, the timestamp should be the first record (sorted by time), and the ticker (which is identical in all returned rows) should also be from the first row.
I have done this:
SELECT
(array_agg(ts))[1] as ts,
(array_agg(ticker))[1] as ticker,
And this one seems to be working too:
SELECT
min(ts) as ts,
min(ticker) as ticker,
Now, this will work too:
SELECT
min(ts) as ts,
min(ticker) as ticker,
FROM ...
GROUP BY ts, ticker
LIMIT 1
What would be the recommended way, and why?
The all queries you wrote above return different data. To see this more clearly:
CREATE TABLE table5 (
ts timestamp NULL,
ticker text NULL,
m1 float8 NULL
);
INSERT INTO table5 (ts, ticker, m1) VALUES('2020-12-03 00:01:00.000', 'c1', 0.00336);
INSERT INTO table5 (ts, ticker, m1) VALUES('2020-12-03 00:03:00.000', 'a1', 0.00337);
INSERT INTO table5 (ts, ticker, m1) VALUES('2020-12-03 00:02:00.000', 'b1', 0.00336);
INSERT INTO table5 (ts, ticker, m1) VALUES('2020-12-03 00:00:00.000', 'e1', 0.00337);
/* Query 1
This query return only first row without sorting by date
*/
SELECT
(array_agg(ts))[1] as ts,
(array_agg(ticker))[1] as ticker
from table5
Return:
ts |ticker|
-----------------------+------+
2020-12-03 00:01:00.000|c1 |
/* Query 2
even if you add "order by ts" to this query, it still won't return you the first values of sorted date. This is return only minimum value of column ts, and same time min value of ticker column
*/
SELECT
min(ts) as ts,
min(ticker) as ticker
from table5
Return:
ts |ticker|
-----------------------+------+
2020-12-03 00:00:00.000|a1 |
/* Query 3
this is to same as query 2, but using gropping columns
*/
SELECT
min(ts) as ts,
min(ticker) as ticker
FROM table5
GROUP BY ts, ticker
LIMIT 1
Return:
ts |ticker|
-----------------------+------+
2020-12-03 00:00:00.000|e1 |
If you need sorted rows by date, and after then return only first values, you can use this query
select
ts,
ticker,
m1
from table5
order by ts
limit 1;
if you need to use agregate functions then try this query:
select
first_value(ts) over (order by ts) as ts,
first_value(ticker) over (order by ts) as ticker,
first_value(m1) over (order by ts) as m1
from table5
limit 1;
These queries return same data:
ts |ticker|m1 |
-----------------------+------+-------+
2020-12-03 00:00:00.000|e1 |0.00337|
Related
The problem I'm trying to solve is removing duplicates from a particular partition as referenced by a TIMESTAMP type column. My table is something like the schema below with the timestamp column partition having day-based granularity:
requestID:STRING, ts:TIMESTAMP, recordNo:INTEGER, recordData:STRING
Now I have millions and millions of these and sometimes there are duplicates like this:
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 2, orange
'server1234', '2020-06-10', 2, orange
The uniqueness of the records is determined by two fields: requestID and recordNo. I'd like to remove the duplicates in the partition where CAST(ts AS DATE) = '2020-06-10'. I can see the distinct records with a simple select:
SELECT DISTINCT * FROM mytable WHERE CAST(ts AS DATE) = '2020-06-10'
There must be a way to combine a delete/update/merge with the select distinct so that I can replace the partition with the de-duplicated data.
Thoughts?
The safest way to do this is to select only the data (de-duplicated) you need out into a new table, delete the data in your permanent table, then insert your de-duplicated data back into the permanent location. BigQuery does not make update/delete methods as easy as some OLTP databases.
If you would prefer a more one-shot approach, here is an example with the data you provided that does the trick.
-- SETUP
CREATE TABLE working.remove_dupes
(
requestID STRING,
ts TIMESTAMP,
recordNo INT64,
recordData STRING
)
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR);
INSERT INTO working.remove_dupes(requestID, ts, recordNo, recordData)
VALUES
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 2, 'orange'),
('server1234', '2020-06-10', 2, 'orange');
------------------------------------------------------------------------------------
-- SELECTING ONLY ONE OF THE ENTRIES (NO DUPLICATES)
SELECT
requestID,
ts,
recordNo,
recordData
FROM (
SELECT
requestID,
ts,
recordNo,
recordData,
ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
FROM
working.remove_dupes
)
WHERE
instance_num = 1;
------------------------------------------------------------------------------------
-- REPLACE THE ORIGINAL TABLE, REMOVING DUPLICATES IN THE PROCESS
-- BACK UP YOUR TABLE FIRST!!!!! (MAKE A COPY)
CREATE OR REPLACE TABLE working.remove_dupes
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR)
AS
(SELECT
requestID,
ts,
recordNo,
recordData
FROM (
SELECT
requestID,
ts,
recordNo,
recordData,
ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
FROM
working.remove_dupes
)
WHERE
instance_num = 1);
EDIT: Note that replacing the table can (in my experience) wipe out table metadata (descriptions) and possibly the table partition. I've updated the example to include a table partition setup.
This question already has answers here:
Fetch the rows which have the Max value for a column for each distinct value of another column
(35 answers)
Select latest row for each group from oracle
(3 answers)
GROUP BY with MAX(DATE) [duplicate]
(6 answers)
Select First Row of Every Group in sql [duplicate]
(2 answers)
Get value based on max of a different column grouped by another column [duplicate]
(1 answer)
Closed 2 years ago.
I have a table with some fields (time, text, type). I want to build a query to return for every type the text that was introduced with the maximum value of time. Oracle has some restrictions and it is not simple to build the query without some tricks.
I am struggling to get the right query, can anyone help me with it?
TIME TEXT TYPE
--------------------------
03.05.2020 AA 2
02.04.2020 BB 2
01.04.2020 CC 1
I want a query that returns
03.05.2020 AA 2
01.04.2020 CC 1
One option would be using DENSE_RANK, FIRST and LAST Analytic Functions as
MAX(time) KEEP (DENSE_RANK LAST ORDER BY time ) OVER( PARTITION BY type )
WITH t2 AS
(
SELECT t.*, MAX(time) KEEP (DENSE_RANK LAST ORDER BY time ) OVER( PARTITION BY type )
AS highest
FROM t
)
SELECT time, text, type
FROM t2
WHERE highest = time
Through this method all the ties(repeating values for time for each type group) are listed.
Demo
First is to get the max(TIME) per type, then join it to your tableA to get other fields (TEXT).
select * from tableA t
inner join
(select max(TIME) mt, type from tableA group by type) t1 on t1.mt=t.mt and t.type= t1.type
You can use row_number (or dense_rank):
select
t.*
from (
select
t.*,
row_number() over(partition by type_column order by time_column desc) rn
from tbl t
) t
where t.rn = 1
Hopefully the answer to this question is simple (it seems it should be but I'm missing something).... I've a simple query that gets the min value from a table (via a sub-query), then using that min value attempts to get a simple months_between final result
The question I'm trying to answer as simply as possible is: What is the earliest end_date and, how far back from the report_date is this end_date
Here's the query
select end_date term_date, months_between (cast (report_date as date), cast (end_date as date)) term_length
from table1
where end_date =
(select min (end_date) end_date
from table1
where rec_load_date = (select max (rec_load_date) from table1)
and active_rec = 'Y' and end_date <= report_date);
My issue is that while the sub-query works perfectly, i.e, I get a single min (end_date) value back, using it in the main query returns multiple records. (See images)
Current result:
1. term_date = min(table1.end_date)
2. term_length = months_between value comparing table1.report_date and table1.end_date (i.e., the actual end_date values for the records in table1)
Current result (attached image):
Sample of records returned (as many rows as there are applicable end_date values in table1)
Sample data:
create table table1 (rec_load_date varchar(25), report_date as varchar (25), end_date as varchar (25));
insert into table1 (rec_load_date, report_date, end_date) values ('2017-08-10', '2017-07-31', '2017-02-28');
insert into table1 (rec_load_date, report_date, end_date) values ('2017-08-10', '2017-07-31', '2017-01-31');
insert into table1 (rec_load_date, report_date, end_date) values ('2017-08-10', '2017-07-31', '2017-04-30');
insert into table1 (rec_load_date, report_date, end_date) values ('2017-08-10', '2017-07-31', '2017-03-31');
insert into table1 (rec_load_date, report_date, end_date) values ('2017-08-10', '2017-07-31', '2017-01-31');
insert into table1 (rec_load_date, report_date, end_date) values ('2017-08-10', '2017-07-31', '2017-04-25');
insert into table1 (rec_load_date, report_date, end_date) values ('2017-08-10', '2017-07-31', '2017-01-31');
Any thoughts on how I can get back from the query the following result: '2017-01-31', 6
Multiple recs returned
Based on your comment, you can use this:
select
report_date
,datediff(day,report_date,min(end_date)) as DaysBack
from
table1
group by
report_date
This will still give you the days back for each row. If you want it for the earliest, or latest report_date, then you can add an order by and TOP 1
select top 1
report_date
,datediff(day,report_date,min(end_date)) as DaysBack
from
table1
group by
report_date
order by
report_date asc --or desc
Looks like your example data is incomplete, but I'll ignore the missing rec_load_date for now, and suppose your SELECT MIN(end_date) will correctly return one value: 2017-01-31
But then the outer query effectively searches select all rows from table1 where end_date = '2017-01-31' and there are 3 rows (in your sample data, maybe more in your real table) that match this criteria.. So you're getting multiple rows out!
If you want to get just a single row out, you're going to have to nail it down a bit more exclusively. Using ROW_NUMBER() OVER(...ORDER BY x) to number the rows and then selecting the row where rownumber is 1 is one way, even if there are duplicates. Caveat though; you don't have an easy way to guarantee WHICH row gets numbered as 1, if you're doing a naive sort on something like end_date where multiple identical values are present. Either increase the number of columns you order by or pick a different sorting key.
I'm trying to find the way of doing a comparison with the current row in the PARTITION BY clause in a WINDOW function in PostgreSQL query.
Imagine I have the short list in the following query of this 5 elements (in the real case, I have thousands or even millions of rows). I am trying to get for each row, the id of the next different element (event column), and the id of the previous different element.
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT lag(id) over w as previous_different, event
, lead(id) over w as next_different
FROM events ev
WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);
I know the comparison event!=ev.event is incorrect but that's the point I want to reach.
The result I get is (the same as if I delete the PARTITION BY clause):
|12|2
1|12|3
2|13|4
3|13|5
4|12|
And the result I would like to get is:
|12|3
|12|3
2|13|5
2|13|5
4|12|
Anyone knows if it is possible and how? Thank you very much!
EDIT: I know I can do it with two JOINs, a ORDER BY and a DISTINCT ON, but in the real case of millions of rows it is very inefficient:
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id
FROM events e
LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event)
LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event)
ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC
Using several different window functions and two subqueries, this should work decently fast:
WITH events(id, event, ts) AS (
VALUES
(1, 12, '2014-03-19 08:00:00'::timestamp)
,(2, 12, '2014-03-19 08:30:00')
,(3, 13, '2014-03-19 09:00:00')
,(4, 13, '2014-03-19 09:30:00')
,(5, 12, '2014-03-19 10:00:00')
)
SELECT first_value(pre_id) OVER (PARTITION BY grp ORDER BY ts) AS pre_id
, id, ts
, first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM (
SELECT *, count(step) OVER w AS grp
FROM (
SELECT id, ts
, NULLIF(lag(event) OVER w, event) AS step
, lag(id) OVER w AS pre_id
, lead(id) OVER w AS post_id
FROM events
WINDOW w AS (ORDER BY ts)
) sub1
WINDOW w AS (ORDER BY ts)
) sub2
ORDER BY ts;
Using ts as name for the timestamp column.
Assuming ts to be unique - and indexed (a unique constraint does that automatically).
In a test with a real life table with 50k rows it only needed a single index scan. So, should be decently fast even with big tables. In comparison, your query with join / distinct did not finish after a minute (as expected).
Even an optimized version, dealing with one cross join at a time (the left join with hardly a limiting condition is effectively a limited cross join) did not finish after a minute.
For best performance with a big table, tune your memory settings, in particular for work_mem (for big sort operations). Consider setting it (much) higher for your session temporarily if you can spare the RAM. Read more here and here.
How?
In subquery sub1 look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id of the previous and the next row (pre_id, post_id).
In subquery sub2, count() only counts non-null values. The resulting grp marks peers in blocks of consecutive same events.
In the final SELECT, take the first pre_id and the last post_id per group for each row to arrive at the desired result.
Actually, this should be even faster in the outer SELECT:
last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS post_id
... since the sort order of the window agrees with the window for pre_id, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.
SQL Fiddle.
Although this question looks simple, it is kind of tricky.
Consider the following table:
CREATE TABLE A (
id INT,
value FLOAT,
"date" DATETIME,
group VARCHAR(50)
);
I would like to obtain the ID and value of the records that contain the maximum date grouped by the column group. In other words, something like "what is the newest value for each group?" What query will answer that question?
I can get each group and its maximum date:
SELECT group, MAX(date)
FROM A
GROUP BY group; -- I also need the "ID" and "value"`
But I would like to have the "ID" and value of the record with the highest date.
Making a JOIN between A and the result could be the answer, but there is no way of knowing which record MAX(date) refers to (in case the date repeats).
Sample data:
INSERT INTO A
VALUES
(1, 1.0, '2000-01-01', 'A'),
(2, 2.0, '2000-01-02', 'A'),
(3, 3.0, '2000-01-01', 'B'),
(4, 2.0, '2000-01-02', 'B'),
(5, 1.0, '2000-01-02', 'B')
;
You could try with a subquery
select group, id, value, date from A where date in
( select MAX(date) as date
from A
group by group )
order by group
This is just what analytic functions were made for:
select group,
id,
value
from (
select group,
id,
value,
date,
max(date) over (partition by group) max_date_by_group
from A
)
where date = max_date_by_group
If date is unique, then you already have your answer. If date is not unique, then you need some other uniqueifier. Absent a natural key, your ID is as good as any. Just put a MAX (or MIN, whichever you prefer) on it:
SELECT *
FROM A
JOIN (
--Dedupe any non unqiue dates by getting the max id for each group that has the max date
SELECT Group, MAX(Id) as Id
FROM A
JOIN (
--Get max date for each group
SELECT group, MAX(date) as Date
FROM A
GROUP BY group
) as MaxDate ON
A.Group = MaxDate.Group
AND A.Date = MaxDate.Date
GROUP BY Group
) as MaxId ON
A.Group = MaxId.Group
AND A.Id= MaxId.Id
As long as the Date column is unique for each group I think something like this might work:
SELECT A.ID, A.Value
FROM A
INNER JOIN (SELECT Group, MAX(Date) As MaxDate FROM A GROUP BY Group) B
ON A.Group = B.Group AND A.Date = B.MaxDate