Having a de-normalized structure in Redshift and plan is to keep creating records and while retrieving only consider most recent attributes against users.
Following is the table:
user_id state created_at
1 A 15-10-2015 02:00:00 AM
2 A 15-10-2015 02:00:01 AM
3 A 15-10-2015 02:00:02 AM
1 B 15-10-2015 02:00:03 AM
4 A 15-10-2015 02:00:04 AM
5 B 15-10-2015 02:00:05 AM
And required result set is:
user_id state created_at
2 A 15-10-2015 02:00:01 AM
3 A 15-10-2015 02:00:02 AM
4 A 15-10-2015 02:00:04 AM
I have the query which retrieve the said result:
select user_id, first_value AS state
from (
select user_id, first_value(state) OVER (
PARTITION BY user_id
ORDER BY created_at desc
ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
from customer_properties
order by created_at) t
where first_value = 'A'
Is this the best way to retrieve or can the query be improved?
The best query depends on various details: selectivity of the query predicate, cardinalities, data distribution. If state = 'A' is a selective condition (view rows qualify), this query should be substantially faster:
SELECT c.user_id, c.state
FROM customer_properties c
LEFT JOIN customer_properties c1 ON c1.user_id = c.user_id
AND c1.created_at > c.created_at
WHERE c.state = 'A'
AND c1.user_id IS NULL;
Provided, there is an index on (state) (or even (state, user_id, created_at)) and another one on (user_id, created_at).
There are various ways to make sure a later version of the row does not exist:
Select rows which are not present in other table
If 'A' is a common value in state, this more generic query will be faster:
SELECT user_id, state
FROM (
SELECT user_id, state
, row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM customer_properties
) t
WHERE t.rn = 1
AND t.state = 'A';
I removed NULLS LAST, assuming that created_at is defined NOT NULL. Also, I don't think Redshift has it:
PostgreSQL sort by datetime asc, null first?
Both queries should work with the limited functionality of Redshift. With modern Postgres, there are better options:
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest record per user
Your original would return all rows per user_id, if the latest row matches. You would have to fold duplicates, needless work ...
Related
This is the table
id
category
value
1
A
40
1
B
20
1
C
10
2
A
4
2
B
7
2
C
7
3
A
32
3
B
21
3
C
2
I want the result like this
id
category
1
A
2
B
2
C
3
A
For small tables or for only very few rows per user, a subquery with the window function rank() (as demonstrated by The Impaler) is just fine. The resulting sequential scan over the whole table, followed by a sort will be the most efficient query plan.
For more than a few rows per user, this gets increasingly inefficient though.
Typically, you also have a users table holding one distinct row per user. If you don't have it, created it! See:
Is there a way to SELECT n ON (like DISTINCT ON, but more than one of each)
Select first row in each GROUP BY group?
We can leverage that for an alternative query that scales much better - using WITH TIES in a LATERAL JOIN. Requires Postgres 13 or later.
SELECT u.id, t.*
FROM users u
CROSS JOIN LATERAL (
SELECT t.category
FROM tbl t
WHERE t.id = u.id
ORDER BY t.value DESC
FETCH FIRST 1 ROWS WITH TIES -- !
) t;
db<>fiddle here
See:
Get top row(s) with highest value, with ties
Fetching a minimum of N rows, plus all peers of the last row
This can use a multicolumn index to great effect - which must exist, of course:
CREATE INDEX ON tbl (id, value);
Or:
CREATE INDEX ON tbl (id, value DESC);
Even faster index-only scans become possible with:
CREATE INDEX ON tbl (id, value DESC, category);
Or (the optimum for the query at hand):
CREATE INDEX ON tbl (id, value DESC) INCLUDE (category);
Assuming value is defined NOT NULL, or we have to use DESC NULLS LAST. See:
Sort by column ASC, but NULL values first?
To keep users in the result that don't have any rows in table tbl, user LEFT JOIN LATERAL (...) ON true. See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
You can use RANK() to identify the rows you want. Then, filtering is easy. For example:
select *
from (
select *,
rank() over(partition by id order by value desc) as rk
from t
) x
where rk = 1
Result:
id category value rk
--- --------- ------ --
1 A 40 1
2 B 7 1
2 C 7 1
3 A 32 1
See running example at DB Fiddle.
Struggling with this subquery - it should be basic, but I'm missing something. I need to make these available as apart of a larger query.
I have customers, and I want to get the ONE transaction with the HIGHEST timestamp.
Customer
customer foo
1 val1
2 val2
Transaction
tx_key customer timestamp value
1 1 11/22 10
2 1 11/23 15
3 2 11/24 20
4 2 11/25 25
The desired of the query:
customer foo timestamp value
1 val1 11/23 15
2 val2 11/25 25
I successfully wrote a subquery to calculate what I needed by using multiple sub queries, but it is very slow when I have a larger data set.
I did it like this:
(select timestamp where transaction.customer = customer.customer order by timestamp desc limit 1) as tx_timestamp
(select value where transaction.customer = customer.customer order by timestamp desc limit 1) as tx_value
So how do I reduce this down to only calculating it once? In my real data set, i have 15 columns joined over 100k rows, so doing this over and over is not performant enough.
In Postgres, the simplest method is distinct on:
select distinct on (cust_id) c.*, t.timestamp, t.value
from transactions t join
customer c
using (cust_id)
order by cust_id, timestamp desc;
Try this query please:
SELECT
T.customer, T.foo, T.timestamp, T.value
FROM Transaction T
JOIN
(SELECT
customer, max(timestamp) as timestamp
from Transaction GROUP BY customer) MT ON
T.customer = MT.customer
AND t.timestamp = MT.timestamp
Suppose I have the following record in BQ:
id name age timestamp
1 "tom" 20 2019-01-01
I then perform two "updates" on this record by using the streaming API to 'append' additional data -- https://cloud.google.com/bigquery/streaming-data-into-bigquery. This is mainly to get around the update quota that BQ enforces (and it is a high-write application we have).
I then append two edits to the table, one update that just modifies the name, and then one update that just modifies the age. Here are the three records after the updates:
id name age timestamp
1 "tom" 20 2019-01-01
1 "Tom" null 2019-02-01
1 null 21 2019-03-03
I then want to query this record to get the most "up-to-date" information. Here is how I have started:
SELECT id, **name**, **age**,max(timestamp)
FROM table
GROUP BY id
-- 1,"Tom",21,2019-03-03
How would I get the correct name and age here? Note that there could be thousands of updates to a record, so I don't want to have to write 1000 case statements, if at all possible.
For various other reasons, I usually won't have all row data at one time, I will only have the RowID + FieldName + FieldValue.
I suppose plan B here is to do a query to get the current data and then add my changes to insert the new row, but I'm hoping there's a way to do this in one go without having to do two queries.
Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "tom" name, 20 age, DATE '2019-01-01' ts UNION ALL
SELECT 1, "Tom", NULL, '2019-02-01' UNION ALL
SELECT 1, NULL, 21, '2019-03-03'
)
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
with result
Row id name age ts
1 1 Tom 21 2019-03-03
This is a classic case of application of analytic functions in Standard SQL.
Here is how you can achieve your results:
select id, name, age from (
select id, name, age, ts, rank() over (partition by id order by ts desc) rnk
from `yourdataset.yourtable`
)
where rnk = 1
This will sub-group your records based id and pick the one with most recent ts (indicating the record most recently added for a given id).
I have a table with columns: FILING_ID, DATE, and BLAH
I'm trying to write a query that for each FILING_ID, returns the rows with the last three dates. If table is:
FILING_ID DATE
aksjdfj 2/1/2006
b 2/1/2006
b 3/1/2006
b 4/1/2006
b 5/1/2006
I would like:
FILING_ID DATE
aksjdfj 2/1/2006
b 3/1/2006
b 4/1/2006
b 5/1/2006
I was thinking of maybe running some query to figure out the 3rd highest date for each FILING_ID then doing a join and comparing the cutoff date with the DATE?
I use PostgreSQL. Is there some way to use limit?
SELECT filing_id, date -- more columns?
FROM (
SELECT *, row_number() OVER (PARTITION BY filing_id ORDER BY date DESC NULLS LAST) AS rn
FROM tbl
) sub
WHERE rn < 4
ORDER BY filing_id, date; -- optionally order rows
NULLS LAST is only relevant if date can actually be NULL.
If date is not unique, you may need to break ties to get stable results.
PostgreSQL sort by datetime asc, null first?
Select first row in each GROUP BY group?
Is there some way to use limit?
Maybe. If you have an additional table holding all distinct filing_id (and possibly a few more, which are removed by the join), you can use CROSS JOIN LATERAL (, LATERAL is short syntax):
SELECT f.filing_id, t.*
FROM filing f -- table with distinct filing_id
, LATERAL (
SELECT date -- more columns?
FROM tbl
WHERE filing_id = f.filing_id
ORDER BY date DESC NULLS LAST
LIMIT 3 -- now you can use LIMIT
) t
ORDER BY f.filing_id, t.date;
What is the difference between LATERAL and a subquery in PostgreSQL?
If you don't have a filing table, you can create one. Or derive it on the fly:
Optimize GROUP BY query to retrieve latest record per user
I have the followin table structure
person_id organization_id
1 1
1 2
1 3
2 4
2 2
I want the result set as
person_id organization_id
1 1
2 4
means TOP1 of the person_id
You are using SQL Server, so you can use row_number(). However, you really cannot define top without ordering -- the results are not guaranteed.
So, the following will do a top without an order by:
select person_id, min(organization_id)
from t
group by person_id;
However, I assume that you intend for the order of the rows to be the intended order. Alas, SQL tables are unordered so the ordering is not valid. You really need an id or creationdate or something to specify the order.
All that said, you can try the following:
select person_id, organization_id
from (select t.*,
row_number() over (partition by person_id order by (select NULL)) as seqnum
from t
) t
where seqnum = 1;
It is definitely not guaranteed to work. In my experience, the order by (select NULL)) returns rows in the same order as the select -- although there is no documentation to this effect (that I have found). Note that in a parallel system on a decent sized table, SQL Server return order has little to do with the order of the rows on the pages or the insert order.