Efficient way to associate each row to latest previous row with condition.(PostgreSQL) - sql

I have a table in which two different kind of rows are inserted:
Some rows represent a datapoint, a key-value pair in a specific of time
Other rows represent a new status, which persist in the future until the next status
In the real problem, I have a timestamp column which stores the order of the events. In the SQL Fiddle example I am using a SERIAL integer field, but it is the same idea.
Here is the example:
http://www.sqlfiddle.com/#!17/a0823/6
I am looking for an efficent way to retrieve each row of the first type with the its status (which is given by the latest status row before current row) associated.
The query on the sqlfiddle link is an example, but uses two subqueries which is very inefficient.
I cannot change the structure of the table nor create other tables, but I can create any necessary index on the table.
I am using PostgreSQL 11.4

The most efficient method is probably to use window functions:
select p.*
from (select p.*,
max(attrvalue) filter (where attrname = 'status_t1') over (partition by grp_1 order by id) as status_t1,
max(attrvalue) filter (where attrname = 'status_t2') over (partition by grp_2 order by id) as status_t2
from (select p.*,
count(*) filter (where attrname = 'status_t1') over (order by id) as grp_1,
count(*) filter (where attrname = 'status_t2') over (order by id) as grp_2
from people p
) p
) p
where attrname not in ('status_t1', 'status_t2');
Here is a db<>fiddle.

Related

Combining the filtering of duplicate rows in partitioned table and the query. BigQuery

I would like to filter the customer_id's last purchased item from the item purchase table. However, the table is the concatenation of distributed tables and may contain duplicate rows. Thus, I am filtering with the ROW_NUMBER() = 1 [1], [2] which is partitioned by log_key field.
I was wondering if there is a better way (instead of using a nested query) of filtering duplicate rows with the same log_key and getting the last item purchased by users.
I was wondering if it is possible to combine the two partition by operations.
currently
WITH
purchase_logs AS (
SELECT
basis_dt, reg_datetime, logkey,
customer_id, customer_info_1, customer_info_2, -- customer info
item_id, item_info_1, item_info_2, -- item info
FROM `project.dataset.item_purchase_table`
WHERE basis_dt BETWEEN '2021-11-01' AND '2021-11-10'
QUALIFY ROW_NUMBER() OVER (PARTITION BY log_key ORDER BY reg_datetime ASC) = 1
)
SELECT *
FROM purchase_logs
QUALIFY ROW_NUMBER() OVER (PARTITION BY log_key, customer_id ORDER BY reg_datetime ASC) = 1
ORDER BY reg_datetime, customer_id
;
The below isn't what I optimally wanted (since the coding format isn't consistent; not filtering logkey first). However, I end up combining the two window operations PARTITION BY with the prior logic. (I was hoping for some kind of filter with the HAVING clause and keeping the coding conventions of filtering with logkey.)
SELECT
basis_dt, reg_datetime, logkey,
customer_id, customer_info_1, customer_info_2, -- customer info
item_id, item_info_1, item_info_2, -- item info
FROM `project.dataset.item_purchase_table`
WHERE basis_dt BETWEEN '2021-11-01' AND '2021-11-10'
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY reg_datetime ASC) = 1
;

Selecting 1 column's value in a group after grouping by another column

How would I include the name of any one of the books that belong to that particular type in the below query?
select distinct
(select sum(ob.Balance)),
ob.BookType
from orders.OrderBooks ob
group by ob.BookType
In its current state it does what I need it to and groups books by BookType and sums their balances, as seen below.
However I need the name of any book that belongs to that BookType as part of the result.
If I select the BookName column and then group by it like below, it results in more unique entries and to an extent undoes the original grouping.
select distinct
(select sum(ob.Balance)),
ob.BookType,
ob.BookName
from orders.OrderBooks ob
group by ob.BookType, ob.BookName
;WITH x AS
(
SELECT
Balance = SUM(Balance) OVER (PARTITION BY BookType),
BookType,
BookName,
rn = ROW_NUMBER() OVER (PARTITION BY BookType ORDER BY BookName DESC)
FROM orders.OrderBooks
)
SELECT Balance, BookType, BookName
FROM x
WHERE rn = 1;
db<>fiddle
ORDER BY BookName DESC was dealer's choice. If you truly don't care which title shows up in the result, you can use any ordering you like. If you want the results to be random every time, you can use ORDER BY NEWID().
In general I like this flexibility better than the TOP (1) subquery approach, in addition to a single scan instead of an additional table access per row. But you can also do it a different way; just take min/max of the bookname, too:
SELECT Balance = SUM(Balance),
BookType,
BookName = MIN(BookName) -- or MAX()
FROM dbo.OrderBooks
GROUP BY BookType;
You can see these give similar results in this db<>fiddle. Plan is simpler, too; most notably: no spools. However when you use an aggregate function against that column, it makes it harder to provide arbitrary/random results, and if you intend to add other columns pulled from the right row, you'll need to go back to the row_number solution.
You can use a correlated subquery to get a single book name of that type. This assumes there's an ID field and you want to pull the most recent one:
select
Balance = (select sum(ob.Balance)),
ob.BookType,
BookName = (SELECT TOP(1) ob.BookName FROM orders.OrderBooks ob2 WHERE ob2.BookType = ob.BookType ORDER BY ob2.ID DESC)
from orders.OrderBooks ob
group by ob.BookType, ob.BookName

SQL Question regarding fields associated with the MAX([field]) only

I'm trying to gather the entire row information associated with the MAX() of a particular field.
I essentially have several [Flight_Leg] for a unique [Shipment_ID], and each one has unique [Destination_Aiport], [Departure_Time], and [Arrival_Time]. Obviously, each [Shipment_ID] can have multiple [Flight_Leg], and each [Flight_Leg] has a unique row of information.
SELECT
[Shipment_ID],
MAX([Flight_Leg]) AS "Final Leg",
[Arrival_Time],
[Destination_Airport]
FROM
[Flight_Info]
Group By
[Shipment_ID],
[Arrival_Time]
The output is multiple lines, rather than having one unique line for [Shipment_ID]. I'm just trying to isolate the FINAL flight info for a shipment.
Depending on your database, most support window functions. Here's one option using row_number():
select *
from (
select *, row_number() over (partition by shipment_id order by flight_leg desc) rn
from flight_info
) t
where rn = 1
Alternatively here's a more generic approach joining back to itself:
select fi.*
from flight_info fi
join (select shipment_id, max(flight_leg) max_flight_leg
from flight_info
group by shipment_id) t on fi.shipment_id = t.shipment_id
and fi.flight_leg = t.max_flight_leg

BigQuery SQL Query Optimization

I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).
The BigQuery dataset that I'm working with comes from Hubspot. It's being kept in sync by Stitch. (For those unfamiliar with BigQuery, most integrations are append-only so I have to filter out old copies via the ROW_NUMBER() OVER line you'll see below, so that's why it exists. Seems like the standard way to deal with this quirk.)
The wrinkle with the companies table is every single field, except for two ID ones, is of type RECORD. (See the screenshot at the bottom for an example). It serves to keep a history of field value changes. Unfortunately they don't seem to be in any order so wrapping up the fields - properties.first_conversion_event_name for example - in a MIN() or MAX() and grouping by companyid formula doesn't work.
This is what I ended up with (the final query is much longer; I didn't include all of the fields in the sample below):
WITH companies AS (
SELECT
o.companyid as companyid,
ARRAY_AGG(STRUCT(o.properties.name.value, o.properties.name.timestamp) ORDER BY o.properties.name.timestamp DESC)[SAFE_OFFSET(0)] as name,
ARRAY_AGG(STRUCT(o.properties.industry.value, o.properties.industry.timestamp) ORDER BY o.properties.industry.timestamp DESC)[SAFE_OFFSET(0)] as industry,
ARRAY_AGG(STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) ORDER BY o.properties.lifecyclestage.timestamp DESC)[SAFE_OFFSET(0)] as lifecyclestage
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) as seqnum
FROM `project.hubspot.companies` o) o
WHERE seqnum = 1
GROUP BY companyid)
SELECT
companyid,
name.value as name,
industry.value as industry,
lifecyclestage.value as lifecyclestage
FROM companies
The WITH clause at the top is to get rid of the extra fields that the ARRAY_AGG(STRUCT()) includes. For each field I would have two columns - [field].value and [field].timestamp - and I only want the [field].value one.
Thanks in advance!
Schema Screenshot
I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).
Based on schema you presented and assuming your query really returns what you expect - below "optimized" version should return same result
#standardSQL
WITH companies AS (
SELECT
o.companyid AS companyid,
STRUCT(o.properties.name.value, o.properties.name.timestamp) AS name,
STRUCT(o.properties.industry.value, o.properties.industry.timestamp) AS industry,
STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) AS lifecyclestage
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) AS seqnum
FROM `project.hubspot.companies` o
) o
WHERE seqnum = 1
)
SELECT
companyid,
name.value AS name,
industry.value AS industry,
lifecyclestage.value AS lifecyclestage
FROM companies
As you can see I just simply removed GROUP BY companyid because you already have just one entry/row per companyid after you apply WHERE seqnum = 1, so no reason at all to group just one row per companyid. For the very same reason I removed ARRAY_AGG( ORDER BY)[SAFE_OFFSET(0)] - it just aggregated one struct and then extracted that one element out of array - so no need in doing this

Ranking over several columns

In the process of query optimization I got to following SQL query:
select s.*
from
(
select id, DATA, update_dt, inspection_dt, check_dt
RANK OVER()
(PARTITION by ID
ORDER BY update_dt DESC, DATA) rank
FROM TABLE
where update_dt < inspection_dt or update_dt < check_dt
) r
where r.rank = 1
Query returns the DATA that corresponds to the latest check_dt.
However, what I want to get is:
1. DATA corresponding to latest check_dt
2. DATA corresponding to latest inspection_dt.
One of the trivial solutions - just write two separate queries with a where single condition - one for inspection_dt, and one for check_dt. However, that way it loses initial intent - to shorten the running time.
By observing the source data I noticed the way to implement it - check date is always later than inspection date; knowing that I could just extract the record with the rank = 1 and it will give me DATA corresponding to latest CHECK_DT, and record with the largest rank would correspond to INSPECTION.
However, data I'm afraid data will not be always consistent, so I was looking for more abstract solution.
How about this?
select s.*
from (select id, DATA, update_dt, inspection_dt, check_dt,
RANK() OVER (PARTITION by ID
ORDER BY update_dt DESC, DATA
) as rank_upd,
RANK() OVER (PARTITION by ID
ORDER BY inspection_dt DESC, DATA
) as rank_insp,
FROM TABLE
) r
where r.rank_upd = 1 or r.rank_insp = 1;