Combining the filtering of duplicate rows in partitioned table and the query. BigQuery

Combining the filtering of duplicate rows in partitioned table and the query. BigQuery - sql

I would like to filter the customer_id's last purchased item from the item purchase table. However, the table is the concatenation of distributed tables and may contain duplicate rows. Thus, I am filtering with the ROW_NUMBER() = 1 [1], [2] which is partitioned by log_key field.
I was wondering if there is a better way (instead of using a nested query) of filtering duplicate rows with the same log_key and getting the last item purchased by users.
I was wondering if it is possible to combine the two partition by operations.
currently
WITH
purchase_logs AS (
SELECT
basis_dt, reg_datetime, logkey,
customer_id, customer_info_1, customer_info_2, -- customer info
item_id, item_info_1, item_info_2, -- item info
FROM `project.dataset.item_purchase_table`
WHERE basis_dt BETWEEN '2021-11-01' AND '2021-11-10'
QUALIFY ROW_NUMBER() OVER (PARTITION BY log_key ORDER BY reg_datetime ASC) = 1
)
SELECT *
FROM purchase_logs
QUALIFY ROW_NUMBER() OVER (PARTITION BY log_key, customer_id ORDER BY reg_datetime ASC) = 1
ORDER BY reg_datetime, customer_id
;

The below isn't what I optimally wanted (since the coding format isn't consistent; not filtering logkey first). However, I end up combining the two window operations PARTITION BY with the prior logic. (I was hoping for some kind of filter with the HAVING clause and keeping the coding conventions of filtering with logkey.)
SELECT
basis_dt, reg_datetime, logkey,
customer_id, customer_info_1, customer_info_2, -- customer info
item_id, item_info_1, item_info_2, -- item info
FROM `project.dataset.item_purchase_table`
WHERE basis_dt BETWEEN '2021-11-01' AND '2021-11-10'
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY reg_datetime ASC) = 1
;

Related

Efficient way to associate each row to latest previous row with condition.(PostgreSQL)

I have a table in which two different kind of rows are inserted:
Some rows represent a datapoint, a key-value pair in a specific of time
Other rows represent a new status, which persist in the future until the next status
In the real problem, I have a timestamp column which stores the order of the events. In the SQL Fiddle example I am using a SERIAL integer field, but it is the same idea.
Here is the example:
http://www.sqlfiddle.com/#!17/a0823/6
I am looking for an efficent way to retrieve each row of the first type with the its status (which is given by the latest status row before current row) associated.
The query on the sqlfiddle link is an example, but uses two subqueries which is very inefficient.
I cannot change the structure of the table nor create other tables, but I can create any necessary index on the table.
I am using PostgreSQL 11.4

The most efficient method is probably to use window functions:
select p.*
from (select p.*,
max(attrvalue) filter (where attrname = 'status_t1') over (partition by grp_1 order by id) as status_t1,
max(attrvalue) filter (where attrname = 'status_t2') over (partition by grp_2 order by id) as status_t2
from (select p.*,
count(*) filter (where attrname = 'status_t1') over (order by id) as grp_1,
count(*) filter (where attrname = 'status_t2') over (order by id) as grp_2
from people p
) p
) p
where attrname not in ('status_t1', 'status_t2');
Here is a db<>fiddle.

How to use RANK OVER PARTITION BY to create rankings based on two columns?

I have duplicate records caused by data inconsistency. I am trying to take only one record for each patient (taking the latest record), who each have dozens of duplicate records due to address changes.
When I run the code below, each record in my table seems to be assigned a rank of 1. How can I assign rankings specific to each Patient ID?
SELECT DISTINCT
PATIENT_ID
,ADDRESS_START_DATE
,ADDRESS_END_DATE
,RANK() OVER (PARTITION BY PATIENT_ID ,ADDRESS_START_DATE ORDER BY ADDRESS_START_DATE DESC) AS Ind
FROM Member_Table
;

You shouldn't partition by the address_start_date if you're ordering by it:
SELECT DISTINCT
PATIENT_ID
,ADDRESS_START_DATE
,ADDRESS_END_DATE
,RANK() OVER (PARTITION BY PATIENT_ID ORDER BY ADDRESS_START_DATE DESC) AS Ind
FROM Member_Table
;

SQL Question regarding fields associated with the MAX([field]) only

I'm trying to gather the entire row information associated with the MAX() of a particular field.
I essentially have several [Flight_Leg] for a unique [Shipment_ID], and each one has unique [Destination_Aiport], [Departure_Time], and [Arrival_Time]. Obviously, each [Shipment_ID] can have multiple [Flight_Leg], and each [Flight_Leg] has a unique row of information.
SELECT
[Shipment_ID],
MAX([Flight_Leg]) AS "Final Leg",
[Arrival_Time],
[Destination_Airport]
FROM
[Flight_Info]
Group By
[Shipment_ID],
[Arrival_Time]
The output is multiple lines, rather than having one unique line for [Shipment_ID]. I'm just trying to isolate the FINAL flight info for a shipment.

Depending on your database, most support window functions. Here's one option using row_number():
select *
from (
select *, row_number() over (partition by shipment_id order by flight_leg desc) rn
from flight_info
) t
where rn = 1
Alternatively here's a more generic approach joining back to itself:
select fi.*
from flight_info fi
join (select shipment_id, max(flight_leg) max_flight_leg
from flight_info
group by shipment_id) t on fi.shipment_id = t.shipment_id
and fi.flight_leg = t.max_flight_leg

BigQuery SQL Query Optimization

I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).
The BigQuery dataset that I'm working with comes from Hubspot. It's being kept in sync by Stitch. (For those unfamiliar with BigQuery, most integrations are append-only so I have to filter out old copies via the ROW_NUMBER() OVER line you'll see below, so that's why it exists. Seems like the standard way to deal with this quirk.)
The wrinkle with the companies table is every single field, except for two ID ones, is of type RECORD. (See the screenshot at the bottom for an example). It serves to keep a history of field value changes. Unfortunately they don't seem to be in any order so wrapping up the fields - properties.first_conversion_event_name for example - in a MIN() or MAX() and grouping by companyid formula doesn't work.
This is what I ended up with (the final query is much longer; I didn't include all of the fields in the sample below):
WITH companies AS (
SELECT
o.companyid as companyid,
ARRAY_AGG(STRUCT(o.properties.name.value, o.properties.name.timestamp) ORDER BY o.properties.name.timestamp DESC)[SAFE_OFFSET(0)] as name,
ARRAY_AGG(STRUCT(o.properties.industry.value, o.properties.industry.timestamp) ORDER BY o.properties.industry.timestamp DESC)[SAFE_OFFSET(0)] as industry,
ARRAY_AGG(STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) ORDER BY o.properties.lifecyclestage.timestamp DESC)[SAFE_OFFSET(0)] as lifecyclestage
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) as seqnum
FROM `project.hubspot.companies` o) o
WHERE seqnum = 1
GROUP BY companyid)
SELECT
companyid,
name.value as name,
industry.value as industry,
lifecyclestage.value as lifecyclestage
FROM companies
The WITH clause at the top is to get rid of the extra fields that the ARRAY_AGG(STRUCT()) includes. For each field I would have two columns - [field].value and [field].timestamp - and I only want the [field].value one.
Thanks in advance!
Schema Screenshot

I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).
Based on schema you presented and assuming your query really returns what you expect - below "optimized" version should return same result
#standardSQL
WITH companies AS (
SELECT
o.companyid AS companyid,
STRUCT(o.properties.name.value, o.properties.name.timestamp) AS name,
STRUCT(o.properties.industry.value, o.properties.industry.timestamp) AS industry,
STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) AS lifecyclestage
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) AS seqnum
FROM `project.hubspot.companies` o
) o
WHERE seqnum = 1
)
SELECT
companyid,
name.value AS name,
industry.value AS industry,
lifecyclestage.value AS lifecyclestage
FROM companies
As you can see I just simply removed GROUP BY companyid because you already have just one entry/row per companyid after you apply WHERE seqnum = 1, so no reason at all to group just one row per companyid. For the very same reason I removed ARRAY_AGG( ORDER BY)[SAFE_OFFSET(0)] - it just aggregated one struct and then extracted that one element out of array - so no need in doing this

SQL generate ranks of groups and subgroups based on third column

I want to write a SQL query to generate ranks of groups and subgroups based on third column (Price in this case). While i know we can use dense_rank() to generate ranks based on one column. I have no idea how to generate the two columns of ranks as shown below in a single query.
Both the rankings are based on price. So J3 comes first because J3 sum(price) is 1600. J1 comes second because J1 sum(price) is 1500 and so on.
Any inputs are appreciated.
I have provided the sample input and output. The name of the input table is "RENTAL"

First roll up jet_type prices to the jet_type level, then create a ranking of all jet_types ordered by rolled up price, and finally use your window function in the outer query partitioned by jet_price and ordered by highest rolled up price to create rank_service_wthin_jet:
select a.jet_type, b.rownum rank_jet, a.service_type, a.price,
row_number() over(partition by a.jet_type order by a.price desc) rank_service_wthin_jet
from yourtable a join (
select jet_type, row_number() over(order by price desc) rownum from (
select jet_type, sum(price) price from yourtable
group by jet_type)a)b on a.jet_type=b.jet_type

You can generate two columns as:
select t.*,
dense_rank() over (order by jet_type) as rank_jet,
row_number() over (partition by jet_type order by price desc) as rank_service_within_jet
. . .
This does not exactly return what is in your table. But the results are quite similar and -- even more important -- make sense.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Combining the filtering of duplicate rows in partitioned table and the query. BigQuery - sql

Related

Efficient way to associate each row to latest previous row with condition.(PostgreSQL)

How to use RANK OVER PARTITION BY to create rankings based on two columns?

SQL Question regarding fields associated with the MAX([field]) only

BigQuery SQL Query Optimization

SQL generate ranks of groups and subgroups based on third column

Categories

Resources