How to use RANK OVER PARTITION BY to create rankings based on two columns? - sql

I have duplicate records caused by data inconsistency. I am trying to take only one record for each patient (taking the latest record), who each have dozens of duplicate records due to address changes.
When I run the code below, each record in my table seems to be assigned a rank of 1. How can I assign rankings specific to each Patient ID?
SELECT DISTINCT
PATIENT_ID
,ADDRESS_START_DATE
,ADDRESS_END_DATE
,RANK() OVER (PARTITION BY PATIENT_ID ,ADDRESS_START_DATE ORDER BY ADDRESS_START_DATE DESC) AS Ind
FROM Member_Table
;

You shouldn't partition by the address_start_date if you're ordering by it:
SELECT DISTINCT
PATIENT_ID
,ADDRESS_START_DATE
,ADDRESS_END_DATE
,RANK() OVER (PARTITION BY PATIENT_ID ORDER BY ADDRESS_START_DATE DESC) AS Ind
FROM Member_Table
;

Related

Partitioning in view of BigQuery is not remaining when create table

I'm trying to run
SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY t.Barcode, t.Country_Code ) AS seqnum_c
FROM t
in BigQuery which shows the approprite result. But the problem is when I want to create a table with the same order it's become a mess and order would not considered.
CREATE OR REPLACE TABLE `test_2` AS
SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY t.Barcode, t.Country_Code ) AS seqnum_c
FROM t
IN Addition I tried:
CREATE OR REPLACE TABLE `test_2` AS
SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY t.Barcode, t.Country_Code ORDER BY t.Barcode, t.Country_Code) AS seqnum_c
FROM t
And got the same result.
Have you ever faced the same issue?
Thanks #ken for your response. I guess I found my answer which is:
CREATE OR REPLACE TABLE t
AS (
SELECT t.*,
ROW_NUMBER() over (partition by t.Barcode, t.Country_Code order by Barcode, Country_Code ) as seqnum_c
FROM t)
ORDER BY Barcode,Country_Code,seqnum_c);
Best
You need to specify how you want the rows within the partition to be ordered in order for it to be deterministic.
It looks like you attempted to do this in your second example, but you did ORDER BY t.Barcode, t.Country_Code which are exactly your partition columns. That means that within each partition, each row will already have exactly the same barcode and country_code so effectively, there is no ordering happening.
For example, given the following rows
Barcode Country_Code Timestamp
111 USA 12345
111 USA 12346
111 JP 12350
You are partitioning by Barcode and Country_code so the first two rows will be a part of the same partition. However, since you don't specify an order, you cannot know which row will get which row number. In the example above, it would make sense to ORDER BY Timestamp, but without knowing your data or your goals it's hard to say what the right logic is for you.
In short, you need to specify an ORDER BY column that is not a part of the PARTITION BY columns in order to deterministically order the rows within each partition.

Combining the filtering of duplicate rows in partitioned table and the query. BigQuery

I would like to filter the customer_id's last purchased item from the item purchase table. However, the table is the concatenation of distributed tables and may contain duplicate rows. Thus, I am filtering with the ROW_NUMBER() = 1 [1], [2] which is partitioned by log_key field.
I was wondering if there is a better way (instead of using a nested query) of filtering duplicate rows with the same log_key and getting the last item purchased by users.
I was wondering if it is possible to combine the two partition by operations.
currently
WITH
purchase_logs AS (
SELECT
basis_dt, reg_datetime, logkey,
customer_id, customer_info_1, customer_info_2, -- customer info
item_id, item_info_1, item_info_2, -- item info
FROM `project.dataset.item_purchase_table`
WHERE basis_dt BETWEEN '2021-11-01' AND '2021-11-10'
QUALIFY ROW_NUMBER() OVER (PARTITION BY log_key ORDER BY reg_datetime ASC) = 1
)
SELECT *
FROM purchase_logs
QUALIFY ROW_NUMBER() OVER (PARTITION BY log_key, customer_id ORDER BY reg_datetime ASC) = 1
ORDER BY reg_datetime, customer_id
;
The below isn't what I optimally wanted (since the coding format isn't consistent; not filtering logkey first). However, I end up combining the two window operations PARTITION BY with the prior logic. (I was hoping for some kind of filter with the HAVING clause and keeping the coding conventions of filtering with logkey.)
SELECT
basis_dt, reg_datetime, logkey,
customer_id, customer_info_1, customer_info_2, -- customer info
item_id, item_info_1, item_info_2, -- item info
FROM `project.dataset.item_purchase_table`
WHERE basis_dt BETWEEN '2021-11-01' AND '2021-11-10'
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY reg_datetime ASC) = 1
;

Test whether MIN would work over ROW_NUMBER

Situation:
I have three columns:
id
date
tx_id
The primary id column is tx_id and is unique in the table. Each tx_id is tied to an id and it has a record date. I would like to test whether or not the tx_id is incremental.
Objective:
I need to extract the first tx_id by id but I want to prevent using ROW_NUMBER
i.e
select id, date, tx_id, row_number() over(partition by id order by date asc) as First_transaction_id from table
and simply use
select id, date, MIN(tx_id) as First_transaction_id from table
So how can i make sure since i have more than 50 millions of ids that by using MINtx_id will yield the earliest transaction for each id?
How can i add a flag column to segment those that don't satisfy the condition?
how can i make sure since i have more than 50 millions of ids that by using MINtx_id will yield the earliest transaction for each id?
Simply do the comparison:
You can get the exceptions with logic like this:
select t.*
from (select t.*,
min(tx_id) over (partition by id) as min_tx_id,
rank() over (partition by id order by date) as seqnum
from t
) t
where tx_id = min_tx_id and seqnum > 1;
Note: this uses rank(). It seems possible that there could be two transactions for an id on the same date.
use corelated sunquery
select t.* from table_name t
where t.date= ( select min(date) from table_name
t1 where t1.id=t.id)

filtering out duplicate rows using max

I have a table that, for the most part, is individual users. Occasionally there is a joint user. For a joint user, all the fields in the table will be exactly the same as the primary user except for a b-score field. I want to only display one row of data per account, and use the highest b-score to decide which row to use when it is a joint account (so the highest score is displayed only)
I thought it would be a simple
SELECT DISTINCT accountNo, MAX(bscore) FROM table, GROUP BY accountNo
but I'm still getting multiple rows for joints
You seem to want the ANSI-standard row_number() function:
select t.*
from (select t.*, row_number() over (partition by accountNo order by bscore desc) as seqnum
from t
) t
where seqnum = 1;
This worked for me, maybe not the most efficient. Correlated sub-query. The key part is accountNo = a.accountNo.
SELECT DISTINCT a.accountNo, (SELECT MAX(bscore) FROM table WHERE accountNo =
a.accountNo) bscore
FROM table a
GROUP BY a.accountNo

how to select the most recent records

Select id, name , max(modify_time)
from customer
group by id, name
but I get all records.
Order by modify_time desc and use row_number to number the row for id,name combination.Then select each combination with row_number = 1
select id,modify_time,name
from (
select id,modify_time,name,row_number() over(partition by id order by modify_time desc) as r_no
from customer
) a
where a.r_no=1
Ids are unique, which means grouping them by the id, will result in the same table.
My suggestion would be, to order the table by "modify_time" descending and limit the result to 1 (Maybe something like the following):
Select id, name modify_time from customer ORDER BY modify_time DESC limit 1
The reason you are getting the whole table as a result is because you are grouping by id AND name. That means every unique combination of id and name is returned. And since all names per id are different, the whole table is returned.
If you want the last modification per id (or name) you should only group by id (or name respectively).