I have origin table A:
dt
c1
value
2022/10/1
1
1
2022/10/2
1
2
2022/10/3
1
3
2022/10/1
2
4
2022/10/2
2
6
2022/10/3
2
5
Currently I got the latest dt's percent_rank by:
select * from
(
select
*,
percent_rank() over (partition by c1 order by value) as prank
from A
) as pt
where pt.dt = Date'2022-10-3'
Demo: https://www.db-fiddle.com/f/rXynTaD5nmLqFJdjDSCZpL/0
the excepted result looks like:
dt
c1
value
prank
2022/10/3
1
3
1
2022/10/3
2
5
0.5
Which means at 2022-10-3, the value in c1 group's percent_rank in history is 100% while in c2 group is 66%.
But this sql will sort evey partition which I thought it's time complexity is O(n log n).
I just need the latest date's rank and I thought I could do that by calculating count(last_value > value)/count() which cost O(n).
Any suggestions?
Rather than hard-coding the maximum date, you can use the ROW_NUMBER() analytic function:
SELECT *
FROM (
SELECT t.*,
PERCENT_RANK() OVER (PARTITION BY c1 ORDER BY value) AS prank,
ROW_NUMBER() OVER (PARTITION BY c1 ORDER BY dt DESC) AS rn
FROM table_name t
) t
WHERE rn = 1
Which, for the sample data:
CREATE TABLE table_name (dt, c1, value) AS
SELECT DATE '2022-10-01', 1, 1 FROM DUAL UNION ALL
SELECT DATE '2022-10-02', 1, 2 FROM DUAL UNION ALL
SELECT DATE '2022-10-03', 1, 3 FROM DUAL UNION ALL
SELECT DATE '2022-10-01', 2, 4 FROM DUAL UNION ALL
SELECT DATE '2022-10-02', 2, 6 FROM DUAL UNION ALL
SELECT DATE '2022-10-03', 2, 5 FROM DUAL;
Outputs:
DT
C1
VALUE
PRANK
RN
2022-10-03 00:00:00
1
3
1
1
2022-10-03 00:00:00
2
5
.5
1
fiddle
But this sql will sort every partition which I thought it's time complexity is O(n log n).
Whatever you do you will need to iterate over the entire result-set.
I just need the latest date's rank and I thought I could do that by calculating count(last_value > value)/count().
Then you will need to find the last value which (unless you are hardcoding the last date) will involve using an index- or table-scan over all the values in each partition and sorting the values and then to find a count of the greater values will require a second index- or table-scan. You can profile both solutions but I expect you would find that using analytic functions is going to be equally efficient, if not better, than trying to use aggregation functions.
For example:
SELECT c1,
dt,
value,
( SELECT ( COUNT(CASE WHEN value <= t.value THEN 1 END) - 1 )
/ ( COUNT(*) - 1 )
FROM table_name c
WHERE c.c1 = t.c1
) AS prank
FROM table_name t
WHERE dt = DATE '2022-10-03'
If going to access the table twice and you are likely to find that the I/O costs of table access are going to far outweight any potential savings from using a different method. However, if you look at the explain plan (fiddle) then the query is still performing an aggregate sort so there is not going to be any cost savings, only additional costs from this method.
Try this
select t.c1, t.dt, t.value
from TABLENAME t
inner join (
select c1, max(dt) as MaxDate
from TABLENAME
group BY dt
) tm on t.c1 = tm.c1 and t.dt = tm.MaxDate ORDER BY dt DESC;
Or as simple as
SELECT * from TABLENAME ORDER BY dt DESC;
I fiddled it a bit, it is almost the same answer as MT0 already put.
select dt, c1, val, prank*100 as percent_rank from (select
t1.*,
percent_rank() over (partition by c1 order by val) as prank,
row_number() over (partition by c1 order by dt desc) rn from t1) where rn=1;
result
DT C1 VAL PERCENT_RANK
2022-10-03 1 3 100
2022-10-03 2 5 50
http://sqlfiddle.com/#!4/ec60a/23
I used Row_number = 1 to get the latest date.
And also pushed the percent_rank as percent.
Is this what you desire?
Related
Given a table of IDs in an Oracle database, what is the best method to randomly flag (x) percent of them? In the example below, I am randomly flagging 20% of all records.
ID ELIG
1 0
2 0
3 0
4 1
5 0
My current approach shown below works fine, but I am wondering if there is a more efficient way to do this?
WITH DAT
AS ( SELECT LEVEL AS "ID"
FROM DUAL
CONNECT BY LEVEL <= 5),
TICKETS
AS (SELECT "ID", 1 AS "ELIG"
FROM ( SELECT *
FROM DAT
ORDER BY DBMS_RANDOM.VALUE ())
WHERE ROWNUM <= (SELECT ROUND (COUNT (*) / 5, 0) FROM DAT)),
RAFFLE
AS (SELECT "ID", 0 AS "ELIG"
FROM DAT
WHERE "ID" NOT IN (SELECT "ID"
FROM TICKETS)
UNION
SELECT * FROM TICKETS)
SELECT *
FROM RAFFLE;
You could use a ROW_NUMBER approach here:
WITH cte AS (
SELECT t.*, ROW_NUMBER() OVER (ORDER BY dbms_random.value) rn,
COUNT(*) OVER () cnt
FROM yourTable t
)
SELECT t.*,
CASE WHEN rn / cnt <= 0.2 THEN 'FLAG' END AS flag -- 0.2 to flag 20%
FROM cte t
ORDER BY ID;
Demo
Sample output for one run of the above query:
Note that one of five records is flagged, which is 20%.
is there a way using sql, in bigquery more specifically, to get one line per unique value in a given column
I know that this is possible using a sequence of union queries where you have as much union as distinct values as there is in the column of interest. but i'm wondering if there is a better way to do it.
You can use row_number():
select t.* except (seqnum)
from (select t.*, row_number() over (partition by col order by col) as seqnum
from t
) t
where seqnum = 1;
This returns an arbitrary row. You can control which row by adjusting the order by.
Another fun solution in BigQuery uses structs:
select array_agg(t limit 1)[ordinal(1)].*
from t
group by col;
You can add an order by (order by X limit 1) if you want a particular row.
here is just a more formated format :
select tab.* except(seqnum)
from (
select *, row_number() over (partition by column_x order by column_x) as seqnum
from `project.dataset.table`
) as tab
where seqnum = 1
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY col
You can test, play with above using dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 1 col UNION ALL
SELECT 2, 1 UNION ALL
SELECT 3, 1 UNION ALL
SELECT 4, 2 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 6, 3
)
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY col
with result
Row id col
1 1 1
2 4 2
3 6 3
Please see the attached screenshot. I'm trying to figure out how we can achieve that using SQL Server.
1
Thanks.
You can achieve this using recursive CTE,
For Ex : (I assumed your date column is in MM/DD/YYYY format)
;with orig_src as
(
select CAST('01/01/2018' AS DATETIME) As Dt, 'Alpha' Name, 3 Freq
UNION ALL
select CAST('12/01/2018' AS DATETIME) As Dt, 'Beta' Name, 2 Freq
), freq_cte as
(
--start/anchor row
select dt, name, 1 freq_new, Freq rn from orig_src
--recursion
union all
select DATEADD(MONTH, 1, a.dt), a.name, 1 freq_new, a.rn - 1 from freq_cte a
--terminator/constraint for recursion
where a.rn - 1 ! = 0
)
select convert(varchar, dt, 101) dt, name, freq_new from freq_cte
order by 2,1
The way this recursive logic works is,
First we get all the rows from the table in a CTE (freq_cte), then we recursively call this CTE and decrement rn (original freq) till the terminator condition is met that is when (rn -1) = 0
I have table with values
val1
1
2
3
4
I want following as output
1.00
2.50
4.25
6.12
Each value in a table is computed as val1+0.5*val1(from Previous row)
so for .eg.
row with 2 ---> output is computed as 2+0.5*1.00= 2.50
row with 3 ---> output is computed as 3+0.5*2.50 = 4.25
When I use following sql windows function
SELECT *
,val1+SUM(0.50*val1) OVER (ORDER BY val1 ROWS between 1 PRECEDING and 1 PRECEDING) AS r
FROM #a1
I get output as
1.00
2.500
4.000
5.500
This can be done with a recursive cte.
with rownums as (select val,row_number() over(order by val) as rnum
from tbl)
/* This is the recursive cte */
,cte as (select val,rnum,cast(val as float) as new_val from rownums where rnum=1
union all
select r.val,r.rnum,r.val+0.5*c.new_val
from cte c
join rownums r on c.rnum=r.rnum-1
)
/* End Recursive cte */
select val,new_val
from cte
Sample Demo
This is called exponential averaging. You can do it with some sort of power function, say it is called power() (this might differ among databases).
The following will work -- but I'm not sure about what happens if the sequences get long. Note that this has an id column to specify the ordering:
with t as (
select 1 as id, 1 as val union all
select 2, 2 union all select 3, 3 union all select 4, 4
)
select t.*,
( sum(p_seqnum * val) over (order by id) ) / p_seqnum
from (select t.*,
row_number() over (order by id desc) as seqnum,
power(cast(0.5 as float), row_number() over (order by id desc)) as p_seqnum
from t
) t;
Here is a rextester for Postgres. Here is a SQL Fiddle for SQL Server.
This works because exponential averaging is "memory-less". If this were not true, you would need a recursive CTE, and that can be much more expensive.
I have some data and want to be able to number each row sequentially, but rows with the same type consecutively, number the same number, and when it's a different type continue numbering.
There will only be types 5 and 6, ID is actually more complex than abc123. I've tried rank but I seem to get two different row counts - in the example instead of 1 2 2 3 4 it would be 1 1 2 2
original image
dense rank result
MS SQL 2008 R2
As far as I understand, you want to number your continous groups
declare #Temp table (id1 bigint identity(1, 1), ID nvarchar(128), Date date, Type int)
insert into #Temp
select 'abc123', '20130101', 5 union all
select 'abc124', '20130102', 6 union all
select 'abc125', '20130103', 6 union all
select 'abc126', '20130104', 5 union all
select 'abc127', '20130105', 6 union all
select 'abc128', '20130106', 6 union all
select 'abc129', '20130107', 6 union all
select 'abc130', '20130108', 6 union all
select 'abc131', '20130109', 5
;with cte1 as (
select
*,
row_number() over (order by T.Date) - row_number() over (order by T.Type, T.Date) as grp
from #Temp as T
), cte2 as (
select *, min(Date) over (partition by grp) as grp2
from cte1
)
select
T.ID, T.Date, T.Type,
dense_rank() over (order by grp2)
from cte2 as T
order by id1