Select rows based on distinct values of nested field in BigQuery

Select rows based on distinct values of nested field in BigQuery - sql

I have a table in BigQuery which looks like this:
The sequence field is a repeated RECORD. I want to select one row per stepName but if there are multiple rows per step name, I want to choose the one where sequence.step.elapsedSeconds and sequence.step.elapsedMinutes are not null, otherwise select the rows where these columns are null.
As shown in the image above, I want to select row no. 2, 4 and 5. I have calculated ROW_NUMBER like this: ROW_NUMBER() OVER(PARTITION BY step.stepName) AS RowNum.
Here´s my query so far in trying to filter out the unwanted rows:
WITH DistinctRows AS
(
select timestamp,
ARRAY (
SELECT
STRUCT(
STRUCT(
step.elapsedSeconds,
step.elapsedMinutes,
) as step
)
FROM
UNNEST(source_table.sequence) AS sequence
) AS sequence,
ROW_NUMBER() OVER(PARTITION BY step.stepName) AS RowNum
from source_table,
unnest(sequence) as previousCalls
order by timestamp asc
)
SELECT *
FROM DistinctRows,
unnest(sequence) as sequence
where (rowNum = 1 and (step.elapsedSeconds is null and step.elapsedMinutes is null)
or (RowNum > 1 and step.elapsedSeconds is not null and step.elapsedSeconds is not null)
order by timestamp asc
I need help in figuring out how to filter out the rows like no. 1 and 3 and would appreciate some help.
Thanks in advance.

Hmmm . . . Assuming that stepname is not part of the repeated column:
SELECT dr.* EXCEPT (sequence),
(SELECT seq
FROM unnest(dr.sequence) seq
ORDER BY seq.step.elapsedSeconds DESC NULLS LAST,
sequence.step.elapsedMinutes DESC NULLS LAST
) as sequence
FROM DistinctRows dr
ORDER BY timestamp asc;
If stepname is part of sequence, then the subquery would reaggregate:
SELECT dr.* EXCEPT (sequence),
(SELECT ARRAY_AGG(sequence ORDER BY stepName)
FROM (SELECT seq,
ROW_NUMBER() OVER (PARTITION BY seq.stepName
ORDER BY seq.step.elapsedSeconds DESC NULLS LAST, sequence.step.elapsedMinutes DESC NULLS
) as seqnum
FROM unnest(dr.sequence) seq
) s
WHERE seqnum = 1
) as sequence
FROM DistinctRows dr
ORDER BY timestamp asc

Related

PostgreSQL optimize query performance that contains Window function with CTE

Here the column amenity_category and parent_path is JSONB column with value like ["Tv","Air Condition"] and ["20000","20100","203"] respectively. Apart from that other columns are normal varchar and numeric type. I've around 2.5M rows with primary key on id and it is indexed. Basically the initial CTE part is taking time when rp.parent_path match multiple rows.
Sample dataset:
Current query:
WITH CTE AS
(
SELECT id,
property_name,
property_type_category,
review_score,
amenity_category.name,
count(*) AS cnt FROM table_name rp,
jsonb_array_elements_text(rp.amenity_categories) amenity_category(name)
WHERE rp.parent_path ? '203' AND number_of_review >= 1
GROUP BY amenity_category.name,id
),
CTE2 as
(
SELECT id, property_name,property_type_category,name,
ROW_NUMBER() OVER (PARTITION BY property_type_category,
name ORDER BY review_score DESC),
COUNT(id) OVER (PARTITION BY property_type_category,
name ORDER BY name DESC)
FROM CTE
)
SELECT id, property_name, property_type_category, name, COUNT
FROM CTE2
where row_number = 1
Current Output:
So my basic question is is there any other way I can re-write this query or optimize the current query?

If it's safe to assume that array elements in amenity_categories are distinct (no duplicate array elements), we can radically simplify to:
SELECT DISTINCT ON (property_type_category, ac.name)
id, property_name, property_type_category, ac.name
, COUNT(*) OVER (PARTITION BY property_type_category, ac.name) AS count
FROM table_name rp, jsonb_array_elements_text(rp.amenity_categories) ac(name)
WHERE parent_path ? '203'
AND number_of_review >= 1
ORDER BY property_type_category, ac.name, review_score DESC;
If review_score can be NULL, make that:
...
ORDER BY property_type_category, ac.name, review_score DESC NULLS LAST;
This works, because DISTINCT ON is applied as last step (after window functions). See:
Best way to get result count before LIMIT was applied
PostgreSQL: running count of rows for a query 'by minute'
parent_path and number_of_review should probably be indexed. Depends on data distribution and selectivity of the WHERE conditions, which you didn't disclose.
About DISTINCT ON:
Select first row in each GROUP BY group?
Assuming id is NOT NULL, count(*) is faster and equivalent to count(id).

Oracle optimise SQL query - Multiple Max()

I have a table where first I need to select data by max(event_date) then need to
filter the data by max(event_sequence) then filter again by max(event_number)
I wrote following query which works but takes time.
Here the the query
SELECT DISTINCT a.stuid,
a.prog,
a.stu_prog_id,
a.event_number,
a.event_date,
a.event_sequence,
a.prog_status
FROM table1 a
WHERE a.event_date=
(SELECT max(b.event_date)
FROM table1 b
WHERE a.stuid=b.stuid
AND a.prog=b.prog
AND a.stu_prog_id=b.stu_prog_id)
AND a.event_seq=
(SELECT max(b.event_sequence)
FROM table1 b
WHERE a.stuid=b.stuid
AND a.prog=b.prog
AND a.stu_prog_id=b.stu_prog_id
AND a.event_date=b.event_date)
AND a.event_number=
(SELECT max(b.event_number)
FROM table1 b
WHERE a.stuid=b.stuid
AND a.prog=b.prog
AND a.stu_prog_id=b.stu_prog_id
AND a.event_date=b.event_date
AND a.event_sequence=b.event_sequence
I was wondering is there there a faster way to get the data?
I am using Oracle 12c.

You could try rephrasing your query using analytic functions:
SELECT
stuid,
prog,
stu_prog_id,
event_number,
event_date,
event_sequence,
prog_status
FROM
(
SELECT t.*,
RANK() OVER (PARTITION BY studio, prog, stu_prog_id
ORDER BY event_date DESC) rnk1,
RANK() OVER (PARTITION BY studio, prog, stu_prog_id, event_date
ORDER BY event_sequence DESC) rnk2,
RANK() OVER (PARTITION BY studio, prog, stu_prog_id, event_date, event_sequence
ORDER BY event_number DESC) rnk3
FROM table1 t
) t
WHERE rnk1 = 1 AND rnk2 = 1 AND rnk3 = 1;
Note: I don't actually know if you really need all three subqueries there. Adding sample data to your question might help someone else improve upon the solution I have given above.

I think you want a simple row_number() or rank():
select t1.*
from (select t1.*,
rank() over (partition by stuid, prog, stu_prog_id
order by event_date desc, event_sequence desc, event_number desc
) as seqnum
from table1 t1
) t1
where seqnum = 1;

If you have multiple records with EVENT_DATE, EVENT_SEQUENCE, EVENT_NUMBER as max respectively then in Tim's solution, Use DENSE_RANK or use the following to fetch the exact max and compare with original column data.
SELECT DISTINCT
A.STUID,
A.PROG,
A.STU_PROG_ID,
A.EVENT_NUMBER,
A.EVENT_DATE,
A.EVENT_SEQUENCE,
A.PROG_STATUS
FROM
(
SELECT
A.STUID,
A.PROG,
A.STU_PROG_ID,
A.EVENT_NUMBER,
A.EVENT_DATE,
A.EVENT_SEQUENCE,
A.PROG_STATUS,
MAX(A.EVENT_DATE) OVER(
PARTITION BY A.STUID, A.PROG, A.STU_PROG_ID
) AS MAX_EVENT_DATE,
MAX(A.EVENT_SEQUENCE) OVER(
PARTITION BY A.STUID, A.PROG, A.STU_PROG_ID, A.EVENT_DATE
) AS MAX_EVENT_SEQUENCE,
MAX(A.EVENT_NUMBER) OVER(
PARTITION BY A.STUID, A.PROG, A.STU_PROG_ID, A.EVENT_DATE, A.EVENT_SEQUENCE
) AS MAX_EVENT_NUMBER
FROM
TABLE1 A
) A
WHERE
A.MAX_EVENT_DATE = A.EVENT_DATE
AND A.MAX_EVENT_SEQUENCE = A.EVENT_SEQUENCE
AND A.MAX_EVENT_NUMBER = A.EVENT_NUMBER;
Cheers!!

As being an Oracle 12c user, you can use
[ OFFSET offset { ROW | ROWS } ]
[ FETCH { FIRST | NEXT } [ { rowcount | percent PERCENT } ]
{ ROW | ROWS } { ONLY | WITH TIES } ]
syntax as :
SELECT DISTINCT a.stuid,
a.prog,
a.stu_prog_id,
a.event_number,
a.event_date,
a.event_sequence,
a.prog_status
FROM table1 a
ORDER BY event_date DESC, event_sequence DESC, event_number DESC
FETCH FIRST 1 ROW ONLY;
where WITH TIES clause is not needed for your case, since you're looking for DISTINCT rows, and OFFSET is not needed either, since starting point is just the beginning of a descendingly ordered columns. Even, using the keyword ROW as ROWS is optional, even for the case of plural rows such as FETCH FIRST 5 ROW ONLY;
^^ --> ROWS without S
Demo

How to I do multiple columns partitioning with the rows being duplicated?

I have a set of SQL Stored procedure to use partitioning for my ranking to get percentile. by doing the below partitioning I am able to get my percentiles data right. However my problem is there are duplicates in each row. E.g for each DESC there are multiple duplicates when it is suppose to be only 1 row. Why is this so?
row_nums AS
(
SELECT DATE, DESC, NUM, ROW_NUMBER() OVER (PARTITION BY DATE, DESC ORDER BY NUM ASC) AS Row_Num
FROM ******
)
SELECT .................
This is the output I get currently: (Where there are duplicate rows being returned - Refer to Row 6 to 8)
http://i.stack.imgur.com/foe7g.png[^]
This is the output I want to achieve: http://i.stack.imgur.com/GkrHP.png[^]

You can remove duplicate by adding one more INNER query in FROM clause like below:
;WTIH row_nums AS
(
SELECT DATE, DESC, NUM, ROW_NUMBER() OVER (PARTITION BY DATE, DESC ORDER BY NUM ASC) AS Row_Num
FROM (
SELECT your required columns, COUNT(duplicated_rows_columnsname)
FROM ***
GROUP BY columnnames
HAVING COUNT(duplicated_rows_columnsname) = 1
)
)
SELECT .................
However, You can also remove duplicate row using DISTINCT clause in INNER. query.

oracle sql wih rownum <=

why below query is not giving results if I remove the < sign from query.Because even without < it must match with results?
Query used to get second max id value:
select min(id)
from(
select distinct id
from student
order by id desc
)
where rownum <=2
student id
1
2
3
4

Rownum has a special meaning in Oracle. It is increased with every row, but the optimizer knows that is increasing continuously and all consecutive rows must met the rownum condition. So if you specify rownum = 2 it will never occur since the first row is already rejected.
You can see this very nice if you do an explain plan on your query. It will show something like:
Plan for rownum <=:
COUNT STOPKEY
Plan for rownum =:
FILTER

A ROWNUM value is not assigned permanently to a row (this is a common misconception). A row in a table does not have a number; you cannot ask for row 2 or 3 from a table
click Here for more Info.
This is from the link provided:
Also confusing to many people is when a ROWNUM value is actually assigned. A ROWNUM value is assigned to a row after it passes the predicate phase of the query but before the query does any sorting or aggregation. Also, a ROWNUM value is incremented only after it is assigned, which is why the following query will never return a row:
select *
from t
where ROWNUM > 1;
Because ROWNUM > 1 is not true for the first row, ROWNUM does not advance to 2. Hence, no ROWNUM value ever gets to be greater than 1. Consider a query with this structure:
select ..., ROWNUM
from t
where <where clause>
group by <columns>
having <having clause>
order by <columns>;

I think this is the query you are looking for:
select id
from (select distinct id
from student
order by id desc
) t
where rownum <= 2;
Oracle processes the rownum before the order by, so you need a subquery to get the first two rows. The min() was forcing an aggregation that returned only one result, but before the rownum was applied.
If you actually want only the second value, you need an additional layer of subqueries:
select min(id)
from (select id
from (select distinct id
from student
order by id desc
) t
where rownum <= 2
) t;
However, I would do:
select id
from (select id, dense_rank() over (order by id) as seqnum
from student
) t
where seqnum = 2;

Order asc instead of desc
select id from student where rownum <=2 order by id asc;

Why not just use
select id
from ( select distinct id
, row_number() over (order by id desc) x
from student
)
where x = 2
Or even really bad. Getting the count and index :)
select id
from ( select id
, row_number() over (order by id desc) idx
, sum(1) over (order by null) cnt
from student
group
by id
)
where idx = cnt - 1 -- get the pre-last
Or
where idx = cnt - 2 -- get the 2nd-last
Or
where idx = 3 -- get the 3rd

Try this
SELECT *
FROM (
SELECT id, row_number() over (order by id asc) row_num
FROM student
) AS T
WHERE row_num = 2 -- or 3 ... n
ROW_NUMBER

How to reverse the table that comes from SQL query which already includes ORDER BY

Here is my query:
SELECT TOP 8 id, rssi1, date
FROM history
WHERE (siteName = 'CCL03412')
ORDER BY id DESC
This the result:
How can I reverse this table based on date (Column2) by using SQL?

You can use the first query to get the matching ids, and use them as part of an IN clause:
SELECT id, rssi1, date
FROM history
WHERE id IN
(
SELECT TOP 8 id
FROM history
WHERE (siteName = 'CCL03412')
ORDER BY id DESC
)
ORDER BY date ASC

You could simply use a sub-query. If you apply a TOP clause the nested ORDER BY is allowed:
SELECT X.* FROM(
SELECT TOP 8 id, Column1, Column2
FROM dbo.History
WHERE (siteName = 'CCL03412')
ORDER BY id DESC) X
ORDER BY Column2
Demo
The SELECT query of a subquery is always enclosed in parentheses. It
cannot include a COMPUTE or FOR BROWSE clause, and may only include an
ORDER BY clause when a TOP clause is also specified.
Subquery Fundamentals

try the below :
select * from (SELECT TOP 8 id, rssi1, date
FROM history
WHERE (siteName = 'CCL03412')
ORDER BY id DESC ) aa order by aa.date DESC

didn't run it, but i think it should go well
WITH cte AS
(
SELECT id, rssi1, date, RANK() OVER (ORDER BY ID DESC) AS Rank
FROM history
WHERE (siteName = 'CCL03412')
)
SELECT id, rssi1, date
FROM cte
WHERE Rank <= 8
ORDER BY Date DESC

I have not run this but i think it will work. Execute and let me know if you face error
select id, rssi1, date from (SELECT TOP 8 id, rssi1, date
FROM history
WHERE (siteName = 'CCL03412')
ORDER BY id DESC) order by date ;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Select rows based on distinct values of nested field in BigQuery - sql

Related

PostgreSQL optimize query performance that contains Window function with CTE

Oracle optimise SQL query - Multiple Max()

How to I do multiple columns partitioning with the rows being duplicated?

oracle sql wih rownum <=

How to reverse the table that comes from SQL query which already includes ORDER BY

Categories

Resources