PostgreSQL optimize query performance that contains Window function with CTE

PostgreSQL optimize query performance that contains Window function with CTE - sql

Here the column amenity_category and parent_path is JSONB column with value like ["Tv","Air Condition"] and ["20000","20100","203"] respectively. Apart from that other columns are normal varchar and numeric type. I've around 2.5M rows with primary key on id and it is indexed. Basically the initial CTE part is taking time when rp.parent_path match multiple rows.
Sample dataset:
Current query:
WITH CTE AS
(
SELECT id,
property_name,
property_type_category,
review_score,
amenity_category.name,
count(*) AS cnt FROM table_name rp,
jsonb_array_elements_text(rp.amenity_categories) amenity_category(name)
WHERE rp.parent_path ? '203' AND number_of_review >= 1
GROUP BY amenity_category.name,id
),
CTE2 as
(
SELECT id, property_name,property_type_category,name,
ROW_NUMBER() OVER (PARTITION BY property_type_category,
name ORDER BY review_score DESC),
COUNT(id) OVER (PARTITION BY property_type_category,
name ORDER BY name DESC)
FROM CTE
)
SELECT id, property_name, property_type_category, name, COUNT
FROM CTE2
where row_number = 1
Current Output:
So my basic question is is there any other way I can re-write this query or optimize the current query?

If it's safe to assume that array elements in amenity_categories are distinct (no duplicate array elements), we can radically simplify to:
SELECT DISTINCT ON (property_type_category, ac.name)
id, property_name, property_type_category, ac.name
, COUNT(*) OVER (PARTITION BY property_type_category, ac.name) AS count
FROM table_name rp, jsonb_array_elements_text(rp.amenity_categories) ac(name)
WHERE parent_path ? '203'
AND number_of_review >= 1
ORDER BY property_type_category, ac.name, review_score DESC;
If review_score can be NULL, make that:
...
ORDER BY property_type_category, ac.name, review_score DESC NULLS LAST;
This works, because DISTINCT ON is applied as last step (after window functions). See:
Best way to get result count before LIMIT was applied
PostgreSQL: running count of rows for a query 'by minute'
parent_path and number_of_review should probably be indexed. Depends on data distribution and selectivity of the WHERE conditions, which you didn't disclose.
About DISTINCT ON:
Select first row in each GROUP BY group?
Assuming id is NOT NULL, count(*) is faster and equivalent to count(id).

Related

sql select everything with maximum date (that that is smaller than a specific date) without subqueries

I would like to write a sql query where I choose all rows grouped by id where the column date is the latest date for this id but still smaller than for example 16-JUL-2021. I would like to do this without using subqueries (in oracle), is that possible?
I tried the below but it doesn't work.
SELECT *, max(date)
WHERE date < '16-JUL-2021'
OVER(PARTITION BY id ORDER BY date DESC) as sth
FROM table

You can find the maximum date without sub-queries.
SELECT t.*,
max("DATE") OVER(PARTITION BY id ORDER BY "DATE" DESC) as max_date
FROM "TABLE" t
WHERE "DATE" < DATE '2021-07-16'
You need a sub-query to filter to only show the row(s) with the maximum date:
SELECT *
FROM (
SELECT t.*,
max("DATE") OVER(PARTITION BY id ORDER BY "DATE" DESC) as max_date
FROM "TABLE" t
WHERE "DATE" < DATE '2021-07-16'
)
WHERE "DATE" = max_date;
However, you are still only querying the table once using this technique even though it uses a sub-query.
Note DATE and TABLE are reserved words and cannot be used as unquoted identifiers; it would be better practice to use different names for those identifiers.
You could, equivalently use the RANK or DENSE_RANK analytic functions instead of MAX; ROW_NUMBER, however, does not give the same output as it will only return a single row and will not return all tied rows.
SELECT *
FROM (
SELECT t.*,
RANK() OVER(PARTITION BY id ORDER BY "DATE" DESC) as rnk
FROM "TABLE" t
WHERE "DATE" < DATE '2021-07-16'
)
WHERE rnk = 1;
But you still need a sub-query to filter the rows.
If you want to not use a sub-query then you can use:
SELECT id,
MAX("DATE") AS "DATE",
MAX(col1) KEEP (DENSE_RANK LAST ORDER BY "DATE", ROWNUM) AS col1,
MAX(col2) KEEP (DENSE_RANK LAST ORDER BY "DATE", ROWNUM) AS col2,
MAX(col3) KEEP (DENSE_RANK LAST ORDER BY "DATE", ROWNUM) AS col3
FROM "TABLE"
GROUP BY id
However, that is not quite the same as it will only get a single row per id and will not return multiple rows tied for the greatest date per id.

In Oracle How do we get multiple columns in result which are not in tables

For example :
I have below table named "T1"
and I need result like this:
if "earlist_run_date" , "last_rundate" & "remainng_run_dates" where in the table T1 could have used PIVOT.
But i don't know how to bring these 3 columns in result set. Any Solution will be much appreciated

My guess is that you want something like this. There's probably a better way to eliminate the first and last row from the listagg that I'm not seeing off the top of my head but this should be reasonably efficient.
with ranked_t1 as (
select t1.*,
rank() over( partition by job_id
order by run_date asc ) asc_rank,
rank() over( partition by job_id
order by run_date desc ) desc_rank
from t1
)
select job_id,
min( run_date ) earliest_run_date,
max( run_date ) last_rundate,
listagg( (case when asc_rank != 1
and desc_rank != 1
then run_date
else null
end), ' ' )
within group( order by run_date ) remaining_run_dates
from ranked_t1
group by job_id;

Removing the remaining_run_dates column, you get a query as simple as
select
JOB_ID,
min(RUN_DATE) as earliest_run_date,
max(RUN_DATE) as last_rundate
from T1
group by JOB_ID

Select rows based on distinct values of nested field in BigQuery

I have a table in BigQuery which looks like this:
The sequence field is a repeated RECORD. I want to select one row per stepName but if there are multiple rows per step name, I want to choose the one where sequence.step.elapsedSeconds and sequence.step.elapsedMinutes are not null, otherwise select the rows where these columns are null.
As shown in the image above, I want to select row no. 2, 4 and 5. I have calculated ROW_NUMBER like this: ROW_NUMBER() OVER(PARTITION BY step.stepName) AS RowNum.
Here´s my query so far in trying to filter out the unwanted rows:
WITH DistinctRows AS
(
select timestamp,
ARRAY (
SELECT
STRUCT(
STRUCT(
step.elapsedSeconds,
step.elapsedMinutes,
) as step
)
FROM
UNNEST(source_table.sequence) AS sequence
) AS sequence,
ROW_NUMBER() OVER(PARTITION BY step.stepName) AS RowNum
from source_table,
unnest(sequence) as previousCalls
order by timestamp asc
)
SELECT *
FROM DistinctRows,
unnest(sequence) as sequence
where (rowNum = 1 and (step.elapsedSeconds is null and step.elapsedMinutes is null)
or (RowNum > 1 and step.elapsedSeconds is not null and step.elapsedSeconds is not null)
order by timestamp asc
I need help in figuring out how to filter out the rows like no. 1 and 3 and would appreciate some help.
Thanks in advance.

Hmmm . . . Assuming that stepname is not part of the repeated column:
SELECT dr.* EXCEPT (sequence),
(SELECT seq
FROM unnest(dr.sequence) seq
ORDER BY seq.step.elapsedSeconds DESC NULLS LAST,
sequence.step.elapsedMinutes DESC NULLS LAST
) as sequence
FROM DistinctRows dr
ORDER BY timestamp asc;
If stepname is part of sequence, then the subquery would reaggregate:
SELECT dr.* EXCEPT (sequence),
(SELECT ARRAY_AGG(sequence ORDER BY stepName)
FROM (SELECT seq,
ROW_NUMBER() OVER (PARTITION BY seq.stepName
ORDER BY seq.step.elapsedSeconds DESC NULLS LAST, sequence.step.elapsedMinutes DESC NULLS
) as seqnum
FROM unnest(dr.sequence) seq
) s
WHERE seqnum = 1
) as sequence
FROM DistinctRows dr
ORDER BY timestamp asc

How do I create a new SQL table with custom column names and populate these columns

So I currently have an SQL statement that generates a table with the most frequent occurring value as well as the least frequent occurring value in a table. However this table has 2 rows with the row values as well as the fields. I need to create a custom table with 2 columns with min and max. Then have one row with one value for each. The value for these columns needs to be from the same row.
(SELECT name, COUNT(name) AS frequency
FROM firefighter_certifications
GROUP BY name
ORDER BY frequency DESC limit 1)
UNION
(SELECT name, COUNT(name) AS frequency
FROM firefighter_certifications
GROUP BY name
ORDER BY frequency ASC limit 1);
So for the above query I would need the names of the min and max values in one row. I need to be able to define the name of new columns for the generated SQL query as well.
Min_Name | Max_Name
Certif_1 | Certif_2

I think this query should give you the results you want. It ranks each name according to the number of times it appears in the table, then uses conditional aggregation to select the min and max frequency names in one row:
with cte as (
select name,
row_number() over (order by count(*) desc) as maxr,
row_number() over (order by count(*)) as minr
from firefighter_certifications
group by name
)
select max(case when minr = 1 then name end) as Min_Name,
max(case when maxr = 1 then name end) as Max_Name
from cte

Postgres doesn't offer "first" and "last" aggregation functions. But there are other, similar methods:
select distinct first_value(name) over (order by cnt desc, name) as name_at_max,
first_value(name) over (order by cnt asc, name) as name_at_min
from (select name, count(*) as cnt
from firefighter_certifications
group by name
) n;
Or without any subquery at all:
select first_value(name) over (order by count(*) desc, name) as name_at_max,
first_value(name) over (order by count(*) asc, name) as name_at_min
from firefighter_certifications
group by name
limit 1;
Here is a db<>fiddle

How to I do multiple columns partitioning with the rows being duplicated?

I have a set of SQL Stored procedure to use partitioning for my ranking to get percentile. by doing the below partitioning I am able to get my percentiles data right. However my problem is there are duplicates in each row. E.g for each DESC there are multiple duplicates when it is suppose to be only 1 row. Why is this so?
row_nums AS
(
SELECT DATE, DESC, NUM, ROW_NUMBER() OVER (PARTITION BY DATE, DESC ORDER BY NUM ASC) AS Row_Num
FROM ******
)
SELECT .................
This is the output I get currently: (Where there are duplicate rows being returned - Refer to Row 6 to 8)
http://i.stack.imgur.com/foe7g.png[^]
This is the output I want to achieve: http://i.stack.imgur.com/GkrHP.png[^]

You can remove duplicate by adding one more INNER query in FROM clause like below:
;WTIH row_nums AS
(
SELECT DATE, DESC, NUM, ROW_NUMBER() OVER (PARTITION BY DATE, DESC ORDER BY NUM ASC) AS Row_Num
FROM (
SELECT your required columns, COUNT(duplicated_rows_columnsname)
FROM ***
GROUP BY columnnames
HAVING COUNT(duplicated_rows_columnsname) = 1
)
)
SELECT .................
However, You can also remove duplicate row using DISTINCT clause in INNER. query.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PostgreSQL optimize query performance that contains Window function with CTE - sql

Related

sql select everything with maximum date (that that is smaller than a specific date) without subqueries

In Oracle How do we get multiple columns in result which are not in tables

Select rows based on distinct values of nested field in BigQuery

How do I create a new SQL table with custom column names and populate these columns

How to I do multiple columns partitioning with the rows being duplicated?

Categories

Resources