Can I speed up this subquery nested PostgreSQL Query - sql

I have the following PostgreSQL code (which works, but slowly) which I'm using to create a materialized view, however it is quite slow and length of code seems cumbersome with the multiple sub-queries. Is there anyway I can improve the speed this code executes at or rewrite so it's shorter and easier to maintain?
CREATE MATERIALIZED VIEW station_views.obs_10_min_avg_ffdi_powerbi AS
SELECT t.station_num,
initcap(t.station_name) AS station_name,
t.day,
t.month_int,
to_char(to_timestamp(t.month_int::text, 'MM'), 'TMMonth') AS Month,
round(((date_part('year', age(t2.dmax, t2.dmin)) * 12 + date_part('month', age(t2.dmax, t2.dmin))) / 12)::numeric, 1) AS record_years,
round((t2.count_all_vals / t2.max_10_periods * 100)::numeric, 1) AS per_datset,
max(t.avg_bom_fdi) AS max,
avg(t.avg_bom_fdi) AS avg,
percentile_cont(0.95) WITHIN GROUP (ORDER BY t.avg_bom_fdi) AS percentile_cont_95,
percentile_cont(0.99) WITHIN GROUP (ORDER BY t.avg_bom_fdi) AS percentile_cont_99
FROM ( SELECT a.station_num,
d.station_name,
a.ten_minute_intervals_utc,
date_part('day', a.ten_minute_intervals_utc) AS day,
date_part('month', a.ten_minute_intervals_utc) AS month_int,
a.avg_bom_fdi
FROM analysis.obs_10_min_avg_ffdi_bom a,
obs_minute_stn_det d
WHERE d.station_num = a.station_num) t,
( SELECT obs_10_min_avg_ffdi_bom_view.station_num,
obs_10_min_avg_ffdi_bom_view.station_name,
min(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) AS dmin,
max(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) AS dmax,
date_part('epoch', max(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) - min(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc)) / 600 AS max_10_periods,
count(*) AS count_all_vals
FROM analysis.obs_10_min_avg_ffdi_bom_view
GROUP BY obs_10_min_avg_ffdi_bom_view.station_num, obs_10_min_avg_ffdi_bom_view.station_name) t2
WHERE t.station_num = t2.station_num
GROUP BY t.station_num, t.station_name, Month, t.month_int, t.day, record_years, per_datset
ORDER BY t.month_int, t.day
WITH DATA;
The output I get is a row for each weather station (station_num & station_name) along with the day & month that a weather variable is recorded (avg_bom_fdi). The month value is retained and converted to a name for purposes of plotting values averaged per month on the chart. I also pull in the total number of years that recordings exist for that station (record_years) and a percentage of how complete that dataset is (per_datset). These both come from the second subquery (t2). The first subquery (t) is used to average the data per day and return the daily max, average and 95/99th percentiles.

I agree with the running the explain plan / execution plan on this
query.
Also , if not needed remove order by
If you see , lot of
time spent on fetching a particular value while reviewing execution plan,
try creating an index on that particular column.
Depending on high
and low cardinality , you can create B-Tree or Bit Map index,if you are deciding on index.

I think you need read something about Execution plan. It's good way to understand what doing with you query.
I recommended you documentation about this problem - LINK

Related

Oracle SQL: How to best go about counting how many values were in time intervals? Database query vs. pandas (or more efficient libraries)?

I currently have to wrap my head around programming the following task.
Situation: suppose we have one column where we have time data (Year-Month-Day Hours-Minutes). Our program shall get the input (weekday, starttime, endtime, timeslot) and we want to return the interval (specified by timeslot) where there are the least values. For further information, the database has several million entries.
So our program would be specified as
def calculate_optimal_window(weekday, starttime, endtime, timeslot):
return optimal_window
Example: suppose we want to input
weekday = Monday, starttime = 10:00, endtime = 12:00, timeslot = 30 minutes.
Here we want to count how many entries there are between 10:00 and 12:00 o'clock, and compute the number of values in every single 30 minute slot (i.e. 10:00 - 10:30, 10:01 - 10:31 etc.) and in the end return the slot with the least values. How would you go about formulating an efficient query?
Since I'm working with an Oracle SQL database, my second question is: would it be more efficient to work with libraries like Dask or Vaex to get the filtering and counting done? Where is the bottleneck in this situation?
Happy to provide more information if the formulation was too blurry.
All the best.
This part:
Since I'm working with an Oracle SQL database, my second question is:
would it be more efficient to work with libraries like Dask or Vaex to
get the filtering and counting done? Where is the bottleneck in this
situation?
Depending on your server's specs and the cluster/machine you have available for Dask, it is rather likely that the bottleneck in your analysis would be the transfer of data between the SQL and Dask workers, even in the (likely) case that this can be efficiently parallelised. From the DB's point of view, selecting data and serialising it is likely at least as expensive as counting in a relatively small number of time bins.
I would start by investigating how long the process takes with SQL alone, and whether this is acceptable, before moving the analysis to Dask. Usual rules would apply: having good indexing and sharding on the time index.
You should at least do the basic filtering and counting in the SQL query. With a simple predicate, Oracle can decide whether to use an index or a partition and potentially reduce the database processing time. And sending fewer rows will significantly decrease the network overhead.
For example:
select trunc(the_time, 'MI') the_minute, count(*) the_count
from test1
where the_time between timestamp '2021-01-25 10:00:00' and timestamp '2021-01-25 11:59:59'
group by trunc(the_time, 'MI')
order by the_minute desc;
(The trickiest part of these queries will probably be off-by-one issues. Do you really want "between 10:00 and 12:00", or do you want "between 10:00 and 11:59:59"?)
Optionally, you can perform the entire calculation in SQL. I would wager the SQL version will be slightly faster, again because of the network overhead. But sending one result row versus 120 aggregate rows probably won't make a noticeable difference unless this query is frequently executed.
At this point, the question veers into the more subjective question about where to put the "business logic". I bet most programmers would prefer your Python solution to my query. But one minor advantage of doing all the work in SQL is keeping all of the weird date logic in one place. If you process the results in multiple steps there are more chances for an off-by-one error.
--Time slots with the smallest number of rows.
--(There will be lots of ties because the data is so boring.)
with dates as
(
--Enter literals or bind variables here:
select
cast(timestamp '2021-01-25 10:00:00' as date) begin_date,
cast(timestamp '2021-01-25 11:59:59' as date) end_date,
30 timeslot
from dual
)
--Choose the rows with the smallest counts.
select begin_time, end_time, total_count
from
(
--Rank the time slots per count.
select begin_time, end_time, total_count,
dense_rank() over (order by total_count) smallest_when_1
from
(
--Counts per timeslot.
select begin_time, end_time, sum(the_count) total_count
from
(
--Counts per minute.
select trunc(the_time, 'MI') the_minute, count(*) the_count
from test1
where the_time between (select begin_date from dates) and (select end_date from dates)
group by trunc(the_time, 'MI')
order by the_minute desc
) counts
join
(
--Time ranges.
select
begin_date + ((level-1)/24/60) begin_time,
begin_date + ((level-1)/24/60) + (timeslot/24/60) end_time
from dates
connect by level <=
(
--The number of different time ranges.
select (end_date - begin_date) * 24 * 60 - timeslot + 1
from dates
)
) time_ranges
on the_minute between begin_time and end_time
group by begin_time, end_time
)
)
where smallest_when_1 = 1
order by begin_time;
You can run a db<>fiddle here.

Grouping by last day of each month—inefficient running

I am attempting to pull month end balances from all accounts a customer has for every month. Here is what I've written. This runs correctly and gives me what I want—but it also runs extremely slowly. How would you recommend speeding it up?
SELECT DISTINCT
[AccountNo]
,SourceDate
,[AccountType]
,[AccountSub]
,[Balance]
FROM [DW].[dbo].[dwFactAccount] AS fact
WHERE SourceDate IN (
SELECT MAX(SOURCEDATE)
FROM [DW].[dbo].[dwFactAccount]
GROUP BY MONTH(SOURCEDATE),
YEAR (SOURCEDATE)
)
ORDER BY SourceDate DESC
I'd try a window function:
SELECT * FROM (
SELECT
[AccountNo]
,[SourceDate]
,[AccountType]
,[AccountSub]
,[Balance]
,ROW_NUMBER() OVER(
PARTITION BY accountno, EOMONTH(sourcedate)
ORDER BY sourcedate DESC
) as rn
FROM [DW].[dbo].[dwFactAccount]
)x
WHERE x.rn = 1
The row number will establish an incrementing counter in order of sourcedate descending. The counter will restart from 1 when the month in sourcedate changes (or the account number changes) thanks to the EOMONTH function quantising any date in a given month to be the last date of the month (2020-03-9 12:34:56 becomes 2020-03-31, as do all other datetimes in March). Any similar tactic to quantise to a fixed date in the month would also work such as using YEAR(sourcedate), MONTH(sourcedate)
You need to build a dimension table for the date with Date as PK, and your SourceDate in the fact table ref. that date dimension table.
Date dimension table can have month, year, week, is_weekend, is_holiday, etc. columns. You join your fact table with the date dimension table and you can group data using any columns in date table you want.
Your absolute first step should be to view the execution plan for the query and determine why the query is slow.
The following explains how to see a graphical execution plan:
Display an Actual Execution Plan
The steps to interpreting the plan and optimizing the query are too much for an SO answer, but you should be able to find some good articles on the topic by Googling. You could also post the plan in an edit to your question and get some real feedback on what steps to take to improve query performance.

Chrome UX Report: improve query performance

I am querying the Chrome UX Report public dataset using the following query to get values for the indicated metrics over time for a set of country-specific tables. The query runs for a very long time (I stopped it at 180 seconds) because I don't know what the timeout is for a query or how to tell if the query hung.
I'm trying to get aggregate data year over year for average_fcp, average_fp and average_dcl. I'm not sure if I'm using BigQuery correctly or there are ways to optimize the query to make it runs faster
This is the query I'm using.
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
AVG(fp.density) as average_fp,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
ORDER BY
yyyymm
I'm not sure if it makes mathematical sense to get the AVG() of all the densities - but let's do it anyways.
The bigger problem in the query is this:
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
-- that's an explosive join: It transforms one row with 3 arrays with ~500 elements each, into 125 million rows!!! That's why the query isn't running.
A similar query that gives you similar results:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT AVG(fcp.density) FROM UNNEST(first_contentful_paint.histogram.bin) fcp WHERE fcp.start > 20000) AS average_fcp,
(SELECT AVG(fp.density) FROM UNNEST(first_paint.histogram.bin) fp) AS average_fp,
(SELECT AVG(dcl.density) FROM UNNEST(dom_content_loaded.histogram.bin) dcl) AS average_dcl
FROM `chrome-ux-report.country_cl.*`
WHERE form_factor.name = 'desktop'
)
GROUP BY yyyymm
ORDER BY yyyymm
The good news: This query runs in 3.3 seconds.
Now that the query runs in 3 seconds, the most important question is: Does it make sense mathematically?
Bonus: This query makes more sense to me mathematically speaking, but I'm not 100% sure about it:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT yyyymm, origin, SUM(weighted_fcp) average_fcp, SUM(weighted_fp) average_fp, SUM(weighted_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT SUM(start*density) FROM UNNEST(first_contentful_paint.histogram.bin)) AS weighted_fcp,
(SELECT SUM(start*density) FROM UNNEST(first_paint.histogram.bin)) AS weighted_fp,
(SELECT SUM(start*density) FROM UNNEST(dom_content_loaded.histogram.bin)) AS weighted_dcl,
origin
FROM `chrome-ux-report.country_cl.*`
)
GROUP BY origin, yyyymm
)
GROUP BY yyyymm
ORDER BY yyyymm
After carefully reviewing your query, I concluded that the processing time for each of the actions you are performing is around 6 seconds or less. Therefore, I decided to execute each task from each unnest and then append the tables together, using UNION ALL method.
The query ran within 4 seconds. The syntax is:
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fp.density) as average_fp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(dom_content_loaded.histogram.bin) as dcl
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
ORDER BY
yyyymm
In addition, I would like to point that according to the documentation it is advisable to avoid the excessive use of wildcards opting to use date ranges and materializing large datasets results. Also, I would like to point that BigQuery limits cached results to 10gb.
I hope it helps.
Let me start saying that BigQuery query timeout is very long (6 hours) so you should not have a problem on this front but you might encounter other errors.
We had the same issue internally, we have datasets with data divided in country tables, even though the tables are partitioned on timestamp when running queries over hounders of tables, not only the query takes a long time, but sometime it will fail with resources exceeded error.
Our solution was to aggregate all this table into one single one adding 'country' column use it as clustering column. This not only made our queries executed but it made them even faster than our temporary solution of running the same query on a sub set of the country tables as intermediate steps and then combining the results together. It is now faster, easier and cleaner.
Coming back to your specific question, I suggest to create a new table (which you will need to host $$$) that is the combination of all the tables inside a dataset as a partitioned table.
The quickest way, unfortunately also the more expensive one (you will pay for the query scan), is to use a create table statement.
create table `project_id.dataset_id.table_id`
partition by date_month
cluster by origin
as (
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
);
If this query fails you can run it on a sub set of table e.g. where starts_with(_table_suffix, '2018') and the run the following query with the 'write append' disposition against the table you create before.
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
where starts_with(_table_suffix, '2019')
If you noticed I also used a clustering column, which is think is a best practice to do.
Note for who is curating Google public datasets.
It would be nice to have a public "chrome_ux_report" dataset with a just a single table partitioned by date and clustered by country.

How to group timestamps into islands (based on arbitrary gap)?

Consider this list of dates as timestamptz:
I grouped the dates by hand using colors: every group is separated from the next by a gap of at least 2 minutes.
I'm trying to measure how much a given user studied, by looking at when they performed an action (the data is when they finished studying a sentence.) e.g.: on the yellow block, I'd consider the user studied in one sitting, from 14:24 till 14:27, or roughly 3 minutes in a row.
I see how I could group these dates with a programming language by going through all of the dates and looking for the gap between two rows.
My question is: how would go about grouping dates in this way with Postgres?
(Looking for 'gaps' on Google or SO brings too many irrelevant results; I think I'm missing the vocabulary for what I'm trying to do here.)
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM (
SELECT done
, lag(done) OVER (ORDER BY done) <= done - interval '2 min' AS step
FROM tbl
) sub
ORDER BY done;
The subquery sub returns step = true if the previous row is at least 2 min away - sorted by the timestamp column done itself in this case.
The outer query adds a rolling count of steps, effectively the group number (grp) - combining the aggregate FILTER clause with another window function.
fiddle
Related:
Query to find all timestamps more than a certain interval apart
How to label groups in postgresql when group belonging depends on the preceding line?
Select longest continuous sequence
Grouping or Window
About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters
Conditional lead/lag function PostgreSQL?
Building up on Erwin's answer, here is the full query for tallying up the amount of time people spent on those sessions/islands:
My data only shows when people finished reviewing something, not when they started, which means we don't know when a session truly started; and some islands only have one timestamp in them (leading to a 0-duration estimate.) I'm accounting for both by calculating the average review time and adding it to the total duration of islands.
This is likely very idiosyncratic to my use case, but I learned a thing or two in the process, so maybe this will help someone down the line.
-- Returns estimated total study time and average time per review, both in seconds
SELECT (EXTRACT( EPOCH FROM logged) + countofislands * avgreviewtime) as totalstudytime, avgreviewtime -- add total logged time to estimate for first-review-in-island and 1-review islands
FROM
(
SELECT -- get the three key values that will let us calculate total time spent
sum(duration) as logged
, count(island) as countofislands
, EXTRACT( EPOCH FROM sum(duration) FILTER (WHERE duration != '00:00:00'::interval) )/( sum(reviews) FILTER (WHERE duration != '00:00:00'::interval) - count(reviews) FILTER (WHERE duration != '00:00:00'::interval)) as avgreviewtime
FROM
(
SELECT island, age( max(done), min(done) ) as duration, count(island) as reviews -- calculate the duration of islands
FROM
(
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS island -- give a unique number to each island
FROM (
SELECT -- detect the beginning of islands
done,
(
lag(done) OVER (ORDER BY done) <= done - interval '2 min'
) AS step
FROM review
WHERE clicker_id = 71 AND "done" > '2015-05-13' AND "done" < '2015-05-13 15:00:00' -- keep the queries small and fast for now
) sub
ORDER BY done
) grouped
GROUP BY island
) sessions
) summary

Enhancing Performance

I'm not as clued up on shortcuts in SQL so I was hoping to utilize the brainpower on here to help speed up a query I'm using. I'm currently using Oracle 8i.
I have a query:
SELECT
NAME_CODE, ACTIVITY_CODE, GPS_CODE
FROM
(SELECT
a.NAME_CODE, b.ACTIVITY_CODE, a.GPS_CODE,
ROW_NUMBER() OVER (PARTITION BY a.GPS_DATE ORDER BY b.ACTIVITY_DATE DESC) AS RN
FROM GPS_TABLE a, ACTIVITY_TABLE b
WHERE a.NAME_CODE = b.NAME_CODE
AND a.GPS_DATE >= b.ACTIVITY_DATE
AND TRUNC(a.GPS_DATE) > TRUNC(SYSDATE) - 2)
WHERE
RN = 1
and this takes about 7 minutes give or take 10 seconds to run.
Now the GPS_TABLE is currently 6.586.429 rows and continues to grow as new GPS coordinates are put into the system, each day it grows by about 8.000 rows in 6 columns.
The ACTIVITY_TABLE is currently 1.989.093 rows and continues to grow as new activities are put into the system, each day it grows by about 2.000 rows in 31 columns.
So all in all these are not small tables and I understand that there will always be a time hit running this or similar queries. As you can see I'm already limiting it to only the last 2 days worth of data, but anything to speed it up would be appreciated.
Your strongest filter seems to be the filter on the last 2 days of GPS_TABLE. It should filter the GPS_TABLE to about 15k rows. Therefore one of the best candidate for improvement is an index on the column GPS_DATE.
You will find that your filter TRUNC(a.GPS_DATE) > TRUNC(SYSDATE) - 2 is equivalent to a.GPS_DATE > TRUNC(SYSDATE) - 2, therefore a simple index on your column will work if you change the query. If you can't change it, you could add a function-based index on TRUNC(GPS_DATE).
Once you have this index in place, we need to access the rows in ACTIVITY_TABLE. The problem with your join is that we will get all the old activities and therefore a good portion of the table. This means that the join as it is will not be efficient with index scans.
I suggest you define an index on ACTIVITY_TABLE(name_code, activity_date DESC) and a PL/SQL function that will retrieve the last activity in the least amount of work using this index specifically:
CREATE OR REPLACE FUNCTION get_last_activity (p_name_code VARCHAR2,
p_gps_date DATE)
RETURN ACTIVITY_TABLE.activity_code%type IS
l_result ACTIVITY_TABLE.activity_code%type;
BEGIN
SELECT activity_code
INTO l_result
FROM (SELECT activity_code
FROM activity_table
WHERE name_code = p_name_code
AND activity_date <= p_gps_date
ORDER BY activity_date DESC)
WHERE ROWNUM = 1;
RETURN l_result;
END;
Modify your query to use this function:
SELECT a.NAME_CODE,
a.GPS_CODE,
get_last_activity(a.name_code, a.gps_date)
FROM GPS_TABLE a
WHERE trunc(a.GPS_DATE) > trunc(sysdate) - 2
Optimising an SQL query is generally done by:
Add some indexes
Try a different way to get the same information
So, start by adding an index for ACTIVITY_DATE, and perhaps some other fields that are used in the conditions.