Are inline queries a bad idea? - sql

I have a table containing the runtimes for generators on different sites, and I want to select the most recent entry for each site. Each generator is run once or twice a week.
I have a query that will do this, but I wonder if it's the best option. I can't help thinking that using WHERE x IN (SELECT ...) is lazy and not the best way to formulate the query - any query.
The table is as follows:
CREATE TABLE generator_logs (
id integer NOT NULL,
site_id character varying(4) NOT NULL,
start timestamp without time zone NOT NULL,
"end" timestamp without time zone NOT NULL,
duration integer NOT NULL
);
And the query:
SELECT id, site_id, start, "end", duration
FROM generator_logs
WHERE start IN (SELECT MAX(start) AS start
FROM generator_logs
GROUP BY site_id)
ORDER BY start DESC
There isn't a huge amount of data, so I'm not worried about optimizing the query. However, I do have to do similar things on tables with 10s of millions of rows, (big tables as far as I'm concerned!) and there optimisation is more important.
So is there a better query for this, and are inline queries generally a bad idea?

Should your query not be correlated? i.e.:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE start = (SELECT MAX(g2.start) AS start
FROM generator_logs g2
WHERE g2.site_id = g1.site_id)
ORDER BY start DESC
Otherwise you will potentially pick up non-latest logs whose start value happens to match the latest start for a different site.
Or alternatively:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE (site_id, start) IN (SELECT site_id, MAX(g2.start) AS start
FROM generator_logs g2
GROUP BY site_id)
ORDER BY start DESC

I would use joins as they perform much better then "IN" clause:
select gl.id, gl.site_id, gl.start, gl."end", gl.duration
from
generator_logs gl
inner join (
select max(start) as start, site_id
from generator_logs
group by site_id
) gl2
on gl.site_id = gl2.site_id
and gl.start = gl2.start
Also as Tony pointed out you were missing correlation in your original query

In MYSQL it could be problematic because Last i Checked it was unable to optimise subqueries effectively ( Ie: by query-rewriting )
Many DBMS's have Genetic Query planners which will do the same thing regardless of your input queries structure.
MYSQL will in some cases for that situation create a temp table, other times not, and depending on the circumstances, indexing, condtions, subqueries can still be rather quick.
Some complain that subqueries are hard to read, but they're perfectly fine if you fork them into local variables.
$maxids = 'SELECT MAX(start) AS start FROM generator_logs GROUP BY site_id';
$q ="
SELECT id, site_id, start, \"end\", duration
FROM generator_logs
WHERE start IN ($maxids)
ORDER BY start DESC
";

This problem - finding not just the the MAX, but the rest of the corresponding row - is a common one. Luckily, Postgres provides a nice way to do this with one query, using DISTINCT ON:
SELECT DISTINCT ON (site_id)
id, site_id, start, "end", duration
FROM generator_logs
ORDER BY site_id, start DESC;
DISTINCT ON (site_id) means "return one record per site_id". The order by clause determines which record that is. Note, however, that this is subtly different from your original query - if you have two records for the same site with the same start, your query would return two records, while this returns only one.

A way to find records having the MAX value per group is to select those records for which there is no record within the same group having a higher value:
SELECT id, site_id, "start", "end", duration
FROM generator_logs g1
WHERE NOT EXISTS (
SELECT 1
FROM generator_logs g2
WHERE g2.site_id = g1.site_id
AND g2."start" > g1."start"
);

Related

Query Optimization with ROW_NUMBER

I have this query:
SELECT
PE1.PRODUCT_EQUIPMENT_KEY, -- primary key
PE1.Customer_Ban,
PE1.Subscriber_No,
PE1.Prod_Equip_Cd,
PE1.Prod_Equip_Txt,
PE1.Prod_Equip_Category_Txt--,
-- PE2.ep_rnk ------------------ UNCOMMENT THIS LINE
FROM
INT_ADM.Product_Equipment_Dim PE1
INNER JOIN
(
SELECT
PRODUCT_EQUIPMENT_KEY,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM INT_ADM.Product_Equipment_Dim PE2
) PE2
ON PE2.PRODUCT_EQUIPMENT_KEY = PE1.PRODUCT_EQUIPMENT_KEY
WHERE
Line_Of_Business_Cd = 'M'
AND /*v_Date_Start*/ TO_DATE( '2022/01/12', 'yyyy/mm/dd' ) BETWEEN Start_Dt AND End_Dt
AND Current_Ind = 'Y'
If I run it as you see it then it runs in under a second.
If I run it with -- PE2.ep_rnk ------------------ UNCOMMENT THIS LINE uncommented then the query takes up to 5 minutes to complete.
I know it's something to do with ROW_NUMBER() but after looking all over online I can't find a good explanation and solution. Does anyone know why uncommenting that line makes the query so slow, and what I can do about it so it runs fast?
Much appreciate your help in advance.
The root cause is, that even if the predicate in the where clause allows an efficient access to the rows of the table (but I suspect your below a second response is the time to get the first page of the result), you need in the subquery to access all rows of the table, to window sort them and finaly to join them to the first row source.
So if you comment out the ep_rnk Oracle is smart enought that it do not need to evaluate the subquery at all, because the subquery is on the same table and the join is on the primary key - so no row can be lost or duplicated in the join.
What can you improve?
Not much. If the WHERE condition filters the table very restrictive (and you end with only a small number of PRODUCT_EQUIPMENT_KEY) make the same filer in the subquery:
(
SELECT
PRODUCT_EQUIPMENT_KEY,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM INT_ADM.Product_Equipment_Dim PE2
--- filer added
WHERE PRODUCT_EQUIPMENT_KEY in (
SELECT PRODUCT_EQUIPMENT_KEY
FROM INT_ADM.Product_Equipment_Dim
WHERE ... same predicate as in the main query ...
)
) PE2
If the predicate returns all (or most) of the PRODUCT_EQUIPMENT_KEY the only (often used) way is to pre-calculate the rank e.g. in a materialized view
The materialized view is defined as follows
SELECT
PE1.PRODUCT_EQUIPMENT_KEY, -- primary key
PE1.Customer_Ban,
PE1.Subscriber_No,
PE1.Prod_Equip_Cd,
PE1.Prod_Equip_Txt,
PE1.Prod_Equip_Category_Txt--,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM
INT_ADM.Product_Equipment_Dim PE1
and you simple query from it - without a join.

Select columns from second subquery if first returns NULL

I have two queries that I'm running separately from a PHP script. The first is checking if an identifier (group) has a timestamp in a table.
SELECT
group, MAX(timestamp) AS timestamp, value
FROM table_schema.sample_table
GROUP BY group, value
If there is no timestamp, then it runs this second query that retrieves the minimum timestamp from a separate table:
SELECT
group, MIN(timestamp) as timestamp, value AS value
FROM table_schema.src_table
GROUP BY group, value
And goes on from there.
What I would like to do, for the sake of conciseness, is to have a single query that runs the first statement, but that defaults to the second if NULL. I've tried with coalesce() and CASE statements, but they require subqueries to return single columns (which I hadn't run into being an issue yet). I then decided I should try a JOIN on the table with the aggregate timestamp to get the whole row, but then quickly realized I can't variate the table being joined (not to my knowledge). I opted to try joining both results and getting the max, something like this:
Edit: I am so tired, this should be a UNION, not a JOIN
sorry for any possible confusion :(
SELECT smpl.group, smpl.value, MAX(smpl.timestamp) AS timestamp
FROM table_schema.sample_table as smpl
INNER JOIN
(SELECT src.group, src.value, MIN(src.timestamp) AS timestamp
FROM source_table src
GROUP BY src.group, src.value) AS history
ON
smpl.group = history.group
GROUP BY smpl.group, smpl.value
I don't have a SELECT MAX() on this because it's really slow as is, most likely because my SQL is a bit rusty.
If anyone knows a better approach, I'd appreciate it!
Please try this:
select mx.group,(case when mx.timestamp is null then mn.timestamp else mx.timestamp end)timestamp,
(case when mx.timestamp is null then mn.value else mx.value end)value
(
SELECT
group, MAX(timestamp) AS timestamp, value
FROM table_schema.sample_table
GROUP BY group, value
)mx
left join
(
SELECT
group, MIN(timestamp) as timestamp, value AS value
FROM table_schema.src_table
GROUP BY group, value
)mn
on mx.group = mn.group

Fastest way to query latest version in large SQL Server table?

What is the fastest way to query for "latest version" on a large SQL table using an update timestamp in SQL Server?
I'm currently using the inner join approach, on very large SQL Server weather forecast table by Date, City, Hour, Temperature, UpdateTimestamp. To get the latest temperature forecast, I created a view using inner join on Date, City, and Hour + max(UpdateTimestamp), such as in this other posting.
However as the dataset is growing on the original table, the view query is getting slower and slower over time.
Wonder if others have encountered similar situation, and what's the best way to speed up this query (one alternative solution I'm considering is having a stored procedure run each day creating a separate table of the "latest version" only, which then will be very quick to access).
EDIT 4/4 - I've found the best solution so far (thanks Vikram) was to add a clustered index to my table on 3 fields "TSUnix", "CityId", "DTUnix", which sped up performance by ~4x (from 25 seconds to 4 seconds)
Also I've tried to use row_number solution (below query sample) , although appears bit slower than the "inner join" approach. Both queries + index creation are below :
Index Creation:
USE [<My DB>]
GO
CREATE NONCLUSTERED INDEX [index_WeatherForecastData]
ON [dbo].[<WeatherForecastData>] ([TSUnix], [CityId], [DTUnix])
INCLUDE ([Temperature], [TemperatureMin], [TemperatureMax], [Humidity], [WindSpeed], [Rain], [Snow])
GO
Query:
-- Inner Join Version
SELECT W.TSUnix, W.CityId, W.DTUnix, W.Temperature, W.*
FROM WeatherForecastData W
INNER JOIN (
SELECT max(TSUnix) Latest, CityId, DTUnix
FROM WeatherForecastData
GROUP BY CityId, DTUnix
) L
ON L.Latest = W.TSUnix
AND L.CityID = W.CityID
AND L.DTUnix = W.DTUnix
-- Row Number Version
SELECT W.TSUnix, W.CityId, W.DTUnix, W.Temperature, W.*
FROM
(
select
*, ROW_NUMBER() over (partition by DTUnix, CityId order by TSUnix desc) as RowNumber
from WeatherForecastData
) W
WHERE
W.RowNumber = 1
Thanks!
Use ROW_NUMBER with an index as shown below.
The specific index that will make this fast is an index that has Date, City, Hour and UpdateTimestamp descending. This requires a single pass over the table rather than multiple passes an INNER JOIN would likely require.
Working code: http://sqlfiddle.com/#!18/8c0b4/1
SELECT Date, City, Hour, Temperature
FROM
(SELECT
Date, City, Hour, Temperature,
ROW_NUMBER() OVER(PARTITION BY Date, City, Hour
ORDER BY UpdateTimestamp DESC) AS RowNumber
FROM
Test) AS t
WHERE
t.RowNumber = 1

SQL Querying Column on Max Date

I apologize if my code is not properly typed. I am trying to query a table that will return the latest bgcheckdate and status report. The table contains additional bgcheckdates and statuses for each record but in my report I only need to see the latest bgcheckdate with its status.
SELECT BG.PEOPLE_ID, MAX(BG.DATE_RUN) AS DATERUN, BG.STATUS
FROM PKS_BGCHECK BG
GROUP BY BG.PEOPLE_ID, BG.status;
When I run the above query, I still see queries with multiple background check dates and statuses.
Whereas when I run without the status, it works fine:
SELECT BG.PEOPLE_ID, MAX(BG.DATE_RUN)
FROM PKS_BGCHECK BG
GROUP BY BG.PEOPLE_ID;
So just wondering if anyone can help me figure out help me query the date run and status and both reflecting the latest date.
The best solution depends on which RDBMS you are using.
Here is one with basic, standard SQL:
SELECT bg.PEOPLE_ID, bg.DATE_RUN, bg.STATUS
FROM (
SELECT PEOPLE_ID, MAX(DATE_RUN) AS MAX_DATERUN
FROM PKS_BGCHECK
GROUP BY PEOPLE_ID
) sub
JOIN PKS_BGCHECK bg ON bg.PEOPLE_ID = sub.PEOPLE_ID
AND bg.DATE_RUN = sub.MAX_DATERUN;
But you can get multiple rows per PEOPLE_ID if there are ties.
In Oracle, Postgres or SQL Server and others (but not MySQL) you can also use the window function row_number():
WITH cte AS (
SELECT PEOPLE_ID, DATE_RUN, STATUS
, ROW_NUMBER() OVER(PARTITION BY PEOPLE_ID ORDER BY DATE_RUN DESC) AS rn
FROM PKS_BGCHECK
)
SELECT PEOPLE_ID, DATE_RUN, STATUS
FROM cte
WHERE rn = 1;
This guarantees 1 row per PEOPLE_ID. Ties are resolved arbitrarily. Add more expressions to ORDER BY to break ties deterministically.
In Postgres, the simplest solution would be with DISTINCT ON.
Details for both in this related answer:
Select first row in each GROUP BY group?
Selecting the latest row in a time-sensitive set is fairly easy and largely platform independent:
SELECT BG.PEOPLE_ID, BG.DATE_RUN, BG.STATUS
FROM PKS_BGCHECK BG
WHERE BG.DATE_RUN =(
SELECT MAX( DATE_RUN )
FROM PKS_BGCHECK
WHERE PEOPLE_ID = BG.PEOPLE_ID
AND DATE_RUN < SYSDATE );
If the PK is (PEOPLE_ID, DATE_RUN), the query will execute about as quickly as any other method. If they don't form the PK (why not???) then use them to form a unique index. But I'm sure you're already doing one or the other.
Btw, you don't really need the and part of the sub-query if you don't allow future dates to be entered. Some temporal implementations allow for future dates (planned or scheduled events) so I'm used to adding it.

Getting the Nth most recent business date - very different query performance using two different methods

I have a requirement to get txns on a T-5 basis. Meaning I need to "go back" 5 business days.
I've coded up two SQL queries for this and the second method is 5 times slower than the first.
How come?
-- Fast
with
BizDays as
( select top 5 bdate bdate
from dbo.business_days
where bdate < '20091211'
order by bdate Desc
)
,BizDate as ( select min(bdate) bdate from BizDays)
select t.* from txns t
join BizDate on t.bdate <= BizDate.bdate
-- Slow
with
BizDays as
( select dense_rank() Over(order by bdate Desc) RN
, bdate
from dbo.business_days
where bdate < '20091211'
)
,BizDate as ( select bdate from BizDays where RN = 5)
select t.* from txns t
join BizDate on t.bdate <= BizDate.bdate
DENSE_RANK does not stop after the first 5 records like TOP 5 does.
Though DENSE_RANK is monotonic and hence theoretically could be optimized to TOP WITH TIES, SQL Server's optimizer is not aware of that and does not do this optimization.
If your business days are unique, you can replace DENSE_RANK with ROW_NUMBER and get the same performance, since ROW_NUMBER is optimized to a TOP.
instead of putting the conditions in where and join clauses, could you perhaps use ORDER BY on your meeting data and then LIMIT offset, rowcount?
The reason this is running so slow is that DENSE_RANK() and ROW_NUMBER() are functions. The engine has to read every record in the table that matches the WHERE clause, apply the function to each row, save the function value, and then get the top 5 from that list.
A "plain" top 5 uses the index on the table to get the first 5 records that meet the WHERE clause. In the best case, the engine may only have to read a couple of index pages. Worst case, it may have to read a few data pages as well. Even without an index, the engine is reading the rows but does not have to execute the function or work with temporary tables.