Fastest way to query latest version in large SQL Server table? - sql

What is the fastest way to query for "latest version" on a large SQL table using an update timestamp in SQL Server?
I'm currently using the inner join approach, on very large SQL Server weather forecast table by Date, City, Hour, Temperature, UpdateTimestamp. To get the latest temperature forecast, I created a view using inner join on Date, City, and Hour + max(UpdateTimestamp), such as in this other posting.
However as the dataset is growing on the original table, the view query is getting slower and slower over time.
Wonder if others have encountered similar situation, and what's the best way to speed up this query (one alternative solution I'm considering is having a stored procedure run each day creating a separate table of the "latest version" only, which then will be very quick to access).
EDIT 4/4 - I've found the best solution so far (thanks Vikram) was to add a clustered index to my table on 3 fields "TSUnix", "CityId", "DTUnix", which sped up performance by ~4x (from 25 seconds to 4 seconds)
Also I've tried to use row_number solution (below query sample) , although appears bit slower than the "inner join" approach. Both queries + index creation are below :
Index Creation:
USE [<My DB>]
GO
CREATE NONCLUSTERED INDEX [index_WeatherForecastData]
ON [dbo].[<WeatherForecastData>] ([TSUnix], [CityId], [DTUnix])
INCLUDE ([Temperature], [TemperatureMin], [TemperatureMax], [Humidity], [WindSpeed], [Rain], [Snow])
GO
Query:
-- Inner Join Version
SELECT W.TSUnix, W.CityId, W.DTUnix, W.Temperature, W.*
FROM WeatherForecastData W
INNER JOIN (
SELECT max(TSUnix) Latest, CityId, DTUnix
FROM WeatherForecastData
GROUP BY CityId, DTUnix
) L
ON L.Latest = W.TSUnix
AND L.CityID = W.CityID
AND L.DTUnix = W.DTUnix
-- Row Number Version
SELECT W.TSUnix, W.CityId, W.DTUnix, W.Temperature, W.*
FROM
(
select
*, ROW_NUMBER() over (partition by DTUnix, CityId order by TSUnix desc) as RowNumber
from WeatherForecastData
) W
WHERE
W.RowNumber = 1
Thanks!

Use ROW_NUMBER with an index as shown below.
The specific index that will make this fast is an index that has Date, City, Hour and UpdateTimestamp descending. This requires a single pass over the table rather than multiple passes an INNER JOIN would likely require.
Working code: http://sqlfiddle.com/#!18/8c0b4/1
SELECT Date, City, Hour, Temperature
FROM
(SELECT
Date, City, Hour, Temperature,
ROW_NUMBER() OVER(PARTITION BY Date, City, Hour
ORDER BY UpdateTimestamp DESC) AS RowNumber
FROM
Test) AS t
WHERE
t.RowNumber = 1

Related

DISTINCT ON slow for 300000 rows

I have a table named assets. Here is the ddl:
create table assets (
id bigint primary key,
name varchar(255) not null,
value double precision not null,
business_time timestamp with time zone,
insert_time timestamp with time zone default now() not null
);
create index idx_assets_name on assets (name);
I need to extract the newest (based on insert_time) value for each asset name. This is the query that I initially used:
SELECT DISTINCT
ON (a.name) *
FROM home.assets a
WHERE a.name IN (
'USD_RLS',
'EUR_RLS',
'SEKKEH_RLS',
'NIM_SEKKEH_RLS',
'ROB_SEKKEH_RLS',
'BAHAR_RLS',
'GOLD_18_RLS',
'GOLD_OUNCE_USD',
'SILVER_OUNCE_USD',
'PLATINUM_OUNCE_USD',
'GOLD_MESGHAL_RLS',
'GOLD_24_RLS',
'STOCK_IR',
'AED_RLS',
'GBP_RLS',
'CAD_RLS',
'CHF_RLS',
'TRY_RLS',
'AUD_RLS',
'JPY_RLS',
'CNY_RLS',
'RUB_RLS',
'BTC_USD'
)
ORDER BY a.name,
a.insert_time DESC;
I have around 300,000 rows in the assets table. On my VPS this query takes about 800 ms. this is causing a whole response time of about 1 second for a specific endpoint. This is a bit slow and considering the fact that the assets table is growing fast, this endpoint will be even slower in the near future. I also tried to avoid IN(...) using this query:
SELECT DISTINCT
ON (a.name) *
FROM home.assets a
ORDER BY a.name,
a.insert_time DESC;
But I didn't notice a significant difference. Any idea how I could optimize this query?
You may try adding the following index to your table:
CREATE INDEX idx ON assets (name, insert_time DESC);
If used, Postgres can simply scan this index to find the distinct record having the most recent insert_time for each name.
For more than a few rows per name in the table (looks to be so), I expect this query to be substantially faster, yet:
SELECT a.*
FROM unnest('{USD_RLS, EUR_RLS, SEKKEH_RLS, NIM_SEKKEH_RLS, ROB_SEKKEH_RLS
, BAHAR_RLS, GOLD_18_RLS, GOLD_OUNCE_USD, SILVER_OUNCE_USD
, PLATINUM_OUNCE_USD, GOLD_MESGHAL_RLS, GOLD_24_RLS, STOCK_IR
, AED_RLS, GBP_RLS, CAD_RLS, CHF_RLS
, TRY_RLS, AUD_RLS, JPY_RLS, CNY_RLS
, RUB_RLS, BTC_USD}'::text[]) AS n(name)
CROSS JOIN LATERAL (
SELECT *
FROM home.assets a
WHERE a.name = n.name
ORDER BY a.insert_time DESC
LIMIT 1
) a;
Pass your list as array, unnest, and then get each latest row in a LATERAL subquery. The CROSS JOIN eliminates names that are not found at all. (You might be interested in LEFT JOIN LATERAL ... ON true instead, to keep those in the result.)
You still need the multicolumn index that Tim mentioned.
CREATE INDEX ON assets (name, insert_time DESC);
Default ascending sort order would work, too, in this case. Postgres can scan backwards:
CREATE INDEX ON assets (name, insert_time);
See:
Postgres: getting latest rows for an array of keys
Optimize GROUP BY query to retrieve latest row per user - basically type 2a
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
Not the number of rows in the table, but the number of rows per group (per name in your case) decides whether DISTINCT ON is the best choice. See this benchmark comparing relevant query styles:
Select first row in each GROUP BY group?

SQL Query Deduplication / Join Issue

I've been having the worst time trying to write what I feel should be a pretty simple query to deal with duplicate entries.
For context: I've created a data warehouse using Big Query and am using Stitch to pull data from Hubspot. Everything works as expected as in: I have confirmed that I have the right number of records in BigQuery.
The issue comes into how Stitch refreshes data. Instead of updating records based on object id, it appends a new row. According to their documentation, the query below should work, but it doesn't for the simple reason that there exist multiple versions of a given record with the same _sdc_sequence (which I don't think should exist). There are other _sdc (stitch system fields) that I can use to help, but it's also not completely reliable for the same reasons as above.
SELECT DISTINCT o.*
FROM [sample-table:hubspot.companies] o
INNER JOIN (
SELECT
MAX(_sdc_sequence) AS seq,
id
FROM [sample-table:hubspot.companies]
GROUP BY companyid ) oo
ON o.companyid = oo.companyid
AND o._sdc_sequence = oo.seq
The query above returns fewer results than it should. If I run the following query, I get the right number of results, but I need the other fields besides companyid like name, description, revenue, etc.
SELECT o.companyid
FROM [samples_table:hubspot.companies] o
GROUP BY o.companyid
I was trying something like this, but it doesn't work (I'm getting the following error (Expression 'oo.properties.name.value' is not present in the GROUP BY list).
SELECT o.companyid,
oo.properties.name.value,
oo.properties.hubspot_owner_id.value,
oo.properties.description.value
FROM [sample_table:hubspot.companies] o
LEFT JOIN [sample_table:hubspot.companies] oo
ON o.companyid = oo.companyid
GROUP BY o.companyid
I'm my mind, the way that I'm thinking about this is:
Get list of unique records id (companyid)
Do a SQL "vlookup equivalent" of the raw, ungrouped company table that is sorted by insert time to get the first record that matches the id (which will be the most recent since the table is sorted)
I just don't know how to write this...
Try using window functions:
#standardSQL
SELECT c.*
FROM (SELECT c.*,
ROW_NUMBER() OVER (PARTITION BY companyid ORDER BY _sdc_sequence DESC) as seqnum
FROM `sample-table.hubspot.companies` c
) c
WHERE seqnum = 1;
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)]
FROM `sample-table.hubspot.companies` t
GROUP BY companyid

Find the Max and related fields

Here is my (simplified) problem, very common I guess:
create table sample (client, recordDate, amount)
I want to find out the latest recording, for each client, with recordDate and amount.
I made the below code, which works, but I wonder if there is any better pattern or Oracle tweaks to improve the efficiency of such SELECT. (I am not allowed to modify to the structure of the database, so indexes etc are out of reach for me, and out of scope for the question).
select client, recordDate, Amount
from sample s
inner join (select client, max(recordDate) lastDate
from sample
group by client) t on s.id = t.id and s.recordDate = t.lastDate
The table has half a million records and the select takes 2-4 secs, which is acceptable but I am curious to see if that can be improved.
Thanks
In most cases Windowed Aggregate Functions might perform better (at least it's easier to write):
select client, recordDate, Amount
from
(
select client, recordDate, Amount,
rank() over (partition by client order by recordDate desc) as rn
from sample s
) dt
where rn = 1
Another structure for the query is not exists. This can perform faster under some circumstances:
select client, recordDate, Amount
from sample s
where not exists (select 1
from sample s2
where s2.client = s.client and
s2.recordDate > s.recordDate
);
This would take good advantage of an index on sample(client, recordDate), if one were available.
And, another thing to try is keep:
select client, max(recordDate),
max(Amount) keep (dense_rank first order by recordDate desc)
from sample s
group by client;
This version assumes only one max record date per client (your original query does not make that assumption).
These queries (plus the one by dnoeth) should all have different query plans and you might get lucky on one of them. The best solution, though, is to have the appropriate index.

Set-based alternative to loop in SQL Server

I know that there are several posts about how BAD it is to try to loop in SQL Server in a stored procedure. But I haven't quite found what I am trying to do. We are using data connectivity that can be linked internally directly into excel.
I have seen some posts where a few people have said they could convert most loops to a standard query. But for the life of me I am having trouble with this one.
I need all custIDs who have orders right before an event of type 38,40. But only get them if there is no other order between the event and the order in the first query.
So there are 3 parts. I first query for all orders (orders table) based on a time frame into a temporary table.
Select into temp1 odate, custId from orders where odate>'5/1/12'
Then I could use the temp table to inner join on the secondary table to get a customer event (LogEvent table) that may have occurred some time in the past prior to the current order.
Select into temp2 eventdate, temp1.custID from LogEvent inner join temp1 on
temp1.custID=LogEvent.custID where EventType in (38,40) and temp1.odate>eventdate
order by eventdate desc
The problem here is that the queries I am trying to run will return all rows for each of the customers from the first query where I only want the latest for each customer. So this is where on the client side I would loop to only get one Event instead of all the old ones. But as all the query has to run inside of Excel I can't really loop client side.
The third step then could use the results from the second query to make check if the event occurred between most current order and any previous order. I only want the data where the event precedes the order and no other orders are in between.
Select ordernum, shopcart.custID from shopcart right outer join temp2 on
shopcart.custID=temp2.custID where shopcart.odate >= temp2.eventdate and
ordernum is null
Is there a way to simplify this and make it set-based to run in SQL Server instead of some kind of loop that I is perform at the client?
THis is a great example of switching to set-based notation.
First, I combined all three of your queries into a single query. In general, having a single query let's the query optimizer do what it does best -- determine execution paths. It also prevents accidental serialization of queries on a multithreaded/multiprocessor machine.
The key is row_number() for ordering the events so the most recent has a value of 1. You'll see this in the final WHERE clause.
select ordernum, shopcart.custID
from (Select eventdate, temp1.custID,
row_number() over (partition by temp1.CustID order by EventDate desc) as seqnum
from LogEvent inner join
(Select odate, custId
from order
where odate>'5/1/12'
) temp1
on temp1.custID=LogEvent.custID
where EventType in (38,40) and temp1.odate>eventdate order by eventdate desc
) temp2 left outer join
ShopCart
on shopcart.custID=temp2.custID
where seqnum = 1 and shopcart.odate >= temp2.eventdate and ordernum is null
I kept your naming conventions, even though I think "from order" should generate a syntax error. Even if it doesn't it is bad practice to name tables and columns with reserved SQL words.
If you are using a newer version of sql server, then you can use the ROW_NUMBER function. I will write an example shortly.
;WITH myCTE AS
(
SELECT
eventdate, temp1.custID,
ROW_NUMBER() OVER (PARTITION BY temp1.custID ORDER BY eventdate desc) AS CustomerRanking
FROM LogEvent
JOIN temp1
ON temp1.custID=LogEvent.custID
WHERE EventType IN (38,40) AND temp1.odate>eventdate
)
SELECT * into temp2 from myCTE WHERE CustomerRanking = 1;
This gets you the most recent event for each customer without a loop.
Also, you could use RANK, however that will create duplicates for ties, whereas ROW_NUMBER will guarantee no duplicate numbers for your partition.

Are inline queries a bad idea?

I have a table containing the runtimes for generators on different sites, and I want to select the most recent entry for each site. Each generator is run once or twice a week.
I have a query that will do this, but I wonder if it's the best option. I can't help thinking that using WHERE x IN (SELECT ...) is lazy and not the best way to formulate the query - any query.
The table is as follows:
CREATE TABLE generator_logs (
id integer NOT NULL,
site_id character varying(4) NOT NULL,
start timestamp without time zone NOT NULL,
"end" timestamp without time zone NOT NULL,
duration integer NOT NULL
);
And the query:
SELECT id, site_id, start, "end", duration
FROM generator_logs
WHERE start IN (SELECT MAX(start) AS start
FROM generator_logs
GROUP BY site_id)
ORDER BY start DESC
There isn't a huge amount of data, so I'm not worried about optimizing the query. However, I do have to do similar things on tables with 10s of millions of rows, (big tables as far as I'm concerned!) and there optimisation is more important.
So is there a better query for this, and are inline queries generally a bad idea?
Should your query not be correlated? i.e.:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE start = (SELECT MAX(g2.start) AS start
FROM generator_logs g2
WHERE g2.site_id = g1.site_id)
ORDER BY start DESC
Otherwise you will potentially pick up non-latest logs whose start value happens to match the latest start for a different site.
Or alternatively:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE (site_id, start) IN (SELECT site_id, MAX(g2.start) AS start
FROM generator_logs g2
GROUP BY site_id)
ORDER BY start DESC
I would use joins as they perform much better then "IN" clause:
select gl.id, gl.site_id, gl.start, gl."end", gl.duration
from
generator_logs gl
inner join (
select max(start) as start, site_id
from generator_logs
group by site_id
) gl2
on gl.site_id = gl2.site_id
and gl.start = gl2.start
Also as Tony pointed out you were missing correlation in your original query
In MYSQL it could be problematic because Last i Checked it was unable to optimise subqueries effectively ( Ie: by query-rewriting )
Many DBMS's have Genetic Query planners which will do the same thing regardless of your input queries structure.
MYSQL will in some cases for that situation create a temp table, other times not, and depending on the circumstances, indexing, condtions, subqueries can still be rather quick.
Some complain that subqueries are hard to read, but they're perfectly fine if you fork them into local variables.
$maxids = 'SELECT MAX(start) AS start FROM generator_logs GROUP BY site_id';
$q ="
SELECT id, site_id, start, \"end\", duration
FROM generator_logs
WHERE start IN ($maxids)
ORDER BY start DESC
";
This problem - finding not just the the MAX, but the rest of the corresponding row - is a common one. Luckily, Postgres provides a nice way to do this with one query, using DISTINCT ON:
SELECT DISTINCT ON (site_id)
id, site_id, start, "end", duration
FROM generator_logs
ORDER BY site_id, start DESC;
DISTINCT ON (site_id) means "return one record per site_id". The order by clause determines which record that is. Note, however, that this is subtly different from your original query - if you have two records for the same site with the same start, your query would return two records, while this returns only one.
A way to find records having the MAX value per group is to select those records for which there is no record within the same group having a higher value:
SELECT id, site_id, "start", "end", duration
FROM generator_logs g1
WHERE NOT EXISTS (
SELECT 1
FROM generator_logs g2
WHERE g2.site_id = g1.site_id
AND g2."start" > g1."start"
);