I have a table containing prices for a lot of different "things" in a MS SQL 2005 table. There are hundreds of records per thing per day and the different things gets price updates at different times.
ID uniqueidentifier not null,
ThingID int NOT NULL,
PriceDateTime datetime NOT NULL,
Price decimal(18,4) NOT NULL
I need to get today's latest prices for a group of things. The below query works but I'm getting hundreds of rows back and I have to loop trough them and only extract the latest one per ThingID. How can I (e.g. via a GROUP BY) say that I want the latest one per ThingID? Or will I have to use subqueries?
SELECT *
FROM Thing
WHERE ThingID IN (1,2,3,4,5,6)
AND PriceDate > cast( convert(varchar(20), getdate(), 106) as DateTime)
UPDATE: In an attempt to hide complexity I put the ID column in a an int. In real life it is GUID (and not the sequential kind). I have updated the table def above to use uniqueidentifier.
I think the only solution with your table structure is to work with a subquery:
SELECT *
FROM Thing
WHERE ID IN (SELECT max(ID) FROM Thing
WHERE ThingID IN (1,2,3,4)
GROUP BY ThingID)
(Given the highest ID also means the newest price)
However I suggest you add a "IsCurrent" column that is 0 if it's not the latest price or 1 if it is the latest. This will add the possible risk of inconsistent data, but it will speed up the whole process a lot when the table gets bigger (if it is in an index). Then all you need to do is to...
SELECT *
FROM Thing
WHERE ThingID IN (1,2,3,4)
AND IsCurrent = 1
UPDATE
Okay, Markus updated the question to show that ID is a uniqueid, not an int. That makes writing the query even more complex.
SELECT T.*
FROM Thing T
JOIN (SELECT ThingID, max(PriceDateTime)
WHERE ThingID IN (1,2,3,4)
GROUP BY ThingID) X ON X.ThingID = T.ThingID
AND X.PriceDateTime = T.PriceDateTime
WHERE ThingID IN (1,2,3,4)
I'd really suggest using either a "IsCurrent" column or go with the other suggestion found in the answers and use "current price" table and a separate "price history" table (which would ultimately be the fastest, because it keeps the price table itself small).
(I know that the ThingID at the bottom is redundant. Just try if it is faster with or without that "WHERE". Not sure which version will be faster after the optimizer did its work.)
I would try something like the following subquery and forget about changing your data structures.
SELECT
*
FROM
Thing
WHERE
(ThingID, PriceDateTime) IN
(SELECT
ThingID,
max(PriceDateTime )
FROM
Thing
WHERE
ThingID IN (1,2,3,4)
GROUP BY
ThingID
)
Edit the above is ANSI SQL and i'm now guessing having more than one column in a subquery doesnt work for T SQL. Marius, I can't test the following but try;
SELECT
p.*
FROM
Thing p,
(SELECT ThingID, max(PriceDateTime ) FROM Thing WHERE ThingID IN (1,2,3,4) GROUP BY ThingID) m
WHERE
p.ThingId = m.ThingId
and p.PriceDateTime = m.PriceDateTime
another option might be to change the date to a string and concatenate with the id so you have only one column. This would be slightly nasty though.
If the subquery route was too slow I would look at treating your price updates as an audit log and maintaining a ThingPrice table - perhaps as a trigger on the price updates table:
ThingID int not null,
UpdateID int not null,
PriceDateTime datetime not null,
Price decimal(18,4) not null
The primary key would just be ThingID and "UpdateID" is the "ID" in your original table.
Since you are using SQL Server 2005, you can use the new (CROSS|OUTTER) APPLY clause. The APPLY clause let's you join a table with a table valued function.
To solve the problem, first define a table valued function to retrieve the top n rows from Thing for a specific id, date ordered:
CREATE FUNCTION dbo.fn_GetTopThings(#ThingID AS GUID, #n AS INT)
RETURNS TABLE
AS
RETURN
SELECT TOP(#n) *
FROM Things
WHERE ThingID= #ThingID
ORDER BY PriceDateTime DESC
GO
and then use the function to retrieve the top 1 records in a query:
SELECT *
FROM Thing t
CROSS APPLY dbo.fn_GetTopThings(t.ThingID, 1)
WHERE t.ThingID IN (1,2,3,4,5,6)
The magic here is done by the APPLY clause which applies the function to every row in the left result set then joins with the result set returned by the function then retuns the final result set. (Note: to do a left join like apply, use OUTTER APPLY which returns all rows from the left side, while CROSS APPLY returns only the rows that have a match in the right side)
BlaM:
Because I can't post comments yet( due to low rept points) not even to my own answers ^^, I'll answer in the body of the message:
-the APPLY clause even, if it uses table valued functions it is optimized internally by SQL Server in such a way that it doesn't call the function for every row in the left result set, but instead takes the inner sql from the function and converts it into a join clause with the rest of the query, so the performance is equivalent or even better (if the plan is chosen right by sql server and further optimizations can be done) than the performance of a query using subqueries), and in my personal experience APPLY has no performance issues when the database is properly indexed and statistics are up to date (just like a normal query with subqueries behaves in such conditions)
It depends on the nature of how your data will be used, but if the old price data will not be used nearly as regularly as the current price data, there may be an argument here for a price history table. This way, non-current data may be archived off to the price history table (probably by triggers) as the new prices come in.
As I say, depending on your access model, this could be an option.
I'm converting the uniqueidentifier to a binary so that I can get a MAX of it.
This should make sure that you won't get duplicates from multiple records with identical ThingIDs and PriceDateTimes:
SELECT * FROM Thing WHERE CONVERT(BINARY(16),Thing.ID) IN
(
SELECT MAX(CONVERT(BINARY(16),Thing.ID))
FROM Thing
INNER JOIN
(SELECT ThingID, MAX(PriceDateTime) LatestPriceDateTime FROM Thing
WHERE PriceDateTime >= CAST(FLOOR(CAST(GETDATE() AS FLOAT)) AS DATETIME)
GROUP BY ThingID) LatestPrices
ON Thing.ThingID = LatestPrices.ThingID
AND Thing.PriceDateTime = LatestPrices.LatestPriceDateTime
GROUP BY Thing.ThingID, Thing.PriceDateTime
) AND Thing.ThingID IN (1,2,3,4,5,6)
Since ID is not sequential, I assume you have a unique index on ThingID and PriceDateTime so only one price can be the most recent for a given item.
This query will get all of the items in the list IF they were priced today. If you remove the where clause for PriceDate you will get the latest price regardless of date.
SELECT *
FROM Thing thi
WHERE thi.ThingID IN (1,2,3,4,5,6)
AND thi.PriceDateTime =
(SELECT MAX(maxThi.PriceDateTime)
FROM Thing maxThi
WHERE maxThi.PriceDateTime >= CAST( CONVERT(varchar(20), GETDATE(), 106) AS DateTime)
AND maxThi.ThingID = thi.ThingID)
Note that I changed ">" to ">=" since you could have a price right at the start of a day
It must work without using a global PK column (for complex primary keys for example):
SELECT t1.*, t2.PriceDateTime AS bigger FROM Prices t1
LEFT JOIN Prices t2 ON t1.ThingID = t2.ThingID AND t1.PriceDateTime < t2.PriceDateTime
HAVING t2.PriceDateTime IS NULL
Try this (provided you only need the latest price, not the identifier or datetime of that price)
SELECT ThingID, (SELECT TOP 1 Price FROM Thing WHERE ThingID = T.ThingID ORDER BY PriceDateTime DESC) Price
FROM Thing T
WHERE ThingID IN (1,2,3,4) AND DATEDIFF(D, PriceDateTime, GETDATE()) = 0
GROUP BY ThingID
maybe i missunderstood the taks but what about a:
SELECT ID, ThingID, max(PriceDateTime), Price
FROM Thing GROUP BY ThingID
Related
I want to apply pagination on a table with huge data. All I want to know a better option than using OFFSET in SQL Server.
Here is my simple query:
SELECT *
FROM TableName
ORDER BY Id DESC
OFFSET 30000000 ROWS
FETCH NEXT 20 ROWS ONLY
You can use Keyset Pagination for this. It's far more efficient than using Rowset Pagination (paging by row number).
In Rowset Pagination, all previous rows must be read, before being able to read the next page. Whereas in Keyset Pagination, the server can jump immediately to the correct place in the index, so no extra rows are read that do not need to be.
For this to perform well, you need to have a unique index on that key, which includes any other columns you need to query.
In this type of pagination, you cannot jump to a specific page number. You jump to a specific key and read from there. So you need to save the unique ID of page you are on and skip to the next. Alternatively, you could calculate or estimate a starting point for each page up-front.
One big benefit, apart from the obvious efficiency gain, is avoiding the "missing row" problem when paginating, caused by rows being removed from previously read pages. This does not happen when paginating by key, because the key does not change.
Here is an example:
Let us assume you have a table called TableName with an index on Id, and you want to start at the latest Id value and work backwards.
You begin with:
SELECT TOP (#numRows)
*
FROM TableName
ORDER BY Id DESC;
Note the use of ORDER BY to ensure the order is correct
In some RDBMSs you need LIMIT instead of TOP
The client will hold the last received Id value (the lowest in this case). On the next request, you jump to that key and carry on:
SELECT TOP (#numRows)
*
FROM TableName
WHERE Id < #lastId
ORDER BY Id DESC;
Note the use of < not <=
In case you were wondering, in a typical B-Tree+ index, the row with the indicated ID is not read, it's the row after it that's read.
The key chosen must be unique, so if you are paging by a non-unique column then you must add a second column to both ORDER BY and WHERE. You would need an index on OtherColumn, Id for example, to support this type of query. Don't forget INCLUDE columns on the index.
SQL Server does not support row/tuple comparators, so you cannot do (OtherColumn, Id) < (#lastOther, #lastId) (this is however supported in PostgreSQL, MySQL, MariaDB and SQLite).
Instead you need the following:
SELECT TOP (#numRows)
*
FROM TableName
WHERE (
(OtherColumn = #lastOther AND Id < #lastId)
OR OtherColumn < #lastOther
)
ORDER BY
OtherColumn DESC,
Id DESC;
This is more efficient than it looks, as SQL Server can convert this into a proper < over both values.
The presence of NULLs complicates things further. You may want to query those rows separately.
On very big merchant website we use a technic compound of ids stored in a pseudo temporary table and join with this table to the rows of the product table.
Let me talk with a clear example.
We have a table design this way :
CREATE TABLE S_TEMP.T_PAGINATION_PGN
(PGN_ID BIGINT IDENTITY(-9 223 372 036 854 775 808, 1) PRIMARY KEY,
PGN_SESSION_GUID UNIQUEIDENTIFIER NOT NULL,
PGN_SESSION_DATE DATETIME2(0) NOT NULL,
PGN_PRODUCT_ID INT NOT NULL,
PGN_SESSION_ORDER INT NOT NULL);
CREATE INDEX X_PGN_SESSION_GUID_ORDER
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_GUID, PGN_SESSION_ORDER)
INCLUDE (PGN_SESSION_ORDER);
CREATE INDEX X_PGN_SESSION_DATE
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_DATE);
We have a very big product table call T_PRODUIT_PRD and a customer filtered it with many predicates. We INSERT rows from the filtered SELECT into this table this way :
DECLARE #SESSION_ID UNIQUEIDENTIFIER = NEWID();
INSERT INTO S_TEMP.T_PAGINATION_PGN
SELECT #SESSION_ID , SYSUTCDATETIME(), PRD_ID,
ROW_NUMBER() OVER(ORDER BY --> custom order by
FROM dbo.T_PRODUIT_PRD
WHERE ... --> custom filter
Then everytime we need a desired page, compound of #N products we add a join to this table as :
...
JOIN S_TEMP.T_PAGINATION_PGN
ON PGN_SESSION_GUID = #SESSION_ID
AND 1 + (PGN_SESSION_ORDER / #N) = #DESIRED_PAGE_NUMBER
AND PGN_PRODUCT_ID = dbo.T_PRODUIT_PRD.PRD_ID
All the indexes will do the job !
Of course, regularly we have to purge this table and this is why we have a scheduled job which deletes the rows whose sessions were generated more than 4 hours ago :
DELETE FROM S_TEMP.T_PAGINATION_PGN
WHERE PGN_SESSION_DATE < DATEADD(hour, -4, SYSUTCDATETIME());
In the same spirit as SQLPro solution, I propose:
WITH CTE AS
(SELECT 30000000 AS N
UNION ALL SELECT N-1 FROM CTE
WHERE N > 30000000 +1 - 20)
SELECT T.* FROM CTE JOIN TableName T ON CTE.N=T.ID
ORDER BY CTE.N DESC
Tried with 2 billion lines and it's instant !
Easy to make it a stored procedure...
Of course, valid if ids follow each other.
When I join to the right table I am getting way too many duplicates. I am trying to grab the most recent record from the right table however, it does not matter what I try it does not work.
So Far I have tried:
PROC SQL;
CREATE TABLE fs1.sample AS
SELECT A.*,
B.xx1,
max(B.time_s)
FROM lx1.results a left join (Select Distinct C.id, c.per FROM lx2.results c
Where c.id = a.id
and COMPGED(a.txt1, c.txt1,'i') < 100
and c.dt > a.dt
and c.ksv = 37
and datepart(c.lsg) >= '12DEC2020'd ) b
ON a.id = b.id
group by a.id, a.txt1
QUIT;
Unfortunately, I get an error. I also tried using case when exists, but that takes way too long. Essentially I am trying to grab the most recent record from the right table based on time_s. I also want to make sure the record I grab from the right table somewhat matches a.txt1.
Cheers
When you perform a join, you attach all records from the table that match your join conditions.
If the table is indexed appropriately, a subquery could achieve the goal of obtaining the most recent value, however, if the query uses the wrong index, TOP or equivalent functions may return the wrong result.
There are a number of ways to accomplish the task of retrieving the most recent record but they are contingent on a couple of things.
Firstly, you need to be able to identify what the most recent row is, usually by a column called CreatedDate or something similar against the IDs. (You should know what that business logic is, it may be that the table is chronologically entered [as most tables are] and therefore, SubID might be a thing. We're going to assume it is CreatedDate.)
Secondly, you need to rank the rows in terms of the CreatedDate in a descending order so that the newest matching ID is ranked 1.
Finally, you filter your results by 1 to return the newest result, but you could also filter by <= x if you are interested in the top x newest return results per ID.
To use more mathematical language: We are deriving a value from the CreatedDate and ID values and then using that derivative value to sort and filter the data. In this case we are deriving the RowNumber from the CreatedDate in descending order for each ID.
In order to accomplish this, you can use the Windowed Function ROW_NUMBER(),
ROW_NUMBER() OVER (PARTITION BY id ORDER BY CreatedDate DESC) as RankID
This windowed function will return a row value for each ID relative to the CreatedDate in descending order, where the newest created date is equal to 1.
You can then put brackets around the whole query to make it into a table so you will be able to filter the results of that Windowed Function.
SELECT id, txt
(SELECT id, txt
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY CreatedDate DESC) as RankID
FROM SourceTable) A
WHERE RankID = 1
This should achieve your goal of returning the "newest result".
What ever your column is that determines the age of the data relative to the ID, it can be multiple, should be placed within the ORDER BY.
In order to make this query perform faster, you should index your data appropriately, whereby ID is the the first column, and CreatedDate Desc is your next column. This means your system will not have to perform a costly sort every time this runs, but that depends on whether you plan on using this query often and how much overhead it is grabbing.
I have a table with an ID and a date column. It's possible (likely) that when a new record is created, it gets the next larger ID and the current datetime. So if I were to sort by date or I were to sort by ID, the resulting data set would be in the same order.
How do I write a SQL query to verify this?
It's also possible that an older record is modified and the date is updated. In that case, the records would not be in the same sort order. I don't think this happens.
I'm trying to move the data to another location, and if I know that there are no modified records, that makes it a lot simpler.
I'm pretty sure I only need to query those two columns: ID, RecordDate. Other links indicate I should be able to use LAG, but I'm getting an error that it isn't a built-in function name.
In other words, both https://dba.stackexchange.com/questions/42985/running-total-to-the-previous-row and Is there a way to access the "previous row" value in a SELECT statement? should help, but I'm still not able to make that work for what I want.
If you cannot use window functions, you can use a correlated subquery and EXISTS.
SELECT *
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.id < t1.id
AND t2.recorddate > t1.recorddate);
It'll select all records where another record with a lower ID and a greater timestamp exists. If the result is empty you know that no such record exists and the data is like you want it to be.
Maybe you want to restrict it a bit more by using t2.recorddate >= t1.recorddate instead of t2.recorddate > t1.recorddate. I'm not sure how you want it.
Use this:
SELECT ID, RecordDate FROM tablename t
WHERE
(SELECT COUNT(*) FROM tablename WHERE tablename.ID < t.ID)
<>
(SELECT COUNT(*) FROM tablename WHERE tablename.RecordDate < t.RecordDate);
It counts for each row, how many rows have id less than the row's id and
how many rows have RecordDate less than the row's RecordDate.
If these counters are not equal then it outputs this row.
The result is all the rows that would not be in the same position after sorting by ID and RecordDate
One method uses window functions:
select count(*)
from (select t.*,
row_number() over (order by id) as seqnum_id,
row_number() over (order by date, id) as seqnum_date
from t
) t
where seqnum_id <> seqnum_date;
When the count is zero, then the two columns have the same ordering. Note that the second order by includes id. Two rows could have the same date. This makes the sort stable, so the comparison is valid even when date has duplicates.
the above solutions are all good but if both dates and ids are in increment then this should also work
select modifiedid=t2.id from
yourtable t1 join yourtable t2
on t1.id=t2.id+1 and t1.recordDate<t2.recordDate
Suppose that I have a table with three columns:
EventID (PK)
TagName
TagValue
I need to create a query that results in something like:
EventID
TagName
TagValue
PreviousConditionTag
PreviousConditionValue
Where PreviousConditionTag/Value is from TagName and TagValue of a previous row (when ordered by EventID).
In a simpler version of this problem, PreviousConditionTag was always the same as TagName - that is, I only needed to retrieve the previous value for the current TagName. I solved this using Oracle's LAG analytic function, partitioning by the TagName.
However, I now need to perform something similar but for cases where PreviousConditionTag is an arbitrary tag related to TagName by another table where the relation between TagName and PreviousConditionTag is not one-to-one.
For example, if a given row has a TagName of "ABC", the relation table may say that I need to look-up the previous value of either "IJK" or "XYZ".
I was able to come up with this logic in an Oracle function that does a SELECT against the same table and looks for MAX(EventID) that matches the criteria. For example:
SELECT * FROM MyTable WHERE EventID = (
SELECT MAX(EventID) FROM MyTable WHERE TagName IN (
SELECT ConditionTagName FROM ConditionMappingTable WHERE TagName = [CurrentTagName]
)
) AND EventID <= [CurrentEventId]
However, as you can imagine, since this query is executed in a function for every row of MyTable, I am concerned about its performance.
I was trying to think of a way to use Oracle's LAG analytic again, but I wasn't sure how to come up with the PARTITION clause for it, since the partitions appear to overlap. (e.g. Tag ABC needs to look at IJK and XYZ, and Tag DEF needs to look at IJK and UVW)
Any ideas?
This is a rewritten form of the answer, now that I understand it better.
You want to look up overlapping sets of tags and still get the previous event id. The idea is the following:
Add to the mapping table the identity for all Current Tags (so Current Tag = Condition Tag)
Join in the mapping table based on the condition tag, to get the current tags that match. So, the rows are getting relabeled with the "current" tag they match, and you can use this for the lag.
Get the most recent EventId based on the lag logic, partitioning by the Current Tag.
Select the results where the Current and Condition tags are the same.
select t.*
from (select t.*, mt.CurrentTagName, mt.ConditionTagName,
lag(EventId, 1, NULL)
over (partition by mt.CurrentTagName
order by EventId)
from t join
(select CurrentTagName, ConditionTagName
from ((select CurrentTagName, ConditionTagName
from ConditionMappingTable mt
) union all
(select distinct CurrentTagName, CurrentTagName
from ConditionMappingTable mt
)
) mt
)
on mt.ConditionTagName = t.tagname
) t
on CurrentTagName = ConditionTagName
This may seem counterintuive, because you are looking things up backwards, by the condition rather than the current. And, you are multiplying the number of rows being processed. However, it may still be faster than the join solution you were using.
Simply put, I have a table with, among other things, a column for timestamps. I want to get the row with the most recent (i.e. greatest value) timestamp. Currently I'm doing this:
SELECT * FROM table ORDER BY timestamp DESC LIMIT 1
But I'd much rather do something like this:
SELECT * FROM table WHERE timestamp=max(timestamp)
However, SQLite rejects this query:
SQL error: misuse of aggregate function max()
The documentation confirms this behavior (bottom of page):
Aggregate functions may only be used in a SELECT statement.
My question is: is it possible to write a query to get the row with the greatest timestamp without ordering the select and limiting the number of returned rows to 1? This seems like it should be possible, but I guess my SQL-fu isn't up to snuff.
SELECT * from foo where timestamp = (select max(timestamp) from foo)
or, if SQLite insists on treating subselects as sets,
SELECT * from foo where timestamp in (select max(timestamp) from foo)
There are many ways to skin a cat.
If you have an Identity Column that has an auto-increment functionality, a faster query would result if you return the last record by ID, due to the indexing of the column, unless of course you wish to put an index on the timestamp column.
SELECT * FROM TABLE ORDER BY ID DESC LIMIT 1
I think I've answered this question 5 times in the past week now, but I'm too tired to find a link to one of those right now, so here it is again...
SELECT
*
FROM
table T1
LEFT OUTER JOIN table T2 ON
T2.timestamp > T1.timestamp
WHERE
T2.timestamp IS NULL
You're basically looking for the row where no other row matches that is later than it.
NOTE: As pointed out in the comments, this method will not perform as well in this kind of situation. It will usually work better (for SQL Server at least) in situations where you want the last row for each customer (as an example).
you can simply do
SELECT *, max(timestamp) FROM table
Edit:
As aggregate function can't be used like this so it gives error. I guess what SquareCog had suggested was the best thing to do
SELECT * FROM table WHERE timestamp = (select max(timestamp) from table)