Is there SQL Logic to reduce type 2 table along a dimension - sql

I have a slowly changing type 2 price change table which I need to reduce the size of to improve performance. Often rows are written to the table even if no price change occurred (when some other dimensional field changed) and the result is that for any product the table could be 3-10x the size it needs to be if it were including only changes in price.
I'd like to compress the table so that it only has contains the first effective date and last expiration date for each price until that price changes that can also
Deal with an unknown number of rows of the same price
Deal with products going back to an old price
As an example if i have this raw data:
Product
Price Effective Date
Price Expiration Date
Price
123456
6/22/18
9/19/18
120
123456
9/20/18
11/8/18
120
123456
11/9/18
11/29/18
120
123456
11/30/18
12/6/18
120
123456
12/7/18
12/19/18
85
123456
12/20/18
1/1/19
85
123456
1/2/19
2/19/19
85
123456
2/20/19
2/20/19
120
123456
2/21/19
3/19/19
85
123456
3/20/19
5/22/19
85
123456
5/23/19
10/10/19
85
123456
10/11/19
6/19/19
80
123456
6/20/20
12/31/99
80
I need to transform it into this:
Product
Price Effective Date
Price Expiration Date
Price
123456
6/22/18
12/6/18
120
123456
12/7/18
2/19/19
85
123456
2/20/19
2/20/19
120
123456
2/21/19
10/10/19
85
123456
10/11/19
12/31/99
80

You can first find the intervals where the price does not change, and then group on those intervals:
with to_r as (select row_number() over (order by (select 1)) r, t.* from data_table t),
to_group as (select t.*, (select sum(t1.r < t.r and t1.price != t.price) from to_r t1) c from to_r t)
select t.product, min(t.effective), max(t.expiration), max(t.price) from to_group t group by t.c order by t.r;
Output:
Product
Price Effective Date
Price Expiration Date
Price
123456
6/22/18
12/6/18
120
123456
12/7/18
2/19/19
85
123456
2/20/19
2/20/19
120
123456
2/21/19
10/10/19
85
123456
10/11/19
12/31/99
80

This is a type of gaps-and-islands problem. I would recommend reconstructing the data, saving it in a temporary table, and then reloading the existing table.
The code to reconstruct the data is:
select product, price, min(effective_date), max(expiration_date)
from (select t.*,
sum(case when prev_expiration_date = effective_date - interval '1 day' then 0 else 1 end) over (partition by product order by effective_date) as grp
from (select t.*,
lag(expiration_date) over (partition by product, price order by effective_date) as prev_expiration_date
from t
) t
) t
group by product, price, grp;
Note that the logic for date arithmetic varies depending on the database.
Save this result into a temporary table, temp_t or whatever, using select into, create table as, or whatever your database supports.
Then empty the current table and reload it:
truncate table t;
insert into t
select product, price, effective_date, expiration_date
from temp_t;
Notes:
Validate the data before using truncate_table!
If there are triggers or columns with default values, you might want to be careful.

It sounds like you are asking for a temporal schema? Where for a given date you can know the price of an asset?
This is done with two tables; price_current and price_history.
price_id
item_id
price
rec_created
1
1
100
'2015-04-18'
price_id
item_id
from
to
price
1
1
'2001-01-01'
'2004-05-01'
114
1
1
'2004-05-01'
'2015-04-18'
102
i.e. for any item, you can ascertain the date it was set without polluting your "current" table. For this to work effectively you will need to have UPDATE triggers on your current_table. When you update a record you insert into the history table the details and the period it was valid from.
CREATE OR REPLACE TRIGGER trg_price_current_update
AS
BEGIN
INSERT INTO price_history(price_id, item_id, from, to, price)
SELECT price_id, item_id, rec_created, GETDATE(), price
FROM rows_updated
END
Now you have a distinction between current and historical, without your current table (presumably the busier table) getting out of hand because of maintaining historical state. Hope i understood the question.
To ignore 'dummy' updates, just alter the trigger to ignore empty changes (if that's not handled by the DBMS anyway). Tbh, this should and could be done application side easily enough, but to manage it via the trigger:
CREATE OR REPLACE TRIGGER trg_price_current_update
AS
BEGIN
INSERT INTO price_history(price_id, item_id, from, to, price)
SELECT price_id, item_id, rec_created, GETDATE(), price
FROM rows_updated u
INNER JOIN price_current ON u.price_id = p.price_id
WHERE u.price <> p.price
END
i.e. rows_updated contains the record from the update, we insert into the history table the previous row, providing the previous row's price is different from the current row's price.
(edited to include new trigger. I also changed the date held in rec_created, this must be the date the row is created, not the first instance that product had a price assigned to it. that was a mistake. Regarding the dates, I am lazy to put the full DD-MM-YYYY hh:mm:ss:zzz, but that would generally be useful in between queries)

What you are asking for is a versioning system. Many RDBMS platforms implement support for this out of the box (it's a SQL standard), which may be suitable, depending on your requirements.
You have not tagged a specific platform so it's not possible to be specific to your situation. I use the concept of system versioning regularly in MS Sql Server, where you would implement it thus:
assuming schema "history" exists,
alter table dbo.MyTable add
ValidFrom datetime2 generated always as row start hidden constraint DF_MyTableSysStart default sysutcdatetime(),
ValidTo datetime2 generated always as row end hidden constraint DF_MyTableSysEnd default convert(datetime2, '9999-12-31 23:59:59.9999999'),
period for system_time (ValidFrom, ValidTo);
end
alter table MyTable set (system_versioning = on (history_table = History.MyTable));
create clustered index ix_MyTable on History.MyTable (ValidTo, ValidFrom) with (data_compression = page,drop_existing=on) on History;
A number of syntax extensions exist to aid querying the temporal data for example to find historical data at a point in time.
Alternatively, to utilise a single table but handle the duplication, you could create an instead of trigger.
the idea here is that the trigger gets to intercept the data before it is inserted, where you can check to see of the value is different to the last value and discard or insert as appropriate.

something along the lines of:
WITH keeps AS
(
SELECT p.product_id, p.effective, p.expires, p.price, CASE WHEN EXISTS(SELECT 1 FROM prices p1 WHERE p1.effective = DATEADD(DAY, p.exires, 1) AND p1.price <> p.price) THEN 1 ELSE 0 END AS has_after, CASE WHEN EXISTS(SELECT 1 FROM prices p1 WHERE p1.expires = DATEADD(DAY, p.effective, -1) AND p1.price <> p.price) THEN 1 ELSE 0 END AS has_before
FROM prices p
)
SELECT * FROM keeps
WHERE has_after = 1
OR has_before = 1
UNION ALL
SELECT p.product_id, p.effective, p.exires, p.price
FROM prices p
WHERE p.effective = (SELECT MIN(effective) FROM prices p1 WHERE p1.product_id = p.product_id)
What's it doing:
Find all the entries where there exists another entry whose effective date is that of the previous entry's expiry date + 1, and the price of that new entry is different. This gives us all the actual changes in price. But we miss the first price entry, so we simply include that in the results.
e.g.:
product_id
effective
expires
price
has_before
has_after
123456
6/22/18
9/19/18
120
0
0
123456
9/20/18
11/8/18
120
0
0
123456
11/9/18
11/29/18
120
0
0
123456
11/30/18
12/6/18
120
0
1
123456
12/7/18
12/19/18
85
1
0
123456
12/20/18
1/1/19
85
0
0
123456
2/1/19
2/19/19
85
0
1
123456
2/20/19
2/20/19
120
1
1
123456
2/21/19
3/19/19
85
1
0

Related

SQL: Trying to flag the earliest shipment date within a the same customer ID and order number across all rows of a database

Think of an order fulfillment database where each customer ID can have the same Order Number for a product shipment and its refills. I am trying to flag the refills by adding 'Y' to a new column for refills. The first shipment is identified by the earliest ship date in the database for the same customer ID and order number. The shipments after the first shipment date with the same customer ID and order number would be the refills.
Customer # and Order # are varchars. Date is a date type.
Table I currently have. I want to be able to fill a new column called "Refill" with Y or N:
Customer # Order # ShipDate Refill <---New Column I want to create
1234 2124 5/25/2015 Y
1234 2124 3/25/2015 N
1234 2124 4/25/2015 Y
5678 4439 12/25/2014 Y
5678 4439 2/20/2015 Y
5678 4439 9/10/2014 N
6666 5920 1/12/2012 Y
6666 5920 5/12/2011 N
6666 6053 6/12/2016 Y
6666 6053 4/12/2016 N
6666 6053 8/12/2016 Y
It appears that the logic for the update is that the initial shipping record for a given customer and order is "No" but all subsequent records are "Yes".
In the update query below I join your original table to a subquery which finds the initial shipping record for each customer/order group. Then, a record in your original table which does match must be a No while a record which does not match must be a Yes.
UPDATE t1
SET Refill = CASE WHEN t2.Customer IS NULL THEN 'Yes' ELSE 'No' END
FROM yourTable t1
LEFT JOIN
(
SELECT Customer, Order, MIN(ShipDate) AS ShipDate -- this query finds
FROM yourTable -- the original
GROUP BY Customer, Order -- ship date
) t2
ON t1.Customer = t2.Customer AND
t1.Order = t2.Order AND
t1.ShipDate = t2.ShipDate
WHERE t1.Order IS NOT NULL OR t1.ShipDate IS NOT NULL
This answer also assumes that you already have a varchar column called Refill defined. If you don't, then go ahead and create one.
You would want to add the new field
ALTER TABLE MyTable Add Refill char(1) Not Null Default 'N';
Update rows in the table to set Refill if they are not the oldest ShipDate for that Customer and Order combination.
With cte As (
SELECT Customer, [Order], ShipDate, Refill, RowNo = Row_Number() Over (Partition By Customer, [Order] Order By ShipDate Asc)
From MyTable)
Update cte
Set Refill = 'Y'
Where RowNo <> 1;
Note that this has the advantage of handling cases where there were two shipments on the first date. We can't distinguish between them to say which one was first, but we will only mark one of them as 'N'.

Postgresql : Check if the last number is the highest

I have large database and one field should be an incremental number, but it sometimes resets and I must detect them (the bold rows)
Table 1:
Shop #Sell DATE
EC1 56 1/10/2015
EC1 57 2/10/2015
**EC1 11 3/10/2015
EC1 12 4/10/2015**
AS2 20 1/10/2015
AS2 21 2/10/2015
AS2 22 3/10/2015
AS2 23 4/10/2015
To solve this problem I thought to find the highest number of each SHOP and check if it is the number with the highest DATE. Do you know another easier way to do it?
My concern is that it can be a problem to do the way I am thinking since I have a large database.
Do you know how I can do the query I am thinking of or do you have any others ideas?
The query you have in mind will give you all Shop values having a discontinuity in Sell number.
If you want to get the offending record you can use the following query:
SELECT Shop, Sell, DATE
FROM (
SELECT Shop, Sell, DATE,
LAG(Sell) OVER (PARTITION BY Shop ORDER BY DATE) AS prevSell
FROM Shops ) t
WHERE Sell < prevSell
ORDER BY DATE
LIMIT 1
The above query will return the first discontinuity found within each Shop partition.
Output:
Shop Sell DATE
---------------------
EC1 11 2015-03-10
Demo here
EDIT:
In case you cannot use windowed function and you only want the id of the shop having the discontinuity, then you can use the following query:
SELECT s.Shop
FROM Shops AS s
INNER JOIN (
SELECT Shop, MAX(Sell) AS Sell, MAX(DATE) AS DATE
FROM Shops
GROUP BY Shop ) t
ON s.Shop = t.Shop AND s.DATE = t.DATE
WHERE t.Sell <> s.Sell
The above will work provided that you have unique DATE values per Shop.
I think the following is the type of query you want:
select s.*
from (select shop, max(sell) as maxsell,
first_value(sell) over (partition by shop order by date desc) as lastsell
from shops s
group by shop
) s
where maxsell <> lastsell;

How to increment a value in SQL based on a unique key

Apologies in advance if some of the trigger solutions already cover this but I can't get them to work for my scenario.
I have a table of over 50,000 rows, all of which have an ID, with roughly 5000 distinct ID values. There could be 100 rows with an instrumentID = 1 and 50 with an instrumentID = 2 within the table etc but they will have slightly different column entries. So I could write a
SELECT * from tbl WHERE instrumentID = 1
and have it return 100 rows (I know this is easy stuff but just to be clear)
What I need to do is form an incrementing value for each time a instrument ID is found, so I've tried stuff like this:
IntIndex INT IDENTITY(1,1),
dDateStart DATE,
IntInstrumentID INT,
IntIndex1 AS IntInstrumentID + IntIndex,
at the table create step.
However, I need the IntIndex1 to increment when an instrumentID is found, irrespective of where the record is found in the table so that it effectively would provide a count of the records just by looking at the last IntIndex1 value alone. Rather than what the above does which is increment on all of the rows of the table irrespective of the instrumentID so you would get 5001,4002,4003 etc.
An example would be: for intInstruments 5000 and 4000
intInstrumentID | IntIndex1
--------- ------------------
5000 | 5001
5000 | 5002
4000 | 4001
5000 | 5003
4000 | 4002
The reason I need to do this is because I need to join two tables based on these values (a start and end date for each instrumentID). I have tried GROUP BY etc but this can't work in both tables and the JOIN then doesn't work.
Many thanks
I'm not entirely sure I understand your problem, but if you just need IntIndex1 to join to, could you just join to the following query, rather than trying to actually keep the calculated value in the database:
SELECT *,
intInstrumentID + RANK() OVER(PARTITION BY intInstrumentID ORDER BY dDateStart ASC) AS IntIndex1
FROM tbl
Edit: If I understand your comment correctly (which is not certain!), then presumably, you know that your end date and start date tables have the exact same number of rows, which leads to a one to one mapping between them based on thir respective end dates within instrument id?
If that's the case then maybe this join is what you are looking for:
SELECT SD.intInstrumentID, SD.dDateStart, ED.dEndDate
FROM
(
SELECT intInstrumentID,
dStartDate,
RANK() OVER(PARTITION BY intInstrumentID ORDER BY dDateStart ASC) AS IntIndex1
FROM tblStartDate
) SD
JOIN
(
SELECT intInstrumentID,
dEndDate,
RANK() OVER(PARTITION BY intInstrumentID ORDER BY dEndDate ASC) AS IntIndex1
FROM tblStartDate
) ED
ON SD.intInstrumentID = ED.intInstrumentID
AND SD.IntIndex1 = ED.IntIndex1
If not, please will you post some example data for both tables and the expected results?

Selecting the latest per group of items [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Retrieving the last record in each group
i have 2 tables products and cost
PRODUCT
ProdCode - PK
ProdName
COST
Effectivedate - PK
RetailCOst
Prodcode
i tried this query:
SELECT a.ProdCOde AS id, MAX(EffectiveDate) AS edate, RetailCOst AS retail
FROM cost a
INNER JOIN product b USING (ProdCode)
WHERE EffectiveDate <= '2009-10-01'
GROUP BY a.ProdCode;
uhm yah its showing the right effectivedate but the cost on that specific effectivedate doesnt match.
so i want to select the latest date with the matching cost per item.
for example the date i selected is '2009-12-25' and the records for 1 item are these:
ProdCode |EffectiveDate| Cost
00010000 | 2009-01-05 | 50
00010000 | 2009-05-25 | 48
00010000 | 2010-07-01 | 40
so in result i should get 00010000|2009-05-25|48 because it is lesser than the date on my query and it is the latest for that item. and then i want to to show on my query the latest costs on each product.
hope to hear from you soon! thanks!
You need to use a subquery here:
SELECT maxdates.ProdCode, maxdates.maxDate, cost.RetailCost as retail
SELECT ProdCode, max(EffectiveDate) as maxDate
FROM cost
WHERE EffectiveDate < '2009-10-01'
GROUP BY ProdCode
) maxdates
LEFT JOIN cost ON (maxdates.ProdCode=cost.ProdCode
AND maxdates.maxDate=cost.EffectiveDate)
Explanation:
The inner SELECT gives a list of all Products and their respective maximum EffectiveDates. The join "glues" the retail cost per data entry to the result.
Alternatively, using the old max concat trick should do the trick.
SELECT
p.ProdCode,
SUBSTRING(MAX(CONCAT(d.EffectiveDate, c.RetailCost)), 1, 10) AS date,
SUBSTRING(MAX(CONCAT(d.EffectiveDate, c.RetailCost)), 10, 100) + 0 AS cost
FROM
product p,
cost c
WHERE
p.ProdCode = c.ProdCode AND
c.EffectiveDate < '2009-10-01'
GROUP BY
p.ProdCode

Optimizing Query With Subselect

I'm trying to generate a sales reports which lists each product + total sales in a given month. Its a little tricky because the prices of products can change throughout the month. For example:
Between Jan-01 and Jan-15, my company sells 50 Widgets at a cost of $10 each
Between Jan-15 and Jan-31, my company sells 50 more Widgets at a cost of $15 each
The total sales of Widgets for January = (50 * 10) + (50 * 15) = $1250
This setup is represented in the database as follows:
Sales table
Sale_ID ProductID Sale_Date
1 1 2009-01-01
2 1 2009-01-01
3 1 2009-01-02
...
50 1 2009-01-15
51 1 2009-01-16
52 1 2009-01-17
...
100 1 2009-01-31
Prices table
Product_ID Sale_Date Price
1 2009-01-01 10.00
1 2009-01-16 15.00
When a price is defined in the prices table, it is applied to all products sold with the given ProductID from the given SaleDate going forward.
Basically, I'm looking for a query which returns data as follows:
Desired output
Sale_ID ProductID Sale_Date Price
1 1 2009-01-01 10.00
2 1 2009-01-01 10.00
3 1 2009-01-02 10.00
...
50 1 2009-01-15 10.00
51 1 2009-01-16 15.00
52 1 2009-01-17 15.00
...
100 1 2009-01-31 15.00
I have the following query:
SELECT
Sale_ID,
Product_ID,
Sale_Date,
(
SELECT TOP 1 Price
FROM Prices
WHERE
Prices.Product_ID = Sales.Product_ID
AND Prices.Sale_Date < Sales.Sale_Date
ORDER BY Prices.Sale_Date DESC
) as Price
FROM Sales
This works, but is there a more efficient query than a nested sub-select?
And before you point out that it would just be easier to include "price" in the Sales table, I should mention that the schema is maintained by another vendor and I'm unable to change it. And in case it matters, I'm using SQL Server 2000.
If you start storing start and end dates, or create a view that includes the start and end dates (you can even create an indexed view) then you can heavily simplify your query. (provided you are certain there are no range overlaps)
SELECT
Sale_ID,
Product_ID,
Sale_Date,
Price
FROM Sales
JOIN Prices on Sale_date > StartDate and Sale_Date <= EndDate
-- careful not to use between it includes both ends
Note:
A technique along these lines will allow you to do this with a view. Note, if you need to index the view, it will have to be juggled around quite a bit ..
create table t (d datetime)
insert t values(getdate())
insert t values(getdate()+1)
insert t values(getdate()+2)
go
create view myview
as
select start = isnull(max(t2.d), '1975-1-1'), finish = t1.d from t t1
left join t t2 on t1.d > t2.d
group by t1.d
select * from myview
start finish
----------------------- -----------------------
1975-01-01 00:00:00.000 2009-01-27 11:12:57.383
2009-01-27 11:12:57.383 2009-01-28 11:12:57.383
2009-01-28 11:12:57.383 2009-01-29 11:12:57.383
It's well to avoid these types of correlated subqueries. Here's a classic technique for such cases.
SELECT
Sale_ID,
Product_ID,
Sale_Date,
p1.Price
FROM Sales AS s
LEFT JOIN Prices AS p1 ON s.ProductID = p1.ProductID
AND s.Sale_Date >= p1.Sale_Date
LEFT JOIN Prices AS p2 ON s.ProductID = p2.ProductID
AND s.Sale_Date >= p2.Sale_Date
AND p2.Sale_Date > p1.Sale_Date
WHERE p2.Price IS NULL -- want this one not to be found
Use a left outer join on the pricing table as p2, and look for a NULL record demonstrating that the matched product-price record found in p1 is the most recent on or before the sales date.
(I would have inner-joined the first price match, but if there is none, it's nice to have the product show up anyway so you know there's a problem.)
Are you actually running into performance problems or are you just anticipating them? I would implement this exactly as you have, were my hands tied from a schema-modification standpoint as yours are.
I agreee with Sean. The code you have written is very clean and understandable. If you are having performance issues, then take the extra effort to make the code faster. Otherwise, you are making the code more complex for no reason. Nested sub-selects are extremely useful when used judiciously.
The combination of Product_ID and Sale_Date is your foreign key. Try a select-join on Product_ID, Sale_Date.