Deterministic sort order for window functions - sql

I've a status table and I want to fetch the latest details.
Slno | ID | Status | date
1 | 1 | Pass | 15-06-2015 11:11:00 - this is inserted first
2 | 1 | Fail | 15-06-2015 11:11:00 - this is inserted second
3 | 2 | Fail | 15-06-2015 12:11:11 - this is inserted first
4 | 2 | Pass | 15-06-2015 12:11:11 - this is inserted second
I use a window function with partition by ID order by date desc to fetch the first value.
Excepted Output :
2 | 1 | Fail | 15-06-2015 11:11:00 - this is inserted second
4 | 2 | Pass | 15-06-2015 12:11:11 - this is inserted second
Actual Output :
1 | 1 | Pass | 15-06-2015 11:11:00 - this is inserted first
3 | 2 | Fail | 15-06-2015 12:11:11 - this is inserted first
According to [http://docs.aws.amazon.com/redshift/latest/dg/r_Examples_order_by_WF.html], adding a second ORDER BY column to the window function may solve the problem. But I don't have any other column to differentiate the rows!
Is there another approach to solve the issue?
EDIT: I've added slno here for clarity. I don't have slno as such in the table!
My SQL:
with range as (
select id from status where date between 01-06-2015 and 30-06-2015
), latest as (
select status, id, row_number() OVER (PARTITION BY id ORDER BY date DESC) row_num
)
select * from latest where row_num = 1

If you don't have slno in your table, then you don't have any reliable information which row was inserted first. There is no natural order in a table, the physical order of rows can change any time (with any update, or with VACUUM, etc.)
You could use an unreliable trick: order by the internal ctid.
select *
from (
select id, status
, row_number() OVER (PARTITION BY id
ORDER BY date, ctid) AS row_num
from status -- that's your table name??
where date >= '2015-06-01' -- assuming column is actually a date
and date < '2015-07-01'
) sub
where row_num = 1;
In absence of any other information which row came first (which is a design error to begin with, fix it!), you might try to save what you can using the internal tuple ID ctid
In-order sequence generation
Rows will be in physical order when inserted initially, but that can change any time with any write operation to the table or VACUUM or other events.
This is a measure of last resort and it will break.
Your presented query was invalid on several counts: missing column name in 1st CTE, missing table name in 2nd CTE, ...
You don't need a CTE for this.
Simpler with DISTINCT ON (considerations for ctid apply the same):
SELECT DISTINCT ON (id)
id, status
FROM status
WHERE date >= '2015-06-01'
AND date < '2015-07-01'
ORDER BY id, date, ctid;
Select first row in each GROUP BY group?

Related

SQL Server query - keeping first and last unique records of a group

We are trying to remove and rank data in tables that is provided in a daily feed to our system. the example data of course isn't the actual product, but clearly represents the concept.
Daily inserts:
data is imported daily into tables that continually updates the status of the products
the daily status updates tell us when products were listed, are they currently listed and then the last date it was listed
after a period of {X} time, we can normalize the data
Cleanup & ranking:
we are now trying to remove duplicate records for values in a group that fall in-between the first and last values
we also want to set identifiers for the records that represent the first and last occurrence of those unique values in that group
Sample data:
I've found that the photo is the easiest way to show the data, show what's needed and not needed - I hope this makes it easier and not obtuse.
In the sample data:
"ridgerapp" we want to keep the records for 03/12/17 & 06/12/17.
"ridgerapp" we want to delete the records that fall between the dates above.
"ridgerapp" we want to also set/update the records for 03/12/17 & 06/12/17 as the first and last occurrence - something like -
update table set 03/12/17 = 0 (first), 06/12/17 = 1 (last)
"sierra" is just another expanded data sample, and we want to keep the records for 12/06/16 and 12/11/16.
"sierra" delete the records that fall between 12/06/16 and 12/11/16.
"sierra" update the status/rank for the 12/06/16 and 12/11/16 records as the first and last occurrence.
update table set 12/06/16 = 0 (first), 12/11/16 = 1 (last).
Conclusion:
Using pseudo code, this is the overall objective:
select distinct records in table (using id,name,color,value as unique identifiers)
for the records in each group look at the history and find the top and bottom dates
delete records between top and bottom dates for each group
update the history with a status/rank (field name is rank) of 0 and 1 for values in each group
using the sample data, the results would end up
Updated table values:
23 ridgerapp blue 25 03/12/17 0
23 ridgerapp blue 25 06/12/17 1
57 sierra red 15 12/06/16 0
57 sierra red 15 12/11/16 1
I'd use a CTE with the row_number() window function to find the first and last rows for each group, and then update it.
You didn't specify what makes a group a group so I only based this off the ID. If you want the group be a set of columns, i.e ID and Color and Value then just add these columns to the partition by list. For the sample data the result would be the same, but different sample data would have different outcomes.
Notice I didn't include the exact rows for the sierra group because I wanted to show you how it'd handle duplicate history dates.
declare #table table (id int, [name] varchar(64), color varchar(16), [value] int, history date)
insert into #table
values
(23,'ridgerapp','blue',25,'20170312'),
(23,'ridgerapp','blue',25,'20170325'),
(23,'ridgerapp','blue',25,'20170410'),
(23,'ridgerapp','blue',25,'20170610'),
(23,'ridgerapp','blue',25,'20170612'),
(57,'sierra','red',15,'20161206'),
(57,'sierra','red',15,'20161208'),
(57,'sierra','red',15,'20161210'),
(57,'sierra','red',15,'20161210') --notice this is a duplicate row
;with cte as(
select
*
,fst = row_number() over (partition by id order by history asc)
,lst = row_number() over (partition by id order by history desc)
from #table
)
delete from cte
where fst !=1 and lst !=1
select
*
,flag = case when row_number() over (partition by id order by history asc) = 1 then 0 else 1 end
from #table
RETURNS
+----+-----------+-------+-------+------------+------+
| id | name | color | value | history | flag |
+----+-----------+-------+-------+------------+------+
| 23 | ridgerapp | blue | 25 | 2017-03-12 | 0 |
| 23 | ridgerapp | blue | 25 | 2017-06-12 | 1 |
| 57 | sierra | red | 15 | 2016-12-06 | 0 |
| 57 | sierra | red | 15 | 2016-12-10 | 1 |
+----+-----------+-------+-------+------------+------+

SQL Server, complex query

I have an Azure SQL Database table which is filled by importing XML-files.
The order of the files is random so I could get something like this:
ID | Name | DateFile | IsCorrection | Period | Other data
1 | Mr. A | March, 1 | false | 3 | Foo
20 | Mr. A | March, 1 | true | 2 | Foo
13 | Mr. A | Apr, 3 | true | 2 | Foo
4 | Mr. B | Feb, 1 | false | 2 | Foo
This table is joined with another table, which is also joined with a 3rd table.
I need to get the join of these 3 tables for the person with the newest data, based on Period, DateFile and Correction.
In my above example, Id=1 is the original data for Period 3, I need this record.
But in the same file was also a correction for Period 2 (Id=20) and in the file of April, the data was corrected again (Id=13).
So for Period 3, I need Id=1, for Period 2 I need Id=13 because it has the last corrected data and I need Id=4 because it is another person.
I would like to do this in a view, but using a stored procedure would not be a problem.
I have no idea how to solve this. Any pointers will be much appreciated.
EDIT:
My datamodel is of course much more complex than this sample. DateFile and Period are DateTime types in the table. Actually Period is two DateTime columns: StartPeriod and EndPeriod.
Well looking at your data I believe we can disregard the IsCorrection column and just pick the latest column for each user/period.
Lets start by ordering the rows placing the latest on top :
SELECT ROW_NUMBER() OVER (PARTITION BY Period, Name ORDER by DateFile DESC), *
And from this result you select all with row number 1:
;with numberedRows as (
SELECT ROW_NUMBER() OVER (PARTITION BY Period, Name ORDER by DateFile DESC) as rowIndex, *
)
select * from numberedRows where rowIndex=1
The PARTITION BY tells ROW_NUMBER() to reset the counter whenever it encounters change in the columns Period and Name. The ORDER BY tells the ROW_NUMBER() that we want th newest row to be number 1 and then older posts afterwards. We only need the latest row.
The WITH declares a "common table expression" which is a kind of subquery or temporary table.
Not knowing your exact data, I might recommend you something wrong, but you should be able to join your with last query with other tables to get your desired result.
Something like:
;with numberedRows as (
SELECT ROW_NUMBER() OVER (PARTITION BY Period, Name ORDER by DateFile DESC) as rowIndex, *
)
select * from numberedRows a
JOIN periods b on b.empId = a.Id
JOIN msg c on b.msgId = c.Id
where a.rowIndex=1

How to update a Date field on all records when another record is inserted or updated?

This update runs as part of a stored proceducre that will insert or update a record in my Information table:
UPDATE [Information] SET
[TermDate] = #aEffDate
WHERE [InformationID] = (SELECT ISNULL(MAX(InformationID),0)
FROM Information
WHERE InformationID < #aInformationID
AND [DeletedBy] IS NULL
AND [DeletedOn] IS NULL
AND Code = #aCode)
Basically, it looks for the second newest record (based off ID) and sets that record's TermDate to the current record's EffDate. The problem is that this assumes the user enters records in oldest-to-newest order.
I've added another clause to the nested select statement above to include AND EffDate < #aEffDate, which ensures the the date isn't improperly terminated. However, now I just have a bunch of records with null TermDate columns when the record that is returned from MAX(InformationID) has a greater EffDate
So, assume the following
1) Record entered with 11/02/2015 EffDate
2) Record entered with 09/01/2015 EffDate
3) Record entered with 10/03/2015 EffDate
4) Record entered with 09/15/2015 EffDate
The database WILL look this:
InformationID | EffDate | TermDate
---------------------------------------------
1 | 11/02/15 | 09/01/15
2 | 09/01/15 | 10/03/15
3 | 10/03/15 | 09/15/15
4 | 09/15/15 | NULL
But it SHOULD look like this:
InformationID | EffDate | TermDate
---------------------------------------------
1 | 11/02/15 | NULL
2 | 09/01/15 | 09/15/15
3 | 10/03/15 | 11/02/15
4 | 09/15/15 | 10/03/15
The user should have entered the records in the order of 2, 4, 3, 1 and this would assign the proper TermDate to each record. But unfortunately we can't really force them to do that.
QUESTION How can I get the get the EffDate closest to, but not less than, another record's EffDate and assign that value as the other record's TermDate?
I've removed the check for InformationID to simply make sure they're not the same; InformationID <> #aInformationID but beyond that I can't figure out how to get the "closest" dates.
Use window functions and an updatable CTE/subquery. As a formal answer to your question, this should get the second to last record:
with toupdate as (
select i.*,
row_number() over (order by effdate desc) as seqnum
from Information
where DeletedBy IS NULL and DeletedOn IS NULL and Code = #aCode
)
update toupdate
set TermDate = #aEffDate
where seqnum = 2;

SQL Select statements that returns differences in minutes from next date-time from the same table

I have a user activity table. I want to see a difference between each process in minutes.
To be more specific here is a partial data from table.
Date |TranType| Prod | Loc
-----------------------------------------------------------
02/27/12 3:17:21 PM | PICK | LIrishGreenXL | E01C015
02/27/12 3:18:18 PM | PICK | LAntHeliconiaS | E01A126
02/27/12 3:19:00 PM | PICK | LAntHeliconiaL | E01A128
02/27/12 3:19:07 PM | PICK | LAntHeliconiaXL | E01A129
I want to retrieve time difference in minutes, between first and second process, than second and third and ....
Thank you
Something like this will work in MS SQL, just change the field names to match yours:
select a.ActionDate, datediff(minute,a.ActionDate,b.ActionDate) as DiffMin
(select ROW_NUMBER() OVER(ORDER BY ActionDate) AS Row, ActionDate
from [dbo].[Attendance]) a
inner join
(select ROW_NUMBER() OVER(ORDER BY ActionDate) AS Row, ActionDate
from [dbo].[Attendance]) b
on a.Row +1 = b.Row

Remove redundant SQL price cost records

I have a table costhistory with fields id,invid,vendorid,cost,timestamp,chdeleted. It looks like it was populated with a trigger every time a vendor updated their list of prices.
It has redundant records - since it was populated regardless of whether price changed or not since last record.
Example:
id | invid | vendorid | cost | timestamp | chdeleted
1 | 123 | 1 | 100 | 1/1/01 | 0
2 | 123 | 1 | 100 | 1/2/01 | 0
3 | 123 | 1 | 100 | 1/3/01 | 0
4 | 123 | 1 | 500 | 1/4/01 | 0
5 | 123 | 1 | 500 | 1/5/01 | 0
6 | 123 | 1 | 100 | 1/6/01 | 0
I would want to remove records with ID 2,3,5 since they do not reflect any change since the last price update.
I'm sure it can be done, though it might take several steps.
Just to be clear, this table has swelled to 100gb and contains 600M rows. I am confident that a proper cleanup will take this table's size down by 90% - 95%.
Thanks!
The approach you take will vary depending on the database you are using. For SQL Server 2005+, the following query should give you the records you want to remove:
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
where Rank > 1
You can then delete them like this:
delete from costhistory
where id in (
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
)
I would suggest that you recreate the table using a group by query. Also, I assume the the "id" column is not used in any other tables. If that is the case, then you need to fix those tables as well.
Deleting such a large quantity of records is likely to take a long, long time.
The query would look like:
insert into newversionoftable(invid, vendorid, cost, timestamp, chdeleted)
select invid, vendorid, cost, timestamp, chdeleted
from table
group by invid, vendorid, cost, timestamp, chdeleted
If you do opt for a delete, I would suggestion:
(1) Fix the code first, so no duplicates are going in.
(2) Determine the duplicate ids and place them in a separate table.
(3) Delete in batches.
To find the duplicate ids, use something like:
select *
from (select id,
row_number() over (partition by invid, vendorid, cost, timestamp, chdeleted order by timestamp) as seqnum
from table
) t
where seqnum > 1
If you want to keep the most recent version instead, then use "timestamp desc" in the order by clause.