I have list of values in a databse. There are many redundancies and I want to get rid of them. As you can see in the list below, dates [10/1/2011 - 7/1/2011) have a value of 0. I can make that into one entry with a start date of 10/1/2011 and an end date of 6/1/2011 and a value of 0 and delete all the other rows. I can do that for all the other similar values as well.
Here is my problem. I did this by writing a query that groups these together and then takes the Min(start date) as the start date and the Max(end date) as the end date. Notice that I have two groups of 0 though. When I group this in the query, the start date is 10/1/2010 and the end date is 2/1/2013. This is a problem elsewhere in my code because whenever it looks for a value at 2/1/2012 it finds 0 but it should be finding .955186.
Any suggestions on how I can write a query to account for this problem?
This is a "gaps-and-islands" problem.
If I assume that the first column is sufficient for defining the groups, then you can use a difference of row_number()s:
select min(startdate), max(enddate), value
from (select t.*,
row_number() over (order by startdate) as seqnum,
row_number() over (partition by value order by startdate) as seqnum_v
from t
) t
group by (seqnum - seqnum_v), value;
It is a gap and islands problem. You may use the following query (using SQL Server syntax, however, it can be easily altered).
select min(startdate) startDate, max(enddate) endDate, value
from
(
select *,
row_number() over (partition by value order by startDate) - (year(startDate) * 12) - month(startDate) grp
from data
) t
group by value, grp
order by startDate
It is using just one row_number() which may be better than two since the DBMS does not have to pass the table twice to generate the sequences.
Related
My dataset looks like below:
I am trying to get Min start date & Max end date of an employee whenever there is a team change.
The problem here is, the date is not coming for repeated team.
Any help would be appreciated..
Teradata has a nice SQL extension for normalizing overlapping date ranges. This assumes that you want to get extra rows when a month is missing, i.e. there's a gap:
SELECT
emp_id
,team
-- split the Period into seperate columns again
,Begin(pd)
,last_day(add_months(End(pd),-1)) -- end of previous month
FRO
(
SELECT NORMALIZE -- normalize overlapping periods
emp_id
,team
-- NORMALIZE only works with periods, so create a Period based on current date plus one month
,PERIOD(month_end_date
,last_day(add_months(month_end_date, 1))
) AS pd
FROM vt
) AS dt;
If I understand correctly, this is a gaps-and-islands problem that can be solved using the difference of row number.
You can use:
select emp_id, team, min(month_end_date), max(month_end_date)
from (select t.*,
row_number() over (partition by emp_id order by month_end_date) as seqnum,
row_number() over (partition by emp_id, team order by month_end_date) as seqnum_2
from t
) t
group by emp_id, team, (seqnum - seqnum_2);
Note: This puts the dates on a single row, which seems more useful than your expected results.
Working on migrating old system data to new system.
I need to group the data based on id and name.
We need to start date as min date and end date as max date.
If any id and name combination contains falls under the same period . We can avoid duplicate and choose lowest to highest date.
Legacy System
New System Expectation
ID - 139247 contains duplicate rows based on name.
Added data in - https://dbfiddle.uk/?rdbms=oracle_18&fiddle=8d6877847c5e052adf703430b5c7f083
Please let me know if more details needed. Thanks in advance.
This is a type of gaps-and-islands problem. Because you want any overlaps, I would go for a cumulative max of the previous enddate to determine where the islands being:
select id, name, min(startdate) as startdate,
(case when count(enddate) = count(*) then max(enddate)
end) as enddate
from (select t.*,
sum(case when prev_enddate >= startdate then 0 else 1 end) over (partition by id, name) as grp
from (select t.*,
max(enddate) over (partition by id, name order by startdate range between unbounded preceding and interval '1' day preceding) as prev_enddate
from t
) t
) t
group by id, name, grp
order by name, startdate;
Here is a db<>fiddle.
I have a table with a foreign_key_id column and a date column.
For each row that has the same foreign key, there is a different date, and if I order by foreign_key_id, date , 90% of the time all the dates are consecutive.
There are some edge cases though, where there are multiple entries with the same foreign_key that don't have consecutive dates.
Trying to come up with an easy way to identify all the foreign_key_id 's that don't have consecutive dates. Any ideas?
I was thinking of left joining on to a generated series, somehow partitioning by track id, but keep hitting a mental wall. My sql query editor keeps crashing, so that is adding some more unrelated frustration
EDIT:
I ended up doing an order by foreign_key_id, date , copying and pasting the result in excel, and then finding what I needed by doing this type of logic formula:
=IF( (B91 = B90), (F91 =(F90 + 1)) , 1 ) , where b is the foreign key column and F is the date column
but wondering if something similar could be done in sql. here's what I had when I gave up and went to excel:
select to_char(date_range.days, 'yyyy-mm-dd') as x
, data.*
from (
select generate_series('2019-04-30'::date,'2019-11-05'::date, '1 day')::date as days
) as date_range
left join(
select foreign_key_id, date
from table_a
order by foreign_key_id, date
) data on data.date = date_range.days
where foreign_key_id is null
You could do that, sure. No joins needed either. Use LAG(datecol) OVER(PARTITION BY foreignkeycol ORDER BY datecol) to get the date of the previous row for the same fk, diff it to the current date to show how many intervals (days? Minutes?) have passed since that date and then wrap it all in something that does WHERE thedifference <> 1 (Or however you define consecutive - if consecutive to you is "every 2 days" then it would be anything that doesn't have a difference of 2)
If you want both rows either side of the gap, use LEAD (same format as LAG) to get the next date and calc two diffs, then do WHERE difftoprev <> 1 or difftonext <>1 etc
It would look something like this (untested)
WITH cte AS (
SELECT foreignkeycol, datecol,
LAG(datecol) OVER(PARTITION BY foreignkeycol ORDER BY datecol) as prevdate,
LEAD(datecol) OVER(PARTITION BY foreignkeycol ORDER BY datecol) as nextdate
FROM table
)
SELECT *
FROM cte
WHERE
DATE_PART('day', datecol - prevdate) <> 1 OR
DATE_PART('day', nextdate - datecol) <> 1
I would use lead():
select t.*
from (select t.*,
lead(date) over (partition by foreign_key_id order by date) as next_date
from t
) t
where next_date <> date + interval '1 day';
This will provide each row where the next row does not have the expected date.
I ve got duplicated rows in a temp table mainly because there are some date values which are seconds/miliseconds different to each other.
For example:
2018-08-30 12:30:19.000
2018-08-30 12:30:20.000
This is what causes the duplication.
How can I keep only one of those values? Let s say the higher one?
Thank you.
Well, one method is to use lead():
select t.*
from (select t.*, lead(ts) over (order by ts) as next_ts
from t
) t
where next_ts is null or
datediff(second, ts, next_ts) < 60; -- or whatever threshold you want
You could assign a Row_Number to each value, as follows:
Select *
, Row_Number() over
(partition by ObjectID, cast(date as date)... ---whichever criteria you want to consider duplicates
order by date desc) --assign the latest date to row 1, may want other order criteria if you might have ties on this field
as RN
from MyTable
Then retain only the rows where RN = 1 to remove duplicates. See this answer for examples of how to round your dates to the nearest hour, minute, etc. as needed; I used truncating to the day above as an example.
I have source data at the day granularity and I need to aggregate it to week granularity. Most fields are easy sum aggregations. But, I have one field that I need to take Sunday's value (kinda like a "first" aggregation) and another field that I need to take Saturday's value.
The road I'm going down using SSIS is to Multicast my source data three times, doing a regular Aggregate for the easy fields, and then using lookup joins to a calendar table to match the other two to Saturday and Sunday respectively to grab those values.... then merge joining everything back together.
Is there a better way to do this?
example source data:
What the output should look like:
Is there a better way to do this? Yes. Don't use a complicated SSIS solution for something that is a simple SQL statement
SELECT
Day,
SUM(Sales) Sales,
MAX(
CASE WHEN DATEPART(dw,Day) = 1 THEN BOP ELSE NULL END
) As BOP,
MAX(
CASE WHEN DATEPART(dw,Day) = 7 THEN EOP ELSE NULL END
) As EOP
FROM Table
GROUP BY Table
You might need to tweak the 1 and 7 depending on your server settings but hopefully you get the idea.
You can use First_value and Last_Value for this as below:
select top 1 with ties datepart(week, [day]) as [Week],
sum(sales) over(partition by datepart(week, [day])) as Sales,
FIRST_VALUE(BOP) over(partition by datepart(week, [day]) order by [day]) as BOP
, EOP = LAST_VALUE(EOP) over(partition by datepart(week, [day]) order by [day] RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING )
from #youraggregate
Order by Row_number() over(partition by datepart(week, [day]) order by [day])
Use Derived column transformation to get the week first
DATEPART("wk", Day)
After that use Aggregate using Week Column