Issue with the repeated records in SQL - sql

My dataset looks like below:
I am trying to get Min start date & Max end date of an employee whenever there is a team change.
The problem here is, the date is not coming for repeated team.
Any help would be appreciated..

Teradata has a nice SQL extension for normalizing overlapping date ranges. This assumes that you want to get extra rows when a month is missing, i.e. there's a gap:
SELECT
emp_id
,team
-- split the Period into seperate columns again
,Begin(pd)
,last_day(add_months(End(pd),-1)) -- end of previous month
FRO
(
SELECT NORMALIZE -- normalize overlapping periods
emp_id
,team
-- NORMALIZE only works with periods, so create a Period based on current date plus one month
,PERIOD(month_end_date
,last_day(add_months(month_end_date, 1))
) AS pd
FROM vt
) AS dt;

If I understand correctly, this is a gaps-and-islands problem that can be solved using the difference of row number.
You can use:
select emp_id, team, min(month_end_date), max(month_end_date)
from (select t.*,
row_number() over (partition by emp_id order by month_end_date) as seqnum,
row_number() over (partition by emp_id, team order by month_end_date) as seqnum_2
from t
) t
group by emp_id, team, (seqnum - seqnum_2);
Note: This puts the dates on a single row, which seems more useful than your expected results.

Related

How to conditional SQL select

My table consists of user_id, revenue, publish_month columns.
Right now I use group_by user_id and sum(revenue) to get revenue for all individual users.
Is there a single SQL query I can use to query for user revenue across a time period conditionally? If for a specific user, there is a row for this month, I want to query for this month, last month and the month before. If there is not yet a row for this month, I want to query for last month and the two months before.
Any advice with which approach to take would be helpful. If I should be using cases, if-elses with exists or if this is do-able with a single SQL query?
UPDATE---since I did a bad job of describing the question, I've come to include some example data and expected results
Where current month is not present for user 33
Where current month is present
Assuming publish_month is a DATE datatype, this should get the most recent three months of data per user...
SELECT
user_id, SUM(revenue) as s_revenue
FROM
(
SELECT
user_id, revenue, publish_month,
MAX(publish_month) OVER (PARTITION BY user_id) AS user_latest_publish_month
FROM
yourtableyoudidnotname
)
summarised
WHERE
publish_month >= DATEADD(month, -2, user_latest_publish_month)
GROUP BY
user_id
If you want to limit that to the most recent 3 months out of the last 4 calendar months, just add AND publish_month >= DATEADD(month, -3, DATE_TRUNC(month, GETDATE()))
The ambiguity here is why it is important to include a Minimal Reproducible Example
With input data and require results, we could test our code against your requirements
If you're using strings for the publish_month, you shouldn't be, and should fix that with utmost urgency.
You can use a windowing function to "number" the months. In this way the most recent one will have a value of 1, the prior 2, and the one before 3. Then you can only select the items with a number of 3 or less.
Here is how:
SELECT user_id, revienue, publish_month,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY publish_month DESC) as RN
FROM yourtableyoudidnotname
now you just select the items with RN less than 3 and do your sum
SELECT user_id, SUM(revenue) as s_revenue
FROM (
SELECT user_id, revenue, publish_month,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY publish_month DESC) as RN
FROM yourtableyoudidnotname
) X
WHERE RN <= 3
GROUP BY user_id
You could also do this without a sub query if you use the windowing function for SUM and a range, but I think this is easier to understand.
From the comment -- there could be an issue if you have months from more than one year. To solve this make the biggest number in the order by always the most recent. so instead of
ORDER BY publish_month DESC
you would have
ORDER BY (100*publish_year)+publish_month DESC
This means more recent years will always have a higher number so january of 2023 will be 202301 while december of 2022 will be 202212. Since january is a bigger number it will get a row number of 1 and december will get a row number of 2.

Teradata SQL help. Need help getting the start date and end date (yellow) of the most recent employment status. Thank you

Teradata SQL help. Need help getting the start date and end date (yellow) of the most recent employment status. Thank you. Click on the question for image.
There are several ways to get your expected result, based on your data you might apply Teradata's NORMALIZE option, a SQL extension to combine overlapping periods:
SELECT NAME, job_title, status, next(Begin(pd)) AS start_date, End(pd) AS end_date
FROM
( -- returns one group for consecutive overlapping rows
SELECT NORMALIZE
name,
job_title,
status,
-- need to subtract 1 to create a valid period
PERIOD(prior(start_date), end_date) AS pd
FROM tab
) AS dt
QUALIFY
-- return the latest row only
Row_Number()
Over(PARTITION BY name
ORDER BY start_date DESC) = 1
Caution : This returns a new group whenever name, job_title or status change.
Give this a try:
SELECT
t.name,
MAX(t.job_title) AS job_title,
status,
MIN(t.start_date) AS start_date,
MAX(t.end_date) AS end_date
FROM mytable t
GROUP BY t.name, t.status
QUALIFY RANK() OVER(PARTITION BY t.name ORDER BY t.start_date DESC) = 1
Teradata will do the aggregates first and then apply the window function. So, this will first get the MIN/MAX dates within each person's status and then assign a RANK to each of these rows based on the most recent start_date.
I don't have a system to test, but give it a try and let me know.

SQL - Grouping a 3 Column List Issue

I have list of values in a databse. There are many redundancies and I want to get rid of them. As you can see in the list below, dates [10/1/2011 - 7/1/2011) have a value of 0. I can make that into one entry with a start date of 10/1/2011 and an end date of 6/1/2011 and a value of 0 and delete all the other rows. I can do that for all the other similar values as well.
Here is my problem. I did this by writing a query that groups these together and then takes the Min(start date) as the start date and the Max(end date) as the end date. Notice that I have two groups of 0 though. When I group this in the query, the start date is 10/1/2010 and the end date is 2/1/2013. This is a problem elsewhere in my code because whenever it looks for a value at 2/1/2012 it finds 0 but it should be finding .955186.
Any suggestions on how I can write a query to account for this problem?
This is a "gaps-and-islands" problem.
If I assume that the first column is sufficient for defining the groups, then you can use a difference of row_number()s:
select min(startdate), max(enddate), value
from (select t.*,
row_number() over (order by startdate) as seqnum,
row_number() over (partition by value order by startdate) as seqnum_v
from t
) t
group by (seqnum - seqnum_v), value;
It is a gap and islands problem. You may use the following query (using SQL Server syntax, however, it can be easily altered).
select min(startdate) startDate, max(enddate) endDate, value
from
(
select *,
row_number() over (partition by value order by startDate) - (year(startDate) * 12) - month(startDate) grp
from data
) t
group by value, grp
order by startDate
It is using just one row_number() which may be better than two since the DBMS does not have to pass the table twice to generate the sequences.

Redshift list 3 most recent values per year

I have a column of dates and I want to find the three maximum dates for each year I have tried the following.
select max(date, rank() over (partition by SPLIT_PART(date, '-', 1) order by date desc)
from table
;
My desired output would be
2013,2010-12-31
2013,2010-12-30
2013,2010-12-29
also there are repeats dates in the table so I would have to filter those out as well
Assuming there are no duplicate dates, you can partition by the year part of date and get the latest 3 dates per year. Use distinct (if needed) in the final query to remove the duplicates, if any.
select yr,date
from (select date_part(year,date) as yr,date
,dense_rank() over (partition by date_part(year,date) order by date desc) as rnk
from table
) t
where rnk<=3

Last day of the month with a twist in SQLPLUS

I would appreciate a little expert help please.
in an SQL SELECT statement I am trying to get the last day with data per month for the last year.
Example, I am easily able to get the last day of each month and join that to my data table, but the problem is, if the last day of the month does not have data, then there is no returned data. What I need is for the SELECT to return the last day with data for the month.
This is probably easy to do, but to be honest, my brain fart is starting to hurt.
I've attached the select below that works for returning the data for only the last day of the month for the last 12 months.
Thanks in advance for your help!
SELECT fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,fd.column_name
FROM super_table fd,
(SELECT TRUNC(daterange,'MM')-1 first_of_month
FROM (
select TRUNC(sysdate-365,'MM') + level as DateRange
from dual
connect by level<=365)
GROUP BY TRUNC(daterange,'MM')) fom
WHERE fd.cust_id = :CUST_ID
AND fd.coll_date > SYSDATE-400
AND TRUNC(fd.coll_date) = fom.first_of_month
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)
You probably need to group your data so that each month's data is in the group, and then within the group select the maximum date present. The sub-query might be:
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY YEAR(coll_date) * 100 + MONTH(coll_date);
This presumes that the functions YEAR() and MONTH() exist to extract the year and month from a date as an integer value. Clearly, this doesn't constrain the range of dates - you can do that, too. If you don't have the functions in Oracle, then you do some sort of manipulation to get the equivalent result.
Using information from Rhose (thanks):
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY TO_CHAR(coll_date, 'YYYYMM');
This achieves the same net result, putting all dates from the same calendar month into a group and then determining the maximum value present within that group.
Here's another approach, if ANSI row_number() is supported:
with RevDayRanked(itemDate,rn) as (
select
cast(coll_date as date),
row_number() over (
partition by datediff(month,coll_date,'2000-01-01') -- rewrite datediff as needed for your platform
order by coll_date desc
)
from super_table
)
select itemDate
from RevDayRanked
where rn = 1;
Rows numbered 1 will be nondeterministically chosen among rows on the last active date of the month, so you don't need distinct. If you want information out of the table for all rows on these dates, use rank() over days instead of row_number() over coll_date values, so a value of 1 appears for any row on the last active date of the month, and select the additional columns you need:
with RevDayRanked(cust_id, server_name, coll_date, rk) as (
select
cust_id, server_name, coll_date,
rank() over (
partition by datediff(month,coll_date,'2000-01-01')
order by cast(coll_date as date) desc
)
from super_table
)
select cust_id, server_name, coll_date
from RevDayRanked
where rk = 1;
If row_number() and rank() aren't supported, another approach is this (for the second query above). Select all rows from your table for which there's no row in the table from a later day in the same month.
select
cust_id, server_name, coll_date
from super_table as ST1
where not exists (
select *
from super_table as ST2
where datediff(month,ST1.coll_date,ST2.coll_date) = 0
and cast(ST2.coll_date as date) > cast(ST1.coll_date as date)
)
If you have to do this kind of thing a lot, see if you can create an index over computed columns that hold cast(coll_date as date) and a month indicator like datediff(month,'2001-01-01',coll_date). That'll make more of the predicates SARGs.
Putting the above pieces together, would something like this work for you?
SELECT fd.cust_id,
fd.server_name,
fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,
fd.column_name
FROM super_table fd,
WHERE fd.cust_id = :CUST_ID
AND TRUNC(fd.coll_date) IN (
SELECT MAX(TRUNC(coll_date))
FROM super_table
WHERE coll_date > SYSDATE - 400
AND cust_id = :CUST_ID
GROUP BY TO_CHAR(coll_date,'YYYYMM')
)
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)