Return overlapping date records in SQL - sql

I used the following query to fetch the overlapping records in SQL:
SELECT QUOTE_ID,FUNCTION_ID,FUNCTION_DT,FUNC_SPACE_ID,FN_START_TIME,FN_END_TIME,DATE_AUTH_LEVEL
FROM R_13_ALL_RESERVED A
WHERE
A.FUNC_SPACE_ID = '401-ZFU-52'
AND A.FUNCTION_DT = TO_DATE('09/03/2015','MM/DD/YYYY')
AND EXISTS ( SELECT 'X'
FROM R_13_ALL_RESERVED B
WHERE A.PROPERTY = B.PROPERTY
AND A.FUNCTION_DT = B.FUNCTION_DT
AND A.FUNCTION_ID <> B.FUNCTION_ID
AND ( ( A.FN_START_TIME > B.FN_START_TIME
AND A.FN_START_TIME < B.FN_END_TIME)
OR ( B.FN_START_TIME > A.FN_START_TIME
AND B.FN_START_TIME < A.FN_END_TIME)
OR ( A.FN_START_TIME = B.FN_START_TIME
AND A.FN_END_TIME = B.FN_END_TIME)
)
)
But eventhough the dates are not overlapping it still returns the records as overlapping.
I am missing some thing here?
Also if the date records overlap, I need to compare the count of function_id records with DATE_AUTH_LEVEL, if 2 function_id records overlap and the count of function_id would be 2 and DATE_AUTH_LEVEL is 1, such record should in the result set.
Please find the data set in SQLFiddle
http://sqlfiddle.com/#!9/95874/1
Desired Output : The SQL should return overlapping FN_START_TIME and FN_END_TIME for a function_space_id and it's function_dt
In the provided example, row 5 and 6 overlap for the function space id '401-ZFU-12' and function_dt 'August, 15 2015' and all others are not overlapping

The simplest predicate (where clause condition) for detecting the overlap of two ranges is to compare the start of the first range with the end of the 2nd range, and the start of the 2nd range with the end of the first range:
WHERE R1.Start_Date <= R2.End_Date
AND R2.Start_Date <= R1.End_Date
As you can see each of the two inequalities looks at a start and end value from separate records (R1 and R2 and then R2 and R1 respectively) all that remains is to add the conditions that will correlate the records, and also ensure that you aren't comparing a row to itself So if you want to find all Common_IDs that have Distinct_IDs with over lapping date ranges:
select *
from Your_Table R1
where exists (select 1 from Your_Table R2
where R1.Common_ID = R2.Common_ID
and R1.Distinct_ID <> R2.Distinct_ID
and R1.Start_Date <= R2.End_Date
and R2.Start_Date <= R1.End_Date)
If there is no Distinct_ID to use, you can use R1.rowid <> R2.rowid in place of R1.Distinct_ID <> R2.Distinct_ID

Here is an approach to troubleshooting the issue on your end.
My first suspicion is that the results of your exists clause are too broad and thus returning rows for every record matching in the outer clause unexpectedly. Likely there are rows that do not fall on the desired date or spaceid that share one component of their interval with your inner criteria.
Inspect the results of the inner select statement (the one within the exists clause) for an example row, exchanging all the 'A' aliased values with actual values from one of the rows returned you did not expect to receive.
Additionally, you can inspect what I think would be a semi join in the execution profile to see what the join criteria are. If you expect it to be filtered by a constant for 'FUNC_SPACE_ID' of '401-ZFU-52', you will discover that it is not.

Related

SQL Filtering duplicate rows due to bad ETL

The database is Postgres but any SQL logic should help.
I am retrieving the set of sales quotations that contain a given product within the bill of materials. I'm doing that in two steps: step 1, retrieve all DISTINCT quote numbers which contain a given product (by product number).
The second step, retrieve the full quote, with all products listed for each unique quote number.
So far, so good. Now the tough bit. Some rows are duplicates, some are not. Those that are duplicates (quote number & quote version & line number) might or might not have maintenance on them. I want to pick the row that has maintenance greater than 0. The duplicate rows I want to exclude are those that have a 0 maintenance. The problem is that some rows, which have no duplicates, have 0 maintenance, so I can't just filter on maintenance.
To make this exciting, the database holds quotes over 20+ years. And the data scientists guys have just admitted that maybe the ETL process has some bugs...
--- step 0
--- cleanup the workspace
SET CLIENT_ENCODING TO 'UTF8';
DROP TABLE IF EXISTS product_quotes;
--- step 1
--- get list of Product Quotes
CREATE TEMPORARY TABLE product_quotes AS (
SELECT DISTINCT master_quote_number
FROM w_quote_line_d
WHERE item_number IN ( << model numbers >> )
);
--- step 2
--- Now join on that list
SELECT
d.quote_line_number,
d.item_number,
d.item_description,
d.item_quantity,
d.unit_of_measure,
f.ref_list_price_amount,
f.quote_amount_entered,
f.negtd_discount,
--- need to calculate discount rate based on list price and negtd discount (%)
CASE
WHEN ref_list_price_amount > 0
THEN 100 - (ref_list_price_amount + negtd_discount) / ref_list_price_amount *100
ELSE 0
END AS discount_percent,
f.warranty_months,
f.master_quote_number,
f.quote_version_number,
f.maintenance_months,
f.territory_wid,
f.district_wid,
f.sales_rep_wid,
f.sales_organization_wid,
f.install_at_customer_wid,
f.ship_to_customer_wid,
f.bill_to_customer_wid,
f.sold_to_customer_wid,
d.net_value,
d.deal_score,
f.transaction_date,
f.reporting_date
FROM w_quote_line_d d
INNER JOIN product_quotes pq ON (pq.master_quote_number = d.master_quote_number)
INNER JOIN w_quote_f f ON
(f.quote_line_number = d.quote_line_number
AND f.master_quote_number = d.master_quote_number
AND f.quote_version_number = d.quote_version_number)
WHERE d.net_value >= 0 AND item_quantity > 0
ORDER BY f.master_quote_number, f.quote_version_number, d.quote_line_number
The logic to filter the duplicate rows is like this:
For each master_quote_number / version_number pair, check to see if there are duplicate line numbers. If so, pick the one with maintenance > 0.
Even in a CASE statement, I'm not sure how to write that.
Thoughts? The database is Postgres but any SQL logic should help.
I think you will want to use Window Functions. They are, in a word, awesome.
Here is a query that would "dedupe" based on your criteria:
select *
from (
select
* -- simplifying here to show the important parts
,row_number() over (
partition by master_quote_number, version_number
order by maintenance desc) as seqnum
from w_quote_line_d d
inner join product_quotes pq
on (pq.master_quote_number = d.master_quote_number)
inner join w_quote_f f
on (f.quote_line_number = d.quote_line_number
and f.master_quote_number = d.master_quote_number
and f.quote_version_number = d.quote_version_number)
) x
where seqnum = 1
The use of row_number() and the chosen partition by and order by criteria guarantee that only ONE row for each combination of quote_number/version_number will get the value of 1, and it will be the one with the highest value in maintenance (if your colleagues are right, there would only be one with a value > 0 anyway).
Can you do something like...
select
*
from
w_quote_line_d d
inner join
(
select
...
,max(maintenance)
from
w_quote_line_d
group by
...
) d1
on
d1.id = d.id
and d1.maintenance = d.maintenance;
Am I understanding your problem correctly?
Edit: Forgot the group by!
I'm not sure, but maybe you could Group By all other columns and use MAX(Maintenance) to get only the greatest.
What do you think?

Fetch rows based on condition

I am using PostgreSQL on Amazon Redshift.
My table is :
drop table APP_Tax;
create temp table APP_Tax(APP_nm varchar(100),start timestamp,end1 timestamp);
insert into APP_Tax values('AFH','2018-01-26 00:39:51','2018-01-26 00:39:55'),
('AFH','2016-01-26 00:39:56','2016-01-26 00:40:01'),
('AFH','2016-01-26 00:40:05','2016-01-26 00:40:11'),
('AFH','2016-01-26 00:40:12','2016-01-26 00:40:15'), --row x
('AFH','2016-01-26 00:40:35','2016-01-26 00:41:34') --row y
Expected output:
'AFH','2016-01-26 00:39:51','2016-01-26 00:40:15'
'AFH','2016-01-26 00:40:35','2016-01-26 00:41:34'
I had to compare start and endtime between alternate records and if the timedifference < 10 seconds get the next record endtime till last or final record.
I,e datediff(seconds,2018-01-26 00:39:55,2018-01-26 00:39:56) Is <10 seconds
I tried this :
SELECT a.app_nm
,min(a.start)
,max(b.end1)
FROM APP_Tax a
INNER JOIN APP_Tax b
ON a.APP_nm = b.APP_nm
AND b.start > a.start
WHERE datediff(second, a.end1, b.start) < 10
GROUP BY 1
It works but it doesn't return row y when conditions fails.
There are two reasons that row y is not returned is due to the condition:
b.start > a.start means that a row will never join with itself
The GROUP BY will return only one record per APP_nm value, yet all rows have the same value.
However, there are further logic errors in the query that will not successfully handle. For example, how does it know when a "new" session begins?
The logic you seek can be achieved in normal PostgreSQL with the help of a DISTINCT ON function, which shows one row per input value in a specific column. However, DISTINCT ON is not supported by Redshift.
Some potential workarounds: DISTINCT ON like functionality for Redshift
The output you seek would be trivial using a programming language (which can loop through results and store variables) but is difficult to apply to an SQL query (which is designed to operate on rows of results). I would recommend extracting the data and running it through a simple script (eg in Python) that could then output the Start & End combinations you seek.
This is an excellent use-case for a Hadoop Streaming function, which I have successfully implemented in the past. It would take the records as input, then 'remember' the start time and would only output a record when the desired end-logic has been met.
Sounds like what you are after is "sessionisation" of the activity events. You can achieve that in Redshift using Windows Functions.
The complete solution might look like this:
SELECT
start AS session_start,
session_end
FROM (
SELECT
start,
end1,
lead(end1, 1)
OVER (
ORDER BY end1) AS session_end,
session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN session_switch = 0 AND reverse_session_switch = 1
THEN 'start'
ELSE 'end' END AS session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN datediff(seconds, end1, lead(start, 1)
OVER (
ORDER BY end1 ASC)) > 10
THEN 1
ELSE 0 END AS session_switch,
CASE WHEN datediff(seconds, lead(end1, 1)
OVER (
ORDER BY end1 DESC), start) > 10
THEN 1
ELSE 0 END AS reverse_session_switch
FROM app_tax
)
AS sessioned
WHERE session_switch != 0 OR reverse_session_switch != 0
UNION
SELECT
start,
end1,
'start'
FROM (
SELECT
start,
end1,
row_number()
OVER (PARTITION BY APP_nm
ORDER BY end1 ASC) AS row_num
FROM APP_Tax
) AS with_row_number
WHERE row_num = 1
) AS with_boundary
) AS with_end
WHERE session_boundary = 'start'
ORDER BY start ASC
;
Here is the breadkdown (by subquery name):
sessioned - we first identify the switch rows (out and in), the rows in which the duration between end and start exceeds limit.
with_row_number - just a patch to extract the first row because there is no switch into it (there is an implicit switch that we record as 'start')
with_boundary - then we identify the rows where specific switches occur. If you run the subquery by itself it is clear that session start when session_switch = 0 AND reverse_session_switch = 1, and ends when the opposite occurs. All other rows are in the middle of sessions so are ignored.
with_end - finally, we combine the end/start of 'start'/'end' rows into (thus defining session duration), and remove the end rows
with_boundary subquery answers your initial question, but typically you'd want to combine those rows to get the final result which is the session duration.

SQL Query - combine 2 rows into 1 row

I have the following query below (view) in SQL Server. The query produces a result set that is needed to populate a grid. However, a new requirement has come up where the users would like to see data on one row in our app. The tblTasks table can produce 1 or 2 rows. The issue becomes when they're is two rows that have the same job_number but different fldProjectContextId (1 or 31). I need to get the MechApprovalOut and ElecApprovalOut columns on one row instead of two.
I've tried restructuring the query using CTE and over partition and haven't been able to get the necessary results I need.
SELECT TOP (100) PERCENT
CAST(dbo.Job_Control.job_number AS int) AS Job_Number,
dbo.tblTasks.fldSalesOrder, dbo.tblTaskCategories.fldTaskCategoryName,
dbo.Job_Control.Dwg_Sent, dbo.Job_Control.Approval_done,
dbo.Job_Control.fldElecDwgSent, dbo.Job_Control.fldElecApprovalDone,
CASE WHEN DATEDIFF(day, dbo.Job_Control.Dwg_Sent, GETDATE()) > 14
AND dbo.Job_Control.Approval_done IS NULL
AND dbo.tblProjectContext.fldProjectContextID = 1
THEN 1 ELSE 0
END AS MechApprovalOut,
CASE WHEN DATEDIFF(day, dbo.Job_Control.fldElecDwgSent, GETDATE()) > 14
AND dbo.Job_Control.fldElecApprovalDone IS NULL
AND dbo.tblProjectContext.fldProjectContextID = 31
THEN 1 ELSE 0
END AS ElecApprovalOut,
dbo.tblProjectContext.fldProjectContextName,
dbo.tblProjectContext.fldProjectContextId, dbo.Job_Control.Drawing_Info,
dbo.Job_Control.fldElectricalAppDwg
FROM dbo.tblTaskCategories
INNER JOIN dbo.tblTasks
ON dbo.tblTaskCategories.fldTaskCategoryId = dbo.tblTasks.fldTaskCategoryId
INNER JOIN dbo.Job_Control
ON dbo.tblTasks.fldSalesOrder = dbo.Job_Control.job_number
INNER JOIN dbo.tblProjectContext
ON dbo.tblTaskCategories.fldProjectContextId = dbo.tblProjectContext.fldProjectContextId
WHERE (dbo.tblTaskCategories.fldTaskCategoryName = N'Approval'
OR dbo.tblTaskCategories.fldTaskCategoryName = N'Re-Approval')
AND (CASE WHEN DATEDIFF(day, dbo.Job_Control.Dwg_Sent, GETDATE()) > 14
AND dbo.Job_Control.Approval_done IS NULL
AND dbo.tblProjectContext.fldProjectContextID = 1
THEN 1 ELSE 0
END = 1)
OR (dbo.tblTaskCategories.fldTaskCategoryName = N'Approval'
OR dbo.tblTaskCategories.fldTaskCategoryName = N'Re-Approval')
AND (CASE WHEN DATEDIFF(day, dbo.Job_Control.fldElecDwgSent, GETDATE()) > 14
AND dbo.Job_Control.fldElecApprovalDone IS NULL
AND dbo.tblProjectContext.fldProjectContextID = 31
THEN 1 ELSE 0
END = 1)
ORDER BY dbo.Job_Control.job_number, dbo.tblTaskCategories.fldProjectContextId
The above query gives me the following result set:
I've created a work around via code (which I don't like but it works for now) where i've used code to populate a "temp" table the way i need it to display the data, that is, one record if duplicate job numbers to get the MechApprovalOut and ElecApprovalOut columns on one row (see first record in following screen shot).
Example:
With the desired result set and one row per job_number, this is how the form looks with the data and how I am using the result set.
Any help restructuring my query to combine duplicate rows with the same job number where MechApprovalOut and ElecApproval out columns are on one row is greatly appreciated! I'd much prefer to use a view on SQL then code in the app to populate a temp table.
Thanks,
Jimmy
What I would do is LEFT JOIN the main table to itself at the beginning of the query, matching on Job Number and Sales Order, such that the left side of the join is only looking at Approval task categories and the right side of the join is only looking at Re-Approval task categories. Then I would make extensive use of the COALESCE() function to select data from the correct side of the join for use later on and in the select clause. This may also be the piece you were missing to make a CTE work.
There is probably also a solution that uses a ranking/windowing function (maybe not RANK itself, but something that category) along with the PARTITION BY clause. However, as those are fairly new to Sql Server I haven't used them enough personally to be comfortable writing an example solution for you without direct access to the data to play with, and it would still take me a little more time to get right than I can devote to this right now. Maybe this paragraph will motivate someone else to do that work.

Determine values in a database that occur before a certain date, but not after that date

I need to determine if a certain field value in a database table occurs before a certain date, but not after that date.
I can determine the values that occur before the cutoff date with a simple select, but there may be records after that date.
The field values that I am using are the 'entereddate' and the value I am looking for (in this case a carriercode).
Thanks for your help!
This is the best I can do without seeing the data structure.
SELECT *
FROM BillTBL a
INNER JOIN carriertbl b ON a.carrier_key = b.carrier_key
WHERE a.billentereddate < '2009-09-01'
AND NOT EXISTS (SELECT 1
FROM BillTBL
WHERE whatever_the_key_is = a.whatever_the_key_is
AND billentereddate > '2009-09-01')
select a.carriercode
from carriertbl as a
inner join BillTBL as b ON b.carrier_key = a.carrier_key and b.enteredate < '2009-09-01'
Maybe you have to ajust some column name...

SQL query to retrieve discrepancies in punch order

Consider the table below.
The rule is - an employee cannot take a break (needs to clock out) from job num 1 before clocking in to job num 2. In this case the employee "A" was supposed to clock OUT instead of BREAK on jobnum 1 because he later clocked in to JobNum#2
Is it possible to write a query to find this in plain SQL?
Idea is to check if next record is proper one. To find next record one has to find first punchtime after current for same employee. Once this information is retrieved one can isolate record itself and check fields of interest, specifically is jobnum the same and [optionally] is punch_type 'IN'. If it is not, not exists evaluates to true and record is output.
select *
from #punch p
-- Isolate breaks only
where p.punch_type = 'BREAK'
-- The ones having no proper entry
and not exists
(
select null
-- The same table
from #punch a
where a.emplid = p.emplid
and a.jobnum = p.jobnum
-- Next record has punchtime from subquery
and a.punchtime = (select min (n.punchtime)
from #punch n
where n.emplid = p.emplid
and n.punchtime > p.punchtime
)
-- Optionally you might force next record to be 'IN'
and a.punch_type = 'IN'
)
Replace #punch with your table name. -- is comment in Sql Server; if you are not using this database, remove this lines. It is a good idea to tag your database and version as there are probably faster/better ways to do this.
Here is the SQL
select * from employees e1 cross join employees e2 where e1.JOBNUM = (e2.JOBNUM + 1)
and e1.PUNCH_TYPE = 'BREAK' and e2.PUNCH_TYPE = 'IN'
and e1.PUNCHTIME < e2.PUNCHTIME
and e1.EMPLID = e2.EMPLID