De-duplicating transaction records

De-duplicating transaction records - sql

I am a bit stuck! I have data, like the below.
I need to calculate the sum of frequency between each customer. In the above, FROM customer1 TO customer2 should be summed with FROM customer2 TO customer1 - like below.
It doesn't matter which direction the message went in; I just need to sum all communication between customer1 and customer2.

You can use the greatest and least functionality as follows:
select least(from,to) as from, greatest(from,to) as to, sum(frequency) as freq
from your_Table
group by least(from,to), greatest(from,to)
If greatest and least is not supported in your version then you can use the case..when also.
select case when from > to then to else from end as from,
case when from > to then from else to end as to,
sum(frequency) as freq
from your_Table
group by case when from > to then to else from end,
case when from > to then from else to end

you can try sorting the From To
WITH tab AS
(SELECT * FROM (VALUES ('Customer 1', 'Customer 2', 2)
, ('Customer 2', 'Customer 1', 4)
, ('Customer 3', 'Customer 1', 4)
) a ([From], [To], [Frequency])
)
SELECT IIF([From] > [To], [To], [From]) [From]
, IIF([From] > [To], [From], [To]) [To]
, SUM([Frequency]) Frequency
From tab
GROUP BY IIF([From] > [To], [To], [From])
, IIF([From] > [To], [From], [To])

You can group by a sorted array:
select
sort_array(array(`from`, `to`))[0] `from`,
sort_array(array(`from`, `to`))[1] `to`,
sum(frequency)
from mytable
group by sort_array(array(`from`, `to`));

Taking into account that two map() with the same key-value pairs but in different order are equal, because maps are unordered by definition, you can exploit this property to aggregate frequency.
Demo with your data example:
with mytable as(
select stack (3,
'Customer 1', 'Customer 2', 2,
'Customer 2', 'Customer 1', 4,
'Customer 3', 'Customer 1', 4
) as (`from`, `to` , frequency)
)
select map_keys(vmap)[0] as `from`, map_keys(vmap)[1] as `to`, frequency
from
(
select map(`from`, 1, `to`, 1) vmap, sum(frequency) frequency
from mytable group by map(`from`, 1, `to`, 1)
)s;
Result:
from to frequency
Customer 2 Customer 1 6
Customer 3 Customer 1 4

Related

Assign Specific Value To All Rows in Partition If Condition Met

I have to build the Exceptions Report to catch Overlaps or Gaps. The dataset has clients and assigned supervisors with start and end dates of supervision.
CREATE TABLE Report
(Id INT, ClientId INT, ClientName VARCHAR(30), SupervisorId INT, SupervisorName
VARCHAR(30), SupervisionStartDate DATE, SupervisionEndDate DATE);
INSERT INTO Report
VALUES
(1, 22, 'Client A', 33, 'Supervisor A', '2022-01-01', '2022-04-30'),
(2, 22, 'Client A', 44, 'Supervisor B', '2022-05-01', '2022-08-23'),
(3, 22, 'Client A', 55, 'Supervisor C', '2022-08-24', NULL),
(4, 23, 'Client B', 33, 'Supervisor A', '2022-01-01', '2022-04-30'),
(5, 23, 'Client B', 44, 'Supervisor B', '2022-04-30', '2022-08-23'),
(6, 24, 'Client C', 33, 'Supervisor A', '2022-01-01', '2022-04-30'),
(7, 24, 'Client C', 44, 'Supervisor B', '2022-05-01', '2022-08-23'),
(8, 24, 'Client C', 55, 'Supervisor C', '2022-07-22', '2022-10-25'),
(9, 25, 'Client D', 33, 'Supervisor A', '2022-01-01', '2022-04-30'),
(10, 25, 'Client D', 44, 'Supervisor B', '2022-07-23', NULL)
SELECT * FROM Report
'Valid' status should be assigned to all rows associated with Client if no Gaps or Overlaps present, for example:
Client A has 3 Supervisors - Supervisor A (01/01/2022 - 04/30/2022), Supervisor B (05/01/2022 - 08/23/2022) and Supervisor C (08/24/2022 - Present).
'Issue Found' status should be assigned to all rows associated with Client if any Gaps or Overlaps present, for example:
Client B has 2 Supervisors - Supervisor A (01/01/2022 - 04/30/2022) and Supervisor B (04/30/2022 - 08/23/2022).
Client C has 3 Supervisors - Supervisor A (01/01/2022 - 04/30/2022), Supervisor B (05/01/2022 - 08/23/2022) and Supervisor C (07/22/2022 - 10/25/2022).
These are examples of the Overlap.
Client D has 2 Supervisors - Supervisor A (01/01/2022 - 04/30/2022) and Supervisor B (07/23/2022 - Present).
This is the example of the Gap.
The Output I need:
I added some columns that might be helpful, but don't know how to accomplish the main goal.
However, I noticed, that if the first record in the [Diff Between PreviousEndDate And SupervisionStartDate] column is NULL and all other = 1, then it will be Valid.
SELECT
Report.*,
ROW_NUMBER() OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS [ClientRecordNumber],
COUNT(*) OVER (PARTITION BY Report.ClientId) AS [TotalNumberOfClientRecords],
DATEDIFF(DAY, Report.SupervisionStartDate, Report.SupervisionEndDate) AS SupervisionAging,
LAG(Report.SupervisionStartDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS PreviousStartDate,
LAG(Report.SupervisionEndDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS PreviousEndDate,
LEAD(Report.SupervisionStartDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS NextStartDate,
LEAD(Report.SupervisionEndDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS NextEndDate,
DATEDIFF(dd, LAG(Report.SupervisionEndDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)), Report.SupervisionStartDate) AS [Diff Between PreviousEndDate And SupervisionStartDate]
FROM Report

One approach:
Use the additional LAG parameters to provide a default value for when its null, and make that value a valid value i.e. 1 day before the StartDate
Use a CTE to calculate the difference in days between the StartDate and previous EndDate.
Then use a second CTE to determine for any given client whether there is an issue.
Finally display your desired results.
WITH cte1 AS (
SELECT
R.*
, DATEDIFF(day, LAG(R.SupervisionEndDate,1,dateadd(day,-1,R.SupervisionStartDate)) OVER (PARTITION BY R.ClientId ORDER BY COALESCE(R.SupervisionStartDate, R.SupervisionEndDate)), R.SupervisionStartDate) AS Diff
FROM Report R
), cte2 AS (
SELECT *
, MAX(COALESCE(Diff,0)) OVER (PARTITION BY ClientId) MaxDiff
, MIN(COALESCE(Diff,0)) OVER (PARTITION BY ClientId) MinDiff
FROM cte1
)
SELECT Id, ClientId, ClientName, SupervisorId, SupervisorName, SupervisionStartDate, SupervisionEndDate
--, Diff, MaxDiff, MinDiff -- Debug
, CASE WHEN MaxDiff = 1 AND MinDiff = 1 THEN 'Valid' ELSE 'Issue Found' END [Status]
FROM cte2
ORDER BY Id;
Notes:
Use the fullname of the datepart you are diff-ing - its much clearer and easier to maintain.
Use short, relevant, table aliases to reduce the code.

I would like to display current month, year to date and previous year to date on ssrs

Data Presentation
Would like to calculate year to date and previous year to date current month on sql view so that to simply data presentation on ssrs to make it run faster. Is there a way to write a view which can perform this ?

Ignoring the fact that I think you have some errors in your Previous YTD summary numbers..
I recreated the data as per your example
CREATE TABLE #t (TransDate date, Customer varchar(30), Amount float)
INSERT INTO #t VALUES
('2020-09-21', 'Customer 1', 200),
('2020-09-22', 'Customer 2', 300),
('2020-08-03', 'Customer 2', 450),
('2020-08-04', 'Customer 1', 1200),
('2019-09-14', 'Customer 1', 859),
('2019-02-05', 'Customer 2', 230),
('2019-07-26', 'Customer 2', 910),
('2019-11-17', 'Customer 1', 820)
Then the following statement will produce what you need. It is NOT the most elegant way of doing this but it will convert to a view easily and was all I could come up with in the time I had.
SELECT
m.Customer
, m.MTD as [Current Month]
, y.YTD as [Current YTD]
, p.YTD as [Previous YTD]
FROM (
SELECT Customer, Yr = Year(TransDate), Mn = MONTH(TransDate), MTD = SUM(Amount) FROM #t t WHERE MONTH(TransDate) = MONTH(GetDate()) and YEAR(TransDate) = YEAR(GetDate())
GROUP by Customer, YEAR(TransDate), MONTH(TransDate)
) m
JOIN (SELECT Customer, Yr = YEAR(TransDate), YTD = SUM(Amount) FROM #t t GROUP by Customer, YEAR(TransDate)) y on m.Customer =y.Customer and m.Yr = y.Yr
JOIN (SELECT Customer, Yr = YEAR(TransDate), YTD = SUM(Amount) FROM #t t GROUP by Customer, YEAR(TransDate)) p on y.Customer =p.Customer and y.Yr = p.Yr + 1
This gives the following results (which don;t match your example but I think your sample is incorrect)

Find date of most recent overdue

I have the following problem: from the table of pays and dues, I need to find the date of the last overdue. Here is the table and data for example:
create table t (
Id int
, [date] date
, Customer varchar(6)
, Deal varchar(6)
, Currency varchar(3)
, [Sum] int
);
insert into t values
(1, '2017-12-12', '1110', '111111', 'USD', 12000)
, (2, '2017-12-25', '1110', '111111', 'USD', 5000)
, (3, '2017-12-13', '1110', '122222', 'USD', 10000)
, (4, '2018-01-13', '1110', '111111', 'USD', -10100)
, (5, '2017-11-20', '2200', '222221', 'USD', 25000)
, (6, '2017-12-20', '2200', '222221', 'USD', 20000)
, (7, '2017-12-31', '2201', '222221', 'USD', -10000)
, (8, '2017-12-29', '1110', '122222', 'USD', -10000)
, (9, '2017-11-28', '2201', '222221', 'USD', -30000);
If the value of "Sum" is positive - it means overdue has begun; if "Sum" is negative - it means someone paid on this Deal.
In the example above on Deal '122222' overdue starts at 2017-12-13 and ends on 2017-12-29, so it shouldn't be in the result.
And for the Deal '222221' the first overdue of 25000 started at 2017-11-20 was completly paid at 2017-11-28, so the last date of current overdue (we are interested in) is 2017-12-31
I've made this selection to sum up all the payments, and stuck here :(
WITH cte AS (
SELECT *,
SUM([Sum]) OVER(PARTITION BY Deal ORDER BY [Date]) AS Debt_balance
FROM t
)
Apparently i need to find (for each Deal) minimum of Dates if there is no 0 or negative Debt_balance and the next date after the last 0 balance otherwise..
Will be gratefull for any tips and ideas on the subject.
Thanks!
UPDATE
My version of solution:
WITH cte AS (
SELECT ROW_NUMBER() OVER (ORDER BY Deal, [Date]) id,
Deal, [Date], [Sum],
SUM([Sum]) OVER(PARTITION BY Deal ORDER BY [Date]) AS Debt_balance
FROM t
)
SELECT a.Deal,
SUM(a.Sum) AS NET_Debt,
isnull(max(b.date), min(a.date)),
datediff(day, isnull(max(b.date), min(a.date)), getdate())
FROM cte as a
LEFT OUTER JOIN cte AS b
ON a.Deal = b.Deal AND a.Debt_balance <= 0 AND b.Id=a.Id+1
GROUP BY a.Deal
HAVING SUM(a.Sum) > 0

I believe you are trying to use running sum and keep track of when it changes to positive, and it can change to positive multiple times and you want the last date at which it became positive. You need LAG() in addition to running sum:
WITH cte1 AS (
-- running balance column
SELECT *
, SUM([Sum]) OVER (PARTITION BY Deal ORDER BY [Date], Id) AS RunningBalance
FROM t
), cte2 AS (
-- overdue begun column - set whenever running balance changes from l.t.e. zero to g.t. zero
SELECT *
, CASE WHEN LAG(RunningBalance, 1, 0) OVER (PARTITION BY Deal ORDER BY [Date], Id) <= 0 AND RunningBalance > 0 THEN 1 END AS OverdueBegun
FROM cte1
)
-- eliminate groups that are paid i.e. sum = 0
SELECT Deal, MAX(CASE WHEN OverdueBegun = 1 THEN [Date] END) AS RecentOverdueDate
FROM cte2
GROUP BY Deal
HAVING SUM([Sum]) <> 0
Demo on db<>fiddle

You can use window functions. These can calculate intermediate values:
Last day when the sum is negative (i.e. last "good" record).
Last sum
Then you can combine these:
select deal, min(date) as last_overdue_start_date
from (select t.*,
first_value(sum) over (partition by deal order by date desc) as last_sum,
max(case when sum < 0 then date end) over (partition by deal order by date) as max_date_neg
from t
) t
where last_sum > 0 and date > max_date_neg
group by deal;
Actually, the value on the last date is not necessary. So this simplifies to:
select deal, min(date) as last_overdue_start_date
from (select t.*,
max(case when sum < 0 then date end) over (partition by deal order by date) as max_date_neg
from t
) t
where date > max_date_neg
group by deal;

Count number of consecutive grouped entries in SQL

I'd like to create and populate the following No. of Entries in Curr.Status field seen below using SQL (sql server).
ID Sequence Prev.Status Curr.Status No. of Entries in Curr.Status
9-9999-9 1 Status D Status A 1
9-9999-9 2 Status A Status A 2
9-9999-9 3 Status A Status A 3
9-9999-9 4 Status A Status A 4
9-9999-9 5 Status A Status B 1
9-9999-9 6 Status B Status B 2
9-9999-9 7 Status B Status B 3
9-9999-9 8 Status B Status A 1
9-9999-9 9 Status A Status A 2
9-9999-9 10 Status A Status C 1
9-9999-9 11 Status C Status C 2
Is there an quick way using something like row_number() --this alone doesn't appear to be sufficient-- to create the field I'm looking for?
Thanks!

This appears to be a Groups and Islands problem. there are plenty of examples out there on how to achieve this, however:
WITH VTE AS(
SELECT *
FROM (VALUES('9-9999-9',1 ,'Status D','Status A'),
('9-9999-9',2 ,'Status A','Status A'),
('9-9999-9',3 ,'Status A','Status A'),
('9-9999-9',4 ,'Status A','Status A'),
('9-9999-9',5 ,'Status A','Status B'),
('9-9999-9',6 ,'Status B','Status B'),
('9-9999-9',7 ,'Status B','Status B'),
('9-9999-9',8 ,'Status B','Status A'),
('9-9999-9',9 ,'Status A','Status A'),
('9-9999-9',10,'Status A','Status C'),
('9-9999-9',11,'Status C','Status C')) V(ID, Sequence, PrevStatus,CurrStatus)),
CTE AS(
SELECT ID,
[Sequence],
PrevStatus,
CurrStatus,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY [Sequence]) -
ROW_NUMBER() OVER (PARTITION BY ID,CurrStatus ORDER BY [Sequence]) AS Grp
FROM VTE V)
SELECT ID,
[Sequence],
PrevStatus,
CurrStatus,
ROW_NUMBER() OVER (PARTITION BY Grp ORDER BY [Sequence]) AS Entries
FROM CTE;

You can mark the rows where status changes using LAG function, and use SUM() OVER () to assign unique number to each group. Numbering within group is trivial:
DECLARE #t TABLE (ID VARCHAR(100), Sequence INT, PrevStatus VARCHAR(100), CurrStatus VARCHAR(100));
INSERT INTO #t VALUES
('9-9999-9', 1, 'Status D', 'Status A'),
('9-9999-9', 2, 'Status A', 'Status A'),
('9-9999-9', 3, 'Status A', 'Status A'),
('9-9999-9', 4, 'Status A', 'Status A'),
('9-9999-9', 5, 'Status A', 'Status B'),
('9-9999-9', 6, 'Status B', 'Status B'),
('9-9999-9', 7, 'Status B', 'Status B'),
('9-9999-9', 8, 'Status B', 'Status A'),
('9-9999-9', 9, 'Status A', 'Status A'),
('9-9999-9', 10, 'Status A', 'Status C'),
('9-9999-9', 11, 'Status C', 'Status C');
WITH cte1 AS (
SELECT *, CASE WHEN LAG(CurrStatus) OVER(ORDER BY Sequence) = CurrStatus THEN 0 ELSE 1 END AS chg
FROM #t
), cte2 AS (
SELECT *, SUM(chg) OVER(ORDER BY Sequence) AS grp
FROM cte1
), cte3 AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY Sequence) AS SeqInGroup
FROM cte2
)
SELECT *
FROM cte3
ORDER BY Sequence
Demo on DB Fiddle

If the Sequence is identity column then you can do :
select t.*,
row_number() over (partition by (Sequence - seq) order by Sequence) as [No. of Entries in Curr.Status]
from (select t.*,
row_number() over (partition by [Curr.Status] order by Sequence) as seq
from table t
) t;
else you need to generate two row_numbers :
select t.*,
row_number() over (partition by (seq1- seq2) order by Sequence) as [No. of Entries in Curr.Status]
from (select t.*,
row_number() over (partition by id order by Sequence) as seq1
row_number() over (partition by id, [Curr.Status] order by Sequence) as seq2
from table t
) t;

Count previous consecutive rows in SQL Server

I have attendance data list which is showing below. Now I am trying to find data by a specific date range (01/05/2016 – 07/05/2016) with total Present Column, Total Present Column will be calculated from previous present data (P). Suppose today is 04/05/2016. If a person has 01,02,03,04 status ‘p’ then it will show date 04-05-2016 total present 4.
Could you help me to find total present from this result set.

You can check this example, which have logic to calculate previous sum value.
declare #t table (employeeid int, datecol date, status varchar(2) )
insert into #t values (10001, '01-05-2016', 'P'),
(10001, '02-05-2016', 'P'),
(10001, '03-05-2016', 'P'),
(10001, '04-05-2016', 'P'),
(10001, '05-05-2016', 'A'),
(10001, '06-05-2016', 'P'),
(10001, '07-05-2016', 'P'),
(10001, '08-05-2016', 'L'),
(10002, '07-05-2016', 'P'),
(10002, '08-05-2016', 'L')
--select * from #t
select * ,
SUM(case when status = 'P' then 1 else 0 end) OVER (PARTITION BY employeeid ORDER BY employeeid, datecol
ROWS BETWEEN UNBOUNDED PRECEDING
AND current row)
from
#t
Another twist of the same thing via cte (as you written SQLSERVER2012, this below solution only work in Sqlserver 2012 and above)
;with cte as
(
select employeeid , datecol , ROW_NUMBER() over(partition by employeeid order by employeeid, datecol) rowno
from
#t where status = 'P'
)
select t.*, cte.rowno ,
case when ( isnull(cte.rowno, 0) = 0)
then LAG(cte.rowno) OVER (ORDER BY t.employeeid, t.datecol)
else cte.rowno
end LagValue
from #t t left join cte on t.employeeid = cte.employeeid and t.datecol = cte.datecol
order by t.employeeid, t.datecol

You could use a subquery to calculate TotalPresent for each row:
SELECT
main.EmployeeID,
main.[Date],
main.[Status],
(
SELECT SUM(CASE WHEN t.[Status] = 'P' THEN 1 ELSE 0 END)
FROM [TableName] t
WHERE t.EmployeeID = main.EmployeeID AND t.[Date] <= main.[Date]
) as TotalPresent
FROM [TableName] main
ORDER BY
main.EmployeeID,
main.[Date]
Here I used subquery to count the sum of records that have the same EmployeeID and date is less or equal to the date of current row. If status of the record is 'P', then 1 is added to the sum, otherwise 0, which counts only records that have status P.

Interesting question, this should work:
select *
, (select count(retail) from p g
where g.date <= p.date and g.id = p.id and retail = 'P')
from p
order by ID, Date;
So I believe I understand correctly. You would like to count the occurences of P per ID datewise.
This makes a lot of sense. That is why the first occurrence of ID2 was L and the Total is 0. This query will count P status for each occurrence, pause at non-P for each ID.
Here is an example

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

De-duplicating transaction records - sql

You can group by a sorted array: select sort_array(array(`from`, `to`))[0] `from`, sort_array(array(`from`, `to`))[1] `to`, sum(frequency) from mytable group by sort_array(array(`from`, `to`));

Related

Assign Specific Value To All Rows in Partition If Condition Met

I would like to display current month, year to date and previous year to date on ssrs

Find date of most recent overdue

Count number of consecutive grouped entries in SQL

Count previous consecutive rows in SQL Server

Categories

Resources