How to remove duplicates while sorting by unique datetime

How to remove duplicates while sorting by unique datetime - sql

I am working with a fairly bad data source, the column that has the information I need is within a varchar(max) and is delimited. However, the data can be duplicated across multiple rows so I am trying to remove these duplicates.
This can be done by trimming the column I am interested in, as when I repeat occurs the "ID" gets re-appended to the end of the column. Then I am taking a distinct of that to, which I then concatenate the results; it isn't pretty.
Example data and the query I currently use SQL Fiddle
Data Table
| id | callID | callDateTime | history |
|----|--------|-----------------------------|-------------------------------------|
| 1 | 1 | 2021-01-01 10:00:00.0000000 | Amount: 10, Ref:123, ID:123 |
| 2 | 1 | 2021-01-01 10:01:00.0000000 | Amount: 10, Ref:123, ID:123, ID:123 |
| 3 | 2 | 2021-01-01 11:00:00.0000000 | Amount:12.44, Ref:SIS, ID:124 |
| 4 | 2 | 2021-01-01 11:02:00.0000000 | Amount:11.22, Ref:Dad, ID:124 |
| 5 | 2 | 2021-01-01 11:01:00.0000000 | Amount:11.22, Ref:Mum, ID:124 |
| 6 | 3 | 2021-01-01 12:00:00.0000000 | Amount:11, ID:125 |
Query
select CallID, Concat([1],',', [2],',',[3])
from
(
select CallID, historyEdit, ROW_NUMBER() over (partition by callID order by callID) as rowNum
from
(
select distinct callID,
substring(history, 0, charindex(', ID:',history)) historyEdit
from test
) a
)b
PIVOT(max(historyEdit) for rowNum IN ([1],[2],[3])) piv
Result
| CallID | |
|--------|-------------------------------------------------------------------|
| 1 | Amount: 10, Ref:123,, |
| 2 | Amount:11.22, Ref:Dad,Amount:11.22, Ref:Mum,Amount:12.44, Ref:SIS |
| 3 | Amount:11,, |
The issue is that I need to ensure the concatenate part is doing so in the order of when the events occurred. In the above you will see that CallID 2 is in the wrong order as Information 3 is coming before Information 2, I did try to sort the base table by callDateTime first and then run the query, however it does seem to yield somewhat random results. Sometimes it will be in the correct order, other times it won't be. I assume this is because I am not specifying any order by clause in the query.
Including the callDateTime in the results then causes the distinct not return the unqiue data rows as the callDateTime is still unique to that duplicated row of data
I am using SQL Server v12
Desired Result
| CallID | |
|--------|-------------------------------------------------------------------|
| 1 | Amount: 10, Ref:123,, |
| 2 | Amount:12.44, Ref:SiS,Amount:11.22, Ref:Mum,Amount:11.22, Ref:Dad |
| 3 | Amount:11,, |

If I understand correctly, you want to break apart the history and recombine -- without duplicates -- for each callid. If so, you can use string_split() and string_agg():
select callid, string_agg(value, ', ')
from (select distinct t.callid, s.value
from test t cross apply
(select trim(s.value) as value
from string_split(t.history, ',') s
) s
) st
group by callid;
Here is a db<>fiddle.

You could use TOP clause inside the select to order the records before pivoting the results if you are sure of the number of records like below:
select callID, historyEdit
from
(
select distinct top 100000 callID, callDateTime,
substring(history, 0, charindex(', ID:',history)) historyEdit
from test
order by callDateTime
)t
Please see the results here.

One way to calculate the strings could be to CROSS APPLY using an ordinal splitter to separate the 'history' column into components which can be enumerated. The result is very close to what was provided in the question. Maybe the provided expected results aren't accurately representative? Something like this
Ordinal splitter described here
CREATE FUNCTION [dbo].[DelimitedSplit8K_LEAD]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT 0 UNION ALL
SELECT TOP (DATALENGTH(ISNULL(#pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(#pString,t.N,1) = #pDelimiter OR t.N = 0)
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),
Item = SUBSTRING(#pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1) OVER (ORDER BY s.N1) - 1),0)-s.N1,8000))
FROM cteStart s
;
query
with
unq_cte as (
select distinct callID
from #test),
exp_cte as (
select callID, callDateTime , dl.*,
row_number() over (partition by callID, dl.Item order by callDateTime) as rn
from #test t
cross apply dbo.DelimitedSplit8K_LEAD(t.history, ',') dl)
select t.callID,
stuff((select ',' + case when rn>1 then '' else Item end
from exp_cte tt
where t.callID = tt.callID
and ltrim(rtrim(Item)) not like 'ID%'
order by tt.callDateTime, tt.ItemNumber for xml path('')), 1, 1, '') [value1]
from unq_cte t
group by t.callID;
callID value1
1 Amount: 10, Ref:123,,
2 Amount:12.44, Ref:SIS,Amount:11.22, Ref:Mum,, Ref:Dad
3 Amount:11

Related

SQL select a row X times and insert into new

I am trying to migrate a bunch of data from an old database to a new one, the old one used to just have the number of alarms that occurred on a single row. The new database inserts a new record for each alarm that occurs. Here is a basic version of how it might look. I want to select each row from Table 1 and insert the number of alarm values as new rows into Table 2.
Table 1:
| Alarm ID | Alarm Value |
|--------------|----------------|
| 1 | 3 |
| 2 | 2 |
Should go into the alarm table as the below values.
Table 2:
| Alarm New ID | Value |
|--------------|----------|
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
I want to create a select insert script that will do this, so the select statement will bring back the number of rows that appear in the "Value" column.

A recursive CTE can be convenient for this:
with cte as (
select id, alarm, 1 as n
from t
union all
select id, alarm, n + 1
from cte
where n < alarm
)
select row_number() over (order by id) as alarm_id, id as value
from cte
order by 1
option (maxrecursion 0);
Note: If your values do not exceed 100, then you can remove OPTION (MAXRECURSION 0).

Replicate values out with a CTE.
DECLARE #T TABLE(AlarmID INT, Value INT)
INSERT #T VALUES
(1,3),
(2,2)
;WITH ReplicateAmount AS
(
SELECT AlarmID, Value FROM #T
UNION ALL
SELECT R.AlarmID, Value=(R.Value - 1)
FROM ReplicateAmount R
INNER JOIN #T T ON R.AlarmID = T.AlarmID
WHERE R.Value > 1
)
SELECT
AlarmID = ROW_NUMBER() OVER( ORDER BY AlarmID),
Value = AlarmID --??
FROM
ReplicateAmount
ORDER BY
AlarmID
This answers your question. I would think the query below would be more useful, however, you did not include usage context.
SELECT
AlarmID,
Value
FROM
ReplicateAmount
ORDER BY
AlarmID

Rather than using an rCTE, which is recursive (as the name suggests) and will fail at 100 rows, you can use a Tally table, which tend to be far faster as well:
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL))N(N)),
Tally AS(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
FROM N N1, N N2, N N3)
SELECT ROW_NUMBER() OVER (ORDER BY V.AlarmID,T.I) AS AlarmNewID,
V.AlarmID
FROM (VALUES(1,3),(2,2))V(AlarmID,AlarmValue)
JOIN Tally T ON V.AlarmValue >= T.I;

SQL - Calculate next row based on previous in the same column

I have spent hours trying to solve this with loops, the lag function but it doesn't solve my problem. I have a table where the first row of a particular field is populated, the next row is calculated based on a subtraction of the previous row of data from 2 columns, the next row is then based on the result of this. The example is below of the original table and the result set:
a b a b
502.5 33.85 502.5 33.85
25.46 468.65 25.46
20.83 443.19 20.83
133.07 422.36 133.07
144.65 289.29 144.65
144.65 144.64 144.65
I have tried several different methods with stored procedures and can get the 2nd row result set but I can't get it to continue and calculate the rest of the fields, it's easy in excel but not so in SQL. Any suggestions?

If your RDBMS supports windowed aggregate functions:
Assuming you have an id or some such thing that is determining the order of your rows (as you indicated there is a first).
You can use the max() over() (in this case min() works instead of max() as well) and sum() over() windowed aggregate functions
select
id
, max(a) over (order by id) - (sum(b) over (order by id) - b) as a
, b
from t
rextester demo: http://rextester.com/MGKM17497
returns:
+----+--------+--------+
| id | a | b |
+----+--------+--------+
| 1 | 502,50 | 33,85 |
| 2 | 468,65 | 25,46 |
| 3 | 443,19 | 20,83 |
| 4 | 422,36 | 133,07 |
| 5 | 289,29 | 144,65 |
| 6 | 144,64 | 144,65 |
+----+--------+--------+

In case, as I saw data before editing )
This solution also assumes that you have id column and order depends on this column
with t(id, a, b) as(
select 1, 502.5, 33.85 union all
select 2, 25.46, null union all
select 3, 20.83, null union all
select 4, 133.07, null union all
select 5, 144.65, null union all
select 6, 144.65, null
)
select case when id = 1 then a else b end as a, case when id = 1 then (select b from t order by id offset 0 rows fetch next 1 rows only) else a end as b from (
select id, a, lag((select a from t order by id offset 0 rows fetch next 1 rows only)-s) over(order by id) as b from (
select id, a, sum(case when b is null then a else b end ) over(order by id) s
from t
) tt
) ttt

Removing duplicate results

I have a view with some records, many of them are duplicated. I need to filter records and get only one from each of them.
I've tried with
SELECT TOP 1 Item, Code, Desc, '1' AS Qty FROM vwTbl1 WHERE Code = '12' OR Code = '311'
Also tried with DISTINCT but still I get all records.
but in this case it shows me only one record. Grouping by Code doesn't work.
Is there any other way how to solve this?
Item | Code | Desc | QTY
a | 12 | 1 |1
a | 311 | 2 |1
b | 12 | 3 |1
b | 311 | 4 |1
c | 1 | 5 |1
Reult should be like:
Item | Code | Desc | QTY
a | 12 | 1 |1
b | 311 | 3 |1
So for each criteria get the first record.

The typical way of doing this uses row_number():
SELECT TOP 1 Item, Code, Desc, 1 AS Qty
FROM (SELECT v.*,
ROW_NUMBER() OVER (PARTITION BY Code ORDER BY (SELECT NULL)) as seqnum
FROM vwTbl1
WHERE Code IN ('12', '311') -- don't use single quotes if these are numbers
) v
WHERE seqnum = 1;

SELECT Top 1 *
FROM
(
SELECT Item, Code, Desc, '1' AS Qty
FROM vwTbl1 WHERE Code = '12' OR Code ='311'
)A
Edited Code based on your expected result:
Declare #YourTable table (Id INT IDENTITY(1,1),Item varchar(50),Code INT,
_Desc INT,Qty INT)
Insert into #YourTable
SELECT 'a',12,1,1 UNION ALL
SELECT 'a',311,2,1 UNION ALL
SELECT 'b',12,3,1 UNION ALL
SELECT 'b',311,4,1 UNION ALL
SELECT 'c',1 ,5 ,1
SELECT Item ,A.Code , _Desc ,Qty
FROM #YourTable T
JOIN
(
SELECT MAX(Id) Id, Code FROM #YourTable GROUP BY Code
)A ON A.Id = T.Id

Group rows into sequences using a sliding window on a DateTime column

I have a table that stores timestamped events. I want to group the events into 'sequences' by using 5-min sliding window on the timestamp column, and write the 'sequence ID' (any ID that can distinguish sequences) and 'order in sequence' into another table.
Input - event table:
+----+-------+-----------+
| Id | Name | Timestamp |
+----+-------+-----------+
| 1 | test | 00:00:00 |
| 2 | test | 00:06:00 |
| 3 | test | 00:10:00 |
| 4 | test | 00:14:00 |
+----+-------+-----------+
Desired output - sequence table. Here SeqId is the ID of the starting event, but it doesn't have to be, just something to uniquely identify a sequence.
+---------+-------+----------+
| EventId | SeqId | SeqOrder |
+---------+-------+----------+
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 2 | 2 |
| 4 | 2 | 3 |
+---------+-------+----------+
What would be the best way to do it? This is MSSQL 2008, I can use SSAS and SSIS if they make things easier.

CREATE TABLE #Input (Id INT, Name VARCHAR(20), Time_stamp TIME)
INSERT INTO #Input
VALUES
( 1 ,'test','00:00:00' ),
( 2 ,'test','00:06:00' ),
( 3 ,'test','00:10:00' ),
( 4 ,'test','00:14:00' )
SELECT * FROM #Input;
WITH cte AS -- add a sequential number
(
SELECT *,
ROW_NUMBER() OVER(ORDER BY Id) AS sort
FROM #Input
), cte2 as -- find the Id's with a difference of more than 5min
(
SELECT cte.*,
CASE WHEN DATEDIFF(MI, cte_1.Time_stamp,cte.Time_stamp) < 5 THEN 0 ELSE 1 END as GrpType
FROM cte
LEFT OUTER JOIN
cte as cte_1 on cte.sort =cte_1.sort +1
), cte3 as -- assign a SeqId
(
SELECT GrpType, Time_Stamp,ROW_NUMBER() OVER(ORDER BY Time_stamp) SeqId
FROM cte2
WHERE GrpType = 1
), cte4 as -- find the Time_Stamp range per SeqId
(
SELECT cte3.*,cte_2.Time_stamp as TS_to
FROM cte3
LEFT OUTER JOIN
cte3 as cte_2 on cte3.SeqId =cte_2.SeqId -1
)
-- final query
SELECT
t.Id,
cte4.SeqId,
ROW_NUMBER() OVER(PARTITION BY cte4.SeqId ORDER BY t.Time_stamp) AS SeqOrder
FROM cte4 INNER JOIN #Input t ON t.Time_stamp>=cte4.Time_stamp AND (t.Time_stamp <cte4.TS_to OR cte4.TS_to IS NULL);
This code is slightly more complex but it returns the expected output (which Gordon Linoffs solution doesn't...) and it's even slightly faster.

You seem to want things grouped together when they are less than five minutes apart. You can assign the groups by getting the previous time stamp and marking the beginning of a group. You then need to do a cumulative sum to get the group id:
with e as (
select e.*,
(case when datediff(minute, prev_timestamp, timestamp) < 5 then 1 else 0 end) as flag
from (select e.*,
(select top 1 e2.timestamp
from events e2
where e2.timestamp < e.timestamp
order by e2.timestamp desc
) as prev_timestamp
from events e
) e
)
select e.eventId, e.seqId,
row_number() over (partition by seqId order b timestamp) as seqOrder
from (select e.*, (select sum(flag) from e e2 where e2.timestamp <= e.timestamp) as seqId
from e
) e;
By the way, this logic is easier to express in SQL Server 2012+ because the window functions are more powerful.

Query for missing elements

I have a table with the following structure:
timestamp | name | value
0 | john | 5
1 | NULL | 3
8 | NULL | 12
12 | john | 3
33 | NULL | 4
54 | pete | 1
180 | NULL | 4
400 | john | 3
401 | NULL | 4
592 | anna | 2
Now what I am looking for is a query that will give me the sum of the values for each name, and treats the nulls in between (orderd by the timestamp) as the first non-null name down the list, as if the table were as follows:
timestamp | name | value
0 | john | 5
1 | john | 3
8 | john | 12
12 | john | 3
33 | pete | 4
54 | pete | 1
180 | john | 4
400 | john | 3
401 | anna | 4
592 | anna | 2
and I would query SUM(value), name from this table group by name. I have thought and tried, but I can't come up with a proper solution. I have looked at recursive common table expressions, and think the answer may lie in there, but I haven't been able to properly understand those.
These tables are just examples, and I don't know the timestamp values in advance.
Could someone give me a hand? Help would be very much appreciated.

With Inputs As
(
Select 0 As [timestamp], 'john' As Name, 5 As value
Union All Select 1, NULL, 3
Union All Select 8, NULL, 12
Union All Select 12, 'john', 3
Union All Select 33, NULL, 4
Union All Select 54, 'pete', 1
Union All Select 180, NULL, 4
Union All Select 400, 'john', 3
Union All Select 401, NULL, 4
Union All Select 592, 'anna', 2
)
, NamedInputs As
(
Select I.timestamp
, Coalesce (I.Name
, (
Select I3.Name
From Inputs As I3
Where I3.timestamp = (
Select Max(I2.timestamp)
From Inputs As I2
Where I2.timestamp < I.timestamp
And I2.Name Is not Null
)
)) As name
, I.value
From Inputs As I
)
Select NI.name, Sum(NI.Value) As Total
From NamedInputs As NI
Group By NI.name
Btw, what would be orders of magnitude faster than any query would be to first correct the data. I.e., update the name column to have the proper value, make it non-nullable and then run a simple Group By to get your totals.
Additional Solution
Select Coalesce(I.Name, I2.Name), Sum(I.value) As Total
From Inputs As I
Left Join (
Select I1.timestamp, MAX(I2.Timestamp) As LastNameTimestamp
From Inputs As I1
Left Join Inputs As I2
On I2.timestamp < I1.timestamp
And I2.Name Is Not Null
Group By I1.timestamp
) As Z
On Z.timestamp = I.timestamp
Left Join Inputs As I2
On I2.timestamp = Z.LastNameTimestamp
Group By Coalesce(I.Name, I2.Name)

You don't need CTE, just a simple subquery.
select t.timestamp, ISNULL(t.name, (
select top(1) i.name
from inputs i
where i.timestamp < t.timestamp
and i.name is not null
order by i.timestamp desc
)), t.value
from inputs t
And summing from here
select name, SUM(value) as totalValue
from
(
select t.timestamp, ISNULL(t.name, (
select top(1) i.name
from inputs i
where i.timestamp < t.timestamp
and i.name is not null
order by i.timestamp desc
)) as name, t.value
from inputs t
) N
group by name

I hope I'm not going to be embarassed by offering you this little recursive CTE query of mine as a solution to your problem.
;WITH
numbered_table AS (
SELECT
timestamp, name, value,
rownum = ROW_NUMBER() OVER (ORDER BY timestamp)
FROM your_table
),
filled_table AS (
SELECT
timestamp,
name,
value
FROM numbered_table
WHERE rownum = 1
UNION ALL
SELECT
nt.timestamp,
name = ISNULL(nt.name, ft.name),
nt.value
FROM numbered_table nt
INNER JOIN filled_table ft ON nt.rownum = ft.rownum + 1
)
SELECT *
FROM filled_table
/* or go ahead aggregating instead */

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to remove duplicates while sorting by unique datetime - sql

Related

SQL select a row X times and insert into new

SQL - Calculate next row based on previous in the same column

Removing duplicate results

Group rows into sequences using a sliding window on a DateTime column

Query for missing elements

Categories

Resources