In SQL, how to collapse two rows into one? - sql

Let's say I have the following table:
C1
C2
C3
C4
Alton
James
Webs
AltonJamesWebs
Alton
Webs
Jams
AltonJamsWebs
Buddarakh
Izme
Grill
BuddarakhGrillIzme
Buddarakh
Gri
Izmezh
BuddarakhGriIzmezh
How would I collapse the table based on the Column C1 so that the result looks like the following:
C1
C2_1
C3_1
C4_1
C2_2
C3_2
C4_2
Alton
James
Webs
AltonJamesWebs
Webs
Jams
AltonJamsWebs
Buddarakh
Izme
Grill
BuddarakhGrillIzme
Gri
Izmezh
BuddarakhGriIzmezh
The ultimate purpose of this is to calculate the Levensthein Distance between the strings in columns C4_1 and C4_2.

Converting your example data into DDL/DML:
DECLARE #Table TABLE (C1 NVARCHAR(20), C2 NVARCHAR(20), C3 NVARCHAR(20), C4 NVARCHAR(20));
INSERT INTO #Table (C1, C2, C3, C4) VALUES
('Alton ', 'James ', 'Webs ', 'AltonJamesWebs '),
('Alton ', 'Webs ', 'Jams ', 'AltonJamsWebs '),
('Buddarakh ', 'Izme ', 'Grill ', 'BuddarakhGrillIzme'),
('Buddarakh ', 'Gri ', 'Izmezh ', 'BuddarakhGriIzmezh'),
('Buddarakh ', 'Gric ', 'Izmezh ', 'BuddarakhGriIzmezh');
We can perform a self-join, but first we'll want to assign some row numbers so we can keep track of the rows later:
;WITH nowWithRowNumber AS (
SELECT t.C1, t.C2, t.C3, t.C4, ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY c2, c3, c4) AS rn
FROM #Table t
)
SELECT t.C1, t.C2, t.c3, t.C4, t2.C2 AS C2_2, t2.C3 AS C3_2, t2.C4 AS C4_2, t2.rn
FROM nowWithRowNumber t
INNER JOIN nowWithRowNumber t2
ON t.C1 = t2.C1
AND t2.rn <> 1
AND (
t.c2 <> t2.c2
OR t.c3 <> t2.c3
)
WHERE t.rn = 1
C1 C2 c3 C4 C2_2 C3_2 C4_2 rn
----------------------------------------------------------------------------------------------------------
Alton James Webs AltonJamesWebs Webs Jams AltonJamsWebs 2
Buddarakh Gri Izmezh BuddarakhGriIzmezh Gric Izmezh BuddarakhGriIzmezh 2
Buddarakh Gri Izmezh BuddarakhGriIzmezh Izme Grill BuddarakhGrillIzme 3
This assumes the logic which you'll need to confirm or tune, that the rows should join based on the fact that the c1 columns match, but the others do not, and that the rows should be partition on c1 and sorted on c2, c3, c4.

Trying to understand the wider context of your problem here, I think this is an x,y problem. In my experience when I have wanted to calculate the Levenshtein Distance I have been attempting to find duplicate rows, and I've always wanted to do something with them once found. Pivoting them into columns actually makes any further processing very difficult. So I would approach this by keeping the rows as they are, but matching them against the first duplicate in the C1 group found. This also handles as many potential duplicates as are found - although to be fair this is fairly simplistic logic.
DECLARE #Table TABLE (Id int, C1 nvarchar(20), C2 nvarchar(20), C3 nvarchar(20), C4 nvarchar(20));
INSERT INTO #Table (Id, C1, C2, C3, C4) VALUES
(1, 'Alton', 'James', 'Webs', 'AltonJamesWebs'),
(2, 'Alton', 'Webs', 'Jams', 'AltonJamsWebs'),
(3, 'Buddarakh', 'Izme', 'Grill', 'BuddarakhGrillIzme'),
(4, 'Buddarakh', 'Gri', 'Izmezh', 'BuddarakhGriIzmezh'),
(5, 'Buddarakh', 'Gric', 'Izmezh', 'BuddarakhGriIzmezh');
WITH cte1 AS (
-- First find the row number within the C1 group
SELECT *
, ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY Id) rn
FROM #Table
), cte2 AS (
-- Second using lag for all but the first row, lag back using rn to the
-- first row in the C1 group
SELECT *
, CASE WHEN rn > 1 THEN LAG(Id, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseId
, CASE WHEN rn > 1 THEN LAG(C2, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC2
, CASE WHEN rn > 1 THEN LAG(C3, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC3
, CASE WHEN rn > 1 THEN LAG(C4, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC4
FROM cte1
)
SELECT Id
, C1, C2, C3, C4
, baseId, baseC2, baseC3, baseC4
-- Some function to calculate Levenshtein Distance
, dbo.LevenshteinDistance(baseC4, C4) LevenshteinDistance
FROM cte2;
This returns:
Id
C1
C2
C3
C4
baseId
baseC2
baseC3
baseC4
1
Alton
James
Webs
AltonJamesWebs
null
null
null
null
2
Alton
Webs
Jams
AltonJamsWebs
1
James
Webs
AltonJamesWebs
3
Buddarakh
Izme
Grill
BuddarakhGrillIzme
null
null
null
null
4
Buddarakh
Gri
Izmezh
BuddarakhGriIzmezh
3
Izme
Grill
BuddarakhGrillIzme
5
Buddarakh
Gric
Izmezh
BuddarakhGriIzmezh
3
Izme
Grill
BuddarakhGrillIzme
Which as you can see returns against each row (aside from the first in the group), the details of the first row in the group which can then be used to calculate the Levenshtein Distance and then potentially to merge them because it knows which rows its being compared against.
DBFiddle
Thanks for the sample data Patrick Hurst.
Note you might also consider the DIFFERENCE function

Related

How to UPIVOT all columns in a table and aggregate into Data Quality/ Validation Metrics? SQL SNOWFLAKE

I have a table with 60+ columns in it that I would like to UNPIVOT so that each column becomes a row and then find the fill rate, min value and max value of each entry.
For Example
ID
START_DATE
END_DATE
EVENT_ID
PROVIDER_CODE
01
01/23/21
03/14/21
0023401
0012323
02
06/04/21
09/20/21
0025906
0023454
03
07/20/21
12/02/21
0027093
0034983
And I want the output to look like
Column_Name
Fill_Rate
Min
Max
ID
0.7934
01
03
Start_Date
0.6990
01/23/21
07/20/21
End_Date
0.9089
03/14/21
12/02/21
Event_ID
1.0000
0023401
0027093
Struggling to get the desired output, especially because of different data types in the different columns
i tried doing the following, but it doesn't allow taking the agg functions within the unpivot
select *
from "DSVC_MERCKPAN_PROD"."COHORTS_LATEST"."MEDICAL_HEADERS"
UNPIVOT (
max(code) as max_value,
min(code) as min_value,
avg(code) as fill_rate,
code as column_name
)
For fill rate, I was trying to use this logic as ID is always populated so it has the total number of rows, however the other columns can be null
(COUNT_IF(start_date is not null))/(COUNT_IF(ID is not null))) as FILL_RATE,
I have 2 ideas to implement the report.
The first way is casting all values to VARCHAR and then using UNPIVOT:
-- Generate dummy data
create or replace table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 int, c7 int, c8 int, c9 int, c10 int) as
select
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null),
iff(random()%2=0, random(), null), iff(random()%2=0, random(), null)
from table(generator(rowcount => 1000000000))
;
-- Query
with
cols as (
select column_name, ordinal_position
from information_schema.columns
where table_catalog = current_database()
and table_schema = current_schema()
and table_name = 'T1'
),
stringified as (
select
c1::varchar c1, c2::varchar c2, c3::varchar c3, c4::varchar c4, c5::varchar c5,
c6::varchar c6, c7::varchar c7, c8::varchar c8, c9::varchar c9, c10::varchar c10
from t1
),
data as (
select column_name, column_value
from stringified
unpivot(column_value for column_name in (c1, c2, c3, c4, c5, c6, c7, c8, c9, c10))
)
select
c.column_name,
count(d.column_value)/(select count(*) from t1) fill_rate,
min(d.column_value) min,
max(d.column_value) max
from cols c
left join data d using (column_name)
group by c.column_name, c.ordinal_position
order by c.ordinal_position
;
/*
COLUMN_NAME FILL_RATE MIN MAX
C1 0.500000 -1000000069270747870 999999972962694409
C2 0.499980 -1000000027928146782 999999946877079818
C3 0.499996 -1000000012155323098 999999942281548701
C4 0.500017 -1000000056353213091 999999946421698482
C5 0.500015 -1000000015608859996 999999993977648967
C6 0.500003 -1000000007081089270 999999998851014730
C7 0.499987 -100000008605944993 999999968272328033
C8 0.499992 -1000000042470913027 999999977402822725
C9 0.500011 -1000000058928465662 999999969060696774
C10 0.500029 -1000000011306371004 99999996061390938
*/
It's a straightforward way, but it still needs to list up all column names twice and it's a bit tough in the case the number of columns is very massive (but I believe it's much better than a huge UNION ALL query).
Another solution is a bit tricky, but you can unpivot a table by using OBJECT_CONSTRUCT(*) aggregation if the row length doesn't exceed a VARIANT value limit (16 MiB):
with
cols as (
select column_name, ordinal_position
from information_schema.columns
where table_catalog = current_database()
and table_schema = current_schema()
and table_name = 'T1'
),
data as (
select f.key column_name, f.value::varchar column_value
from (select object_construct(*) rec from t1) up,
lateral flatten(up.rec) f
)
select
c.column_name,
count(d.column_value)/(select count(*) from t1) fill_rate,
min(d.column_value) min,
max(d.column_value) max
from cols c
left join data d using (column_name)
group by c.column_name, c.ordinal_position
order by c.ordinal_position
;
/*
COLUMN_NAME FILL_RATE MIN MAX
C1 0.500000 -1000000069270747870 999999972962694409
C2 0.499980 -1000000027928146782 999999946877079818
C3 0.499996 -1000000012155323098 999999942281548701
C4 0.500017 -1000000056353213091 999999946421698482
C5 0.500015 -1000000015608859996 999999993977648967
C6 0.500003 -1000000007081089270 999999998851014730
C7 0.499987 -100000008605944993 999999968272328033
C8 0.499992 -1000000042470913027 999999977402822725
C9 0.500011 -1000000058928465662 999999969060696774
C10 0.500029 -1000000011306371004 99999996061390938
*/
OBJECT_CONSTRUCT(*) aggregation is a special usage of the OBJECT_CONSTRUCT function that extracts column names as a key of each JSON object. As far as I know, this is the only way to extract column names from a table along with values in a programmatic way.
Since OBJECT_CONSTRUCT is relatively a heavy operation, it usually takes a longer time than the first solution, but you don't need to write all column names with this trick.

SQL query to find all rows with same timestamp + or - one second

Row 3 in the following table is a duplicate. I know this because there is another row (row 5) that was created by the same user less than one second earlier.
row record created_by created_dt
1 5734 '00E759CF' '2020-06-05 19:59:36.610'
2 9856 '1E095CBA' '2020-06-05 19:57:31.207'
3 4592 '1E095CBA' '2020-06-05 19:54:41.930'
4 7454 '00E759CF' '2020-06-05 19:54:41.840'
5 4126 '1E095CBA' '2020-06-05 19:54:41.757'
I want a query that returns all rows created by the same user less than one second apart.
Like so:
row record created_by created_dt
1 4592 '1E095CBA' '2020-06-05 19:54:41.930'
2 4126 '1E095CBA' '2020-06-05 19:54:41.757'
This is what I have so far:
SELECT DISTINCT a1.*
FROM table AS a1
LEFT JOIN table AS a2
ON a1.created_by = a2.created_by
AND a1.created_dt > a2.created_dt
AND a1.created_dt <= DATEADD(second, 1, a2.created_dt)
WHERE a1.created_dt IS NOT NULL
AND a.created_dt IS NOT NULL
This is what finally did the trick:
SELECT
a.*
FROM table a
WHERE EXISTS (SELECT TOP 1
*
FROM table a1
WHERE a1.created_by = a.created_by
AND ABS(DATEDIFF(SECOND, a.created_dt, a1.created_dt)) < 1
AND a.created_dt <> a1.created_dt)
ORDER BY created_dt DESC
You could use exists:
select t.*
from mytable t
where exists(
select 1
from mytable t1
where
t1.created_by = t.created_by
and abs(datediff(second, t.created_dt, t1.created_dt)) < 1
)
How about something like this
SELECT DISTINCT a1.*
FROM #a1 AS a1
LEFT JOIN #a1 AS a2 ON a1.[Created_By] = a2.[Created_By]
AND a1.[Record] <> a2.[Record]
WHERE ABS(DATEDIFF(SECOND, a1.[Created_Dt], a2.[Created_Dt])) < 1
Here is the sample query I used to verify the results.
DECLARE #a1 TABLE (
[Record] INT,
[Created_By] NVARCHAR(10),
[Created_Dt] DATETIME
)
INSERT INTO #a1 VALUES
(5734, '00E759CF', '2020-06-05 19:59:36.610'),
(9856, '1E095CBA', '2020-06-05 19:57:31.207'),
(4592, '1E095CBA', '2020-06-05 19:54:41.930'),
(7454, '00E759CF', '2020-06-05 19:54:41.840'),
(4126, '1E095CBA', '2020-06-05 19:54:41.757')
SELECT DISTINCT a1.*
FROM #a1 AS a1
LEFT JOIN #a1 AS a2 ON a1.[Created_By] = a2.[Created_By]
AND a1.[Record] <> a2.[Record]
WHERE ABS(DATEDIFF(SECOND, a1.[Created_Dt], a2.[Created_Dt])) < 1
I would suggest lead() and lag() instead of self-joins:
select t.*
from (select t.*,
lag(created_dt) over (partition by created_dt) as prev_cd,
lead(created_dt) over (partition by created_dt) as next_cd
from t
) t
where created_dt < dateadd(second, 1, prev_created_dt) or
created_dt > dateadd(second, -1, next_created_dt)

SQL Server - How to delete some rows of columns without disrupting the rest of the record

I have it
-- -- -- --
01 A1 B1 99
01 A1 B1 98
02 A2 B2 97
02 A2 B2 96
I need this
-- -- -- --
01 A1 B1 99
98
02 A2 B2 97
96
------------
I can not repeat the data that I will present in a excel,
My result needs to be just so.
In my actual table, the last column are responses of forms and the first columns (those that can not repeat) are customer data as (phone, name ...).
The end result of this "query" will populate a "DataTable" and will be presented in a file "xlsx".
Thanks for sharing knowledge ^^
If you have SQL2012+
SELECT
ISNULL(NULLIF(Column1,LAG(Column1) OVER(ORDER BY Column1)),'')
,ISNULL(NULLIF(Column2,LAG(Column2) OVER(ORDER BY Column1,Column2)),'')
,ISNULL(NULLIF(Column3,LAG(Column3) OVER(ORDER BY Column1,Column2,Column3)),'')
,Column4
FROM #mytable
ORDER BY Column1,Column2,Column3,Column4 DESC
It's a little messy, but you can do it in the database. You basically make a subquery that gets the smallest value, and then join that to the regular table and blank out values that don't match. I created your sample set like this:
CREATE TABLE mytable (N1 VARCHAR(2), A VARCHAR(2), B VARCHAR(2), N2 VARCHAR(2))
INSERT INTO mytable VALUES
('01', 'A1', 'B1', '99'),
('01', 'A1', 'B1', '98'),
('02', 'A2', 'B2', '97'),
('02', 'A2', 'B2', '96')
And then was able to get the result like this:
SELECT
CASE WHEN O.N2 = I.N2 THEN O.N1 ELSE '' END,
CASE WHEN O.N2 = I.N2 THEN O.A ELSE '' END,
CASE WHEN O.N2 = I.N2 THEN O.B ELSE '' END,
O.N2
FROM
(SELECT MAX(N2) AS N2, N1, A, B FROM mytable GROUP BY N1, A, B) I
INNER JOIN mytable O
ON O.A = I.A AND O.B = I.B AND O.N1 = I.N1
ORDER BY O.N1 ASC
we can use ROW_NUMBER to get the sequence and substitute '' for all rows where sequence is greater than 1
with CTE
AS
( SELECT ID, ColumnA, ColumnB, value,ROW_NUMBER() over ( PARTITION by id order by id) as seq
FROM tableA
)
, CTE1
AS
(
select id, ColumnA, ColumnB, value, seq from CTE where seq =1
UNION
SELECT id ,'','', value , seq from CTE where seq >1
)
SELECT case when seq >1 THEN NULL ELSE id END as id, columnA, columnB, value from CTE1
You can achieve what you want using a query.
You haven't provided DDL so I am going to asume your columns are called a, b, c and d respectively
; WITH cte AS (
SELECT a
, b
, c
, d
, Row_Number() OVER (PARTITION BY a, b, c ORDER BY d) As sequence
FROM your_table
)
SELECT CASE WHEN sequence = 1 THEN a ELSE '' END As a
, CASE WHEN sequence = 1 THEN b ELSE '' END As b
, CASE WHEN sequence = 1 THEN c ELSE '' END As c
, d
FROM cte
ORDER
BY a
, b
, c
, d
The idea is to assign an incremental counter to each row, that restarts after each change of a + b + c.
We then use a conditional statement to show a value or not (basically only show on the first instance of each group)
The analytic ROW_NUMBER() function is good for this. I've made up column names because you didn't supply any. To assign a row number by customer, use something like this:
SELECT
Name,
Phone,
Address,
Response,
ROW_NUMBER() OVER (PARTITION BY Name, Phone, Address ORDER BY Response) AS CustRow
FROM myTable
That will assign row number within each customer. Try it yourself and I think it will make sense.
You can put it into a subquery or CTE from there and only show customer ID information like name, phone, and address when you're on the first row for each customer:
SELECT
CASE WHEN CustRow = 1 THEN Name ELSE '' END AS Name,
CASE WHEN CustRow = 1 THEN Phone ELSE '' END AS Phone,
CASE WHEN CustRow = 1 THEN Address ELSE '' END AS Address,
Response
FROM (
SELECT
Name,
Phone,
Address,
Response,
ROW_NUMBER() OVER (PARTITION BY Name, Phone, Address ORDER BY Response) AS CustRow
FROM myTable) custSubquery
ORDER BY Name, Phone, Address
The custSubquery on the second-to-last line is because SQL Server requires all subqueries to be aliased, even if the alias isn't used.
The most important thing is to determine how your last column will be ordered for display and to make sure that it's consistent in the ROW_NUMBER() function as well as the final ORDER BY.
If you need more help, please supply table and column names, and specify how results are ordered within each customer.

MS-SQL Average Columns with NULL

So I've got 3 different columns (basket 1, 2, and 3). Sometimes these columns have all the information and sometimes one or two of them are null. I have another column that I'm going to average these values into and save.
Is there a sleek/easy way to get the average of these three columns even if one of them is null? Or do I have to have a special check for each one being null?
Example data( ~~ is null)
- B1 - B2 - B3 - Avg
------------------------------
- 10 - 20 - 30 - 20
- 10 - ~~ - 30 - 20
- ~~ - 20 - ~~ - 20
How would I write the T-SQL to update my temp table?
UPDATE #MyTable
SET Avg = ???
Answer:
Thanks to Aaronaught for the method I used. I'm going to put my code here just in case someone else has the same thing.
WITH AverageView AS
(
SELECT Results_Key AS xxx_Results_Key,
AVG(AverageValue) AS xxx_Results_Average
FROM #MyResults
UNPIVOT (AverageValue FOR B IN (Results_Basket_1_Price, Results_Basket_2_Price, Results_Basket_3_Price)) AS UnpivotTable
GROUP BY Results_Key
)
UPDATE #MyResults
SET Results_Baskets_Average_Price = xxx_Results_Average
FROM AverageView
WHERE Results_Key = xxx_Results_Key;
Assuming you have some sort of ID column, the most effective way is probably to use UNPIVOT so you can use the normal row-based AVG operator (which ignores NULL values):
DECLARE #Tbl TABLE
(
ID int,
B1 int,
B2 int,
B3 int
)
INSERT #Tbl (ID, B1, B2, B3) VALUES (1, 10, 20, 30)
INSERT #Tbl (ID, B1, B2, B3) VALUES (2, 10, NULL, 30)
INSERT #Tbl (ID, B1, B2, B3) VALUES (3, 10, NULL, NULL)
SELECT ID, AVG(Value) AS Average
FROM #Tbl
UNPIVOT (Value FOR B IN (B1, B2, B3)) AS u
GROUP BY ID
If you don't have the ID column, you can generate a surrogate ID using ROW_NUMBER:
;WITH CTE AS
(
SELECT
B1, B2, B3,
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS ID
FROM #Tbl
)
SELECT ID, AVG(Value)
FROM CTE
UNPIVOT (Value FOR B IN (B1, B2, B3)) AS u
GROUP BY ID
SELECT (
SELECT AVG(b)
FROM (
SELECT b1 AS b
UNION ALL
SELECT b2
UNION ALL
SELECT b3
) q
)
FROM mytable
SELECT (ISNULL(B1,0) + ISNULL(B2,0) + ISNULL(B3,0))
/(CASE WHEN B1 IS NULL THEN 0 ELSE 1 END
+CASE WHEN B2 IS NULL THEN 0 ELSE 1 END
+CASE WHEN B3 IS NULL THEN 0 ELSE 1 END)
and put logic in there to exclude cases where all three are null if you need to.

sql select to start with a particular record

Is there any way to write a select record starting with a particular record? Suppose I have an table with following data:
SNO ID ISSUE
----------------------
1 A1 unknown
2 A2 some_issue
3 A1 unknown2
4 B1 some_issue2
5 B3 ISSUE4
6 B1 ISSUE4
Can I write a select to start showing records starting with B1 and then the remaining records? The output should be something like this:
4 B1 some_issue2
6 B1 ISSUE4
1 A1 unknown
2 A2 some_issue
3 A1 unknown2
5 B3 ISSUE4
It doesn't matter if B3 is last, just that B1 should be displayed first.
Couple of different options depending on what you 'know' ahead of time (i.e. the id of the record you want to be first, the sno, etc.):
Union approach:
select 1 as sortOrder, SNO, ID, ISSUE
from tableName
where ID = 'B1'
union all
select 2 as sortOrder, SNO, ID, ISSUE
from tableName
where ID <> 'B1'
order by sortOrder;
Case statement in order by:
select SNO, ID, ISSUE
from tableName
order by case when ID = 'B1' then 1 else 2 end;
You could also consider using temp tables, cte's, etc., but those approaches would likely be less performant...try a couple different approaches in your environment to see which works best.
Assuming you are using MySQL, you could either use IF() in an ORDER BY clause...
SELECT SNO, ID, ISSUE FROM table ORDER BY IF( ID = 'B1', 0, 1 );
... or you could define a function that imposes your sort order...
DELIMITER $$
CREATE FUNCTION my_sort_order( ID VARCHAR(2), EXPECTED VARCHAR(2) )
RETURNS INT
BEGIN
RETURN IF( ID = EXPECTED, 0, 1 );
END$$
DELIMITER ;
SELECT SNO, ID, ISSUE FROM table ORDER BY my_sort_sort( ID, 'B1' );
select * from table1
where id = 'B1'
union all
select * from table1
where id <> 'B1'