T-SQL Grouping rows from the MAX length columns in different rows (?) - sql

i'm trying to come up with a way to combine rows in a table based on the longest string in any of the rows based on a row key
example
CREATE TABLE test1
(akey int not null ,
text1 varchar(50) NULL,
text2 varchar(50) NULL,
text3 varchar(50) NULL )
INSERT INTO test1 VALUES ( 1,'Winchester Road','crawley',NULL)
INSERT INTO test1 VALUES ( 1,'Winchester Rd','crawley','P21869')
INSERT INTO test1 VALUES ( 1,'Winchester Road','crawley estate','P21869')
INSERT INTO test1 VALUES ( 1,'Winchester Rd','crawley','P21869A')
INSERT INTO test1 VALUES ( 2,'','birmingham','P53342B')
INSERT INTO test1 VALUES ( 2,'Smith Close','birmingham North East','P53342')
INSERT INTO test1 VALUES ( 2,'Smith Cl.',NULL,'P53342B')
INSERT INTO test1 VALUES ( 2,'Smith Close','birmingham North','P53342')
with these rows i would be looking for the result of :
1 Winchester Road, crawley estate, P21869A
2 Smith Close, birmingham North East, P53342B
EDIT: the results above need to be in a table rather than just a comma separated string
as you can see in the result, the output should be the longest text column in the range of the 'akey' field.
i'm trying to come up with a solution that does not involve lots of subqueries on each column, the actual table has 32 columns and over 13 million rows.
the reason i'm doing this is to create a cleaned-up table that has the best results in each column for just one ID per row
this is my first post, so let me know if you need any more info, and i'm happy to hear about any best practices about posting that i've broken!
thanks
Ben.

SELECT A.akey,
(
SELECT TOP 1 T1.text1
FROM test1 T1
WHERE T1.akey=A.akey AND LEN(T1.TEXT1) = MAX(LEN(A.text1))
) AS TEXT1,
(
SELECT TOP 1 T2.text2
FROM test1 T2
WHERE T2.akey=A.akey AND LEN(T2.TEXT2) = MAX(LEN(A.text2))
) AS TEXT2,
(
SELECT TOP 1 T3.text3
FROM test1 T3
WHERE T3.akey=A.akey AND LEN(T3.TEXT3) = MAX(LEN(A.text3))
) AS TEXT3
FROM TEST1 AS A
GROUP BY A.akey
I just realized you said you have 32 columns. I don't see a good way to do that, unless UNPIVOT would allow you to create separate rows (akey, textn) for each text* column.
Edit: I may not have a chance to finish this today, but UNPIVOT looks useful:
;
WITH COLUMNS AS
(
SELECT akey, [Column], ColumnValue
FROM
(
SELECT X.Akey, X.Text1, X.Text2, X.Text3
FROM test1 X
) AS p
UNPIVOT (ColumnValue FOR [Column] IN (Text1, Text2, Text3))
AS UNPVT
)
SELECT *
FROM COLUMNS
ORDER BY akey,[Column], LEN(ColumnValue)

This seems really ugly, but at least works (on SQL2K) and doesn't need subqueries:
select test1.akey, A.text1, B.text2, C.text3
from test1
inner join test1 A on A.akey = test1.akey
inner join test1 B on B.akey = test1.akey
inner join test1 C on C.akey = test1.akey
group by test1.akey, A.text1, B.text2, C.text3
having len(a.text1) = max(len(test1.text1))
and len(B.text2) = max(len(test1.text2))
and len(C.text3) = max(len(test1.text3))
order by test1.akey
I must admit that it needs an inner join for each column and I wonder how this could impact on the 32 columns x 13millions record table... I try both this approach and the one based one subqueries and looked at executions plans: I'ld actually be curious to know

Related

UNION two SELECT queries but result set is smaller than one of them

In a SQL Server statement there is
SELECT id, book, acnt, prod, category from Table1 <where clause...>
UNION
SELECT id, book, acnt, prod, category from Table2 <where clause...>
The first query returned 131,972 lines of data; the 2nd one, 147,692 lines. I didn't notice there is any commonly shared line of data from these two tables, so I expect the result set after UNION should be the same as the sum of 131972 + 147692 = 279,384.
However the result set after UNION is 133,857. Even though they might have overlapped lines that I accidently missed, the result should be at least the same as the larger result set of those two. I can't figure how the number 133,857 came from.
Is my understanding about SQL UNION correct? I use SQL server in this case.
To expand comment given under the question, which I think states what you already know:
UNION takes care of duplicates also within one table as well.
Just take a look at a example:
SETUP:
create table tbl1 (col1 int, col2 int);
insert into tbl1 values
(1,2),
(3,4);
create table tbl2 (col1 int, col2 int);
insert into tbl1 values
(1,2),
(1,2),
(1,2),
(3,4);
Query
select * from tbl1
union
select * from tbl2;
will produce output
col1 | col2
-----|------
1 | 2
3 | 4
DB fiddle

Generating Lines based on a value from a column in another table

I have the following table:
EventID=00002,DocumentID=0005,EventDesc=ItemsReceived
I have the quantity in another table
DocumentID=0005,Qty=20
I want to generate a result of 20 lines (depending on the quantity) with an auto generated column which will have a sequence of:
ITEM_TAG_001,
ITEM_TAG_002,
ITEM_TAG_003,
ITEM_TAG_004,
..
ITEM_TAG_020
Here's your sql query.
with cte as (
select 1 as ctr, t2.Qty, t1.EventID, t1.DocumentId, t1.EventDesc from tableA t1
inner join tableB t2 on t2.DocumentId = t1.DocumentId
union all
select ctr + 1, Qty, EventID, DocumentId, EventDesc from cte
where ctr <= Qty
)select *, concat('ITEM_TAG_', right('000'+ cast(ctr AS varchar(3)),3)) from cte
option (maxrecursion 0);
Output:
Best is to introduce a numbers table, very handsome in many places...
Something along:
Create some test data:
DECLARE #MockNumbers TABLE(Number BIGINT);
DECLARE #YourTable1 TABLE(DocumentID INT,ItemTag VARCHAR(100),SomeText VARCHAR(100));
DECLARE #YourTable2 TABLE(DocumentID INT, Qty INT);
INSERT INTO #MockNumbers SELECT TOP 100 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values;
INSERT INTO #YourTable1 VALUES(1,'FirstItem','qty 5'),(2,'SecondItem','qty 7');
INSERT INTO #YourTable2 VALUES(1,5), (2,7);
--The query
SELECT CONCAT(t1.ItemTag,'_',REPLACE(STR(A.Number,3),' ','0'))
FROM #YourTable1 t1
INNER JOIN #YourTable2 t2 ON t1.DocumentID=t2.DocumentID
CROSS APPLY(SELECT Number FROM #MockNumbers WHERE Number BETWEEN 1 AND t2.Qty) A;
The result
FirstItem_001
FirstItem_002
[...]
FirstItem_005
SecondItem_001
SecondItem_002
[...]
SecondItem_007
The idea in short:
We use an INNER JOIN to get the quantity joined to the item.
Now we use APPLY, which is a row-wise action, to bind as many rows to the set, as we need it.
The first item will return with 5 lines, the second with 7. And the trick with STR() and REPLACE() is one way to create a padded number. You might use FORMAT() (v2012+), but this is working rather slowly...
The table #MockNumbers is a declared table variable containing a list of numbers from 1 to 100. This answer provides an example how to create a pyhsical numbers and date table. Any database should have such a table...
If you don't want to create a numbers table, you can search for a tally table or tally on the fly. There are many answers showing approaches how to create a list of running numbers...a

Need a SQL statement to filter rows

There is a table which look like this:
I need to retrieve only highlighted records. And I need a query which should work on bigger table where are millions of records are exists.
Criteria:
There are 4 sets, 1st and 3rd have the similar values but 2nd and 4th sets have different values
Edit:
I made slight modification in the table( ID column added). How can we achieve the same with the ID column?
return only this kind of set where 1 or more different value exists in the set
create table #ab
(
col1a int,
colb char(2)
)
insert into #ab
values
(1,'a'),
(1,'a'),
(1,'a'),
(2,'b'),
(2,'c'),
(2,'c')
select id,col1a,colb
from #ab
where col1a in (
Select col1a from #ab group by col1a having count (distinct colb)>1)
Regarding the performance over millions of rows,i would probably check the execution plan and deal with it.with my sample data set and my query ,Distinct sort takes nearly 40% of cost..with millions of rows,it can probably go to tempdb as well..so i suggest below index which can eliminate more rows
create index nci on #ab(colb)
include(col1a)
You can also achieve it using INNER JOIN instead of IN as it is million rows query.
SELECT f.colA,f.colB
FROM
filtertable f
INNER JOIN
(
SELECT colA
FROM filtertable
GROUP BY colA
HAVING COUNT(DISTINCT colB)>1
) f1
ON f.colA = f1.colA

Can this ORDER BY on a CASE clause be made faster?

I'm selecting results from a table of ~350 million records, and it's running extremely slowly - around 10 minutes. The culprit seems to be the ORDER BY, as if I remove it the query only takes a moment. Here's the gist:
SELECT TOP 100
(columns snipped)
FROM (
SELECT
CASE WHEN (e2.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
ORDER BY p1.RecordExists
Basically, I'm ordering the results by whether Files have a corresponding Record, as those without need to be handled first. I could run two queries with WHERE clauses, but I'd rather do it in a single query if possible.
Is there any way to speed this up?
The ultimate issue is that the use of CASE in the sub-query introduces an ORDER BY over something that is not being used in a sargable manner. Thus the entire intermediate result-set must first be ordered to find the TOP 100 - this is all 350+ million records!2
In this particular case, moving the CASE to the outside SELECT and use a DESC ordering (to put NULL values, which means "0" in the current RecordExists, first) should do the trick1. It's not a generic approach, though .. but the ordering should be much, much faster iff Files.ID is indexed. (If the query is still slow, consult the query plan to find out why ORDER BY is not using an index.)
Another alternative might be to include a persisted computed column for RecordExists (that is also indexed) that can be used as an index in the ORDER BY.
Once again, the idea is that the ORDER BY works over something sargable, which only requires reading sequentially inside the index (up to the desired number of records to match the outside limit) and not ordering 350+ million records on-the-fly :)
SQL Server is then able to push this ordering (and limit) down into the sub-query, instead of waiting for the intermediate result-set of the sub-query to come up. Look at the query plan differences based on what is being ordered.
1 Example:
SELECT TOP 100
-- If needed
CASE WHEN (p1.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM (
SELECT
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
-- Hopefully ID is indexed, DESC makes NULLs (!RecordExists) go first
ORDER BY p1.ID DESC
2 Actually, it seems like it could hypothetically just stop after the first 100 0's without a full-sort .. at least under some extreme query planner optimization under a closed function range, but that depends on when the 0's are encountered in the intermediate result set (in the first few thousand or not until the hundreds of millions or never?). I highly doubt SQL Server accounts for this extreme case anyway; that is, don't count on this still non-sargable behavior.
Give this form a try
SELECT TOP(100) *
FROM (
SELECT TOP(100)
0 AS RecordExists
--,(columns snipped)
FROM dbo.Files AS e1
WHERE NOT EXISTS (SELECT * FROM dbo.Records e2 WHERE e1.FID = e2.FID)
ORDER BY SecondaryOrderColumn
) X
UNION ALL
SELECT * FROM (
SELECT TOP(100)
1 AS RecordExists
--,(columns snipped)
FROM dbo.Files AS e1
INNER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
ORDER BY SecondaryOrderColumn
) X
ORDER BY SecondaryOrderColumn
Key indexes:
Records (FID)
Files (FID, SecondaryOrdercolumn)
Well the reason it is much slower is because it is really a very different query without the order by clause.
With the order by clause:
Find all matching records out of the entire 350 million rows. Then sort them.
Without the order by clause:
Find the first 100 matching records. Stop.
Q: If you say the only difference is "with/outout" the "order by", then could you somehow move the "top 100" into the inner select?
EXAMPLE:
SELECT
(columns snipped)
FROM (
SELECT TOP 100
CASE WHEN (e2.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
ORDER BY p1.RecordExists
In SQL Server, null values collate lower than any value in the domain. Given these two tables:
create table dbo.foo
(
id int not null identity(1,1) primary key clustered ,
name varchar(32) not null unique nonclustered ,
)
insert dbo.foo ( name ) values ( 'alpha' )
insert dbo.foo ( name ) values ( 'bravo' )
insert dbo.foo ( name ) values ( 'charlie' )
insert dbo.foo ( name ) values ( 'delta' )
insert dbo.foo ( name ) values ( 'echo' )
insert dbo.foo ( name ) values ( 'foxtrot' )
go
create table dbo.bar
(
id int not null identity(1,1) primary key clustered ,
foo_id int null foreign key references dbo.foo(id) ,
name varchar(32) not null unique nonclustered ,
)
go
insert dbo.bar( foo_id , name ) values( 1 , 'golf' )
insert dbo.bar( foo_id , name ) values( 5 , 'hotel' )
insert dbo.bar( foo_id , name ) values( 3 , 'india' )
insert dbo.bar( foo_id , name ) values( 5 , 'juliet' )
insert dbo.bar( foo_id , name ) values( 6 , 'kilo' )
go
The query
select *
from dbo.foo foo
left join dbo.bar bar on bar.foo_id = foo.id
order by bar.foo_id, foo.id
yields the following result set:
id name id foo_id name
-- ------- ---- ------ -------
2 bravo NULL NULL NULL
4 delta NULL NULL NULL
1 alpha 1 1 golf
3 charlie 3 3 india
5 echo 2 5 hotel
5 echo 4 5 juliet
6 foxtrot 5 6 kilo
(7 row(s) affected)
This should allow the query optimizer to use a suitable index (if such exists); however, it does not guarantee than any such index would be used.
Can you try this?
SELECT TOP 100
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
ORDER BY e2.ID ASC
This should give you where e2.ID is null first. Also, make sure Records.ID is indexed. This should give you the ordering you were wanting.

Updating Uncommitted data to a cell with in an UPDATE statement

I want to convert a table storing in Name-Value pair data to relational form in SQL Server 2008.
Source table
Strings
ID Type String
100 1 John
100 2 Milton
101 1 Johny
101 2 Gaddar
Target required
Customers
ID FirstName LastName
100 John Milton
101 Johny Gaddar
I am following the strategy given below,
Populate the Customer table with ID values in Strings Table
INSERT INTO CUSTOMERS SELECT DISTINCT ID FROM Strings
You get the following
Customers
ID FirstName LastName
100 NULL NULL
101 NULL NULL
Update Customers with the rest of the attributes by joining it to Strings using ID column. This way each record in Customers will have corresponding 2 matching records.
UPDATE Customers
SET FirstName = (CASE WHEN S.Type=1 THEN S.String ELSE FirstName)
LastName = (CASE WHEN S.Type=2 THEN S.String ELSE LastName)
FROM Customers
INNER JOIN Strings ON Customers.ID=Strings.ID
An intermediate state will be llike,
ID FirstName LastName ID Type String
100 John NULL 100 1 John
100 NULL Milton 100 2 Milton
101 Johny NULL 101 1 Johny
101 NULL Gaddar 101 2 Gaddar
But this is not working as expected. Because when assigning the values in the SET clause it is setting only the committed values instead of the uncommitted. Is there anyway to set uncommitted values (with in the processing time of query) in UPDATE statement?
PS: I am not looking for alternate solutions but make my approach work by telling SQL Server to use uncommitted data for UPDATE.
The easiest way to do it would be to split the update into two:
UPDATE Customers
SET FirstName = Strings.String
FROM Customers
INNER JOIN Strings ON Customers.ID=Strings.ID AND Strings.Type = 1
And then:
UPDATE Customers
SET LastName = Strings.String
FROM Customers
INNER JOIN Strings ON Customers.ID=Strings.ID AND Strings.Type = 2
There are probably ways to do it in one query such as a derived table, but unless that's a specific requirement I'd just use this approach.
Have a look at this, it should avoid all the steps you had
DECLARE #Table TABLE(
ID INT,
Type INT,
String VARCHAR(50)
)
INSERT INTO #Table (ID,[Type],String) SELECT 100 ,1 ,'John'
INSERT INTO #Table (ID,[Type],String) SELECT 100 ,2 ,'Milton'
INSERT INTO #Table (ID,[Type],String) SELECT 101 ,1 ,'Johny'
INSERT INTO #Table (ID,[Type],String) SELECT 101 ,2 ,'Gaddar'
SELECT IDs.ID,
tName.String NAME,
tSur.String Surname
FROM (
SELECT DISTINCT ID
FROM #Table
) IDs LEFT JOIN
#Table tName ON IDs.ID = tName.ID AND tName.[Type] = 1 LEFT JOIN
#Table tSur ON IDs.ID = tSur.ID AND tSur.[Type] = 2
OK, i do not think that you will find a solution to what you are looking for. From UPDATE (Transact-SQL) it states
Using UPDATE with the FROM Clause
The results of an UPDATE statement are
undefined if the statement includes a
FROM clause that is not specified in
such a way that only one value is
available for each column occurrence
that is updated, that is if the UPDATE
statement is not deterministic. For
example, in the UPDATE statement in
the following script, both rows in
Table1 meet the qualifications of the
FROM clause in the UPDATE statement;
but it is undefined which row from
Table1 is used to update the row in
Table2.
USE AdventureWorks;
GO
IF OBJECT_ID ('dbo.Table1', 'U') IS NOT NULL
DROP TABLE dbo.Table1;
GO
IF OBJECT_ID ('dbo.Table2', 'U') IS NOT NULL
DROP TABLE dbo.Table2;
GO
CREATE TABLE dbo.Table1
(ColA int NOT NULL, ColB decimal(10,3) NOT NULL);
GO
CREATE TABLE dbo.Table2
(ColA int PRIMARY KEY NOT NULL, ColB decimal(10,3) NOT NULL);
GO
INSERT INTO dbo.Table1 VALUES(1, 10.0), (1, 20.0), (1, 0.0);
GO
UPDATE dbo.Table2
SET dbo.Table2.ColB = dbo.Table2.ColB + dbo.Table1.ColB
FROM dbo.Table2
INNER JOIN dbo.Table1
ON (dbo.Table2.ColA = dbo.Table1.ColA);
GO
SELECT ColA, ColB
FROM dbo.Table2;
Astander is correct (I am accepting his answer). The update is not happening because of a read UNCOMMITTED issue but because of the multiple rows returned by the JOIN. I have verified this. UPDATE picks only the first row generated from the multiple records to update the original table. This is the behavior for MSSQL, Sybase and such RDMBMSs but Oracle does not allow this kind of an update an d it throws an error. I have verified this thing for MSSQL.
And again MSSQL does not support updating a cell with UNCOMMITTED data. Don't know the status with other RDBMSs. And I have no idea if anyRDBMS provides with in the query ISOLATION level management.
An alternate solution will be to do it in two steps, Aggregate to unpivot and then insert. This has lesser scans compared to methods given in above answers.
INSERT INTO Customers
SELECT
ID
,MAX(CASE WHEN Type = 1 THEN String ELSE NULL END) AS FirstName
,MAX(CASE WHEN Type = 2 THEN String ELSE NULL END) AS LastName
FROM Strings
GROUP BY ID
Thanks to my friend Roji Thomas for helping me with this.