Help with SQL Grouping - sql

A partial fragment of my output looks as follows:
CNEP P000000025 1
CNEP P000000029 1
NONMAT P000000029 1
CNEP P000000030 1
CWHCNP P000000030 1
MSN P000000030 1
Each row represents a term that a student is in a particular curriculum. Right now I am grouping the information to make sure that each UserID correlates to a partcular curriculum only once.
Notice how "P000000029" and "P000000030" have multiple entries.
I would like to be able to show only those students who have multiple curriculum types within the system.

Assuming the columnbs are named curriculum and userid (no idea what the third column IS;-), you can get the userids of interest via, e.g.:
select userid
from thetable
group by userid
having count(distinct curriculum) > 1
and other info about the userids so selected via in, joins, and similar operations as usual.

I don't think you are showing any student info in your sample data. But you can still use this to find groups with multiples (SQL Server example code, but query will wrok just about anywhere):
DECLARE #YourTable table (col1 varchar(10), col2 char(10), col3 int)
INSERT INTO #YourTable VALUES ('NEP','P000000025',1)
INSERT INTO #YourTable VALUES ('CNEP','P000000029',1)
INSERT INTO #YourTable VALUES ('NONMAT','P000000029',1)
INSERT INTO #YourTable VALUES ('CNEP','P000000030',1)
INSERT INTO #YourTable VALUES ('CWHCNP','P000000030',1)
INSERT INTO #YourTable VALUES ('MSN','P000000030',1)
SELECT
col1,COUNT(*) AS CountOf
FROM #YourTable
GROUP BY col1
HAVING COUNT(col2)>1
OUTPUT
col1 CountOf
---------- -----------
CNEP 2
(1 row(s) affected)

Related

Add column to ensure composite key is unique

I have a table which needs to have a composite primary key based on 2 columns (Material number, Plant).
For example, this is how it is currently (note that these rows are not unique):
MATERIAL_NUMBER PLANT NUMBER
------------------ ----- ------
000000000000500672 G072 1
000000000000500672 G072 1
000000000000500672 G087 1
000000000000500672 G207 1
000000000000500672 G207 1
However, I'll need to add the additional column (NUMBER) to the composite key such that each row is unique, and it must work like this:
For each MATERIAL_NUMBER, for each PLANT, let NUMBER start at 1 and increment by 1 for each duplicate record.
This would be the desired output:
MATERIAL_NUMBER PLANT NUMBER
------------------ ----- ------
000000000000500672 G072 1
000000000000500672 G072 2
000000000000500672 G087 1
000000000000500672 G207 1
000000000000500672 G207 2
How would I go about achieving this, specifically in SQL Server?
Best Regards!
SOLVED.
See below:
SELECT MATERIAL_NUMBER, PLANT, (ROW_NUMBER() OVER (PARTITION BY MATERIAL_NUMBER, PLANT ORDER BY VALID_FROM)) as NUMBER
FROM Table_Name
Will output the table in question, with the NUMBER column properly defined
Suppose this is actual table,
create table #temp1(MATERIAL_NUMBER varchar(30),PLANT varchar(30), NUMBER int)
Suppose you want to insert only single record then,
declare #Num int
select #Num=isnull(max(number),0) from #temp1 where MATERIAL_NUMBER='000000000000500672' and PLANT='G072'
insert into #temp1 (MATERIAL_NUMBER,PLANT , NUMBER )
values ('000000000000500672','G072',#Num+1)
Suppose you want to insert bulk record.Your bulk record sample data is like
create table #temp11(MATERIAL_NUMBER varchar(30),PLANT varchar(30))
insert into #temp11 (MATERIAL_NUMBER,PLANT)values
('000000000000500672','G072')
,('000000000000500672','G072')
,('000000000000500672','G087')
,('000000000000500672','G207')
,('000000000000500672','G207')
You want to insert `#temp11` in `#temp1` maintaining number id
insert into #temp1 (MATERIAL_NUMBER,PLANT , NUMBER )
select t11.MATERIAL_NUMBER,t11.PLANT
,ROW_NUMBER()over(partition by t11.MATERIAL_NUMBER,t11.PLANT order by (select null))+isnull(maxnum,0) as Number from #temp11 t11
outer apply(select MATERIAL_NUMBER,PLANT,max(NUMBER)maxnum from #temp1 t where t.MATERIAL_NUMBER=t11.MATERIAL_NUMBER
and t.PLANT=t11.PLANT group by MATERIAL_NUMBER,PLANT) t
select * from #temp1
drop table #temp1
drop table #temp11
Main question is Why you need number column ? In mot of the cases you don't need number column,you can use ROW_NUMBER()over(partition by t11.MATERIAL_NUMBER,t11.PLANT order by (select null)) to display where you need. This will be more efficient.
Or tell the actual situation and number of rows involved where you will be needing Number column.

A thought experiment in SQL

I want to show the number of times each distinct element in a column in a table in a SQL database appears, alongside the particular distinct element in a new output table. Is it possible in a single statement over ramming my head over it manually?
Without having actually tried, how about this:
SELECT tmp.Field, (SELECT COUNT(*) FROM [Table] t WHERE t.DesiredField = tmp.Field) AS Count
FROM
(
SELECT DISTINCT DesiredField FROM [Table]
) tmp
This would first select all distinct values from [Table] and in the outer select, take the values and the number of times they appear in the column.
You could also try
SELECT Field, SUM(1) AS Count FROM Table
GROUP BY Field
This should "flatten" the table so that it only contains distinct values in Field and the number of rows where Field has the same value.
I just tried the second - it seems to work nicely.
Turns out I was wrong all the time. The second example and the following actually return the same results:
SELECT Field, COUNT(*) AS Count FROM Table
GROUP BY Field
Simplest just to use COUNT(). You'll see varieties on what your count parameter, so here are the options.
DECLARE #tbl TABLE(id INT, data INT)
INSERT INTO #tbl VALUES (1,1),(2,1),(3,2),(4,NULL)
SELECT data
,COUNT(*) Count_star
,COUNT(id) Count_id
,COUNT(data) Count_data
,COUNT(1) Count_literal
FROM #tbl
GROUP BY data
data Count_star Count_id Count_data Count_literal
----------- ----------- ----------- ----------- -------------
NULL 1 1 0 1
1 2 2 2 2
2 1 1 1 1
Warning: Null value is eliminated by an aggregate or other SET operation.
You'll see the difference coming with the treatment of NULL if you COUNT a field that contains NULLs.

SQL aggregate function to return single value if there is only one, otherwise null

I'm looking for the best way to achieve an aggregate function that does this:-
If the group contains only a single repeated value, return that value
If the group contains any nulls, then return null
If the group contains more than one value, return null
Here's some sample data:
CREATE TABLE EXAMPLE
( ID NUMBER(3),
VAL VARCHAR2(3));
INSERT INTO EXAMPLE VALUES (1,'A');
INSERT INTO EXAMPLE VALUES (2,'A');
INSERT INTO EXAMPLE VALUES (2,'B');
INSERT INTO EXAMPLE VALUES (3,null);
INSERT INTO EXAMPLE VALUES (3,'A');
INSERT INTO EXAMPLE VALUES (4,'A');
INSERT INTO EXAMPLE VALUES (4,'A');
SQLFiddle Link
The SQL should be something like:-
SELECT ID, ????( VAL ) ONLY_VAL
FROM EXAMPLE
GROUP BY ID
ORDER BY ID
The result I am after should look like this:-
ID ONLY_VAL
1 A
2
3
4 A
In the real thing, I want to do this on multiple VAL columns (grouped by the same ID). There would be several hundred records per ID.
I thought this was an interesting problem The only solution I have is a mess of NVL, MIN and MAX and it seems like there should be a neater way.
Will this work for for your original data?
SELECT ID,
CASE WHEN COUNT(DISTINCT VAL) = 1 AND COUNT(ID) = COUNT(VAL)
THEN MAX(VAL)
ELSE NULL
END ONLY_VAL
FROM EXAMPLE
GROUP BY ID
ORDER BY ID
SQLFiddle Demo

Last Record of a Join Table (how to optimize)

I have the same "problem" as described in (Last record of Join table): I need to join a "Master Table" with a "History Table" whereas I only want to join the latest (by date) Record of the the history table. So whenever I query a record for the mastertable I also geht the "latest" data of the History Table.
Master Table
ID
FIRSTNAME
LASTNAME
...
History Table
ID
LASTACTION
DATE
This is possible by joining both tables and using a subselect to retrieve the latest history table record as described in the answer given in the link above.
My Quesions are:
How can I solve the problem, that there might be in theory two History Records with the same date?
Is this kind of joining with the subselect really the best solution in terms of performance (and in general)? What do you think (I am NO expert in all this stuff) if I integrate a further attribute in the History table that is named "ISLATESTRECORD" as a boolean Flag that I manage manually (and that has a unique constrained). This attribute will then explicitly mark the latest record and I do not need any subselects as I can directly use this attribute in the where clause of the join.
On the other hand, this makes inserting a new record of course a little bit more complicated: I first have to remove the "ISLATESTRECORD" flag from the latest record, I have to insert the new History Record with the "ISLATESTRECORD" set and commit the transaction.
What do you think is the recommended solution? I do not have any clue about the performance impact of the subselects: I might have millions of "Mastertable" Records" that I have to search for a specific record also using in the search attributes of the joined History table like: "Give me the Mastertable Record with FIRSTNAME XYZ and the LASTACTION (of the History Table) was "changed_name". So this subselect might be called millions of times.
Or is it better work with a subselect to find the latest record, as subselects are very efficient and its better to keep everything normalized?
Thank you very much
I solve your problem with a query on your existing tables, and on your tables with an auto-incrementing identity column added to the history table. By adding an auto-incrementing identity column on your history table, you can get around the unique problem of the dates, and make the query easier.
To solve the problem with your tables (with SQL Server example code):
DECLARE #MasterTable table (MasterID int,FirstName varchar(20),LastName varchar(20))
DECLARE #HistoryTable table (MasterID int,LastAction char(1),HistoryDate datetime)
INSERT INTO #MasterTable VALUES (1,'AAA','aaa')
INSERT INTO #MasterTable VALUES (2,'BBB','bbb')
INSERT INTO #MasterTable VALUES (3,'CCC','ccc')
INSERT INTO #HistoryTable VALUES (1,'I','1/1/2009')
INSERT INTO #HistoryTable VALUES (1,'U','2/2/2009')
INSERT INTO #HistoryTable VALUES (1,'U','3/3/2009') --<<dups
INSERT INTO #HistoryTable VALUES (1,'U','3/3/2009') --<<dups
INSERT INTO #HistoryTable VALUES (2,'I','5/5/2009')
INSERT INTO #HistoryTable VALUES (3,'I','7/7/2009')
INSERT INTO #HistoryTable VALUES (3,'U','8/8/2009')
SELECT
MasterID,FirstName,LastName,LastAction,HistoryDate
FROM (SELECT
m.MasterID,m.FirstName,m.LastName,h.LastAction,h.HistoryDate,ROW_NUMBER() OVER(PARTITION BY m.MasterID ORDER BY m.MasterID) AS RankValue
FROM #MasterTable m
INNER JOIN (SELECT
MasterID,MAX(HistoryDate) AS MaxDate
FROM #HistoryTable
GROUP BY MasterID
) dt ON m.MasterID=dt.MasterID
INNER JOIN #HistoryTable h ON dt.MasterID=h.MasterID AND dt.MaxDate=h.HistoryDate
) AllRows
WHERE RankValue=1
OUTPUT:
MasterID FirstName LastName LastAction HistoryDate
----------- --------- -------- ---------- -----------
1 AAA aaa U 2009-03-03
2 BBB bbb I 2009-05-05
3 CCC ccc U 2009-08-08
(3 row(s) affected)
To solve the problem with a better, HistoryTable (with SQL Server example code):
it is better because it has an auto-incrementing history id identity column
DECLARE #MasterTable table (MasterID int,FirstName varchar(20),LastName varchar(20))
DECLARE #HistoryTableNEW table (HistoryID int identity(1,1), MasterID int,LastAction char(1),HistoryDate datetime)
INSERT INTO #MasterTable VALUES (1,'AAA','aaa')
INSERT INTO #MasterTable VALUES (2,'BBB','bbb')
INSERT INTO #MasterTable VALUES (3,'CCC','ccc')
INSERT INTO #HistoryTableNEW VALUES (1,'I','1/1/2009')
INSERT INTO #HistoryTableNEW VALUES (1,'U','2/2/2009')
INSERT INTO #HistoryTableNEW VALUES (1,'U','3/3/2009') --<<dups
INSERT INTO #HistoryTableNEW VALUES (1,'U','3/3/2009') --<<dups
INSERT INTO #HistoryTableNEW VALUES (2,'I','5/5/2009')
INSERT INTO #HistoryTableNEW VALUES (3,'I','7/7/2009')
INSERT INTO #HistoryTableNEW VALUES (3,'U','8/8/2009')
SELECT
m.MasterID,m.FirstName,m.LastName,h.LastAction,h.HistoryDate,h.HistoryID
FROM #MasterTable m
INNER JOIN (SELECT
MasterID,MAX(HistoryID) AS MaxHistoryID
FROM #HistoryTableNEW
GROUP BY MasterID
) dt ON m.MasterID=dt.MasterID
INNER JOIN #HistoryTableNEW h ON dt.MasterID=h.MasterID AND dt.MaxHistoryID=h.HistoryID
OUTPUT:
MasterID FirstName LastName LastAction HistoryDate HistoryID
----------- --------- -------- ---------- ----------------------- ---------
1 AAA aaa U 2009-03-03 00:00:00.000 4
2 BBB bbb I 2009-05-05 00:00:00.000 5
3 CCC ccc U 2009-08-08 00:00:00.000 7
(3 row(s) affected)
If the history table has a Primary Key (and all tables should), you can modify the subselect to extract the record with either the larger (or the smaller) PK value of the multiples that match the date criteria...
Select M.*, H.*
From Master M
Join History H
On H.PK = (Select Max(PK) From History
Where FK = M.PK
And Date = (Select Max(Date) From History
Where FK = M.PK))
As to performance, that can be addressed by adding the appropriate indices to these tables (History.Date, History.FK) but in general, depending on the specific table data distribution patterns, sub queries can adversely affect performance.

T-SQL Grouping rows from the MAX length columns in different rows (?)

i'm trying to come up with a way to combine rows in a table based on the longest string in any of the rows based on a row key
example
CREATE TABLE test1
(akey int not null ,
text1 varchar(50) NULL,
text2 varchar(50) NULL,
text3 varchar(50) NULL )
INSERT INTO test1 VALUES ( 1,'Winchester Road','crawley',NULL)
INSERT INTO test1 VALUES ( 1,'Winchester Rd','crawley','P21869')
INSERT INTO test1 VALUES ( 1,'Winchester Road','crawley estate','P21869')
INSERT INTO test1 VALUES ( 1,'Winchester Rd','crawley','P21869A')
INSERT INTO test1 VALUES ( 2,'','birmingham','P53342B')
INSERT INTO test1 VALUES ( 2,'Smith Close','birmingham North East','P53342')
INSERT INTO test1 VALUES ( 2,'Smith Cl.',NULL,'P53342B')
INSERT INTO test1 VALUES ( 2,'Smith Close','birmingham North','P53342')
with these rows i would be looking for the result of :
1 Winchester Road, crawley estate, P21869A
2 Smith Close, birmingham North East, P53342B
EDIT: the results above need to be in a table rather than just a comma separated string
as you can see in the result, the output should be the longest text column in the range of the 'akey' field.
i'm trying to come up with a solution that does not involve lots of subqueries on each column, the actual table has 32 columns and over 13 million rows.
the reason i'm doing this is to create a cleaned-up table that has the best results in each column for just one ID per row
this is my first post, so let me know if you need any more info, and i'm happy to hear about any best practices about posting that i've broken!
thanks
Ben.
SELECT A.akey,
(
SELECT TOP 1 T1.text1
FROM test1 T1
WHERE T1.akey=A.akey AND LEN(T1.TEXT1) = MAX(LEN(A.text1))
) AS TEXT1,
(
SELECT TOP 1 T2.text2
FROM test1 T2
WHERE T2.akey=A.akey AND LEN(T2.TEXT2) = MAX(LEN(A.text2))
) AS TEXT2,
(
SELECT TOP 1 T3.text3
FROM test1 T3
WHERE T3.akey=A.akey AND LEN(T3.TEXT3) = MAX(LEN(A.text3))
) AS TEXT3
FROM TEST1 AS A
GROUP BY A.akey
I just realized you said you have 32 columns. I don't see a good way to do that, unless UNPIVOT would allow you to create separate rows (akey, textn) for each text* column.
Edit: I may not have a chance to finish this today, but UNPIVOT looks useful:
;
WITH COLUMNS AS
(
SELECT akey, [Column], ColumnValue
FROM
(
SELECT X.Akey, X.Text1, X.Text2, X.Text3
FROM test1 X
) AS p
UNPIVOT (ColumnValue FOR [Column] IN (Text1, Text2, Text3))
AS UNPVT
)
SELECT *
FROM COLUMNS
ORDER BY akey,[Column], LEN(ColumnValue)
This seems really ugly, but at least works (on SQL2K) and doesn't need subqueries:
select test1.akey, A.text1, B.text2, C.text3
from test1
inner join test1 A on A.akey = test1.akey
inner join test1 B on B.akey = test1.akey
inner join test1 C on C.akey = test1.akey
group by test1.akey, A.text1, B.text2, C.text3
having len(a.text1) = max(len(test1.text1))
and len(B.text2) = max(len(test1.text2))
and len(C.text3) = max(len(test1.text3))
order by test1.akey
I must admit that it needs an inner join for each column and I wonder how this could impact on the 32 columns x 13millions record table... I try both this approach and the one based one subqueries and looked at executions plans: I'ld actually be curious to know