T-sql Inner Join erratic behavior - sql

I'm trying a simple self join, but it's output is some what erratic.
My table (input) looks like this:
ID | Value
1 | val1
1 | val2
1 | val3
2 | val4
2 | val5
2 | val6
2 | val7
What I am trying to achieve is the following:
ID 1 | Value 1 | ID 2 | Value 2
1 | val1 | 2 | val4
1 | val2 | 2 | val5
1 | val3 | 2 | val6
Null | Null | 2 | val7
My attempt at achieving this output has been the following:
SELECT DISTINCT
column1.ID,
column1.value,
column2.ID,
column2.value
FROM table column1
INNER JOIN table column2 ON column1.ID = 1 AND column2.ID = 2
This chunk of code returns the incorrect number of rows; the total number of rows I should get is 4 with a few a null values in the last. I don't get any null values but I do get some numbers which I don't know how are getting there. Additionally, if choose to display more fields from my table, the numbers of rows returned grows larger. I don't understand this behavior. Can Somebody please help me fix it? (and possibly tell me what I am doing wrong).

You cannot do this effectively. There is no relationship between the rows on the left side and the right side of the join. A join needs a relationship; your ON condition doesn't specify a relationship between the two so you will see lots more rows than you intend.
If you are trying to use SQL to format your data for display, DON'T. Get your data, then format it in your client application.

You separate both your data sets in different CTE or sub-queries and use ROW_NUMBER() function in the process to assign row numbers in order of value. In the end join the two on row number - but using FULL instead of INNER join so you can get null values on whatever side has fewer rows.
WITH CTE_1 AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY [Value]) AS RN
FROM dbo.table1
WHERE ID = 1
)
, CTE_2 AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY [Value]) AS RN
FROM dbo.table1
WHERE ID = 2
)
SELECT
c1.ID AS ID1
, c1.VALUE AS Value1
, c2.ID AS ID2
, c2.VALUE AS Value2
FROM CTE_1 c1
FULL JOIN CTE_2 c2 ON c1.RN = c2.RN
SQLFIddle is not working at the moment, I can't setup a demo, but here is sample table I have used:
CREATE TABLE Table1
([ID] int, [Value] varchar(4))
;
INSERT INTO Table1
([ID], [Value])
VALUES
(1, 'val1'),
(1, 'val2'),
(1, 'val3'),
(2, 'val4'),
(2, 'val5'),
(2, 'val6'),
(2, 'val7')
;

Related

SQL Pivot 2nd step -- Combine rows / Remove nulls

I am trying to pivot data. We start with two columns: ListNum and Value
The row ?integrity? doesn't matter, I just want all the values to collapse upwards and remove the nulls.
In this case ListNum is like an enum, the values are limited to List1, List2, or List3. Notice that they are not in order (1,3,3,1,2 rather than 1,2,3,1,2,3 etc.).
It would be nice to have a solution that uses standard sql so it would work across many databases.
Starting Point:
+---------+------------+
| ListNum | Value |
+---------+------------+
| List1 | A |
| List3 | 123 |
| List3 | CDE |
| List1 | Somestring |
| List2 | randString |
+---------+------------+
I was able to separate the Lists into columns with:
select
case when ListNum = "List1" then Value end as List1,
case when ListNum = "List2" then Value end as List2,
case when ListNum = "List3" then Value end as List3
from Table;
Midpoint:
+------------+------------+-------+
| List1 | List2 | List3 |
+------------+------------+-------+
| A | NULL | NULL |
| NULL | NULL | 123 |
| NULL | NULL | CDE |
| Somestring | NULL | NULL |
| NULL | randString | NULL |
+------------+------------+-------+
but now I need to collapse upwards/remove the nulls to get -
Desired Output:
+------------+------------+-------+
| List1 | List2 | List3 |
+------------+------------+-------+
| A | randString | 123 |
| Somestring | NULL | CDE |
+------------+------------+-------+
Aren't you missing some kind of grouping criterion? How do you determine, that A belongs to 123 and not to CDE? And why is randString in the first line and not in the second?
This is easy, with such a grouping key:
DECLARE #tbl TABLE(GroupingKey INT, ListNum VARCHAR(100),[Value] VARCHAR(100));
INSERT INTO #tbl VALUES
(1,'List1','A')
,(1,'List3','123')
,(2,'List3','CDE')
,(2,'List1','Somestring')
,(1,'List2','randString');
SELECT p.*
FROM #tbl
PIVOT
(
MAX([Value]) FOR ListNum IN(List1,List2,List3)
) p;
But with your data this seems rather random...
UPDATE: A random approach...
The following approach will sort the values into their columns rather randomly:
DECLARE #tbl TABLE(ListNum VARCHAR(100),[Value] VARCHAR(100));
INSERT INTO #tbl VALUES
('List1','A')
,('List3','123')
,('List3','CDE')
,('List1','Somestring')
,('List2','randString');
--This will use three independant, but numbered sets and join them:
WITH All1 AS (SELECT [Value],ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RandomNumber FROM #tbl WHERE ListNum='List1')
,All2 AS (SELECT [Value],ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RandomNumber FROM #tbl WHERE ListNum='List2')
,All3 AS (SELECT [Value],ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RandomNumber FROM #tbl WHERE ListNum='List3')
SELECT All1.[Value] AS List1
,All2.[Value] AS List2
,All3.[Value] AS List3
FROM All1
FULL OUTER JOIN All2 ON All1.RandomNumber=All2.RandomNumber
FULL OUTER JOIN All3 ON All1.RandomNumber=All3.RandomNumber ;
Hint: There is no implicit sort order in your table!
From your comment:
It’s simply the index / instance number. randString is the first non-null row.
Without a specific ORDER BY the same SELECT may return your data in any random order. So there is no first non-null row, at least not in the meaning of first comes before second...
Something with recursive CTE may work:
DECLARE #tbl TABLE( ListNum VARCHAR(100),[Value] VARCHAR(100));
INSERT INTO #tbl VALUES
( 'List1','A')
,( 'List3','123')
,( 'List3','CDE')
,( 'List1','Somestring')
,( 'List2','randString');
DECLARE #mmax int;
SELECT #mmax = cnt from (SELECT TOP 1 count(*) cnt from #tbl group by ListNum ORDER BY count(*) DESC) t;
With rec AS (
SELECT 1 AS num
UNION ALL
SELECT num+1 FROM rec WHERE num+1<=#mmax
)
SELECT t1.List1, t2.List2, t3.List3 FROM rec
FULL JOIN (
select Value as List1, row_number() over(order by ListNum) rn from #tbl where ListNum = 'List1'
) t1
ON rec.num = t1.rn
FULL JOIN
(
select Value as List2, row_number() over(order by ListNum) rn from #tbl where ListNum = 'List2'
) t2
ON rec.num = t2.rn
FULL JOIN
(
select Value as List3, row_number() over(order by ListNum) rn from #tbl where ListNum = 'List3'
) t3
ON rec.num = t3.rn;
DEMO
Ordinarily you'd use an aggregating operation such as MAX because it hides nulls (a null can never be the max unless there is no other valid value in the group).. but your query is a bit odd because you don't seem to have a solid pivot anchor, and you're allowing any of your data to become associated with anything else. In the real world this probably wouldn't happen because it isn't particularly useful
Better example data:
Person, Attribute, Value
1, Name, John
1, Age, 10
2, Name, Sarah
3 Age, 39
Pivoting query:
SELECT
Person,
MAX(case when attribute = 'name' then value end) as name,
MAX(case when attribute = 'age' then value end) as age
FROM
data
GROUP BY person
Result:
Person, Name, Age
1, John, 10
2, Sarah, NULL
3, NULL, 39

Recursive SQL query to find all matching identifiers

I have a table with following structure
CREATE TABLE Source
(
[ID1] INT,
[ID2] INT
);
INSERT INTO Source ([ID1], [ID2])
VALUES (1, 2), (2, 3), (4, 5),
(2, 5), (6, 7)
Example of Source and Result tables:
Source table basically stores which id is matching which another id. From the diagram it can be seen that 1, 2, 3, 4, 5 are identical. And 6, 7 are identical. I need a SQL query to get a Result table with all matches between ids.
I found this item on the site - Recursive query in SQL Server
similar to my task, but with a different result.
I tried to edit the code for my task, but it does not work. "The statement terminated. The maximum recursion 100 has been exhausted before statement completion."
;WITH CTE
AS
(
SELECT DISTINCT
M1.ID1,
M1.ID1 as ID2
FROM Source M1
LEFT JOIN Source M2
ON M1.ID1 = M2.ID2
WHERE M2.ID2 IS NULL
UNION ALL
SELECT
C.ID2,
M.ID1
FROM CTE C
JOIN Source M
ON C.ID1 = M.ID1
)
SELECT * FROM CTE ORDER BY ID1
Thanks a lot for the help!
This is a challenging question. You are trying to walk through a graph in two directions. There are two key ideas:
Add "reverse" edges, so the graph behaves like a digraph but with edges in both directions.
Keep a list of edges that have been visited. In SQL Server, strings are one method.
So:
with s as (
select id1, id2 from source
union -- on purpose
select id2, id1 from source
),
cte as (
select s.id1, s.id2, ',' + cast(s.id1 as varchar(max)) + ',' + cast(s.id2 as varchar(max)) + ',' as ids
from s
union all
select cte.id1, s.id2, ids + cast(s.id2 as varchar(max)) + ','
from cte join
s
on cte.id2 = s.id1
where cte.ids not like '%,' + cast(s.id2 as varchar(max)) + ',%'
)
select *
from cte
order by 1, 2;
Here is a db<>fiddle.
Since all node connections are bidirectional - add reversed relations to the original list
Find all possible paths from each node; almost usual recursion, the only difference is - we need to keep root id1
Avoid cycles - we need to be aware of it because we don't have directions
source:
;with src as(
select id1, id2 from source
union
-- reversed connections
select id2, id1 from source
), rec as (
select id1, id2, CAST(CONCAT('/', src.id1, '/', src.id2, '/') as varchar(8000)) path
from src
union all
-- keep the root id1 from the start of each path
select rec.id1, src.id2, CAST(CONCAT(rec.path, src.id2, '/') as varchar(8000))
from rec
-- usual recursion
inner join src on src.id1 = rec.id2
-- avoid cycles
where rec.path not like CONCAT('%/', src.id2, '/%')
)
select id1, id2, path
from rec
order by 1, 2
output
| id1 | id2 | path |
|-----|-----|-----------|
| 1 | 2 | /1/2/ |
| 1 | 3 | /1/2/3/ |
| 1 | 4 | /1/2/5/4/ |
| 1 | 5 | /1/2/5/ |
| 2 | 1 | /2/1/ |
| 2 | 3 | /2/3/ |
| 2 | 4 | /2/5/4/ |
| 2 | 5 | /2/5/ |
| 3 | 1 | /3/2/1/ |
| 3 | 2 | /3/2/ |
| 3 | 4 | /3/2/5/4/ |
| 3 | 5 | /3/2/5/ |
| 4 | 1 | /4/5/2/1/ |
| 4 | 2 | /4/5/2/ |
| 4 | 3 | /4/5/2/3/ |
| 4 | 5 | /4/5/ |
| 5 | 1 | /5/2/1/ |
| 5 | 2 | /5/2/ |
| 5 | 3 | /5/2/3/ |
| 5 | 4 | /5/4/ |
| 6 | 7 | /6/7/ |
| 7 | 6 | /7/6/ |
http://sqlfiddle.com/#!18/76114/13
source table will contain about 100,000 records
There is nothing that can help you with this. The task is unpleasant - finding all possible connections. Almost CROSS JOIN. With even more connections in the end.
Looks like I came up with a similar answer as the other posters. My approach was to insert the existing value pairs, and then insert the reverse of each pair.
Once you expand the list of value pairs, you can transverse the table to find all the pairs.
CREATE TABLE #Source
([ID1] int, [ID2] int);
INSERT INTO #Source
(
[ID1]
,[ID2]
)
VALUES
(1, 2)
,(2, 3)
,(4, 5)
,(2, 5)
,(6, 7)
INSERT INTO #Source
(
[ID1]
,[ID2]
)
SELECT
[ID2]
,[ID1]
FROM #Source
;WITH expanded AS
(
SELECT DISTINCT
ID1 = s1.ID1
,ID2 = s1.ID2
FROM #Source s1
LEFT JOIN #Source s2 ON s1.ID2 = s2.ID1
UNION
SELECT DISTINCT
ID1 = s1.ID1
,ID2 = s2.ID2
FROM #Source s1
LEFT JOIN #Source s2 ON s1.ID2 = s2.ID1
WHERE s1.ID1 <> s2.ID2
)
,recur AS
(
SELECT DISTINCT
e1.ID1
,e1.ID2
FROM expanded e1
LEFT JOIN expanded e2 ON e1.ID2 = e2.ID1
WHERE e1.ID1 <> e1.ID2
UNION ALL
SELECT DISTINCT
e1.ID1
,e2.ID2
FROM expanded e1
INNER JOIN expanded e2 ON e1.ID2 = e2.ID1
WHERE e1.ID1 <> e2.ID2
)
SELECT DISTINCT
ID1, ID2
FROM recur
ORDER BY ID1, ID2
DROP TABLE #Source
This is a way to get that output by brute force, but may not be the best solution with a different/larger data set:
select sub1.rnk as ID1
,sub2.rnk as ID2
from
(
select a.*
,rank() over (partition by 1 order by id1, id2) as RNK
from source a
) sub1
cross join
(
select a.*
,rank() over (partition by 1 order by id1, id2) as RNK
from source a
) sub2
where sub1.rnk <> sub2.rnk
union all
select id1 as ID1
,id2 as ID2
from source
where id1 = 6
union all
select id2 as ID1
,id1 as ID2
from source
where id1 = 6;

SQL Server retrieving multiple columns with rank 1

I will try to describe my issue as clearly as possible.
I have a dataset of unique 1000 clients, say ##temp1.
I have another dataset which holds the information related to the 1000 clients from ##temp1 across the past 7 years. Lets call this dataset ##temp2. There are 6 specific columns in this second dataset (##temp2) that I am interested in, lets call them column A, B, C, D, E, F. Just for information, the information that columns A, C, E hold is year of some form in data type float (2012, 2013, 2014..) and the information that columns B, D, F hold is ratings of some form (1,2,3,..upto 5) in data type float. Both year and ratings columns have NULL values which I have converted to 0 for now.
My eventual goal is to create a report which holds the information for the 1000 clients in ##temp1, such that each row should hold the information in the following form,
ClientID | ClientName | ColA_Latest_Year1 | ColB_Corresponding_Rating_Year_1 | ColC_Latest_Year2 | ColD_Corresponding_Rating_Year_2 | ColE_Latest_Year3 | ColF_Corresponding_Rating_Year3.
ColA_Latest_Year1 should hold the latest year for that particular client from dataset ##temp2 and ColB_Corresponding_Rating_Year_1 should hold the rating from Column B corresponding to the year pulled in from Column A. Same goes for the other columns.
The approach which I have taken so far, was,
Create ##temp1 as needed
Create ##temp2 as needed
##temp1 LEFT JOIN ##temp2 on client ids to retrieve the year and rating information for all the clients in ##temp1 and put all that information in ##temp3. There will be multiple rows for every client in ##temp3 because the data in ##temp3 is for multiple years.
Ranked the year column (B,D,F) partition by client_ids and put in in ##temp4,
What I have now is something like this,
Rnk_A | Rnk_C | Rnk_F | ColA | ColB | ColC | ColD | ColE | ColF | Client_id | Client_name
2 | 1 | 1 | 0 | 0 | 0 | 0 | 2014 | 1 | 111 | 'ABC'
1 | 2 | 1 | 2012 | 1 | 0 | 0 | 0 | 0 | 111 | 'ABC'
My goal is
Rnk_A | Rnk_C | Rnk_F | ColA | ColB | ColC | ColD | ColE | ColF | Client_id | Client_name
1 | 1 | 1 | 2012| 1 | 0 | 0 | 2014| 1 | 111 | 'ABC'
Any help is appreciated.
This answer assumes you don't have any duplicates per client in columns A, C, E. If you do have duplicates you'd need to find a way to differentiate them and make the necessary changes.
The hurdle that you've failed to overcome in your attempt (as described) is that you're trying to join from temp1 to temp2 only once for lookup information that could come from 3 distinct rows of temp2. This cannot work as you hope. You must perform separate joins for each pair [A,B] [C,D] and [E,F]. The following demonstrates a solution using CTEs to derive the lookup data for each pair.
/********* Prepare sample tables and data ***********/
declare #t1 table (
ClientId int,
ClientName varchar(50)
)
declare #t2 table (
ClientId int,
ColA datetime,
ColB float,
ColC datetime,
ColD float,
ColE datetime,
ColF float
)
insert into #t1
select 1, 'Client 1' union all
select 2, 'Client 2' union all
select 3, 'Client 3' union all
select 4, 'Client 4'
insert into #t2
select 1, '20001011', 1, '20010101', 7, '20130101', 14 union all
select 1, '20040101', 4, '20170101', 1, '20120101', 1 union all
select 1, '20051231', 0, '20020101', 15, '20110101', 1 union all
select 2, '20060101', 2, NULL, 15, '20110101', NULL union all
select 2, '20030101', 3, NULL, NULL, '20100101', 17 union all
select 3, NULL, NULL, '20170101', 42, NULL, NULL
--select * from #t1
--select * from #t2
/********* Solution ***********/
;with MaxA as (
select ROW_NUMBER() OVER (PARTITION BY t2.ClientId ORDER BY t2.ColA DESC) rn,
t2.ClientId, t2.ColA, t2.ColB
from #t2 t2
--where t2.ColA is not null and t2.ColB is not null
), MaxC as (
select ROW_NUMBER() OVER (PARTITION BY t2.ClientId ORDER BY t2.ColC DESC) rn,
t2.ClientId, t2.ColC, t2.ColD
from #t2 t2
--where t2.ColC is not null and t2.ColD is not null
), MaxE as (
select ROW_NUMBER() OVER (PARTITION BY t2.ClientId ORDER BY t2.ColE DESC) rn,
t2.ClientId, t2.ColE, t2.ColF
from #t2 t2
--where t2.ColE is not null and t2.ColF is not null
)
select t1.ClientId, t1.ClientName, a.ColA, a.ColB, c.ColC, c.ColD, e.ColE, e.ColF
from #t1 t1
left join MaxA a on
a.ClientId = t1.ClientId
and a.rn = 1
left join MaxC c on
c.ClientId = t1.ClientId
and c.rn = 1
left join MaxE e on
e.ClientId = t1.ClientId
and e.rn = 1
If you run this you may notice some peculiar results for Client 2 in columns C and F. This is because (as per your question) there may be some NULL values. ColC date is "unknown" and ColF rating is "unknown".
My solution preserves NULL values instead of converting them to zeroes. This allows you to handle them explicitly if you so choose. I commented out lines in the above query that could be used to ignore NULL dates and ratings if necessary.

SQL - Select In from Array, and NOT in in same query

I have a userform built in VBA where my coworkers can enter multiple values that builds an array and places it into an IN statement, which works great. Problem is I need to also be able to display what values do not exist within the tables.
Example table
id | value
1 | value1
2 | value2
4 | value4
Then a query that could be generated would be
SELECT [id],[value] FROM [tablea] WHERE [id] IN (1,2,3,4)
Expected or desirable outcome would be as follows
id | value
1 | value1
2 | value2
3 | null
4 | value4
I've tried doing it like so;
SELECT [id],[value] FROM [tablea] WHERE [id] IN (1,2,3,4) AND [id] NOT IN (1,2,3,4)
since both arrays will be the same, this returns 0 of course.
I know I can do this with a union, and define the not in statement within the second union, but I'd like to do this without a union.. Any other thoughts?
This is on Microsoft SQL 2005
I unfortunately only have access to SELECT, since I'm performing queries either via VBA or Tableau. So I cannot create a derived table or have anything to reference other than the select statement.
You need a left join of some sort. One way would be to construct your query as:
select v.id, t.value
from (values (1), (2), (3), (4)
) v(id) left join
table t
on v.id = t.id;
Thanks to Joel Coehoorn for the tip towards using a CTE
I was able to accomplish this like so;
WITH numbers AS (
SELECT 1 AS num UNION ALL
SELECT 2 AS num UNION ALL
SELECT 3 AS num UNION ALL
SELECT 4 as num UNION ALL )
SELECT
COALESCE(id,num) as col1,
id as col2
FROM tablea
RIGHT JOIN numbers ON tablea.id = numbers.num
This would return
col1 | col2
1 | 1
2 | 2
3 | NULL
4 | 4

SQL Select First column and for each row select unique ID and the last date

I have a problems this mornig , I have tried many solutions and nothing gave me the expected result.
I have a table that looks like this :
+----+----------+-------+
| ID | COL2 | DATE |
+----+----------+-------+
| 1 | 1 | 2001 |
| 1 | 2 | 2002 |
| 1 | 3 | 2003 |
| 1 | 4 | 2004 |
| 2 | 1 | 2001 |
| 2 | 2 | 2002 |
| 2 | 3 | 2003 |
| 2 | 4 | 2004 |
+----+----------+-------+
And I have a query that returns a result like this :
I have the unique ID and for this ID I want to take the last date of the ID
+----+----------+-------+
| ID | COL2 | DATE |
+----+----------+-------+
| 1 | 4 | 2004 |
| 2 | 4 | 2004 |
+----+----------+-------+
But I don't have any idea how I can do that.
I tried Join , CROSS APPLY ..
If you have some idea ,
Thank you
Clement FAYARD
declare #t table (ID INT,Col2 INT,Date INT)
insert into #t(ID,Col2,Date)values (1,1,2001)
insert into #t(ID,Col2,Date)values (1,2,2001)
insert into #t(ID,Col2,Date)values (1,3,2001)
insert into #t(ID,Col2,Date)values (1,4,2001)
insert into #t(ID,Col2,Date)values (2,1,2002)
insert into #t(ID,Col2,Date)values (2,2,2002)
insert into #t(ID,Col2,Date)values (2,3,2002)
insert into #t(ID,Col2,Date)values (2,4,2002)
;with cte as(
select
*,
rn = row_number() over(partition by ID order by Col2 desc)
from #t
)
select
ID,
Col2,
Date
from cte
where
rn = 1
SELECT ID,MAX(Col2),MAX(Date) FROM tableName GROUP BY ID
If col2 and date allways the highest value in combination than you can try
SELECT ID, MAX(COL2), MAX(DATE)
FROM Table1
GROUP BY ID
But it is not realy good.
The alternative is a subquery with:
SELECT yourtable.ID, sub1.COL2, sub1.DATE
FROM yourtable
INNER JOIN -- try with CROSS APPLY for performance AND without ON 1=1
(SELECT TOP 1 COL2, DATE
FROM yourtable sub2
WHERE sub2.ID = topquery.ID
ORDER BY COL2, DATE) sub1 ON 1=1
You didn't tell what's the name of your table so I'll assume below it is tbl:
SELECT m.ID, m.COL2, m.DATE
FROM tbl m
LEFT JOIN tbl o ON m.ID = o.ID AND m.DATE < o.DATE
WHERE o.DATE is NULL
ORDER BY m.ID ASC
Explanation:
The query left joins the table tbl aliased as m (for "max") against itself (alias o, for "others") using the column ID; the condition m.DATE < o.DATE will combine all the rows from m with rows from o having a greater value in DATE. The row having the maximum value of DATE for a given value of ID from m has no pair in o (there is no value greater than the maximum value). Because of the LEFT JOIN this row will be combined with a row of NULLs. The WHERE clause selects only these rows that have NULL for o.DATE (i.e. they have the maximum value of m.DATE).
Check the SQL Antipatterns: Avoiding the Pitfalls of Database Programming book for other SQL tips.
In order to do this you MUST exclude COL2 Your query should look like this
SELECT ID, MAX(DATE)
FROM table_name
GROUP BY ID
The above query produces the Maximum Date for each ID.
Having COL2 with that query does not makes sense, unless you want the maximum date for each ID and COL2
In that case you can run:
SELECT ID, COL2, MAX(DATE)
GROUP BY ID, COL2;
When you use aggregation functions(like max()), you must always group by all the other columns you have in the select statement.
I think you are facing this problem because you have some fundemental flaws with the design of the table. Usually ID should be a Primary Key (Which is Unique). In this table you have repeated IDs. I do not understand the business logic behind the table but it seems to have some flaws to me.