Consolidate, Combine, Merge Rows - sql

Every search I do leads me to results for people seeking array_agg to combine multiple columns in a row into column. That's not what I am trying to figure out here, and maybe I am not using the right search terms (e.g., consolidate, combine, merge).
I am trying to combine rows by populating values in fields ... I am not sure the best way to describe this other than with an example:
Current:
--------------------------------
id num_1 num_2 num_3 num_4
--------------------------------
1 111 222 0 0
2 111 333 0 0
3 111 0 0 444
4 0 222 555 0
5 777 999 0 0
6 0 999 888 0
After Processing:
--------------------------------
id num_1 num_2 num_3 num_4
--------------------------------
1 111 222 555 444
2 111 333 555 444
3 111 333 555 444
4 111 222 555 444
5 777 999 888 0
6 777 999 888 0
After Deleting Duplicate Rows:
--------------------------------
id num_1 num_2 num_3 num_4
--------------------------------
1 111 222 555 444
2 111 333 555 444
3 777 999 888 0
This will likely be a 2 step process ... first fill in the blanks, and then find/delete the duplicate. I can do the second step, but having trouble figuring how to first populate the 0 values with values from another row where you might have two different values (id 1/2 for num_2 column) but only one value for num_1 (e.g., 111)
I can do it in PHP, but would like to figure out how to do it using only Postgres.
EDIT: My example table is a relations table. I have multiple datasets with similar information (e.g., username) but different registration ID numbers. So, I do an inner join on table 1 and table 2 (for example) where the username is the same. Then I take the registration IDs (which are different) from each table and insert that as a row into my relations table. In my example tables above, Row 1 has two different registration IDs from the two tables I joined … the values 111 (num_1) and 222 (num_2) are inserted into the table and zeros inserted for num_3 and num_4. Then I compare table 1 and table 4 and the values 111 (num_1) and 444 (num_4) get inserted into the relations table and zeros for num_2 and num_3. Since registration ID 111 is related to registration ID 222 and registration ID 111 is related to registration ID 444, then registration IDs 111, 222, and 444 are all related (meaning the username is the same for each of those registration IDs). Does that help to clarify?
EDIT 2: I corrected Tables 2 and 3. Hopefully now it makes sense. The username column is not unique. So, I have 4 tables like this:
Table 1:
bob - 111
mary - 777
Table 2:
bob - 222
bob - 333
mary - 999
Table 3:
bob - 555
mary - 888
Table 4:
bob - 444 -- mary does not exist in this table
So, in my relations table I should end up with 3 rows as given in example Table 3 above.

It seems like you started in the middle of a presumed solution, forgetting to present the initial problem. Based on your added information I suggest a completely different, much simpler solution. You have:
CREATE TABLE table1 (username text, registration_id int);
CREATE TABLE table2 (LIKE table1);
CREATE TABLE table3 (LIKE table1);
CREATE TABLE table4 (LIKE table1);
INSERT INTO table1 VALUES ('bob', 111), ('mary', 777);
INSERT INTO table2 VALUES ('bob', 222), ('bob', 333), ('mary', 999);
INSERT INTO table3 VALUES ('bob', 555), ('mary', 888);
INSERT INTO table4 VALUES ('bob', 444); -- no mary
Solution
What you really seem to need is FULL [OUTER] JOIN. Details in the manual on FROM and JOIN.
-- CREATE TABLE relations AS
SELECT username
, t1.registration_id AS reg1
, t2.registration_id AS reg2
, t3.registration_id AS reg3
, t4.registration_id AS reg4
FROM table1 t1
FULL JOIN table2 t2 USING (username)
FULL JOIN table3 t3 USING (username)
FULL JOIN table4 t4 USING (username)
ORDER BY username;
That's all. Produces your desired result directly.
username reg1 reg2 reg3 reg4
---------------------------------
bob 111 222 555 444
bob 111 333 555 444
mary 777 999 888 (null)
Your given example would work with LEFT JOIN as well, since all missing entries are to the right. But that would fail in other constellations. I added some more revealing test cases in the fiddle:
SQL Fiddle.
I assume you are aware that multiple entries in multiple tables will produce a huge number of output rows:
Two SQL LEFT JOINS produce incorrect result

If your values are always increasing (as in the example), then just use cumulative maximum and then select distinct:
select row_number() over (order by min(id)) as id,
t.num1, t.num2, t.num3, t.num4
from (select id,
max(num1) over (order by id) as num1,
max(num2) over (order by id) as num2,
max(num3) over (order by id) as num3,
max(num4) over (order by id) as num4
from t
) t
group by t.num1, t.num2, t.num3, t.num4;
If max() doesn't work, then what you really want is lag( . . . ignore nulls). That is not yet available. Perhaps the simplest method is then correlated subqueries for each column:
select row_number() over (order by min(id)) as id,
t.num1, t.num2, t.num3, t.num4
from (select id,
(select t2.num1 from t t2 where t2.id <= t.id and t2.num1 <> 0 order by t2.id desc limit 1
) as num1,
(select t2.num2 from t t2 where t2.id <= t.id and t2.num2 <> 0 order by t2.id desc limit 1
) as num2,
(select t2.num3 from t t2 where t2.id <= t.id and t2.num3 <> 0 order by t2.id desc limit 1
) as num3,
(select t2.num4 from t t2 where t2.id <= t.id and t2.num4 <> 0 order by t2.id desc limit 1
) as num4
from t
) t
group by t.num1, t.num2, t.num3, t.num4;
This version would not be very efficient on even medium sized tables.
A more efficient version is more complicated:
select row_number() over (order by id) as id,
t1.num1, t2.num2, t3.num3, t4.num4
from (select min(id) as id,
from (select id,
max(case when num1 > 0 then id end) over (order by id) as num1_id,
max(case when num2 > 0 then id end) over (order by id) as num2_id,
max(case when num3 > 0 then id end) over (order by id) as num3_id,
max(case when num4 > 0 then id end) over (order by id) as num4_id
from t
) t
group by num1_id, num2_id, num3_id, num4_id
) t left join
t t1
on t1.id = t.num1_id left join
t t2
on t2.id = t.num2_id left join
t t3
on t3.id = t.num3_id left join
t t4
on t4.id = t.num4_id left join
group by t.num1, t.num2, t.num3, t.num4;
EDIT:
That was a little silly. There is an easier way using first_value() (which Postgres unfortunately does not support as an aggregation function):
select row_number() over (order by min(id)) as id,
num1, num2, num3, num4
from (select id,
first_value(num1) over (order by (case when num1 is not null then id en) nulls last
) as num1,
first_value(num2) over (order by (case when num2 is not null then id end) nulls last
) as num2,
first_value(num3) over (order by (case when num3 is not null then id end) nulls last
) as num3,
first_value(num4) over (order by (case when num4 is not null then id end) nulls last
) as num4
from t
) t
group by num1, num2, num3, num4;

Related

Filter rows and select in to another columns in SQL?

I have a table like below.
If(OBJECT_ID('tempdb..#temp') Is Not Null)
Begin
Drop Table #Temp
End
create table #Temp
(
Type int,
Code Varchar(50),
)
Insert Into #Temp
SELECT 1,'1'
UNION
SELECT 1,'2'
UNION
SELECT 1,'3'
UNION
SELECT 2,'4'
UNION
SELECT 2,'5'
UNION
SELECT 2,'6'
select * from #Temp
And would like to get the below result.
Type_1
Code_1
Type_2
Code_2
1
1
2
4
1
2
2
5
1
3
2
6
I have tried with union and inner join, but not getting desired result. Please help.
You can use full outer join and cte as follows:
With cte as
(Select type, code,
Row_number() over (partition by type order by code) as rn
From your_table t)
Select t1.type, t1.code, t2.type, t2.code
From cte t1 full join cte t2
On t1.rn = t2.rn and t1.type =1 and t2.type = 2
Here is a query which will produce the output you expect:
WITH cte AS (
SELECT t.[Type], t.Code
, rn = ROW_NUMBER() OVER (PARTITION BY t.[Type] ORDER BY t.Code)
FROM #Temp t
)
SELECT Type_1 = t1.[Type], Code_1 = t1.Code
, Type_2 = t2.[Type], Code_2 = t2.Code
FROM cte t1
JOIN cte t2 ON t1.rn = t2.rn AND t2.[Type] = 2
AND t1.[Type] = 1
This query is will filter out any Type_1 records which do not have a Type_2 record. This means if there are an uneven number of Type_1 vs Type_2 records, the extra records will get eliminated.
Explanation:
Since there is no obvious way to join the two sets of data, because there is no shared key between them, we need to create one.
So we use this query:
SELECT t.[Type], t.Code
, rn = ROW_NUMBER() OVER (PARTITION BY t.[Type] ORDER BY t.Code)
FROM #Temp t
Which assigns a ROW_NUMBER to every row...It restarts the numbering for every Type value, and it orders the numbering by the Code.
So it will produce:
| Type | Code | rn |
|------|------|----|
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 3 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
| 2 | 6 | 3 |
Now you can see that we have assigned a key to each row of Type 1's and Type 2's which we can use for the joining process.
In order for us to re-use this output, we can stick it in a CTE and perform a self join (not an actual type of join, it just means we want to join a table to itself).
That's what this query is doing:
SELECT *
FROM cte t1
JOIN cte t2 ON t1.rn = t2.rn AND t2.[Type] = 2
AND t1.[Type] = 1
It's saying, "give me a list of all Type 1 records, and then join all Type 2 records to that using the new ROW_NUMBER we've generated".
Note: All of this works based on the assumption that you always want to join the Type 1's and Type 2's based on the order of their Code.
You can also do this using aggregation:
select max(case when type = 1 then type end) as type_1,
max(case when type = 1 then code end) as code_1,
max(case when type = 2 then type end) as type_2,
max(case when type = 2 then code end) as code_2
from (select type, code,
row_number() over (partition by type order by code) as seqnum
from your_table t
) t
group by seqnum;
It would be interesting to know which is faster -- a join approach or aggregation.
Here is a db<>fiddle.

How to get first 2 children of a parent in SQL?

I have a table that looks like the below
ID Name ParentID
333 UK NULL
124 Wales 333
126 Swansea 124
127 Llanrhidian 126
As you can see all of the parent and children are in the same table. I need to create a view from this which shows the ID and name for each bottom level Child, the ChildID and Name of the one above it and then the highest level parent of them. An output of the above for Wales would look like the following
ChildID1 Child1Name ChildID2 Child2Name ParentID ParentName
127 Llanrhidian 126 Swansea 333 England
Sometimes the number of parents to a child can be different. In the example above, ChildID 127 has 3 parents. This can sometimes be more but we will always need to see the lowest 2.
Does this make sense? Can someone help me with this?
You can try to use CTE recursive with condition aggravated function.
the first query to use CTE recursive get all recursive relation data.
the second query use MAX window function to get max number which means parentId.
;WITH CTE AS(
SELECT t1.ID,t1.Name,t1.ParentID,1 num
FROM T t1 LEFT JOIN T t2 on t1.ID = t2.ParentID
WHERE t2.ID IS NULL
UNION ALL
SELECT t2.ID,t2.NAME,t2.ParentID,num + 1
FROM CTE t1 JOIN T t2 ON t1.ParentID = t2.ID
),CTE2 AS (
SELECT *,MAX(num) OVER(ORDER BY num desc) maxNum
FROM CTE
)
SELECT MAX(CASE WHEN num = 1 then ID END) ChildID1,
MAX(CASE WHEN num = 1 then name END) Child1Name,
MAX(CASE WHEN num = 2 then ID END) ChildID2,
MAX(CASE WHEN num = 2 then Name END) Child2Name,
MAX(CASE WHEN num = maxNum then ID END) ParentID,
MAX(CASE WHEN num = maxNum then Name END) ParentName
FROM CTE2
sqlfiddle
In your SQL, use: SELECT TOP 2 * FROM ...
This wil only fetch the first 2 lines he finds. (This is syntax for SQL Server)

SQL Server - Remove duplicates with different ordering

I have a table containing pairs of items bought together and the # of times the pairing occurred.
item_1 item_2 count
123 234 5
345 567 22
567 345 22
890 345 6
Some of the pairings are dupes that differ just by order (ie rows 2&3).
Is there an easy way to de-dupe this table?
If the "dups" can appear only once in either direction, then a convenient way is:
select t.*
from t
where t.item_1 <= t.item_2
union all
select t.*
from t t
where t.item_1 > t.item2 and
not exists (select 1
from t t2
where t2.item_1 = t.item_2 and t.item_2 = t.item_1 and t2.count = t.count
);
You can use this script.
DECLARE #T TABLE (item_1 INT, item_2 INT , [count] INT)
INSERT INTO #T
VALUES
(123 ,234, 5),
(345 ,567, 22),
(567 ,345, 22),
(890 ,345, 6)
;WITH BASE AS
(
SELECT RN = ROW_NUMBER() OVER(ORDER BY item_1), * FROM #T
)
SELECT T1.item_1, T1.item_2, T1.count FROM BASE T1
OUTER APPLY (SELECT TOP 1 *
FROM BASE T2
WHERE T2.RN > T1.RN AND T1.item_1 = T2.item_2 AND T1.item_2 = T2.item_1) X
WHERE X.RN IS NULL
Result
item_1 item_2 count
----------- ----------- -----------
123 234 5
567 345 22
890 345 6
You can classify a pair being the same with comparison similar to the least and greatest of the two. And select one of them.
select item_1,item_2,count
from (select t.*
,row_number() over(partition by case when item_1<item_2 then item_1 else item_2 end,
case when item_1>item_2 then item_1 else item_2 end
order by item_1) as rnum
from tbl t
) t
where rnum=1
Edit: Per Gordon's comment, if the duplicates have to eliminated only when the count is the same, use
select item_1,item_2,count
from (select t.*
,row_number() over(partition by case when item_1<item_2 then item_1 else item_2 end,
case when item_1>item_2 then item_1 else item_2 end,
count
order by item_1) as rnum
from tbl t
) t
where rnum=1

Finding Missing Numbers series when Data Is Grouped in sql server

I need to write a query that will calculate the missing numbers with their count in a sequence when the data is "grouped". The data are in multiple groups & each group is in sequence.
For Ex. I have number series like 1001-1050, 1245-1270, 4571-4590 and all numbers like 1001,1002,1003,....1050 is stored in Table1 and from that Table1 some numbers are stored in another table Table2. E.g. 1001,1002,1003,1004,1005.
I want to get output like this:
Utilized Numbers | Balance Numbers |
----------- -------------------------
1001 - 1005 = 5 | 1006 - 1050 = 45 |
1245 - 1251 = 7 | 1252 - 1270 = 19 |
4571 - 4573 = 3 | 4574 - 4590 = 17 |
The number of each series is single field which is stored in both tables.
You haven't really explained your data, but guessing that "Utilized" are the numbers found in both Table1 and Table2, and "Balance" are the numbers only in Table1.
You can get the result at least this way, it's a little bit messy, mostly because of formatting the results:
Edit: This is a new version that does not use lag.
select
min (case when C2 = 1 then MINID end), max (case when C2 = 1 then MAXID end), max(case when C2=1 then ROWS end),
min (case when C2 = 0 then MINID end), max (case when C2 = 0 then MAXID end), max(case when C2=0 then ROWS end)
from (
select min(ID) as MINID, max(ID) as MAXID, count(*) as ROWS, C2, row_number() over (partition by C2 order by min(ID)) as GRP3 from (
select *, ID - RN as GRP1, ID - RN2 as GRP2 from (
select
T1.ID, row_number() over (order by T1.ID) as RN,
case when T2.ID is NULL then 0 else 1 end as C2,
row_number() over (partition by case when T2.ID is NULL then 0 else 1 end order by T1.ID) as RN2,
T2.ID as ID2
from #Table1 T1
left outer join #Table2 T2 on T1.ID = T2.ID
) X
) Y
group by GRP1, GRP2, C2
) Z
group by GRP3
order by 1
The idea here is to have a row number ordered by Table1.ID, and it's compared to the Table1.ID, and if the difference changes, then it's a new group. The same logic is used second time, but now partitioned differently for rows that exist in Table2 to handle changes between "Utilized" and "Balance".
From those groupings you can get the min and max value + number of rows. There's one additional grouping with min/max and case to format the result into 2 columns.
See the demo.

How to replicate a SAS merge

I have two tables, t1 and t2:
t1
person | visit | code1 | type1
1 1 50 50
1 1 50 50
1 2 75 50
t2
person | visit | code2 | type2
1 1 50 50
1 1 50 50
1 1 50 50
When SAS runs the following code:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
It generates the following dataset:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
I want to replicate this process in SQL, and my idea was to use a full-outer-join. This works unless there are duplicate rows. When we have duplicate rows like in the above example, a full outer join produces the following table:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
I'm wondering how I would get the SQl table to match the SAS table.
Gordon's answer is close; but it misses one point. Here's its output:
person visit code1 type1 seqnum person visit code2 type2 seqnum
1 1 1 1 1 1 1 1 1 1
1 1 2 2 2 1 1 2 2 2
NULL NULL NULL NULL NULL 1 1 3 3 3
1 2 1 3 1 NULL NULL NULL NULL NULL
The third row's nulls are incorrect, while the fourth's are correct.
As far as I know, in SQL there's not a really good way to do this other than splitting things up into a few queries. I think there are five possibilities:
Matching person/visit, Matching seqnums
Matching person/visit, Left has more seqnums
Matching person/visit, Right has more seqnums
Left has unmatched person/visit
Right has unmatched person/visit
I think the last two might be workable into one query, but I think the second and third have to be separate queries. You can union everything together, of course.
So here's an example, using some temporary tables that are a little more well suited to see what's going on. Note that the third row is now filled in for code1 and type1, even though those are 'extra'. I've only added three of the five criteria - the three you had in your initial example - but the other two aren't too hard.
Note that this is an example of something far faster in SAS - because SAS has a row-wise concept, ie, it's capable of going one row at a time. SQL tends to take a lot longer at these, with large tables, unless it's possible to partition things very neatly and have very good indexes - and even then I've never seen a SQL DBA do anywhere near as well as SAS at some of these types of problems. That's something you'll have to accept of course - SQL has its own advantages, one of which being probably price...
Here's my example code. I'm sure it's not terribly elegant, hopefully one of the SQL folk can improve it. This is written to work in SQL Server (using table variables), same thing should work with some changes (to use temporary tables) in other variants, assuming they implement windowing. (SAS of course can't do this particular thing - as even FedSQL implements ANSI 1999, not ANSI 2008.) This is based on Gordon's initial query, then modified with the additional bits at the end. Anyone who wants to improve this please feel free to edit and/or copy to a new/existing answer any bit you wish.
declare #t1 table (person INT, visit INT, code1 INT, type1 INT);
declare #t2 table (person INT, visit INT, code2 INT, type2 INT);
insert into #t1 values (1,1,1,1)
insert into #t1 values (1,1,2,2)
insert into #t1 values (1,2,1,3)
insert into #t2 values (1,1,1,1)
insert into #t2 values (1,1,2,2)
insert into #t2 values (1,1,3,3)
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select *,
row_number() over (partition by person, visit order by type1) as seqnum
from #t1
) t1 inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from #t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum
union all
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (
(select person, visit, MAX(seqnum) as max_rownum from (
select person, visit,
row_number() over (partition by person, visit order by type1) as seqnum
from #t1) t1_f
group by person, visit
) t1_m inner join
(select *, row_number() over (partition by person, visit order by type1) as seqnum
from #t1
) t1
on t1.person=t1_m.person and t1.visit=t1_m.visit
and t1.seqnum=t1_m.max_rownum
inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from #t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum < t2.seqnum
)
union all
select t1.person, t1.visit, t1.code1, t1.type1, t2.code2, t2.type2
from #t1 t1 left join #t2 t2
on t2.person=t1.person and t2.visit=t1.visit
where t2.code2 is null
You can replicate a SAS merge by adding a row_number() to each table:
select t1.*, t2.*
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
Notes:
The ?? means to put in the column(s) used for ordering. SAS datasets have an intrinsic order. SQL tables do not, so the ordering needs to be specified.
You should list the columns explicitly (instead of using t1.*, t2.* in the outer query). I think SAS only includes person and visit once in the resulting dataset.
EDIT:
Note: the above produces separate values for the key columns. This is easy enough to fix:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.key, t2.key) as key,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
That fixes the columns issue. You can fix the copying issue by using first_value()/last_value() or by using a more complicated join condition:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
(t1.seqnum = t2.seqnum or
(t1.cnt > t2.cnt and t1.seqnum > t2.seqnum and t2.seqnum = t2.cnt) or
(t2.cnt > t1.cnt and t2.seqnum > t1.seqnum and t1.seqnum = t1.cnt)
This implements the "keep the last row" logic in a single join. Probably for performance reasons, you would want to put this into separate left joins on the original logic.