create a column that group values - sql

To resume i want to put into the same group values that are associated:
Here is what i have :
col1 col2
1 2
1 3
2 3
4 5
5 6
and I want this :
col1 col2 group
1 2 1
1 3 1
2 3 1
4 5 2
5 6 2
To produce those two groups here are the steps if i do it manually.
row 1 : 1 is associated to 2 so they are on the same group let's call it group 1
row 2 : 1 is on group 1 and now 1 is associated to 3 so 3 is also on group 1
row 3 : 2 is on the group 1 and 3 is also on the group 1 so they are in the group 1
row 4 : 4 is not a value of the group 1 so i create a new group called 2 and i associate it with 5
row 5 : 5 has a group 2 and is associated to 6 so it has group 2.
Do you have an idea of to resolve this in SQL.
Knowing that i am using Hive or pyspark

Based on A.R.Ferguson answer i was able to figure out the solution using pyspark and graphframe:
from graphframes import *
vertices = sqlContext.createDataFrame([
("A", 1),
("B", 2),
("C", 3),
("D", 4),
("E", 5),
("F", 6)], ["name", "id"])
edges = sqlContext.createDataFrame([
(1, 2),
(1, 3),
(2, 3),
(4, 5),
(5, 6)], ["src", "dst"])
g = GraphFrame(vertices, edges)
result = g.connectedComponents()
result.show()
Thanks again Ferguson.

Related

case end / self-join postres sql

I am trying to process data within the same table.
Input:
Table
id sort value
1 1 1
2 1 8
3 2 0
4 1 2
What I want to achieve is obtain for each id, the first encountered value for all value equal to its sort, and this ordered by id.
Output
Table
id sort value new
1 1 1 1
2 1 8 1
3 2 0 0
4 1 2 1
I tried to self join the table, but I constantly get relation not found. I tried with a case statement but I don't see how can I connect to the same table, I get the same error, relation not found.
The beauty of SQL is that many requirements (yours included) can be verbosely described in very similar way they are finally coded:
with t(id, sort, value ) as (values
(1, 1, 1),
(2, 1, 8),
(3, 2, 0),
(4, 1, 2)
)
select t.*
, first_value(value) over (partition by sort order by id) as "new"
from t
order by id
id
sort
value
new
1
1
1
1
2
1
8
1
3
2
0
0
4
1
2
1
fiddle

Sequence Numbering in SQL Server

I have an table (always ordered by ID ascending) with 5 records as such :
ID Sequence
1 1
2 2
3 3
4 4
8 3
9 3
And the desired output is :
ID Sequence
1 1
2 2
3 3
4 6
8 4
9 5
Looks like 3 direct updates to me, no point over complicating it:
UPDATE table SET sequence = 6 WHERE id = 4
UPDATE table SET sequence = 4 WHERE id = 8
UPDATE table SET sequence = 5 WHERE id = 9
If you want to do this in one step:
update t
set sequence = v.sequence
from t join
(values (4, 6), (8, 4), (9, 5)
) v(id, sequence)
on t.id = v.id;
If you have to do many of these updates, then separate calls to update incur extra overhead.

Postgres Group and order by range (specific values)

I have a table like this:
id simcard simcard_order
80769 56407503370245588410 1
80788 66329183922439284822 2
80803 20993658565113174305 0
80804 81781641934100313243 4
80852 71560493627263868232 3
80784 23739383536995189713 1
80793 42702512646659519628 2
80805 17990699721985463506 0
80832 08525531276567944345 4
80854 74478849586042090832 3
80786 22535328208807554315 1
80812 34317440773382930807 0
80826 36103390459816949722 2
80858 15439885499080289130 3
80862 26786481240939036248 4
80792 59566921916027957512 1
80813 98968026512101636608 0
80835 65834894114116066528 2
80864 17764015687751814947 4
80882 41427844162545991837 3
80887 41587969946566907740 4
80891 46059625228552654737 3
80824 76381392106884963712 1
80863 77385361462191701926 2
80868 46607630719285200008 0
80892 08860583551940471945 4
80899 85443153649210377733 3
80934 90908807112484733323 2
80937 25660906025678471304 0
80967 34298088103509862330 3
The column simcard_order has repeat values from 0 to 4.
I want to order the table like this:
id simcard simcard_order
80769 56407503370245588410 0
80788 66329183922439284822 1
80803 20993658565113174305 2
80804 81781641934100313243 3
80852 71560493627263868232 4
80784 23739383536995189713 0
80793 42702512646659519628 1
80805 17990699721985463506 2
80832 08525531276567944345 3
80854 74478849586042090832 4
80786 22535328208807554315 0
80812 34317440773382930807 1
80826 36103390459816949722 2
80858 15439885499080289130 3
80862 26786481240939036248 4
....
and so on... So in this case I have 3 groups of (0, 1, 2, 3, 4)
Always the order must be 0, 1, 2, 3, 4.
I have used this sql, but it does not work properly:
SELECT id, simcard, simcard_order
FROM tmp_pending_simcards
WHERE tmp_pending_simcards.simcard_order IN (0, 1, 2, 3, 4)
ORDER BY (0, 1, 2, 3, 4)
If I understand correctly:
SELECT id, simcard, simcard_order
FROM tmp_pending_simcards tps
WHERE tps.simcard_order IN (0, 1, 2, 3, 4)
ORDER BY ROW_NUMBER() OVER (PARTITION BY tps.simcard_order),
tps.simcard_order;
Usually, you would have an ORDER BY as part of ROW_NUMBER(), but Postgres does not require it.

Using UNPIVOT and column names for field values

The context is online assessment responses. The source data I have to work with is one record per Participant ( test taker ) with over 140 columns representing the answer value for each question. The column name would correspond to the "ItemCode" of the specific question.
I want to move this data into the following table
ResponseItems
ParticipantID
Form ( 1 or 2 depending on ParticipantID )
ItemCode ( Name of the Column )
ItemDesc ( pulled from Codebook on ItemCode and Form
AnswerValue
I can attack this with a large Union statement but it does not scale to 140 columns wide.
But as an example, here is code that works for the 1st 3 Item Codes or Columns of answers:
Select ar.ParticipantID, ar.Form, c.ItemCode, c.ItemDesc, ar.SJT_01 as 'AnswerValue'
From dbo.Responses ar
Inner Join dbo.CodeBook c on c.ItemCode = 'SJT_01'
and c.Form = ar.Form
UNION ALL
Select ar.ParticipantID, ar.Form, c.ItemCode, c.ItemDesc, ar.SJT_02 as 'AnswerValue'
From dbo.Responses ar
Inner Join dbo.CodeBook c on c.ItemCode = 'SJT_02'
and c.Form = ar.Form
UNION ALL
Select ar.ParticipantID, ar.Form, c.ItemCode, c.ItemDesc, ar.SJT_03 as 'AnswerValue'
From dbo.Responses ar
Inner Join dbo.CodeBook c on c.ItemCode = 'SJT_03'
and c.Form = ar.Form
If I had 100 Test Takers and 3 Item Codes ( 3 columns with answers I want to capture, I am taking a 100 record table and creating 300 records.
I have studied the UNPIVOT command but I can't seem to get it to work.
Any suggestions would be greatly appreciated.
Source data with 10 test takers ... results desired are
ParticipantID, Form, ItemCode, AnswerValue
AICPAPSS003 1 SJT_01 5
AICPAPSS003 1 SJT_02 1
AICPAPSS003 1 SJT_03 3
AICPAPSS007 1 SJT_01 3
AICPAPSS007 1 SJT_02 1
AICPAPSS007 1 SJT_03 5
etc... ( a total of 30 records would be created )
Here you go - I think this is what you were after. If you have 140+ columns you might want to use some dynamic SQL to generate the final query. There are heaps of examples on SO.
Setup
IF(OBJECT_ID('Tempdb..#Test')) IS NOT NULL
DROP TABLE #Test;
SELECT * INTO #Test FROM (VALUES
('AICPAPSS003',1,5, 1, 3),
('AICPAPSS007',1,3, 1, 5),
('AICPAPSS012',1,2, 1, 4),
('AICPAPSS016',1,3, 2, 5),
('AICPAPSS019',1,1, 2, 5),
('AICPAPSS024',1,3, 2, 4),
('AICPAPSS025',1,1, 1, 4),
('AICPAPSS032',1,1, 2, 4),
('AICPAPSS033',1,3, 4, 5),
('AICPAPSS034',1,1, 2, 4)) A (ParticipantID, Form, SJT_01, SJT_02, SJT_03);
Query
SELECT ParticipantID, Form, ItemCode, Value
FROM #Test t
UNPIVOT
(
Value FOR ItemCode IN (SJT_01, SJT_02, SJT_03)
) u;
Results
ParticipantID Form ItemCode Value
------------- ----------- -------------------------------------------------------------------------------------------------------------------------------- -----------
AICPAPSS003 1 SJT_01 5
AICPAPSS003 1 SJT_02 1
AICPAPSS003 1 SJT_03 3
AICPAPSS007 1 SJT_01 3
AICPAPSS007 1 SJT_02 1
AICPAPSS007 1 SJT_03 5
AICPAPSS012 1 SJT_01 2
AICPAPSS012 1 SJT_02 1
AICPAPSS012 1 SJT_03 4
AICPAPSS016 1 SJT_01 3
AICPAPSS016 1 SJT_02 2
AICPAPSS016 1 SJT_03 5
AICPAPSS019 1 SJT_01 1
AICPAPSS019 1 SJT_02 2
AICPAPSS019 1 SJT_03 5
AICPAPSS024 1 SJT_01 3
AICPAPSS024 1 SJT_02 2
AICPAPSS024 1 SJT_03 4
AICPAPSS025 1 SJT_01 1
AICPAPSS025 1 SJT_02 1
AICPAPSS025 1 SJT_03 4
AICPAPSS032 1 SJT_01 1
AICPAPSS032 1 SJT_02 2
AICPAPSS032 1 SJT_03 4
AICPAPSS033 1 SJT_01 3
AICPAPSS033 1 SJT_02 4
AICPAPSS033 1 SJT_03 5
AICPAPSS034 1 SJT_01 1
AICPAPSS034 1 SJT_02 2
AICPAPSS034 1 SJT_03 4
(30 row(s) affected)

SQL SERVER Find out non duplicate data

I have a table with 3 columns as below
id a b
=================
1 1 2
2 1 3
3 1 4
4 2 4
5 2 5
6 3 4
7 3 5
I wanna show the result
if a column or b column is duplicated,
I have try to use Group by ( a,b) but result is not I want.
I wanna Group by (a) and show grouped first row A,B and B is not duplicated
In my example, A will grouped into { 1, 2 ,3 },
and B will show {2 , 4, 5} not {2,4,4} because 4,4 is duplicated
id a b
=================
1 1 2
4 2 4
7 3 5
How do I do for this ?
Sorry, I’m not good at English.
Thx for help.
This code goes from your example data to your example results. Seems strange though and I doubt this is what you're looking for. If you give more detail, then you can get a better answer.
CREATE TABLE Example
(
id INT NOT NULL,
a INT NOT NULL,
b INT NOT NULL
)
GO
INSERT Example
VALUES
(1, 1, 2)
, (2, 1, 3)
, (3, 1, 4)
, (4, 2, 3)
, (5, 2, 4)
, (6, 3, 4)
SELECT
MIN(id) AS id
, a
, MIN(b) AS b
FROM
Example
GROUP BY
a
DROP TABLE Example