Group observations with SQL and Specifying in same group - sql

I have a table consisting of two columns (X,Y) that represent correlations between observations like below.
X Y
1 2
2 3
3 4
A B
B C
I want a create new column that represent the relation between observation. 1 become 2, 2 become 3, 3 become 4. So i wanna show this variables in same group(1,2,3,4 are belong to same group). The table should be like below.
X Y Z
1 2 Group 1
2 3 Group 1
3 4 Group 1
A B Group 2
B C Group 2
I am using SAS Enterprise Guide. The solution would be great with proc sql or any sql type. I need the logic.
Note: I have no additional information except this table.

Try the following, here is the demo which is in PostgreSQL but you may be able to use the same logic.
with cte as
(
select
*,
lag(y) over (order by x) as rnk
from myTable
)
select
x,
y,
concat('Group ', sum(case when x = rnk then 0 else 1 end) over (order by x)) as z
from cte;
Output:
| x | y | z |
| --- | --- | ------- |
| 1 | 2 | Group 1 |
| 2 | 3 | Group 1 |
| 3 | 4 | Group 1 |
| A | B | Group 2 |
| B | C | Group 2 |

Related

query to count occurances of aparticular column value

Let's say I have a table with the following value
1
1
1
2
2
2
3
3
3
1
1
1
2
2
2
I need to get an out put like this, which counts each occurances of a
particular value
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
1 1
1 2
1 3
2 1
2 2
2 3
NB: This is a sample table Actual table is a complex table with lots of rows and columns and query contains some more conditions
If the number repeats over different "islands" then you need to calculate a value to maintain those islands first (grpnum). That first step can be undertaken by subtracting a raw top-to-bottom row number (raw_rownum) from a partitioned row number. That result gives each "island" a reference unique to that island that can then be used to partition a subsequent row number. As each order by can disturb the outcome I find it necessary to use individual steps and to pass the prior calculation up so it may be reused.
SQL Fiddle
MS SQL Server 2014 Schema Setup:
CREATE TABLE Table1 ([num] int);
INSERT INTO Table1 ([num])
VALUES (1),(1),(1),(2),(2),(2),(3),(3),(3),(1),(1),(1),(2),(2),(2);
Query 1:
select
num
, row_number() over(partition by (grpnum + num) order by raw_rownum) rn
, grpnum + num island_num
from (
select
num
, raw_rownum - row_number() over(partition by num order by raw_rownum) grpnum
, raw_rownum
from (
select
num
, row_number() over(order by (select null)) as raw_rownum
from table1
) r
) d
;
Results:
| num | rn | island_num |
|-----|----|------------|
| 1 | 1 | 1 |
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 2 | 1 | 5 |
| 2 | 2 | 5 |
| 2 | 3 | 5 |
| 1 | 1 | 7 |
| 1 | 2 | 7 |
| 1 | 3 | 7 |
| 3 | 1 | 9 |
| 3 | 2 | 9 |
| 3 | 3 | 9 |
| 2 | 1 | 11 |
| 2 | 2 | 11 |
| 2 | 3 | 11 |
SQL Server provide row_number() function :
select ID, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) RN FROM <TABLE_NAME>
EDIT :
select * , case when (row_number() over (order by (select 1))) %3 = 0 then 3 else
(row_number() over (order by (select 1))) %3 end [rn] from table
I think there is a problem with your sample, in that you have an implied order but not an explicit one. There is no guarantee that the database will keep and store the values the way you have them listed, so there has to be some inherent/explicit ordering mechanism to tell the database to give those values back exactly the way you listed.
For example, if you did this:
update test
set val = val + 2
where val < 3
You would find your select * no longer comes back the way you expected.
You indicated your actual table was huge, so I assume you have something like this you can use. There should be something in the table to indicate the order you want... a timestamp, perhaps, or maybe a surrogate key.
That said, assuming you have something like that and can leverage it, I believe a series of windowing functions would work.
with rowed as (
select
val,
case
when lag (val, 1, -1) over (order by 1) = val then 0
else 1
end as idx,
row_number() over (order by 1) as rn -- fix this once you have your order
from
test
),
partitioned as (
select
val, rn,
sum (idx) over (order by rn) as instance
from rowed
)
select
val, instance, count (1) over (partition by instance order by rn)
from
partitioned
This example orders by the way they are listed in the database, but you would want to change the row_number function to accommodate whatever your real ordering mechanism is.
1 1 1
1 1 2
1 1 3
2 2 1
2 2 2
2 2 3
3 3 1
3 3 2
3 3 3
1 4 1
1 4 2
1 4 3
2 5 1
2 5 2
2 5 3

Distinct values by multiple columns after already applied GROUP BY

Basically, I have the following query (actually more complex, but I think this simplification is ok):
SELECT a, b, x
FROM table
output:
a | b | x
-----------
1 | 2 | 34
1 | 3 | 35
1 | 3 | 36
1 | 4 | 37
2 | 3 | 38
2 | 3 | 39
2 | 4 | 40
3 | 4 | 41
3 | 5 | 42
To count the number of occurrence of each "pair of a and b", I'm using GROUP BY:
SELECT a, b, COUNT(x) AS count
FROM table
GROUP BY a, b
ORDER BY count
output:
a | b | count
--------------
1 | 2 | 1
1 | 4 | 1
2 | 4 | 1
3 | 4 | 1
3 | 5 | 1
1 | 3 | 2
2 | 3 | 2
What bothers me is the multiple occurence of a and b. I would like to keep the "count" as it is, but remove every following row, if a or b was already in a previous row. It would be a nice to have, if it would also remove a row, if the value of "a" appeared in a previous row as "b" and vice versa.
Preferred expected output:
a | b | count
--------------
1 | 2 | 1
1 | 4 | 1 <- should not be in output since we had a=1
2 | 4 | 1 <- should not be in output since we had b=
3 | 4 | 1
3 | 5 | 1 <- should not be in output since we had a=3
1 | 3 | 2 <- should not be in output since we had a=1 / a=3
2 | 3 | 2 <- should not be in output since we had b=2 / a=3
Therefore, this:
a | b | count
--------------
1 | 2 | 1
3 | 4 | 1
Alternative expected output, if the above would be too complex:
a | b | count
--------------
1 | 2 | 1
1 | 4 | 1 <- should not be in output since we had a=1
2 | 4 | 1
3 | 4 | 1 <- should not be in output since we had b=4
3 | 5 | 1
1 | 3 | 2 <- should not be in output since we had a=1
2 | 3 | 2 <- should not be in output since we had a=2
Therefore, this:
a | b | count
--------------
1 | 2 | 1
2 | 4 | 1
3 | 5 | 1
This is rather a mess of a problem, but here's something to consider:
SELECT a, b, count
FROM (
SELECT a, b, count,
rank() over (partition by b order by count, a) as b_rank
FROM (
SELECT a, b, count,
rank() over (partition by a order by count, b) as a_rank
FROM (
SELECT a, b, COUNT(*) AS count
FROM t
GROUP BY a, b
ORDER BY count
) pc
) pc2
WHERE a_rank < 3
) pc3
WHERE b_rank = 1
Each a value will appear at most twice in the results, but b values will be unique. Some b values appearing in low-count pairs may not be reflected in the results. There is a trade-off between possible duplication of a and the number of b values that may be missed altogether: allowing more duplicates of a (by changing to, e.g, WHERE a_rank < 4) reduces the number of b values that may be missed.
This query will give you the desired output.
DECLARE #id INT = 1,
#a INT,
#b INT,
#count INT
DECLARE #tbl TABLE
(
id INT IDENTITY(1,1),
a INT,
b INT,
count INT
)
INSERT INTO #tbl
SELECT a, b, COUNT(1) AS COUNT FROM dbo.myTable
GROUP BY a, b
ORDER BY COUNT,a,b
SELECT #count = COUNT(1) FROM #tbl
WHILE #id <= #count
BEGIN
SELECT TOP 1 #a = a,#b = b FROM #tbl WHERE id = #id
IF EXISTS(SELECT 1 FROM #tbl WHERE id < #id AND (a = #a OR b = #b))
DELETE #tbl WHERE id = #id
SET #id += 1
END
SELECT a,b,count FROM #tbl
Check it on SQLFiddle

PostgreSQL distinct rows joined with a count of distinct values in one column

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t
-----+---+----+----
foo | 1 | 1 | 10
foo | 1 | 2 | 11
foo | 1 | 2 | 11
foo | 2 | 4 | 1
foo | 3 | 5 | 2
bar | 1 | 6 | 2
bar | 2 | 7 | 2
bar | 2 | 8 | 3
bar | 3 | 9 | 4
bar | 4 | 10 | 5
bar | 5 | 11 | 6
baz | 1 | 12 | 1
baz | 1 | 13 | 2
baz | 1 | 13 | 2
baz | 1 | 13 | 3
There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.
I'm trying to build a query which will return the following results:
a | b | u | t | z
-----+---+----+----+---
foo | 1 | 1 | 10 | 3
foo | 1 | 2 | 11 | 3
foo | 2 | 4 | 1 | 3
foo | 3 | 5 | 2 | 3
bar | 1 | 6 | 2 | 5
bar | 2 | 7 | 2 | 5
bar | 2 | 8 | 3 | 5
bar | 3 | 9 | 4 | 5
bar | 4 | 10 | 5 | 5
bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.
I can get just the z filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
HAVING COUNT(b) > 2;
However, I'm stumped on combining this with the rest of the data in the table.
What's the most efficient way to do this?
Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2;
Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.
Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b, u, and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t)
t.a, t.b, t.u, t.t, a.z
FROM (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) a
JOIN t ON md5(t.a) = md5_a
ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z
FROM (
SELECT DISTINCT ON (1, 2, 3, 4)
md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
FROM t
) t
JOIN (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) z USING (md5_a)
ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?

Select non distinct rows from two columns

My question is very similar to Multiple NOT distinct only it deals with multiple columns instead of one. I have a table like so:
A B C
1 1 0
1 2 1
2 1 2
2 1 3
2 2 4
2 3 5
2 3 6
3 1 7
3 3 8
3 1 9
And the result should be:
A B C
2 1 2
2 1 3
2 3 5
2 3 6
3 1 7
3 1 9
Essentially, like the above question, removing all unique entries only where uniqueness is determined by two columns instead of one. I already tried various tweaks to the above answer but couldn't get any of them to work.
You are using SQL Server, so this is easier than in Access:
select A, B, C
from (select t.*, count(*) over (partition by A, B) as cnt
from t
) t
where cnt > 1;
This use of count(*) is as a window function. It is counting the number of rows with the same value of A and B. The final where just selects the rows that have more than one entry.
Another possible solution with EXISTS
SELECT a, b, c
FROM Table1 t
WHERE EXISTS
(
SELECT 1
FROM Table1
WHERE a = t.a
AND b = t.b
AND c <> t.c
)
It should be fast enough.
Output:
| A | B | C |
-------------
| 2 | 1 | 2 |
| 2 | 1 | 3 |
| 2 | 3 | 5 |
| 2 | 3 | 6 |
| 3 | 1 | 7 |
| 3 | 1 | 9 |
Here is SQLFiddle demo

SQL Finding multiple-line duplicates

At my worki we have data stored in a database, the data is not normalized. I am looking for a way to find what data was duplicated.
Our Data base has 3 rows columns, Name, State, Strategy
This data might looks something like this:
OldTable:
Name | State | Strat
-----+-------+------
A | M | 1
A | X | 3
B | T | 6
C | M | 1
C | X | 3
D | X | 3
What I'd like to do is move the data to two tables, one containing the name the other containing the set of State and Strats so it would look more like this
NewTable0:
Name | StratID
-----+--------
A | 1
B | 2
C | 1
D | 3
NewTable1:
StratID | State | Strat
--------+-------+------
1 | M | 1
1 | X | 3
2 | T | 6
3 | X | 3
So in the data example A and C would be duplicates, but D would not be. How would I go about finding and/or identifying these duplicates?
Try:
SELECT OT1.Name Name1, OT2.Name Name2
FROM OldTable OT1
JOIN OldTable OT2 ON OT1.Name < OT2.Name AND
OT1.State = OT2.State AND
OT1.Strat = OT2.Strat
GROUP BY OT1.Name, OT2.Name
HAVING COUNT(*) = (SELECT COUNT(*) FROM OldTable TC1 WHERE TC1.NAME = OT1.NAME)
AND COUNT(*) = (SELECT COUNT(*) FROM OldTable TC2 WHERE TC2.NAME = OT2.NAME)
You could find this out by grouping the Names together, and only listing those where there is more than one record:
SELECT OldTable.Name, COUNT(1) Duplicates
FROM OldTable
GROUP BY OldTable.Name
HAVING Duplicates > 1
Should output:
OldTable:
Name | Duplicates
-----+------------
A | 2
C | 2