Grouping by similar values in multiple columns - sql

I have a table of entities with an id, and a category (few different values with NULL allowed) from 3 different years (category can be different from 1 year to another), in 'wide' table format:
| ID | CATEG_Y1 | CATEG_Y2 | CATEG_Y3 |
+-----+----------+----------+----------+
| 1 | NULL | B | C |
| 2 | A | A | C |
| 3 | B | A | NULL |
| 4 | A | C | B |
| ... | ... | ... | ... |
I would like to simply count the number of entities by category, grouped by category, independently for the year:
+-------+----+----+----+
| CATEG | Y1 | Y2 | Y3 |
+-------+----+----+----+
| A | 6 | 4 | 5 | <- 6 entities w/ categ_y1, 4 w/ categ_y2, 5 w/ categ_y3
| B | 3 | 1 | 10 |
| C | 8 | 4 | 5 |
| NULL | 3 | 3 | 3 |
+-------+----+----+----+
I guess I could do it by grouping values one column after the other and UNION ALL the results, but I was wondering if there was a more rapid & convenient way, and if it can be generalized if I have more columns/years to manage (e.g. 20-30 different values)

A bit clumsy, but probably someone has a better idea. Query first collects all diferent categories (the union-query in the from part), and then counts the occurences with dedicated subqueries in the select part. One could omit the union-part if there is a table already defining the available categories (I suppose categ_y1 is a foreign key to such a primary category table). Hope there are not to many typos:
select categories.cat,
(select count(categ_y1) from table ty1 where select categories.cat = categ_y1) as y1,
(select count(categ_y2) from table ty2 where select categories.cat = categ_y2) as y2,
(select count(categ_y3) from table ty3 where select categories.cat = categ_y3) as y3
from ( select categ_y1 as cat from table t1
union select categ_y2 as cat from table t2
union select categ_y3 as cat from table t3) categories

Use jsonb functions to transpose the data (from the question) to this format:
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
order by 1;
categ | jdata
-------+-----------------------------------------------
A | {"categ_y1": 2, "categ_y2": 2}
B | {"categ_y1": 1, "categ_y2": 1, "categ_y3": 1}
C | {"categ_y2": 1, "categ_y3": 2}
| {"categ_y1": 1, "categ_y3": 1}
(4 rows)
For a known (static) number of years you can easily unpack the jsonb column:
select categ, jdata->'categ_y1' as y1, jdata->'categ_y2' as y2, jdata->'categ_y3' as y3
from (
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
) s
order by 1;
categ | y1 | y2 | y3
-------+----+----+----
A | 2 | 2 |
B | 1 | 1 | 1
C | | 1 | 2
| 1 | | 1
(4 rows)
To get fully dynamic solution you can use the function create_jsonb_flat_view() described in Flatten aggregated key/value pairs from a JSONB field.

I would do this as using union all followed by aggregation:
select categ, sum(categ_y1) as y1, sum(categ_y2) as y2,
sum(categ_y3) as y3
from ((select categ_y1, 1 as categ_y1, 0 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y2, 0 as categ_y1, 1 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y3, 0 as categ_y1, 0 as categ_y2, 1 as categ_y3
from t
)
)
group by categ ;

Related

Spark sql query to find many to many mappings between two columns of the same table ordered by maximum overlapedness

I wanted to write a Spark sql query or pyspark code to extract many to many mappings between two columns of the same table ordered by maximum overlapedness.
For example:
SysA SysB
A Y
A Z
B Z
B Y
C W
Which means there is therefore a M:M relationship between the above two columns.
Is there a way to extract all M:M combinations ordered by maximum overlapedness i.e values which share a lot among each other should be at the top? and discarding the one-one mappings like C W
Z maps to both A and B
Y maps to both A and B
A maps to both Y and Z
B maps to both Y and Z
Therefore both A ,B AND X,Y have M:M relationships and C W Is 1:1. The order would be sorted by the count i.e 2 , in above example only mappings of two are there between A,B:X,Y hence both are 2.
Similar question:https://social.msdn.microsoft.com/Forums/en-US/fa496933-e85a-4dfe-98df-b6c29ad812f4/sql-to-find-manytomany-combinations-of-two-columns
AS you requested and simplified version of your similar MSDN quesiton identifying just M:M relationships and ordered.
The following approaches may be used on Spark SQL.
CREATE TABLE SampleData (
`SysA` VARCHAR(1),
`SysB` VARCHAR(1)
);
INSERT INTO SampleData
(`SysA`, `SysB`)
VALUES
('A', 'Y'),
('A', 'Z'),
('B', 'Z'),
('B', 'G'),
('B', 'Y'),
('C', 'W');
Query #1
For demo purposes i have used * instead of SysA,SysB in the final projection below
SELECT
*
FROM
(
SELECT
*,
(
SELECT
count(*)
FROM
SampleData s
WHERE s.SysA=sd.SysA
) SysA_SysB,
(
SELECT
count(*)
FROM
SampleData s
WHERE s.SysB=sd.SysB
) SysB_SysA
FROM
SampleData sd
) t
WHERE t.SysA_SysB > 1 AND t.SysB_SysA>1
ORDER BY t.SysA_SysB DESC, t.SysB_SysA DESC;
| SysA | SysB | SysA_SysB | SysB_SysA |
| ---- | ---- | --------- | --------- |
| B | Z | 3 | 2 |
| B | Y | 3 | 2 |
| A | Y | 2 | 2 |
| A | Z | 2 | 2 |
Query #2
NB. Cross Joins should be enabled in spark i.e. setting spark.sql.crossJoin.enabled as true in your spark conf
SELECT
s1.SysA,
s1.SysB
FROM
SampleData s1
CROSS JOIN
SampleData s2
GROUP BY
s1.SysA, s1.SysB
HAVING
SUM(
CASE WHEN s1.SysA = s2.SysA THEN 1 ELSE 0 END
) > 1 AND
SUM(
CASE WHEN s1.SysB = s2.SysB THEN 1 ELSE 0 END
) > 1
ORDER BY
SUM(
CASE WHEN s1.SysA = s2.SysA THEN 1 ELSE 0 END
) DESC,
SUM(
CASE WHEN s1.SysB = s2.SysB THEN 1 ELSE 0 END
) DESC;
| SysA | SysB |
| ---- | ---- |
| B | Z |
| B | Y |
| A | Z |
| A | Y |
Query #3 (Recommended)
WITH SampleDataOcc AS (
SELECT
SysA,
SysB,
COUNT(SysA) OVER (PARTITION BY SysA) as SysAOcc,
COUNT(SysB) OVER (PARTITION BY SysB) as SysBOcc
FROM
SampleData
)
SELECT
SysA,
SysB,
SysAOcc,
SysBOcc
FROM
SampleDataOcc t
WHERE
t.SysAOcc > 1 AND
t.SysBOcc>1
ORDER BY
t.SysAOcc DESC,
t.SysBOcc DESC;
| SysA | SysB | SysAOcc | SysBOcc |
| ---- | ---- | --------- | --------- |
| B | Z | 3 | 2 |
| B | Y | 3 | 2 |
| A | Y | 2 | 2 |
| A | Z | 2 | 2 |
View on DB Fiddle

Oracle SQL, how to select * having distinct columns

I want to have a query something like this (this doesn't work!)
select * from foo where rownum < 10 having distinct bar
Meaning I want to select all columns from ten random rows with distinct values in column bar. How to do this in Oracle?
Here is an example. I have the following data
| item | rate |
-------------------
| a | 50 |
| a | 12 |
| a | 26 |
| b | 12 |
| b | 15 |
| b | 45 |
| b | 10 |
| c | 5 |
| c | 15 |
And result would be for example
| item no | rate |
------------------
| a | 12 | --from (26 , 12 , 50)
| b | 45 | --from (12 ,15 , 45 , 10)
| c | 5 | --from (5 , 15)
Aways having distinct item no
SQL Fiddle
Oracle 11g R2 Schema Setup:
Generate a table with 12 items A - L each with rates 0 - 4:
CREATE TABLE items ( item, rate ) AS
SELECT CHR( 64 + CEIL( LEVEL / 5 ) ),
MOD( LEVEL - 1, 5 )
FROM DUAL
CONNECT BY LEVEL <= 60;
Query 1:
SELECT item,
rate
FROM (
SELECT i.*,
-- Give the rates for each item a unique index assigned in a random order
ROW_NUMBER() OVER ( PARTITION BY item ORDER BY DBMS_RANDOM.VALUE ) AS rn
FROM items i
ORDER BY DBMS_RANDOM.VALUE -- Order all the rows randomly
)
WHERE rn = 1 -- Only get the first row for each item
AND ROWNUM <= 10 -- Only get the first 10 items.
Results:
| ITEM | RATE |
|------|------|
| A | 0 |
| K | 2 |
| G | 4 |
| C | 1 |
| E | 0 |
| H | 0 |
| F | 2 |
| D | 3 |
| L | 4 |
| I | 1 |
I mention table create and query for distinct and top 10 rows;
(Ref SqlFiddle)
create table foo(item varchar(20), rate int);
insert into foo values('a',50);
insert into foo values('a',12);
insert into foo values('a',26);
insert into foo values('b',12);
insert into foo values('b',15);
insert into foo values('b',45);
insert into foo values('b',10);
insert into foo values('c',5);
insert into foo values('c',15);
--Here first get the distinct item and then filter row number wise rows:
select item, rate from (
select item, rate, ROW_NUMBER() over(PARTITION BY item ORDER BY rate desc)
row_num from foo
) where row_num=1;

matching groups of rows in two databases

I have the following (simplified) situation in two databases:
ID Prog T Qt
|---------|--------|---------|---------|
| a | 1 | N | 100 |
| b | 1 | Y | 10 |
| b | 2 | N | 90 |
| c | 1 | N | 25 |
| c | 2 | Y | 25 |
| c | 3 | Y | 25 |
| c | 4 | Y | 25 |
|---------|--------|---------|---------|
ID Prog T Qt
|---------|--------|---------|---------|
| 1 | 1 | Y | 10 |
| 1 | 2 | N | 90 |
| 2 | 1 | Y | 100 |
| 3 | 1 | Y | 100 |
| 4 | 1 | Y | 50 |
| 4 | 2 | Y | 25 |
| 4 | 3 | Y | 25 |
|---------|--------|---------|---------|
I need to compare groups of rows (primary keys are ID and Prog), to find out which groups of rows represent the same combination of factors (not considering ID).
In the example above, ID "b" in the first table and ID "1" in the second have the same combination of values for Prog, T and Qt, while no one else can be considered exactly the same between the 2 dbs (while ID "2" and "3" in the second table are equal, I'm not interested in comparing in the same db).
I hope I explained everything.
A join and aggregation should work for this purpose:
select t1.id, t2.id
from (select t1.*, count(*) over (partition by id) as cnt
from t1
) t1 join
(select t2.*, count(*) over (partition by id) as cnt
from t2
) t2
on t1.prog = t2.prog and t1.T = t2.T and t1.Qt = t2.Qt and t1.cnt = t2.cnt
group by t1.id, t2.id, t1.cnt
having count(*) = t1.cnt;
This is a little tricky. The subqueries count the number of rows for each id in each table. The on clause gets matches between the three columns -- and checks that the ids have the same count. The group by and having then get rows where number of matching rows is the total number of rows.
Join the two tables on the conditions you want to match. The results will be the values that match between them.
CREATE TABLE a (ID CHAR(1), Prog INT, T CHAR(1), Qt INT);
CREATE TABLE b (ID int, Prog INT, T CHAR(1), Qt INT);
INSERT INTO dbo.a
( ID ,Prog ,T ,Qt)
VALUES ('a',1,'N',100), ('b',1,'Y',10), ('b',2,'N',90),('c',1,'N',25),('c',2,'Y',25),('c',3,'Y',25),('c',4,'Y',25)
INSERT INTO dbo.b
( ID ,Prog ,T ,Qt)
VALUES (1,1,'Y',10),(1,2,'N',90),(2,1,'Y',100),(3,1,'Y',100),(4,1,'Y',50),(4,2,'Y',25),(4,3,'Y',25)
WITH CTEa
AS (SELECT ID,
Prog,
T,
Qt,
Cnt = COUNT(ID) OVER (PARTITION BY ID)
FROM dbo.a
),
CTEb
AS (SELECT ID,
Prog,
T,
Qt,
Cnt = COUNT(ID) OVER (PARTITION BY ID)
FROM dbo.b
)
SELECT ID_A = a.ID,
ID_B = b.ID,
b.Prog,
b.T,
b.Qt,
b.Cnt
FROM CTEa AS a
INNER JOIN CTEb AS b
ON a.Prog = b.Prog
AND a.T = b.T
AND a.Qt = b.Qt
AND a.Cnt = b.Cnt;
Results:
ID_A ID_B Prog T Qt Cnt
b 1 1 Y 10 2
b 1 2 N 90 2

Row-wise maximum in T-SQL [duplicate]

This question already has answers here:
SQL MAX of multiple columns?
(24 answers)
Closed 7 years ago.
I've got a table with a few columns, and for each row I want the maximum:
-- Table:
+----+----+----+----+----+
| ID | C1 | C2 | C3 | C4 |
+----+----+----+----+----+
| 1 | 1 | 2 | 3 | 4 |
| 2 | 11 | 10 | 11 | 9 |
| 3 | 3 | 1 | 4 | 1 |
| 4 | 0 | 2 | 1 | 0 |
| 5 | 2 | 7 | 1 | 8 |
+----+----+----+----+----+
-- Desired result:
+----+---------+
| ID | row_max |
+----+---------+
| 1 | 4 |
| 2 | 11 |
| 3 | 4 |
| 4 | 2 |
| 5 | 8 |
+----+---------+
With two or three columns, I'd just write it out in iif or a CASE statement.
select ID
, iif(C1 > C2, C1, C2) row_max
from table
But with more columns this gets cumbersome fast. Is there a nice way to get this row-wise maximum? In R, this is called a "parallel maximum", so I'd love something like
select ID
, pmax(C1, C2, C3, C4) row_max
from table
What about unpivoting the data to get the result? You've said tsql but not what version of SQL Server. In SQL Server 2005+ you can use CROSS APPLY to convert the columns into rows, then get the max value for each row:
select id, row_max = max(val)
from yourtable
cross apply
(
select c1 union all
select c2 union all
select c3 union all
select c4
) c (val)
group by id
See SQL Fiddle with Demo. Note, this could be abbreviated by using a table value constructor.
This could also be accomplished via the UNPIVOT function in SQL Server:
select id, row_max = max(val)
from yourtable
unpivot
(
val
for col in (C1, C2, C3, C4)
) piv
group by id
See SQL Fiddle with Demo. Both versions gives a result:
| id | row_max |
|----|---------|
| 1 | 4 |
| 2 | 11 |
| 3 | 4 |
| 4 | 2 |
| 5 | 8 |
You can use the following query:
SELECT id, (SELECT MAX(c)
FROM (
SELECT c = C1
UNION ALL
SELECT c = C2
UNION ALL
SELECT c = C3
UNION ALL
SELECT c = C4
) as x(c)) maxC
FROM mytable
SQL Fiddle Demo
One method uses cross apply:
select t.id, m.maxval
from table t cross apply
(select max(val) as maxval
from (values (c1), (c2), (c3), (c4)) v(val)
) m

PostgreSQL distinct rows joined with a count of distinct values in one column

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t
-----+---+----+----
foo | 1 | 1 | 10
foo | 1 | 2 | 11
foo | 1 | 2 | 11
foo | 2 | 4 | 1
foo | 3 | 5 | 2
bar | 1 | 6 | 2
bar | 2 | 7 | 2
bar | 2 | 8 | 3
bar | 3 | 9 | 4
bar | 4 | 10 | 5
bar | 5 | 11 | 6
baz | 1 | 12 | 1
baz | 1 | 13 | 2
baz | 1 | 13 | 2
baz | 1 | 13 | 3
There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.
I'm trying to build a query which will return the following results:
a | b | u | t | z
-----+---+----+----+---
foo | 1 | 1 | 10 | 3
foo | 1 | 2 | 11 | 3
foo | 2 | 4 | 1 | 3
foo | 3 | 5 | 2 | 3
bar | 1 | 6 | 2 | 5
bar | 2 | 7 | 2 | 5
bar | 2 | 8 | 3 | 5
bar | 3 | 9 | 4 | 5
bar | 4 | 10 | 5 | 5
bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.
I can get just the z filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
HAVING COUNT(b) > 2;
However, I'm stumped on combining this with the rest of the data in the table.
What's the most efficient way to do this?
Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2;
Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.
Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b, u, and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t)
t.a, t.b, t.u, t.t, a.z
FROM (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) a
JOIN t ON md5(t.a) = md5_a
ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z
FROM (
SELECT DISTINCT ON (1, 2, 3, 4)
md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
FROM t
) t
JOIN (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) z USING (md5_a)
ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?