Partially meeting conditions and sorting by match rank - sql

Given a table where very simplified data looks like the following (but it could include millions of rows with a lot more data in dozens of columns of different types):
+----+----+---+-----+
| ID | X | Y | Z |
+----+----+---+-----+
| 1 | 1 | 1 | "a" |
| 2 | 1 | 0 | "a" |
| 3 | 0 | 1 | "a" |
| 4 | 0 | 0 | "a" |
| 5 | 0 | 0 | "b" |
+----+----+---+-----+
What would be the approach to select only the data with full AND MAYBE partial condition matching but up to a certain match rank, with the results sorted by that match rank?
E.g. when the condition is WHERE ((X = 1) AND (Y = 1) AND (Z = "a")) how would it be possible to get the following results in the following order:
+----+----+---+-----+-------+
| ID | X | Y | Z | MATCH |
+----+----+---+-----+-------+
| 1 | 1 | 1 | "a" | 100% | <- 100% because all conditions matched
| 2 | 1 | 0 | "a" | 66% | <- 66% because X & Z matched but Y didn't
| 3 | 0 | 1 | "a" | 66% | <- 66% because Y & Z matched but X didn't
| 4 | 0 | 0 | "a" | 33% | <- 33% because Z matched but X & Y didn't
| 5 | 0 | 0 | "b" | 0% | <- 0% because nothing matched
+----+----+---+-----+-------+
Or being able to select up to a certain match rank, so with WHERE ((X = 1) AND (Y = 1) AND (Z = "a")) AND (MATCH >= 25) we'd only get the following:
+----+----+---+-----+-------+
| ID | X | Y | Z | MATCH |
+----+----+---+-----+-------+
| 1 | 1 | 1 | "a" | 100% |
| 2 | 1 | 0 | "a" | 66% |
| 3 | 0 | 1 | "a" | 66% |
| 4 | 0 | 0 | "a" | 33% |
+----+----+---+-----+-------+
Or with WHERE ((X = 1) AND (Y = 1) AND (Z = "a")) AND (MATCH >= 75) to get:
+----+----+---+-----+-------+
| ID | X | Y | Z | MATCH |
+----+----+---+-----+-------+
| 1 | 1 | 1 | "a" | 100% |
+----+----+---+-----+-------+
Due to the table having tens of millions of rows iterating over them wouldn't be possible for scalability reasons (but other required conditions could be passed to narrow down the results).
The percentage values are for illustrative purposes only and aren't strictly required (the same applies for the looks of the MATCH >= XX% condition which would likely have to be represented differently).
I guess I'm looking for something like this
SELECT *
FROM xyz
WHERE (X = 1 AND Y = 1 AND Z = "a")
OR (X != 1 AND Y = 1 AND Z = "a")
OR (X = 1 AND Y != 1 AND Z = "a")
OR (X = 1 AND Y = 1 AND Z != "a")
OR (X = 1 AND Y != 1 AND Z != "a")
OR (X != 1 AND Y != 1 AND Z = "a")
OR (X != 1 AND Y = 1 AND Z != "a")
OR (X != 1 AND Y != 1 AND Z != "a")
But it of course wouldn't necessarily sort them in the order of the match rank nor allow specifying the match rank (other than maybe programmatically generating the needed number of OR conditions which is also an option).

This answers the original version of the question.
You can do the calculation in-line:
select t.*
from (select x, y,
((x = ?)::int + (y = ?)::int) / 2.0 as match
from t
) t
where match = ?;
The ? are placeholders for your values.

I can think of one way using JSONB to count the number of matches:
with vals (x,y,z) as (
values (1, 1,'a')
)
select d.*,
(select count(*)
from (
select jsonb_build_object(k,v)
from jsonb_each(to_jsonb(v)) as t1(k,v)
intersect
select jsonb_build_object(k,v)
from jsonb_each(to_jsonb(d) - 'id') as t2(k,v)
)t
) as num_matches
from data d
cross join vals v
where d.x = v.x
or d.y = v.y
or d.z = v.z
order by num_matches desc;
Not very pretty but at least the calculation of the number of matches is dynamic based on the number of columns of the "values" part.
returns:
id | x | y | z | num_matches
---+---+---+---+------------
1 | 1 | 1 | a | 3
2 | 1 | 0 | a | 2
3 | 0 | 1 | a | 2
4 | 0 | 0 | a | 1
If there are more columns that need to be ignored (not just id), you need to extend the to_jsonb(d) - 'id' to also remove the other columns - which makes this only partially "dynamic".
Doing this and calculating the percentage can all be put into a function:
create or replace function match_percent(p_values jsonb, p_row data)
returns int
as
$$
select ((count(*)::numeric / (select count(*) from jsonb_object_keys(p_values)))*100)::int
from (
select jsonb_build_object(k,v)
from jsonb_each(p_values) as t1(k,v)
intersect
select jsonb_build_object(k,v)
from jsonb_each(to_jsonb(p_row)) as t2(k,v)
where t2.k in (select k from jsonb_object_keys(p_values))
) x;
$$
language sql
stable;
Then the query can be simplified to:
with vals (x,y,z) as (
values (1, 1,'a')
)
select d.*,
match_percent(to_jsonb(v), d)
from data d
cross join vals v
where d.x = v.x
or d.y = v.y
or d.z = v.z
order by match_percent desc;

Related

How to select unique rows (comparing by a few columns)?

I want to select unique rows from a table (without repeating the combination of 'f' and 'x' fields).
The table:
| f | x | z |
|---|—--|---|
| 1 | 1 | a |
| 1 | 2 | b |
| 1 | 3 | c |
| 1 | 3 | d |
The result:
| f | x | z |
|---|—--|---|
| 1 | 1 | a |
| 1 | 2 | b |
The following query groups rows in "the_table" by "f" and "x", selects the minimum value of "z" in each group and filters out groups with a count greater than 1, returning only unique combinations of "f" and "x".
SELECT f, x, MIN(z) AS z
FROM the_table
GROUP BY f, x
HAVING COUNT(*) = 1;
WITH
check_repetitions AS
(
SELECT
*,
COUNT(*) OVER (PARTITION BY f, x) AS repetitions
FROM
your_table
)
SELECT
f, x, z
FROM
check_repetitions
WHERE
repetitions = 1
You can use the following query to select only rows where the combination of columns f and x do not repeat:
SELECT f, x, MIN(z) AS z
FROM table_name
GROUP BY f, x
HAVING COUNT(*) = 1
This query will group the rows based on the values of f and x, and then return only the rows where the combination of f and x occurs only once. The function MIN is used to select a single value for z for each group.

Spark sql query to find many to many mappings between two columns of the same table ordered by maximum overlapedness

I wanted to write a Spark sql query or pyspark code to extract many to many mappings between two columns of the same table ordered by maximum overlapedness.
For example:
SysA SysB
A Y
A Z
B Z
B Y
C W
Which means there is therefore a M:M relationship between the above two columns.
Is there a way to extract all M:M combinations ordered by maximum overlapedness i.e values which share a lot among each other should be at the top? and discarding the one-one mappings like C W
Z maps to both A and B
Y maps to both A and B
A maps to both Y and Z
B maps to both Y and Z
Therefore both A ,B AND X,Y have M:M relationships and C W Is 1:1. The order would be sorted by the count i.e 2 , in above example only mappings of two are there between A,B:X,Y hence both are 2.
Similar question:https://social.msdn.microsoft.com/Forums/en-US/fa496933-e85a-4dfe-98df-b6c29ad812f4/sql-to-find-manytomany-combinations-of-two-columns
AS you requested and simplified version of your similar MSDN quesiton identifying just M:M relationships and ordered.
The following approaches may be used on Spark SQL.
CREATE TABLE SampleData (
`SysA` VARCHAR(1),
`SysB` VARCHAR(1)
);
INSERT INTO SampleData
(`SysA`, `SysB`)
VALUES
('A', 'Y'),
('A', 'Z'),
('B', 'Z'),
('B', 'G'),
('B', 'Y'),
('C', 'W');
Query #1
For demo purposes i have used * instead of SysA,SysB in the final projection below
SELECT
*
FROM
(
SELECT
*,
(
SELECT
count(*)
FROM
SampleData s
WHERE s.SysA=sd.SysA
) SysA_SysB,
(
SELECT
count(*)
FROM
SampleData s
WHERE s.SysB=sd.SysB
) SysB_SysA
FROM
SampleData sd
) t
WHERE t.SysA_SysB > 1 AND t.SysB_SysA>1
ORDER BY t.SysA_SysB DESC, t.SysB_SysA DESC;
| SysA | SysB | SysA_SysB | SysB_SysA |
| ---- | ---- | --------- | --------- |
| B | Z | 3 | 2 |
| B | Y | 3 | 2 |
| A | Y | 2 | 2 |
| A | Z | 2 | 2 |
Query #2
NB. Cross Joins should be enabled in spark i.e. setting spark.sql.crossJoin.enabled as true in your spark conf
SELECT
s1.SysA,
s1.SysB
FROM
SampleData s1
CROSS JOIN
SampleData s2
GROUP BY
s1.SysA, s1.SysB
HAVING
SUM(
CASE WHEN s1.SysA = s2.SysA THEN 1 ELSE 0 END
) > 1 AND
SUM(
CASE WHEN s1.SysB = s2.SysB THEN 1 ELSE 0 END
) > 1
ORDER BY
SUM(
CASE WHEN s1.SysA = s2.SysA THEN 1 ELSE 0 END
) DESC,
SUM(
CASE WHEN s1.SysB = s2.SysB THEN 1 ELSE 0 END
) DESC;
| SysA | SysB |
| ---- | ---- |
| B | Z |
| B | Y |
| A | Z |
| A | Y |
Query #3 (Recommended)
WITH SampleDataOcc AS (
SELECT
SysA,
SysB,
COUNT(SysA) OVER (PARTITION BY SysA) as SysAOcc,
COUNT(SysB) OVER (PARTITION BY SysB) as SysBOcc
FROM
SampleData
)
SELECT
SysA,
SysB,
SysAOcc,
SysBOcc
FROM
SampleDataOcc t
WHERE
t.SysAOcc > 1 AND
t.SysBOcc>1
ORDER BY
t.SysAOcc DESC,
t.SysBOcc DESC;
| SysA | SysB | SysAOcc | SysBOcc |
| ---- | ---- | --------- | --------- |
| B | Z | 3 | 2 |
| B | Y | 3 | 2 |
| A | Y | 2 | 2 |
| A | Z | 2 | 2 |
View on DB Fiddle

SQL Server - Better Solution for join between 2 tables pivoting rows into columns

Hi everyone I am using SQL Server 2016, I have a table called support_event_log that looks like this:
| event_nr | data |
|--------------|-------------|
| 1 | x |
| 2 | x |
And a table called support_event_log_params that looks like this:
| event_nr | msg_param_nr | msg_param_value |
|-----------------|----------------|------------------|
| 1 | 1 | x |
| 2 | 1 | x |
| 2 | 2 | y |
| 2 | 3 | z |
I want to join both tables by the column Event_nr and pivot the column msg_param_nr into 3 different columns depending on the number with the value of the column msg_param_value, like this:
| event_nr | msg1 | msg2 | msg3 | data |
|-----------------|------|------|------| x |
| 1 | x | null | null | x |
| 2 | x | y | z | x |
I first tried the following query:
SELECT A.event_nr
,A.data
,CASE WHEN B.msg_param_nr = 1 THEN B.msg_param_value END AS msg1
,CASE WHEN B.msg_param_nr = 2 THEN B.msg_param_value END AS msg2
,CASE WHEN B.msg_param_nr = 3 THEN B.msg_param_value END AS msg3
FROM support_event_log A LEFT JOIN support_event_log_params B
on A.event_nr=B.event_nr
but I was getting the following result with repeated rows:
| event_nr | msg1 | msg2 | msg3 | data |
|-----------------|------|------|------| x |
| 1 | x | null | null | x |
| 2 | x | null | null | x |
| 2 | null | y | null | x |
| 2 | null | null | z | x |
Finally after a lot of thinking I got a working solution with the following query:
WITH col1 AS (
SELECT A.event_nr, A.msg_param_value
FROM support_event_log_params A
WHERE A.msg_param_nr=1
)
, col2 AS (
SELECT A.event_nr, A.msg_param_value
FROM support_event_log_params A
WHERE A.msg_param_nr=2
)
,col3 AS (
SELECT A.event_nr, A.msg_param_value
FROM support_event_log_params A
WHERE A.msg_param_nr=3
)
SELECT A.event_nr
,A.data
,B.msg_param_value as msg1
,C.msg_param_value as msg2
,D.msg_param_value as msg3
FROM support_event_log A
LEFT JOIN col1 B on A.event_nr=B.event_nr
LEFT JOIN col2 C on A.event_nr=C.event_nr
LEFT JOIN col3 D on A.event_nr=D.event_nr
but it seems very inefficient doing the 3 withs to the same table, is there a better solution to this problem ? I can't seem to find one that works
You just need aggregation on your first query:
SELECT el.event_nr, el.data,
MAX(CASE WHEN elp.msg_param_nr = 1 THEN elp.msg_param_value END) AS msg1,
MAX(CASE WHEN elp.msg_param_nr = 2 THEN elp.msg_param_value END) AS msg2,
MAX(CASE WHEN elp.msg_param_nr = 3 THEN elp.msg_param_value END) AS msg3
FROM support_event_log el LEFT JOIN
support_event_log_params elp
ON el.event_nr = elp.event_nr
GROUP BY el.event_nr, el.data;
Notice that I also changed the table aliases to be abbreviations for the table names, rather than meaningless letters such as A and B.

SQL MIN across multiple columns with a GROUP BY

I'm trying to take a table with information as follows:
+----+---+---+
| ID | X | Y |
+----+---+---+
| A | 1 | 3 |
| A | 1 | 1 |
| A | 1 | 2 |
| A | 1 | 7 |
| B | 2 | 2 |
| B | 3 | 3 |
| B | 1 | 9 |
| B | 2 | 4 |
| B | 2 | 1 |
| C | 1 | 1 |
+----+---+---+
I'd like to be able to select the minimum across both columns, grouping by the first column - the "X" column is more important than the Y column. So for example, the query should return something like this:
+----+---+---+
| ID | X | Y |
+----+---+---+
| A | 1 | 1 |
| B | 1 | 9 |
| C | 1 | 1 |
+----+---+---+
Any ideas? I've gone through dozens of posts and experiments and no luck so far.
Thanks,
James
You seem to want the row that has the minimum x value. And, if there are duplicates on x, then take the one with the minimum y.
For this, use row_number():
select id, x, y
from (select t.*,
row_number() over (partition by id order by x, y) as seqnum
from t
) t
where seqnum = 1
If your database does not support window functions, you can still express this in SQL:
select t.id, t.x, min(t.y)
from t join
(select id, MIN(x) as minx
from t
group by id
) tmin
on t.id = tmin.id and t.x = tmin.minx
group by t.id, t.x
If your RDBMS supports Window Function,
SELECT ID, X, Y
FROM
(
SELECT ID, X, Y,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY X, Y) rn
FROM tableName
) d
WHERE rn = 1
SQLFiddle Demo

Self-join a table to merge multiple rows in to one row

I am very new to SQL and never did anything complex like this. Any help would be appreciated.
I have following data in the table:
ID - TAG
1 - U
1 - N
1 - U
1 - N
1 - U
My output needs to be
ID - U - N
1 - 3 - 2
Basically my output needs to count N's and U's for the ID and produce a single row for that ID.
Thanks for your time.
SELECT
ID,
COUNT(CASE WHEN TAG = 'U' THEN 1 END) AS [U],
COUNT(CASE WHEN TAG = 'N' THEN 1 END) AS [N]
FROM
someTable
GROUP BY
ID
UPDATE:
Another poster (see below) points out that I used square brackets where some RDBMSs require double quotes.
In MySQL this is simplier than you think :)
select id,
sum(if(tag = 'U', 1, 0)) as U,
sum(if(tag = 'N', 1, 0)) as N
from table1
group by id
Basically, for each id you get you create a new column (U and N) which will have a 1 or a 0 based on whether tag has value 'U' or 'N'. This results in something like this:
+----+---+---+
| ID | U | N |
+----+---+---+
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
+----+---+---+
Now, all we have to do is group by ID and sum all values from U and N, and we get:
+----+---+---+
| ID | U | N |
+----+---+---+
| 1 | 3 | 2 |
+----+---+---+
Hope this helps.