How to compare each row against each other and get the best result? - sql

Suppose I have a table of values and categories:
+--+-----+---+
|ID|value|cat|
+--+-----+---+
|0 |1 |0 |
+--+-----+---+
|1 |3 |0 |
+--+-----+---+
|2 |2 |1 |
+--+-----+---+
|3 |1.2 |1 |
+--+-----+---+
|4 |1 |1 |
+--+-----+---+
And I want to know, for each row, the ID of the row which matches the value most closely and belongs to the same category, and I also want to know the difference.
So for row ID=0 the correct answer would be ID=1, and the difference value would be 2. The correct output would be this:
+--+----------+----------+
|ID|difference|best match|
+--+----------+----------+
|0 |2 |1 |
+--+----------+----------+
|1 |2 |0 |
+--+----------+----------+
|2 |0.8 |3 |
+--+----------+----------+
|3 |0.2 |4 |
+--+----------+----------+
|4 |0.2 |3 |
+--+----------+----------+
I'm just learning about CROSS JOIN and while I'm sure this can be done I don't really know where to start.

You can do this with a self-join and making use of the ROW_NUMBER() function in conjunction with MIN():
;WITH cte AS (SELECT a.ID aID
,MIN(ABS(a.value - b.value)) diff
,ROW_NUMBER() OVER(PARTITION BY a.ID ORDER BY MIN(ABS(a.value - b.value)))RN
,b.ID bID
FROM Table1 a
JOIN Table1 b
ON a.cat = b.cat
AND a.ID <> b.ID
GROUP BY a.ID,b.ID)
SELECT aID
,diff
,bID Best_Match
FROM cte
WHERE RN = 1
Demo: SQL Fiddle
If you want to return multiple rows in case of a tie, you'd want to use RANK() instead of ROW_NUMBER()

Related

In SQL, query a table by transposing column results

Background
Forgive the title of this question, as I'm not really sure how to describe what I'm trying to do.
I have a SQL table, d, that looks like this:
+--+---+------------+------------+
|id|sex|event_type_1|event_type_2|
+--+---+------------+------------+
|a |m |1 |1 |
|b |f |0 |1 |
|c |f |1 |0 |
|d |m |0 |1 |
+--+---+------------+------------+
The Problem
I'm trying to write a query that yields the following summary of counts of event_type_1 and event_type_2 cut (grouped?) by sex:
+-------------+-----+-----+
| | m | f |
+-------------+-----+-----+
|event_type_1 | 1 | 1 |
+-------------+-----+-----+
|event_type_2 | 2 | 1 |
+-------------+-----+-----+
The thing is, this seems to involve some kind of transposition of the 2 event_type columns into rows of the query result that I'm not familiar with as a novice SQL user.
What I've tried
I've so far come up with the following query:
SELECT event_type_1, event_type_2, count(sex)
FROM d
group by event_type_1, event_type_2
But that only gives me this:
+------------+------------+-----+
|event_type_1|event_type_2|count|
+------------+------------+-----+
|1 |1 |1 |
|1 |0 |1 |
|0 |1 |2 |
+------------+------------+-----+
You can use a lateral join to unpivot the data. Then use conditional aggregate to calculate m and f:
select v.which,
count(*) filter (where d.sex = 'm') as m,
count(*) filter (where d.sex = 'f') as f
from d cross join lateral
(values (d.event_type_1, 'event_type_1'),
(d.event_type_2, 'event_type_2')
) v(val, which)
where v.val = 1
group by v.which;
Here is a db<>fiddle.

Assign Rank to Row based on Alphabetical Order Using Window Functions in PySpark

I'm trying to assign a rank to the rows of a dataframe using a window function over a string column (user_id), based on alphabetical order. So, for example:
user_id | rank_num
-------------------
A |1
A |1
A |1
B |2
A |1
B |2
C |3
B |2
B |2
C |3
I tried using the following lines of code:
user_window = Window().partitionBy('user_id').orderBy('user_id')
data = (data
.withColumn('profile_row_num', dense_rank().over(user_window))
)
But I'm getting something like:
user_id | rank_num
-------------------
A |1
A |1
A |1
B |1
A |1
B |1
C |1
B |1
B |1
C |1
Partition by user_id is unnecessary. This will cause all user_id to fall into their own partition and get a rank of 1. The code below should do what you wanted:
user_window = Window.orderBy('user_id')
data = data.withColumn('profile_row_num', dense_rank().over(user_window))

SQL COUNT ignoring a column

I have a doubt on a SQL query:
I have the following result from a query:
select distinct eb.event_type_id, eb.status from eid.event_backlog eb order by 1
|event_type_id|status |
|-------------|----------|
|1 |SUCCESS |
|2 |SUCCESS |
|2 |ERROR |
|3 |SUCCESS |
|3 |ERROR |
|4 |SUCCESS |
i would like to obtain this result doing a distinct on the status:
|event_type_id|count |
|-------------|-------|
|1 |1 |
|2 |2 |
|3 |2 |
|4 |1 |
but the only way that I see to obtain this result is doing the following query:
select
eb.event_type_id,
count(1)
from
(
select
distinct eb.event_type_id, eb.status
from
eid.event_backlog eb
order by
1) eb
group by
eb.event_type_id
I don't like to use an nestled query, there is another way to obtain what i want?
Simply count(distinct eb.status), i.e.
select
eb.event_type_id,
count(distinct eb.status)
from eid.event_backlog eb
group by
eb.event_type_id

Query Results For Consecutive Months In Column Grouped By Value

The following is sample data:
Name | Hours | RDate | Company |
------------------------------------
A |0 |2014-08-01 |W
A |0 |2014-07-01 |W
A |0 |2014-06-01 |W
A |0 |2014-05-01 |W
B |0 |2014-08-01 |X
C |0 |2014-07-01 |Y
C |0 |2014-06-01 |Y
D |0 |2014-08-01 |V
D |0 |2014-07-01 |Z
The following are the results I desire:
Name | Hours | RDate | Company |
------------------------------------
A |0 |2014-08-01 |W
A |0 |2014-07-01 |W
A |0 |2014-06-01 |W
A |0 |2014-05-01 |W
C |0 |2014-07-01 |Y
C |0 |2014-06-01 |Y
So the question is:
How do I get the results only of which RDate is consecutive months in the columns I.e 2014-08-01, 2014-07-01(2014-08-01, 2014-06-01 would not satisfy)for the same name and the same company
I'm thinking this is somewhat a variation of Grouping Islands of Contiguous Dates problem.
;WITH Cte AS(
SELECT *,
RN = DATEADD(MONTH, - ROW_NUMBER() OVER (PARTITION BY Name, Company ORDER BY RDate), RDate)
FROM Test
)
,CteCount AS(
SELECT *,
CC = COUNT(*) OVER(PARTITION BY Name, Company, RN)
FROM Cte
)
SELECT
Name, Hours, RDate, Company
FROM CteCount
WHERE CC > 1
SQL FIDDLE
Although #wewesthemenace answers is way more efficient, I tried to figure out myself with solution I was working on and it works; Keeping previously marked answer as marked because is way better. This actually works as well:
SELECT
one.*
FROM
foo one
INNER JOIN
foo two
ON
(one.Name = two.Name and one.Company = two.Company)
WHERE
CONVERT(int,FORMAT(two.Date, 'yyyyMM')) - CONVERT(int,FORMAT(one.ACSS_Date, 'yyyyMM')) = 1
ORDER BY
one.Name
,one.Date DESC

Complicated min/max multi-table query

I need to get the min and max score of group ids, but only if they are enabled:
cdu_group_sl: cdu_group_cc: cdu_group_ph:
-------------------- -------------------- --------------------
|id |name |enabled | |id |name |enabled | |id |name |enabled |
-------------------- -------------------- --------------------
|1 |sl_1 |1 | |1 |cc_1 |1 | |1 |ph_1 |0 |
|2 |sl_3 |1 | |2 |cc_2 |0 | |2 |ph_2 |1 |
|3 |sl_4 |1 | |3 |cc_3 |1 | |3 |ph_3 |1 |
-------------------- -------------------- --------------------
Scores are found in a separate table:
cdu_user_progress
----------------------------------
|id |group_type |group_id |score |
----------------------------------
|1 |sl |1 |50 |
|1 |cc |1 |10 |
|1 |ph |1 |20 |
|1 |sl |2 |80 |
|1 |sl |3 |20 |
|1 |cc |3 |30 |
|1 |sl |1 |40 |
|1 |ph |1 |50 |
|1 |cc |1 |40 |
|1 |ph |2 |90 |
----------------------------------
I need to get a max and min score for each type of group for only enabled groups (for each type):
---------------------------------------------
|group_type |group_id |min_score |max_score |
---------------------------------------------
|sl |1 |40 |50 |
|sl |2 |80 |80 |
|sl |3 |20 |20 |
|cc |1 |10 |40 |
|cc |3 |30 |30 |
|ph |1 |20 |50 |
|ph |2 |90 |90 |
---------------------------------------------
Any idea what the query might be??? So far I have:
SELECT * FROM cdu_user_progress
JOIN cdu_group_sl ON (cdu_group_sl.id = cdu_user_progress.group_id AND cdu_user_progress.group_type = 'sl')
JOIN cdu_group_cc ON (cdu_group_cc.id = cdu_user_progress.group_id AND cdu_user_progress.group_type = 'cc')
JOIN cdu_group_ph ON (cdu_group_ph.id = cdu_user_progress.group_id AND cdu_user_progress.group_type = 'ph')
WHERE cdu_user_progress.uid = $student->uid
AND (cdu_user_progress.group_type = 'sl' AND cdu_group_sl.enabled = 1)
AND (cdu_user_progress.group_type = 'cc' AND cdu_group_cc.enabled = 1)
AND (cdu_user_progress.group_type = 'ph' AND cdu_group_ph.enabled = 1)
Probably completely wrong...
what about using a union to pick the groups you are interested in - something like:
select group_type, group_id min(score) min_score, max(score) max_score
from (
select id, 'sl' grp from cdu_group_sl where enabled = 1
union all
select id, 'cc' from cdu_group_cc where enabled = 1
union all
select id, 'ph' from cdu_group_ph where enabled = 1
) grps join cdu_user_progress scr
on grps.id = scr.group_id and grps.grp = scr.group_type
group by scr.group_type, scr.group_id
The following is probably the fastest way to do this query. To optimize this, you should have an index on group_id, enabled on each of the three "sl", "cc", and "ph" tables:
select cup.*
from cdu_user_progress cup
where (cup.group_type = 'sl' and
exists (select 1
from cdu_group_sl sl
where sl.id = cup.group_id and
sl.enabled = 1
)
) or
(cup.group_type = 'cc' and
exists (select 1
from cdu_group_cc cc
where cc.id = cup.group_id and
cc.enabled = 1
)
) or
(cup.group_type = 'ph' and
exists (select 1
from cdu_group_ph ph
where ph.id = cup.group_id and
ph.enabled = 1
)
)
As a note, having three tables with the same structure is usually a sign of a poor database schema. These three tables should probably be combined into a single table, which would make this query much easier to write.
If you are just starting up this project, I would recommend refining your data structure. Based on what you showed, you could benefit from only one cdu_groups table with a reference to a new cdu_group_types table, and removing the group_type column from cdu_user_progress.
If this is an established project, where changing the structure would be too disruptive... then one of the other answers showing a query would be a better/easier fit.
Otherwise, you could simplify things with restructured tables and end up with a query like:
SELECT group_type,
group_id,
MIN(score) as min_score,
MAX(score) as max_score
FROM cdu_user_progress c
INNER JOIN cdu_groups g
ON c.group_id=g.id
INNER JOIN cdu_group_types t
ON g.group_type_id=t.id
WHERE enabled=1
GROUP BY group_type, group_id
This is shown, with expected results, in this SQLFiddle. With this structure you can add new group types as you want (and also cut down on amount of tables and joins). Tables would be (simplified in this code below, no FKs or anything):
CREATE TABLE cdu_user_progress
(id INT, group_id INT, score INT)
CREATE TABLE cdu_group_types
(id INT, group_type VARCHAR(3))
CREATE TABLE cdu_groups
(id INT, group_type_id INT, name VARCHAR(10), enabled BIT NOT NULL DEFAULT 1)
Granted moving data to a new structure may be a pain or not reasonable... but wanted to throw this out there as a possibility or just something to chew on.