Matching algorithm in SQL - sql

I have the following table in my database.
# select * FROM matches;
name | prop | rank
------+------+-------
carl | 1 | 4
carl | 2 | 3
carl | 3 | 9
alex | 1 | 8
alex | 2 | 5
alex | 3 | 6
alex | 3 | 8
alex | 2 | 11
anna | 3 | 8
anna | 3 | 13
anna | 2 | 14
(11 rows)
Each person is ranked at work by different properties/criterias called 'prop' and the performance is called 'rank'. The table contains multiple values of (name, prop) as the example shows. I want to get the best candidate following from some requirements. E.g. I need a candidate that have (prop=1 AND rank > 5) and (prop=3 AND rank >= 8). Then we must be able to sort the candidates by their rankings to get the best candidate.
EDIT: Each person must fulfill ALL requirements
How can I do this in SQL?

select x.name, max(x.rank)
from matches x
join (
select name from matches where prop = 1 AND rank > 5
intersect
select name from matches where prop = 3 AND rank >= 8
) y
on x.name = y.name
group by x.name
order by max(rank);

Filtering the data to match your criteria here is quite simple (as shown by both Amir and sternze):
SELECT *
FROM matches
WHERE prop=1 AND rank>5) OR (prop=3 AND rank>=8
The problem is how to aggregate this data so as to have just one row per candidate.
I suggest you do something like this:
SELECT m.name,
MAX(DeltaRank1) AS MaxDeltaRank1,
MAX(DeltaRank3) AS MaxDeltaRank3
FROM (
SELECT name,
(CASE WHEN prop=1 THEN rank-6 ELSE 0 END) AS DeltaRank1,
(CASE WHEN prop=3 THEN rank-8 ELSE 0 END) AS DeltaRank3,
FROM matches
) m
GROUP BY m.name
HAVING MaxDeltaRank1>0 AND MaxDeltaRank3>0
SORT BY MaxDeltaRank1+MaxDeltaRank3 DESC;
This will order the candidates by the sum of how much they exceeded the target rank in prop1 and prop3. You could use different logic to indicate which is best though.
In the case above, this should be the result:
name | MaxDeltaRank1 | MaxDeltaRank3
------+---------------+--------------
alex | 3 | 0
... because neither anna nor carl reach both the required ranks.

A typical case of relational division. We assembled a whole arsenal of techniques under this related question:
How to filter SQL results in a has-many-through relation
Assuming you want the minimum rank of a person, I might solve your particular case with LEAST():
SELECT m1.name, LEAST(m1.rank, m2.rank, ...) AS best_rank
FROM matches m1
JOIN matches m2 USING (name)
...
WHERE m1.prop = 1 AND m1.rank > 5
AND m2.prop = 3 AND m2.rank >= 8
...
ORDER BY best_rank;
Also assuming name to be unique per individual person. You'd probably use some kind of foreign key to a pk column of a person table in reality.
And if you have such a person table like you should, the best rank would be stored in a column there ...

If I understand you question, then you just need to execute the following operation:
SELECT * FROM matches where (prop = 1 AND rank > 5) OR (prop = 3 AND rank >= 8) ORDER BY rank
It gives you the canidates that either have prop=1 and rank > 5 or prop=3 and rank >= 8 sorted by their rankings.

Related

Update statement to set a column based the maximum row of another table

I have a Family table:
SELECT * FROM Family;
id | Surname | Oldest | Oldest_Age
---+----------+--------+-------
1 | Byre | NULL | NULL
2 | Summers | NULL | NULL
3 | White | NULL | NULL
4 | Anders | NULL | NULL
The Family.Oldest column is not yet populated. There is another table of Children:
SELECT * FROM Children;
id | Name | Age | Family_FK
---+----------+------+--------
1 | Jake | 8 | 1
2 | Martin | 7 | 2
3 | Sarah | 10 | 1
4 | Tracy | 12 | 3
where many children (or no children) can be associated with one family. I would like to populate the Oldest column using an UPDATE ... SET ... statement that sets it to the Name and Oldest_Age of the oldest child in each family. Finding the name of each oldest child is a problem that is solved quite well here: How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL?
However, I don't know how to use the result of this in an UPDATE statement to update the column of an associated table using the h2 database.
The following is ANSI-SQL syntax that solves this problem:
update family
set oldest = (select name
from children c
where c.family_fk = f.id
order by age desc
fetch first 1 row only
)
In h2, I think you would use limit 1 instead of fetch first 1 row only.
EDIT:
For two columns -- alas -- the solution is two subqueries:
update family
set oldest = (select name
from children c
where c.family_fk = f.id
order by age desc
limit 1
),
oldest_age = (select age
from children c
where c.family_fk = f.id
order by age desc
limit 1
);
Some databases (such as SQL Server, Postgres, and Oracle) support lateral joins that can help with this. Also, row_number() can also help solve this problem. Unfortunately, H2 doesn't support this functionality.

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

How do I return rows in groups by certain values?

I want my query to return the rows of a table in groups where a column contains specific values. After I got the rows ordered in the groups I want to be able to order them by name.
Example Table
- Id - Name - Group
- 1 George Group_2_1
- 2 Alfred Group_2_2
- 3 Eric Group_3
- 4 Mary Group_1_2
- 5 Jon Group_1_1
I want them ordered by their group and after that ordered by their name
- Id - Name - Group
- 1 Jon Group_1_1
- 2 Mary Group_1_2
- 3 Alfred Group_2_2
- 4 George Group_2_1
- 5 Eric Group_3
I found this SQL-Query-Snippet
ORDER BY CASE WHEN Group LIKE '%Group_1%' THEN 1 ELSE 2 END, Group
but it is not enough. The result is only grouped by the first group (obviously) but I can't extend it to order the second group because it is in the same column.
Please don't get confused by the example.
I just want to be able to group certain rows and put them in front of the results. I want a result that has all rows containing group 1 in the top, containing group 2 in the middle and containing group 3 in the bottom.
The values are not "Group_1_1" or something like that. They are just some strings and I want certain strings to be always in the first row (group 1) and some always below group 1
The problem here seems to be that some of your group names have an extra underscore, otherwise you could just order by the Group and all would be good. You could probably do something like this to work around this?
WITH Data AS (
SELECT 'Group1_1' AS Value
UNION
SELECT 'Group_3_2' AS Value
UNION
SELECT 'Group_2_2' AS Value
UNION
SELECT 'Group_3_1' AS Value
)
SELECT * FROM Data ORDER BY CASE WHEN Value LIKE 'Group_%' THEN SUBSTRING(Value, 7, 10) ELSE SUBSTRING(Value, 6, 10) END;
Results:
Value
Group1_1
Group_2_2
Group_3_1
Group_3_2
---- EDIT ----
Okay, seeing as your example isn't really an "example" it sounds like you are going to need a really, REALLY long case statement. You could do something like this (using the original Group_1_1, Group_2_2 codes) that would extend to different values. The key is that a CASE statement works from left to right and a value is assigned to the first case that matches:
ORDER BY
CASE
WHEN [Group] = 'Group_1_1' THEN 1
WHEN [Group] = 'Group_1_2' THEN 2
WHEN [Group] LIKE 'Group_1_%' THEN 3
WHEN [Group] = 'Group_2_1' THEN 4
WHEN [Group] = 'Group_2_2' THEN 5
WHEN [Group] LIKE 'Group_2_%' THEN 6
etc.
END;
Obviously that's very generic and depends on what the actual values are in your database.
Edits for mssql
If there is ANY instance of 3 underscores then the following simply won't work. However if there is the possibility of Group_12_6 or Group_21_1 then this approach may be worth trying.
It removes Group_ or Group from the string, leaving 1_1 or 12_6 or 21_1 then it replaces the remaining underscore with . giving 1.1 or 12.6 or 21.1 and casts this to decimal.
All utterly dependent of the consistency of those group names.
SELECT
id
, name
, [Group]
FROM YourData
ORDER BY
CAST(REPLACE(REPLACE(REPLACE([Group], 'Group_', ''), 'Group', ''), '_', '.') AS decimal(12,3))
, name
I'm really hoping you do not have a column called [Group] but if you do it has to be referenced as [Group] or "Group". Test result:
| ID | NAME | GROUP |
|----|--------|-----------|
| 1 | Jon | Group_1_1 |
| 2 | Mary | Group_1_2 |
| 4 | George | Group_2_1 |
| 3 | Alfred | Group_2_2 |
| 5 | Eric | Group_3 |
see http://sqlfiddle.com/#!3/e95b07/1

SQL Query to select bottom 2 from each category

In Mysql, I want to select the bottom 2 items from each category
Category Value
1 1.3
1 4.8
1 3.7
1 1.6
2 9.5
2 9.9
2 9.2
2 10.3
3 4
3 8
3 16
Giving me:
Category Value
1 1.3
1 1.6
2 9.5
2 9.2
3 4
3 8
Before I migrated from sqlite3 I had to first select a lowest from each category, then excluding anything that joined to that, I had to again select the lowest from each category. Then anything equal to that new lowest or less in a category won. This would also pick more than 2 in case of a tie, which was annoying... It also had a really long runtime.
My ultimate goal is to count the number of times an individual is in one of the lowest 2 of a category (there is also a name field) and this is the one part I don't know how to do.
Thanks
SELECT c1.category, c1.value
FROM catvals c1
LEFT OUTER JOIN catvals c2
ON (c1.category = c2.category AND c1.value > c2.value)
GROUP BY c1.category, c1.value
HAVING COUNT(*) < 2;
Tested on MySQL 5.1.41 with your test data. Output:
+----------+-------+
| category | value |
+----------+-------+
| 1 | 1.30 |
| 1 | 1.60 |
| 2 | 9.20 |
| 2 | 9.50 |
| 3 | 4.00 |
| 3 | 8.00 |
+----------+-------+
(The extra decimal places are because I declared the value column as NUMERIC(9,2).)
Like other solutions, this produces more than 2 rows per category if there are ties. There are ways to construct the join condition to resolve that, but we'd need to use a primary key or unique key in your table, and we'd also have to know how you intend ties to be resolved.
You could try this:
SELECT * FROM (
SELECT c.*,
(SELECT COUNT(*)
FROM user_category c2
WHERE c2.category = c.category
AND c2.value < c.value) cnt
FROM user_category c ) uc
WHERE cnt < 2
It should give you the desired results, but check if performance is ok.
Here's a solution that handles duplicates properly. Table name is 'zzz' and columns are int and float
select
smallest.category category, min(smallest.value) value
from
zzz smallest
group by smallest.category
union
select
second_smallest.category category, min(second_smallest.value) value
from
zzz second_smallest
where
concat(second_smallest.category,'x',second_smallest.value)
not in ( -- recreate the results from the first half of the union
select concat(c.category,'x',min(c.value))
from zzz c
group by c.category
)
group by second_smallest.category
order by category
Caveats:
If there is only one value for a given category, then only that single entry is returned.
If there was a unique recordID for each row you wouldn't need all the concats to simulate a unique key.
Your mileage may vary,
--Mark
A union should work. I'm not sure of the performance compared to Peter's solution.
SELECT smallest.category, MIN(smallest.value)
FROM categories smallest
GROUP BY smallest.category
UNION
SELECT second_smallest.category, MIN(second_smallest.value)
FROM categories second_smallest
WHERE second_smallest.value > (SELECT MIN(smallest.value) FROM categories smallest WHERE second.category = second_smallest.category)
GROUP BY second_smallest.category
Here is a very generalized solution, that would work for selecting first n rows for each Category. This will work even if there are duplicates in value.
/* creating temporary variables */
mysql> set #cnt = 0;
mysql> set #trk = 0;
/* query */
mysql> select Category, Value
from (select *,
#cnt:=if(#trk = Category, #cnt+1, 0) cnt,
#trk:=Category
from user_categories
order by Category, Value ) c1
where c1.cnt < 2;
Here is the result.
+----------+-------+
| Category | Value |
+----------+-------+
| 1 | 1.3 |
| 1 | 1.6 |
| 2 | 9.2 |
| 2 | 9.5 |
| 3 | 4 |
| 3 | 8 |
+----------+-------+
This is tested on MySQL 5.0.88
Note that initial value of #trk variable should be not the least value of Category field.

SQL: Find rows where field value differs

I have a database table structured like this (irrelevant fields omitted for brevity):
rankings
------------------
(PK) indicator_id
(PK) alternative_id
(PK) analysis_id
rank
All fields are integers; the first three (labeled "(PK)") are a composite primary key. A given "analysis" has multiple "alternatives", each of which will have a "rank" for each of many "indicators".
I'm looking for an efficient way to compare an arbitrary number of analyses whose ranks for any alternative/indicator combination differ. So, for example, if we have this data:
analysis_id | alternative_id | indicator_id | rank
----------------------------------------------------
1 | 1 | 1 | 4
1 | 1 | 2 | 6
1 | 2 | 1 | 3
1 | 2 | 2 | 9
2 | 1 | 1 | 4
2 | 1 | 2 | 7
2 | 2 | 1 | 4
2 | 2 | 2 | 9
...then the ideal method would identify the following differences:
analysis_id | alternative_id | indicator_id | rank
----------------------------------------------------
1 | 1 | 2 | 6
2 | 1 | 2 | 7
1 | 2 | 1 | 3
2 | 2 | 1 | 4
I came up with a query that does what I want for 2 analysis IDs, but I'm having trouble generalizing it to find differences between an arbitrary number of analysis IDs (i.e. the user might want to compare 2, or 5, or 9, or whatever, and find any rows where at least one analysis differs from any of the others). My query is:
declare #analysisId1 int, #analysisId2 int;
select #analysisId1 = 1, #analysisId2 = 2;
select
r1.indicator_id,
r1.alternative_id,
r1.[rank] as Analysis1Rank,
r2.[rank] as Analysis2Rank
from rankings r1
inner join rankings r2
on r1.indicator_id = r2.indicator_id
and r1.alternative_id = r2.alternative_id
and r2.analysis_id = #analysisId2
where
r1.analysis_id = #analysisId1
and r1.[rank] != r2.[rank]
(It puts the analysis values into additional fields instead of rows. I think either way would work.)
How can I generalize this query to handle many analysis ids? (Or, alternatively, come up with a different, better query to do the job?) I'm using SQL Server 2005, in case it matters.
If necessary, I can always pull all the data out of the table and look for differences in code, but a SQL solution would be preferable since often I'll only care about a few rows out of thousands and there's no point in transferring them all if I can avoid it. (However, if you have a compelling reason not to do this in SQL, say so--I'd consider that a good answer too!)
This will return your desired data set - Now you just need a way to pass the required analysis ids to the query. Or potentially just filter this data inside your application.
select r.* from rankings r
inner join
(
select alternative_id, indicator_id
from rankings
group by alternative_id, indicator_id
having count(distinct rank) > 1
) differ on r.alternative_id = differ.alternative_id
and r.indicator_id = differ.indicator_id
order by r.alternative_id, r.indicator_id, r.analysis_id, r.rank
I don't know wich database you are using, in SQL Server I would go like this:
-- STEP 1, create temporary table with all the alternative_id , indicator_id combinations with more than one rank:
select alternative_id , indicator_id
into #results
from rankings
group by alternative_id , indicator_id
having count (distinct rank)>1
-- STEP 2, retreive the data
select a.* from rankings a, #results b
where a.alternative_id = b.alternative_id
and a.indicator_id = b. indicator_id
order by alternative_id , indicator_id, analysis_id
BTW, THe other answers given here need the count(distinct rank) !!!!!
I think this is what you're trying to do:
select
r.analysis_id,
r.alternative_id,
rm.indicator_id_max,
rm.rank_max
from rankings rm
join (
select
analysis_id,
alternative_id,
max(indicator_id) as indicator_id_max,
max(rank) as rank_max
from rankings
group by analysis_id,
alternative_id
having count(*) > 1
) as rm
on r.analysis_id = rm.analysis_id
and r.alternative_id = rm.alternative_id
You example differences seems wrong. You say you want analyses whose ranks for any alternative/indicator combination differ but the example rows 3 and 4 don't satisfy this criteria. A correct result according to your requirement is:
analysis_id | alternative_id | indicator_id | rank
----------------------------------------------------
1 | 1 | 2 | 6
2 | 1 | 2 | 7
1 | 2 | 1 | 3
2 | 2 | 1 | 4
On query you could try is this:
with distinct_ranks as (
select alternative_id
, indicator_id
, rank
, count (*) as count
from rankings
group by alternative_id
, indicator_id
, rank
having count(*) = 1)
select r.analysis_id
, r.alternative_id
, r.indicator_id
, r.rank
from rankings r
join distinct_ranks d on r.alternative_id = d.alternative_id
and r.indicator_id = d.indicator_id
and r.rank = d.rank
You have to realize that on multiple analysis the criteria you have is ambiguous. What if analysis 1,2 and 3 have rank 1 and 4,5 and 6 have rank 2 for alternative/indicator 1/1? The set (1,2,3) is 'different' from the set (4,5,6) but inside each set there is no difference. what is the behavior you desire in that case, should they show up or not? My query finds all records that have a different rank for the same alternative/indicator *from all other analysis' but is not clear if this is correct in your requirement.