SQL: Find rows that match closely but not exactly - sql

I have a table inside a PostgreSQL database with columns c1,c2...cn. I want to run a query that compares each row against a tuple of values v1,v2...vn. The query should not return an exact match but should return a list of rows ordered in descending similarity to the value vector v.
Example:
The table contains sports records:
1,USA,basketball,1956
2,Sweden,basketball,1998
3,Sweden,skating,1998
4,Switzerland,golf,2001
Now when I run a query against this table with v=(Sweden,basketball,1998), I want to get all records that have a similarity with this vector, sorted by number of matching columns in descending order:
2,Sweden,basketball,1998 --> 3 columns match
3,Sweden,skating,1998 --> 2 columns match
1,USA,basketball,1956 --> 1 column matches
Row 4 is not returned because it does not match at all.
Edit: All columns are equally important. Although, when I really think of it... it would be a nice add-on if I could give each column a different weight factor as well.
Is there any possible SQL query that would return the rows in a reasonable amount of time, even when I run it against a million rows?
What would such a query look like?

SELECT * FROM countries
WHERE country = 'sweden'
OR sport = 'basketball'
OR year = 1998
ORDER BY
cast(country = 'sweden' AS integer) +
cast(sport = 'basketball' as integer) +
cast(year = 1998 as integer) DESC
It's not beautiful, but well. You can cast the boolean expressions as integers and sum them.
You can easily change the weight, by adding a multiplicator.
cast(sport = 'basketball' as integer) * 5 +

This is how I would do it ... the multiplication factors used in the case stmts will handle the importance(weight) of the match and they will ensure that those records that have matches for columns designated with the highest weight will come up top even if the other columns don't match for those particular records.
/*
-- Initial Setup
-- drop table sport
create table sport (id int, Country varchar(20) , sport varchar(20) , yr int )
insert into sport values
(1,'USA','basketball','1956'),
(2,'Sweden','basketball','1998'),
(3,'Sweden','skating','1998'),
(4,'Switzerland','golf','2001')
select * from sport
*/
select * ,
CASE WHEN Country='sweden' then 1 else 0 end * 100 +
CASE WHEN sport='basketball' then 1 else 0 end * 10 +
CASE WHEN yr=1998 then 1 else 0 end * 1 as Match
from sport
WHERE
country = 'sweden'
OR sport = 'basketball'
OR yr = 1998
ORDER BY Match Desc

It might help if you wrote a stored procedure that calculates a "similarity metric" between two rows. Then your query could refer to the return value of that procedure directly rather than having umpteen conditions in the where-expression and the order-by-expression.

Related

Given a table of numbers, can I get all the rows which add up to less than or equal to a number?

Say I have a table with an incrementing id column and a random positive non zero number.
id
rand
1
12
2
5
3
99
4
87
Write a query to return the rows which add up to a given number.
A couple rules:
Rows must be "consumed" in order, even if a later row makes it a a perfect match. For example, querying for 104 would be a perfect match for rows 1, 2, and 4 but rows 1-3 would still be returned.
You can use a row partially if there is more available than is necessary to add up to whatever is leftover on the number E.g. rows 1, 2, and 3 would be returned if your max number is 50 because 12 + 5 + 33 equals 50 and 90 is a partial result.
If there are not enough rows to satisfy the amount, then return ALL the rows. E.g. in the above example a query for 1,000 would return rows 1-4. In other words, the sum of the rows should be less than or equal to the queried number.
It's possible for the answer to be "no this is not possible with SQL alone" and that's fine but I was just curious. This would be a trivial problem with a programming language but I was wondering what SQL provides out of the box to do something as a thought experiment and learning exercise.
You didn't mention which RDBMS, but assuming SQL Server:
DROP TABLE #t;
CREATE TABLE #t (id int, rand int);
INSERT INTO #t (id,rand)
VALUES (1,12),(2,5),(3,99),(4,87);
DECLARE #target int = 104;
WITH dat
AS
(
SELECT id, rand, SUM(rand) OVER (ORDER BY id) as runsum
FROM #t
),
dat2
as
(
SELECT id, rand
, runsum
, COALESCE(LAG(runsum,1) OVER (ORDER BY id),0) as prev_runsum
from dat
)
SELECT id, rand
FROM dat2
WHERE #target >= runsum
OR #target BETWEEN prev_runsum AND runsum;

How to write a SQL query to calculate percentages based on values across different tables?

Suppose I have a database containing two tables, similar to below:
Table 1:
tweet_id tweet
1 Scrap the election results
2 The election was great!
3 Great stuff
Table 2:
politician tweet_id
TRUE 1
FALSE 2
FALSE 3
I'm trying to write a SQL query which returns the percentage of tweets that contain the word 'election' broken down by whether they were a politician or not.
So for instance here, the first 2 tweets in Table 1 contain the word election. By looking at Table 2, you can see that tweet_id 1 was written by a politician, whereas tweet_id 2 was written by a non-politician.
Hence, the result of the SQL query should return 50% for politicians and 50% for non-politicians (i.e. two tweets contained the word 'election', one by a politician and one by a non-politician).
Any ideas how to write this in SQL?
You could do this by creating one subquery to return all election tweets, and one subquery to return all election tweets by politicians, then join.
Here is a sample. Note that you may need to cast the totals to decimals before dividing (depending on which SQL provider you are working in).
select
politician_tweets.total / election_tweets.total
from
(
select
count(tweet) as total
from
table_1
join table_2 on table_1.tweet_id = table_2.tweet_id
where
tweet like '%election%'
) election_tweets
join
(
select
count(tweet) as total
from
table_1
join table_2 on table_1.tweet_id = table_2.tweet_id
where
tweet like '%election%' and
politician = 1
) politician_tweets
on 1 = 1
You can use aggregation like this:
select t2.politician, avg( case when t.tweet like '%election%' then 1.0 else 0 end) as election_ratio
from tweets t join
table2 t2
on t.tweet_id = t2.tweet_id
group by t2.politician;
Here is a db<>fiddle.

Sample certain number of result rows from a postgres table based on given proportions

Let's say I have a table named population with 1000 rows like the following:
And I have another table named proportions that holds the desired proportions of different Group_Names that I want to extract:
I want to randomly sample 100 rows from population table where the proportions of the Group_Names within the sample is in line with that of the Proportion field within proportions table. So in that 100 rows sample, 50 rows should be Group-A, 30 rows should be Group-B and 20 rows should be Group-C.
I can manually sample like:
CREATE EXTENSION tsm_system_rows;
SELECT * FROM population TABLESAMPLE SYSTEM_ROWS(100);
But I do not know how to sample from population programmatically based on proportions table especially if proportions table has a lot more Group_Names than 3 as shown in the example.
The main problem that you will be facing is that TABLESAMPLE takes the sample before applying your group filter. Say that you want 20 rows from group C. The chances of getting those 20 by running
SELECT *
FROM population TABLESAMPLE system_rows(20)
WHERE group_name = 'C'
are pretty slim if group C is small relative to other groups in population.
I'd solve this by writing a stored function that receives as parameters the group name and wanted amount of rows, and samples the table until reaching the wanted amount of rows.
You should also limit the number of iterations, in case that the group is very sparse or there or not enough rows to fulfill the need.
So the function could look like so
CREATE OR REPLACE FUNCTION sample_group (p_group_name text, sample_size int, max_iterations int)
RETURNS int[]
LANGUAGE PLPGSQL AS $$
DECLARE
result int[];
i int := 0;
BEGIN
WHILE i < max_iterations AND coalesce(array_length(result, 1), 0) < sample_size LOOP
WITH sample AS (
SELECT group_name, value
FROM population TABLESAMPLE BERNOULLI (1)
LIMIT 10 * sample_size
), add_rows AS (
SELECT result || array_agg(value) arr
FROM sample
WHERE group_name = p_group_name
)
SELECT array_agg(DISTINCT value), i + 1
INTO result, i
FROM add_rows, unnest(arr) AS t(value);
END LOOP;
RETURN result[1:sample_size];
END;
$$;
I'm using BERNOULLI sampling to avoid getting the same rows over and over.
The function did most of the work for you. All that remains is to call it. In this example I'm setting an upper limit of 500 on the iterations.
SELECT group_name, unnest(sample_group(group_name, (100*proportion)::int, 500)) AS value
from proportions;
You can sample based on randomly assigned row numbers:
select *
from
(
select *
,case
when row_number()
over (partition by pop.group_name
order by random()) <= pr.proportion * 100 -- sample size
then 1
else 0
end as flag
from population as pop
join proportions as pr
on pop.group_name = pr.group_name
) as dt
where flag = 1
Edit:
If the table is large creating a SAMPLE before ROW_NUMBER might greatly reduce the number of rows processed. Of course, the SAMPLE size must be large enough to contain at least the required number of rows, i.e. way over 100 rows.

SQL query to pull certain rows based on values in other rows in the same table

I have a set of data that contains 2 sets of identifiers: a unique number for that record, Widget_Number, and the original unique number for the record, Original_Widget_Number. Typically these two values are identical but when a record has been revised, the a new record is created with a new Widget_Number, preserving the old Widget_Number value in Original_Widget_Number. IE SELECT * FROM widgets WHERE Widget_Number != Original_Widget_Number returns all records that have been changed. (Widget_Number increments by 10 for new widgets and by 1 for revised widgets.)
I would like to return all records that were changed as well as the original records related to those records. For example if I had a table containing:
Widget_Number Original_Widget Number More_Data
1: 10 10 Stephen
2: 11 10 Steven
3: 20 20 Joe
I would like a query to return rows 1 & 2. I know I could loop trough this in a higher-level language but is there a straightforward way to do this in MS SQL?
using exists():
select *
from widgets as t
where exists (
select 1
from widgets as i
where i.original_widget_number = t.original_widget_number
and i.widget_number != i.original_widget_number
)
or in()
select *
from widgets as t
where t.original_widget_number in (
select i.original_widget_number
from widgets as i
where i.widget_number != i.original_widget_number
)
The following should get both the records that have changed and the original records:
select w.*
from widgets w
where w.widget_number <> w.original_widget_number or
exists (select 1
from widgets w2
where w.widget_number = w2.original_widget_number and
w2.widget_number <> w2.original_widget_number
);
select * from widget
where original_widget_number in
(select original_widget_number from widget
where widget_number <> original_widget_number)

SQL Sampling based on the whole population

I have a population of records...let's say 10,000 athletes, grouped by sports, where (numbers below would be variable):
4,000 are from NBA
2,000 are from NHL
3,000 are from MLB
1,000 are from NFL
How can I build a sample query that will sample 100 records based on the population, not fully random but pull out:
NBA/Whole Population=X
Select Top X * From MainTable Where league= 'NBA' (something like this)
40 names are from NBA
20 names are from NHL
30 names are from MLB
10 names are from NFL.
This is just a sample of the population, logic here is to calculate what the ratios are with regard to the whole population and then apply them to the sample size.
Regards
Consider using a count correlated subquery for a rank order that you then use as filtering criteria for sample ratio.
SELECT main.*
FROM
(SELECT *,
(SELECT Count(*) FROM MainTable sub
WHERE sub.League = t.League AND sub.UniqueID <= t.UniqueID) As Rank
FROM MainTable t) AS main
WHERE main.Rank <= CInt((SELECT Count(*) FROM MainTable sub
WHERE sub.League = main.League) /
(SELECT Count(*) FROM MainTable) * 100)
ORDER BY main.League, main.Rank
To explain above query with nested subqueries and derived tables:
The derived table, main, is exact source MainTable with a new column called Rank that gives an ordinal count of records for each League. So for the first NBA record (not necessarily first row), it is tagged rank 1, next NBA record (which can appear anywhere like 89th row) is tagged 2, and so on for each League. And yes, Rank will go up to 4,000 if needed!
Once this Rank field is calculated giving ordinal 1, 2, 3, ... indicators for each League grouping, we then position this SELECT statement as a derived table in FROM clause in order to use Rank in WHERE filter for the sample ratio. We cannot calculate a column and filter in same SELECT call.
Sample ratio is the last two subqueries used for a quotient that calculates: (# of League records matching current row / total # of table records). This value is then multiplied by 100 per sample quota. CInt is used to return integer values of possible decimal ratios. Consider also Round(..., 0) which rounds instead of strips decimal points.
Dim Leagues(1 To 4) As String
Leagues(1) = "NHL"
Leagues(2) = "MLB"
Leagues(3) = "NFL"
Leagues(4) = "MLS"
Set db = CurrentDb
For x = 1 To 4
y = 0
sqql = "Select * from Maintable Where League = '" & leagues(x) & "'"
Set cf = db.OpenRecordset(sqql)
Set samp = db.OpenRecordset("RANDOMSAMPLE")
Do While y < (x * 1000) ' adjust as necessary just swagged in you wanted 1000 from league 1, 2000 league 2 etc
cf.MoveLast
cf.MoveFirst
i = Int((cf.RecordCount - 1 + 1) * Rnd + 1)
cf.Move (i)
With samp
.AddNew
.fields("Yourfield here") = cf![your field ]
' repeat as nec
.Update
End With
y = y + 1
Loop
cf.Close
Next x
samp.Close