Scoring database entries by multiple columns - sql

i'm faced with a situation, where i have to find the best matches for a users search request. I will provide an example (a little abstract):
We have a table with lawyers:
Name Location Royality Family Law Criminal Law
-------------------------------------------------------------
Lawyer A Berlin 100 €/hour false true
Lawyer B Amsterdam 150 €/hour true true
A user should now be able to search by several features. The weight of each feature should be some kind of parameter. In my case the table contains much more of such features (Location, Royality, 20+ boolean values). Of course the result should provide all "good" results but ordered by some kind of "score", so the best result appears at the top.
I'm not looking for a out of the box solution rather than some introduction to this topic.
Kind regards,
matt

A generic approach is to assign a weight to each item, and add them up when they match. This will cause a full table scan to score every single record.
Assuming inputs of Berlin, >100/hr, Criminal Law=true, family law = null (no criteria)
And Location match carries a weight of 5
select *
from (
select *,
case when location = 'berlin' then 5 else 0 end +
case when royality >= 100 then 1 else 0 end +
case when familylaw = null then 1 else 0 end +
case when criminallaw = true then 1 else 0 end as score
from tbl
) scored
order by score desc

You may be able to make use of SOUNDEX functions in your particular RDBMS. These compare two strings and give a numeric value for "how alike they sound".
You can then weight and/or sum the results for each column, as mentioned by Richard.

Related

How to write a SQL query to calculate percentages based on values across different tables?

Suppose I have a database containing two tables, similar to below:
Table 1:
tweet_id tweet
1 Scrap the election results
2 The election was great!
3 Great stuff
Table 2:
politician tweet_id
TRUE 1
FALSE 2
FALSE 3
I'm trying to write a SQL query which returns the percentage of tweets that contain the word 'election' broken down by whether they were a politician or not.
So for instance here, the first 2 tweets in Table 1 contain the word election. By looking at Table 2, you can see that tweet_id 1 was written by a politician, whereas tweet_id 2 was written by a non-politician.
Hence, the result of the SQL query should return 50% for politicians and 50% for non-politicians (i.e. two tweets contained the word 'election', one by a politician and one by a non-politician).
Any ideas how to write this in SQL?
You could do this by creating one subquery to return all election tweets, and one subquery to return all election tweets by politicians, then join.
Here is a sample. Note that you may need to cast the totals to decimals before dividing (depending on which SQL provider you are working in).
select
politician_tweets.total / election_tweets.total
from
(
select
count(tweet) as total
from
table_1
join table_2 on table_1.tweet_id = table_2.tweet_id
where
tweet like '%election%'
) election_tweets
join
(
select
count(tweet) as total
from
table_1
join table_2 on table_1.tweet_id = table_2.tweet_id
where
tweet like '%election%' and
politician = 1
) politician_tweets
on 1 = 1
You can use aggregation like this:
select t2.politician, avg( case when t.tweet like '%election%' then 1.0 else 0 end) as election_ratio
from tweets t join
table2 t2
on t.tweet_id = t2.tweet_id
group by t2.politician;
Here is a db<>fiddle.

Oracle SQL - Multiple return from case

I may be trying it wrong. I am looking for any approach which is best.
Requirement:
My Query joins 4-5 tables based on few fields.
I have a column called product id. In my table there are 1.5 million rows. Out of those only 10% rows has product ids with the following attribute
A300X-%
A500Y-%
300,500, 700 are valid model numbers. X and Y are classifications. My query picks all the systems.
I have a check as follows
CASE
WHEN PID LIKE 'A300X%'
THEN 'A300'
...
END AS MODEL
Similarly
CASE
WHEN PID LIKE 'A300X%'
THEN 'X'
...
END AS GENRE
I am looking for the best option from the below
How do I Combine both case statement and add another[third] case which will have these two cases. i.e
CASE
WHEN desc in ('AAA')
First Case
Second Case
ELSE
don't do anything for other systems
END
Is there any regex way of doing this? Before first - take the string. Look for X, Y and also 300,500,700.
Is there any other way of doing this? Or doing via code is the best way?
Any suggestions?
EDIT:
Sample desc:
AAA,
SoftwARE,
sw-app
My query picks all the desc. But the case should be running for AAA alone.
And Valid models are
A300X-2x-P
A500Y-5x-p
A700X-2x-p
A50CE-2x-P
I have to consider only 300,500,700. And the above two cases.
Expected result:
MODEL GENRE
A300 X
A500 Y
A300 Y
Q: How do I Combine both CASE statement expressions
Each CASE expression will return a single value. If the requirement is to return two separate columns in the resultset, that will require two separate expressions in the SELECT list.
For example:
DESC PID model_number genre
---- ---------- ------------ ------
AAA A300X-2x-P 300 X
AAA A500Y-5x-p 500 Y
AAA A700X-2x-p 700 X
AAA A50CE-2x-P (NULL) (NULL)
FOO A300X-2x-P (NULL) (NULL)
There will need to be an expression to return the model_number column, and a separate expression to return the genre column.
It's not possible for a single expression to return two separate columns.
Q: and add another[third] case which will have these two cases.
A CASE expression returns a value; we can use a CASE expression almost anywhere in a SQL statement where we can use a value, including within another CASE expression.
We can also combine multiple conditions in a WHEN test with AND and OR
As an example of combining conditions and nesting CASE expressions ditions...
CASE
WHEN ( ( t.PID LIKE '_300%' OR t.PID LIKE '_500%' OR t.PID LIKE '_700%' )
AND ( t.DESC = 'AAA' )
)
THEN CASE
WHEN ( t.PID LIKE '____X%' )
THEN 'X'
WHEN ( t.PID LIKE '____Y%' )
THEN 'Y'
ELSE NULL
END
ELSE NULL
END AS `genre`
There are other expressions that will return an equivalent result; the example shown here isn't necessarily the best expression. It just serves as a demonstration of combining conditions and nesting CASE expressions.
Note that to return another column model we would need to include another expression in the SELECT list. Similar conditions will need to be repeated; it's not possible to reference the WHEN conditions in another CASE expression.
Based on your sample data, logic such as this would work:
(CASE WHEN REGEXP_LIKE(PID, '^A[0-9]{3}[A-Z]-')
THEN SUBSTR(PID, 1, 4)
ELSE PID
END) AS MODEL
(CASE WHEN REGEXP_LIKE(PID, '^A[0-9]{3}[A-Z]-')
THEN SUBSTR(PID, 5, 1)
ELSE PID
END) AS GENRE
This assumes that the "model number" always starts with "A" and is followed by three digits (as in your example data). If the model number is more complicated, you may need regexp_substr() to extract the values you want.

SQL: Find rows that match closely but not exactly

I have a table inside a PostgreSQL database with columns c1,c2...cn. I want to run a query that compares each row against a tuple of values v1,v2...vn. The query should not return an exact match but should return a list of rows ordered in descending similarity to the value vector v.
Example:
The table contains sports records:
1,USA,basketball,1956
2,Sweden,basketball,1998
3,Sweden,skating,1998
4,Switzerland,golf,2001
Now when I run a query against this table with v=(Sweden,basketball,1998), I want to get all records that have a similarity with this vector, sorted by number of matching columns in descending order:
2,Sweden,basketball,1998 --> 3 columns match
3,Sweden,skating,1998 --> 2 columns match
1,USA,basketball,1956 --> 1 column matches
Row 4 is not returned because it does not match at all.
Edit: All columns are equally important. Although, when I really think of it... it would be a nice add-on if I could give each column a different weight factor as well.
Is there any possible SQL query that would return the rows in a reasonable amount of time, even when I run it against a million rows?
What would such a query look like?
SELECT * FROM countries
WHERE country = 'sweden'
OR sport = 'basketball'
OR year = 1998
ORDER BY
cast(country = 'sweden' AS integer) +
cast(sport = 'basketball' as integer) +
cast(year = 1998 as integer) DESC
It's not beautiful, but well. You can cast the boolean expressions as integers and sum them.
You can easily change the weight, by adding a multiplicator.
cast(sport = 'basketball' as integer) * 5 +
This is how I would do it ... the multiplication factors used in the case stmts will handle the importance(weight) of the match and they will ensure that those records that have matches for columns designated with the highest weight will come up top even if the other columns don't match for those particular records.
/*
-- Initial Setup
-- drop table sport
create table sport (id int, Country varchar(20) , sport varchar(20) , yr int )
insert into sport values
(1,'USA','basketball','1956'),
(2,'Sweden','basketball','1998'),
(3,'Sweden','skating','1998'),
(4,'Switzerland','golf','2001')
select * from sport
*/
select * ,
CASE WHEN Country='sweden' then 1 else 0 end * 100 +
CASE WHEN sport='basketball' then 1 else 0 end * 10 +
CASE WHEN yr=1998 then 1 else 0 end * 1 as Match
from sport
WHERE
country = 'sweden'
OR sport = 'basketball'
OR yr = 1998
ORDER BY Match Desc
It might help if you wrote a stored procedure that calculates a "similarity metric" between two rows. Then your query could refer to the return value of that procedure directly rather than having umpteen conditions in the where-expression and the order-by-expression.

Confusing with Having query in sql

I am using sql server management studio 2012 and have to make a query to show which subject a student has failed(condition for failing is point<5.0) the most for the first time from this table
StudentID | SubjectID | First/Second_Time | Point.
1 | 02 | 1 | 5.0
2 | 04 | 2 | 7.0
3 | 03 | 2 | 9
... etc
Here are my teacher's query:
SELECT SubjectID
FROM Result(NAME OF the TABLE)
WHERE [First/Second_Time] = 1 AND Point < 5
GROUP BY SubjectID
HAVING count(point) >= ALL
(
SELECT count(point)
FROM Result
WHERE [First/Second_Time] = 1 AND point < 5
GROUP BY SubjectID
)
I don't understand the reason for making the having query. Because Count(point) is always >=all(select count(point)
from Result
where First/Second_Time=1 and point<5
group by SubjectID), isnt it ?
and it doesn't show that the subject has most student fail for the first time. Thanks in advance and sorry for my bad english
The subquery is returning a list of the number of times a subject was failed (on the first attempt). It might be easier for you to see what it's doing if you run it like this:
SELECT SubjectID, count(point)
FROM Result
WHERE [First/Second_Time] = 1 AND point < 5
GROUP BY SubjectID
So if someone failed math twice and science once, the subquery would return:
2
1
You want to know which subject was failed the most (in this case, which subject was failed 2 or more times, since that is the highest number of failures in your subquery). So you count again (also grouping by subject), and use having to return only subjects with 2 or more failures (greater than or equal to the highest value in your subquery).
SELECT SubjectID
FROM Result
WHERE [First/Second_Time] = 1 AND Point < 5
GROUP BY SubjectID
HAVING count(point)...
See https://msdn.microsoft.com/en-us/library/ms178543.aspx for more examples.
Sounds like you are working on a project for a class, so I'm not even sure I should answer this, but here goes. The question is why the having clause. Have you read the descriptions for having and all ?
All "Compares a scalar value with a single-column set of values".
The scalar value in this case is count(point) or the number of occurrences of a subject id with point less than 5. The single-column set in this case is a list of the number of occurrences of every subject that has less than 5 points.
The net result of the comparison is in the ">=". "All" will only evaluate to true if it is true for every value in the subquery. The subquery returns a set of counts of all subjects meeting the <5 and 1st time requirement. If you have three subjects that meet the <5 and 1st time criteria, and they have a frequency of 1,2,3 times respectively, then the main query will have three "having" results; 1,2,3. Each of the main query results has to be >= each of the subquery results for that main value to evaluate true. So going through step by step, First main value 1 is >= 1, but isn't >= 2 so 1 drops because the "having" is false. Second main value 2 is >=1, is >= 2, but is not >= 3 so it drops. Third value, 3, evaluates true as >= 1, 2, and 3, so you end up returning the subject with the highest frequency.
This is fairly clear in the "remarks" section of the MSDN discussion of "All" keyword, but not as relates to your specific application.
Remember, MSDN is our friend!

Best way to count this Data

In short I have 2 tables:
USERS:
------------------------
UserID | Name
------------------------
0 a
1 b
2 c
CALLS:
------------------------
ToUser | Result
------------------------
0 ANSWERED
1 ENGAGED
1 ANSWERED
0 ANSWERED
Etc, etc (i use a numerical referance for result in reality)
I have over 2 million records each detailing a call to a specific client. Currently I'm using Case statements to count each recurance of a particular result AFTER I have already done the quick total count:
COUNT(DISTINCT l_call_log.line_id),
COALESCE (SUM(CASE WHEN l_call_log.line_result = 1 THEN 1 ELSE NULL END), 0) AS [Answered],
COALESCE (SUM(CASE WHEN l_call_log.line_result = 2 THEN 1 ELSE NULL END), 0) AS [Engaged],
COALESCE (SUM(CASE WHEN l_call_log.line_result = 4 THEN 1 ELSE NULL END), 0) AS [Unanswered]
Am I doing 3 scans of the data after my inital total count? if so, is there a way I can do one sweep and count the calls as-per-result in one go?
Thanks.
This would take one full table scan.
EDIT: There's not enough information to answer; because the duplicate removal (DISTINCT) that I missed earlier, we can't tell what strategy that would be used.... especially without knowing the database engine.
In just about every major query engine, each aggregate function is executed per each column per each row, and it may use a cached result (such as COUNT(*) for example).
Is line_result indexed? If so, you could leverage a better query (GROUP BY + COUNT(*) to take advantage of index statistics, though I'm not sure if that's worthwhile depending on your other tables in the query.
There is the GROUP BY construction in SQL. Try:
SELECT COUNT(DISTINCT l_call_log.line_id)
GROUP BY l_call_log.line_result
I would guess it's a table scan, since you don't have any depending subqueries. Run explain on the query to be sure.