SQL Nearest Neighbor Query (Movie Recommendation Algorithm)

SQL Nearest Neighbor Query (Movie Recommendation Algorithm) - sql

Need help making this (sort of) working query more dynamic.
I have three tables myShows, TVShows and Users
myShows
ID (PK)
User (FK to Users)
Show (FK to TVShows)
Would like to take this query and change it to a stored procedure that I can send a User ID into and have it do the rest...
SELECT showId, name, Count(1) AS no_users
FROM
myShows LEFT OUTER JOIN
tvshows ON myShows.Show = tvshows.ShowId
WHERE
[user] IN (
SELECT [user]
FROM
myShows
WHERE
show ='1' or show='4'
)
AND
show <> '1' and show <> '4'
GROUP BY
showId, name
ORDER BY
no_users DESC
This right now works. But as you can see the problem lies within the WHERE (show ='1' or show='4') and the AND (show <> '1' and show <> '4') statements which is currently hard-coded values, and that's what I need to be dynamic, being I have no idea if the user has 3 or 30 shows I need to check against.
Also how inefficient is this process? this will be used for a iPad application that might get a lot of users. I currently run a movie API (IMDbAPI.com) that gets about 130k hits an hour and had to do a lot of database/code optimization to make it run fast. Thanks again!
If you want the database schema for testing let me know.

This will meet your requirements
select name, count(distinct [user]) from myshows recommend
inner join tvshows on recommend.show = tvshows.showid
where [user] in
(
select other.[user] from
( select show from myshows where [User] = #user ) my,
( select show, [user] from myshows where [user] <> #user ) other
where my.show = other.show
)
and show not in ( select show from myshows where [User] = #user )
group by name
order by count(distinct [user]) desc
If your SQL platform supports WITH Common Table Expressions, the above can be optimized to use them.
Will it be efficient as the data sizes increase? No.
Will it be effective? No. If just one user shares a show with your selected user, and they watch a popular show, then that popular show will rise to the top of the ranking.
I'd recommend
a) reviewing your thinking of what recommends a show
b) periodically calculating the results rather than performing it on demand.

Related

PostgreSQL - Best approach for summarize data

We have data as follows in system
User data
Experience
Education
Job Application
This data will be used across application and there are few logic also attached to these data.
Just to make sure that this data are consistent across application, i thought to create View for the same and get count of these data then use this view at different places.
Now question is, as detail tables does not have relation with each other, how should i create view
Create different view for each table and then use group by
Create one view and write sub query to get these data
From performance perspective, which one is the best approach?
For e.g.
SELECT
UserId,
COUNT(*) AS ExperienceCount,
0 AS EducationCount
FROM User
INNER JOIN Experience ON user_id = User_Id
GROUP BY
UserId
UNION ALL
SELECT
UserId,
0,
COUNT(*)
FROM User
INNER JOIN Education ON user_id = user_id
GROUP BY
UserId
And then group by this to get summary of all these data in one row per user.

One way to write the query that you have specified would probably be:
SELECT UserId, SUM(ExperienceCount), SUM(EducationCount
FROM ((SELECT UserId, COUNT(*) as ExperienceCount, 0 AS EducationCount
FROM Experience
GROUP BY UserId
) UNION ALL
(SELECT UserId, 0, COUNT(*)
GROUP BY UserId
)
) u
GROUP BY UserId;
This can also be written as a FULL JOIN, LEFT JOIN, and using correlated subqueries. Each of these can be appropriate in different circumstances, depending on your data.

FIRST ORDER BY ... THEN GROUP BY

I have two tables, one stores the users, the other stores the users' email addresses.
table users: (userId, username, etc)
table userEmail: (emailId, userId, email)
I would like to do a query that allows me to fetch the latest email address along with the user record.
I'm basically looking for a query that says
FIRST ORDER BY userEmail.emailId DESC
THEN GROUP BY userEmail.userId
This can be done with:
SELECT
users.userId
, users.username
, (
SELECT
userEmail.email
FROM userEmail
WHERE userEmail.userId = users.userId
ORDER BY userEmail.emailId DESC
LIMIT 1
) AS email
FROM users
ORDER BY users.username;
But this does a subquery for every row and is very inefficient. (It is faster to do 2 separate queries and 'join' them together in my program logic).
The intuitive query to write for what I want would be:
SELECT
users.userId
, users.username
, userEmail.email
FROM users
LEFT JOIN userEmail USING(userId)
GROUP BY users.userId
ORDER BY
userEmail.emailId
, users.username;
But, this does not function as I would like. (The GROUP BY is performed before the sorting, so the ORDER BY userEmail.emailId has nothing to do).
So my question is:
Is it possible to write the first query without making use of the subqueries?
I've searched and read the other questions on stackoverflow, but none seems to answer the question about this query pattern.

But this does a subquery for every row and is very inefficient
Firstly, do you have a query plan / timings that demonstrate this? The way you've done it (with the subselect) is pretty much the 'intuitive' way to do it. Many DBMS (though I'm not sure about MySQL) have optimisations for this case, and will have a way to execute the query only once.
Alternatively, you should be able to create a subtable with ONLY (user id, latest email id) tuples and JOIN onto that:
SELECT
users.userId
, users.username
, userEmail.email
FROM users
INNER JOIN
(SELECT userId, MAX(emailId) AS latestEmailId
FROM userEmail GROUP BY userId)
AS latestEmails
ON (users.userId = latestEmails.userId)
INNER JOIN userEmail ON
(latestEmails.latestEmailId = userEmail.emailId)
ORDER BY users.username;

If this is a query you do often, I recommend optimizing your tables to handle this.
I suggest adding an emailId column to the users table. When a user changes their email address, or sets an older email address as the primary email address, update the user's row in the users table to indicate the current emailId
Once you modify your code to do this update, you can go back and update your older data to set emailId for all users.
Alternatively, you can add an email column to the users table, so you don't have to do a join to get a user's current email address.

Complex select query question for hardcore SQL designers

Very complex query been trying to construct it for few days with more real success.
I'm using SQL-SERVER 2005 Standard
What i need is :
5 CampaignVariants from Campaigns whereas 2 are with the largest PPU number set and 3 are random.
Next condition is that CampaignDailyBudget and CampaignTotalBudget are below what is set in Campaign ( calculation is number of clicks in Visitors table connected to Campaigns via CampaignVariants on which users click)
Next condition CampaignLanguage, CampaignCategory, CampaignRegion and CampaignCountry must be the ones i send to this select with (languageID,categoryID,regionID and countryID).
Next condition is that IP address i send to this select statement won't be in IPs list for current Campaign ( i delete inactive for 24 hours IPs ).
In other words it gets 5 CampaignVariants for user that enters the site, when i take from user PublisherRegionUID,IP,Language,Country and Region
view diagram
more details
i get countryID, regionID, ipID, PublisherRegionUID and languageID from Visitor. This are filter parameters. While i first need to get what Publisher is about to show on his site by it's categories, language so on.... and then i filter all remaining Campaigns by Visitors's params with all parameters besides PublisherRegionUID.
So it has two actual fiters. One What Publisher wants to Publish and other one what Visitor can view...
campaignDailyBudget and campaignTotalBudget are values set by Users who creates a Campaign. Those two compared to (number of clicks per campaign)*(campaignPPU) while date filters obviously used to filter for campaignDailyBudget with from 12:00AM to 11:59PM of today. campaignTotalBudget is not filtered by date for obvious reasons
Demo of Stored Procedure
ALTER PROCEDURE dbo.CampaignsGetCampaignVariants4Visitor
#publisherSiteRegionUID uniqueidentifier,
#visitorIP varchar(15),
#browserID tinyint,
#countryID tinyint,
#osID tinyint,
#languageID tinyint,
#acceptsCookies bit
AS
BEGIN
SET NOCOUNT ON;
-- check if such #publisherRegionUID exists
if exists(select publisherSiteRegionID from PublisherSiteRegions where publisherSiteRegionUID=#publisherSiteRegionUID)
begin
declare #publisherSiteRegionID int
select #publisherSiteRegionID = publisherSiteRegionID from PublisherSiteRegions where publisherSiteRegionUID=#publisherSiteRegionUID
-- get CampaignVariants
-- ** choose 2 highest PPU and 3 random CampaignVariants from Campaigns list
-- where regionID,countryID,categoryID,languageID meets Publisher and Visitor requirements
-- and Campaign.campaignDailyBudget<(sum of Clicks in Visitors per this Campaign)*Campaign.PPU during this day
-- and Campaign.campaignTotalBudget<(sum of Clicks in Visitors per this Campaign)*Campaign.PPU
-- and #visitorID does not appear in Campaigns2IPs with this Campaign
-- insert visitor
insert into Visitors (ipAddress,browserID,countryID,languageID,OSID,acceptsCookies)
values (#visitorIP,#browserID,#countryID,#languageID,#OSID,#acceptsCookies)
declare #visitorID int
select #visitorID = IDENT_CURRENT('Visitors')
-- add IP to pool Campaigns ** adding ip to all Campaigns whose CampaignVariants were chosen
-- add PublisherRegion2Visitor relationship
insert into PublisherSiteRegions2Visitors values (#visitorID,#publisherSiteRegionID)
-- add CampaignVariant2Visitor relationship
end
END
GO

I also make a number of assumptions about your oblique requirements. I’ll spell them out as I go along, along with explaining the code. Please note that I of course have no reasonable way of testing this code for typos or minor logic errors.
It might be possible to write this as a single ginormous query, but that would be awkward, ugly, and prone to performance issues as the SQL optimizer can have problems buliding plans for overly-large queries. An option would be to write it as a series of queries, populating temp tables for use in subsequent queries (which alows for much simpler debugging). I chose to write this as a large common table expression statement with a series of CTE tables, largely because it kind of “flows” better that way, and it'd probably perform better than the many-temp-tables version.
First assumption: there are several ciruclar references in there. Campaign has links to both Countries and Regions, so both of these parameter values must be checked—even though based on the table link from Countries to Region, this filter could possibly be simplified to just a check on Country (assuming that the country parameter value is always “in” the region parameter). The same applies to Language and Category, and perhaps to IPs and Visitors. This appears to be sloppy design; if it can be cleared up, or if assumptions on the validity of the data can be made, the query could be simplified.
Second assumption: Parameters are passed in as variables in the form of #Region, #Country, etc. Also, there is only one IP address being passed in; if not, then you’ll need to pass in multiple values, set up a temp table containing those values, and add that as a filter where I use the #IP parameter.
So, step 1 is a first pass identifying “eligible” campaigns, by pulling out all those that share the desired country, region, language, cateogory, and that do not have the one IP address associated with them:
WITH cteEligibleCampaigns (CampaignId)
as (select CampaignId
from Campaigns2Regions
where RegionId = #RegionId
intersect select CampaignId
from Campaign2Countries
where CountryId = #CountryId
intersect select CampaignId
from Campaign2Languages
where LanguageId = #LanguageId
intersect select CampaignId
from Campaign2Categories
where CategoryId = #CategoryId
except select CampaignId
from Campaigns2IPs
where IPID = #IPId)
Next up, from these filter out those items where “CampaignDailyBudget and CampaignTotalBudget are below what is set in Campaign ( calculation is number of clicks in Visitors table connected to Campaigns via CampaignVariants on which users click)”. This requirement is not entirely clear to me. I have chosen to interpret it as “only include those campaigns where, if you count the number of visitors for those campaign’s CampaignVariants, the total count is less than both CampaignDailyBudget and CampaignTotalBudget”. Note that here I introduce a random value, used later on in selecting random rows.
,cteTargetCampaigns (CampaignId, RandomNumber)
as (select CampaignId, checksum(newid() RandomNumber)
from cteEligibleCampaigns ec
inner join Campaigns ca
on ca.CampgainId = ec.CampaignId
inner join CampaignVariants cv
on cv.CampgainId = ec.CampaignId
inner join CampaignVariants2Visitors cvv
on cvv.CampaignVariantId = cv. CampaignVariantId
group by ec.CampaignId
having count(*) < ca.CampaignDailyBudget
and count(*) < CampaignTotalBudget)
Next up, identify the two “best” items.
,cteTopTwo (CampaignId, Ranking)
as (select CampaignId, row_number() over (order by CampgainPPU desc)
from cteTargetCampaigns tc
inner join Campaigns ca
on ca.CampaignId = tc.CampaignId)
Next, line up all other campaigns by the randomly assigned number:
,cteRandom (CampaignId, Ranking)
as (select CampaignId, row_number() over (order by RandomNumber)
from cteTargetCampaigns
where CampaignId not in (select CampaignId
from cteTopTwo
where Ranking < 3))
And, at last, pull the data sets together:
select CampaignId
from cteTopTwo
where Ranking <= 2
union all select CampaignId
from cteRandom
where Ranking <= 3
Lump the above sections of code together, debug typos, invalid assumption, and missed requirements (such as order or flags identifying the top two items from the random ones), and you should be good.

I'm not sure I understand this portion of your post:
it gets 5 CampaignVariants for user
that enters the site, when i take from
user
PublisherRegionUID,IP,Language,Country
and Region
I'm assuming "it" is the query. The user given your second "Next Condition" is the IP? What does "when I take from user" mean? Does that mean that is the information you have at the time you execute your query or is that information you returned from your query? If the later, then there are a host of questions that would need to be answered since many of those columns are part of a Many:Many relationship.
Regardless, below is a means to get the 5 campaigns where, according to your second "Next condition", you have an IP address that you want filter out. I'm also assuming that you want five campaigns total which means that the three random ones cannot include the two "highest PPU" ones.
With
ValidCampaigns As
(
Select C.campaignId
From Campaigns As C
Left Join (Campaigns2IPs As CIP
Join IPs
On IPs.ipID = CIP.ipID
And IPs.ipAddress = #IPAddress)
On CIP.campaignId = C.campaignId
Where CIP.campaignID Is Null
)
CampaignPPURanks As
(
Select C.campaignId
, Row_Number() Over ( Order By C.campaignPPU desc ) As ItemRank
From ValidCampaigns As C
)
, RandomRanks As
(
Select campaignId
, Row_Number() Over ( Order By newid() desc ) As ItemRank
From ValidCampaigns As C
Left Join CampaignPPURanks As CR
On CR.campaignId = C.campaignId
And CR.ItemRank <= 2
Where CR.campaignId Is Null
)
Select ...
From CampaignPPURanks As CPR
Join CampaignVariants As CV
On CV.campaignId = CPR.campaignId
And CPR.ItemRank <= 2
Union All
Select ...
From RandomRanks As RR
Join CampaignVariants As CV
On CV.campaignId = RR.campaignId
And RR.ItemRank <= 3

How to match/compare values in two resultsets in SQL Server 2008?

I'm working on a employee booking application. I've got two different entities Projects and Users that are both assigned a variable number of Skills.
I've got a Skills table with the various skills (columns: id, name)
I register the user skills in a table called UserSkills (with two foreign key columns: fk_user and fk_skill)
I register the project skills in another table called ProjectSkills (with two foreign key columns: fk_project and fk_skill).
A project can require maybe 6 different skills and users when registering sets up their Skills aswell.
The tricky part is when I have to find users for my Projects based on their skills. I'm only interested in users that meet that have ALL the skills required by the project. Users are ofcause allowed to have more skilled then required.
The following code will not work, (and even if it did, would not be very performance friendly), but it illustrates my idea:
SELECT * FROM Users u WHERE
( SELECT us.fk_skill FROM UserSkills us WHERE us.fk_user = u.id )
>=
( SELECT ps.fk_skill FROM ProjectSkills ps WHERE ps.fk_project = [some_id] )
I'm thinking about making my own function that takes two TABLE-variables, and then working out the comparisson in that (kind of a modified IN-function), but I'd rather find a solution that's more performance friendly.
I'm developing on SQL Server 2008.
I really appreciate any ideas or suggestions on this. Thanks!

SELECT *
FROM Users u
WHERE NOT EXISTS
(
SELECT NULL
FROM ProjectSkill ps
WHERE ps.pk_project = #someid
AND NOT EXISTS
(
SELECT NULL
FROM UserSkills us
WHERE us.fk_user = u.id
AND us.fk_skill = ps.fk_skill
)
)

-- Assumes existance of variable #ProjectId, specifying
-- which project to analyze
SELECT us.UserId
from UserSkills us
inner join ProjectSkills ps
on ps.SkillId = us.SkillId
and ps.ProjectId = #ProjectId
group by us.UserId
having count(*) = (select count(*)
from ProjectSkills
where ProjectId = #ProjectId)
You'd want to test an debug this, as I have no test data to run it through. Ditto for indexing to optimize it.
(Now to post, and see if someone's come up with a better way--there should be something more subtle and effective than this.)

Aggregation with two Joins (MySQL)

I have one table called gallery. For each row in gallery there are several rows in the table picture. One picture belongs to one gallery. Then there is the table vote. There each row is an upvote or a downvote for a certain gallery.
Here is the (simplified) structure:
gallery ( gallery_id )
picture ( picture_id, picture_gallery_ref )
vote ( vote_id, vote_value, vote_gallery_ref )
Now I want one query to give me the following information: All galleries with their own data fields and the number of pictures that are connected to the gallery and the sumarized value of the votes.
Here is my query, but due to the multiple joining the aggregated values are not the right ones. (At least when there is more than one row of either pictures or votes.)
SELECT
*, SUM( vote_value ) as score, COUNT( picture_id ) AS pictures
FROM
gallery
LEFT JOIN
vote
ON gallery_id = vote_gallery_ref
LEFT JOIN
picture
ON gallery_id = picture_gallery_ref
GROUP BY gallery_id
Because I have noticed that COUNT( DISTINCT picture_id ) gives me the correct number of pictures I tried this:
( SUM( vote_value ) / GREATEST( COUNT( DISTINCT picture_id ), 1 ) ) AS score
It works in this example, but what if there were more joins in one query?
Just want to know whether there is a better or more 'elegant' way this problem can be solved. Also I'd like to know whether my solution is MySQL-specific or standard SQL?

This quote from William of Okham applies here:
Enita non sunt multiplicanda praeter necessitatem
(Latin for "entities are not to be multiplied beyond necessity").
You should reconsider why do you need this to be done in a single query? It's true that a single query has less overhead than multiple queries, but if the nature of that single query becomes too complex, both for you to develop, and for the RDBMS to execute, then run separate queries.

Or just use subqueries...
I don't know if this is valid MySQL syntax, but you might be able to do something similar to:
SELECT
gallery.*, a.score, b.pictures
LEFT JOIN
(
select vote_gallery_ref, sum(vote_value) as score
from vote
group by vote_gallery_ref
) a ON gallery_id = vote_gallery_ref
LEFT JOIN
(
select picture_gallery_ref, count(picture_id) as pictures
from picture
group by picture_gallery_ref
) b ON gallery_id = picture_gallery_ref

How often do you add/change vote records?
How often do you add/remove picture records?
How often do you run this query for these totals?
It might be better to create total fields on the gallery table (total_pictures, total_votes, total_vote_values).
When you add or remove a record on the picture table you also update the total on the gallery table. This could be done using triggers on the picture table to automatically update the gallery table. It could also be done using a transaction combining two SQL statements to update the picture table and the gallery table. When you add a record on the picture table increment the total_pictures field on the gallery table. When you delete a record on the picture table decrement the total_pictures field.
Similary when a vote record is added or removed or the vote_value changes you update the total_votes and total_vote_values fields. Adding a record increments the total_votes field and adds vote_values to total_vote_values. Deleting a record decrements the total_votes field and subtracts vote_values from total_vote_values. Updating vote_values on a vote record should also update total_vote_values with the difference (subtract old value, add new value).
Your query now becomes trivial - it's just a straightforward query from the gallery table. But this is at the expense of more complex updates to the picture and vote tables.

As Bill Karwin said, doing this all within one query is pretty ugly.
But, if you have to do it, joining and selecting non-aggregate data with aggregate data requires joining against subqueries (I haven't used SQL that much in the past few years so I actually forgot the proper term for this).
Let's assume your gallery table has additional fields name and state:
select g.gallery_id, g.name, g.state, i.num_pictures, j.sum_vote_values
from gallery g
inner join (
select g.gallery_id, count(p.picture_id) as 'num_pictures'
from gallery g
left join picture p on g.gallery_id = p.picture_gallery_ref
group by g.gallery_id) as i on g.gallery_id = i.gallery_id
left join (
select g.gallery_id, sum(v.vote_value) as 'sum_vote_values'
from gallery g
left join vote v on g.gallery_id = v.vote_gallery_ref
group by g.gallery_id
) as j on g.gallery_id = j.gallery_id
This will yield a result set that looks like:
gallery_id, name, state, num_pictures, sum_vote_values
1, 'Gallery A', 'NJ', 4, 19
2, 'Gallery B', 'NY', 3, 32
3, 'Empty gallery', 'CT', 0,

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Nearest Neighbor Query (Movie Recommendation Algorithm) - sql

Related

PostgreSQL - Best approach for summarize data

FIRST ORDER BY ... THEN GROUP BY

Complex select query question for hardcore SQL designers

How to match/compare values in two resultsets in SQL Server 2008?

Aggregation with two Joins (MySQL)

Categories

Resources