SQL: get records based on user likes of other records - sql

I'm trying to write an SQL (Windows server) query that will provide some results based on what other users like.
It is a bit like on Amazon when it says 'Users who bought this also bought...'
It is based on the vote field, where a vote of '1' means a user liked a record; or a vote of '0' means they disliked it.
So when a user is on a particular record, I want to list 3 other records that users who liked the current record also liked.
snippet of relevant table provided below:
ID UserID Record ID Vote DateAdded
16 9999 12013011290 1 2008-11-11 13:23:44.000
17 8888 12013011290 0 2008-11-11 13:23:44.000
18 7777 12013011290 0 2008-11-11 13:23:44.000
20 4930 12013011290 1 2013-11-19 15:04:06.263
I think this requires ordering by a sub-select, but I'm not sure. Can anyone advise me on if this is possible and if so how! thanks.
p.s.
To maintain the quality of the results I think it would be extra useful to filter by DateAdded. That is,
- 'user x' is seeing recommended records about 'record z'
- 'user y' is someone who has liked 'record z' and 'record a'
- only count 'user y's' like of 'record a' IF they liked 'record a' an HOUR before or after they liked 'record z'
- in other words, only count the 'record a's' like if it was during the same website-browsing session as 'record z'
Hope this makes sense!

something like this?
select r.description
from record r
join (
select top 3 v.recordid from votes v
where v.vote = 1 and recordid != 123456789
and userid in
(
select userid from votes where recordid = 123456789 and vote =1
)
order by dateadded desc
) as x on x.recordid = r.id

A method I used for the basic version of this problem is indeed using multiple selects: figure out what users liked a specific item, then query further on what they tagged.
with Likers as
(select user_id from likes where content_id = 10)
select count(user_id) as like_count, content_id
from likes
natural join likers
where content_id <> 10
group by content_id
order by like_count desc;
(Tested using Sqlite3)
What you will receive is a list of items that were liked by everyone who liked item 10, ordered by the number of likes (within the search domain.) I would probably want to limit this as well, since on a larger dataset its likely to result in a large amount of stray items with only one or two similar likes that are in turn buried under items with hundreds of likes.
I suspect the reason you are checking timestamps in the first place is so that if somebody likes laundry detergent, then comes back two days later to like a movie, the system would not associate "people who like Epic Shootout 17 also like Clean More."
I would not recommend using date arithmetic for this. I might suggest creating another table to represent individual "sessions" and using the session_id for this task. Since there are (hopefully!) many, many like records on your database, you want to reduce the amount of work you are making it do. You can also use this session_id for logging any other actions a person did (for analytics purposes.) It is also computationally cheaper to ask for all things that happened within a session with a simple index and identity comparison than to perform date computations on potentially millions of records.
For reference, Piwik defines a new session as thirty minutes since the last action taken.

Related

SQL question with attempt on customer information

Schema
Question: List all paying customers with users who had 4 or 5 activities during the week of February 15, 2021; also include how many of the activities sent were paid, organic and/or app store. (i.e. include a column for each of the three source types).
My attempt so far:
SELECT source_type, COUNT(*)
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
I would like to get a second opinion on it. I didn't include the accounts table because I don't believe that I need it for this query, but I could be wrong.
Have you tried to run this? It doesn't satisfy the brief on FOUR counts:
List all the ... customers (that match criteria)
There is no customer information included in the results at all, so this is an outright fail.
paying customers
This is the top level criteria, only customers that are not free should be included in the results.
Criteria: users who had 4 or 5 activities
There has been no attempt to evaluate this user criteria in the query, and the results do not provide enough information to deduce it.
there is further ambiguity in this requirement, does it mean that it should only include results if the account has individual users that have 4 or 5 acitvities, or is it simply that the account should have 4 or 5 activities overall.
If this is a test question (clearly this is contrived, if it is not please ask for help on how to design a better schema) then the use of the term User is usually very specific and would suggest that you need to group by or otherwise make specific use of this facet in your query.
Bonus: (i.e. include a column for each of the three source types).
This is the only element that was attempted, as the data is grouped by source_type but the information cannot be correlated back to any specific user or customer.
Next time please include example data and the expected outcome with your post. In preparing the data for this post you would have come across these issues yourself and may have been inspired to ask a different question, or through the process of writing the post up you may have resolved the issue yourself.
without further clarification, we can still start to evolve this query, a good place to start is to exclude the criteria and focus on the format of the output. the requirement mentions the following output requirements:
List Customers
Include a column for each of the source types.
Firstly, even though you don't think you need to, the request clearly states that Customer is an important facet in the output, and in your schema account holds the customer information, so although we do not need to, it makes the data readable by humans if we do include information from the account table.
This is a standard PIVOT style response then, we want a row for each customer, presenting a count that aggregates each of the values for source_type. Most RDBMS will support some variant of a PIVOT operator or function, however we can achieve the same thing with simple CASE expressions to conditionally put a value into projected columns in the result set that match the values we want to aggregate, then we can use GROUP BY to evaluate the aggregation, in this case a COUNT
The following syntax is for MS SQL, however you can achieve something similar easily enough in other RBDMS
OP please tag this question with your preferred database engine...
NOTE: there is NO filtering in this query... yet
SELECT accounts.company_id
, accounts.company_name
, paid = COUNT(st_paid)
, organic = COUNT(st_organic)
, app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
GROUP BY accounts.company_id, accounts.company_name
This results in the following shape of result:
company_id
company_name
paid
organic
app_store
apl01
apples
4
8
0
ora01
oranges
6
12
0
Criteria
When you are happy with the shpe of the results and that all the relevant information is available, it is time to apply the criteria to filter this data.
From the requirement, the following criteria can be identified:
paying customers
The spec doesn't mention paying specifically, but it does include a note that (free customers have current_mrr = 0)
Now aren't we glad we did join on the account table :)
users who had 4 or 5 activities
This is very specific about explicitly 4 or 5 activities, no more, no less.
For the sake of simplicity, lets assume that the user facet of this requirement is not important and that is is simply a reference to all users on an account, not just users who have individually logged 4 or 5 activities on their own - this would require more demo data than I care to manufacture right now to prove.
during the week of February 15, 2021.
This one was correctly identified in the original post, but we need to call it out just the same.
OP has used Monday to Friday of that week, there is no mention that weeks start on a Monday or that they end on Friday but we'll go along, it's only the syntax we need to explore today.
In the real world the actual values specified in the criteria should be parameterised, mainly because you don't want to manually re-construct the entire query every time, but also to sanitise input and prevent SQL injection attacks.
Even though it seems overkill for this post, using parameters even in simple queries helps to identify the variable elements, so I will use parameters for the 2nd criteria to demonstrate the concept.
DECLARE #from DateTime = '2021-02-15' -- Date in ISO format
DECLARE #to DateTime = (SELECT DateAdd(d, 5, #from)) -- will match Friday: 2021-02-19
/* NOTE: requirement only mentioned the start date, not the end
so your code should also only rely on the single fixed start date */
SELECT accounts.company_id, accounts.company_name
, paid = COUNT(st_paid), organic = COUNT(st_organic), app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
WHERE -- paid accounts = exclude 'free' accounts
accounts.current_mrr > 0
-- Date range filter
AND activity_time BETWEEN #from AND #to
GROUP BY accounts.company_id, accounts.company_name
-- The fun bit, we use HAVING to apply a filter AFTER the grouping is evaluated
-- Wording was explicitly 4 OR 5, not BETWEEN so we use IN for that
HAVING COUNT(source_type) IN (4,5)
I believe you are missing some information there.
without more information on the tables, I can only guess that you also have a customer table. i am going to assume there is a customer_id key that serves as key between both tables
i would take your query and do something like:
SELECT customer_id,
COUNT() AS Total,
MAX(CASE WHEN source_type = "app" THEN "numoperations" END) "app_totals"),
MAX(CASE WHEN source_type = "paid" THEN "numoperations" END) "paid_totals"),
MAX(CASE WHEN source_type = "organic" THEN "numoperations" END) "organic_totals"),
FROM (
SELECT source_type, COUNT() AS num_operations
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
) tb1 GROUP BY customer_id
This is the most generic case i can think of, but does not scale very well. If you get new source types, you need to modify the query, and the structure of the output table also changes. Depending on the sql engine you are using (i.e. mysql vs microsoft sql) you could also use a pivot function.
The previous query is a little bit rough, but it will give you a general idea. You can add "ELSE" statements to the clause, to zero the fields when they have no values, and join with the customer table if you want only active customers, etc.

Writing a query to include certain values but exclude others when looking for a latest time period

I am trying to write a query that looks for a people that have a certain code with the latest period (year) but not if they have another code with that latest period(year). I'll be explicit just so my example makes sense.
I want people who have the code A1,A2,A3,A4,A5 but not AG,AP,AQ. There are people who have an A1 code for a period (like 2014) and an AG code for a the same period. I'd like to exclude them. Not everyone has a code so the field value could be NULL.
Is there a way to express this in a different way (i.e. less characters) than the way I did?
SELECT
people.firstName
FROM
people
WHERE EXISTS (
SELECT *
FROM codes
WHERE
codes.people_id = people.id
AND period = (SELECT MAX(period) FROM codes codes2 WHERE codes2.people_id = codes.people_id)
AND code LIKE 'A[1-5]'
)
AND NOT EXISTS (
SELECT *
FROM codes
WHERE
codes.people_id = people.id
AND period = (
SELECT MAX(period)
FROM codes codes2
WHERE codes2.people_id = codes.people_id
)
AND code LIKE 'A[GPQ]'
)
Schema is as follows:
People
id (PK)
firstName
Codes
people_id (FK) many to one relation with People table
code (e.g. "A1", "A2", "AG")
period (e.g. "2013", "2014")
There are so many ways you could do that, I'm not an SQL expert but I can't see your query being too bad, if you want to try and reduce the number of sub-queries you could consider using the GROUP BY clause along with a SUM Aggregate function in a HAVING clause.
I started updating your code as follows:
SELECT
people.firstName
FROM
people
LEFT JOIN codes AS a15 ON a15.people_id = people.id AND a15.code LIKE 'A[1-5]'
LEFT JOIN codes AS agpq ON agpq.people_id = people.id AND agpq.code LIKE 'A[GPQ]'
GROUP BY
people.firstName
HAVING
SUM(CASE WHEN a15.code IS NULL THEN 0 ELSE 1 END) > 0
AND SUM(CASE WHEN agpq.code IS NULL THEN 0 ELSE 1 END) = 0
This however doesn't take into account anything to do with period specific requirements described. You could add the period to the GROUP BY clause or add it to a WHERE or one of the JOIN constraints but I'm not quite sure from your description exactly what you're after (I don't believe this is through any fault of your own, I just can't personally align the code provided to the description).
I would also like to point out that the SUM functions above will not give an accurate count of the number of matching codes. This is because if both A[GPQ] and A[1_5] return at least one row, the number returned by each constraint will be multiplied by the number returned for the other, it can however be used to determine if there are "any" returned items as if the criteria is matched it will have a SUM(...) > 0
I'm sure a more experienced SQL Developer / DBA will be able to poke many holes in my proposed query but it might give them or someone else something to work from and hopefully gives you ideas for alternatives to using sub-queries.

SQL : how to Case depending of the result of the 2 latest values of one column

I am discovering SQL as I have to build queries in my new company.I have understood the basic but here is where I am stuck, maybe you could help me figure this out :
I would like to mention a product as unprocurable if sellers rejected my orders twice. Tricky part I aggregate the furniture orders for all our local offices, therefore even though I sent my purchase order(s) to one unique seller (the one with the best offer at the moment) I might have multiple lines for each item (one per office)
See below table for purchase orders, see REF1 item should be set as unprocurable as both on 21 and 31 december my orders have been rejected (no matter the seller)
http://i.stack.imgur.com/r3W3E.jpg
So to put it in logic I would like to have something like this:
For each items with 2 latest purchase orders that were both made at different dates and rejected(0 value in the table) THEN attach a note to it saying "unprocurable" else put as procurable.
IF it was only 1 value I think I could go with
Select
item
, MAX(date)
, case
when confirmed_units = 0
then 'Unprocurable'
else 'procurable'
end
From
purchase_table
Where
date between TO_DATE('01/01/2013', 'MM/DD/YYYY') AND TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'YYYY/MM/DD')
But now I need to check the two latest purchase orders and that are not from the same day.
I am a bit lost, could you give a hand please?
Thanks !
Your question is a little unclear... have you tried using something along the lines of:
SELECT TOP 2 etc, etc... order by [column]

Splitting one table based on criteria and comparing

I'm not quite sure on the best way to phrase this particular query, so I hope the title is adequate, however, I will attempt to describe what it is I need to be able to understand how to do. Just to clarify, this is for oracle sql.
We have a table called assessments. There are different kinds of assessments within this table, however, some assessments should follow others in a logical order and within set time frames. The problems come in when a client has multiple assessments of the same type, as we have to use a fairly inefficient array formula in excel to identify which 'full' assessment corresponds with the 'initial' assessment.
I have an earlier query that was resolved on this site (Returning relevant date from multiple tables including additional table info) which I believe includes a lot of the logic for what is required (particularly in identifying a corresponding event which has occurred within a specified timeframe). However, whilst that query pulls data from 3 seperate tables (assessments, events, responsiblities), I now need to create a query that generates a similar outcome but pulling from 1 main table and a 2nd table to return worker information. I thought the most logical way would be be to create a query that looks at the assessment table with one type of assessment, and then joins to the assessment table again (possibly a temporary table?) with assessment type that would follow the initial one.
For example:
Table 1 (Assessments):
Client ID Assessment Type Start End
P1 1 Initial 01/01/2012 05/01/2012
Table 2 (Assessments temp?):
Client ID Assessment Type Start End
P1 2 Full 12/01/2012
Table 3:
ID Worker Team
1 Bob Team1
2 Lyn Team2
Result:
Client ID Initial Start Initial End Initial Worker Full Start Full End
P1 1 01/01/2012 05/01/2012 Bob 12/01/2012
So table 1 and table 2 draw from the same table, except it's bringing back different assessments. Ideally, there'd be a check to make sure that the 'full' assessment started within X days of the end of the 'initial' assessment (similar to the 'likely' check in the previous query mentioned earlier). If this can be achieved, it's probably worth mentioning that I'd also be interested in expanding this to look at multiple assessment types, as roughly in the cycle a client could be expected to have between 4 or 5 different types of assessment. Any pointers would be appreciated, I've already had a great deal of help from this community which is very valuable.
Edit:
Edited to include solution following MBs advice.
Select
*
From(
Select
I.ASM_SUBJECT_ID as PNo,
I.ASM_ID As IAID,
I.ASM_QSA_ID as IAType,
I.ASM_START_DATE as IAStart,
I.ASM_END_DATE as IAEnd,
nvl(olm_bo.get_ref_desc(I.ASM_OUTCOME,'ASM_OUTCOME'),'') as IAOutcome,
C.ASM_ID as CAID,
C.ASM_QSA_ID as CAType,
C.ASM_START_DATE as CAStart,
C.ASM_END_DATE as CAEnd,
nvl(olm_bo.get_ref_desc(C.ASM_OUTCOME,'ASM_OUTCOME'),'') as CAOutcome,
ROUND(C.ASM_START_DATE -I.ASM_START_DATE,0) as "Likely",
row_number() over(PARTITION BY I.ASM_ID
ORDER BY
abs(I.ASM_START_DATE - C.ASM_START_DATE))as "Row Number"
FROM
O_ASSESSMENTS I
left join O_ASSESSMENTS C
on I.ASM_SUBJECT_ID = C.ASM_SUBJECT_ID
and C.ASM_QSA_ID IN ('AA523','AA1326') and
ROUND(C.ASM_START_DATE - I.ASM_START_DATE,0) >= -2
AND
ROUND(C.ASM_START_DATE - I.ASM_START_DATE,0) <= 25
and C.ASM_OUTCOME <>'ABANDON'
Where I.ASM_QSA_ID IN ('AA501','AA1323')
AND I.ASM_OUTCOME <> 'ABANDON'
AND
I.ASM_END_DATE >= '01-04-2011') WHERE "Row Number" = 1
You can access the same table multiple times in a given query in SQL, simply by using table aliases. So one way of doing this would be:
select i.client,
i.id initial_id,
i.start initial_start,
i.end initial_end,
w.worker initial_worker,
f.id full_id,
f.start full_start,
f.end full_end
from assessments i
join workers w on i.id = w.id
left join assessments f
on i.client = f.client and
f.assessment_type = 'Full' and
f.start between i.end and i.end + X
/* replace X with appropriate number of days */
where i.assessment_type = 'Initial'
Note: column names such as end (that are reserved words in Oracle SQL) should normally be double-quoted, but from the previous question it looks as though these are simplified versions of the actual column names.
From your post, I assume that you're using Oracle here (as I see "Oracle" in the question).
In terms of "temp" tables, Views come right to mind. An Oracle View can give you different looks of a table which is what it sounds like you're looking for with different kinds of assessments.
Don Burleson is a good source for anything Oracle related and he gives some tips on Oracle Views at http://www.dba-oracle.com/concepts/views.htm

Complicated SQL Query Question

I am developing freelancer site and in this site users(project owners and experts) can leave feedback for each one. I am try to find count of feedbacks waitting to leave.
This query returns project's watting feedback count which are have no feedback in last 30 days, user id = 3 and have suitable status code:
SELECT COUNT(*)
FROM projects
WHERE projects.status IN (5, 10) AND projects.status_created >= DATE_SUB('2010-12-17 21:24:51', INTERVAL 30 DAY)
AND NOT EXISTS(
SELECT * FROM
feedbacks WHERE feedbacks WHERE projects.id = feedbacks.project_id AND feedbacks.from_id = '3'
)
This query is works when we have only 2 users in database otherwise for example if we change user id 3 to 99(user which have no relationship with project), query still return 1 for count but it should be return 0.
My database scheme:
PROJECTS(id, project_owner_id, project_title, ...)
FEEDBACKS(id, project_id, to_id, from_id, ....)
PROJECT_BIDS(id, project_id, bid_owner_id, accepted, ...) We can use this table for find out which user's bid is accepted then accepted bid owner have right for leave feedback.
We can use project_bids.accepted field for find out which users have relationship with project. If accepted true then project's freelancer expert is this user. Also projects.project_owner_id is another column to determine relationship.
How can i fix my query ? Thank you.
Your query (as written) is looking for the number of projects that have been created in the last 30 days have have attached comments/feedback, and the person in question has commented on this project.
The first thing that stands out is that you're checking the date the projected was created, not the date of the comments/feedback. If you do this, when the project becomes more than 30 days old, no more feedbacks will count when running the query. You most likely will want to add a timestamp to the feedbacks table and check that field instead.
Also, you're doing a count of the number of projects, rather than the number of feedbacks that meet the criteria.
For you're query, I would try something like:
SELECT COUNT(feedbacks.id)
FROM feedbacks, projects
WHERE
projects.id = feedbacks.project_id AND
projects.status IN (5, 10) AND
feedbacks.timestamp >= DATE_SUB('2010-12-17 21:24:51', INTERVAL 30 DAY)
ORDER BY projects.id
This will find the number of feedbacks per each project (of the given status). If you want to count only the feedbacks that were given by the person who won the bid, you can add to the WHERE clause:
AND feedbacks.from in (
SELECT project_bids.bid_owner_id
FROM project_bids
WHERE
project_bids.accepted = 1 AND
project_bids.project_id = projects.id
)
Your English is a bit difficult to understand, so please clarify I misunderstood something.
Note to everyone else: I'm still trying to get used to the Mark Down system. Feel free to correct my formatting above.
NOT EXISTS(SELECT FROM ...feedbacks.from_id = '99')
Is always true: 99(user which have no relationship with project),
Thats why you «still return 1 »