Create a "products you may be interested in" algorithm in SQL? - sql

i have a problem which im not sure how to approach.
I have a simple database where i store products , users , and purchases of products by users.
Each product has a name , a category and a price.
My goal is the following :
I want to display a list of 5 items that are suggested as "You might be interested in" to the user.The main problem is that i don't just want to search LIKE %..% for the name , but i also want to take into account the types of products the user usually buys , the price range he usually buys at , and giving priority to products being bought more often.
Is such an algorithm realistic? I can think of some metrics , like grouping all categories into semantically "similar" buckets and calculating distance from that , but im not sure how i should rank them when there is multiple criteria.
Maybe i should give each criteria an importance factor and have the result be a multiplication of the distance * the factor?

What you could do is create 2 additional fields for each product in your database. In the first field called Type for example you could say "RC" and in the second field called similar you could say, "RC, Radio, Electronics, Remote, Model" Then in your query in SQL later on you can tell it to select products that match up between type and similar. This provides a system that doesn't just rely on the product name, as these can be deceiving. It would be still using the LIKE command, but it would be far more accurate as it's pre-defined by you as to what other products are similar to this one.
Depending on the size of your database already, I believe this to be the simplest option.

I was using this on MySql for some weighted search :
SELECT *,
IF(
`libelle` LIKE :startSearch, 30,
IF(`libelle` LIKE :fullSearch, 20, 0)
)
+ IF(
`description` LIKE :startSearch, 10,
IF(`description` LIKE :fullSearch, 5, 0)
)
+ IF(
`keyword` LIKE :fullSearch, 1, 0
)
AS `weight`
FROM `t`
WHERE (
-- at least 1 match
`libelle` LIKE :fullSearch
OR `description` LIKE :fullSearch
OR `keyword` LIKE :fullSearch
)
ORDER BY
`weight` DESC
/*
'fullSearch'=>'%'.str_replace(' ', '_', trim($search)).'%',
'startSearch'=>str_replace(' ', '_', trim($search)).'%',
*/

Related

SQL question with attempt on customer information

Schema
Question: List all paying customers with users who had 4 or 5 activities during the week of February 15, 2021; also include how many of the activities sent were paid, organic and/or app store. (i.e. include a column for each of the three source types).
My attempt so far:
SELECT source_type, COUNT(*)
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
I would like to get a second opinion on it. I didn't include the accounts table because I don't believe that I need it for this query, but I could be wrong.
Have you tried to run this? It doesn't satisfy the brief on FOUR counts:
List all the ... customers (that match criteria)
There is no customer information included in the results at all, so this is an outright fail.
paying customers
This is the top level criteria, only customers that are not free should be included in the results.
Criteria: users who had 4 or 5 activities
There has been no attempt to evaluate this user criteria in the query, and the results do not provide enough information to deduce it.
there is further ambiguity in this requirement, does it mean that it should only include results if the account has individual users that have 4 or 5 acitvities, or is it simply that the account should have 4 or 5 activities overall.
If this is a test question (clearly this is contrived, if it is not please ask for help on how to design a better schema) then the use of the term User is usually very specific and would suggest that you need to group by or otherwise make specific use of this facet in your query.
Bonus: (i.e. include a column for each of the three source types).
This is the only element that was attempted, as the data is grouped by source_type but the information cannot be correlated back to any specific user or customer.
Next time please include example data and the expected outcome with your post. In preparing the data for this post you would have come across these issues yourself and may have been inspired to ask a different question, or through the process of writing the post up you may have resolved the issue yourself.
without further clarification, we can still start to evolve this query, a good place to start is to exclude the criteria and focus on the format of the output. the requirement mentions the following output requirements:
List Customers
Include a column for each of the source types.
Firstly, even though you don't think you need to, the request clearly states that Customer is an important facet in the output, and in your schema account holds the customer information, so although we do not need to, it makes the data readable by humans if we do include information from the account table.
This is a standard PIVOT style response then, we want a row for each customer, presenting a count that aggregates each of the values for source_type. Most RDBMS will support some variant of a PIVOT operator or function, however we can achieve the same thing with simple CASE expressions to conditionally put a value into projected columns in the result set that match the values we want to aggregate, then we can use GROUP BY to evaluate the aggregation, in this case a COUNT
The following syntax is for MS SQL, however you can achieve something similar easily enough in other RBDMS
OP please tag this question with your preferred database engine...
NOTE: there is NO filtering in this query... yet
SELECT accounts.company_id
, accounts.company_name
, paid = COUNT(st_paid)
, organic = COUNT(st_organic)
, app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
GROUP BY accounts.company_id, accounts.company_name
This results in the following shape of result:
company_id
company_name
paid
organic
app_store
apl01
apples
4
8
0
ora01
oranges
6
12
0
Criteria
When you are happy with the shpe of the results and that all the relevant information is available, it is time to apply the criteria to filter this data.
From the requirement, the following criteria can be identified:
paying customers
The spec doesn't mention paying specifically, but it does include a note that (free customers have current_mrr = 0)
Now aren't we glad we did join on the account table :)
users who had 4 or 5 activities
This is very specific about explicitly 4 or 5 activities, no more, no less.
For the sake of simplicity, lets assume that the user facet of this requirement is not important and that is is simply a reference to all users on an account, not just users who have individually logged 4 or 5 activities on their own - this would require more demo data than I care to manufacture right now to prove.
during the week of February 15, 2021.
This one was correctly identified in the original post, but we need to call it out just the same.
OP has used Monday to Friday of that week, there is no mention that weeks start on a Monday or that they end on Friday but we'll go along, it's only the syntax we need to explore today.
In the real world the actual values specified in the criteria should be parameterised, mainly because you don't want to manually re-construct the entire query every time, but also to sanitise input and prevent SQL injection attacks.
Even though it seems overkill for this post, using parameters even in simple queries helps to identify the variable elements, so I will use parameters for the 2nd criteria to demonstrate the concept.
DECLARE #from DateTime = '2021-02-15' -- Date in ISO format
DECLARE #to DateTime = (SELECT DateAdd(d, 5, #from)) -- will match Friday: 2021-02-19
/* NOTE: requirement only mentioned the start date, not the end
so your code should also only rely on the single fixed start date */
SELECT accounts.company_id, accounts.company_name
, paid = COUNT(st_paid), organic = COUNT(st_organic), app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
WHERE -- paid accounts = exclude 'free' accounts
accounts.current_mrr > 0
-- Date range filter
AND activity_time BETWEEN #from AND #to
GROUP BY accounts.company_id, accounts.company_name
-- The fun bit, we use HAVING to apply a filter AFTER the grouping is evaluated
-- Wording was explicitly 4 OR 5, not BETWEEN so we use IN for that
HAVING COUNT(source_type) IN (4,5)
I believe you are missing some information there.
without more information on the tables, I can only guess that you also have a customer table. i am going to assume there is a customer_id key that serves as key between both tables
i would take your query and do something like:
SELECT customer_id,
COUNT() AS Total,
MAX(CASE WHEN source_type = "app" THEN "numoperations" END) "app_totals"),
MAX(CASE WHEN source_type = "paid" THEN "numoperations" END) "paid_totals"),
MAX(CASE WHEN source_type = "organic" THEN "numoperations" END) "organic_totals"),
FROM (
SELECT source_type, COUNT() AS num_operations
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
) tb1 GROUP BY customer_id
This is the most generic case i can think of, but does not scale very well. If you get new source types, you need to modify the query, and the structure of the output table also changes. Depending on the sql engine you are using (i.e. mysql vs microsoft sql) you could also use a pivot function.
The previous query is a little bit rough, but it will give you a general idea. You can add "ELSE" statements to the clause, to zero the fields when they have no values, and join with the customer table if you want only active customers, etc.

Create a new row when encountering a special character (eg "/") in a field

I'm trying to replicate a record whenever I come across a slash in a specific field.
Background of the problem: I'm trying to compare two lists of data containing item number, item description, and item serial number. One list has item quantity and status information, the other list has item location information, so I'm trying to match up the location list onto the main list. The problem is that both lists were created independently of each other so they both have errors and I've only been able to inner join in SQL about 20%. The rest don't match up because the item number is wrong in one list or the other, the serial number might be missing a digit in one list, and I can't compare the nomenclatures very well either because one might say "Hand Wrench" and the other might say "Wrench, 5mm, socket".
Additionally, one list of data has multiple items, that related to some main item, saved in each record. They did this by storing the multiple serial numbers separated by slashes in the serial number field.
Tried using Levenshtein difference (fuzzy match) in Alteryx for serial number / item number matching. This created way too many false positives because serial numbers are sequential, item numbers are frequently incorrect, and item descriptions might look similar to a human but character length can be wildly different (eg. "Truck" in one list might not match well if the other list has something like "Truck, 8 wheel, cargo, flatbed").
I'm currently trying to just match the lists on if the serial number in one list is contained in the other list (with multiple serial numbers in the serial number field).
Example SQLite code I'm currently using
select * from [MISSING ITEMS LIST] as a
left join [RFID TAG SCAN] as b on
b.[SERIAL NUMBER] like '%' || (a.[SERIAL NUMBER] || '%')
where b.[SERIAL NUMBER] <> '' and b.[SERIAL NUMBER] is not null
What I'm trying to achieve:
Recopying this part from above:
So table A might have this:
Record# Item# Description SN
1, 156928, Truck, 1234
2, 209344, Truck Cover, 5588
And Table B might have this
Record# Item# Description SN
1, 156928, Truck, 5588/01234
To make the analysis a little easier, I'd like to convert Table B to this:
Record# Item# Description SN
1, 156928, Truck, 5588
1, 156928, Truck, 01234
If I assume there is at most one slash, then in MariaDB you can do:
select Record, Item, Description
substring_index(SN, '/', 1) as SN
from t
union all
select Record, Item, Description
substring_index(SN, '/', -1) as SN
from t
where SN like '%/%';
In order to convert table B to the format you want use the Text to Columns tool
In the Configuration, set the Delimiter to "/" and select "Split to Rows"

Writing a query to include certain values but exclude others when looking for a latest time period

I am trying to write a query that looks for a people that have a certain code with the latest period (year) but not if they have another code with that latest period(year). I'll be explicit just so my example makes sense.
I want people who have the code A1,A2,A3,A4,A5 but not AG,AP,AQ. There are people who have an A1 code for a period (like 2014) and an AG code for a the same period. I'd like to exclude them. Not everyone has a code so the field value could be NULL.
Is there a way to express this in a different way (i.e. less characters) than the way I did?
SELECT
people.firstName
FROM
people
WHERE EXISTS (
SELECT *
FROM codes
WHERE
codes.people_id = people.id
AND period = (SELECT MAX(period) FROM codes codes2 WHERE codes2.people_id = codes.people_id)
AND code LIKE 'A[1-5]'
)
AND NOT EXISTS (
SELECT *
FROM codes
WHERE
codes.people_id = people.id
AND period = (
SELECT MAX(period)
FROM codes codes2
WHERE codes2.people_id = codes.people_id
)
AND code LIKE 'A[GPQ]'
)
Schema is as follows:
People
id (PK)
firstName
Codes
people_id (FK) many to one relation with People table
code (e.g. "A1", "A2", "AG")
period (e.g. "2013", "2014")
There are so many ways you could do that, I'm not an SQL expert but I can't see your query being too bad, if you want to try and reduce the number of sub-queries you could consider using the GROUP BY clause along with a SUM Aggregate function in a HAVING clause.
I started updating your code as follows:
SELECT
people.firstName
FROM
people
LEFT JOIN codes AS a15 ON a15.people_id = people.id AND a15.code LIKE 'A[1-5]'
LEFT JOIN codes AS agpq ON agpq.people_id = people.id AND agpq.code LIKE 'A[GPQ]'
GROUP BY
people.firstName
HAVING
SUM(CASE WHEN a15.code IS NULL THEN 0 ELSE 1 END) > 0
AND SUM(CASE WHEN agpq.code IS NULL THEN 0 ELSE 1 END) = 0
This however doesn't take into account anything to do with period specific requirements described. You could add the period to the GROUP BY clause or add it to a WHERE or one of the JOIN constraints but I'm not quite sure from your description exactly what you're after (I don't believe this is through any fault of your own, I just can't personally align the code provided to the description).
I would also like to point out that the SUM functions above will not give an accurate count of the number of matching codes. This is because if both A[GPQ] and A[1_5] return at least one row, the number returned by each constraint will be multiplied by the number returned for the other, it can however be used to determine if there are "any" returned items as if the criteria is matched it will have a SUM(...) > 0
I'm sure a more experienced SQL Developer / DBA will be able to poke many holes in my proposed query but it might give them or someone else something to work from and hopefully gives you ideas for alternatives to using sub-queries.

Sorting SQL by first two characters of fields

I'm trying to sort some data by sales person initials, and the sales rep field is 3 chars long, and is Firstname, Lastname and Account type. So, Bob Smith would be BS* and I just need to sort by the first two characters.
How can I pull all data for a certain rep, where the first two characters of the field equals BS?
In some databases you can actually do
select * from SalesRep order by substring(SalesRepID, 1, 2)
Othere require you to
select *, Substring(SalesRepID, 1, 2) as foo from SalesRep order by foo
And in still others, you can't do it at all (but will have to sort your output in program code after you get it from the database).
Addition: If you actually want just the data for one sales rep, do as the others suggest. Otherwise, either you want to sort by the thing or maybe group by the thing.
What about this
SELECT * FROM SalesTable WHERE SalesRepField LIKE 'BS_'
I hope that you never end up with two sales reps who happen to have the same initials.
Also, sorting and filtering are two completely different things. You talk about sorting in the question title and first paragraph, but your question is about filtering. Since you can just ORDER BY on the field and it will use the first two characters anyway, I'll give you an answer for the filtering part.
You don't mention your RDBMS, but this will work in any product:
SELECT
my_columns
FROM
My_Table
WHERE
sales_rep LIKE 'BS%'
If you're using a variable/parameter then:
SELECT
my_columns
FROM
My_Table
WHERE
sales_rep LIKE #my_param + '%'
You can also use:
LEFT(sales_rep, 2) = 'BS'
I would stay away from:
SUBSTRING(sales_rep, 1, 2) = 'BS'
Depending on your SQL engine, it might not be smart enough to realize that it can use an index on the last one.
You haven't said what DBMS you are using. The following would work in Oracle, and something like them in most other DBMSs
1) where sales_rep like 'BS%'
2) where substr(sales_rep,1,2) = 'BS'
SELECT * FROM SalesRep
WHERE SUBSTRING(SalesRepID, 1, 2) = 'BS'
You didn't say what database you were using, this works in MS SQL Server.

How to sort and display mixed lists of alphas and numbers as the users expect?

Our application has a CustomerNumber field. We have hundreds of different people using the system (each has their own login and their own list of CustomerNumbers). An individual user might have at most 100,000 customers. Many have less than 100.
Some people only put actual numbers into their customer number fields, while others use a mixture of things. The system allows 20 characters which can be A-Z, 0-9 or a dash, and stores these in a VARCHAR2(20). Anything lowercase is made uppercase before being stored.
Now, let's say we have a simple report that lists all the customers for a particular user, sorted by Customer Number. e.g.
SELECT CustomerNumber,CustomerName
FROM Customer
WHERE User = ?
ORDER BY CustomerNumber;
This is a naive solution as the people that only ever use numbers do not want to see a plain alphabetic sort (where "10" comes before "9").
I do not wish to ask the user any unnecessary questions about their data.
I'm using Oracle, but I think it would be interesting to see some solutions for other databases. Please include which database your answer works on.
What do you think the best way to implement this is?
Probably your best bet is to pre-calculate a separate column and use that for ordering and use the customer number for display. This would probably involve 0-padding any internal integers to a fixed length.
The other possibility is to do your sorting post-select on the returned results.
Jeff Atwood has put together a blog posting about how some people calculate human friendly sort orders.
In Oracle 10g:
SELECT cust_name
FROM t_customer c
ORDER BY
REGEXP_REPLACE(cust_name, '[0-9]', ''), TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+'))
This will sort by the first occurence of number, not regarding it's position, i. e.:
customer1 < customer2 < customer10
cust1omer ? customer1
cust8omer1 ? cust8omer2
, where a ? means that the order is undefined.
That suffices for most cases.
To force sort order on case 2, you may add a REGEXP_INSTR(cust_name, '[0-9]', n) to ORDER BY list n times, forcing order on the first appearance of n-th (2nd, 3rd etc.) group of digits.
To force sort order on case 3, you may add a TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+', n)) to ORDER BY list n times, forcing order of n-th. group of digits.
In practice, the query I wrote is enough.
You may create a function based index on these expressions, but you'll need to force it with a hint, and a one-pass SORT ORDER BY will be performed anyway, as the CBO doesn't trust function-base indexes enough to allow an ORDER BY on them.
You could have a numeric column [CustomerNumberInt] that is only used when the CustomerNumber is purely numeric (NULL otherwise[1]), then
ORDER BY CustomerNumberInt, CustomerNumber
[1] depending on how your SQL version handles NULLs in ORDER BY you might want to default it to zero (or infinity!)
I have a similar horrible situation and have developed a suitably horrible function to deal with it (SQLServer)
In my situation I have a table of "units" (this is a work-tracking system for students, so unit in this context represents a course they're doing). Units have a code, which for the most part is purely numeric, but for various reasons it was made a varchar and they decided to prefix some by up to 5 characters. So they expect 53,123,237,356 to sort normally, but also T53, T123, T237, T356
UnitCode is a nvarchar(30)
Here's the body of the function:
declare #sortkey nvarchar(30)
select #sortkey =
case
when #unitcode like '[^0-9][0-9]%' then left(#unitcode,1) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-1)
when #unitcode like '[^0-9][^0-9][0-9]%' then left(#unitcode,2) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-2)
when #unitcode like '[^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,3) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-3)
when #unitcode like '[^0-9][^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,4) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-4)
when #unitcode like '[^0-9][^0-9][^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,5) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-5)
when #unitcode like '%[^0-9]%' then #unitcode
else left('000000000000000000000000000000',30-len(#unitcode)) + #unitcode
end
return #sortkey
I wanted to shoot myself in the face after writing that, however it works and seems not to kill the server when it runs.
I used this in SQL SERVER and working great: Here the solution is to pad the numeric values with a character in front so that all are of the same string length.
Here is an example using that approach:
select MyCol
from MyTable
order by
case IsNumeric(MyCol)
when 1 then Replicate('0', 100 - Len(MyCol)) + MyCol
else MyCol
end
The 100 should be replaced with the actual length of that column.