SQL Query For Most Popular Combination - sql

Suppose I have a grocery store application with a table of purchases:
customerId int
itemId int
Four customers come into the store:
Bob buys a banana, lemonade, and a cookie
Kevin buys a banana, lemonade, and a donut
Sam buys a banana, orange juice, and a cupcake
Susie buys a banana
I am trying to write a query which would return which combinations of items are most popular. In this case, the results of this query should be:
banana and lemonade-2
I have already written a query which tells me a list of all items which were in a multi-item purchase (we exclude sales of one item - it cannot form a "combination"). It returns:
banana - 3
lemonade - 2
cookie - 1
donut - 1
cupcake - 1
orange juice - 1
Here is the query:
SELECT itemId, count( * )
FROM grocery_store
INNER JOIN (
SELECT customerId
FROM grocery_store
GROUP BY customerId
HAVING count( itemId ) > 1
)subQuery ON subQuery.customerId = grocery_store.customerId
GROUP BY itemId;
Could I get a pointer about how to expand my existing query to get the desired output?

select a.itemID, b.itemID, COUNT(*) countForCombination
from grocery_store a
inner join grocery_store b
on a.customer_id = b.customer_id
and a.itemID < b.itemID
group by a.itemID, b.itemID
order by countForCombination desc
Assumed:
grocery_store = sales records
customer_id = unique sale
This query takes all the grocery_store records and for each single sales transaction, it creates all the possible combinations (a.itemid, b.itemid) in a specific order (a.itemid
This specific order eliminates duplicates (apple, orange) is kept whereas (orange, apple) is not necessary.
After producing all the combinations from all sales, a simple group by and sorting by count is used to show the most popular combinations at the top

Related

Query to fetch the similar products purchased by customer in SQL Server DB

Example :
C NAME PRODUCT DATE OF PURCHASE
**JOHN MILK 12/17/2015**
**JOHN BREAD 12/17/2015**
John soap 12/17/2015
**John milk 03/21/2016**
**John bread 03/21/2016**
John laptop 03/21/2016
John pen 07/30/2015
John Refils 07/30/2015
John Pen 08/05/2016
John Refils 08/05/2016
By above example we can say, Mr. John always purchase "MILK" & "BREAD" (Mean, when ever he purchases Milk he buy Bread similarly Pen & Refils )
Can anyone send me the query for the above example?
If seems that you want to perform some kind of top basket analysis on your data. If that is the case then you can use the following DBMS_FREQUENT_ITEMSET package.
Here are some good links on the same.
https://docs.oracle.com/cd/B28359_01/appdev.111/b28419/d_frqist.htm
https://technology.amis.nl/2004/10/16/hidden-plsql-gem-in-10g-dbms_frequent_itemset-for-plsql-based-data-mining/
This is a way to obtain the result, returning the products grouped per couples:
SELECT UPPER(t1.c_name), UPPER(t1.product), UPPER(t2.product), COUNT(*)
FROM products t1, products t2
WHERE UPPER(t1.c_name) = UPPER(t2.c_name)
AND t1.dt_purchase = t2.dt_purchase
AND UPPER(t1.product) < UPPER(t2.product)
GROUP BY UPPER(t1.c_name), UPPER(t1.product), UPPER(t2.product)
HAVING COUNT(*) = (
SELECT COUNT(*)
FROM products t3
WHERE UPPER(t3.c_name) = UPPER(t1.c_name)
AND UPPER(t3.product) = UPPER(t1.product)
) AND COUNT(*) = (
SELECT COUNT(*)
FROM products t3
WHERE UPPER(t3.c_name) = UPPER(t1.c_name)
AND UPPER(t3.product) = UPPER(t2.product)
);
The query joins the table with itself searching couples of product buyed togheter in all purchasing dates present in table.
This returns following result:
JOHN BREAD MILK
JOHN PEN REFILS
This is one (ugly, but working) way to achieve this -
select C_Name, product
from products p1
where date_of_purchase = (select min(date_of_purchase)
from products p2
where p1.C_Name = p2.C_Name)
and product not in (select product
from products p3
where p1.C_Name = p3.C_Name
and date_of_purchase <> (select min(date_of_purchase)
from products p4
where p3.C_Name = p4.C_Name))

SQL: Combine result columns

SELECT Category, SUM (Volume) as Volume
FROM Product
GROUP BY Category;
The above query returns this result:
Category Volume
-------------------
Oth 2
Tv Kids 4
{null} 1
Humour 3
Tv 5
Theatrical 13
Doc 6
I want to combine some of the columns as one colum as follows:
Oth,{null}, Humour, Doc as Others
Tv Kids, Tv as TV
Theatrical as Film
So my result would look like:
Category Volume
-------------------
Others 12
Tv 9
Film 13
How would I go about this?
You need a CASE here, like this:
SELECT
CASE
WHEN Category IN ('Oth','Humour','Doc')
OR Category IS NULL THEN 'Others'
WHEN Category IN ('Tv Kids','Tv') THEN 'TV'
WHEN Category = 'Theatrical' THEN 'Film'
END as category ,
SUM (Volume) as Volume
from Product
GROUP BY
CASE
WHEN Category IN ('Oth','Humour','Doc')
OR Category IS NULL THEN 'Others'
WHEN Category IN ('Tv Kids','Tv') THEN 'TV'
WHEN Category = 'Theatrical' THEN 'Film'
END;
Null must be dealt with outside the IN list as it is a special value.
I think you need to use a case statement to group categories together.
select case category when 'Tv' then 'Tv'
when 'Film' then 'Film'
else 'Other'
end as Category,
sum(Volume) as Volume
from (
SELECT Category, SUM (Volume) as Volume
FROM Product
GROUP BY Category
) subcategoryTotals
group by Category
(I think most DBs will allow you to group by the alias Category. (If not you can re-use the case statement)
Edit: Just a final thought (or two):
You should consider normalizing your database - for example, the Category column should really be a foreign key to a Categories table.
Also, this sql is reasonably ok because the case statement isn't too long or complex. If you wanted to split things up further it could quickly get to be unmanageable. I'd be inclined to use the idea of categories and subcategories in my database.
The best solution might be to implement those groups in the database. For instance:
category_group
id_category_group name sortkey
1 Others 3
2 TV 2
3 Film 1
category
id_category name id_category_group
1 Oth 1
2 Tv Kids 2
3 Humour 1
4 Tv 2
5 Theatrical 3
6 Doc 1
query
SELECT g.Name, SUM (p.Volume) as Volume
FROM Product p
LEFT JOIN Category c ON c.Id_Category = p.Id_Category
LEFT JOIN Category_Group g ON g.Id_Category_Group = c.Id_Category_Group
GROUP BY g.Id_Category_Group, g.Name
ORDER BY g.sortkey;
This makes NULL a group of its own, though. But well, it is a group of its own, as NULL means not known (yet), so you don't actually know whether it's TV, Film or Other. If you still want to count NULL as Others, change the ON clause accordingly:
LEFT JOIN Category_Group g
ON g.Id_Category_Group = COALESCE(c.Id_Category_Group, 3) -- default to group 'Others'
Try following,
select category_group , sum(volume) as Volume from
(
SELECT
Category,
Volume,
case
WHEN Category IN ('Oth','Humour','Doc','{null}') THEN 'Others'
WHEN Category IN ('Tv Kids','Tv') THEN 'TV'
WHEN Category = 'Theatrical' THEN 'Film'
end as category_group
FROM Product
) T
group by category_group

How I select the complete set of rows for each Sales Order by one common item?

I would like to select all my sales order items based on one common item. In the below example, I want to select all my Sales Order rows filtered by item ‘bread’.
Dataset:
Order Items
10001 bread
10001 milk
10001 cheese
10001 apple
10001 milk
10002 cheese
10002 apple
10002 banana
10003 onions
10003 bread
10003 carrot
Desired output:
10001 bread
10001 milk
10001 cheese
10001 apple
10001 milk
10003 onions
10003 bread
10003 carrot
The result should not include the middle order, number 10002, because it has no 'bread' item.
I tried to use the EXISTS function but have had no luck.
Use a nested self join:
SELECT a.Order, b.Item
FROM SalesOrder a
INNER JOIN (
SELECT Order
From SalesOrder
WHERE Item = 'bread'
) b
ON a.Order = b.Order
The inner query (b) gets all the IDs which include a bread item. The outer query gets all the items for each of the IDs chosen by the inner query. Unless you have multiple rows in the table with the same ID and a bread item (e.g. 1, bread and a second 1, bread) DISTINCT is not necessary and will lower performance. Based on the limited information given, you shouldn't have such data if your schema are designed correctly (meaning that if your order has two bread items, it has a quantity = 2 column, not two rows with 'bread'; if it is designed to insert two rows to represent quantity 2 of the same item, you really should change the schema).
If you don't want a nested join for some reason, you could do it this way:
SELECT a.Order, a.Item
FROM SalesOrder a
WHERE EXISTS (
SELECT 1
From SalesOrder b
WHERE b.Item = 'bread'
AND b.Order = a.Order
)
You mentioned you tried to use EXISTS but didn't get it working; here's how to do it. In almost all cases WHERE EXISTS and the WHERE IN version which other answers have suggested will generate the same plan (DISTINCT is still not necessary and will make a difference, though). It's possible but rather unlikely that there will be a difference based on statistics, indexes, etc., but you shouldn't worry about this.
Regardless of which query you use, having an index on Item will speed it up (whether you should add such an index if you don't have one is a completely separate question). String comparison scans aren't terribly performant on SQL Server.
Try this:
select SalesOrderLineID, SalesOrderLineItem FROM dbo.SalesOrderLines where SalesOrderLineID IN
(
select DISTINCT SalesOrderLineID FROM dbo.SalesOrderLines where SalesOrderLineItem = 'Bread'
)
Without knowing the exact datamodel, you'll probably have to end up doing something like a subquery:
SELECT
SalesOrderLineID
, SalesOrderLineItem
FROM dbo.SalesOrderLines
WHERE SalesOrderLineID IN
(
SELECT DISTINCT
SalesOrderLineID
FROM dbo.SalesOrderLines
WHERE SalesOrderLineItem = 'Bread'
)
This will be OK for a small dataset, but if it grows to millions of rows, you'll want to review the query and possibly change it.

SQL Inner Join query

I have following table structures,
cust_info
cust_id
cust_name
bill_info
bill_id
cust_id
bill_amount
bill_date
paid_info
paid_id
bill_id
paid_amount
paid_date
Now my output should display records (1 jan 2013 to 1 feb 2013) between two bill_dates dates as single row as follows,
cust_name | bill_id | bill_amount | tpaid_amount | bill_date | balance
where tpaid_amount is total paid for particular bill_id
For example,
for bill id abcd, bill_amount is 10000 and user pays 2000 one time and 3000 second time
means, paid_info table contains two entries for same bill_id
bill_id | paid_amount
abcd 2000
abcd 3000
so, tpaid_amount = 2000 + 3000 = 5000 and balance = 10000 - tpaid_amount = 10000 - 5000 = 5000
Is there any way to do this with single query (inner joins)?
You'd want to join the 3 tables, then group them by bill ids and other relevant data, like so.
-- the select line, as well as getting your columns to display, is where you'll work
-- out your computed columns, or what are called aggregate functions, such as tpaid and balance
SELECT c.cust_name, p.bill_id, b.bill_amount, SUM(p.paid_amount) AS tpaid, b.bill_date, b.bill_amount - SUM(p.paid_amount) AS balance
-- joining up the 3 tables here on the id columns that point to the other tables
FROM cust_info c INNER JOIN bill_info b ON c.cust_id = b.cust_id
INNER JOIN paid_info p ON p.bill_id = b.bill_id
-- between pretty much does what it says
WHERE b.bill_date BETWEEN '2013-01-01' AND '2013-02-01'
-- in group by, we not only need to join rows together based on which bill they're for
-- (bill_id), but also any column we want to select in SELECT.
GROUP BY c.cust_name, p.bill_id, b.bill_amount, b.bill_date
A quick overview of group by: It will take your result set and smoosh rows together, based on where they have the same data in the columns you give it. Since each bill will have the same customer name, amount, date, etc, we are fine to group by those as well as the bill id, and we'll get a record for each bill. If we wanted to group it by p.paid_amount, though, since each payment would have a different one of those (possibly), you'd get a record for each payment as opposed to for each bill, which isn't what you'd want. Once group by has smooshed these rows together, you can run aggregate functions such as SUM(column). In this example, SUM(p.paid_amount) totals up all the payments that have that bill_id to work out how much has been paid. For more information, please look at W3Schools chapter on group by in their SQL tutorials.
Hope I've understood this correctly and that this helps you.
This will do the trick;
select
cust_name,
bill_id,
bill_amount,
sum(paid_amount),
bill_date,
bill_amount - sum(paid_amount)
from
cust_info
left outer join bill_info
left outer join paid_info
on bill_info.bill_id=paid_info.bill_id
on cust_info.cust_id=bill_info.cust_id
where
bill_info.bill_date between X and Y
group by
cust_name,
bill_id,
bill_amount,
bill_date

db2 query top group by

I've been trying for hours but can't ge the query, i want to do the following using DB2. From table Company and Users I have the following tickets quantity info per company/user
QUERY USING:
SELECT T.USER, COUNT(T.USER) AS QUANITTY, T.COMPANY FROM TICKET T
INNER JOIN COMPANY P ON P.COMPANY = T.COMPANY
GROUP BY (T.USER, T.COMPANY) ORDER BY QUANTITY DESC
Outcome is:
user company quantity
----------------------------------
mark nissn 300
tom toyt 50
steve kryr 80
mark frd 20
tom toyt 120
jose toyt 230
tom nissn 145
steve toyt 10
jose kryr 35
steve frd 100
THIS SHOULD BE THE RESULT(Top user per company)
user company quantity
----------------------------------
mark nissn 300
jose toyt 230
steve frd 100
steve kryr 80
as you can see there are many users in a company and each have different quantities per company, the result should
get the user with the highest quantity per company. i.e. : Company nissn it has 2 users and each has (mark with 300) and (tom with 145)
so it should give me the highest user which would be mark with 300. The same would be for toyt, frd, kryr. I need all of them in a query.
I wonder if that's possible in a query or I will need to create a stored procedure.
You can do this with analytic queries. But be careful. The pattern usually works out to involve nested subqueries. (One to produce a dataset, the next to add it to the pattern, the third to select out the rows you want.)
In this case it should look something like this.
Original query.
SELECT T.USER, COUNT(T.USER) AS QUANTITY, T.COMPANY
FROM TICKET T
JOIN COMPANY P
ON P.COMPANY = T.COMPANY
GROUP BY (T.USER, T.COMPANY)
Analytic query. (Note that the s is to name the subquery. I have not used DB2, but the standard strictly doesn't prevent them to be dropped, and I know at least one database that requires them.)
SELECT user, quantity, company
, RANK () OVER (PARTITION BY company ORDER BY quantity DESC) as r
FROM ( ... previous query ... ) s
Final result.
SELECT user, quantity, company
FROM ( ... previous query ... ) t
WHERE r = 1
The combined query is:
SELECT user, quantity, company
FROM (
SELECT user, quantity, company
, RANK () OVER (PARTITION BY company ORDER BY quantity DESC) as r
FROM (
SELECT T.USER, COUNT(T.USER) AS QUANTITY, T.COMPANY
FROM TICKET T
JOIN COMPANY P
ON P.COMPANY = T.COMPANY
GROUP BY (T.USER, T.COMPANY)
) s
) t
WHERE r = 1
As I say I have not used DB2. But according to the SQL standard, that query should work.