Doing a market basket analysis on the order details - sql

I have a table that looks (abbreviated) like:
| order_id | item_id | amount | qty | date |
|---------- |--------- |-------- |----- |------------ |
| 1 | 1 | 10 | 1 | 10-10-2014 |
| 1 | 2 | 20 | 2 | 10-10-2014 |
| 2 | 1 | 10 | 1 | 10-12-2014 |
| 2 | 2 | 20 | 1 | 10-12-2014 |
| 2 | 3 | 45 | 1 | 10-12-2014 |
| 3 | 1 | 10 | 1 | 9-9-2014 |
| 3 | 3 | 45 | 1 | 9-9-2014 |
| 4 | 2 | 20 | 1 | 11-11-2014 |
I would like to run a query that would calculate the list of items
that most frequently occur together.
In this case the result would be:
|items|frequency|
|-----|---------|
|1,2, |2 |
|1,3 |1 |
|2,3 |1 |
|2 |1 |
Ideally, first presenting orders with more than one items, then presenting
the most frequently ordered single items.
Could anyone please provide an example for how to structure this SQL?

This query generate all of the requested output, in the cases where 2 items occur together. It doesn't include the last item of the requested output since a single value (2) technically doesn't occur together with anything... although you could easily add a UNION query to include values that happen alone.
This is written for PostgreSQL 9.3
create table orders(
order_id int,
item_id int,
amount int,
qty int,
date timestamp
);
INSERT INTO ORDERS VALUES(1,1,10,1,'10-10-2014');
INSERT INTO ORDERS VALUES(1,2,20,1,'10-10-2014');
INSERT INTO ORDERS VALUES(2,1,10,1,'10-12-2014');
INSERT INTO ORDERS VALUES(2,2,20,1,'10-12-2014');
INSERT INTO ORDERS VALUES(2,3,45,1,'10-12-2014');
INSERT INTO ORDERS VALUES(3,1,10,1,'9-9-2014');
INSERT INTO ORDERS VALUES(3,3,45,1,'9-9-2014');
INSERT INTO ORDERS VALUES(4,2,10,1,'11-11-2014');
with order_pairs as (
select (pg1.item_id, pg2.item_id) as items, pg1.date
from
(select distinct item_id, date
from orders) as pg1
join
(select distinct item_id, date
from orders) as pg2
ON
(
pg1.date = pg2.date AND
pg1.item_id != pg2.item_id AND
pg1.item_id < pg2.item_id
)
)
SELECT items, count(*) as frequency
FROM order_pairs
GROUP by items
ORDER by items;
output
items | frequency
-------+-----------
(1,2) | 2
(1,3) | 2
(2,3) | 1
(3 rows)

Market Basket Analysis with Join.
Join on order_id and compare if item_id < self.item_id. So for every item_id you get its associated items sold. And then group by items and count the number of rows for each combinations.
select items,count(*) as 'Freq' from
(select concat(x.item_id,',',y.item_id) as items from orders x
JOIN orders y ON x.order_id = y.order_id and
x.item_id != y.item_id and x.item_id < y.item_id) A
group by A.items order by A.items;

Related

Is there a way in SQL to aggregate a column across rows and potentially duplicate rows based on another field value in Redshift?

So I have a table, let's call it shipment_items that lists by a shipment_id the individual items contained within a shipment and their quantities.
+-------------+-------------+----------+
| shipment_id | item_name | quantity |
+-------------+-------------+----------+
| 1 | cleanser | 1 |
| 1 | moisturizer | 2 |
| 2 | cleanser | 2 |
| 2 | body wash | 1 |
| 3 | cleanser | 1 |
| 3 | moisturizer | 2 |
| 4 | cleanser | 1 |
| 4 | moisturizer | 1 |
+-------------+-------------+----------+
What I want is to return a table that looks like this
+------------------------------------+----------+
| items | num_ship |
+------------------------------------+----------+
| cleanser, moisturizer, moisturizer | 2 |
| body wash, cleanser, cleanser | 1 |
| cleanser, moisturizer | 1 |
+------------------------------------+----------+
Is there any way in sql to do that? I'm thinking something with list_agg, but the tricky part is duplicating the item_names based on the quantity field. What I'm trying to show in the new table is that there were 2 shipments that contained 2 moisturizers and 1 cleanser, and 1 shipment containing 2 cleansers and 1 body wash.
** EDIT **
Resolved thanks to #Gordon Linoff
new resulting table will look like this
+------------------------------------+----------+
| items | num_ship |
+------------------------------------+----------+
| cleanser: 1, moisturizer: 2 | 2 |
| body wash: 1, cleanser: 2 | 1 |
| cleanser: 1, moisturizer: 1 | 1 |
You can use listagg():
select listagg(item_name, ', ') within group (order by item_name) as items,
quantity
from t
group by quantity
order by quantity desc;
EDIT:
I think you want two levels of aggregation:
select items, count(*)
from (select shipment_id,
listagg(distinct item_name, ', ') within group (order by item_name) as items
from t
group by shipment_id
) s
group by items
order by count(*) desc;
This does not include duplicates in the item list.
EDIT II:
For exact matches, include the quantity:
select items, count(*)
from (select shipment_id,
listagg(distinct item_name || ':' || quantity, ', ') within group (order by item_name) as items
from t
group by shipment_id
) s
group by items
order by count(*) desc;

tSQL aggregate functions and group bys

So, I can't seem to find a way to get this to work. But, what I need is as follow.
I have a table that has lets say types.
Type_ID, Type_Description
Then I have a table of Items. [Item Type is fk to type table]
Item_ID, Item_Type
Then I have a Results Table. [Item_ID is fk to Item table]
Result_ID, Item_ID, Cost
So what i am needing for output is Grouped by the Type_ID
- The Count of Items(can be 0),
- The Count of Results(Can be 0),
- and the sum of Cost(can be 0)
I dont have direct access to these tables. I am having to build the sql and send it to an api so I dont get to know the error simply the results if successful and error 500 if not.
Seems to be older tSQL. As List and STRING_AGG dont seem to be available.
EDIT: As requested - Sample Data
+---------+------------------+
| Type_ID | Type_Description |
+---------+------------------+
| 1 | Example 1 |
+---------+------------------+
| 2 | Example 2 |
+---------+------------------+
+---------+---------------+
| ITEM_ID | ITEM_TYPE |
+---------+---------------+
| 1 | 1 |
+---------+---------------+
| 2 | 1 |
+---------+---------------+
| 3 | 1 |
+---------+---------------+
| 4 | 2 |
+---------+---------------+
| 5 | 2 |
+---------+---------------+
+-----------+---------+------+
| Result_ID | Item_ID | Cost |
+-----------+---------+------+
| 1 | 1 | 10 |
+-----------+---------+------+
| 2 | 1 | 20 |
+-----------+---------+------+
| 3 | 2 | 5 |
+-----------+---------+------+
| 4 | 5 | 100 |
+-----------+---------+------+
Desired Output
+---------+------------+--------------+------+
| Type_ID | Item_Count | Result_Count | Cost |
+---------+------------+--------------+------+
| 1 | 3 | 3 | 35 |
+---------+------------+--------------+------+
| 2 | 2 | 1 | 100 |
+---------+------------+--------------+------+
I think GMB's answer was quite good, but in case there is a type with no items (something in your requirements), it will not be displayed.
So first of all let's create the input data:
select 1 as Type_ID, 'Example 1' as Type_Description into #type
union all
select 2, 'Example 2'
union all
select 3, 'Example 3'
select 1 Result_ID, 1 Item_ID, 10 Cost into #item
union all
select 2, 1, 20 Cost
union all
select 3, 2, 5 Cost
union all
select 4, 5, 100 Cost
select 1 Item_ID, 1 Item_Type INTO #item_type
union all
select 2, 1
union all
select 3, 1
union all
select 4, 2
union all
select 5, 2
Note I added also a type 3 with no items to test the no item case.
And then the query you need:
SELECT
t.Type_ID,
COUNT(DISTINCT it.Item_ID) Item_Count,
COUNT(DISTINCT i.Result_ID) Result_Count,
SUM(ISNULL(Cost, 0)) Cost
FROM #type t
LEFT JOIN #item_type it on it.Item_Type = t.Type_ID
LEFT JOIN #item i on i.Item_ID = it.Item_ID
GROUP BY t.Type_ID
I think it is pretty straightforward and doesn't need much explanation, but feel free to ask in the comments if necessary.
The results are just like you requested, with also a line for type 3:
+---------+------------+--------------+------+
| Type_ID | Item_Count | Result_Count | Cost |
+---------+------------+--------------+------+
| 1 | 3 | 3 | 35 |
+---------+------------+--------------+------+
| 2 | 2 | 1 | 100 |
+---------+------------+--------------+------+
| 3 | 0 | 0 | 0 |
+---------+------------+--------------+------+
You mentioned that the count of items could be 0 and also the count of results. But aren't both values always either 0, or both > 0? Only in case your type-item many-to-many table doesn't have a FK, you could have that scenario. For example, if I add:
insert into #item_type
select 6, 3
Then the last row is:
+---------+------------+--------------+------+
| 3 | 1 | 0 | 0 |
+---------+------------+--------------+------+
I am not sure if that makes sense in your scenario, but as your post implies that items and results can be 0 independently, that confused me a bit.
You can join and aggregate. Assuming that your tables are called types, types_costs and costs, that would be:
select
t.type_id,
count(distinct tc.item_id) item_count,
count(distinct c.result_id) result_count,
sum(c.cost) cost
from types t
inner join types_costs tc on tc.item_type = t.type_id
left costs c on c.item_id = tc.item_id
group by t.type_id
An important thing is to use a left join to bring the costs table so item_ids that do not exist in costs are not eliminated before you get a change to count them. Depending on your actual use case, you might also want a left join on table types_costs.

Getting a distinct value from one column if all rows matches a certain criteria

I'm trying to find a performant and easy-to-read query to get a distinct value from one column, if all rows in the table matches a certain criteria.
I have a table that tracks e-commerce orders and whether they're delivered on time, contents and schema as following:
> select * from orders;
+----+--------------------+-------------+
| id | delivered_on_time | customer_id |
+----+--------------------+-------------+
| 1 | 1 | 9 |
| 2 | 0 | 9 |
| 3 | 1 | 10 |
| 4 | 1 | 10 |
| 5 | 0 | 11 |
+----+--------------------+-------------+
I would like to get all distinct customer_id's which have had all their orders delivered on time. I.e. I would like an output like this:
+-------------+
| customer_id |
+-------------+
| 10 |
+-------------+
What's the best way to do this?
I've found a solution, but it's a bit hard to read and I doubt it's the most efficient way to do it (using double CTE's):
> with hits_all as (
select memberid,count(*) as count from orders group by memberid
),
hits_true as
(select memberid,count(*) as count from orders where hit = true group by memberid)
select
*
from
hits_true
inner join
hits_all on
hits_all.memberid = hits_true.memberid
and hits_all.count = hits_true.count;
+----------+-------+----------+-------+
| memberid | count | memberid | count |
+----------+-------+----------+-------+
| 10 | 2 | 10 | 2 |
+----------+-------+----------+-------+
You use group by and having as follows:
select customer_id
from orders
group by customer_id
having sum(delivered_on_time) = count(*)
This works because an ontime delivery is identified by delivered_on_time = 1. So you can just ensure that the sum of delivered_on_time is equal to the number of records for the customer.
You can use aggregation and having:
select customer_id
from orders
group by customer_id
having min(delivered_on_time) = max(delivered_on_time);

Count within the result set of a subquery

I have the following relations in my database:
Invoice InvoiceMeal
--------------------- ---------------------------
| InvoiceId | Total | | Id | InvoiceId | MealId |
--------------------- ---------------------------
| 1 | 22.32 | | 1 | 1 | 3 |
--------------------- ---------------------------
| 2 | 12.18 | | 2 | 1 | 2 |
--------------------- ---------------------------
| 3 | 27.76 | | 3 | 2 | 2 |
--------------------- ---------------------------
Meal Type
----------------------------------- -------------------
| Id | Name | TypeId | | Id | Name |
----------------------------------- -------------------
| 1 | Hamburger | 1 | | 1 | Meat |
----------------------------------- -------------------
| 2 | Soja Beans | 2 | | 2 | Vegetarian |
----------------------------------- -------------------
| 3 | Chicken | 2 |
-----------------------------------
What I want to query from the database is InvoiceId and Total of all Invoices which consist of at least two Meals where at least one of the Meals is of Type Vegetarian. I have the following SQL query and it works:
SELECT
i."Id", i."Total"
FROM
public."Invoice" i
WHERE
(SELECT COUNT(*)
FROM public."InvoiceMeal" im
WHERE im."InvoiceId" = i."Id" AND
(SELECT COUNT(*)
FROM public."Meal" m, public."Type" t
WHERE im."MealId" = m."Id" AND
m."TypeId" = t."Id" AND
g."Name" = 'Vegetarian') > 0
) >= 2;
My problem with this query is that I can not easily modify the condition that there must at least one vegetarien Meal. I want to be able, for example, to change it to at least two vegetarian meals. How can I achieve this with my query?
I would approach this by joining the tables together and using aggregation. The having clause can handle the conditions:
select i.Id, i.Total
from InvoiceMeal im join
Invoice i
on i.InvoiceId = im.InvoiceId join
Meal m
on im.mealid = m.mealid join
Type t
on m.typeid = t.typeid
group by i.Id, i.Total
having count(distinct im.mealid) >= 2 and
sum(case when t.name = 'Vegetarian' then 1 else 0 end) > 0;
I also see no reason to put double quotes around column names. That just makes the query harder to write and read.

problem with Update in MySQL

According to the documentation, joins, when used with the update statement, work in the same way as when used in selects.
For example, if we have these two tables:
mysql> SELECT * FROM orders;
+---------+------------+
| orderid | customerid |
+---------+------------+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 1 |
+---------+------------+
mysql> SELECT * FROM customers;
+------------+------------+
| customerid | ordercount |
+------------+------------+
| 1 | 9 |
| 2 | 3 |
| 3 | 8 |
| 4 | 5 |
| 5 | 7 |
+------------+------------+
using this select statements:
SELECT orders.customerid
FROM orders
JOIN customers ON (customers.customerid = orders.customerid)
returns:
+------------+
| customerid |
+------------+
| 1 |
| 1 |
| 2 |
| 3 |
+------------+
So, I was expecting the statement below:
UPDATE orders
JOIN customers ON (customers.customerid = orders.customerid)
SET ordercount = ordercount + 1
to update ordercount for customer #1 (customerid = 1) to be 11, but actually this is not the case, here are the results after the update:
mysql> SELECT * FROM customers;
+------------+------------+
| customerid | ordercount |
+------------+------------+
| 1 | 10 |
| 2 | 4 |
| 3 | 9 |
| 4 | 5 |
| 5 | 7 |
+------------+------------+
As you can see it was only incremented once despite that it occurs twice in the orders table and despite that the select statement returns it correctly.
Is this a bug in MySQL or is it me doing something wrong? I'm trying to avoid using group by for performance reasons hence my interest to understand what's going on.
Thanks in advance
Yes, MySQL updates each record in a joined table at most once.
I cannot find it in the documentation, but practice says so.
I'll probably post it as a bug, so they at least add it to documentation:
CREATE TABLE updater (value INT NOT NULL);
INSERT
INTO updater
VALUES (1);
SELECT *
FROM updater;
value
---
1
UPDATE updater u
JOIN (
SELECT 1 AS newval
UNION ALL
SELECT 2
) q
SET u.value = u.value + newval;
SELECT *
FROM updater;
value
---
2
(expected 4).
SQL Server, by the way, behaves same in a multiple table UPDATE.
You can use:
UPDATE orders o
SET ordercount = ordercount +
(
SELECT COUNT(*)
FROM customers c
WHERE c.customerid = o.customerid
)
which is same on performance as long as you have an index on customers (customer_id)