Frequent itemset SQL - sql

I'm using SAS for a piece of coursework. At the moment, I have a set of Order IDs and Product IDs. I want to found out which products are most frequently ordered together. Think, milk and cereal in a grocery basket.
I am not very good at programming, so would really appreciate if anyone could spare a bit of time and write a simple few lines of SQL I can easily use. Its not a heavy dataset and there are only two columns (Order_ID and Product_ID)
For example:
Order ID Product ID
10001 64564564
10001 546456
10001 54646
10003 5464
10003 342346
I've spent three hours researching now and am a bit desperate :(

If you think about it, you can find the answer by asking the question this way: for every possible pair of products, how many times did the two products occur on the same order. Then order by the count to float the answer(s) to the top:
select
p1.product_id, p2.product_id, count(*) times_order_together
from
orders p1
inner join
orders p2
on
p1.order_id = p2.order_id
and
p1.product_id != p2.product_id
group by
p1.product_id, p2.product_id
order by
count(*) desc
Products that weren't ever ordered together don't show up at all. Also - pairs are represented twice - a row for eggs with milk and a row for milk with eggs. These duplicate pairs are removable - but it gets uglier - and simple is good.
To elaborate a bit, p1 and p2 are aliases of orders. You do that to be able to use a data source more than once - and yet distinguish between them. Also, the count(*) times_order_together is just giving the name 'times_order_together' to the calculation count(*). It's counting the number of times a product pairing occurs in an order.

how about something like:
create table order_together (order_id, product_id1, product_id2);
insert into order_together
(order_id, product_id1, product_id2)
select o1.order_id, o1.product_id, o2.product_id
from order_line o1, order_line o2
where o1.order_id = o2.order_id
/* you dont want them equal and you also dont
want to insert cereal-milk and milk-cereal on the same order*/
and o1.product_id < o2.product_id
now you have pairs of products together and you can go wild with counts and stats. Mind you, this is quite naive and would blow up in volume quite quickly.
Maybe
select count(distinct order_id), o1.product_id, o2.product_id
...
group by o1.product_id, o2.product_id
would be better.
in response to be comment
but you are grabbing pairs of ordered together products, coming from different rows of the same order's order_lines.
Try this on sqlfiddle.com
put this in left, build schema pane. it creates the tables.
create table order_line(order_no int, product_id varchar(10));
create table order_together(order_no int, product_id1 varchar(10), product_id2 varchar(10));
put this in right pane, Run SQL
insert into order_line(order_no, product_id) values(1, 'milk');
insert into order_line(order_no, product_id) values (1, 'cereal');
insert into order_line(order_no, product_id) values (1, 'rice');
insert into order_line(order_no, product_id) values (2, 'milk');
insert into order_line(order_no, product_id) values (2, 'cereal');
insert into order_line(order_no, product_id) values (3, 'milk');
insert into order_line(order_no, product_id) values (3, 'cookies');
insert into order_line(order_no, product_id) values(4, 'milk');
insert into order_line(order_no, product_id) values (4, 'cookies');
insert into order_line(order_no, product_id) values(5, 'rice');
insert into order_line(order_no, product_id) values (5, 'icecream');
select o1.order_no, o1.product_id as product_from_row1, o2.product_id as product_from_row2
from order_line o1, order_line o2
where o1.order_no = o2.order_no
and o1.product_id < o2.product_id
gives:
order_no product_from_row1 product_from_row2
1 milk rice
1 cereal milk
1 cereal rice
2 cereal milk
3 cookies milk
4 cookies milk
5 icecream rice
give it a try, then think about what the query is requesting, which is joining different order_lines of the same order. That's pretty much the definition of "ordered together".

Related

Ordinary taks - complex solution

I have some users (C1, C2, C3, etc.) who handles products (aa, bb, cc, dd, ee, ff, gg, hh, etc.) in differents stores (St1, St2, St3, St4, etc.).
Every user in the list wants to know in which store which products they can handle cheaper.
How looks out the tables and how looks out the queries if user want get at least the following 3 things (one at the time):
1- Get own list of products. Exemple:
Pr. St1 St2 St3 St4
aa $20 $12 $19 $22
bb $31 $44 $38 $44
cc $18 $12 $19 $22
dd $36 $44 $38 $44
ee $15 $12 $19 $22
2- Get a list of lowest prices (but greather than 0) and see how much he/she save if he/she handles the same products on others stores. Exemple:
Pr. St4 St1 St3 St2
bb $23 $27 $26 $28
ee $14 $15 $15 $20
hh $36 $38 $40 $37
Sum $73 $80 $81 $85
Count products 3.
Pr. St2 St1 St3 St4
aa $32 $33 $38 $36
cc $21 $29 $27 $25
ff $13 $14 $17 $20
Sum $66 $76 $82 $81
Count products 3.
3- Get a list of products which has cero price in each store. Exemple:
Pr. St1 St2 St3 St4
kk $00 $12 $19 $22
ii $00 $44 $38 $44
Pr. St2 St1 St3 St4
ll $00 $21 $52 $20
mm $00 $13 $17 $15
A primitive not good solution:
CREATE TABLE usrs (
idc INT NOT NULL,
name VARCHAR(50) NOT NULL,
stores VARCHAR(50) NOT NULL,
pwd VARBINARY(72) NOT NULL,
PRIMARY KEY (idc)
)
COMMENT='Customers'
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
CREATE TABLE stores (
ids INT NOT NULL,
nm VARCHAR(50) NOT NULL,
PRIMARY KEY (ids)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
CREATE TABLE products (
idp INT(11) NOT NULL AUTO_INCREMENT,
prod VARCHAR(50) NOT NULL,
st1 MEDIUMINT(9) NULL DEFAULT '0',
st2 MEDIUMINT(9) NULL DEFAULT '0',
st3 MEDIUMINT(9) NULL DEFAULT '0',
st4 MEDIUMINT(9) NULL DEFAULT '0',
PRIMARY KEY (idp)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (aa,14,20,13,17);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (bb,33,29,38,33);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (cc,19,20,00,21);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (dd,22,29,25,33);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (ee,30,00,35,29);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (ff,10,14,11,13);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (gg,00,00,00,00);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (hh,16,22,30,10);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (ii,23,34,34,26);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (jj,41,32,39,41);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (kk,25,29,26,19);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (ll,24,27,10,24);
INSERT INTO products (prod,st1,st2,st3,st4) VALUES (mm,29,41,37,36);
I don't know how all the tables can looks like when count of stores are more than 2 and count of users are more than 1.
In my solution I have to manually change the tables when users or stores increases. I understand here needs a relationship between tables, but I don't find out it. Bad, very bad ...
Thanks in advance!
You could change your solution such that you had a list of prices connecting stores to products.
A price would be a tuple of Product, Store, and Price.
Then, you could select with a left join on this table. The where clause could have a specified store id, a price requirement, and/or product id.
Example:
I have modified the names of your tables for better clarity. Examples tested on MariaDB.
First, the models I will use:
CREATE TABLE `stores` (
`id` INT NOT NULL,
`name` VARCHAR(50) NOT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `products` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`name` VARCHAR(50) NOT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `prices` (
`product_id` INT(11) NOT NULL,
`store_id` INT NOT NULL,
`price` DOUBLE NOT NULL,
CONSTRAINT productIdToStoreID UNIQUE (product_id, store_id)
);
Here is how I would populate the database
INSERT INTO stores (id, name) VALUES (10, "Foo's Fun");
INSERT INTO stores (id, name) VALUES (20, "Bar's Barn");
INSERT INTO products (id, name) VALUES (10, "0xC0C0C01A");
INSERT INTO products (id, name) VALUES (20, "0xDEADBEEF");
INSERT INTO prices (product_id, store_id, price) VALUES (10, 10, 1.50);
INSERT INTO prices (product_id, store_id, price) VALUES (10, 20, 1.60);
INSERT INTO prices (product_id, store_id, price) VALUES (20, 10, 40);
INSERT INTO prices (product_id, store_id, price) VALUES (20, 20, 35);
Here are some starting points are queries you could write:
SELECT store_id, price FROM prices WHERE product_id = 10; -- All prices for product with id 10
SELECT store_id, price FROM prices WHERE product_id = 10 ORDER BY price ASC LIMIT 1; -- Get Cheapest Price and Store for product 10
SELECT stores.name, price FROM prices LEFT JOIN stores ON (stores.id = store_id) WHERE product_id = 10 ORDER BY price ASC LIMIT 1; -- Get the Store name and price.
Motivation:
We need a way that can relate n stores, m products. (Not sure where users go here). We need a data structure that can hold:
If a store has a item (conversely, if a product is in a store). We can see that this is a many to many relationship. SQL databases tend to be really optimized for this type of relationship.
The price of a product at a store.
Thus, we create a table that links a product to a store uniquely, with the extra metadata of the price.
As you pointed out, the alternative to this is a table that contains all of the stores as columns. However, this does not scale well at all as in order to add, remove, or change the name of a store would be a O(n) update. In a production environment, this would be detrimental as this would likely lock the database. but I digress.
This solution solves this issue by allowing the store list and product list to be decoupled from the actual products. If a price exists at a store for a product, there is a row. This means that the space complexity is at worst O(n), but in reality would be much less as not all stores would have all products.

SQL statement that displays items where buying a combo meal is cheaper than buying individual items

*NOTE! I am asking for study reasons I have no idea how to write this query, and I haven't been able to find anything like it in references.
Hello stack, so I have this query as an example of using temporary tables in a subquery for a join, and I have literally no idea how to do this. I am mostly doing this for study reasons. Here are the two tables used in the example. Again the query is: Display items where buying the combo is cheaper than buying individual items.
create table items(
I# varchar2 (20),
IName varchar2(20),
UnitPrice Number)
;
insert into items
values ('101','Cheese Burger',3.99);
insert into items
values ('102','Double Cheeseburger',4.99);
insert into items
values ('103','French Fries',1.19);
insert into items
values ('104','Medium Drink',1.39);
insert into items
values ('105','Large Drink',1.89);
insert into items
values('106','Combo 1',6.99);
insert into items
values('107','Combo 2',8.99);
create table itemdetails(
I# varchar2(20),
IncludeI# varchar2(20)
);
insert into itemdetails
values ('106','101');
insert into itemdetails
values ('106','103');
insert into itemdetails
values ('106','104');
insert into itemdetails
values ('107','102');
insert into itemdetails
values ('107','103');
insert into itemdetails
values ('107','105');
Any answer would be greatly appreciated. I'm using standard oracle SQL.
You could query the sum of the combos (this is the temporary table) eg.
select sum(I1.UnitPrice)
from items as I1, itemdetails as I2
where I1.I# = I2.I#
and I1.i# = [one I'm looking at]
group by I1.I#
And then search for the combo prices
select I3.IName
from items I3
where UnitPrice < [individual prices of I3.I#]
Then combine the two.
select I3.IName
from items I3
where UnitPrice <
(
select sum(I1.UnitPrice)
from items as I1, itemdetails as I2
where I1.I# = I2.I#
and I1.I# = I3.I#
group by I1.I#
)
You could join itemdetails on items twice, once for the price of the combo and once for the prices of the individual items:
SELECT combo.name, combo.UnitPrice
FROM itemdetails id
JOIN items combo ON id.i# = combo.i#
JOIN items individual ON id.i# = individual.IncludeI#
GROUP BY combo.name, combo.UnitPrice
HAVING combo.UnitPrice < SUM(individual.UnitPrice)

How to Get Sum of One Column Based On Other Table in Sql Server

I have 2 table in my database (like this):
tblCustomers:
id CustomerName
1 aaa
2 bbb
3 ccc
4 ddd
5 eee
6 fff
tblPurchases:
id CustomerID Price
1 1 300
2 2 100
3 3 500
4 1 150
5 4 50
6 3 250
7 6 700
8 2 30
9 1 310
10 4 25
Now I want with "Stored Procedures" take a new table that give me the sum of price for each customer. Exactly like under.
How can do that?
Procedures Result:
id CustomerName SumPrice
1 aaa 760
2 bbb 130
3 ccc 750
4 ddd 75
5 eee 0
6 fff 700
select c.id, c.customername, sum(isnull(p.price, 0)) as sumprice
from tblcustomers c
left join tblpurchases p
on c.id = p.customerid
group by c.id, c.customername
SQL Fiddle test: http://sqlfiddle.com/#!3/9b573/1/0
Note the need for an outer join because your desired result includes customers with no purchases.
You can use the below query to get the result
select id,CustomerName,sum(price) as TotalPrice
from
(
select tc.id,tc.CustomerName,tp.price
from tblCustomers tc
join
tblPurchases tp on tc.id = tp.CustomerID
) tab
group by id,CustomerName
Although the other answers here do work, they don't appear to be what I would consider standard practice, or optimal.
The simplest solution (standard, but not always optimal) requires no sub-query of any variety.
SELECT
cust.id,
cust.CustomerName,
SUM(prch.price) AS SumPrice
FROM
tblCustomers AS cust
INNER JOIN
tblPurchases AS prch
ON cust.id = prch.CustomerID
GROUP BY
cust.id,
cust.CustomerName
The only reason that this is not necessarily optimal is that it involves grouping by two fields, one of which is a string. This involves creating 'counters' in memory that are identified by this composite of an id and string, which can be inefficient due to the fact that you only really need to use the id to uniquely identify the counter. (The identifier is only one item and is a small (probably only 4 bytes), rather than multiple items one of which is long (potentially many many bytes)).
This means that you can do the following as a possible optimisation. Though depending on your data this many be a premature optimsation, it has no performance down-side and is always good to know about...
SELECT
cust.id,
cust.CustomerName,
prch.SumPrice
FROM
tblCustomers AS cust
INNER JOIN
(
SELECT
CustomerID,
SUM(price) AS SumPrice
FROM
tblPurchases
GROUP BY
CustomerID
) AS prch
ON cust.id = prch.CustomerID
This makes the in-memory aggregation as simple as possible, as so as quick as possible.
In both cases you should have the best possible efficiency in the query by ensuring that you have indexes on tblCustomer(id) and on tblPurchases(CustomerID),
DECLARE #tblcustomers table (id int, customername varchar(10));
insert into #tblcustomers values (1, 'aaa');
insert into #tblcustomers values (2, 'bbb');
insert into #tblcustomers values (3, 'ccc');
insert into #tblcustomers values (4, 'ddd');
insert into #tblcustomers values (5, 'eee');
insert into #tblcustomers values (6, 'fff');
DECLARE #tblpurchases table (id int, customerid int, price int);
insert into #tblpurchases values (1, 1, 300);
insert into #tblpurchases values (2, 2, 100);
insert into #tblpurchases values (3, 3, 500);
insert into #tblpurchases values (4, 1, 150);
insert into #tblpurchases values (5, 4, 50);
insert into #tblpurchases values (6, 3, 250);
insert into #tblpurchases values (7, 6, 700);
insert into #tblpurchases values (8, 2, 30);
insert into #tblpurchases values (9, 1, 310);
insert into #tblpurchases values (10, 4, 25);
WITH CTE AS(
select c.id,c.customername from #tblcustomers c
)
Select c.id,c.customername,(Select SUM(ISNULL(P.price,0)) from #tblpurchases P
WHERE P.customerid = C.id) AS Price from CTE c

SQL: couple people who assisted to the same event

create table people(
id_pers int,
nom_pers char(25),
d_nais date,
d_mort date,
primary key(id_pers)
);
create table event(
id_evn int,
primary key(id_evn)
);
create table assisted_to(
id_pers int,
id_evn int,
foreign key (id_pers) references people(id_pers),
foreign key (id_evn) references event(id_evn)
);
insert into people(id_pers, nom_pers, d_nais, d_mort) values (1, 'A', current_date - integer '20', current_date);
insert into people(id_pers, nom_pers, d_nais, d_mort) values (2, 'B', current_date - integer '50', current_date - integer '20');
insert into people(id_pers, nom_pers, d_nais, d_mort) values (3, 'C', current_date - integer '25', current_date - integer '20');
insert into event(id_evn) values (1);
insert into event(id_evn) values (2);
insert into event(id_evn) values (3);
insert into event(id_evn) values (4);
insert into event(id_evn) values (5);
insert into assisted_to(id_pers, id_evn) values (1, 5);
insert into assisted_to(id_pers, id_evn) values (2, 5);
insert into assisted_to(id_pers, id_evn) values (2, 4);
insert into assisted_to(id_pers, id_evn) values (3, 5);
insert into assisted_to(id_pers, id_evn) values (3, 4);
insert into assisted_to(id_pers, id_evn) values (3, 3);
I need to find couples who assisted to the same event on any particular day.
I tried:
select p1.id_pers, p2.id_pers from people p1, people p2, assisted_event ae
where ae.id_pers = p1.id_pers
and ae.id_pers = p2.id_pers
But returns 0 rows.
What am I doing wrong?
Try this:
select distint ae.id_evn,
p1.nom_pers personA, p2.nom_pers PersonB
from assieted_to ae
Join people p1
On p1.id_pers = ae.id_pers
Join people p2
On p2.id_pers = ae.id_pers
And p2.id_pers > p1.id_pers
This generates all pairs of people [couples] who assisted on the same event. With your schema, there is no way to restrict the results to cases where they assisted on the same day. The assumption is that if they assisted on the same event, then that event can only have occurred on one day.
You select two persons, so you need to select two assisted_event rows as well, because each person has its own assignment row in the assisted_event table. The idea is to build a link between p1 and p2 through a pair of assisted_event rows sharing the same id_evn
select p1.id_pers, p2.id_pers
from people p1, people p2
where exists (
select *
from assisted_event e1
join assisted_event e2 on e1.id_evn=e2.id_evn
where e1.id_pers=p1.id_pers and e2.id_pers=p2.id_pers
)
When re-phrased into ANSI JOIN syntax so I can read it, your query reads:
select p1.id_pers, p2.id_pers
from assisted_event ae
inner join people p1 ON (ae.id_pers = p1.id_pers)
inner join people p2 ON (ae.id_pers = p2.id_pers)
Since id_pers is the primary key of p1, it is impossible for ae.id_pers to be simultaneously equal to p1.id_pers and p2.id_pers. You'll need to find another approach.
You don't need to join on people at all for this, though you'll probably want to in order to populate their details. You need to self-join the people-to-events join table not the people table in order to get the desired results, filtering the self-join to include only rows where the event ID is the same but the people are different. Using > rather than <> means you don't have to use another pass to filter out the (a,b) vs (b,a) pairings.
Something like:
select ae1.id_evn event_id, ae1.id_pers id_pers1, ae2.id_pers id_pers2
from assisted_to ae1
inner join assisted_to ae2
on (ae2.id_evn = ae1.id_evn and ae1.id_pers > ae2.id_pers)
You can now, if desired, add additional joins on the event and persion tables to populate details. You'll need to join people twice with different aliases to populate the two different "sides". See Charles Bretana's example.

Select distinct rows based on some, but not all columns

I originally ran into this problem while working on SQL queries that select certain aggregate values (min, max etc) from grouped results. For example, select the cheapest fruit, its variety and the price, off each fruit group. The common solution is to first group the fruits along with the cheapest price using MIN, then self join it to get the other column ("variety" in this case).
Now say if we have more than one variety of a fruit with the same price, and that price happened to be the lowest price. So we end up getting results like this:
Apple Fuji 5.00
Apple Green 5.00
Orange valencia 3.00
Pear bradford 6.00
How do I make it so that only one kind of apple shows up in the final result? It can be any one of the varieties, be it the record that shows up the first, last or random.
So basically I need to eliminate rows based on two of the three columns being equal, and it doesn't matter which rows get eliminated as long as there is one left in the final result set.
Any help would be appreciated.
Try this... I added more fruits. The way to read it is to start from the inner most From clause and work your way out.
create table fruit (
FruitName varchar(50) not null,
FruitVariety varchar(50) not null,
Price decimal(10,2) not null
)
insert into fruit (FruitName, FruitVariety, Price)
values ('Apple','Fuji',5.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Apple','Green',5.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Orange','Valencia',3.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Orange','Navel',5.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Pear','Bradford',6.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Pear','Nashi',8.00)
select
rankedCheapFruits.FruitName,
rankedCheapFruits.FruitVariety,
rankedCheapFruits.Price
from (
select
f.FruitName,
f.FruitVariety,
f.Price,
row_number() over(
partition by f.FruitName
order by f.FruitName, f.FruitVariety
) as FruitRank
from (
select
f.FruitName,
min(f.Price) as LowestPrice
from Fruit f
group by
f.FruitName
) as cheapFruits
join Fruit f on cheapFruits.FruitName = f.FruitName
and f.Price = cheapFruits.LowestPrice
) rankedCheapFruits
where rankedCheapFruits.FruitRank = 1
You could use a MIN operator, that would limit it to the first row
One option is to rank the rows based on some criteria (alphabetical order of fruit variety) and then pick the minimum of the rank.
There is a rank() function in ms-sql for exactly this purpose.