How to find every customers' favourite category with a query - sql

I have a table in MS Access which looks basically like this:
Table Name : Customer_Categories
+----------------------+------------+-------+
| Email | CategoryID | Count |
+----------------------+------------+-------+
| jim#example.com | 10 | 4 |
+----------------------+------------+-------+
| jim#example.com | 2 | 1 |
+----------------------+------------+-------+
| simon#example.com | 5 | 2 |
+----------------------+------------+-------+
| steven#example.com | 10 | 16 |
+----------------------+------------+-------+
| steven#example.com | 5 | 3 |
+----------------------+------------+-------+
In this table there are ≈ 350,000 records. The characteristics are this:
Duplicate values for Email, CategoryID and Count
Count refers to the number of times this customer has ordered from this category
What I want
I want to create a table that consists of a unique email address along with the CategoryID this customer has purchased from the most.
So the above example would be:
+----------------------+------------+
| Email | CategoryID |
+----------------------+------------+
| jim#example.com | 10 |
+----------------------+------------+
| simon#example.com | 5 |
+----------------------+------------+
| steven#example.com | 10 |
+----------------------+------------+
What I have tried
I have written a query that achieves what I want:
SELECT main.Email, (SELECT TOP 1 CategoryID
FROM Customer_Categories
WHERE main.Email = Email
GROUP BY CategoryID
ORDER BY MAX(Count) DESC, CategoryID ASC) AS Category
FROM Customer_Categories AS main
GROUP BY main.Email;
This works a treat and does exactly what I want. It returns results in around 8 seconds. However I need this data in a new table because I then want to update another table with the categoryID. When I add INTO Customer_Favourite_Categories after the sub-query to add this data to a new table rather than just return the result set and run the query it never finishes. I've left it running for about 45 minutes and it does nothing.
Is there any way around this?

If select into doesn't work, use insert into:
create table Customer_Favorite_Categories (
email <email type>,
FavoriteCategory <CategoryId type>
);
insert into Customer_Favorite_Categories
SELECT main.Email, (SELECT TOP 1 CategoryID
FROM Customer_Categories
WHERE main.Email = Email
GROUP BY CategoryID
ORDER BY MAX(Count) DESC, CategoryID ASC) AS Category
FROM Customer_Categories AS main
GROUP BY main.Email;

Try this:
SELECT Distinct(Email),Max(CategoryID )
FROM Customer_Categories group by Email

I use sub-queries for this quite frequently. Your query in "What I have tried" is close, but just a little off in syntax. Something like the following should get what you are after. Count is in square-brackets since it's a reserved word in SQL. The spacing I use in my SQL is conventional, so edit to your liking.
SELECT Email,
CategoryID
FROM MyTable AS m,
(
SELECT Email,
MAX( [Count] ) AS mc
FROM MyTable
GROUP BY Email
) AS f
WHERE m.Email = f.Email
AND m.[Count] = f.mc;

Related

More efficient way to query shortest string value associated with each value in another column in Hive QL

I have a table in Hive containing store names, order IDs, and User IDs (as well as some other columns including item ID). There is a row in the table for every item purchased (so there can be more than one row per order if the order contains multiple items). Order IDs are unique within a store, but not across stores. A single order can have more than one user ID associated with it.
I'm trying to write a query that will return a list of all stores and order IDs and the shortest user ID associated with each order.
So, for example, if the data looks like this:
STORE | ORDERID | USERID | ITEMID
------+---------+--------+-------
| a | 1 | bill | abc |
| a | 1 | susan | def |
| a | 2 | jane | abc |
| b | 1 | scott | ghi |
| b | 1 | tony | jkl |
Then the output would look like this:
STORE | ORDERID | USERID
------+---------+-------
a | 1 | bill
a | 2 | jane
b | 1 | tony
I've written a query that will do this, but I feel like there must be a more efficient way to go about it. Does anybody know a better way to produce these results?
This is what I have so far:
select
users.store, users.orderid, users.userid
from
(select
store, orderid, userid, length(userid) as len
from
sales) users
join
(select distinct
store, orderid,
min(length(userid)) over (partition by store, orderid) as len
from
sales) len on users.store = len.store
and users.orderid = len.orderid
and users.len = len.len
Check out probably this will work for you, here you can achieve your goal of single "SELECT" clause with no extra overhead on SQL.
select distinct
store, orderid,
first_value(userid) over(partition by store, orderid order by length(userid) asc) f_val
from
sales;
The result will be:
store orderid f_val
a 1 bill
a 2 jane
b 1 tony
Probably rank() is the best way:
select s.*
from (select s.*, rank() over (partition by store order by length(userid) as seqnum
from sales s
) s
where seqnum = 1;

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.

CTE to represent a logical table for the rows in a table which have the max value in one column

I have an "insert only" database, wherein records aren't physically updated, but rather logically updated by adding a new record, with a CRUD value, carrying a larger sequence. In this case, the "seq" (sequence) column is more in line with what you may consider a primary key, but the "id" is the logical identifier for the record. In the example below,
This is the physical representation of the table:
seq id name | CRUD |
----|-----|--------|------|
1 | 10 | john | C |
2 | 10 | joe | U |
3 | 11 | kent | C |
4 | 12 | katie | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
This is the logical representation of the table, considering the "most recent" records:
seq id name | CRUD |
----|-----|--------|------|
2 | 10 | joe | U |
3 | 11 | kent | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
In order to, for instance, retrieve the most recent record for the person with id=12, I would currently do something like this:
SELECT
*
FROM
PEOPLE P
WHERE
P.ID = 12
AND
P.SEQ = (
SELECT
MAX(P1.SEQ)
FROM
PEOPLE P1
WHERE P.ID = 12
)
...and I would receive this row:
seq id name | CRUD |
----|-----|--------|------|
5 | 12 | sue | U |
What I'd rather do is something like this:
WITH
NEW_P
AS
(
--CTE representing all of the most recent records
--i.e. for any given id, the most recent sequence
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
The first SQL example using the the subquery already works for us.
Question: How can I leverage a CTE to simplify our predicates when needing to leverage the "most recent" logical view of the table. In essence, I don't want to inline a subquery every single time I want to get at the most recent record. I'd rather define a CTE and leverage that in any subsequent predicate.
P.S. While I'm currently using DB2, I'm looking for a solution that is database agnostic.
This is a clear case for window (or OLAP) functions, which are supported by all modern SQL databases. For example:
WITH
ORD_P
AS
(
SELECT p.*, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY seq DESC) rn
FROM people p
)
,
NEW_P
AS
(
SELECT * from ORD_P
WHERE rn = 1
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
PS. Not tested. You may need to explicitly list all columns in the CTE clauses.
I guess you already put it together. First find the max seq associated with each id, then use that to join back to the main table:
WITH newp AS (
SELECT id, MAX(seq) AS latestseq
FROM people
GROUP BY id
)
SELECT p.*
FROM people p
JOIN newp n ON (n.latestseq = p.seq)
ORDER BY p.id
What you originally had would work, or moving the CTE into the "from" clause. Maybe you want to use a timestamp field rather than a sequence number for the ordering?
Following up from #Glenn's answer, here is an updated query which meets my original goal and is on par with #mustaccio's answer, but I'm still not sure what the performance (and other) implications of this approach vs the other are.
WITH
LATEST_PERSON_SEQS AS
(
SELECT
ID,
MAX(SEQ) AS LATEST_SEQ
FROM
PERSON
GROUP BY
ID
)
,
LATEST_PERSON AS
(
SELECT
P.*
FROM
PERSON P
JOIN
LATEST_PERSON_SEQS L
ON
(
L.LATEST_SEQ = P.SEQ)
)
SELECT
*
FROM
LATEST_PERSON L2
WHERE
L2.ID = 12

SQL SELECT only rows where a max value is present, and the corresponding ID from another linked table

I have a simple Parts database which I'd like to use for calculating costs of assemblies, and I need to keep a cost history, so that I can update the costs for parts without the update affecting historic data.
So far I have the info stored in 2 tables:
tblPart:
PartID | PartName
1 | Foo
2 | Bar
3 | Foobar
tblPartCostHistory
PartCostHistoryID | PartID | Revision | Cost
1 | 1 | 1 | £1.00
2 | 1 | 2 | £1.20
3 | 2 | 1 | £3.00
4 | 3 | 1 | £2.20
5 | 3 | 2 | £2.05
What I want to end up with is just the PartID for each part, and the PartCostHistoryID where the revision number is highest, so this:
PartID | PartCostHistoryID
1 | 2
2 | 3
3 | 5
I've had a look at some of the other threads on here and I can't quite get it. I can manage to get the PartID along with the highest Revision number, but if I try to then do anything with the PartCostHistoryID I end up with multiple PartCostHistoryIDs per part.
I'm using MS Access 2007.
Many thanks.
Mihai's (very concise) answer will work assuming that the order of both
[PartCostHistoryID] and
[Revision] for each [PartID]
are always ascending.
A solution that does not rely on that assumption would be
SELECT
tblPartCostHistory.PartID,
tblPartCostHistory.PartCostHistoryID
FROM
tblPartCostHistory
INNER JOIN
(
SELECT
PartID,
MAX(Revision) AS MaxOfRevision
FROM tblPartCostHistory
GROUP BY PartID
) AS max
ON max.PartID = tblPartCostHistory.PartID
AND max.MaxOfRevision = tblPartCostHistory.Revision
SELECT PartID,MAX(PartCostHistoryID) FROM table GROUP BY PartID
Here is query
select PartCostHistoryId, PartId from tblCost
where PartCostHistoryId in
(select PartCostHistoryId from
(select * from tblCost as tbl order by Revision desc) as tbl1
group by PartId
)
Here is SQL Fiddle http://sqlfiddle.com/#!2/19c2d/12

join multiple row in table by filed value

i have a table company row like this :
id(int) |name(string) |maincategory(int) |subcategory(string)
1 |Google |1 |1,2,3
2 |yahoo |4 |4,1
and other table category like:
id(int) |name(string)
1 |Search
2 |Email
3 |Image
4 |Video
i want to join tow table by company.subcategory = category.id
is it possible in sql ?
Start by splitting your subcategory column. In the end you should have an additional company_category table with company_id and category_id as columns.
company_id(int) |category_id(int)
1 |1
1 |2
1 |3
2 |4
2 |1
Your design is invalid. You shoud have another table called companySubcategories or something like that.
This table shoud have two columns companyId an categoryId.
Then your select would look like this:
select <desired fields> from
company c
join companySubcategories cs on cs.companyId = cs.id
join category ct on ct.id = cs.categoryId
you can do like below...
select * from
company c, category cc
where c. subcategory like '%'||cc.id||'%';
it is working as expected in oracle database ..
You could introduce a new table company_subcategory to keep track of subcategories
id (int) | subcategory(int)
1 | 1
1 | 2
1 | 3
2 | 1
2 | 4
then you would be able to run select as
select company.name AS company, category.name AS category
FROM company
JOIN company_subcategory
ON company.id = company_subcategory.company
JOIN category
ON company_subcategory.subcategory = category.id;
to get
+---------+----------+
| company | category |
+---------+----------+
| google | search |
| google | email |
| google | image |
| yahoo | search |
| yahoo | video |
+---------+----------+
SELECT *
FROM COMPANY CMP, CATEGORY CT
WHERE (SELECT CASE
WHEN INSTR(CMP.SUB_CATEGORY, CT.ID) > 0 THEN
'TRUE'
ELSE
'FALSE'
END
FROM DUAL) = 'TRUE'
This query looks for the ID in the SUB_CATEGORY, using the INSTR function.
In case it does exist, the row is returned.
The output is as below
ID NAME MAIN_CATEGORY SUB_CATEGORY ID NAME
1 Google 1 1,2,3 1 Search
1 Google 1 1,2,3 2 Email
1 Google 1 1,2,3 3 Image
2 yahoo 2 4,1 1 Search
2 yahoo 2 4,1 4 Video
Hope it helps.
However, I suggest you avoid this type of entries, as an ID should have separate entries and not combined entries. This may create problems in future, so it would be better to avoid it now.