PostgreSQL: Customers preferred product and second most preferred product - sql

I'm pretty new to SQL (currently using PostgreSQL but interested in knowledge about any SQL), and am trying to figure something that I guess should be relatively straightforward.
I have a table containing one row per customer transaction, for each transaction I know what the customer bought. I am interested in finding out what product is each customers preferred choice, and then their second to most preferred choice (and in the end, on a general basis what is the preferred second choice when the preferred choice is unavailable).
Below is a mock up of what the data could look like:
+---------------------+-----------------+
| Customer_id | Product bought |
+---------------------+-----------------+
| 1 | DVD |
+- -+- -+
| 1 | DVD |
+- -+- -+
| 1 | Blu-ray |
+- -+- -+
| 1 | DVD |
+- -+- -+
| 2 | DVD |
+- -+- -+
| 2 | DVD |
The successful results would be something like this:
+---------------------+--------------------------------+
| Customer_id | Preferred #1 | Preferred #2 |
+---------------------+--------------------------------+
| 1 | DVD | Blu-ray |
+- -+- -+
| 2 | DVD | $NULL$ |
(And as mentioned earlier, the final result (most likely done in Python/R and not in SQL, would be to see a general basis as "If Preferred #1 is DVD, then Preferred #2 is Blu-ray", "If Preferred #1 is Blu-ray, then Preferred #2 is Sandwich"... and so on)
Cheers

This is a combination of a greatest-n-per-group and a pivot problem (sometimes also referred to as crosstab)
The first step you need to do is to identify the two preferred products.
In your case you need to combine a group by query with window functions.
The following query counts how often each customer has bought each product:
select customer_id,
product_bought,
count(*) as num_products
from sales
group by customer_id, product_bought
order by customer_id;
This can be enhanced to include a rank for the number of times a product was bought:
select customer_id,
product_bought,
count(*) as num_products,
dense_rank() over (partition by customer_id order by count(*) desc) as rnk
from sales
group by customer_id, product_bought
order by customer_id;
This would return the following result (based on your sample data):
customer_id | product_bought | num_products | rnk
------------+----------------+--------------+----
1 | DVD | 3 | 1
1 | Blu-ray | 1 | 2
2 | DVD | 2 | 1
We cannot apply a where condition on the rnk column directly, so we need a derived table for that:
select customer_id, product_bought
from (
select customer_id,
product_bought,
count(*) as num_products,
dense_rank() over (partition by customer_id order by count(*) desc) as rnk
from sales
group by customer_id, product_bought
) t
where rnk <= 2
order by customer_id;
Now we need to convert the two rows for each customer into columns. This could e.g. be done using a common table expression:
with preferred_products as (
select *
from (
select customer_id,
product_bought,
count(*) as num_products,
dense_rank() over (partition by customer_id order by count(*) desc) as rnk
from sales
group by customer_id, product_bought
) t
where rnk <= 2
)
select p1.customer_id,
p1.product_bought as "Product #1",
p2.product_bought as "Product #2"
from preferred_products p1
left join preferred_products p2 on p1.customer_id = p2.customer_id and p2.rnk = 2
where p1.rnk = 1
This then returns
customer_id | Product #1 | Product #2
------------+------------+-----------
1 | DVD | Blu-ray
2 | DVD |
The above is standard SQL and will work on any modern DBMS.
Online example: http://rextester.com/VAID15638

Related

Sum Column A While Not Summing Column B Depending On Column C - SQL Server

I am attempting to devise a method of always summing column A (Units) and sometimes summing column B (Price) depending on the value of column C (Version) while rolling up to the column D (Product) level. I am thinking of using a case when statement, but the issue I am running into revolves around the dependence on Price which has kind of a tiered system. I have to take the Price value of the Current Version where available, then the Price value of the Recent Version where available, then if both Current and Recent Versions are not available, I need to take the Price value of the Old Version. The tier list goes Current > Recent > Old. Right now the data is in SQL Server.
To add another wrinkle, if the same Product has different Prices for two higher Version tiers, the returned Price should be different than the Price of the oldest Version. For example, if three rows for the same Product had Prices of 50, 50, and 10 and Versions of Old, Recent, and Recent (respectively), the returned Price would be 10.
So if the starting data looks like this:
--------------------------------------------------
| Units | Price | Version | Product |
--------------------------------------------------
| 105 | 50 | Old | Bear |
--------------------------------------------------
| 100 | 100 | Recent | Bear |
--------------------------------------------------
| 100 | 150 | Current | Bear |
--------------------------------------------------
| 97 | 50 | Old | Bear |
--------------------------------------------------
| 67 | 50 | Old | Goose |
--------------------------------------------------
| 28 | 50 | Recent | Goose |
--------------------------------------------------
| 10 | 10 | Recent | Goose |
--------------------------------------------------
The rolled up version of the data will look like this:
--------------------------------------------------
| Units | Price | Version | Product |
--------------------------------------------------
| 402 | 150 | Current | Bear |
--------------------------------------------------
| 105 | 10 | Recent | Goose |
--------------------------------------------------
I am new to SQL so apologies if this is a rookie question. Any help you can provide is greatly appreciated.
Another option is using the WITH TIES clause in concert with Row_Number()
Example
Select Top 1 with ties
Units = sum(Units) over (Partition By Product)
,Price
,Version
,Product
From YourTable
Order By Row_Number() over (Partition By Product Order by case when Version='Old' then 3 when Version='Recent' then 2 else 1 end)
Returns
Units Price Version Product
402 150 Current Bear
95 10 Recent Goose
EDIT - Requested Update
Here, we use the lag() function within a CTE to determine the change in price
Declare #YourTable Table ([Units] int,[Price] int,[Version] varchar(50),[Product] varchar(50))
Insert Into #YourTable Values
(105,50,'Old','Bear')
,(100,100,'Recent','Bear')
,(100,150,'Current','Bear')
,(97,50,'Old','Bear')
,(67,50,'Old','Goose')
,(28,50,'Recent','Goose')
,(10,10,'Recent','Goose')
;with cte as (
Select Units = sum(Units) over (Partition By Product)
,Price
,Version
,Product
,PrevPrice = abs(Price-lag(Price,1) over (Partition By Product Order by case when Version='Old' then 3 when Version='Recent' then 2 else 1 end desc) )
From #YourTable
)
Select top 1 with ties
Units
,Price
,Version
,Product
From cte
Order By Row_Number() over (Partition By Product Order by case when Version='Old' then 3 when Version='Recent' then 2 else 1 end ,PrevPrice desc)
Returns
Units Price Version Product
402 150 Current Bear
105 10 Recent Goose
You can use conditional aggregation. The trick is to order by your priority. One method uses row_number() and case:
select sum(units) as units,
max(case when seqnum = 1 then price end) as price,
max(case when seqnum = 1 then version end) as version,
product
from (select t.*,
row_number() over (partition by product
order by (case version when 'old' then 3 when 'recent' then 2 when 'current' then 1 else 4 end)
) as seqnum
from t
) t
group by product;

More efficient way to query shortest string value associated with each value in another column in Hive QL

I have a table in Hive containing store names, order IDs, and User IDs (as well as some other columns including item ID). There is a row in the table for every item purchased (so there can be more than one row per order if the order contains multiple items). Order IDs are unique within a store, but not across stores. A single order can have more than one user ID associated with it.
I'm trying to write a query that will return a list of all stores and order IDs and the shortest user ID associated with each order.
So, for example, if the data looks like this:
STORE | ORDERID | USERID | ITEMID
------+---------+--------+-------
| a | 1 | bill | abc |
| a | 1 | susan | def |
| a | 2 | jane | abc |
| b | 1 | scott | ghi |
| b | 1 | tony | jkl |
Then the output would look like this:
STORE | ORDERID | USERID
------+---------+-------
a | 1 | bill
a | 2 | jane
b | 1 | tony
I've written a query that will do this, but I feel like there must be a more efficient way to go about it. Does anybody know a better way to produce these results?
This is what I have so far:
select
users.store, users.orderid, users.userid
from
(select
store, orderid, userid, length(userid) as len
from
sales) users
join
(select distinct
store, orderid,
min(length(userid)) over (partition by store, orderid) as len
from
sales) len on users.store = len.store
and users.orderid = len.orderid
and users.len = len.len
Check out probably this will work for you, here you can achieve your goal of single "SELECT" clause with no extra overhead on SQL.
select distinct
store, orderid,
first_value(userid) over(partition by store, orderid order by length(userid) asc) f_val
from
sales;
The result will be:
store orderid f_val
a 1 bill
a 2 jane
b 1 tony
Probably rank() is the best way:
select s.*
from (select s.*, rank() over (partition by store order by length(userid) as seqnum
from sales s
) s
where seqnum = 1;

Grouping by two values in same table

I have a table on the format
Ship_type | userid | Message
Neither of these columns are unique.
I want to count how many (unique) user id's that belong to each ship type, and thus find out which ship type is the most popular.
Example:
Ship_type | userid| Message
-------------- ------- ----------
Sailboat | 34241 | hello
Sailboat | 34241 | hi
Sailboat | 34241 | I'm on a boat!
Fishingvessel | 31245 | yo
Fishingvessel | 98435 | hi there
Here we see that there are two different fishingvessels and one sailboat.
If I do the following query:
select ship_type, count(ship_type) FROM db1.MessageType5 GROUP BY ship_type ORDER BY count(ship_type) ASC;
I get
Sailboat | 3
Fishingvessel | 2
which is wrong - as it counts the number of messages belonging to each ship_type.
Desired result:
Fishingvessel | 2
Sailboat | 1
You have to COUNT DISTINCT user ids (and ORDER BY ... DESC if you want the provided result):
SELECT ship_type, COUNT(DISTINCT userid) as cnt
FROM db1.MessageType5
GROUP BY ship_type
ORDER BY cnt DESC
See this fiddle.

SQL SELECT only rows where a max value is present, and the corresponding ID from another linked table

I have a simple Parts database which I'd like to use for calculating costs of assemblies, and I need to keep a cost history, so that I can update the costs for parts without the update affecting historic data.
So far I have the info stored in 2 tables:
tblPart:
PartID | PartName
1 | Foo
2 | Bar
3 | Foobar
tblPartCostHistory
PartCostHistoryID | PartID | Revision | Cost
1 | 1 | 1 | £1.00
2 | 1 | 2 | £1.20
3 | 2 | 1 | £3.00
4 | 3 | 1 | £2.20
5 | 3 | 2 | £2.05
What I want to end up with is just the PartID for each part, and the PartCostHistoryID where the revision number is highest, so this:
PartID | PartCostHistoryID
1 | 2
2 | 3
3 | 5
I've had a look at some of the other threads on here and I can't quite get it. I can manage to get the PartID along with the highest Revision number, but if I try to then do anything with the PartCostHistoryID I end up with multiple PartCostHistoryIDs per part.
I'm using MS Access 2007.
Many thanks.
Mihai's (very concise) answer will work assuming that the order of both
[PartCostHistoryID] and
[Revision] for each [PartID]
are always ascending.
A solution that does not rely on that assumption would be
SELECT
tblPartCostHistory.PartID,
tblPartCostHistory.PartCostHistoryID
FROM
tblPartCostHistory
INNER JOIN
(
SELECT
PartID,
MAX(Revision) AS MaxOfRevision
FROM tblPartCostHistory
GROUP BY PartID
) AS max
ON max.PartID = tblPartCostHistory.PartID
AND max.MaxOfRevision = tblPartCostHistory.Revision
SELECT PartID,MAX(PartCostHistoryID) FROM table GROUP BY PartID
Here is query
select PartCostHistoryId, PartId from tblCost
where PartCostHistoryId in
(select PartCostHistoryId from
(select * from tblCost as tbl order by Revision desc) as tbl1
group by PartId
)
Here is SQL Fiddle http://sqlfiddle.com/#!2/19c2d/12

GROUP BY and aggregate sequential numeric values

Using PostgreSQL 9.0.
Let's say I have a table containing the fields: company, profession and year. I want to return a result which contains unique companies and professions, but aggregates (into an array is fine) years based on numeric sequence:
Example Table:
+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google | Programmer | 2000 |
| Google | Sales | 2000 |
| Google | Sales | 2001 |
| Google | Sales | 2002 |
| Google | Sales | 2004 |
| Mozilla | Sales | 2002 |
+-----------------------------+
I'm interested in a query which would output rows similar to the following:
+-----------------------------------------+
| company | profession | year |
+---------+------------+------------------+
| Google | Programmer | [2000] |
| Google | Sales | [2000,2001,2002] |
| Google | Sales | [2004] |
| Mozilla | Sales | [2002] |
+-----------------------------------------+
The essential feature is that only consecutive years shall be grouped together.
Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).
The first step is to identify non-consecutive values for the year:
Step 1) Identify non-consecutive values
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
This returns the following result:
company | profession | year | group_cnt
---------+------------+------+-----------
Google | Programmer | 2000 | 1
Google | Sales | 2000 | 1
Google | Sales | 2001 | 0
Google | Sales | 2002 | 0
Google | Sales | 2004 | 1
Mozilla | Sales | 2002 | 1
Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:
Step 2) Define group IDs
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
This returns the following result:
company | profession | year | group_nr
---------+------------+------+----------
Google | Programmer | 2000 | 1
Google | Sales | 2000 | 2
Google | Sales | 2001 | 2
Google | Sales | 2002 | 2
Google | Sales | 2004 | 3
Mozilla | Sales | 2002 | 4
(6 rows)
As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:
Step 3) Final query
select company,
profession,
array_agg(year) as years
from (
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr
This returns the following result:
company | profession | years
---------+------------+------------------
Google | Programmer | {2000}
Google | Sales | {2000,2001,2002}
Google | Sales | {2004}
Mozilla | Sales | {2002}
(4 rows)
Which is exactly what you wanted, if I'm not mistaken.
There's much value to #a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.
And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.
That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:
WITH MarkedForGrouping AS (
SELECT
company,
profession,
year,
year - ROW_NUMBER() OVER (
PARTITION BY company, profession
ORDER BY year
) AS seqID
FROM atable
)
SELECT
company,
profession,
array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
company,
profession,
seqID
Procedural solution with PL/pgSQL
The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with PL/pgSQL can make do with a single sequential scan over the table (implicit cursor of a FOR loop) and should be substantially faster in this particular case:
Test table:
CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
('Google', 'Programmer', 2000)
, ('Google', 'Sales', 2000)
, ('Google', 'Sales', 2001)
, ('Google', 'Sales', 2002)
, ('Google', 'Sales', 2004)
, ('Mozilla', 'Sales', 2002)
;
Function:
CREATE OR REPLACE FUNCTION f_periods()
RETURNS TABLE (company text, profession text, years int[])
LANGUAGE plpgsql AS
$func$
DECLARE
r tbl; -- use table type as row variable
r0 tbl;
BEGIN
FOR r IN
SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
LOOP
IF ( r.company, r.profession, r.year)
<> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row
RETURN QUERY
SELECT r0.company, r0.profession, years; -- output row
years := ARRAY[r.year]; -- start new array
ELSE
years := years || r.year; -- add to array - year can be NULL, too
END IF;
r0 := r; -- remember last row
END LOOP;
RETURN QUERY -- output last iteration
SELECT r0.company, r0.profession, years;
END
$func$;
Call:
SELECT * FROM f_periods();
db<>fiddle here
Produces the requested result.