So I'm working on an RFM analysis, and with lots of help, was able to put together the following query that outputs the customer_id, r score, f score, m score, and lastly a combined rfm score:
--This will first create quintiles using the ntile function
--Then factor in the conditions
--Then combine the score
--Then the substrings will seperate each score's individual points
SELECT *,
SUBSTRING(rfm_combined,1,1) AS recency_score,
SUBSTRING(rfm_combined,2,1) AS frequency_score,
SUBSTRING(rfm_combined,3,1) AS monetary_score
FROM (
SELECT
customer_id,
rfm_recency*100 + rfm_frequency*10 + rfm_monetary AS rfm_combined
FROM
(SELECT
customer_id,
ntile(5) over (order by last_order_date) AS rfm_recency,
ntile(5) over (order by count_order) AS rfm_frequency,
ntile(5) over (order by total_spent) AS rfm_monetary
FROM
(SELECT
customer_id,
MAX(oms_order_date) AS last_order_date,
COUNT(*) AS count_order,
SUM(quantity_ordered * unit_price_amount) AS total_spent
FROM
l_dmw_order_report
WHERE
order_type NOT IN ('Sales Return', 'Sales Price Adjustment')
AND item_description_1 NOT IN ('freight', 'FREIGHT', 'Freight')
AND line_status NOT IN ('CANCELLED', 'HOLD')
AND oms_order_date BETWEEN '2019-01-01' AND CURRENT_DATE
AND customer_id = 'US621111112234061'
GROUP BY customer_id))
ORDER BY customer_id desc)
In the above, you will notice that I am forcing it to only output on a particular customer_id. That is because I wanted to test to see if this query is accounting for when a customer_id appears in multiple YearMonth categories (because they could have bought in Jan, then again in Feb, then again in Nov).
The issue here is that, although the query outputs the right scores, it only seems to be accounting for the customer_id once, regardless of if it appears in multiple months. For this particular customer ID, I see that they appear in Jan 2019, Feb 2019, and Nov 2019, so it should be giving me 3 rows instead of just 1. Been testing for a few hours and can't seem to find the cause, but I suspect that my grouping may be wrong.
Thank you for your help and let me know if you have any questions!!
Best,
Z
Related
Can anyone please help and tell where is the error? What Am I doing wrong? (Databricks)
even the example from databricks www doesn't work and produce the same error like below.
Is there any other method to calculate this metric?
select
customerid,
yearid,
monthid,
sum(TotalSpendings) as TotalSpendings,
sum(TotalQuantity) as TotalQuantity,
count (distinct ticketid) as TotalTickets,
AVG(AvgIndexesPerTicket) as AvgIndexesPerTicket,
max (transactiondate) as DateOfLastVisit,
count(distinct transactiondate) as TotalNumberOfVisits,
AVG(TotalSpendings) as AverageTicket,
sum(TotalQuantity)/count(distinct ticketid) as AvgQttyPerTicket,
sum(TotalDiscount) as TotalDiscount,
percentile_disc(0.25) WITHIN GROUP (ORDER BY TotalQuantity),
percentile_disc(0.50) WITHIN GROUP (ORDER BY TotalQuantity),
percentile_disc(0.75) WITHIN GROUP (ORDER BY TotalQuantity) as PercentileQttyTicket_75,
percentile_disc(0.90) WITHIN GROUP (ORDER BY TotalQuantity) as PercentileQttyTicket_90,
percentile_disc(0.25) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_25,
percentile_disc(0.50) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_50,
percentile_disc(0.75) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_75,
percentile_disc(0.90) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_90
from (
select
a.customerid,
a.ticketid,
a.transactiondate,
extract(year from a.transactiondate) as yearid,
extract(month from a.transactiondate) as monthid,
sum(positionvalue) as TotalSpendings,
sum(quantity) as TotalQuantity,
count(distinct productindex)/count(distinct a.ticketid) as AvgIndexesPerTicket,
sum(discountvalue) as TotalDiscount
from default.TICKET_ITEM a
where 1=1
and a.transactiondate between '2022-10-01' and '2022-10-31'
and a.transactiontype = 'S'
and a.transactiontypeheader = 'S'
and a.customerid in ('94861b2c83c54d03930af4585a3a325a')
and length(a.customerid) > 10
group by 1,2,3,4,5) DETAL
group by 1,2,3"""
I still receive error:
ParseException:
no viable alternative at input 'GROUP ('(line 15, pos 43)
Try reducing the complexity of the problem until you figure out what is wrong. Unless I have you TICKET_ITEM hive table, I can not try debugging the issue in my environment. Many times I break a complex query into pieces.
First, always put data into a schema (database) for management.
%sql
create database STACK_OVER_FLOW
Thus, your table would be recreated as STACK_OVER_FLOW.TICKET_ITEM.
Second, place the inner query into a permanent or temporary view. The code below creates a permanent view in the new schema.
%sql
create view STACK_OVER_FLOW.FILTERED_TICKET_ITEM as
select
a.customerid,
a.ticketid,
a.transactiondate,
extract(year from a.transactiondate) as yearid,
extract(month from a.transactiondate) as monthid,
sum(a.positionvalue) as TotalSpendings,
sum(a.quantity) as TotalQuantity,
count(distinct a.productindex) / count(distinct a.ticketid) as AvgIndexesPerTicket,
sum(discountvalue) as TotalDiscount
from
STACK_OVER_FLOW.TICKET_ITEM a
where
1=1
and a.transactiondate between '2022-10-01' and '2022-10-31'
and a.transactiontype = 'S'
and a.transactiontypeheader = 'S'
and a.customerid in ('94861b2c83c54d03930af4585a3a325a')
and length(a.customerid) > 10
group by
customerid,
ticketid,
transactiondate,
yearid,
monthid
Third, always group by or order by name, not by position. You might the field order over time. I did notice extra """ at the end of the query but it might be a typo.
At this point you will know if the inner query works correctly in the view and you can focus on the outer query with the percentiles.
In data engineering, I have seen the spark optimizer get confused when the number of temporary views is large. In these cases, the intermediate view might have to be written to file as a step. Then you can expose that file as a view and continue with your engineering effort.
The percentile_disc is part of the databricks distribution.
https://docs.databricks.com/sql/language-manual/functions/percentile_disc.html
It is not a core function that is part of the open source distribution.
https://spark.apache.org/docs/latest/api/sql/index.html#percentile
Please add more information to the post after you reduce the complexity and still can not find your issue.
I am using the ga_sessions sample data in BigQuery and I was aiming to divide the customers into segments based on how often they placed an order using APPROX_QUANTILES. Ideally I want the output to tell me which segment the customer belongs to based on their orders.
Unfortunately I cannot get the code to run properly as I now get a 1 for each segment and the 100% segment returns 36 everytime. Any idea on how to improve this query?
WITH transdata AS (
SELECT
DISTINCT fullVisitorId AS VisitorId
,COUNT(DISTINCT FORMAT('%s%i',fullVisitorId,visitId)) AS uniqueVisits
,SUM(totals.transactions) AS total_transactions
,SUM(totals.totalTransactionRevenue) AS total_transaction_revenue
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_table_suffix BETWEEN '20160801' AND '20170801'
GROUP BY 1
ORDER BY 3 DESC
)
-- determine percentiles for total transactions per customer
SELECT
a.*
,b.percentiles[offset (20)] AS v20
,b.percentiles[offset (40)] AS v40
,b.percentiles[offset (60)] AS v60
,b.percentiles[offset (80)] AS v80
,b.percentiles[offset (100)] AS v100
FROM
transdata AS a,
(SELECT APPROX_QUANTILES(total_transactions, 100) percentiles FROM transdata) AS b
I am working on a restaurant management system. There I have two tables
order_details(orderId,dishId,createdAt)
dishes(id,name,imageUrl)
My customer wants to see a report top 3 selling items / least selling 3 items by the month
For the moment I did something like this
SELECT
*
FROM
(SELECT
SUM(qty) AS qty,
order_details.dishId,
MONTHNAME(order_details.createdAt) AS mon,
dishes.name,
dishes.imageUrl
FROM
rms.order_details
INNER JOIN dishes ON order_details.dishId = dishes.id
GROUP BY order_details.dishId , MONTHNAME(order_details.createdAt)) t
ORDER BY t.qty
This gives me all the dishes sold count order by qty.
I have to manually filter max 3 records and reject the rest. There should be a SQL way of doing this. How do I do this in SQL?
You would use row_number() for this purpose. You don't specify the database you are using, so I am guessing at the appropriate date functions. I also assume that you mean a month within a year, so you need to take the year into account as well:
SELECT ym.*
FROM (SELECT YEAR(od.CreatedAt) as yyyy,
MONTH(od.createdAt) as mm,
SUM(qty) AS qty,
od.dishId, d.name, d.imageUrl,
ROW_NUMBER() OVER (PARTITION BY YEAR(od.CreatedAt), MONTH(od.createdAt) ORDER BY SUM(qty) DESC) as seqnum_desc,
ROW_NUMBER() OVER (PARTITION BY YEAR(od.CreatedAt), MONTH(od.createdAt) ORDER BY SUM(qty) DESC) as seqnum_asc
FROM rms.order_details od INNER JOIN
dishes d
ON od.dishId = d.id
GROUP BY YEAR(od.CreatedAt), MONTH(od.CreatedAt), od.dishId
) ym
WHERE seqnum_asc <= 3 OR
seqnum_desc <= 3;
Using the above info i used i combination of group by, order by and limit
as shown below. I hope this is what you are looking for
SELECT
t.qty,
t.dishId,
t.month,
d.name,
d.mageUrl
from
(
SELECT
od.dishId,
count(od.dishId) AS 'qty',
date_format(od.createdAt,'%Y-%m') as 'month'
FROM
rms.order_details od
group by date_format(od.createdAt,'%Y-%m'),od.dishId
order by qty desc
limit 3) t
join rms.dishes d on (t.dishId = d.id)
I would like to count the number of daily unique active users by subreddit and day, and then aggregate these counts onto monthly unique active users by group and month. Doing each one individually is simple enough, but when I try to do them in one combined query, it tells me that I need to group by date_month_day in my second-level subquery, which would result in monthly_unique_users being the same as daily_unique_uauthors..(Error: Expression 'date_month_day' is not present in the GROUP BY list [invalidQuery]).
Here is the query I have so far:
SELECT * FROM
(
SELECT *,
(daily_unique_authors/monthly_unique_authors) * 1.0 AS ratio,
ROW_NUMBER() OVER (PARTITION BY date_month_day ORDER BY ratio DESC) rank
FROM
(
SELECT subreddit,
date_month_day,
daily_unique_authors,
SUM(daily_unique_authors) AS monthly_unique_authors,
LEFT(date_month_day, 7) as date_month
FROM
(
SELECT subreddit,
LEFT(DATE(SEC_TO_TIMESTAMP(created_utc)), 10) as date_month_day,
COUNT(UNIQUE(author)) as daily_unique_authors
FROM TABLE_QUERY([fh-bigquery:reddit_comments], "table_id CONTAINS \'20\' AND LENGTH(table_id)<8")
GROUP EACH BY subreddit, date_month_day
)
GROUP EACH BY subreddit, date_month))
WHERE rank <= 100
ORDER BY date_month ASC
The final output should ideally be something like:
subreddit date_month date_month_day daily_unique_users monthly_unique_users ratio
1 google 2005-12 2005-12-29 77 600 0.128
2 google 2005-12 2005-12-31 52 600 0.866
3 google 2005-12 2005-12-28 81 600 0.135
4 google 2005-12 2005-12-27 73 600 0.121
Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY date_month_day ORDER BY ratio DESC) rank
FROM (
SELECT
daily.subreddit subreddit,
daily.date_month date_month,
date_month_day,
daily_unique_authors,
monthly_unique_authors,
1.0 * daily_unique_authors / monthly_unique_authors AS ratio
FROM (
SELECT subreddit,
DATE(TIMESTAMP_SECONDS(created_utc)) AS date_month_day,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_SECONDS(created_utc))) AS date_month,
COUNT(DISTINCT author) AS daily_unique_authors
FROM `fh-bigquery.reddit_comments.2018*`
GROUP BY subreddit, date_month_day, date_month
) daily
JOIN (
SELECT subreddit,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_SECONDS(created_utc))) AS date_month,
COUNT(DISTINCT author) AS monthly_unique_authors
FROM `fh-bigquery.reddit_comments.2018*`
GROUP BY subreddit, date_month
) monthly
ON daily.subreddit = monthly.subreddit
AND daily.date_month = monthly.date_month
)
)
WHERE rank <= 100
ORDER BY date_month
Note: I tried to leave the original logic and structure as much as possible as it is in the question - so OP will be able to correlate answer with question and make further adjustments if needed :o)
How do I obtain the highest value for each year within a table. So let's say we have a table movies and I want to find the highest profiting film for each year.
This is my attempt so far:
SELECT year, MAX(income - cost) AS profit, title
FROM Movies m, Movies m2
GROUP BY year
I am pretty certain it is going to need some sub selects but I can't visualise what I need to do. I was also thinking probably some sort of distinct option to rule out duplicate years.
Title Year Income Cost Length
A 2000 10 2 2
B 2000 9 7 2
So from this the expected result would be
Title Year Profit
A 2000 8
I'm guessing slightly at what you want, but since you've not specified any RDBMS a generic solution would be:
SELECT m.Year, (m.Income - m.Cost) AS Profit, m.Title
FROM Movies m
INNER JOIN
( SELECT m.Year, MAX(m.Income - m.Cost) AS Profit
FROM Movies
GROUP BY m.Year
) MaxProfit
ON MaxProfit.Year = m.Year
AND MaxProfit.Profit = (m.Income - m.Cost)
ORDER BY m.Year
You can also do this using analytic functions if your DBMS permits. e.g. SQL-Server
WITH MovieCTE AS
( SELECT m.Year,
Profit = (m.Income - m.Cost),
m.Title,
RowNumber = ROW_NUMBER() OVER(PARTITION BY m.Year ORDER BY (m.Income - m.Cost) DESC)
FROM Movies
)
SELECT year, Profit, Title
FROM MovieCTE
WHERE RowNumber = 1
It is possible I have misunderstood your exact criteria, but I am sure the same priciples can be applied, you will just need to alter the grouping and the join in the first example, or the partition by in the second.
select m1year,m1profit,title
from
(
(select year as m1year, max(income- cost) as m1profit from movies group by year) m1
join
(select m2year, (income-cost) as m2profit ,title as profit from movies) m2
on
m1profit = m2profit
) m
This will give the highest profit movie for each year, and choose the first title in the event of a tie:
select a.year, a.profit,
(select min(title) from Movies where year = a.year and income - cost = a.profit) as title
from (
select year, max(income - cost) as profit
from Movies -- title, year, cost, income, number
group by year
) as a
order by year desc