Calculating percentile in data bricks

Calculating percentile in data bricks - sql

Can anyone please help and tell where is the error? What Am I doing wrong? (Databricks)
even the example from databricks www doesn't work and produce the same error like below.
Is there any other method to calculate this metric?
select
customerid,
yearid,
monthid,
sum(TotalSpendings) as TotalSpendings,
sum(TotalQuantity) as TotalQuantity,
count (distinct ticketid) as TotalTickets,
AVG(AvgIndexesPerTicket) as AvgIndexesPerTicket,
max (transactiondate) as DateOfLastVisit,
count(distinct transactiondate) as TotalNumberOfVisits,
AVG(TotalSpendings) as AverageTicket,
sum(TotalQuantity)/count(distinct ticketid) as AvgQttyPerTicket,
sum(TotalDiscount) as TotalDiscount,
percentile_disc(0.25) WITHIN GROUP (ORDER BY TotalQuantity),
percentile_disc(0.50) WITHIN GROUP (ORDER BY TotalQuantity),
percentile_disc(0.75) WITHIN GROUP (ORDER BY TotalQuantity) as PercentileQttyTicket_75,
percentile_disc(0.90) WITHIN GROUP (ORDER BY TotalQuantity) as PercentileQttyTicket_90,
percentile_disc(0.25) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_25,
percentile_disc(0.50) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_50,
percentile_disc(0.75) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_75,
percentile_disc(0.90) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_90
from (
select
a.customerid,
a.ticketid,
a.transactiondate,
extract(year from a.transactiondate) as yearid,
extract(month from a.transactiondate) as monthid,
sum(positionvalue) as TotalSpendings,
sum(quantity) as TotalQuantity,
count(distinct productindex)/count(distinct a.ticketid) as AvgIndexesPerTicket,
sum(discountvalue) as TotalDiscount
from default.TICKET_ITEM a
where 1=1
and a.transactiondate between '2022-10-01' and '2022-10-31'
and a.transactiontype = 'S'
and a.transactiontypeheader = 'S'
and a.customerid in ('94861b2c83c54d03930af4585a3a325a')
and length(a.customerid) > 10
group by 1,2,3,4,5) DETAL
group by 1,2,3"""
I still receive error:
ParseException:
no viable alternative at input 'GROUP ('(line 15, pos 43)

Try reducing the complexity of the problem until you figure out what is wrong. Unless I have you TICKET_ITEM hive table, I can not try debugging the issue in my environment. Many times I break a complex query into pieces.
First, always put data into a schema (database) for management.
%sql
create database STACK_OVER_FLOW
Thus, your table would be recreated as STACK_OVER_FLOW.TICKET_ITEM.
Second, place the inner query into a permanent or temporary view. The code below creates a permanent view in the new schema.
%sql
create view STACK_OVER_FLOW.FILTERED_TICKET_ITEM as
select
a.customerid,
a.ticketid,
a.transactiondate,
extract(year from a.transactiondate) as yearid,
extract(month from a.transactiondate) as monthid,
sum(a.positionvalue) as TotalSpendings,
sum(a.quantity) as TotalQuantity,
count(distinct a.productindex) / count(distinct a.ticketid) as AvgIndexesPerTicket,
sum(discountvalue) as TotalDiscount
from
STACK_OVER_FLOW.TICKET_ITEM a
where
1=1
and a.transactiondate between '2022-10-01' and '2022-10-31'
and a.transactiontype = 'S'
and a.transactiontypeheader = 'S'
and a.customerid in ('94861b2c83c54d03930af4585a3a325a')
and length(a.customerid) > 10
group by
customerid,
ticketid,
transactiondate,
yearid,
monthid
Third, always group by or order by name, not by position. You might the field order over time. I did notice extra """ at the end of the query but it might be a typo.
At this point you will know if the inner query works correctly in the view and you can focus on the outer query with the percentiles.
In data engineering, I have seen the spark optimizer get confused when the number of temporary views is large. In these cases, the intermediate view might have to be written to file as a step. Then you can expose that file as a view and continue with your engineering effort.
The percentile_disc is part of the databricks distribution.
https://docs.databricks.com/sql/language-manual/functions/percentile_disc.html
It is not a core function that is part of the open source distribution.
https://spark.apache.org/docs/latest/api/sql/index.html#percentile
Please add more information to the post after you reduce the complexity and still can not find your issue.

Related

Assistance with PERCENTILE_CONT function and GROUP By error

All,
I am having problems with the below query. I am trying to get stat data from our database for the last 3 years but I keep getting the error message:
***Column 'OC_VDATA.DATA1' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.***
I know it has something to do with the DATA1 column but I am not familiar enough using the PERCENTILE_CONT function to know what the solution is.
Anyone have any ideas?
WITH Q AS
(
SELECT stagingPLM.dbo.ITEM_CODES.ITEM_CODE,
AVG(OC_VDATA.DATA1) AS Mean,
STDEVP(OC_VDATA.DATA1) AS StandardDev,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY OC_VDATA.DATA1)
OVER (PARTITION BY stagingPLM.dbo.ITEM_CODES.ITEM_CODE) AS Median
FROM OC_VDATA INNER JOIN
OC_VDAT_AUX ON OC_VDATA.PARTNO = OC_VDAT_AUX.PARTNOAUX
AND OC_VDATA.DATETIME = OC_VDAT_AUX.DATETIMEAUX INNER JOIN
stagingPLM.dbo.ITEM_CODES ON LEFT(OC_VDATA.PARTNO, 12) = stagingPLM.dbo.ITEM_CODES.SPEC_NO
AND LEFT(OC_VDAT_AUX.PARTNOAUX, 12) = stagingPLM.dbo.ITEM_CODES.SPEC_NO
WHERE (OC_VDAT_AUX.UDL28 LIKE '%PLASTIC%')
AND (RIGHT(OC_VDATA.PARTNO, 6) = '036150')
AND (CAST(OC_VDAT_AUX.UDL40 AS DATETIME)
BETWEEN CONVERT(datetime, '2019-05-18 00:00:00', 102) AND CONVERT(datetime, '2022-05-18 00:00:00', 102))
GROUP BY stagingPLM.dbo.ITEM_CODES.ITEM_CODE
)
SELECT * FROM Q

The error is because of the code WITHIN GROUP (ORDER BY OC_VDATA.DATA1).
You are doing GROUP BY(for AVG and STDEVP) based on ITEM_CODE, whereas ORDER BY is there on OC_VDATA.DATA1 for the Window function.
Better to calculate AVG,STDEVP and PERCENTILE_CONT with Window Function, instead of half through GROUP BY and half through Window Function.
By considering the minimum required columns to reproduce the issue, you can rewrite the query as below to get the desired output.
SELECT DISTINCT item_codes.item_code,
Avg(oc_vdata.data1)
over(
PARTITION BY item_codes.item_code) AS Mean,
Stdevp(oc_vdata.data1)
over(
PARTITION BY item_codes.item_code) AS StandardDev,
Percentile_cont(0.5)
within GROUP (ORDER BY oc_vdata.data1) over (
PARTITION BY item_codes.item_code) AS Median
FROM oc_vdata
inner join item_codes
ON Left(oc_vdata.partno, 12) = item_codes.spec_no
DB Fiddle: Try it here
Minimum steps to reproduce the error:
SELECT item_codes.item_code,
Avg(oc_vdata.data1) AS Mean,
Stdevp(oc_vdata.data1) AS StandardDev
FROM oc_vdata
INNER JOIN item_codes
ON LEFT(oc_vdata.partno, 12) = item_codes.spec_no
GROUP BY item_codes.item_code
ORDER BY oc_vdata.data1 -- This will cause the error

SQL get top 3 values / bottom 3 values with group by and sum

I am working on a restaurant management system. There I have two tables
order_details(orderId,dishId,createdAt)
dishes(id,name,imageUrl)
My customer wants to see a report top 3 selling items / least selling 3 items by the month
For the moment I did something like this
SELECT
*
FROM
(SELECT
SUM(qty) AS qty,
order_details.dishId,
MONTHNAME(order_details.createdAt) AS mon,
dishes.name,
dishes.imageUrl
FROM
rms.order_details
INNER JOIN dishes ON order_details.dishId = dishes.id
GROUP BY order_details.dishId , MONTHNAME(order_details.createdAt)) t
ORDER BY t.qty
This gives me all the dishes sold count order by qty.
I have to manually filter max 3 records and reject the rest. There should be a SQL way of doing this. How do I do this in SQL?

You would use row_number() for this purpose. You don't specify the database you are using, so I am guessing at the appropriate date functions. I also assume that you mean a month within a year, so you need to take the year into account as well:
SELECT ym.*
FROM (SELECT YEAR(od.CreatedAt) as yyyy,
MONTH(od.createdAt) as mm,
SUM(qty) AS qty,
od.dishId, d.name, d.imageUrl,
ROW_NUMBER() OVER (PARTITION BY YEAR(od.CreatedAt), MONTH(od.createdAt) ORDER BY SUM(qty) DESC) as seqnum_desc,
ROW_NUMBER() OVER (PARTITION BY YEAR(od.CreatedAt), MONTH(od.createdAt) ORDER BY SUM(qty) DESC) as seqnum_asc
FROM rms.order_details od INNER JOIN
dishes d
ON od.dishId = d.id
GROUP BY YEAR(od.CreatedAt), MONTH(od.CreatedAt), od.dishId
) ym
WHERE seqnum_asc <= 3 OR
seqnum_desc <= 3;

Using the above info i used i combination of group by, order by and limit
as shown below. I hope this is what you are looking for
SELECT
t.qty,
t.dishId,
t.month,
d.name,
d.mageUrl
from
(
SELECT
od.dishId,
count(od.dishId) AS 'qty',
date_format(od.createdAt,'%Y-%m') as 'month'
FROM
rms.order_details od
group by date_format(od.createdAt,'%Y-%m'),od.dishId
order by qty desc
limit 3) t
join rms.dishes d on (t.dishId = d.id)

RFM Analysis Not Outputting All Customer IDs

So I'm working on an RFM analysis, and with lots of help, was able to put together the following query that outputs the customer_id, r score, f score, m score, and lastly a combined rfm score:
--This will first create quintiles using the ntile function
--Then factor in the conditions
--Then combine the score
--Then the substrings will seperate each score's individual points
SELECT *,
SUBSTRING(rfm_combined,1,1) AS recency_score,
SUBSTRING(rfm_combined,2,1) AS frequency_score,
SUBSTRING(rfm_combined,3,1) AS monetary_score
FROM (
SELECT
customer_id,
rfm_recency*100 + rfm_frequency*10 + rfm_monetary AS rfm_combined
FROM
(SELECT
customer_id,
ntile(5) over (order by last_order_date) AS rfm_recency,
ntile(5) over (order by count_order) AS rfm_frequency,
ntile(5) over (order by total_spent) AS rfm_monetary
FROM
(SELECT
customer_id,
MAX(oms_order_date) AS last_order_date,
COUNT(*) AS count_order,
SUM(quantity_ordered * unit_price_amount) AS total_spent
FROM
l_dmw_order_report
WHERE
order_type NOT IN ('Sales Return', 'Sales Price Adjustment')
AND item_description_1 NOT IN ('freight', 'FREIGHT', 'Freight')
AND line_status NOT IN ('CANCELLED', 'HOLD')
AND oms_order_date BETWEEN '2019-01-01' AND CURRENT_DATE
AND customer_id = 'US621111112234061'
GROUP BY customer_id))
ORDER BY customer_id desc)
In the above, you will notice that I am forcing it to only output on a particular customer_id. That is because I wanted to test to see if this query is accounting for when a customer_id appears in multiple YearMonth categories (because they could have bought in Jan, then again in Feb, then again in Nov).
The issue here is that, although the query outputs the right scores, it only seems to be accounting for the customer_id once, regardless of if it appears in multiple months. For this particular customer ID, I see that they appear in Jan 2019, Feb 2019, and Nov 2019, so it should be giving me 3 rows instead of just 1. Been testing for a few hours and can't seem to find the cause, but I suspect that my grouping may be wrong.
Thank you for your help and let me know if you have any questions!!
Best,
Z

ROW_NUMBER() OVER (PARTITION BY) showing duplicate results for Group By Clause

I have the below query that was created to show the summation of the "Last" values for a year, usually this is a december value, but the year could potentially end in any month so i want to add together the last values for each goalmontecarloheaderid. I have it working 99%, but there are some random duplicates in the [year] value.
WITH endBalances AS (
SELECT ROW_NUMBER() OVER (PARTITION By GoalMonteCarloHeaderID, Year(Convert(date,MonthDate)) Order By Max(Month(Convert(date,MonthDate))) desc) n, Max(Month(Convert(date,MonthDate))) maxMonth, GrowthBucket, WithdrawalBucket, NoTaxesBucket,
Year(MonthDate) [year]
From GoalMonteCarloMedianResults mcmr
full join GoalMonteCarloHeader mch on mch.ID = mcmr.GoalMonteCarloHeaderID
full join GoalChartData gcd on gcd.ID = mch.GoalChartDataID and gcd.TypeID = 2
inner join Goal g on g.iGoalID = gcd.GoalID
where g.iTypeID in (1) and g.iHHID = 850802
group by GoalMonteCarloHeaderID, MonthDate, GrowthBucket, WithdrawalBucket, NoTaxesBucket
)
SELECT [year], Sum(GrowthBucket) GrowthBucket, Sum(WithdrawalBucket) WithdrawalBucket,Sum(NoTaxesBucket) NoTaxesBucket, maxMonth
From endBalances
where [year] is not null and n=1
Group By [year], maxMonth
order by [year] asc
Showing two random duplicates in the database result;
you can see in the image there are two examples where the year is duplicated and displayed for more than just the 'last' month in the year. Am I doing something wrong with the group by or the PARTITION BY() in my query? I am not the most familiar with this functionality of T-SQL.

T-SQL has a lovely function for this which has no direct equivalent in MySQL.
ROW_NUMBER() OVER (PARTITION BY [year] ORDER BY MonthDate DESC) AS rn
Then anything with rn=1 will be the last entry in a year.
The answers to this question have a few ideas:
ROW_NUMBER() in MySQL

Sql Server - Joining subqueries using calculated fields

I am trying to calculate the percentage change in price between days. As the days are not consectutive, I build into the query a calculated field that tells me what relative day it is (day 1, day 2, etc). In order to compare today with yesterday, I offset the calculated day number by 1 in a subquery. what I want to do is to join the inner and outer query on the calculated relative day. The code I came up with is:
SELECT TOP 11
P.Date,
(AVG(P.SettlementPri) - PriceY) / PriceY as PriceChange,
P.Symbol,
(RANK() OVER (ORDER BY P.Date desc)) as dayrank_Today
FROM OTE P
JOIN (SELECT TOP 11
C.Date,
AVG(SettlementPri) as PriceY,
(RANK() OVER (ORDER BY C.Date desc))+1 as dayrank_Yest
FROM OTE C
WHERE C.ComCode = 'C-'
GROUP BY c.Date) C ON dayrank_Today = C.dayrank_Yest
WHERE P.ComCode = 'C-'
GROUP BY P.Symbol, P.Date
If I try and execute the query, I get an erro message indicating dayrank_Today is an invalid column. I have tried renaming it, qualifying it, yell obsenities at it and I get squat. Still an error.

You can't do a select of a calculated column, and then use it in a join. You can use CTEs, which I'm not so familiar with, or you can jsut do table selects like so:
SELECT
P.Date,
(AVG(AvgPrice) - C.PriceY) / C.PriceY as PriceChange,
P.Symbol,
P.dayrank_Today FROM
(SELECT TOP 11
ComCode,
Date,
AVG(SettlementPri) as AvgPrice,
Symbol,
(RANK() OVER (ORDER BY Date desc)) as dayrank_Today
FROM OTE WHERE ComCode = 'C-') P
JOIN (SELECT TOP 11
C.Date,
AVG(SettlementPri) as PriceY,
(RANK() OVER (ORDER BY C.Date desc))+1 as dayrank_Yest
FROM OTE C
WHERE C.ComCode = 'C-'
GROUP BY c.Date) C ON dayrank_Today = C.dayrank_Yest
GROUP BY P.Symbol, P.Date

If possible consider using a CTE as it makes it very easy. Something like this:
With Raw as
(
SELECT TOP 11 C.Date,
Avg(SettlementPri) As PriceY,
Rank() OVER (ORDER BY C.Date desc) as dayrank
FROM OTE C WHERE C.Comcode = 'C-'
Group by C.Date
)
select today.pricey as todayprice ,
yesterday.pricey as yesterdayprice,
(today.pricey - yesterday.pricey)/today.pricey * 100 as percentchange
from Raw today
left outer join Raw yesterday on today.dayrank = yesterday.dayrank + 1
Obviously this doesn;t include the symbol but that can be included pretty easily.
If using 'With' syntax doesn;t suit you can also use calculated fields with Outer Apply http://technet.microsoft.com/en-us/library/ms175156.aspx
Although the CTE will mean that you only need to write your price calculation once which is a lot cleaner
Cheers

I had the same problem and found this thread and found a solution so I thought I'd post it here.
Instead of using the column name as parameter for ON, copy the statement that gave you the colmun name in the first place:
replace:
ON dayrank_Today = C.dayrank_Yest
with:
ON (RANK() OVER (ORDER BY Date desc)) = C.dayrank_Yest
Granted, you're displeasing the Programming Gods by violating DRY, but you could be pragmatic and mention the duplication in the comments, which should appease their wrath to a mild grumbling.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas