Running calculations on different percent amounts of data - SQL - sql

I'm brainstorming on ways to find trends over a dataset containing transaction amounts that spans a year.
I'd like to run an average of top 25% observations of data and bottom 75% observations of data and viceversa.
If the entire dataset contains 1000 observations, I'd like to run:
An average of the top 25% and then separately, an average of the bottom 75% and find the resulting average of this.
Inversely, top 75% average, then bottom 25%, then the average of the 2.
For the overall average I have: avg(transaction_amount)
I am aware that in order for the sectioning averages to be useful, I will have to order the data according to the date which I already have accounted for in my SQL code:
select avg(transaction_amount)
from example.table
order by transaction_date
I am now struggling to find a way to split the data between 25% and 75% based on the number of observations.
Thanks.

If you're using MSSQL, it's pretty trivial depending on exactly the output you're looking for.
SELECT TOP 25 PERCENT
*
FROM (
SELECT
AVG(transaction_amount) as avg_amt
FROM example.table
) AS sub
ORDER BY sub.avg_amt DESC

Use PERCENT_RANK in order to see which percentage block a row belongs to. Then use this to group your data:
with data as
(
select t.*, percent_rank() over (order by transaction_amount) as pr
from example.table t
)
select
case when pr <= 0.75 then '0-75%' else '75-100%' end as percent,
avg(transaction_amount) as avg,
avg(avg(transaction_amount)) over () as avg_of_avg
from data
group by case when pr <= 0.75 then '0-75%' else '75-100%' end
union all
select
case when pr <= 0.25 then '0-25%' else '25-100%' end as percent,
avg(transaction_amount) as avg,
avg(avg(transaction_amount)) over () as avg_of_avg
from data
case when pr <= 0.25 then '0-25%' else '25-100%' end;

Related

How to calculate percentage of entries out of total those match the condition?? SQL

I've been playing around it for the whole day and it's by far the most hard topic to understand in SQL.
Say we have a students table, which consists of group number and students rating as so:
Each group can contain multiple students
And now I want to look for groups where at least 60% of students have rating of 4 or higher.
Expecting something like:
group_n
percentage_of_goodies
1120
0.7
1200
0.66
1111
1
I tried this
with group_goodies as (
select group_n, count(id) goodies from students
where rating >= 4
group by group_n
), group_counts as (
select group_n, count(id) acount from students
group by group_n
)
select cast(group_goodies.goodies as float)/group_counts.acount from group_counts, group_goodies
where cast(group_goodies.goodies as float)/group_counts.acount > 0.6;
and got an unexpected result
where percentage seems to surpass 100% (and it's not because I misplaced denominator, since there are controversial outputs below as well), which is obviously is not intended. There are also more output rows than there are groups. Apparently, I could use window functions, but I can't figure it out myself.. So how can I have this query done?
Problem is extracting count of students before and after the query seems to be impossible within a single query, so I had to create 2 CTEs in order to receive needed results. Both of the CTEs seems to output proper result, where in first CTE amount of students rarely exceeds 10, and in second CTE amounts are naturally smaller and match needs. But When I divide them like that, it results in something unreasonable.
If someone explains it properly, one will make my day 😳
If I understand correctly, this is a pretty direct aggregation query:
select group_id, avg( (rating >= 4)::int ) as good_students
from students
group by group_id
having avg( (rating >= 4)::int ) > 0.6;
I don't see why two levels of aggregation would be needed.
The avg() works by converting each rating to 0 if less than or equal to 4 or 1 for the higher ones. The average of these values is the ratio that are 1.
First, aggregate the students per group and rating to set a partial sum.
Then calculate the fraction of ratings that are 4 or better and report only those groups where that fraction exceeds 0.6.
SELECT group_n,
100.0 *
sum(ratings_per_group) FILTER (WHERE rating >= 4) /
sum(ratings_per_group)
AS fraction_of_goodies
FROM (SELECT group_n, rating,
count(*) AS ratings_per_group
FROM students
GROUP BY group_n, rating
) AS per_group_and_rating
GROUP BY group_n
HAVING sum(ratings_per_group) FILTER (WHERE rating >= 4) /
sum(ratings_per_group) > 0.6;

sum of amount group by condition optimise query

I have a view consists of data from different tables. major fields are BillNo,ITEM_FEE,GroupNo. Actually I need to calculate the total discount by passing the groupNo. The discount calculation is based on the fraction of amount group by BillNo(single Bill no can have multiple entries). If there are multiple transactions for a single BillNo then discount is calculated if decimal part of sum of ITEM_FEE is greater than 0 and if there is only single transaction and the decimal part of ITEM_FEE is greater than 0 then the decimal part will be treated as discount.
I have prepared script and I am getting total discount for a particular groupNo.
declare #GroupNo as nvarchar(100)
set #GroupNo='3051'
SELECT Sum(disc) Discount
FROM --sum(ITEM_FEE) TotalAmoiunt,
(SELECT (SELECT CASE
WHEN ( Sum(item_fee) )%1 > 0 THEN Sum(( item_fee )%1)
END
FROM view_bi_sales VBS
WHERE VBS.billno = VB.billno
GROUP BY billno) Disc
FROM view_bi_sales VB
WHERE groupno = #GroupNo)temp
The problem is that it takes almost 2 minutes to get the result.
Please help me to find the result faster if possible.
Thank you for all your help and support , as I was already calculating sum of decimal part of ITEM_FEE group by BillNo , there was no need of checking greater than 0 or not. below query gives me the desired ouput in less than 10 sec
select sum(discount) from
(select sum((ITEM_FEE)%1) discount from view_BI_Sales
where groupno=3051
group by BillNo )temp
If I understand correctly, you don't need a JOIN. This might help performance:
SELECT SUM(disc) as Discount
FROM (SELECT (CASE WHEN SUM(item_fee % 1) > 0
THEN SUM(item_fee % 1)
END) as disc
FROM view_bi_sales VBS
WHERE groupno = #GroupNo
GROUP BY billno
) vbs;

Redshift - Find % as compared to total value

I have a table with count by product. I am trying to add a new column that would find % as compared to sum of all rows in that column.
prod_name,count
prod_a,100
prod_b,50
prod_c,150
For example, I want to find % of prod_a as compared to the total count and so on.
Expected output:
prod_name,count,%
prod_a,100,0.33
prod_b,50,0.167
prod_c,150,0.5
Edit on SQL:
select count(*),ratio_to_report(prod_name)
over (partition by count(*))
from sales
group by prod_name;
Using window functions.
select t.*,100.0*cnt_by_prod/sum(cnt_by_prod) over() as pct
from tbl t
Edit: Based on OP's question change, To compute the counts and then percentage, use
select prod_name,100.0*count(*)/sum(count(*)) over()
from tbl
group by prod_name

Outlier detection, excluding outliers?

I want to detect outliers (greater than 20x stddev deviation from the average) but I don't want to have the more than 3x outliers affecting the average. I came up with this:
SELECT d.* FROM
(
SELECT
d.*,
(amount - avg(amount_excl_3z) OVER(PARTITION BY productid)) / NULLIF(STDEV(amount_excl_3z) OVER(PARTITION BY productid), 0) AS zscore_ex
FROM
(
SELECT
d.*,
--when the amount zscore is 3x, null the amount else provide
CASE WHEN ABS(amount - avg(amount) OVER(PARTITION BY productid)) / NULLIF(STDEV(amount) OVER(PARTITION BY productid), 0) > 3
THEN NULL ELSE amount END AS amount_excl_3z
FROM sales d
WHERE --the past year's sales of product 1, but one day I will consider all prods hence why i left the partitions in
timestamp > GETUTCDATE()-365.0 AND
productid = 1
) d
) d
WHERE e.zscore_ex > 20
ORDER BY amount desc
The concern with the data is that if too many outliers occur they will drastically affect the average - there might be 1000 occurrences of a product with an amount of 1, and then 5 occurrences of a product amount of 20,000.. I don't want the 20,000 to influence the average.. I don't really want 50 occurrences of 20,000 to influence the average.. 500 occurrences though, that would be OK/represent a new norm..
The way I'm thus considering doing this is to strip out low occurrences of massive outliers. If they start occurring frequently enough that they affect the average enough to come into range, then I'll start including them..
The query above is my best stab at "outlier detection that tries to exclude low numbers of wild outliers from having too much of an impact" - is there any other facility in SQL server that I can leverage more effectively for this algorithm? Perhaps some analytic query that can indicate where on a distribution curve a point lies? I looked at PERCENT_RANK, CUME_DIST, PERCENTILE_CONT/DISC, NTILE, but they seemed to be more linear in the distribution of the output than doing a zscore..

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?