SQL calculating percentage - sql

I am trying to get a claim denied percentage count (total_count / denied_count * 100) for providers with under 100 claims. I am able to get the total count and denied count with separate queries, but I am having trouble pulling everything together.
SELECT
PROVID,
COUNT(CLAIMID) AS TOTAL_COUNT,
COUNT(CLAIMID) / (SELECT COUNT(CLAIMID) * 100
FROM #TEMPSTAGE
WHERE STATUS = 'DENY') AS DENIED_PERCENTAGE
FROM
#TEMPSTAGE
WHERE
PROVID IN (SELECT DISTINCT PROVID
FROM #TEMPSTAGE
GROUP BY PROVID
HAVING COUNT(CLAIMID) <= 100)
GROUP BY
PROVID
Results example:
ProvID / Total_Count / Denied Percentage
-----------------------------------------
X12345 / 77 / 0
I am getting zero denied percentage for everything as my subquery in the select statement isn't allowing me to group by provid.
Error
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
What's the best way to go about this??

As per most languages, if you do 1 / 2 with integers, the result is 0, because there is no integer for 0.5. To get a decimal (fixed point of floating point) you need to convert the datatypes.
How depends on your dialect of SQL (MySQL, SQL Server, Oracle, PostgreSQL, etc).
CAST(COUNT(CLAIMID) AS FLOAT)
CAST(COUNT(CLAIMID) AS DECIMAL(10, 4))
COUNT(CLAIMID) * 1.0
etc, etc
Next, to use IN the list needs to be in braces IN (1, 2, 3), but to use a sub query the query needs to be in braces (SELECT x FROM y).
That means to use both you need two pairs of braces IN ((SELECT x FROM y))
So, the smallest changes to your query are...
SELECT
PROVID,
COUNT(CLAIMID) AS TOTAL_COUNT,
COUNT(CLAIMID) / (SELECT COUNT(CLAIMID) * 100.0
FROM #TEMPSTAGE
WHERESTATUS = 'DENY') AS DENIED_PERCENTAGE
FROM
#TEMPSTAGE
WHERE
PROVID IN ((SELECT PROVID
FROM #TEMPSTAGE
GROUP BY PROVID
HAVING COUNT(CLAIMID) <= 100))
GROUP BY
PROVID
That said, the subquery in where clause can just be moved to the main query...
SELECT
PROVID,
COUNT(CLAIMID) AS TOTAL_COUNT,
COUNT(CLAIMID) / (SELECT COUNT(CLAIMID) * 100.0
FROM #TEMPSTAGE
WHERE STATUS = 'DENY') AS DENIED_PERCENTAGE
FROM
#TEMPSTAGE
GROUP BY
PROVID
HAVING
COUNT(CLAIMID) <= 100
Also, I've removed the DISTINCT keywords. If you're using GROUP BY the way you are you don't need it.
EDITTED : Following comment
You can skip the sub-query and just sum the number of rows in the group where the status is 'DENY'.
Also, a percentage is (x * 100) / y not x / (y * 100), so I reversed the calculation.
SELECT
PROVID,
COUNT(CLAIMID) AS TOTAL_COUNT,
SUM(CASE WHEN STATUS = 'DENY' THEN 1 ELSE 0 END) * 100.0 / COUNT(CLAIMID) AS DENIED_PERCENTAGE
FROM
#TEMPSTAGE
GROUP BY
PROVID
HAVING
COUNT(CLAIMID) <= 100

Related

SQL - Calculate percentage by group, for multiple groups

I have a table in GBQ in the following format :
UserId Orders Month
XDT 23 1
XDT 0 4
FKR 3 6
GHR 23 4
... ... ...
It shows the number of orders per user and month.
I want to calculate the percentage of users who have orders, I did it as following :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
GROUP BY
HasOrders
ORDER BY
Parts
It gives me the following result:
HasOrders Parts
0 35
1 65
I need to calculate the percentage of users who have orders, by month, in a way that every month = 100%
Currently to do this I execute the query once per month, which is not practical :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
WHERE Month = 1
GROUP BY
HasOrders
ORDER BY
Parts
Is there a way execute a query once and have this result ?
HasOrders Parts Month
0 25 1
1 75 1
0 45 2
1 55 2
... ... ...
SELECT
SIGN(Orders),
ROUND(COUNT(*) * 100.000 / SUM(COUNT(*), 2) OVER (PARTITION BY Month)) AS Parts,
Month
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
Demo on Postgres:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=4cd2d1455673469c2dfc060eccea8020
You've stated that it's important for the total to be 100% so you might consider rounding down in the case of no orders and rounding up in the case of has orders for those scenarios where the percentages falls precisely on an odd multiple of 0.5%. Or perhaps rounding toward even or round smallest down would be better options:
WITH DATA AS (
SELECT SIGN(Orders) AS HasOrders, Month,
COUNT(*) * 10000.000 / SUM(COUNT(*)) OVER (PARTITION BY Month) AS PartsPercent
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
)
select HasOrders, Month, PartsPercent,
PartsPercent - TRUNCATE(PartsPercent) AS Fraction,
CASE WHEN HasOrders = 0
THEN FLOOR(PartsPercent) ELSE CEILING(PartsPercent)
END AS PartsRound0Down,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5
AND MOD(TRUNCATE(PartsPercent), 2) = 0
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsRoundTowardEven,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5 AND PartsPercent < 50
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsSmallestTowardZero
from DATA
It's usually not advisable to test floating-point values for equality and I don't know how BigQuery's float64 will work with the comparison against 0.5. One half is nevertheless representable in binary. See these in a case where the breakout is 101 vs 99. I don't have immediate access to BigQuery so be aware that Postgres's rounding behavior is different:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=c8237e272427a0d1114c3d8056a01a09
Consider below approach
select hasOrders, round(100 * parts, 2) as parts, month from (
select month,
countif(orders = 0) / count(*) `0`,
countif(orders > 0) / count(*) `1`,
from your_table
group by month
)
unpivot (parts for hasOrders in (`0`, `1`))
with output like below

How to avoid function repetition in SELECT, GROUP BY, and ORDER BY in SQL

I am writing a statistical query where the value is duplicated in SELECT, GROUP BY, and ORDER BY. Having to repeat the same value makes it hard to read the query and modify it.
How can I avoid repeating FLOOR(COALESCE(LEN(Body), 0) / 100) 3-4 times in the query below.
SELECT FLOOR(COALESCE(LEN(Body), 0) / 100) * 100 as BodyLengthStart,
(FLOOR(COALESCE(LEN(Body), 0) / 100) + 1) * 100 - 1 as BodyLengthEnd,
COUNT(*) as MessageCount
FROM [Message]
GROUP BY FLOOR(COALESCE(LEN(Body), 0) / 100)
ORDER BY FLOOR(COALESCE(LEN(Body), 0) / 100)
The output of the query is the number of messages bucketed by how many hundreds of characters they have.
BodyLengthStart
BodyLengthEnd
MessageCount
0
99
130
100
199
76
200
299
36
Using CROSS APPLYs
SELECT BodyLengthStart,
BodyLengthEnd,
COUNT(*)
FROM [Message]
CROSS APPLY (
VALUES
(FLOOR(COALESCE(LEN(Body), 0) / 100))
) a1(v)
CROSS APPLY (
VALUES
(v * 100, (v + 1) * 100 - 1)
) a2(BodyLengthStart, BodyLengthEnd)
GROUP BY BodyLengthStart,
BodyLengthEnd
One option may be a CTE (Common Table Expression), something along these lines:
WITH x AS
(
SELECT FLOOR(COALESCE(LEN(Body), 0) / 100) AS BodyLength
FROM [Message]
)
SELECT BodyLength * 100 AS BodyLengthStart,
(BodyLength + 1) * 100 - 1 AS BodyLengthEnd,
COUNT(*) as MessageCount
FROM x
GROUP BY BodyLength
ORDER BY BodyLength
As a side note - if the statement prior to this doesn't end with a semi-colon (;), this will not work as expected.
Use a sub-select:
SELECT BodyLengthStart,
BodyLengthEnd,
COUNT(*)
FROM (SELECT FLOOR(COALESCE(LEN(Body), 0) / 100) * 100 as BodyLengthStart,
(FLOOR(COALESCE(LEN(Body), 0) / 100) + 1) * 100 - 1 as BodyLengthEnd
FROM [Message]) as a
GROUP BY BodyLengthStart,
BodyLengthEnd
You can define a SELECT after the FROM; in this way, you can elaborate previously your data.
You can use a common table expression:
WITH cte AS
(
SELECT FLOOR(COALESCE(LEN(Body), 0) / 100) * 100 as BodyLengthStart,
(FLOOR(COALESCE(LEN(Body), 0) / 100) + 1) * 100 - 1 as BodyLengthEnd
FROM [Message]
)
SELECT BodyLengthStart,BodyLengthEnd,COUNT(*)
FROM cte
GROUP BY BodyLengthStart,BodyLengthEnd

Referencing other columns in a SQL SELECT

I have a SQL query in BigQuery:
SELECT
creator.country,
(SUM(length) / 60) AS total_minutes,
COUNT(DISTINCT creator.id) AS total_users,
(SUM(length) / 60 / COUNT(DISTINCT creator.id)) AS minutes_per_user
FROM
...
You may have noticed that the last column is equivalent to total_minutes / total_users.
I tried this, but it doesn't work:
SELECT
creator.country,
(SUM(length) / 60) AS total_minutes,
COUNT(DISTINCT creator.id) AS total_users,
(total_minutes / total_users) AS minutes_per_user
FROM
...
Is there any way to make this simpler?
Not really. That is, you cannot re-use column aliases in expressions in the same SELECT. If you really want, you can use a subquery or CTE:
SELECT c.*,
total_minutes / total_users
FROM (SELECT creator.country,
(SUM(length) / 60) AS total_minutes,
COUNT(DISTINCT creator.id) AS total_users
FROM
) c;
Another option is to move all business logic of metrics calculation into UDF (temp or permanent depends on usage needs) ...
create temp function custom_stats(arr any type) as ((
select as struct
sum(length) / 60 as total_minutes,
count(distinct id) as total_users,
sum(length) / 60 / count(distinct id) as minutes_per_user
from unnest(arr)
));
... and thus keep query itself simple and least verbose - as in below example
select creator.country,
custom_stats(array_agg(struct(length, creator.id))).*
from `project.dataset.table`
group by country

This query would be too heavy , need to be refactored. how can i do?

This query would be too heavy, needs to be refactored. How can I do that?
Please help
SELECT
contract_type, SUM(fte), ROUND(SUM(fte * 100 / t.s ), 0) AS "% of total"
FROM
design_studio_testing.empfinal_tableau
CROSS JOIN
(SELECT SUM(fte) AS s
FROM design_studio_testing.empfinal_tableau) t
GROUP BY
contract_type;
Output should be like this:
Use window functions:
SELECT contract_type,
SUM(fte),
ROUND(SUM(fte) * 100.0 / SUM(SUM(fte)) OVER (), 0) AS "% of total"
FROM design_studio_testing.empfinal_tableau
GROUP BY contract_type;
That said, your original version should not be that much slower than this, unless perhaps empfinal_tableau is a view.
If it is a table, you could further speed this with an index on empfinal_tableau(contract_type, fte).
There is no need to sum over the expression:
fte * 100 / t.s
which may slow the process.
Get SUM(fte) and then multiply and divide:
SELECT g.contract_type, g.sum_fte,
ROUND(100.0 * g.sum_fte / t.s, 0) AS [% of total]
FROM (
SELECT
contract_type,
SUM(fte) AS sum_fte
FROM design_studio_testing.empfinal_tableau
GROUP BY contract_type
) AS g CROSS JOIN (SELECT SUM(fte) AS s FROM design_studio_testing.empfinal_tableau) t
Edit for Oracle:
SELECT g.contract_type, g.sum_fte,
ROUND(100.0 * g.sum_fte / t.s, 0) AS "% of total"
FROM (
SELECT
contract_type,
SUM(fte) AS sum_fte
FROM empfinal_tableau
GROUP BY contract_type
) g CROSS JOIN (SELECT SUM(fte) AS s FROM empfinal_tableau) t

Cohort/ Retention query in BigQuery using Google Analytics exported data

I need help formulating a cohort/retention query
I am trying to build a query to look at visitors who performed ActionX on their first visit (in the time frame) and then how many days later they returned to perform Action X again
The output I (eventually) need looks like this...
The table I am dealing with is an export of Google Analytics to BigQuery
If anyone could help me with this or anyone who has written a query similar that I can manipulate?
Thanks
Just to give you simple idea / direction
Below is for BigQuery Standard SQL
#standardSQL
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
You can test it with below dummy data from your question
#standardSQL
WITH `OutputFromQuery` AS (
SELECT '01.07.17' AS Date_of_action_first_taken, 1000 AS Visits, 800 AS later_1_day, 400 AS later_2_days, 300 AS later_3_days UNION ALL
SELECT '02.07.17', 1000, 860, 780, 860 UNION ALL
SELECT '29.07.17', 1000, 780, 120, 0 UNION ALL
SELECT '30.07.17', 1000, 710, 0, 0
)
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
The OutputFromQuery data is as below:
Date_of_action_first_taken Visits later_1_day later_2_days later_3_days
01.07.17 1000 800 400 300
02.07.17 1000 860 780 860
29.07.17 1000 780 120 0
30.07.17 1000 710 0 0
and the final output is:
Date_of_action_first_taken later_1_day later_2_days later_3_days
01.07.17 80.0 40.0 30.0
02.07.17 90.0 78.0 86.0
29.07.17 80.0 12.0 0.0
30.07.17 70.0 0.0 0.0
I found this query on Turn Your App Data into Answers with Firebase and BigQuery (Google I/O'19)
It should work :)
#standardSQL
###################################################
# Part 1: Cohort of New Users Starting on DEC 24
###################################################
WITH
new_user_cohort AS (
SELECT DISTINCT
user_pseudo_id as new_user_id
FROM
`[your_project].[your_firebase_table].events_*`
WHERE
event_name = `[chosen_event] ` AND
#set the date from when starting cohort analysis
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) = '20191224' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
),
num_new_users AS (
SELECT count(*) as num_users_in_cohort FROM new_user_cohort
),
#############################################
# Part 2: Engaged users from Dec 24 cohort
#############################################
engaged_users_by_day AS (
SELECT
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) as event_day,
COUNT(DISTINCT user_pseudo_id) as num_engaged_users
FROM
`[your_project].[your_firebase_table].events_*`
INNER JOIN
new_user_cohort ON new_user_id = user_pseudo_id
WHERE
event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
GROUP BY
event_day
)
####################################################################
# Part 3: Daily Retention = [Engaged Users / Total Users]
####################################################################
SELECT
event_day,
num_engaged_users,
num_users_in_cohort,
ROUND((num_engaged_users / num_users_in_cohort), 3) as retention_rate
FROM
engaged_users_by_day
CROSS JOIN
num_new_users
ORDER BY
event_day
So I think I may have cracked it... from this output I then would need to manipulate it (pivot table it) to make it look like the desired output.
Can anyone review this for me and let me know what you think?
`WITH
cohort_items AS (
SELECT
MIN( TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 +
h.time*1000)), DAY) ) AS cohort_day, fullVisitorID
FROM
TABLE123 AS U,
UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN "20170701" AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 2
),
user_activites AS (
SELECT
A.fullVisitorID,
DATE_DIFF(DATE(TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 + h.time*1000)), DAY)), DATE(C.cohort_day), DAY) AS day_number
FROM `Table123` A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID,
UNNEST(hits) AS h
WHERE
A._TABLE_SUFFIX BETWEEN "20170701 AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 1,2),
cohort_size AS (
SELECT
cohort_day,
count(1) as number_of_users
FROM
cohort_items
GROUP BY 1
ORDER BY 1
),
retention_table AS (
SELECT
C.cohort_day,
A.day_number,
COUNT(1) AS number_of_users
FROM
user_activites A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID
GROUP BY 1,2
)
SELECT
B.cohort_day,
S.number_of_users as total_users,
B.day_number,
B.number_of_users / S.number_of_users as percentage
FROM retention_table B
LEFT JOIN cohort_size S ON B.cohort_day = S.cohort_day
WHERE B.cohort_day IS NOT NULL
ORDER BY 1, 3
`
Thank you in advance!
If you use some techniques available in BigQuery, you can potentially solve this type of problem with very cost and performance effective solutions. As an example:
SELECT
init_date,
ARRAY((SELECT AS STRUCT days, freq, ROUND(freq * 100 / MAX(freq) OVER(), 2) FROM UNNEST(data) ORDER BY days)) data
FROM(
SELECT
init_date,
ARRAY_AGG(STRUCT(days, freq)) data
FROM(
SELECT
init_date,
data AS days,
COUNT(data) freq
FROM(
SELECT
init_date,
ARRAY(SELECT DATE_DIFF(PARSE_DATE("%Y%m%d", dts), PARSE_DATE("%Y%m%d", init_date), DAY) AS dt FROM UNNEST(dts) dts) data
FROM(
SELECT
MIN(date) init_date,
ARRAY_AGG(DISTINCT date) dts
FROM `Table123`
WHERE TRUE
AND EXISTS(SELECT 1 FROM UNNEST(hits) where eventinfo.eventCategory = 'recommendation') -- This is your 'ACTION TAKEN' filter
AND _TABLE_SUFFIX BETWEEN "20170724" AND "20170731"
GROUP BY fullvisitorid
)
),
UNNEST(data) data
GROUP BY init_date, days
)
GROUP BY init_date
)
I tested this query against our G.A data and selected customers who interacted with our recommendation system (as you can see in the filter selection WHERE EXISTS...). Example of result (omitted absolute values of freq for privacy reasons):
As you can see, at day 28th for instance, 8% of customers came back 1 day later and interacted with the system again.
I recommend you to play around with this query and see if it works well for you. It's simpler, cheaper, faster and hopefully easier to maintain.