how to calculate prevalence using sql code - sql

I am trying to calculate prevalence in sql.
kind of stuck in writing the code.
I want to make automative code.
I have check that I have 1453477 of sample size and number of people who has disease is 851451 using count.
The formula of calculating prevalence is no.of person who has disease/no.sample size.
select (COUNT(condition_id)/COUNT(person_id)) as prevalence
from disease
where condition_id=12345;
when I run above code, I get 1 as a output where I am suppose to get 0.5858.
Can some one please help me out?
Thanks!

In your current query you count the number of rows in the disease table, once using the column condition_id, once using the column person_id. But the number of rows is the same - this is why you get 1 as a result.
I think you need to find the number of different values for these columns. This can be done using count distinct:
select (COUNT(DISTINCT condition_id)/COUNT(DISTINCT person_id)) as prevalence
from disease
where condition_id=12345;

You can cast by
count(...)/count(...)::numeric(6,4) or
count(...)/count(...)::decimal
as two options.
Important point is apply cast to denominator or numerator part(in this case denominator), Do not apply to division as
(count(...)/count(...))::numeric(6,4) which again results an integer.

I am pretty sure that the logic that you want is something like this:
select avg( (condition_id = 12345)::int )
from disease;
Your version doesn't have the sample size, because you are filtering out people without the condition.
If you have duplicate people in the data, then this is a little more complicated. One method is:
select (count(distinct person_id) filter (where condition_id = 12345)::numeric /
count(distinct person_id
)
from disease;

Related

Why is my SQL aliasing not being recognized?

This may be an incredibly simple question, but I'm not seeing what the problem here is. I'm trying to teach myself SQL and was working on an experiment to play with subqueries and aliasing. When I try to enter the following query (into BigQuery), I get an error message "Unrecognized name: cast1 at [3:1]" which persists even if I copy the COUNT lines into the outer query. Obviously there is something I'm not understanding about aliasing here, but I'm not sure where I am going wrong. I would appreciate any help from more experienced SQL users out there on how to improve, thank you in advance!
SELECT
cast__1_,
cast1 + cast2 + cast3 + cast4 AS num_films,
(
SELECT
cast__1_,
COUNT (cast__1_) AS cast1,
COUNT (cast__2_) AS cast2,
COUNT (cast__3_) AS cast3,
COUNT (cast__4_) AS cast4,
FROM `dataproject1-351413.movie_data.movies` AS movies
WHERE
cast__1_ IS NOT Null
GROUP BY
cast__1_
)
FROM `dataproject1-351413.movie_data.movies`
GROUP BY
cast__1_
(The intended result was two columns, pairing each actor with the number of films across the dataset, in case that is not clear from the query)
The structure of your query seems a bit skewed, you need to treat the aggregate query as a derived table, such as:
SELECT cast__1_, cast1 + cast2 + cast3 + cast4 AS num_films
from (
SELECT
cast__1_,
COUNT (cast__1_) AS cast1,
COUNT (cast__2_) AS cast2,
COUNT (cast__3_) AS cast3,
COUNT (cast__4_) AS cast4,
FROM `dataproject1-351413`.movie_data.movies
WHERE cast__1_ IS NOT Null
GROUP BY cast__1_
)t;

COUNT Function: Result of a query

I'm having trouble with a part of the following question. Thank you in advance for your help. I have a hard time visualizing this "fake" database table. I was hoping someone could help me run through my logic and see if it's correct. If someone could just point me in the right direction that would be great!
About:
Sesame is a way to find online class for adults & activities for adults around you.
Imagine a database table named activities. It has four columns:
activity_id [int, non null]
activity_provider_id [int, non null]
area_id [int, nullable]
starts_at [timestamp, non null]
Question: Given the following query, which counts would you expect to return the highest and lowest values? Which counts would you expect to be the same? Why?
select
count(activity_id),
count(distinct activity_provider_id),
count(area_id),
count(distinct area_id),
count(*)

 from activities
My Solution
Highest values: count(*)
Reasoning: The Count(*) function returns the number of rows returned by a SELECT statement, including NULL and duplicates.
Lowest values: count(distinct activity_provider_id)
Reasoning: Less activity providers per activity per area*
Same: Unsure - Could someone just point me in the right direction?
count(*) takes in account all rows in the table, while count(some_col) only counts non-null values of some_col.
Since activity_id is a non nullable colum, one would expect the following expressions return the same, "highest" count:
count(activity_id)
count(*)
As for wich expression returns the lowest count out of the three remaining choices, it is not really possible to tell for sure from the information provided in the question. If actually depends whether that are more, or less, distinct areas than activity providers.
There even is an edge case where all expressions return the same case, if all activity providers (resp. areas) are not null and unique in the table.

Is there a way to find percentage of non-zero vs zero values in one column?

I'm supposed to find the percentage of people having received aid.
I'm assuming the best way to do this is find the number rows who received 0 aid, and the number of rows that have a greater than 0 value, create two variables for those and divide accordingly to find the percentage. It's been a while since I've worked with sql so this is challenging me.
select
rprawrd_aidy_code as year,
sum(rprawrd_accept_amt)
from
rprawrd
where
rprawrd_aidy_code = '1819'
group by
rprawrd_aidy_code
This only gives me a total of the amount of aid provided for the year in question. I need to figure out the total rows that received vs the total that didnt.
If the only output you need from your script is that ratio, there are a few ways to go about this one:
WITH cte (awrd) AS(
SELECT
CASE WHEN rprawrd_accept_amt > 0 THEN 1.0
ELSE 0.0
END awrd
FROM rprawrd
WHERE rprawrd_ady_code = '1819'
)
SELECT SUM(awrd)/COUNT(awrd)
FROM cte
This will get you the percentage of people who received an award, but if you need to know the amounts as well you'll have to approach it differently.

Writing a query to include certain values but exclude others when looking for a latest time period

I am trying to write a query that looks for a people that have a certain code with the latest period (year) but not if they have another code with that latest period(year). I'll be explicit just so my example makes sense.
I want people who have the code A1,A2,A3,A4,A5 but not AG,AP,AQ. There are people who have an A1 code for a period (like 2014) and an AG code for a the same period. I'd like to exclude them. Not everyone has a code so the field value could be NULL.
Is there a way to express this in a different way (i.e. less characters) than the way I did?
SELECT
people.firstName
FROM
people
WHERE EXISTS (
SELECT *
FROM codes
WHERE
codes.people_id = people.id
AND period = (SELECT MAX(period) FROM codes codes2 WHERE codes2.people_id = codes.people_id)
AND code LIKE 'A[1-5]'
)
AND NOT EXISTS (
SELECT *
FROM codes
WHERE
codes.people_id = people.id
AND period = (
SELECT MAX(period)
FROM codes codes2
WHERE codes2.people_id = codes.people_id
)
AND code LIKE 'A[GPQ]'
)
Schema is as follows:
People
id (PK)
firstName
Codes
people_id (FK) many to one relation with People table
code (e.g. "A1", "A2", "AG")
period (e.g. "2013", "2014")
There are so many ways you could do that, I'm not an SQL expert but I can't see your query being too bad, if you want to try and reduce the number of sub-queries you could consider using the GROUP BY clause along with a SUM Aggregate function in a HAVING clause.
I started updating your code as follows:
SELECT
people.firstName
FROM
people
LEFT JOIN codes AS a15 ON a15.people_id = people.id AND a15.code LIKE 'A[1-5]'
LEFT JOIN codes AS agpq ON agpq.people_id = people.id AND agpq.code LIKE 'A[GPQ]'
GROUP BY
people.firstName
HAVING
SUM(CASE WHEN a15.code IS NULL THEN 0 ELSE 1 END) > 0
AND SUM(CASE WHEN agpq.code IS NULL THEN 0 ELSE 1 END) = 0
This however doesn't take into account anything to do with period specific requirements described. You could add the period to the GROUP BY clause or add it to a WHERE or one of the JOIN constraints but I'm not quite sure from your description exactly what you're after (I don't believe this is through any fault of your own, I just can't personally align the code provided to the description).
I would also like to point out that the SUM functions above will not give an accurate count of the number of matching codes. This is because if both A[GPQ] and A[1_5] return at least one row, the number returned by each constraint will be multiplied by the number returned for the other, it can however be used to determine if there are "any" returned items as if the criteria is matched it will have a SUM(...) > 0
I'm sure a more experienced SQL Developer / DBA will be able to poke many holes in my proposed query but it might give them or someone else something to work from and hopefully gives you ideas for alternatives to using sub-queries.

BigQuery: GROUP BY clause for QUANTILES

Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.
For example I use the below query for every distinct row in column-y if I want to get the desired result.
SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE
<column-y> == <each distinct row in column-y>
Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
Is there a better way to get what I am trying to get here?
Thanks
With the recently announced percentile_cont() window function you can get medians.
Look at the example in the announcement blog post:
http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.
There are plans to improve QUANTILES, but I don't know what the timeline is.
Do you need median? Can you filter outliers and do an average of the remainder?
If your per-group size is fixed, you may be able to hack it using combination of order, nest and nth. For instance, if there are 9 distinct values of f2 per value of f1, for median:
select f1,nth(5,f2) within record from (
select f1,nest(f2) f2 from (
select f1, f2 from table
group by f1,f2
order by f2
) group by f1
);
Not sure if the sorted order in subquery is guaranteed to survive the second group, but it worked in a simple test I tried.