The question is:For the previous Track, compute the weighted average (instead of the average) average yield by nation (yield is defined as dividend divided by price, and is reported as a percentage). The weighted average includes the quantity of shares as a weight.
But I have no clue on how to find a weighted average.
I already have some code that was used to compute the average of the yield in percent. But I'm having trouble computing the weighted average. I talked to some classmates and they said they're trying to use the formula: sum(x*w)/sum(w) where w is weight.
But I'm having trouble implementing this into my code
SELECT Nations.nationName,
AVG(dividend/price) AS Yield
FROM Shares, Nations
WHERE Shares.nationID=Nations.ID
GROUP BY Nations.nationName;
In the end the weighted average should be computer but I just don't know how to implement that into the code. The material we're supposed to base this problem off of is no help.
I followed you correctly, you should consider:
SELECT
n.nationName,
SUM(s.quantity * s.dividend / s.price)/IIF(SUM(s.quantity) = 0, NULL, SUM(s.quantity)) AS weighted_yield
FROM Shares s
INNER JOIN Nations n ON s.nationID = n.ID
GROUP BY n.nationName;
Notes:
always prefer explicit JOINs over old-school implicit JOINs
also, using table aliases are a best practice, since they make the query shorter and easier to follow
Related
The question asked is: Report the nations where the average yield of the shares exceeds the average yield of all shares
I have a solution but I am unsure if it answers the question. I only received one output and I'm not sure if i need multiple
SELECT Nations.nationName,
AVG(dividend/price*100) AS Yield
FROM Shares, Nations
WHERE Shares.nationID=Nations.ID
GROUP BY Nations.nationName
HAVING AVG(dividend/price*100)>
(SELECT AVG(dividend/price*100) FROM Shares);
My output only shows one result, i'm just wondering if this should be the outcome.
I'll provide the previous table and my output
output of the previous table that is required to get this
When Percentage formatting is applied to the column, the result will be automatically multiplied by 100 when displayed, i.e. a value of 0.25 will be displayed as 25%; as such, either remove the formatting or remove the 100 multiplier from your code.
I would also suggest using an inner join over a Cartesian Product, i.e.:
select n.nationname, avg(s.dividend/s.price) as yield
from shares s inner join nations n on s.nationid = n.id
group by n.nationname
As for the correctness of your query, you can easily check this by calculating the average yield of all shares either manually or using a separate query, and then identifying which nations should be output by your query.
I am trying to calculate prevalence in sql.
kind of stuck in writing the code.
I want to make automative code.
I have check that I have 1453477 of sample size and number of people who has disease is 851451 using count.
The formula of calculating prevalence is no.of person who has disease/no.sample size.
select (COUNT(condition_id)/COUNT(person_id)) as prevalence
from disease
where condition_id=12345;
when I run above code, I get 1 as a output where I am suppose to get 0.5858.
Can some one please help me out?
Thanks!
In your current query you count the number of rows in the disease table, once using the column condition_id, once using the column person_id. But the number of rows is the same - this is why you get 1 as a result.
I think you need to find the number of different values for these columns. This can be done using count distinct:
select (COUNT(DISTINCT condition_id)/COUNT(DISTINCT person_id)) as prevalence
from disease
where condition_id=12345;
You can cast by
count(...)/count(...)::numeric(6,4) or
count(...)/count(...)::decimal
as two options.
Important point is apply cast to denominator or numerator part(in this case denominator), Do not apply to division as
(count(...)/count(...))::numeric(6,4) which again results an integer.
I am pretty sure that the logic that you want is something like this:
select avg( (condition_id = 12345)::int )
from disease;
Your version doesn't have the sample size, because you are filtering out people without the condition.
If you have duplicate people in the data, then this is a little more complicated. One method is:
select (count(distinct person_id) filter (where condition_id = 12345)::numeric /
count(distinct person_id
)
from disease;
I have a column of numbers in my database. How can I computer the standard deviation? I do not want use the stddev function.
Just because I was curious, I decided to test the actual STDEV(). Now, I could not nail the built in function.
I was close... 0.000141009220002264 or 0.00748% off
Also, The Total Average and Count has to be converted to float (variance was greater with decimal)
The example below is going after my Treasury Rates Table for the 10 Year Yield (not that it matters)
Select SQLFunction = Stdev([TR_Y10])
,ManualCalc = Sqrt(Sum(Power(((cast([TR_Y10] as float)-B.TotalAvg)),2) / B.TotalCnt))
,Variance = Stdev([TR_Y10]) - Sqrt(Sum(Power(((cast([TR_Y10] as float)-B.TotalAvg)),2) / B.TotalCnt))
From [Chinrus-Shared].[dbo].[DS_Treasury_Rates]
Join (Select TotalAvg=Avg(cast([TR_Y10] as float)),TotalCnt=count(*) From [Chinrus-Shared].[dbo].[DS_Treasury_Rates]) B on 1=1
Returns
SQLFunction ManualCalc Variance
1.88409468982299 1.88395368060299 0.000141009220002264
The standard deviation is the square root of the variance divided by n.
The variance is the sum of the squares of the differences between the average and the observed value.
So, in most databases, you can use window functions:
select sqrt(avg(var))
from (select square(t.x - avg(t.x) over ()) as var
from t
) t;
Notes:
The square() function might have some other name (such as power()).
The sqrt() function might have some other name.
This is not a good way to calculate the standard deviation in general. In particular, this is a numerically unstable algorithm (it will work just fine for finite numbers of normal numbers).
The subquery is needed because window functions cannot be the arguments to aggregation functions.
I am trying to determine medicare costs per capita in each State using Google BigQuery.
I already have population numbers for each state (represented as Total) as well as total medicare cost (Cost) in each state. I am trying to divide total cost by the population of each state.
At the moment the query runs, however every entry is null. I am admittedly a beginner with both BigQuery and SQL.
Here is my code:
SELECT State, Cost / Total AS PerCapita
FROM medicare.population, medicare.CostByState
GROUP BY State, PerCapita;
One thing that may be causing issues is that the 'State' column exists in both 'population' and 'CostByState' tables. Not sure how to address this.
Here are my tables:
population
CostByState
You seem to have data with one row per state, so you only need a JOIN.
SELECT p.State, cbs.Cost / p.Total AS PerCapita
FROM medicare.population p JOIN
medicare.CostByState cbs
ON p.state = cbs.state;
You would only need aggregation if the tables had multiple rows per state.
Indeed you need to join that.
If the relationship is one to one you're good. But if not you may need some type of aggregation
sum(cost)/sum(total) as per_capita
I have a table:
LocationId OriginalValue Mean
1 0.45 3.99
2 0.33 3.99
3 16.74 3.99
4 3.31 3.99
and so forth...
How would I work out the Standard Deviation using this table and also what would you recommend - STDEVP or STDEV?
To use it, simply:
SELECT STDEVP(OriginalValue)
FROM yourTable
From below, you probably want STDEVP.
From here:
STDEV is used when the group of numbers being evaluated are only a partial sampling of the whole population. The denominator for dividing the sum of squared deviations is N-1, where N is the number of observations ( a count of items in the data set ). Technically, subtracting the 1 is referred to as "non-biased."
STDEVP is used when the group of numbers being evaluated is complete - it's the entire population of values. In this case, the 1 is NOT subtracted and the denominator for dividing the sum of squared deviations is simply N itself, the number of observations ( a count of items in the data set ). Technically, this is referred to as "biased." Remembering that the P in STDEVP stands for "population" may be helpful. Since the data set is not a mere sample, but constituted of ALL the actual values, this standard deviation function can return a more precise result.
Generally, you should use STDEV when you have to estimate standard deviation based on a sample. But if you have entire column-data given as arguments, then use STDEVP.
In general, if your data represents the entire population, use STDEVP; otherwise, use STDEV.
Note that for large samples, the functions return nearly the same value, so better use STDEV in this case.
In statistics, there are two types of standard deviations: one for a sample and one for a population.
The sample standard deviation, generally notated by the letter s, is used as an estimate of the population standard deviation.
The population standard deviation, generally notated by the Greek letter lower case sigma, is used when the data constitutes the complete population.
It is difficult to answer your question directly -- sample or population -- because it is difficult to tell what you are working with: a sample or a population. It often depends on context.
Consider the following example.
If I want to know the standard deviation of the age of students in my class, then I u=would use STDEVP because the class is my population. But if I want the use my class as a sample of the population of all students in the school (this would be what is known as a convenience sample, and would likely be biased, but I digress), then I would use STDEV because my class is a sample. The resulting value would be my best estimate of STDEVP.
As mentioned above (1) for large sample sizes (say, more than thirty), the difference between the two becomes trivial, and (2) generally you should use STDEV, not STDEVP, because in practice we usually don't have access to the population. Indeed, one could argue that if we always had access to populations, then we wouldn't need statistics. The entire point of inferential statistics is to be able to make inferences about a population based on the sample.