Trouble writing a query to select one row per "date", given certain conditions - sql

I am having trouble writing a query to select one row per "date", given certain conditions. My table has this structure:
ID date expiration callput iv delta
1 1/1/2009 1/20/2009 C 0.4 0.61
2 1/1/2009 1/20/2009 C 0.3 0.51
3 1/1/2009 2/20/2009 C 0.2 0.41
I would like to write a query with the following characteristics:
For each row, calculate the "days", i.e. the expiration date minus the date. For instance, for row one, the "days" is 19 (1/20 minus 1/1)
The result set should only have rows with a "days" of between 15 and 50
The "callput" value must be "C"
For each date, show only one row. That row should have the following characteristics:
The delta should be greater than 0.5
The delta should be the smallest number greater than 0.5
If there are two rows, the row with the lower days should be selected
Here is 'days' for the sample data above:
ID date expiration days callput iv delta
1 1/1/2009 1/20/2009 19 C 0.4 0.61
2 1/1/2009 1/20/2009 19 C 0.3 0.51
3 1/1/2009 2/20/2009 50 C 0.2 0.41
For my sample dataset, the answer should be row 2, because row 2's "delta" is above 0.5, row 2's delta of 0.51 is closer to 0.5 than row 1's 0.61, and row 2's "days" of 19 is less than row 3's "days" of 50.
This is the query I've written so far:
SELECT date, Min(delta) AS MaxOfdelta, [expiration]-[date] AS days
FROM RAWDATA
WHERE (((delta)>0.5) AND ((callput)="C") AND (([expiration]-[date])>=15 And ([expiration]-[date])<=50))
GROUP BY date, [expiration]-[date]
ORDER BY date;
This works somewhat, but sometimes, there are multiple rows for one date, because two rows on a given day can have a "days" between 15 and 50. I can't get my query to obey the rule "If there are two rows, the row with the lower days should be selected". I would also like the "iv" value for that row to be present in my query result set.
I happen to be using Microsoft Access, but syntax for any SQL engine would be appreciated! :-)

What you can do is select the right rows in a subquery. This query should find the rows you're looking for:
select [date], min([expiration]-[date])
from rawdata
where delta > 0.5
and callput = 'C'
and [expiration]-[date] between 15 and 50
group by [date]
To find the delta that belongs to these rows, put it in a subquery and join on it:
select *
from rawdata
inner join (
select [date]
, min([expiration]-[date]) as days
from rawdata
where delta > 0.5
and callput = 'C'
and [expiration]-[date] between 15 and 50
group by [date]
) as filter
on filter.date = rawdata.date
and filter.days = rawdata.[expiration] - rawdata.[date]
where delta > 0.5
and callput = 'C'
To search for the lowest delta within rows with identical "days", you could add another subquery:
select
SubDaysDelta.date
, SubDaysDelta.MinDays
, SubDaysDelta.MinDelta
, min(rawdata.iv) as MinIv
from rawdata
inner join (
select
SubDays.date
, SubDays.MinDays
, min(delta) as MinDelta
from rawdata
inner join (
select [date]
, min([expiration]-[date]) as MinDays
from rawdata
where delta > 0.5
and callput = 'C'
and [expiration]-[date] between 15 and 50
group by [date]
) as SubDays
on SubDays.date = rawdata.date
and SubDays.MinDays = rawdata.[expiration] - rawdata.[date]
where delta > 0.5
and callput = 'C'
group by SubDays.date, SubDays.MinDays
) as SubDaysDelta
on SubDaysDelta.date = rawdata.date
and SubDaysDelta.MinDays = rawdata.[expiration] - rawdata.[date]
and SubDaysDelta.MinDelta = rawdata.delta
where delta > 0.5
and callput = 'C'
group by SubDaysDelta.date, SubDaysDelta.MinDays, SubDaysDelta.MinDelta
The first subquery "SubDays" searches for rows with the lowest "days". The second subquery "SubDaysDelta" searches for the lowest delta within the "SubDays" set. The outer query filters any duplicates remaining.
It would be more readable and maintainable if you'd use views. The first view could filter on callput and the 15-20 "days" limit. That'd make it a lot easier.

VBA!
I wish I could be as thorough, dedicated and helpful a servant as Andomar. I can only up-vote his answer in sheer awe of him.
However ... I would point out there are perhaps compelling reasons to switch to VBA. Even if you are new to VBA, the benefits in control and trouble shooting may put you ahead. And I'd guess any new learning will help elsewhere in your project.
I wish I would provide a complete answer as Andomar did. But give it a whack.

Related

SQL calculate percentage from calculated column

I have a table with multiple columns however I need to calculate a Total Percentage based off 2 columns.
Column 1 has unique identifier (number i.e. 15211, 36521, 45987 etc)
Column 2 has a "Y" or is blank (the criteria is built in to the DWH)
What i am wanting to do is get a Percentage of Column 2 of only the Y fields using Column 1 as the Denominator
Column 1
Column 2
25638
y
69857
n
78561
n
23149
y
based on the example above im expecting 2/4 = 0.50 or 50%
You can divide the result of a conditional aggregation on Column2 = 'Y', and the overall count.
SELECT COUNT(CASE WHEN Column2 = 'y' THEN 1 END) / COUNT(*) AS perc_y
FROM tab
Output:
perc_y
0.5000
If you want a percentage, multiply by 100, round up and concatenate with '%'.
Here's a demo in MySQL, although it should work on all the most common DBMS'.

Assign an age to a person based on known population average but no Date of birth

I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don't know the date of birth of any of the family members.
Dataset looks like:
household_id
household_size
x1
5
x2
1
x3
8
...
...
I then have a set of percentages for each age group with that dataset looking like:
age_group
percentage
0-18
30
19-30
40
31-100
30
I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:
household_id
household_size
0-18
19-30
31-100
x1
5
2
2
1
x2
1
0
1
0
x3
8
3
3
2
...
...
...
....
...
I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.
I didn't want to post an answer with just a link so I figured I'll give it a shot and see if I can simplify depesz weighted_random to plain sql. The result is this slower, less readable, worse version of it, but in shorter, plain sql:
CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] )
RETURNS ANYELEMENT language sql as $$
select choice
from
( select case when (sum(weight) over (rows UNBOUNDED PRECEDING)) >= hit
then choice end as choice
from ( select unnest(p_choices) as choice,
unnest(p_weights) as weight ) inputs,
( select sum(weight)*random() as hit
from unnest(p_weights) a(weight) ) as random_hit
) chances
where choice is not null
limit 1
$$;
It's not inlineable because of aggregate and window function calls. It's faster if you assume weights will only be probabilities that sum up to 1.
The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don't have to, nor do they have to sum up to any specific number):
update test_area t
set ("0-18",
"19-30",
"31-100")
= (with cte AS (
select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}')
as age_group
from generate_series(1,household_size,1))
select count(*) filter (where age_group='0-18') as "0-18",
count(*) filter (where age_group='19-30') as "19-30",
count(*) filter (where age_group='31-100') as "31-100"
from cte)
returning *;
Online demo showing that both his version and mine are statistically reliable.
A minimum start could be:
SELECT
household_id,
MIN(household_size) as size,
ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT
h.household_id,
h.household_size,
p.agegroup_from,
p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;
output:
household_id
size
g1
g2
g3
x1
5
1.5
2.0
1.5
x2
1
0.3
0.4
0.3
x3
8
2.4
3.2
2.4
see: DBFIDDLE
Notes:
Of course you should round the columns g to whole numbers, taking into account the complete split (g1+g2+g3 = total)
Because g1,g2 and g3 are based on percentages, their values can change (as long as the total split is OK.... (see, for more info: Return all possible combinations of values on columns in SQL )

Calculating Column value based on row above and previous column [duplicate]

This question already has answers here:
How to calculate Running Multiplication
(4 answers)
Closed 6 months ago.
I have a table I'm trying to create that has a column that needs to be calculated based on the row above it multiplied by the previous column. The first row is defaulted to 100,000 and the rest of the rows would be calculated off of that. Here's an example:
Age
Population
Deaths
DeathRate
DeathPro
DeathProb
SurvivalProb
PersonsAlive
0
1742
0
0
0.1
0
1
100,000
51
2048
1
0.00048
0.5
0.00048
0.99951
99951.18379
52
1921
0
0
0.5
0
1
99951.18379
61
1965
1
0.00051
0.5
0.00051
0.99949
99900.33
I skipped some ages so I didn't have type it all in there, but the ages go from 0 - 85. This was orginally done in excel and the formula for PersonsAlive (which is what I'm trying to recreate) was G3*H2 aka previous value of PersonsAlive * Survival Probability.
I was thinking I could accomplish this with the lag function, but with the example I provided above, I get null values for everything after age 1 because there is no value in the previous row. What I want to happen is that PersonsAlive returns 100,000 until I get a death (in the example at Age 51) and then it does the calculation and returns the value (99951) until another death happens (Age 61). Here's my code, which includes two extra columns, ZipCode (the reason we want to do it in SQL is so we can calculate all zips at once) and PersonsAliveTemp, which I used to set Age 0 to 100,000:
SELECT
ZipCode
,Age
,[Population]
,Deaths
,DeathRate
,Death_Proportion
,DeathProbablity
,SurvivalProbablity
,PersonsAliveTemp
,(LAG(PersonsAliveTemp,1) OVER(PARTITION BY ZipCode ORDER BY Age))*SurvivalProbablity as PersonsAlive
FROM #temp4
I also tried it with defaulting PersonsAliveTemp to 100,000 and 0, which "works" but doesn't do the running calculation.
Is it possible to get the lag function (or some other function) to do a running row by row calc?
This converts a running product into an addition via logarithms.
select *,
100000 * exp(sum(log(SurvivalProb)) over
(partition by ZipCode order by Age
rows between unbounded preceding and current row)
) as PersonsAlive
from data
order by Age;
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=36be4d66260c74196f7d36833018682a

Lookup values based solely on minumum value of range

I'm wanting to place values within a range given only a minimum value, similar to using a VLOOKUP/HLOOKUP in Excel using the "FALSE" criteron.
As seen below, TableScore lists the low-end cutpoints (CutpointVal) for a value to be assigned a specific number of points (the minimum value in a range). The below SQL code accomplishes this in two steps, with the first query generating a datasheet that includes a high value for each low value, thus creating a full range.
However, this is a somewhat clunky way of doing this, especially when trying to iterate this many times. The original table (TableScore) cannot be altered to include high values. Is there a way to accomplish a similar mechanism with only one query?
Main
ID Score
72625 2.5
78261 3.2
82766 4.7
58383 0.3
TableScore
CutpointVal Points
0 0
0.3 1
1.2 2
2.7 3
3.4 4
Upper and lower range query (RangeQry):
SELECT a.CutpointVal AS LowVal, Val(Nz((SELECT TOP 1 [CutpointVal]-0.001
FROM TableScore b
WHERE b.Points > a.Points
ORDER BY b.Points ASC),9999)) AS HighVal, a.Points
FROM TableScore AS a
ORDER BY a.Points;
Range assignment query:
SELECT Main.ID, Main.Score, RangeQry.LowVal, RangeQry.HighVal, RangeQry.Points AS PTS
FROM RangeQry, Main
WHERE (((Main.Score) Between [RangeQry].[LowVal] And [RangeQry].[HighVal]));
Desired output:
ID Score Points
72625 2.5 2
78261 3.2 3
82766 4.7 4
58383 0.3 1
Consider:
SELECT Main.ID, Main.Score, (
SELECT Max(Points) FROM TableScore WHERE CutpointVal<=Main.Score) AS Pts
FROM Main;
Or
SELECT Main.ID, Main.Score, (
SELECT TOP 1 Points FROM TableScore
WHERE CutpointVal <= Main.Score
ORDER BY Points DESC) AS Pts
FROM Main;
Or
SELECT Main.ID, Main.Score, DMax("Points","TableScore","CutpointVal<=" & [Score]) AS Pts
FROM Main;

Powerpivot - Only Show Minimum Value

Newbie to DAX/PowerPivot and struggling with a specific problem.
I have a table
Location Category Distance
1 A 1.244
2 A 2.111
3 B 5.113
4 C 0.124
etc
I need to identify the Minimum distance out of the selection and only output for that record. So I'd have
Location Category Distance MinDist
1 A 1.244
2 A 0.111 0.111
3 B 5.113
4 C 3.124
etc
I've tried various measures but always end up with simply a repeat of the Distance column....whatever filters I try to apply.
Please help.
If your table was called 'table1' then this would give you the overall minimum:
=CALCULATE(MIN(Table1[Distance]), ALL(Table1))
Depending on your requirements, you may have to specify columns in the ALL() to reduce how much of the filter is opened up (suggest you research ALL() as it is a very important DAX function).
To return zero (blanks is tricky) for the non matchers you could package it in:
=
IF (SUM ( Table1[Distance] ) = CALCULATE ( MIN ( Table1[Distance] ), ALL ( Table1 ) ),
CALCULATE ( MIN ( Table1[Distance] ), ALL ( Table1 ) ),
0
)