Calculating Column value based on row above and previous column [duplicate] - sql

This question already has answers here:
How to calculate Running Multiplication
(4 answers)
Closed 6 months ago.
I have a table I'm trying to create that has a column that needs to be calculated based on the row above it multiplied by the previous column. The first row is defaulted to 100,000 and the rest of the rows would be calculated off of that. Here's an example:
Age
Population
Deaths
DeathRate
DeathPro
DeathProb
SurvivalProb
PersonsAlive
0
1742
0
0
0.1
0
1
100,000
51
2048
1
0.00048
0.5
0.00048
0.99951
99951.18379
52
1921
0
0
0.5
0
1
99951.18379
61
1965
1
0.00051
0.5
0.00051
0.99949
99900.33
I skipped some ages so I didn't have type it all in there, but the ages go from 0 - 85. This was orginally done in excel and the formula for PersonsAlive (which is what I'm trying to recreate) was G3*H2 aka previous value of PersonsAlive * Survival Probability.
I was thinking I could accomplish this with the lag function, but with the example I provided above, I get null values for everything after age 1 because there is no value in the previous row. What I want to happen is that PersonsAlive returns 100,000 until I get a death (in the example at Age 51) and then it does the calculation and returns the value (99951) until another death happens (Age 61). Here's my code, which includes two extra columns, ZipCode (the reason we want to do it in SQL is so we can calculate all zips at once) and PersonsAliveTemp, which I used to set Age 0 to 100,000:
SELECT
ZipCode
,Age
,[Population]
,Deaths
,DeathRate
,Death_Proportion
,DeathProbablity
,SurvivalProbablity
,PersonsAliveTemp
,(LAG(PersonsAliveTemp,1) OVER(PARTITION BY ZipCode ORDER BY Age))*SurvivalProbablity as PersonsAlive
FROM #temp4
I also tried it with defaulting PersonsAliveTemp to 100,000 and 0, which "works" but doesn't do the running calculation.
Is it possible to get the lag function (or some other function) to do a running row by row calc?

This converts a running product into an addition via logarithms.
select *,
100000 * exp(sum(log(SurvivalProb)) over
(partition by ZipCode order by Age
rows between unbounded preceding and current row)
) as PersonsAlive
from data
order by Age;
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=36be4d66260c74196f7d36833018682a

Related

Query smallest number of rows to match a given value threshold

I would like to create a query that operates similar to a cash register. Imagine a cash register full of coins of different sizes. I would like to retrieve a total value of coins in the fewest number of coins possible.
Given this table:
id
value
1
100
2
100
3
500
4
500
5
1000
How would I query for a list of rows that:
has a total value of AT LEAST a given threshold
with the minimum excess value (value above the threshod)
in the fewest possible rows
For example, if my threshold is 1050, this would be the expected result:
id
value
1
100
5
1000
I'm working with postgres and elixir/ecto. If it can be done in a single query great, if it requires a sequence of multiple queries no problem.
I had a go at this myself, using answers from previous questions:
Using ABS() to order by the closest value to the threshold
Select rows until a sum reduction of a single column reaches a threshold
Based on #TheImpaler's comment above, this prioritises minimum number of rows over minimum excess. It's not 100% what I was looking for, so open to improvements if anyone can, but if not I think this is going to be good enough:
-- outer query selects all rows underneath the threshold
-- inner subquery adds a running total column
-- window function orders by the difference between value and threshold
SELECT
*
FROM (
SELECT
i.*,
SUM(i.value) OVER (
ORDER BY
ABS(i.value - $THRESHOLD),
i.id
) AS total
FROM
inputs i
) t
WHERE
t.total - t.value < $THRESHOLD;

fetch aggregate value along with data

I have a table with the following fields
ID,Content,QuestionMarks,TypeofQuestion
350, What is the symbol used to represent Bromine?,2,MCQ
758,What is the symbol used to represent Bromine? ,2,MCQ
2425,What is the symbol used to represent Bromine?,3,Essay
2080,A quadrilateral has four sides, four angles ,1,MCQ
2614,A circular cone has a curved surface area of ,2,MCQ
2520,Two triangles have sides 5 cm, 11 cm, 2 cm . ,2,MCQ
2196,Life supporting process mediated by water? ,2,Essay
I would like to get random questions where total marks is an input number.
For example if I say 25, the result should be all the random questions whose Sum(QuestionMarks) is 25(+/-1)
Is this really possible using a SQL
select content,id,questionmarks,sum(questionmarks) from quiz_question
group by content,id,questionmarks;
Expected Input 25
Expected Result (Sum of Question Marks =25)
Update:
How do I ensure I get atleast 2 Essay Type Questions (this is just an example) I would extend this for other conditions. Thank you for all the help
S-Man's cumulative sum is the right approach. For your logic, though, I think you want to get up to the first row that is 24 or more. That logic is:
where total - questionmark < 24
If you have enough questions, then you could get exactly 25 using:
with q25 as (
select *
from (select t.*,
sum(questionmark) over (order by random()) as running_questionmark
from t
) t
where running_questionmark < 25
)
select q.ID, q.Content, q.QuestionMarks, q.TypeofQuestion
from q25 q
union all
(select t.ID, t.Content, t.QuestionMarks, t.TypeofQuestion
from t cross join
(select sum(questionmark) as questionmark_25 from q25) x
where not exists (select 1 from q25 where q25.id = t.id)
order by abs(questionmark - (25 - questionmark_25))
limit 1
)
This selects questions up to 25 but not at 25. It then tries to find one more to make the total 25.
Supposing, questionmark is of type integer. Then you want to get some records in random order whose questionmark sum is not more than 25:
You can use the consecutive SUM() window function. The order is random. The consecutive SUM() adds every current value to the previous sum. So, you could filter where SUM() <= <your value>:
demo:db<>fiddle
SELECT
*
FROM (
SELECT
*,
SUM(questionmark) OVER (ORDER BY random()) as total
FROM
t
)s
WHERE total <= 25
Note:
This returns a records list with no more than 25, but as close as possible to it with an random order.
To find an exact match of your value is some sort of combinatorical problem which shouldn't be solved in a database. Especially when there's a random factor. What if your current SUM is 22 and the next randomly chosen value is 4. Would you retry maybe until infinity to randomly find a value = 3? Or are you trying to remove an already counted record with value = 1?

How can I retrieve data from a Hive table from two columns with non null values and top 500 records in one query?

I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions
Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds
It is possible that clo_US may have a numeric value while col_DS is 0 or null
Question:
I want to retrieve top 500 non null and non 0 records from each of these columns from one query.
My Query:
From(
SELECT D.COL_US, D.DATESTAMP,
ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
FROM ${wf_table_name} D
WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
AND COL_US > 0)T
INSERT OVERWRITE TABLE ${wf_us_table}
SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;
As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
As per your question your table may have 2 type of value. for example.
col_us
0
NULL
10
5
col_ds
5
10
0
NULL
or both column may have >0 value.
So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
But with this condition you will not get any value from above stated 4 rows.
So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.
Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.
What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.

Is there a way do dynamically set ROWS BETWEEN X PRECENDING AND CURRENT ROW?

i'm looking for a way to, on my query, dynamically set the beginning of the window function on Sql Server using ROWS BETWEEN.
Something like:
SUM(field) OVER(ORDER BY field2 ROWS BETWEEN field3 PRECEDING AND CURRENT ROW)
field3 holds the amount of items (via group by from a CTE) that represent a group.
Is that possible or should i try a different approach?
>> EDIT
My query is too big and messy to share here, but let me try to explain what i need. It's from a report builder which allows users to create custom formulas, like "emplyoees/10". This also allows the user to simply input a formula like "12" and i need to calculate subtotals and the grand total for them. When using a field, like "employees", everything works fine. But for constant values i can't sum the values without rewriting a lot of stuff (which i'm trying to avoid).
So, consider a CTE called "aggregator" and the following query:
SELECT
*,
"employees"/10 as "ten_percent"
12 as "twelve"
FROM aggregator
This query returns this output:
row_type counter company_name department_name employees ten_percent twelve
data 1 A A1 10 1 12
data 1 A A2 15 1,5 12
data 1 A A3 10 1 12
subtotal 3 A 35 3,5 12
data 1 B B1 10 1 12
subtotal 1 B 10 1 12
total 4 45 4,5 12
As you can see, the values fot "twelve" are wrong for subtotal and total row types. I'm trying to solve this without changing the CTE.
ROLLUP won't work because i already have the sum for other columns.
I tried this (i ommited "row_type_sort" on the table above, it defines the sorting):
CASE
WHEN row_type = 'data' THEN
MAX(aggregator.[twelve])
ELSE
SUM(SUM(aggregator.[twelve]))
OVER (ORDER BY "row_type_sort" ROWS BETWEEN unbounded PRECEDING AND CURRENT ROW)
END AS "twelve"
This would work OK if i could change "unbounded" by the value of column "counter", which was my original question.
LAG/LEAD wasn't helpful neither.
I'm out of ideas. Is it possible to achieve what i need only by changing this part of the query, or the result of the CTE should be changed as well?
Thanks

Why percentage is not working properly in SQLite3?

I have the following code for the following question however percentage is happening just to be zero:
SELECT p.state, (p.popestimate2011/sum(p.popestimate2011)) * 100
FROM pop_estimate_state_age_sex_race_origin p
WHERE p.age >= 21
GROUP BY p.state;
Also here's the table schema:
sqlite> .schema pop_estimate_state_age_sex_race_origin
CREATE TABLE pop_estimate_state_age_sex_race_origin (
sumlev NUMBER,
region NUMBER,
division NUMBER,
state NUMBER,
sex NUMBER,
origin NUMBER,
race NUMBER,
age NUMBER,
census2010pop NUMBER,
estimatesbase2010 NUMBER,
popestimate2010 NUMBER,
popestimate2011 NUMBER,
PRIMARY KEY(state, age, sex, race, origin),
FOREIGN KEY(sumlev) REFERENCES SUMLEV(sumlev_cd),
FOREIGN KEY(region) REFERENCES REGION(region_cd),
FOREIGN KEY(division) REFERENCES DIVISION(division_cd),
FOREIGN KEY(sex) REFERENCES SEX(sex_cd),
FOREIGN KEY(race) REFERENCES RACE(race_cd),
FOREIGN KEY(origin) REFERENCES ORIGIN(origin_cd));
So when I run the query it just shows 0 for the percentage:
stat p.popestimate
---- -------------
1 0
2 0
4 0
5 0
6 0
8 0
9 0
10 0
11 0
12 0
13 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
I was trying to write it using nested queries by didn't get anywhere too:
SELECT p.state, 100.0 * sum(p.popestimate2011) / total_pop AS percentage
FROM pop_estimate_state_age_sex_race_origin p
JOIN (SELECT state, sum(p2.popestimate2011) AS total_pop
FROM pop_estimate_state_age_sex_race_origin p2) s ON (s.state = p.state)
WHERE age >= 21
GROUP BY p.state, total_pop
ORDER BY p.state;
The current problem I am having is that it just shows one row as result and just shows the result for the last state number (state ID=56):
56 0.131294163192301
Here's an approach (not tested) that does not require an inner query. It makes a single pass over the table, aggregating by state, and using CASE to calculate the numerator of population aged over 20 and denominator of total state population.
SELECT
state,
(SUM(CASE WHEN age >= 21 THEN popestimate2011 ELSE 0) / SUM(popestimate2011)) * 100
FROM pop_estimate_state_age_sex_race_origin
GROUP BY state
I'm not sure why your SQL statement is executing at all. You are including the non-aggregated column value popestimate2011 in a GROUP BY select and that should generate an error.
A closer reading of the SQLite documentation indicates that it does, in fact, support random value selection for non-aggregate columns in the result expression list (a feature also offered by MySQL). This explains:
Why your SELECT statement is able to execute (a random value is chosen for the non-aggregated popestimate2011 reference).
Why you are seeing a result of 0: the random value chosen is probably the first occurring row and if the rows were added to the database in order that row probably has an age value of 0. Since the numerator in your division would then be 0, the result is also 0.
As to the meat of your calculation it's not clear from your table definition whether the data in your base table is already aggregated or not and, if so, what the age column represents (an average? the grouping factor for that row?)
Finally, SQLite does not have a NUMBER data type. These columns will get the default affinity of NUMERIC which is probably what you want but might not be.
You need something along these lines (not tested):
SELECT state, SUM(popestimate2011) /
(SELECT SUM(popestimate2011)
FROM pop_estimate_state_age_sex_race_origin
WHERE age > 21)))
* 100 as percentage
FROM pop_estimate_state_age_sex_race_origi
WHERE age >= 21
GROUP by state
;
The NUMBER type does not exist in SQLite.
SQLite interprets as INTEGER and
decimals are lost in an integer division
(p.popestimate2011 / sum (p.popestimate2011))
is always 0.
Change the type of the column popestimate2011 REAL
or use CAST (...)
(CAST (p.popestimate2011 AS REAL) / SUM (p.popestimate2011))