Access SQL Crosstab Function SELECT TOP 1 - sql

I am calculating waste based on an administered amount corrected using a correction factor read from a table. The structure of the tables with some sample data is below:
RMP Administrations:
Nuclide Product MBq Date Given
-------------------------------------------------
Tc-99m Pertechnetate 700 2018/01/01
I-131 NaI 399 2018/02/01
I-131 NaI 555 2018/01/01
I-123 MIBG 181 2018/01/01
I-123 NaI 29 2018/01/03
WasteFactors
Nuclide Product MinActivity MaxActivity Factor
------------------------------------------------------------
Tc-99m * 0.3
I-123 * 150 0.3
I-123 * 150 1
I-123 MIBG 0.6
I-131 * 400 0.5
I-131 * 400 1
So this table is complex, but it is the best way I can think of to represent the correction factors in a table. Nuclide is matched first, then if product matches that correction factor is used, finally we check the activity (MBq) against the Min / Max columns to decide. We then use this factor along with the activity to determine the waste using the following SQL
SELECT
Nuclide,
[Date Given] AS Given,
(SELECT TOP 1
Factor
FROM WasteFactors
WHERE [RMP Administrations].Nuclide = WasteFactors.Nuclide
AND [RMP Administrations].Product LIKE WasteFactors.Product
AND (WasteFactors.MinActivity IS NULL
OR WasteFactors.MinActivity > [RMP Administrations].MBq)
AND (WasteFactors.MaxActivity IS NULL
OR WasteFactors.MaxActivity <= [RMP Administrations].MBq)
ORDER BY WasteFactors.Nuclide ASC, WasteFactors.Product DESC)
AS Waste
FROM [RMP Administrations] WHERE NOT [RMP Administrations].Nuclide IS NULL AND NOT [RMP Administrations].MBq IS NULL
So this achieves what we require by sorting the Factor table so that factors with a product name appear before factors which apply to all other products, so with the data above 'I-123 MIBG' is checked before 'I-123 *'.
so, running this SQL along with the data above should return the following:
Nuclide Given Waste
--------------------------------------------------------
Tc-99m 2018/01/01 0.3 (All Tc-99m is 0.3)
I-131 2018/02/01 1 (Activity <=400)
I-131 2018/01/01 0.5 (Activity >400)
I-123 2018/01/01 0.6 (Product is MIBG)
I-123 2018/01/03 0.3 (Not MIBG, <150)
with that factor used as MBq * (SELECT TOP 1...) AS Waste in the real code. So... this works fine; and I can summarise by data annually with a normal SUM(Waste), GROUP BY Nuclide and WHERE Year(Given)=[Enter Year]. My issues begin when I try to use this in the following crosstab query:
PARAMETERS [Enter Year] Short;
TRANSFORM SUM(T.MBq *
(SELECT TOP 1
Factor
FROM WasteFactors
WHERE T.Nuclide = WasteFactors.Nuclide
AND T.Product LIKE WasteFactors.Product
AND (WasteFactors.MinActivity IS NULL
OR WasteFactors.MinActivity >T.MBq)
AND (WasteFactors.MaxActivity IS NULL
OR WasteFactors.MaxActivity <= T.MBq)
ORDER BY WasteFactors.Nuclide ASC, WasteFactors.Product DESC)
)
SELECT T.Nuclide
FROM [RMP Administrations] AS T
WHERE Year(T.[Date Given])=[Enter Year]
GROUP BY T.Nuclide
PIVOT Format(T.[Date Given],"mm - mmm");
giving the error 'Access does not recognise T.Nuclide as a valid field or....'. I can't see an error in my SQL, and don't see why it should not work as written, I have tried making a VBA function to calculate the waste amount SUM(GetWaste(Nuclide, Product,MBq)), with the function running the same SQL as above in a recordset however then my query is too complex to be evaluated.
Does anyone have any ideas on where I have gone wrong with my crosstab query, how I can restructure my WasteFactors to make then easier to query, or is this just too complex to try to do in SQL and I should just do it in VBA it instead?
The real data set is ~1000 records, spanning multiple months, I would like to change Table and Col names to something not crap but I didn't create the database. Expected output for above data would be:
Nuclide 01 - Jan 02 - Feb
-------------------------------
Tc-99m 210
I-131 277.5 399
I-123 117.3

You can't use a subquery in a TRANSFORM clause in that way afaik.
Use a subquery in your FROM clause multiplying that subquery with MBq. Then only use a plain aggregate in the TRANSFORM clause
Sample, probably needs refining:
PARAMETERS [Enter Year] Short;
TRANSFORM SUM(TransformField)
SELECT R.Nuclide
FROM (SELECT *,
Mbq * (SELECT TOP 1
Factor
FROM WasteFactors
WHERE T.Nuclide = WasteFactors.Nuclide
AND T.Product LIKE WasteFactors.Product
AND (WasteFactors.MinActivity IS NULL
OR WasteFactors.MinActivity >T.MBq)
AND (WasteFactors.MaxActivity IS NULL
OR WasteFactors.MaxActivity <= T.MBq)
ORDER BY WasteFactors.Nuclide ASC, WasteFactors.Product DESC) As TransformField
FROM [RMP Administrations] T
) AS R
WHERE Year(R.[Date Given])=[Enter Year]
GROUP BY R.Nuclide
PIVOT Format(R.[Date Given],"mm - mmm");

Related

How to isolate a row that contains a value unlike other values in that column in SQL Server query?

I am trying to write a query in SSMS 2016 that will isolate the value(s) for a group that are unlike the other values within a column. I can explain better with an example:
Each piece of equipment in our fleet has an hour meter reading that gets recorded from a handheld device. Sometimes people in the field enter in a typo meter reading which skews our hourly readings.
So a unit's meter history may look like this:
10/1/2019: 2000
10/2/2019: 2208
10/4/2019: 2208
10/7/2019: 2212
10/8/2019: 2
10/8/2019: 2225
...etc.
It's obvious that the "2" is a bad record because an hour meter can never decrease. edit: Sometimes the opposite extreme may occur, where they enter a reading like "22155" and then I would need the query to adapt to find values that are too high and isolate those as well. This data is stored in a meter history table where there is a single row for each meter reading. I am tasked with creating some type of procedure that will automatically isolate the bad data and delete those rows from the table. How can I write a query that understands the context of the meter history and knows that the 2 is bad?
Any tips welcome, thanks in advance.
You can use filter to get rid of "decreases":
select t.*
from (select t.*, lag(col2) over (order by col1) as prev_col2
from t
) t
where prev_col2 < col2;
I would not advise "automatically deleting" such records.
Automatically deleting data is risky, so I'm not certain I'd recommend unleashing that without some serious thought, but here's my idea based on your sample data showing that it's usually a pretty consistent number.
DECLARE #Median numeric(22,0);
;with CTE as
(
select t.*, row_number() over (order by t.value) as "rn" from t
)
select #Median = cte.value
where cte.rn = (select (SUM( MAX(RN) + MIN(RN)) / 2 from cte); -- floors if dividing an odd number
select * from dataReadings where reading_value < (0.8 * #median) OR reading_value > (1.2 * #median);
The goal of this is to give you a +/- 20% range of the median value, which shouldn't be as skewed by mistakes as an average would be. Again, this assumes that your values should fall into an acceptable range.
If this is meant to be an always-increasing reading and you shouldn't ever encounter lower values, Gordon's answer is perfect.
I would think to look at the variation of each reading from the mean reading value. (I picked up the lag() check from #Gordon Linoff's reply too.) For example:
create table #test (the_date date, reading int)
insert #test (the_date, reading) values ('10/1/2019', 2000)
, ('10/2/2019', 2208)
, ('10/4/2019', 2208)
, ('10/7/2019', 2212)
, ('10/8/2019', 2)
, ('10/8/2019', 2225)
, ('10/8/2019', 2224)
, ('10/9/2019', 22155)
declare #avg int, #stdev float
select #avg = avg(reading)
, #stdev = stdev(reading) * 0.5
from #test
select t.*
, case when reading < #avg - #stdev then 'SUSPICIOUS - too low'
when reading > #avg + #stdev then 'SUSPICIOUS - too high'
when reading < prev_reading then 'SUSPICIOUS - decrease'
end Comment
from (select t.*, lag(reading) over (order by the_date) as prev_reading
from #test t
) t
Which results in:
the_date reading prev_reading Comment
2019-10-01 2000 NULL NULL
2019-10-02 2208 2000 NULL
2019-10-04 2208 2208 NULL
2019-10-07 2212 2208 NULL
2019-10-08 2 2212 SUSPICIOUS - too low
2019-10-08 2225 2 NULL
2019-10-08 2224 2225 SUSPICIOUS - decrease
2019-10-09 22155 2224 SUSPICIOUS - too high

Limit rows by first two dates within column. How?

Is it possible to query thats only first and second date of the customer? I tried doing the UP TO 2 ROWS but it only limits the table only to 2 rows.
SELECT knvv~kunnr vbak~vbeln vbak~erdat FROM vbak INNER JOIN knvv ON vbak~kunnr = knvv~kunnr.
The sample result of the above query would be:
Customer no. Document No Date
1 100000 01/01/18
1 200000 01/02/18
1 300000 01/03/18
1 400000 01/04/18
2 100001 01/01/18
2 200000 01/04/18
2 100040 01/06/18
But what i need that it only limits the first two dates per customer. The result must be like this. It should only get like the first two dates of each customer just like the result below. Is it possible to do it in the query?
Customer no. Document No Date
1 100000 01/01/18
1 200000 01/02/18
2 100001 01/01/18
2 200000 01/04/18
SELECT CustomerNo,DocumentNo,Date,(#Count:= if(#TempID - CustomerNo = 0,#Count +1,1)) Counter,(#TempID:=CustomerNo) Tempid
FROM vbak, (Select #Count:=0) counter, (Select #TempID:=0) tempid
having Counter<= 2 order by CustomerNo;
you can try this. Basically I declared 2 variables (#Count and #TempID) and both set as 0.
Initially for the first row, #TempID - CustomerNo = -1 makes the condition false and sets it to 1 rather then increment it. Then, #TempID is set to the current CustomerNo of that row.
The next row would produce #TempID - CustomerNo = 0 and causes the condition to be true and increment #Count + 1.
So on and so forth,
The Having Statement selects Counter that is less or equal to 2 which then returns the desired results.
hopefully this would give you an idea for your application.
I couldn't find a way to do this with a single query in OpenSQL. It just doesn't seem to offer the kind of sub-query or window function that would be required.
However, I noticed that you added the hana tag. With SAP HANA, this can be quite easily realized with an ABAP-Managed Database Procedure (AMDP) or an equivalent scripted Calculation View:
METHOD select BY DATABASE PROCEDURE FOR HDB LANGUAGE SQLSCRIPT
USING vbak.
lt_first_dates = SELECT kunnr,
min(erdat) AS erdat
FROM vbak
GROUP BY kunnr;
lt_second_dates = SELECT kunnr,
min(erdat) AS erdat
FROM vbak
WHERE (kunnr, erdat) NOT IN ( SELECT * FROM :lt_first_dates )
GROUP BY kunnr;
lt_first_two_dates = SELECT * FROM :lt_first_dates
UNION
SELECT * FROM :lt_second_dates;
et_result = SELECT src.kunnr,
src.vbeln,
src.erdat
FROM vbak AS src
WHERE (kunnr, erdat) IN ( SELECT * FROM :lt_first_two_dates )
ORDER BY kunnr, vbeln, erdat;
ENDMETHOD.

Outer table reference in sub-select

I have two tables, one that represents stock trades:
Blotter
TradeDate Symbol Shares Price
2014-09-02 ABC 100 157.79
2014-09-10 ABC 200 72.50
2014-09-16 ABC 100 36.82
and one that stores a history of stock splits for all symbols:
Splits
SplitDate Symbol Factor
2014-09-08 ABC 2
2014-09-15 ABC 2
2014-09-20 DEF 2
I am trying to write a report that reflects trades and includes what their current split adjustment factor should be. For these table values, I would expect the report to look like:
TradeDate Symbol Shares Price Factor
2014-09-02 ABC 100 157.79 4
2014-09-10 ABC 200 72.50 2
2014-09-16 ABC 100 36.82 1
The first columns are taken straight from Blotter - the Factor should represent the split adjustments that have taken place since the trade occurred (the Price is not split-adjusted).
Complicating matters is that each symbol could have multiple splits, which means I can't just OUTER JOIN the Splits table or I will start duplicating rows.
I have a subquery that I adapted from https://stackoverflow.com/a/3912258/3063706 to allow me to calculate the product of rows, grouped by symbol, but how do I only return the product of all Splits records with SplitDates occurring after the TradeDate?
A query like the following
SELECT tb.TradeDate, tb.Symbol, tb.Shares, tb.Price, ISNULL(s.Factor, 1) AS Factor
FROM Blotter tb
LEFT OUTER JOIN (
SELECT Symbol, EXP(Factor) AS Factor
FROM
(SELECT Symbol, SUM(LOG(ABS(NULLIF(Factor, 0)))) AS Factor
FROM Splits s
WHERE s.SplitDate > tb.TradeDate -- tb is unknown here
GROUP BY Symbol
) splits) s
ON s.Symbol = tb.Symbol
returns the error "Msg 4104, Level 16, State 1, Line 1 The multi-part identifier "tb.TradeDate" could not be bound."
Without the inner WHERE clause I get results like:
TradeDate Symbol Shares Price Factor
2014-09-02 ABC 100 157.79 4
2014-09-10 ABC 200 72.50 4
2014-09-16 ABC 100 36.82 4
Update The trade rows in Blotter are not guaranteed to be unique, so I think that rules out one suggested solution using a GROUP BY.
One way without changing the logic too much is to put the factor calculation into a table valued function:
create function dbo.FactorForDate(
#Symbol char(4), #TradeDate datetime
) returns table as
return (
select
exp(Factor) as Factor
from (
select
sum(log(abs(nullif(Factor, 0)))) as Factor
from
Splits s
where
s.SplitDate > #TradeDate and
s.Symbold = #Symbol
) splits
);
select
tb.TradeDate,
tb.Symbol,
tb.Shares,
tb.Price,
isnull(s.Factor, 1) as Factor
from
Blotter tb
outer apply
dbo.FactorForDate(tb.Symbol, tb.TradeDate) s;
To do it in a single statement is going to be something like:
select
tb.TradeDate,
tb.Symbol,
tb.Shares,
tb.Price,
isnull(exp(sum(log(abs(nullif(factor, 0))))), 1) as Factor
from
Blotter tb
left outer join
Symbol s
on s.Symbol = tb.Symbol and s.SplitDate > tb.TradeDate
group by
tb.TradeDate,
tb.Symbol,
tb.Shares,
tb.Price;
This will probably perform better if you can get it to work.
Apologies for any syntax errors, don't have access to SQL at the moment.

How to consolidate blocks of time?

I have a derived table with a list of relative seconds to a foreign key (ID):
CREATE TABLE Times (
ID INT
, TimeFrom INT
, TimeTo INT
);
The table contains mostly non-overlapping data, but there are occasions where I have a TimeTo < TimeFrom of another record:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 70 |
| 10 | 60 | 150 |
| 10 | 75 | 150 |
| .. | ... | ... |
+----+----------+--------+
The result set is meant to be a flattened linear idle report, but with too many of these overlaps, I end up with negative time in use. I.e. If the window above for ID = 10 was 150 seconds long, and I summed the differences of relative seconds to subtract from the window size, I'd wind up with 150-(20+20+90+75)=-55. This approach I've tried, and is what led me to realizing there were overlaps that needed to be flattened.
So, what I'm looking for is a solution to flatten the overlaps into one set of times:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 150 |
| .. | ... | ... |
+----+----------+--------+
Considerations: Performance is very important here, as this is part of a larger query that will perform well on it's own, and I'd rather not impact its performance much if I can help it.
On a comment regarding "Which seconds have an interval", this is something I have tried for the end result, and am looking for something with better performance. Adapted to my example:
SELECT SUM(C.N)
FROM (
SELECT A.N, ROW_NUMBER()OVER(ORDER BY A.N) RowID
FROM
(SELECT TOP 60 1 N FROM master..spt_values) A
, (SELECT TOP 720 1 N FROM master..spt_values) B
) C
WHERE EXISTS (
SELECT 1
FROM Times SE
WHERE SE.ID = 10
AND SE.TimeFrom <= C.RowID
AND SE.TimeTo >= C.RowID
AND EXISTS (
SELECT 1
FROM Times2 D
WHERE ID = SE.ID
AND D.TimeFrom <= C.RowID
AND D.TimeTo >= C.RowID
)
GROUP BY SE.ID
)
The problem I have with this solution is I have get a Row Count Spool out of the EXISTS query in the query plan with a number of executions equal to COUNT(C.*). I left the real numbers in that query to illustrate that getting around this approach is for the best. Because even with a Row Count Spool reducing the cost of the query by quite a bit, it's execution count increases the cost of the query as a whole by quite a bit as well.
Further Edit: The end goal is to put this in a procedure, so Table Variables and Temp Tables are also a possible tool to use.
OK. I'm still trying to do this with just one SELECT. But This totally works:
DECLARE #tmp TABLE (ID INT, GroupId INT, TimeFrom INT, TimeTo INT)
INSERT INTO #tmp
SELECT ID, 0, TimeFrom, TimeTo
FROM Times
ORDER BY Id, TimeFrom
DECLARE #timeTo int, #id int, #groupId int
SET #groupId = 0
UPDATE #tmp
SET
#groupId = CASE WHEN id != #id THEN 0
WHEN TimeFrom > #timeTo THEN #groupId + 1
ELSE #groupId END,
GroupId = #groupId,
#timeTo = TimeTo,
#id = id
SELECT Id, MIN(TimeFrom), Max(TimeTo) FROM #tmp
GROUP BY ID, GroupId ORDER BY ID
Left join each row to its successor overlapping row on the same ID value (where such exist).
Now for each row in the result-set of LHS left join RHS the contribution to the elapsed time for the ID is:
isnull(RHS.TimeFrom,LHS.TimeTo) - LHS.TimeFrom as TimeElapsed
Summing these by ID should give you the correct answer.
Note that:
- where there isn't an overlapping successor row the calculation is simply
LHS.TimeTo - LHS.TimeFrom
- where there is an overlapping successor row the calculation will net to
(RHS.TimeFrom - LHS.TimeFrom) + (RHS.TimeTo - RHS.TimeFrom)
which simplifies to
RHS.TimeTo - LHS.TimeFrom
What about something like below (assumes SQL 2008+ due to CTE):
WITH Overlaps
AS
(
SELECT t1.Id,
TimeFrom = MIN(t1.TimeFrom),
TimeTo = MAX(t2.TimeTo)
FROM dbo.Times t1
INNER JOIN dbo.Times t2 ON t2.Id = t1.Id
AND t2.TimeFrom > t1.TimeFrom
AND t2.TimeFrom < t1.TimeTo
GROUP BY t1.Id
)
SELECT o.Id,
o.TimeFrom,
o.TimeTo
FROM Overlaps o
UNION ALL
SELECT t.Id,
t.TimeFrom,
t.TimeTo
FROM dbo.Times t
INNER JOIN Overlaps o ON o.Id = t.Id
AND (o.TimeFrom > t.TimeFrom OR o.TimeTo < t.TimeTo);
I do not have a lot of data to test with but seems decent on the smaller data sets I have.
I also wrapped by head around this issue - and afterall I found, that the problem is your data.
You claim (if i get that right), that these entries should reflect the relative times, when a user goes idle / comes back.
So, you should consider to sanitize your data and refactor your inserts to produce valid data sets.
For instance, the two lines:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 50 | 70 |
| 10 | 60 | 150 |
how can it be possible that a user is idle until second 70, but goes idle on second 60? This already implies, that he has been back latest at around second 59.
I can only assume that this issue comes from different threads and/or browser windows (tabs) a user might be using your application with. (Each having it's own "idle detection")
So instead of working-around the symptoms - you should fix the cause! Why is this data entry inserted into the table? You could avoid this by simple checking, if the user is already idle before inserting a new row.
Create a unique key constraint on ID and TimeTo
Whenever an idle-event is detected, execute the following query:
INSERT IGNORE INTO Times (ID,TimeFrom,TimeTo)VALUES('10', currentTimeStamp, -1);
-- (If the user is already "idle" - nothing will happen)
Whenever an comeback-event is detected, execute the following query:
UPDATE Times SET TimeTo=currentTimeStamp WHERE ID='10' and TimeTo=-1
-- (If the user is already "back" - nothing will happen)
The fiddle linked here: http://sqlfiddle.com/#!2/dcb17/1 would reproduce the chain of events for your example, but resulting in a clean and logical set of idle-windows:
ID TIMEFROM TIMETO
10 10 30
10 50 70
10 75 150
Note: The Output is slightly different from the output you desired. But I feel that this is more accurate, cause of the reason outlined above: A user cannot go idle on second 70 without returning from it's current idle state before. He either STAYS idle (and a second thread/tab runs into the idle-event) Or he returned in between.
Especially for your need to maximize performance, you should fix the data and not invent a work-around-query. This is maybe 3 ms upon inserts, but could be worth 20 seconds upon select!
Edit: if Multi-Threading / Multiple-Sessions is the cause for the wrong insert, you would also need to implement a check, if most_recent_come_back_time < now() - idleTimeout - otherwhise a user might comeback on tab1, and is recorded idle on tab2 after a few seconds, cause tab2 did run into it's idle timeout, cause the user only refreshed tab1.
I had the 'same' problem once with 'days' (additionaly without counting WE and Holidays)
The word counting gave me the following idea:
create table Seconds ( sec INT);
insert into Seconds values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9), ...
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom and t.timeto-1
and id=10;
you can cut the start to 0 (I put the '10' here in braces)
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom- (10) and t.timeto- (10)-1
and id=10;
and finaly
select count(distinct sec) from times t, seconds s,
(select min(timefrom) m from times where id=10) as m
where s.sec between t.timefrom-m.m and t.timeto-m.m-1
and id=10;
additionaly you can "ignore" eg. 10 seconds by dividing you loose some prezition but earn speed
select count(distinct sec)*d from times t, seconds s,
(select min(timefrom) m from times where id=10) as m,
(select 10 d) as d
where s.sec between (t.timefrom-m)/d and (t.timeto-m)/d-1
and id=10;
Sure it depends on the range you have to look at, but a 'day' or two of seconds should work (although i did not test it)
fiddle ...

Select random row from a PostgreSQL table with weighted row probabilities

Example input:
SELECT * FROM test;
id | percent
----+----------
1 | 50
2 | 35
3 | 15
(3 rows)
How would you write such query, that on average 50% of time i could get the row with id=1, 35% of time row with id=2, and 15% of time row with id=3?
I tried something like SELECT id FROM test ORDER BY p * random() DESC LIMIT 1, but it gives wrong results. After 10,000 runs I get a distribution like: {1=6293, 2=3302, 3=405}, but I expected the distribution to be nearly: {1=5000, 2=3500, 3=1500}.
Any ideas?
This should do the trick:
WITH CTE AS (
SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
SELECT id, SUM(percent) OVER (ORDER BY id) S, R
FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;
The sub-query Q gives the following result:
1 50
2 85
3 100
We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE clause). We use common table expression (WITH) to ensure the random number is calculated only once.
BTW, the SELECT SUM(percent) FROM YOUR_TABLE allows you to have any weights in percent - they don't strictly need to be percentages (i.e. add-up to 100).
[SQL Fiddle]
ORDER BY random() ^ (1.0 / p)
from the algorithm described by Efraimidis and Spirakis.
Branko's accepted solution is great (thanks!). However, I'd like to contribute an alternative that is just as performant (according to my tests), and perhaps easier to visualize.
Let's recap. The original question can perhaps be generalized as follows:
Given an map of ids and relative weights, create a query that returns a random id in the map, but with a probability proportional to its relative weight.
Note the emphasis on relative weights, not percent. As Branko points out in his answer, using relative weights will work for anything, including percents.
Now, consider some test data, which we'll put in a temporary table:
CREATE TEMP TABLE test AS
SELECT * FROM (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
) AS test(id, weight);
Note that I'm using a more complicated example than that in the original question, in that it does not conveniently add up to 100, and in that the same weight (20) is used more than once (for ids 2 and 3), which is important to consider, as you'll see later.
The first thing we have to do is turn the weights into probabilities from 0 to 1, which is nothing more than a simple normalization (weight / sum(weights)):
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
This will result in the following output:
id | weight | probability | startprobability | endprobability
----+--------+-------------+------------------+----------------
1 | 25 | 0.5 | 0.0 | 0.5
2 | 10 | 0.2 | 0.5 | 0.7
3 | 10 | 0.2 | 0.7 | 0.9
4 | 5 | 0.1 | 0.9 | 1.0
The query above is admittedly doing more work than strictly necessary for our needs, but I find it helpful to visualize the relative probabilities this way, and it does make the final step of choosing the id trivial:
SELECT id FROM (queryabove)
WHERE random() BETWEEN startprobability AND endprobability;
Now, let's put it all together with a test that ensures the query is returning data with the expected distribution. We'll use generate_series() to generate a random number a million times:
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
),
fp AS ( -- final probability
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
)
SELECT *
FROM fp
CROSS JOIN (SELECT random() FROM generate_series(1, 1000000)) AS random(val)
WHERE random.val BETWEEN fp.startprobability AND fp.endprobability
;
This will result in output similar to the following:
id | count
----+--------
1 | 499679
3 | 200652
2 | 199334
4 | 100335
Which, as you can see, tracks the expected distribution perfectly.
Performance
The query above is quite performant. Even in my average machine, with PostgreSQL running in a WSL1 instance (the horror!), execution is relatively fast:
count | time (ms)
-----------+----------
1,000 | 7
10,000 | 25
100,000 | 210
1,000,000 | 1950
Adaptation to generate test data
I often use a variation of the query above when generating test data for unit/integration tests. The idea is to generate random data that approximates a probability distribution that tracks reality.
In that situation I find it useful to compute the start and end distributions once and storing the results in a table:
CREATE TEMP TABLE test AS
WITH test(id, weight) AS (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
),
p AS ( -- probability
SELECT *, (weight::NUMERIC / sum(weight) OVER ()) AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
I can then use these precomputed probabilities repeatedly, which results in extra performance and simpler use.
I can even wrap it all in a function that I can call any time I want to get a random id:
CREATE OR REPLACE FUNCTION getrandomid(p_random FLOAT8 = random())
RETURNS INT AS
$$
SELECT id
FROM test
WHERE p_random BETWEEN startprobability AND endprobability
;
$$
LANGUAGE SQL STABLE STRICT
Window function frames
It's worth noting that the technique above is using a window function with a non-standard frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This is necessary to deal with the fact that some weights might be repeated, which is why I chose test data with repeated weights in the first place!
Your proposed query appears to work; see this SQLFiddle demo. It creates the wrong distribution though; see below.
To prevent PostgreSQL from optimising the subquery I've wrapped it in a VOLATILE SQL function. PostgreSQL has no way to know that you intend the subquery to run once for every row of the outer query, so if you don't force it to volatile it'll just execute it once. Another possibility - though one that the query planner might optimize out in future - is to make it appear to be a correlated subquery, like this hack that uses an always-true where clause, like this: http://sqlfiddle.com/#!12/3039b/9
At a guess (before you updated to explain why it didn't work) your testing methodology was at fault, or you're using this as a subquery in an outer query where PostgreSQL is noticing it isn't a correlated subquery and executing it just once, like in this example. .
UPDATE: The distribution produced isn't what you're expecting. The issue here is that you're skewing the distribution by taking multiple samples of random(); you need a single sample.
This query produces the correct distribution (SQLFiddle):
WITH random_weight(rw) AS (SELECT random() * (SELECT sum(percent) FROM test))
SELECT id
FROM (
SELECT
id,
sum(percent) OVER (ORDER BY id),
coalesce(sum(prev_percent) OVER (ORDER BY id),0) FROM (
SELECT
id,
percent,
lag(percent) OVER () AS prev_percent
FROM test
) x
) weighted_ids(id, weight_upper, weight_lower)
CROSS JOIN random_weight
WHERE rw BETWEEN weight_lower AND weight_upper;
Performance is, needless to say, horrible. It's using two nested sets of windows. What I'm doing is:
Creating (id, percent, previous_percent) then using that to create two running sums of weights that are used as range brackets; then
Taking a random value, scaling it to the range of weights, and then picking a value that has weights within the target bracket
Here is something for you to play with:
select t1.id as id1
, case when t2.id is null then 0 else t2.id end as id2
, t1.percent as percent1
, case when t2.percent is null then 0 else t2.percent end as percent2
from "Test1" t1
left outer join "Test1" t2 on t1.id = t2.id + 1
where random() * 100 between t1.percent and
case when t2.percent is null then 0 else t2.percent end;
Essentially perform a left outer join so that you have two columns to apply a between clause.
Note that it will only work if you get your table ordered in the right way.
Based on Branko Dimitrijevic's answer, I wrote this query, which may or may not be faster by using the sum total of percent using tiered windowing functions (not unlike a ROLLUP).
WITH random AS (SELECT random() AS random)
SELECT id FROM (
SELECT id, percent,
SUM(percent) OVER (ORDER BY id) AS rank,
SUM(percent) OVER () * random AS roll
FROM test CROSS JOIN random
) t WHERE roll <= rank LIMIT 1
If the ordering isn't important, SUM(percent) OVER (ROWS UNBOUNDED PRECEDING) AS rank, may be preferable because it avoids having to sort the data first.
I also tried Mechanic Wei's answer (as described in this paper, apparently), which seems very promising in terms of performance, but after some testing, the distribution appear to be off :
SELECT id
FROM test
ORDER BY random() ^ (1.0/percent)
LIMIT 1