How to dynamically perform a weighted random row selection in PostgreSQL? - sql

I have following table for an app where a student is assigned task to play educational game.
Student{id, last_played_datetime, total_play_duration, total_points_earned}
The app selects a student at random and assigns the task. The student earns a point for just playing the game. The app records the date and time when the game was played and for how much duration. I want to randomly select a student and assign the task. At a time only one student can be assigned the task. To give equal opportunity to all students I am dynamically calculating weight for the student using the date and time a student last played the game, the total play duration and the total points earned by the student. A student will then be randomly choosen influenced on the weight.
How do I, in PostgreSQL, randomly select a row from a table depending on the dynamically calculated weight of the row?
The weight for each student is calculated as follows: (minutes(current_datetime - last_played_datetime) * 0.75 + total_play_duration * 0.5 + total_points_earned * 0.25) / 1.5
Sample data:
+====+======================+=====================+=====================+
| Id | last_played_datetime | total_play_duration | total_points_earned |
+====+======================+=====================+=====================+
| 1 | 01/02/2011 | 300 mins | 7 |
+----+----------------------+---------------------+---------------------+
| 2 | 06/02/2011 | 400 mins | 6 |
+----+----------------------+---------------------+---------------------+
| 3 | 01/03/2011 | 350 mins | 8 |
+----+----------------------+---------------------+---------------------+
| 4 | 22/03/2011 | 550 mins | 9 |
+----+----------------------+---------------------+---------------------+
| 5 | 01/03/2011 | 350 mins | 8 |
+----+----------------------+---------------------+---------------------+
| 6 | 10/01/2011 | 130 mins | 2 |
+----+----------------------+---------------------+---------------------+
| 7 | 03/01/2011 | 30 mins | 1 |
+----+----------------------+---------------------+---------------------+
| 8 | 07/10/2011 | 0 mins | 0 |
+----+----------------------+---------------------+---------------------+

Here is a solution that works as follows:
first compute the weight of each student
sum the weight of all students and multiply if by a random seed
then pick the first student above that target, random, weight
Query:
with
student_with_weight as (
select
id,
(
extract(epoch from (now() - last_played_datetime)) / 60 * 0.75
+ total_play_duration * 0.5
+ total_points_earned * 0.25
) / 1.5 weight
from student
),
random_weight as (
select random() * (select sum(weight) weight from student_with_weight ) weight
)
select id
from
student_with_weight s
inner join random_weight r on s.weight >= r.weight
order by id
limit 1;

You can use a cumulative sum on the weights and compare to rand(). It looks like this:
with s as (
select s.*,
<your expression> as weight
from s
)
select s.*
from (select s.*,
sum(weight) over (order by weight) as running_weight,
sum(weight) over () as total_weight
from s
) s cross join
(values (random())) r(rand)
where r.rand * total_weight >= running_weight - weight and
r.rand * total_weight < running_weight;
The values() clause ensures that the random value is calculated only once for the query. Funky things can happen if you put random() in the where clause, because it will be recalculated for each comparison.
Basically, you can think of the cumulative sum as dividing up the total count into discrete regions. The rand() is then just choosing one of them.

Related

Workload distribution in Oracle SQL

I try to make a Workload distribution in SQL but it's seems hard.
My data are :
work-station | workload
------------------------
Station1 | 500
Station2 | 450
Station3 | 50
Station4 | 600
Station5 | 2
Station6 | 350
And :
Real Worker Number : 5
My needs are the following :
I required the exact match between real worker number than theoretical worker number (distribution)
I don't want to put someone in a station if it's not required (example : station5)
I don't want to know if my workers will be able to finish the complete workload
I want the best theoretical placement of my workers to have the best productivity
Is it possible to make this WorkLoad Distribution in a sql Request ?
Possible result :
work-station | workload | theoretical worker distribution
------------------------
Station1 | 500 | 1
Station2 | 450 | 1
Station3 | 50 | 0
Station4 | 600 | 2
Station5 | 2 | 0
Station6 | 350 | 1
Here is a very simplistic way to do it by prorating the workers by the percentage of total work assigned to each station.
The complexity comes from making sure that an integer number of workers is assigned and that the total number of assigned workers equals the number of workers that are available. Here is the query that does that:
with params as ( SELECT 5 total_workers FROM DUAL),
info ( station, workload) AS (
SELECT 'Station1', 500 FROM DUAL UNION ALL
SELECT 'Station2', 450 FROM DUAL UNION ALL
SELECT 'Station3', 50 FROM DUAL UNION ALL
SELECT 'Station4', 600 FROM DUAL UNION ALL
SELECT 'Station5', 2 FROM DUAL UNION ALL
SELECT 'Station6', 350 FROM DUAL ),
targets as (
select station,
workload,
-- What % of total work is assigned to station?
workload/sum(workload) over ( partition by null) pct_work,
-- How many workers (target_workers) would we assign if we could assign fractional workers?
total_workers * (workload/sum(workload) over ( partition by null)) target_workers,
-- Take the integer part of target_workers
floor(total_workers * (workload/sum(workload) over ( partition by null))) target_workers_floor,
-- Take the fractional part of target workers
mod(total_workers * (workload/sum(workload) over ( partition by null)),1) target_workers_frac
from params, info )
select t.station,
t.workload,
-- Start with the integer part of target workers
target_workers_floor +
-- Order the stations by the fractional part of target workers and assign 1 additional worker to each station until
-- the total number of workers assigned = the number of workers we have available.
case when row_number() over ( partition by null order by target_workers_frac desc )
<= total_workers - sum(target_workers_floor) over ( partition by null) THEN 1 ELSE 0 END target_workers
from params, targets t
order by station;
+----------+----------+----------------+--+
| STATION | WORKLOAD | TARGET_WORKERS | |
+----------+----------+----------------+--+
| Station1 | 500 | 1 | |
+----------+----------+----------------+--+
| Station2 | 450 | 1 | |
+----------+----------+----------------+--+
| Station3 | 50 | 0 | |
+----------+----------+----------------+--+
| Station4 | 600 | 2 | |
+----------+----------+----------------+--+
| Station5 | 2 | 0 | |
+----------+----------+----------------+--+
| Station6 | 350 | 1 | |
+----------+----------+----------------+--+
Below query should work:
First I divide workers to the stations that has more workload than mean work load.
Then I divide the rest of the workers to the stations in the same order of their remaining workloads.
http://sqlfiddle.com/#!4/55491/12
5 represent the number of workers.
SELECT
workload,
station,
SUM (worker_count)
FROM
(
SELECT workload, station, floor( workload / ( SELECT SUM (workload) / 5 FROM work_station ) ) worker_count -- divide workers to the stations have more workload then mean
FROM
work_station works
UNION ALL
SELECT t_table.*, 1
FROM ( SELECT workload, station
FROM work_station
ORDER BY
( workload - floor( workload / ( SELECT SUM (workload) / 5 FROM work_station ) ) * ( SELECT SUM (workload) / 5 FROM work_station )
) DESC
) t_table
WHERE
rownum < ( 5 - ( SELECT SUM ( floor( workload / ( SELECT SUM (workload) / 5 FROM work_station ) ) ) FROM work_station ) + 1
) -- count of the rest of the workers
) table_sum
GROUP BY
workload,
station
ORDER BY
station

SQL get the time of different rows

I want to do a select that gives me the time of an employee resolving a ticket.
The problem is that the ticket is divided in actions, so its not only getting the time of a row, it can be from n rows.
This is an abbreviation of what I have:
Tickets
TicketID | Days | Hours | Minutes
------------------------------------------------
12 | 0 | 2 | 32
12 | 1 | 0 | 12
12 | 4 | 6 | 0
13 | 2 | 5 | 12
13 | 0 | 2 | 33
And this is what I want to get:
TicketID | Time (in minutes)
------------------------------------------------
12 | 2994
13 | 1425
(Or just one row with the condition where specifying TicketID)
This is the select that im doing right now:
select distinct ((Days*8)*60) + (Hours*60) + Minutes from Tickets where ticketid = 12
But is not working as I want.
select ticketid, sum((Days*8)*60), sum((Hours*60)), sum (Minutes)
from tickets
group by ticketid
select TicketID, sum((Days*8)*60) + sum(Hours*60) + sum(Minutes) as Time_in_minutes
from Tickets
group by TicketID
Distinct, as you were trying before, takes each row in the source table (Tickets) and filters out all of the duplicate rows. Instead, you are trying to sum up the days, minutes, and hours for each ticket. So sum them up, and group by the ticket number.
Try this:
SELECT TicketID, (Sum(Minutes)+(Sum(Hours)*60)+(sum(Days)*24*60) ) time
FROM Tickets Group by TicketID

Split rows into m parts and build the average of each part [SQLite]

I've got the following problem.
Given a table consisting basically of two columns, a timestamp and a value, I need to reduce n rows between to timestamps down to m data rows by averaging both, value and timestamp.
Let's say I want all data between times 15 and 85 in a maximum of 3 data rows.
time | value
10 | 7
20 | 6
30 | 2
40 | 9
50 | 4
60 | 3
70 | 2
80 | 9
90 | 2
Remove unneeded rows and split them into 3 parts
20 | 6
30 | 2
40 | 9
50 | 4
60 | 3
70 | 2
80 | 9
Average them
25 | 4
50 | 8
75 | 5.5
I know how to remove the unwanted rows by including a WHERE, how to average a given set of rows but can't think of a way on how to split the wanted dataset into m parts.
Any help and ideas appreciated!
I use SQLite which doesn't make this any easier and can't switch to any other dialect sadly.
I tried to group the rows based on row number and count of rows without success. The only other solution that came to my mind was getting the count of affected rows and UNION m SELECTs having a limit and offset.
I've got it working. My problem was that I did integer division by accident.
The formula I use to group the items is:
group = floor ( row_number / (row_count / limit) )
The full SQL-query looks something like this:
SELECT
avg(measurement_timestamp) AS measurement_timestamp,
avg(measurement_value) AS average
FROM (
SELECT
(SELECT COUNT(*)
FROM measurements_table
WHERE measurement_timestamp > start_time AND measurement_timestamp < end_time)
AS row_count,
(SELECT COUNT(0)
FROM measurements_table t1
WHERE t1.measurement_timestamp < t2.measurement_timestamp
AND t1.measurement_timestamp > start_time AND t1.measurement_timestamp < end_time
ORDER BY measurement_timestamp ASC )
AS row_number,
*
FROM measurements_table t2
WHERE t2.measurement_timestamp > start_time AND t2.measurement_timestamp < end_time
ORDER BY measurement_timestamp ASC)
GROUP BY CAST((row_number / (row_count / ?)) AS INT)
* the limit has to be a float

SQL GROUP BY and differences on same field (for MS Access)

Hi I have the following style of table under MS Access: (I didn't make the table and cant change it)
Date_r | Id_Person |Points |Position
25/05/2015 | 120 | 2000 | 1
25/05/2015 | 230 | 1500 | 2
25/05/2015 | 100 | 500 | 3
21/12/2015 | 120 | 2200 | 1
21/12/2015 | 230 | 2000 | 4
21/12/2015 | 100 | 200 | 20
what I am trying to do is to get a list of players (identified by Id_Person) ordered by the points difference between 2 dates.
So for example if I pick date1=25/05/2015 and date2=21/12/2015 I would get:
Id_Person |Points_Diff
230 | 500
120 | 200
100 |-300
I think I need to make something like
SELECT Id_Person , MAX(Points)-MIN(Points)
FROM Table
WHERE date_r = #25/05/2015# or date_r = #21/12/2015#
GROUP BY Id_Person
ORDER BY MAX(Points)-MIN(Points) DESC
But my problem is that i don't really want to order by (MAX(Points)-MIN(Points)) but rather by (points at date2 - points at date1) which can be different because points can decrease with the time.
One method is to use first and last However, this can sometimes produce strange results, so I think that conditional aggregation is best:
SELECT Id_Person,
(MAX(IIF(date_r = #25/05/2015#, Points, 0)) -
MIN(IIF(date_r = #21/05/2015#, Points, 0))
) as PointsDiff
FROM Table
WHERE date_r IN (#25/05/2015#, #21/12/2015#)
GROUP BY Id_Person
ORDER BY (MAX(IIF(date_r = #25/05/2015#, Points, 0)) -
MIN(IIF(date_r = #21/05/2015#, Points, 0))
) DESC;
Because you have two dates, this is more easily written as:
SELECT Id_Person,
SUM(IIF(date_r = #25/05/2015#, Points, -Points)) as PointsDiff
FROM Table
WHERE date_r IN (#25/05/2015#, #21/12/2015#)
GROUP BY Id_Person
ORDER BY SUM(IIF(date_r = #25/05/2015#, Points, -Points)) DESC;

Select the difference of two consecutive columns

I have a table car that looks like this:
| mileage | carid |
------------------
| 30 | 1 |
| 50 | 1 |
| 100 | 1 |
| 0 | 2 |
| 70 | 2 |
I would like to get the average difference for each car. So for example for car 1 I would like to get ((50-30)+(100-50))/2 = 35. So I created the following query
SELECT AVG(diff),carid FROM (
SELECT (mileage-
(SELECT Max(mileage) FROM car Where mileage<mileage AND carid=carid GROUP BY carid))
AS diff,carid
FROM car GROUP BY carid)
But this doesn't work as I'm not able to use current row for the other column. And I'm quite clueless on how to actually solve this in a different way.
So how would I be able to obtain the value of the next row somehow?
The average difference is the maximum minus he minimum divided by one less than the count (you can do the arithmetic to convince yourself this is true).
Hence:
select carid,
( (max(mileage) - min(mileage)) / nullif(count(*) - 1, 0)) as avg_diff
from cars
group by carid;