How to calculate bandwidth by SQL query - sql

I have a table like this:
-----------+------------+-------
first | last | bytes
-----------+------------+-------
1441013602 | 1441013602 | 10
-----------+------------+-------
1441013602 | 1441013603 | 20
-----------+------------+-------
1441013603 | 1441013605 | 30
-----------+------------+-------
1441013610 | 1441013612 | 30
which
'first' column is switching time of the first packet of a traffic flow
'last' column is switching time of the last packet of a traffic flow
'bytes' is the volume of the traffic flow.
How can I calculate bandwidth usage for each second from 1441013602 to 1441013612 ?
I want this:
1441013602 20 B/s
1441013603 20 B/s
1441013604 10 B/s
1441013605 10 B/s
1441013606 0 B/s
1441013607 0 B/s
1441013608 0 B/s
1441013609 0 B/s
1441013610 10 B/s
1441013611 10 B/s
1441013612 10 B/s

You can use PostgreSQL's generate_series function for this. Generate a series of rows, one for each second, since that's what you want. Then left join on the table of info, so that you get one row for each second for each data flow. GROUP BY seconds, and sum the data flow bytes.
e.g:
SELECT seconds.second, coalesce(sum(t.bytes::float8 / (t.last::float8-t.first::float8+1)),0)
FROM generate_series(
(SELECT min(t1.first) FROM Table1 t1),
(SELECT max(t1.last) FROM Table1 t1)
) seconds(second)
LEFT JOIN table1 t
ON (seconds.second BETWEEN t.first and t.last)
GROUP BY seconds.second
ORDER BY seconds.second;
http://sqlfiddle.com/#!15/b3b07/7
Note that we calculate the bytes per second of the flow, then sum that over the seconds of the flow across all flows. This only gives an estimate, since we don't know if the flow rate was steady over the flow duration.
For formatting the bytes, use the format function and/or pg_size_pretty.

Here an approach at SQL Fiddle
PostgreSQL 9.3 Schema Setup:
create table t ( first int, last int, bytes int );
insert into t values
(1441013602 , 1441013602 , 10 ),
(1441013602 , 1441013603 , 20 ),
(1441013603 , 1441013605 , 30 ),
(1441013610 , 1441013612 , 30 );
The query:
with
bytes as
( select first, last,
( last - first ) as calc_time,
bytes
from t
where ( last - first )>0 ),
bytes_per_second as
( select first, last, bytes / calc_time as Bs
from bytes ),
calc_interval as
( SELECT * FROM generate_series(1441013602,1441013612) )
select
i.generate_series, bps.Bs
from
calc_interval i
left outer join
bytes_per_second bps
on i.generate_series between bps.first and bps.last - 1
order by
i.generate_series
Results:
| generate_series | bs |
|-----------------|--------|
| 1441013602 | 20 |
| 1441013603 | 15 |
| 1441013604 | 15 |
| 1441013605 | (null) |
| 1441013606 | (null) |
| 1441013607 | (null) |
| 1441013608 | (null) |
| 1441013609 | (null) |
| 1441013610 | 15 |
| 1441013611 | 15 |
| 1441013612 | (null) |
Explanation:
bytes and bytes_per_second are to cleaning data, perhaps is more accurate to do an average.
calc_interval a generator for your seconds.
last select is the final calculation. Also joining generated seconds with bandwidth.

Related

SQL query: Fill missing values with 0

I have a table with gaps in data at certain times (see there is no data between 37 & 46). I need to fill in those gaps with 0 for better display on the frontend.
date | mydata
--------+-----------------
911130 | 10
911131 | 11
911132 | 9
911133 | 6
911134 | 5
911135 | 5
911136 | 10
911137 | 8
911146 | 4
911147 | 5
911148 | 9
911149 | 14
911150 | 8
The times are sequential integers (UNIX timestamps initially). I have aggregated my data into 5 minute time buckets.
The frontend query will pass in a start & end time and aggregate the data into larger buckets. For example:
SELECT
(five_min_time / 6) AS date,
SUM(mydata) AS mydata
FROM mydata_table_five_min
WHERE
five_min_time BETWEEN (1640000000 / 300) AND (1640086400 / 300)
GROUP BY date
ORDER BY date ASC;
I would like to be able to get a result:
date | mydata
--------+-----------------
911130 | 10
911131 | 11
911132 | 9
911133 | 6
911134 | 5
911135 | 5
911136 | 10
911137 | 8
911138 | 0
911139 | 0
911140 | 0
911141 | 0
911142 | 0
911143 | 0
911144 | 0
911145 | 0
911146 | 4
911147 | 5
911148 | 9
911149 | 14
911150 | 8
As a note, this query is being run in AWS Redshift.
Not sure if a recursive CTE works in redshift.
But something like this works in postgresql.
with recursive rcte as (
select
min(half_hour) as n,
max(half_hour) as n_max
from cte_data
union all
select n+1, n_max
from rcte
where n < n_max
)
, cte_data as (
select
(five_min_time / 6) as half_hour,
sum(mydata) as mydata
from mydata_table_five_min
where five_min_time between (date_part('epoch','2021-12-20 12:00'::date)::int / 300)
and (date_part('epoch','2021-12-21 12:00'::date)::int / 300)
group by half_hour
)
select n as date
--, to_timestamp(n*6*300) as dt
, coalesce(t.mydata, 0) as mydata
from rcte c
left join cte_data t
on t.half_hour = c.n
order by date;

How to dynamically perform a weighted random row selection in PostgreSQL?

I have following table for an app where a student is assigned task to play educational game.
Student{id, last_played_datetime, total_play_duration, total_points_earned}
The app selects a student at random and assigns the task. The student earns a point for just playing the game. The app records the date and time when the game was played and for how much duration. I want to randomly select a student and assign the task. At a time only one student can be assigned the task. To give equal opportunity to all students I am dynamically calculating weight for the student using the date and time a student last played the game, the total play duration and the total points earned by the student. A student will then be randomly choosen influenced on the weight.
How do I, in PostgreSQL, randomly select a row from a table depending on the dynamically calculated weight of the row?
The weight for each student is calculated as follows: (minutes(current_datetime - last_played_datetime) * 0.75 + total_play_duration * 0.5 + total_points_earned * 0.25) / 1.5
Sample data:
+====+======================+=====================+=====================+
| Id | last_played_datetime | total_play_duration | total_points_earned |
+====+======================+=====================+=====================+
| 1 | 01/02/2011 | 300 mins | 7 |
+----+----------------------+---------------------+---------------------+
| 2 | 06/02/2011 | 400 mins | 6 |
+----+----------------------+---------------------+---------------------+
| 3 | 01/03/2011 | 350 mins | 8 |
+----+----------------------+---------------------+---------------------+
| 4 | 22/03/2011 | 550 mins | 9 |
+----+----------------------+---------------------+---------------------+
| 5 | 01/03/2011 | 350 mins | 8 |
+----+----------------------+---------------------+---------------------+
| 6 | 10/01/2011 | 130 mins | 2 |
+----+----------------------+---------------------+---------------------+
| 7 | 03/01/2011 | 30 mins | 1 |
+----+----------------------+---------------------+---------------------+
| 8 | 07/10/2011 | 0 mins | 0 |
+----+----------------------+---------------------+---------------------+
Here is a solution that works as follows:
first compute the weight of each student
sum the weight of all students and multiply if by a random seed
then pick the first student above that target, random, weight
Query:
with
student_with_weight as (
select
id,
(
extract(epoch from (now() - last_played_datetime)) / 60 * 0.75
+ total_play_duration * 0.5
+ total_points_earned * 0.25
) / 1.5 weight
from student
),
random_weight as (
select random() * (select sum(weight) weight from student_with_weight ) weight
)
select id
from
student_with_weight s
inner join random_weight r on s.weight >= r.weight
order by id
limit 1;
You can use a cumulative sum on the weights and compare to rand(). It looks like this:
with s as (
select s.*,
<your expression> as weight
from s
)
select s.*
from (select s.*,
sum(weight) over (order by weight) as running_weight,
sum(weight) over () as total_weight
from s
) s cross join
(values (random())) r(rand)
where r.rand * total_weight >= running_weight - weight and
r.rand * total_weight < running_weight;
The values() clause ensures that the random value is calculated only once for the query. Funky things can happen if you put random() in the where clause, because it will be recalculated for each comparison.
Basically, you can think of the cumulative sum as dividing up the total count into discrete regions. The rand() is then just choosing one of them.

Anyway to make case statements on rankings smarter?

I want to group my ranks into chunks of data. I thought about using CASE statements, but that not only looks silly, but it's also slow
Any tips on how this can be improved?
Please note chunks vary in size (first listing the top 100, then chunks of 100, then chunks of 1000, then one chunk of 5000 and 3 other chunks of 15K)
select
transaction_code
,row_number() over (order by SALES_AMOUNT desc) as rank
,SALES_AMOUNT
,CASE
WHEN rank <=100 THEN to_varchar(rank)
WHEN rank <=200 then '101-200'
WHEN rank <=300 then '201-300'
WHEN rank <=400 then '301-400'
WHEN rank <=500 then '401-500'
WHEN rank <=1000 then '501-1000'
WHEN rank <=1500 then '1001-1500'
WHEN rank <=2000 then '1501-2000'
WHEN rank <=2500 then '2001-2500'
WHEN rank <=3000 then '2501-3000'
WHEN rank <=3500 then '3001-3500'
WHEN rank <=4000 then '3501-4000'
WHEN rank <=4500 then '4001-4500'
WHEN rank <=5000 then '4501-5000'
WHEN rank <=5500 then '5001-5500'
WHEN rank <=6000 then '5501-6000'
WHEN rank <=6500 then '6001-6500'
WHEN rank <=7000 then '6501-7000'
WHEN rank <=7500 then '7001-7500'
WHEN rank <=8000 then '7501-8000'
WHEN rank <=8500 then '8001-8500'
WHEN rank <=9000 then '8501-9000'
WHEN rank <=95000 then '9001-9500'
WHEN rank <=10000 then '9501-10000'
WHEN rank <=15000 then '10001-15000'
WHEN rank <=30000 then '15001-30000'
WHEN rank <=45000 then '30001-45000'
WHEN rank <=60000 then '45001-60000'
ELSE 'Bottom'
END AS "TRANSACTION GROUPS"
The fastest way is to create a lookup table that maps rank into a group name. You could do it using a stateful JavaScript UDF (initializing the map just once).
But you can also just do it in SQL
Table definition
Simple mapping from a number to a string
create or replace table rank2group(rank integer, grp string);
UDF to generate group name
Your code is indeed very long.
Instead, we can create a function that for a given rank, group_size, and group_base (number from which groups of group_size form), generates a string.
Note, this function will be slower than your code, as it generates a string from input, but we'll only use it to fill the lookup table, so it doesn't matter.
create or replace function group_name(rank integer, group_base integer, group_size integer)
returns varchar
as $$
(group_base + 1 + group_size * floor((rank - 1 - group_base) / group_size))
|| '-' ||
(group_base + group_size + group_size * floor((rank - 1 - group_base) / group_size))
$$;
Example outputs:
select group_name(101, 100, 100), group_name(1678, 500, 500), group_name(15000, 10000, 5000);
---------------------------+----------------------------+--------------------------------+
GROUP_NAME(101, 100, 100) | GROUP_NAME(1678, 500, 500) | GROUP_NAME(15000, 10000, 5000) |
---------------------------+----------------------------+--------------------------------+
101-200 | 1501-2000 | 10001-15000 |
---------------------------+----------------------------+--------------------------------+
Table data generation
We'll generate values that map range 1 .. 60000 only, using Snowflake generators, group_name, and your simplified CASE statement:
create or replace table rank2group(rank integer, grp string);
insert into rank2group
select rank,
CASE
WHEN rank <=100 THEN to_varchar(rank)
-- groups of size 100, starting at 100
WHEN rank <=500 then group_name(rank, 100, 100)
WHEN rank <=10000 then group_name(rank, 500, 500)
-- groups of size 5000, starting at 10000
WHEN rank <=15000 then group_name(rank, 10000, 5000)
WHEN rank <=60000 then group_name(rank, 15000, 15000)
ELSE 'Bottom'
END AS "TRANSACTION GROUPS"
from (
select row_number() over (order by 1) as rank
from table(generator(rowCount=>60000))
);
Usage
To use it, we simply join on rank.
Note, you need an outer join followed by ifnull for the Bottom values.
Example, using a generated input that creates exponentially increasing numbers:
with input as (
select 1 + (seq8() * seq8() * seq8()) AS rank
from table(generator(rowCount=>50))
)
select input.rank, ifnull(grp, 'Bottom') grp
from input left outer join rank2group on input.rank = rank2group.rank
order by input.rank;
--------+-------------+
RANK | GRP |
--------+-------------+
1 | 1 |
2 | 2 |
9 | 9 |
28 | 28 |
65 | 65 |
126 | 101-200 |
217 | 201-300 |
344 | 301-400 |
513 | 501-1000 |
730 | 501-1000 |
1001 | 1001-1500 |
1332 | 1001-1500 |
1729 | 1501-2000 |
2198 | 2001-2500 |
2745 | 2501-3000 |
3376 | 3001-3500 |
4097 | 4001-4500 |
4914 | 4501-5000 |
5833 | 5501-6000 |
6860 | 6501-7000 |
8001 | 8001-8500 |
9262 | 9001-9500 |
10649 | 10001-15000 |
12168 | 10001-15000 |
13825 | 10001-15000 |
15626 | 15001-30000 |
17577 | 15001-30000 |
19684 | 15001-30000 |
21953 | 15001-30000 |
24390 | 15001-30000 |
27001 | 15001-30000 |
29792 | 15001-30000 |
32769 | 30001-45000 |
35938 | 30001-45000 |
39305 | 30001-45000 |
42876 | 30001-45000 |
46657 | 45001-60000 |
50654 | 45001-60000 |
54873 | 45001-60000 |
59320 | 45001-60000 |
64001 | Bottom |
68922 | Bottom |
74089 | Bottom |
79508 | Bottom |
85185 | Bottom |
91126 | Bottom |
97337 | Bottom |
103824 | Bottom |
110593 | Bottom |
117650 | Bottom |
--------+-------------+
Possible optimization
If your ranges are always in multiple or 100, you can make a table 100 times smaller, storing only the values ending with 00, and then join on e.g. CEIL(rank)+1.
But then you also need to handle values 1..100 after the join, e.g. IFNULL(grp, IFF(rank <= 100, rank::varchar, 'Bottom'))

Workload distribution in Oracle SQL

I try to make a Workload distribution in SQL but it's seems hard.
My data are :
work-station | workload
------------------------
Station1 | 500
Station2 | 450
Station3 | 50
Station4 | 600
Station5 | 2
Station6 | 350
And :
Real Worker Number : 5
My needs are the following :
I required the exact match between real worker number than theoretical worker number (distribution)
I don't want to put someone in a station if it's not required (example : station5)
I don't want to know if my workers will be able to finish the complete workload
I want the best theoretical placement of my workers to have the best productivity
Is it possible to make this WorkLoad Distribution in a sql Request ?
Possible result :
work-station | workload | theoretical worker distribution
------------------------
Station1 | 500 | 1
Station2 | 450 | 1
Station3 | 50 | 0
Station4 | 600 | 2
Station5 | 2 | 0
Station6 | 350 | 1
Here is a very simplistic way to do it by prorating the workers by the percentage of total work assigned to each station.
The complexity comes from making sure that an integer number of workers is assigned and that the total number of assigned workers equals the number of workers that are available. Here is the query that does that:
with params as ( SELECT 5 total_workers FROM DUAL),
info ( station, workload) AS (
SELECT 'Station1', 500 FROM DUAL UNION ALL
SELECT 'Station2', 450 FROM DUAL UNION ALL
SELECT 'Station3', 50 FROM DUAL UNION ALL
SELECT 'Station4', 600 FROM DUAL UNION ALL
SELECT 'Station5', 2 FROM DUAL UNION ALL
SELECT 'Station6', 350 FROM DUAL ),
targets as (
select station,
workload,
-- What % of total work is assigned to station?
workload/sum(workload) over ( partition by null) pct_work,
-- How many workers (target_workers) would we assign if we could assign fractional workers?
total_workers * (workload/sum(workload) over ( partition by null)) target_workers,
-- Take the integer part of target_workers
floor(total_workers * (workload/sum(workload) over ( partition by null))) target_workers_floor,
-- Take the fractional part of target workers
mod(total_workers * (workload/sum(workload) over ( partition by null)),1) target_workers_frac
from params, info )
select t.station,
t.workload,
-- Start with the integer part of target workers
target_workers_floor +
-- Order the stations by the fractional part of target workers and assign 1 additional worker to each station until
-- the total number of workers assigned = the number of workers we have available.
case when row_number() over ( partition by null order by target_workers_frac desc )
<= total_workers - sum(target_workers_floor) over ( partition by null) THEN 1 ELSE 0 END target_workers
from params, targets t
order by station;
+----------+----------+----------------+--+
| STATION | WORKLOAD | TARGET_WORKERS | |
+----------+----------+----------------+--+
| Station1 | 500 | 1 | |
+----------+----------+----------------+--+
| Station2 | 450 | 1 | |
+----------+----------+----------------+--+
| Station3 | 50 | 0 | |
+----------+----------+----------------+--+
| Station4 | 600 | 2 | |
+----------+----------+----------------+--+
| Station5 | 2 | 0 | |
+----------+----------+----------------+--+
| Station6 | 350 | 1 | |
+----------+----------+----------------+--+
Below query should work:
First I divide workers to the stations that has more workload than mean work load.
Then I divide the rest of the workers to the stations in the same order of their remaining workloads.
http://sqlfiddle.com/#!4/55491/12
5 represent the number of workers.
SELECT
workload,
station,
SUM (worker_count)
FROM
(
SELECT workload, station, floor( workload / ( SELECT SUM (workload) / 5 FROM work_station ) ) worker_count -- divide workers to the stations have more workload then mean
FROM
work_station works
UNION ALL
SELECT t_table.*, 1
FROM ( SELECT workload, station
FROM work_station
ORDER BY
( workload - floor( workload / ( SELECT SUM (workload) / 5 FROM work_station ) ) * ( SELECT SUM (workload) / 5 FROM work_station )
) DESC
) t_table
WHERE
rownum < ( 5 - ( SELECT SUM ( floor( workload / ( SELECT SUM (workload) / 5 FROM work_station ) ) ) FROM work_station ) + 1
) -- count of the rest of the workers
) table_sum
GROUP BY
workload,
station
ORDER BY
station

Calculate time difference between rows

I currently have a database in the following format
ID | DateTime | PID | TIU
1 | 2013-11-18 00:15:00 | 1551 | 1005
2 | 2013-11-18 00:16:03 | 1551 | 1885
3 | 2013-11-18 00:16:30 | 9110 | 75527
4 | 2013-11-18 00:22:01 | 1022 | 75
5 | 2013-11-18 00:22:09 | 1019 | 1311
6 | 2013-11-18 00:23:52 | 1022 | 89
7 | 2013-11-18 00:24:19 | 1300 | 44433
8 | 2013-11-18 00:38:57 | 9445 | 2010
I have a scenario where I need to identify where there are gaps in processes more than 5 minutes by using the DateTime column.
An example of what I am trying to achieve is:
ID | DateTime | PID | TIU
3 | 2013-11-18 00:16:30 | 9110 | 75527
4 | 2013-11-18 00:22:01 | 1022 | 75
7 | 2013-11-18 00:24:50 | 1300 | 44433
8 | 2013-11-18 00:38:57 | 9445 | 2010
ID3 is the last row before a 6 minute 1 second gap, ID4 is the next row after it.
ID7 is the last row before a 14 minute 7 second gap, ID8 is the next record available.
I am trying to do this in SQL, however if needs be I can do this in C# to process instead.
I have tried a number of inner joins, however the table is over 3 million rows so performance suffers greatly.
This is a CTE solution but, as has been indicated, this may not always perform well - because we're having to compute functions against the DateTime column, most indexes will be useless:
declare #t table (ID int not null,[DateTime] datetime not null,
PID int not null,TIU int not null)
insert into #t(ID,[DateTime],PID,TIU) values
(1,'2013-11-18 00:15:00',1551,1005 ),
(2,'2013-11-18 00:16:03',1551,1885 ),
(3,'2013-11-18 00:16:30',9110,75527 ),
(4,'2013-11-18 00:22:01',1022,75 ),
(5,'2013-11-18 00:22:09',1019,1311 ),
(6,'2013-11-18 00:23:52',1022,89 ),
(7,'2013-11-18 00:24:19',1300,44433 ),
(8,'2013-11-18 00:38:57',9445,2010 )
;With Islands as (
select ID as MinID,[DateTime],ID as RecID from #t t1
where not exists
(select * from #t t2
where t2.ID < t1.ID and --Or by date, if needed
--Use 300 seconds to avoid most transition issues
DATEDIFF(second,t2.[DateTime],t1.[DateTime]) < 300
)
union all
select i.MinID,t2.[DateTime],t2.ID
from Islands i
inner join
#t t2
on
i.RecID < t2.ID and
DATEDIFF(second,i.[DateTime],t2.[DateTime]) < 300
), Ends as (
select MinID,MAX(RecID) as MaxID from Islands group by MinID
)
select * from #t t
where exists(select * from Ends e where e.MinID = t.ID or e.MaxID = t.ID)
This also returns a row for ID 1, since that row has no preceding row within 5 minutes of it - but that should be easy enough to exclude in the final select, if needed.
I've assumed we can use ID as a proxy for increasing dates - that if for two rows, the ID is higher in the second row, then the DateTime will also be later.
Islands is a recursive CTE. The top half (the anchor) just selects rows which do not have any preceding row within 5 minutes of themselves. We select the ID twice for those rows and also keep the DateTime around.
In the recursive portion, we try to find a new row from the table that can be "added on" to an existing Islands row - based on this new row being no more than 5 minutes later than the current end-point of the island.
Once the recursion is complete, we then exclude the intermediate rows that the CTE produces. E.g. for the "4" island, it generated the following rows:
4,00:22:01,4
4,00:22:09,5
4,00:23:52,6
4,00:24:19,7
And all that we care about is that final row where we've identified an "island" of time from ID 4 to ID 7 - that's what the second CTE (Ends) is finding for us.