Postgres where clause over two columns - sql

Database - I am working on in Postgres 9.6.5
I am analyzing the data from US Airport Authority (RITA) about the flights arrival and departures.
This link (http://stat-computing.org/dataexpo/2009/the-data.html) lists all the columns in the table.
The table has following 29 columns
No Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes
There are about a million rows for each year.
I am trying to find out a count the most busy airports when delay is more than 15minutes.
column DepDelay - has the delay time.
origin - is the origin code for the airport.
All the data has been loaded into a table called 'ontime'
I am forming the query as follows in stages.
select airports where delay is more than 15 minutes
select origin,year,count(*) as depdelay_count from ontime
where
depdelay > 15
group by year,origin
order by depdelay_count desc
)
Now I wish to pull out only the top 10 airports per year - which I am doing as follows
select x.origin,x.year from (with subquery as (
select origin,year,count(*) as depdelay_count from ontime
where
depdelay > 15
group by year,origin
order by depdelay_count desc
)
select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 10;
Now that I have the top 10 airports by depdelay - I wish to get a count of the total flights out of these airports.
select origin,count() from ontime where origin in
(select x.origin from (with subquery as (
select origin,year,count() as depdelay_count from ontime
where
depdelay > 15
group by year,origin
order by depdelay_count desc
)
select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 2)
group by origin
order by origin;
If I modify the Step 3 query by adding the year in the year clause
---- will be any value from (1987 to 2008)
select origin,count(*) from ontime where year = (<YEAR>) origin in
(select x.origin from (with subquery as (
select origin,year,count(*) as depdelay_count from ontime
where
depdelay > 15
group by year,origin
order by depdelay_count desc
)
select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 2)
group by origin
order by origin;
But I have to do this manually for all years from 1987 to 2008 which I want to avoid.
Please can you help refine the query so that I can select the data for all the years without having to select each year manually.

I find CTEs int he middle of queries to e confusing. You can basically do this with one CTE/subquery:
with oy as (
select origin, year, count(*) as numflights,
sum( (depdelay > 15)::int ) as depdelay_count,
row_number() over (partition by year order by sum( (depdelay > 15)::int ) desc) as seqnum
from ontime
group by origin, year
)
select oy.*
from oy
where seqnum <= 10;
Note the use of conditional aggregation and using window functions with aggregation functions.

Related

How do I use SQL to perform a cumulative sum where the increments have an expiration?

Say the scenario is this:
I have a database of student infractions. When a student is late to class, or misses a homework assignment they get an infraction.
student_id
infraction_type
day
1
tardy
0
2
missed_assignment
0
1
tardy
29
2
missed_assignment
15
1
tardy
99
2
missed_assignment
29
The school has three strike system, at each infraction disciplinary action is taken. Call them D0,D1,D2.
Infractions expire after 30 days.
I want to be able to perform a query to calculate the total counts of disciplinary actions taken in a given time period.
So the number of disciplinary actions taken in the last 100 days (at day 99) would be
disciplinary_action
count
D0
3
D1
2
D2
1
A table generated showing the disciplinary actions taken would look like:
student_id
infraction_type
day
disciplinary_action_gen
1
tardy
0
D0
2
missed_assignment
0
D0
1
tardy
29
D1
2
missed_assignment
15
D1
1
tardy
99
D0
2
missed_assignment
29
D2
What SQL query could I use to do such a cumulative sum?
You can solve your problem by checking in the following order:
if <30 days have passed from the last two infractions, assign D2
if <30 days have passed from last infraction, assign D1
assign D0 (given its the first infraction)
This will work assuming your DBMS supports the tools used for this solution, namely:
the CASE expression, to conditionally assign infraction values
the LAG window function, to retrieve the previous "day" values
SELECT *,
CASE WHEN day - LAG(day,2) OVER(PARTITION BY student_id
ORDER BY day ) < 30 THEN 'D2'
WHEN day - LAG(day,1) OVER(PARTITION BY student_id
ORDER BY day ) < 30 THEN 'D1'
ELSE 'D0'
END AS disciplinary_action_gen
FROM tab
Check a MySQL demo here.
A similar approach using COUNT() as a window function and a frame definition -
SELECT
*,
CONCAT(
'D',
LEAST(
3,
COUNT(*) OVER (
PARTITION BY student_id
ORDER BY day ASC
RANGE BETWEEN 30 PRECEDING AND CURRENT ROW
)
) - 1
) AS disciplinary_action_gen
FROM infractions;
The frame definition (RANGE BETWEEN 30 PRECEDING AND CURRENT ROW) tells the server that we want to include all rows with a day value between (current row's value of day - 30) and (the current row's value of day). So, if the current row has a day value of 99, the count will be for all rows in the partition with a day value between 69 and 99.
To get the disciplinary counts, we can simply wrap this in a normal GROUP BY -
SELECT disciplinary_action, COUNT(*) AS count
FROM (
SELECT
CONCAT(
'D',
LEAST(
3,
COUNT(*) OVER (
PARTITION BY student_id
ORDER BY day ASC
RANGE BETWEEN 30 PRECEDING AND CURRENT ROW
)
) - 1
) AS disciplinary_action
FROM infractions
) t
GROUP BY disciplinary_action;
If your infractions are stored with a date, as opposed to the days in your example, this can be easily updated to use a date interval in the frame definition. And, if looking at counts of disciplinary actions in the last 100 days we need to include the previous 30 days, as these could impact the action (D0, D1 or D2) on the first day we are interested in.
SELECT disciplinary_action, COUNT(*) AS count
FROM (
SELECT
`date`,
CONCAT(
'D',
LEAST(
3,
COUNT(*) OVER (
PARTITION BY student_id
ORDER BY `date` ASC
RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW
)
) - 1
) AS disciplinary_action
FROM infractions
WHERE `date` >= CURRENT_DATE - INTERVAL 130 DAY
) t
WHERE `date` >= CURRENT_DATE - INTERVAL 100 DAY
GROUP BY disciplinary_action;
Here's a db<>fiddle

How to use percentile and rank to give avg days for 90 percent of orders

I have table called order and it has 3 columns I'm interested in: order ID, day order placed, day fulfilled. order ID is unique.
I need to find out in how many days (on average) 90% of the orders placed in January of 2016 took to be paid.
If order 1 was fulfilled in 1 day, order 2 in 2 days, order 3 in 3 days... order 10 in 10 days, then I would need to calculate as such:
number of orders = 10
90% of 10 = 9
the first 9 of those 10 orders that were fulfilled, when arranged in ascending order, took: 1+2+3+4+5+6+7+8+9 = 45 days to fulfill
hence, avg day for first 90% of orders fulfilled is: 45/9 = 5 days.
How can I write a query to first arrange orders by "number of days to fulfill" and then calculate avg days it took for the first 90% of orders for that period?
First, we would have to assume that most of the orders have been filled from January.
Second, you can do this with analytic functions. Although the percentile functions work, I usually do this the old fashioned way . . . by using row_number() and count(*):
select min(days)
from (select (coalesce(datefulfilled, trunc(sysdate)) - dateordered) as days,
sum(count(*) over (order by (coalesce(datefulfilled, trunc(sysdate)) - dateordered)) as cumecnt,
sum(count(*)) over () as totalcnt
from orders o
group by (coalesce(datefulfilled, trunc(sysdate)) - dateordered)
) d
where cumecnt >= 0.9 * cnt ;

How can I do this in SQL?

Today, I need your help.
I have a stats website, I get data from Game Webservices.
I want to implement a new function but I don't know how.
I want to guess players' connection hours.
I have a script which collects data every hour and stores this data in a table.
Imagine that I have a table with: player_id, score and the hour (Integer, just H), and the day number of the month.
Then, for example, if the score between hour 17 and 18 is different then player has been connected to his account.
To simplify, imagine that I have a table with day from 1 to 31 and hour from 0 to 23 for every day.
At the end of the month I need to execute a query to calculate for each hour, the number of days the player has been connected during this hour.
Example :
0 => 31 The player has been connected between 23 and 0 : every days
1 => 3 The player has been connected between 0 and 1 : 3 days a month
2 => 5 The player has been connected between 1 and 2 : 5 days a month
3 => 10 The player has been connected between 3 and 4 : 10 days a month
...
23 => 4
I think I can ORDER BY days and hour and player_id from day 1 hour 0 to day 31 hour 23
And do a first SELECT with a CASE like :
SELECT
table.*,
(CASE WHEN ACTUAL_ROW.score!=PREVIOUS_ROW.score THEN 1 ELSE 0) AS active
FROM table
TO know for each row if the player has been connected.
AND THEN It's Simple to do a GROUP BY and a SUM for each hour.
But I don't know how I can compare previous row with actual
Do you have any IDEA or hint how to do this ? Is PL/SQL Better to do this ?
Note :I'm using PostGreSQL
Thanks
You can access the previous row of the table with LAG window function.
Try using something like
SELECT player_id, count(CASE WHEN score > prev_score THEN 1 END)
FROM(
SELECT player_id, score, mm, hh, LAG(score) OVER (ORDER BY mm,hh) as prev_score
FROM your_table)
GROUP BY player_id
Additional advise - store full timestamps instead of day and hour fields. You can always get the day and hour from timestamp with functions.
Manual on window functions: one, two
The problem here is that we're not checking when the player "was connected"
but instead when the player "earned points", which can be similar - or not;
and this at intervals of one hour (three logins in one hour count as one).
Just as well, a player remaining logged three hours and accruing points in that
period will result as being "logged" in one, two or three data points, depending.
With those caveats, we can JOIN the score table with itself:
SELECT a.player_id, a.day, a.hour, a.score - b.score AS chg
FROM cdata AS a
JOIN cdata AS b
ON (
(a.player_id = b.player_id AND a.score != b.score)
AND (
(a.hour > 0 AND a.day = b.day AND b.hour = a.hour-1)
OR
(a.hour = 0 AND a.day = b.day+1 AND b.hour = 23)
)
)
This will yield a series of statistics for the user, with the day and hour when his
score changed.
You can use this in a collecting subSELECT
SELECT player_id, hour, COUNT(player_id) FROM ( ... ) AS changes
GROUP BY player_id, hour
ORDER BY player_id, hour;
and this will return in 'changes' a number between 1 and 31. Hours with no logins will
not be counted.
I have attempted to provide a test case with this SQLFiddle. The above is not PostgreSQL specific, you can optimize the inner query using PostgreSQL window functions.

query to display additional column based on aggregate value

I've been mulling on this problem for a couple of hours now with no luck, so I though people on SO might be able to help :)
I have a table with data regarding processing volumes at stores. The first three columns shown below can be queried from that table. What I'm trying to do is to add a 4th column that's basically a flag regarding if a store has processed >=$150, and if so, will display the corresponding date. The way this works is the first instance where the store has surpassed $150 is the date that gets displayed. Subsequent processing volumes don't count after the the first instance the activated date is hit. For example, for store 4, there's just one instance of the activated date.
store_id sales_volume date activated_date
----------------------------------------------------
2 5 03/14/2012
2 125 05/21/2012
2 30 11/01/2012 11/01/2012
3 100 02/06/2012
3 140 12/22/2012 12/22/2012
4 300 10/15/2012 10/15/2012
4 450 11/25/2012
5 100 12/03/2012
Any insights as to how to build out this fourth column? Thanks in advance!
The solution start by calculating the cumulative sales. Then, you want the activation date only when the cumulative sales first pass through the $150 level. This happens when adding the current sales amount pushes the cumulative amount over the threshold. The following case expression handles this.
select t.store_id, t.sales_volume, t.date,
(case when 150 > cumesales - t.sales_volume and 150 <= cumesales
then date
end) as ActivationDate
from (select t.*,
sum(sales_volume) over (partition by store_id order by date) as cumesales
from t
) t
If you have an older version of Postgres that does not support cumulative sum, you can get the cumulative sales with a subquery like:
(select sum(sales_volume) from t t2 where t2.store_id = t.store_id and t2.date <= t.date) as cumesales
Variant 1
You can LEFT JOIN to a table that calculates the first date surpassing the 150 $ limit per store:
SELECT t.*, b.activated_date
FROM tbl t
LEFT JOIN (
SELECT store_id, min(thedate) AS activated_date
FROM (
SELECT store_id, thedate
,sum(sales_volume) OVER (PARTITION BY store_id
ORDER BY thedate) AS running_sum
FROM tbl
) a
WHERE running_sum >= 150
GROUP BY 1
) b ON t.store_id = b.store_id AND t.thedate = b.activated_date
ORDER BY t.store_id, t.thedate;
The calculation of the the first day has to be done in two steps, since the window function accumulating the running sum has to be applied in a separate SELECT.
Variant 2
Another window function instead of the LEFT JOIN. May of may not be faster. Test with EXPLAIN ANALYZE.
SELECT *
,CASE WHEN running_sum >= 150 AND thedate = first_value(thedate)
OVER (PARTITION BY store_id, running_sum >= 150 ORDER BY thedate)
THEN thedate END AS activated_date
FROM (
SELECT *
,sum(sales_volume)
OVER (PARTITION BY store_id ORDER BY thedate) AS running_sum
FROM tbl
) b
ORDER BY store_id, thedate;
->sqlfiddle demonstrating both.

Select top Values For each day of Specified month

I am sorry if the question is silly because i'm new to SQL Server. I want to select top 5 records for each day of specified month.
e.g.
top 5 records for day 1 in month september
top 5 records for day 2 in month september
top 5 records for day 3 in month september
.
.
top 5 records for day 31 in month september
and show these all records as a one result.
Let's say you're checking speeding records for the month June 2012, and you wanted the top 5 speeds (by speed desc).
SELECT *
FROM (
SELECT *, RowNum = Row_number() over (partition by Cast(EventTime as Date)
order by Speed desc)
FROM Events
WHERE EventTime >= '20120601'
AND EventTime < '20120701'
) X
WHERE RowNum <= 5
Try this one,
WITH TopFiveRecords
AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY dayColumn ORDER BY colName DESC) RN
FROM tableName
)
SELECT *
FROM TopFiveRecords
WHERE RN <= 5
-- AND date condition here ....
dayColumn the column that contains the date of the month
colName the column to be sorted