Multiple aggregate sums from different conditions in one sql query - sql

Whereas I believe this is a fairly general SQL question, I am working in PostgreSQL 9.4 without an option to use other database software, and thus request that any answer be compatible with its capabilities.
I need to be able to return multiple aggregate totals from one query, such that each sum is in a new row, and each of the groupings are determined by a unique span of time, e.g. WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'. The number of records that satisfy there WHERE clause is unknown and may be zero, in which case ideally the result is "0". This is what I have worked out so far:
(
SELECT SUM(minutes) AS min
FROM downtime
WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-14' AND '2016-02-21'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-28' AND '2016-03-06'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-06' AND '2016-03-13'
)
UNION ALL
(
SELECT SUM(minutes))
FROM downtime
WHERE time_stamp BETWEEN '2016-03-13' AND '2016-03-20'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-20' AND '2016-03-27'
)
Result:
min
---+-----
1 | 119
2 | 4
3 | 30
4 |
5 | 62
6 | 350
That query gets me almost the exact result that I want; certainly good enough in that I can do exactly what I need with the results. Time spans with no records are blank but that was predictable, and whereas I would prefer "0" I can account for the blank rows in software.
But, while it isn't terrible for the 6 weeks that it represents, I want to be flexible and to be able to do the same thing for different time spans, and for a different number of data points, such as each day in a week, each week in 3 months, 6 months, each month in 1 year, 2 years, etc... As written above, it feels as if it is going to get tedious fast... for instance 1 week spans over a 2 year period is 104 sub-queries.
What I'm after is a more elegant way to get the same (or similar) result.
I also don't know if doing 104 iterations of a similar query to the above (vs. the 6 that it does now) is a particularly efficient usage.
Ultimately I am going to write some code which will help me build (and thus abstract away) the long, ugly query--but it would still be great to have a more concise and scale-able query.

In Postgres, you can generate a series of times and then use these for the aggregation:
select g.dte, coalesce(sum(dt.minutes), 0) as minutes
from generate_series('2016-02-07'::timestamp, '2016-03-20'::timestamp, interval '7 day') g(dte) left join
downtime dt
on dt.timestamp >= g.dte and dt.timestamp < g.dte + interval '7 day'
group by g.dte
order by g.dte;

Related

How do I design a SQL query that will show me all users who visited at least one page for the last 20 out of 24 hours?

In order to identify human traffic (as opposed to crawlers, bots, etc), I would like to design an SQL query that will identify all unique visitor ID's that have visited websites in the last 20 of 24 hours (as most humans would not be browsing for that long). I believe I understand how I want to structure it, "How many UNIQUE hours have any activity for each visitor in the past 24 hours, and WHERE 20 hours have at least some activity".
While the specifics of such a query would depend on the tables involved, I'm having trouble understanding if my structure is on the right track:
SELECT page_url, affinity, num
FROM (
SELECT AGG GROUP BY visitor_id, pages.page_url, max(v.max_affinity) as affinity, COUNT(*) as num, Row_Number()
OVER (Partition By v.visitor_id ORDER BY COUNT(visitor_id) DESC) AS RowNumber
FROM audience_lab_active_visitors v
SELECT pages ON pages.p_date >= '2017-09-14'
WHERE v.p_date='2017-09-14'
GROUP BY v.vispage_visitors, pages.page_url
) tbl WHERE RowNumber < 20
I don't believe your query is valid SQL, but I have an idea of what you're trying to accomplish. Rather than use a static date, I filtered by the past 24 hours and truncated the current timestamp to the hour, otherwise the query would be considering 25 unique hours. I also removed page_url from the query since it didn't seem relevant to the results based on what you're trying to solve.
For each visitor_id, the query counts the number of unique hours recorded based on the column (timestamp_col in this example) used to record the timestamp of the page view. HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20 returns those you've identified at humans, meaning they visited the website up to 19 of the past 24 hours.
SELECT
visitor_id,
COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) AS num,
MAX(v.max_affinity) AS affinity
FROM audience_lab_active_visitors AS v
JOIN pages AS p ON v.page_url = p.page_url
WHERE
v.p_date >= DATE_TRUNC('hour', CURRENT_TIMESTAMP) - INTERVAL '24' hour
GROUP BY 1
HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20;

SELECT statement optimization

I'm not so expert in SQL queryes, but not even a complete newbie.
I'm exporting data from a MS-SQL database to an excel file using a SQL query.
I'm exporting many columns and two of this columns contain a date and an hour, this are the columns I use for the WHERE clause.
In detail I have about 200 rows for each day, everyone with a different hour, for many days. I need to extract the first value after the 15:00 of each day for more days.
Since the hours are different for each day i can't specify something like
SELECT a,b,hour,day FROM table WHERE hour='15:01'
because sometimes the value is at 15:01, sometimes 15:03 and so on (i'm looking for the closest value after the 15:00), for fix this i used this workaround:
SELECT TOP 1 a,b,hour,day FROM table WHERE hour > "15:00"
in this way i can take the first value after the 15:00 for a day...the problem is that i need this for more days...for a user-specifyed interval of days. At the moment i fix this with a UNION ALL statement, like this:
SELECT TOP 1 a,b,hour,day FROM table WHERE data="first_day" AND hour > "15:00"
UNION ALL SELECT TOP 1 a,b,hour,day FROM table WHERE data="second_day" AND hour > "15:00"
UNION ALL SELECT TOP 1 a,b,hour,day FROM table WHERE data="third_day" AND hour > "15:00"
...and so on for all the days (i build the SQL string with a for each day in the specifyed interval).
Until now this worked, but now I need to expand the days interval (now is maximun a week, so 5 days) to up to 60 days. I don't want to build an huge query string, but i can't imagine an alternative way for write the SQL.
Any help appreciated
Ettore
I typical solution for this uses row_number():
SELECT a, b, hour, day
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY day ORDER BY hour) as seqnum
FROM table t
WHERE hour > '15:00'
) t
WHERE seqnum = 1;

oracle sql: efficient way to calculate business days in a month

I have a pretty huge table with columns dates, account, amount, etc. eg.
date account amount
4/1/2014 XXXXX1 80
4/1/2014 XXXXX1 20
4/2/2014 XXXXX1 840
4/3/2014 XXXXX1 120
4/1/2014 XXXXX2 130
4/3/2014 XXXXX2 300
...........
(I have 40 months' worth of daily data and multiple accounts.)
The final output I want is the average amount of each account each month. Since there may or may not be record for any account on a single day, and I have a seperate table of holidays from 2011~2014, I am summing up the amount of each account within a month and dividing it by the number of business days of that month. Notice that there is very likely to be record(s) on weekends/holidays, so I need to exclude them from calculation. Also, I want to have a record for each of the date available in the original table. eg.
date account amount
4/1/2014 XXXXX1 48 ((80+20+840+120)/22)
4/2/2014 XXXXX1 48
4/3/2014 XXXXX1 48
4/1/2014 XXXXX2 19 ((130+300)/22)
4/3/2014 XXXXX2 19
...........
(Suppose the above is the only data I have for Apr-2014.)
I am able to do this in a hacky and slow way, but as I need to join this process with other subqueries, I really need to optimize this query. My current code looks like:
<!-- language: lang-sql -->
select
date,
account,
sum(amount/days_mon) over (partition by last_day(date))
from(
select
date,
-- there are more calculation to get the account numbers,
-- so this subquery is necessary
account,
amount,
-- this is a list of month-end dates that the number of
-- business days in that month is 19. similar below.
case when last_day(date) in ('','',...,'') then 19
when last_day(date) in ('','',...,'') then 20
when last_day(date) in ('','',...,'') then 21
when last_day(date) in ('','',...,'') then 22
when last_day(date) in ('','',...,'') then 23
end as days_mon
from mytable tb
inner join lookup_businessday_list busi
on tb.date = busi.date)
So how can I perform the above purpose efficiently? Thank you!
This approach uses sub-query factoring - what other RDBMS flavours call common table expressions. The attraction here is that we can pass the output from one CTE as input to another. Find out more.
The first CTE generates a list of dates in a given month (you can extend this over any range you like).
The second CTE uses an anti-join on the first to filter out dates which are holidays and also dates which aren't weekdays. Note that Day Number varies depending according to the NLS_TERRITORY setting; in my realm the weekend is days 6 and 7 but SQL Fiddle is American so there it is 1 and 7.
with dates as ( select date '2014-04-01' + ( level - 1) as d
from dual
connect by level <= 30 )
, bdays as ( select d
, count(d) over () tot_d
from dates
left join holidays
on dates.d = holidays.hol_date
where holidays.hol_date is null
and to_number(to_char(dates.d, 'D')) between 2 and 6
)
select yt.account
, yt.txn_date
, sum(yt.amount) over (partition by yt.account, trunc(yt.txn_date,'MM'))
/tot_d as avg_amt
from your_table yt
join bdays
on bdays.d = yt.txn_date
order by yt.account
, yt.txn_date
/
I haven't rounded the average amount.
You have 40 month of data, this data should be very stable.
I will assume that you have a cold body (big and stable easily definable range of data) and hot tail (small and active part).
Next, I would like to define a minimal period. It is a data range that is a smallest interval interesting for Business.
It might be year, month, day, hour, etc. Do you expect to get questions like "what was averege for that account between 1900 and 12am yesterday?".
I will assume that the answer is DAY.
Then,
I will calculate sum(amount) and count() for every account for every DAY of cold body.
I will not create a dummy records, if particular account had no activity on some day.
and I will save day, account, total amount, count in a TABLE.
if there are modifications later to the cold body, you delete and reload affected day from that table.
For hot tail there might be multiple strategies:
Do the same as above (same process, clear to support)
always calculate on a fly
use materialized view as an averege between 1 and 2.
Cold body table totalc could also be implemented as materialized view, but if data never change - no need to rebuild it.
With this you go from (number of account) x (number of transactions per day) x (number of days) to (number of account)x(number of active days) number of records.
That should speed up all following calculations.

Change select to a previous date

I have basic knowledge of SQL and have a question:
I am trying to select data from a time series (date and windspeed). I want to select the original wind speed value if it lies between hours 7 and 21. If the hour is outside this range I would like to assign the wind speed to the previous wind speed at hour 21. There is also a concern that there is the occasional point where hour 21 does not exist and would like to assign the windspeed as hour 20... 19 etc until it finds the next available hour.
SELECT
date,
CASE WHEN DATEPART(HH,date) < 7 OR DATEPART(HH,date) > 21
THEN '<WIND SPEED AT HOUR 21> ELSE <WIND SPEED> END AS ModifiedWindspeed
,WindSpeed, winddirection
from TerrainCorrectedHourlyWind w
This might make things clearer. If the hour is in the specified range, select windspeed. If not then select the wind speed from the prior day at 21 hours.
Though you've tagged the question mysql, I'm guessing this is actually SQL Server because of the DATEPART() function used. Try the following, which uses an OUTER APPLY to get your alternate value:
SELECT Date
, CASE
WHEN DATEPART(HOUR, Date)BETWEEN 7 AND 21 THEN w.WindSpeed
ELSE m.WindSpeed
END AS ModifiedWindSpeed
, w.WindSpeed
, w.WindDirection
FROM TerrainCorrectedHourlyWind AS w
OUTER APPLY(SELECT TOP 1 WindSpeed
FROM TerrainCorrectedHourlyWind
WHERE DATEPART(HOUR, Date)BETWEEN 7 AND 21
AND Date < w.Date
ORDER BY Date DESC)AS m;
Just to explain what this is doing--the OUTER APPLY will get the single most recent record (TOP 1 and ORDER BY Date DESC) for dates prior to the record in question (Date < w.Date) as well as within the hours specified. The CASE near the top chooses whether to use the current value or this alternate one based on the hour.

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.