How can I select one row of data per hour, from a table of time stamps? - sql

Excuse me if this is confusing, as I am not very familiar with postgresql. I have a postgres database with a table full of "sites". Each site reports about once an hour, and when it reports, it makes an entry in this table, like so:
site | tstamp
-----+--------------------
6000 | 2013-05-09 11:53:04
6444 | 2013-05-09 12:58:00
6444 | 2013-05-09 13:01:08
6000 | 2013-05-09 13:01:32
6000 | 2013-05-09 14:05:06
6444 | 2013-05-09 14:06:25
6444 | 2013-05-09 14:59:58
6000 | 2013-05-09 19:00:07
As you can see, the time stamps are almost never on-the-nose, and sometimes there will be 2 or more within only a few minutes/seconds of each other. Furthermore, some sites won't report for hours at a time (on occasion). I want to only select one entry per site, per hour (as close to each hour as I can get). How can I go about doing this in an efficient way? I also will need to extend this to other time frames (like one entry per site per day -- as close to midnight as possible).
Thank you for any and all suggestions.

You could use DISTINCT ON:
select distinct on (date_trunc('hour', tstamp)) site, tstamp
from t
order by date_trunc('hour', tstamp), tstamp
Be careful with the ORDER BY if you care about which entry you get.
Alternatively, you could use the row_number window function to mark the rows of interest and then peel off the first result in each group from a derived table:
select site, tstamp
from (
select site, tstamp,
row_number() over (partition by date_trunc('hour', tstamp) order by tstamp) as r
from t
) as dt
where r = 1
Again, you'd adjust the ORDER BY to select the specific row of interest for each date.

You are looking for the closest value per hour. Some are before the hour and some are after. That makes this a hardish problem.
First, we need to identify the range of values that work for a particular hour. For this, I'll consider anything from 15 minutes before the hour to 45 minutes after as being for that hour. So, the period of consideration for 2:00 goes from 1:45 to 2:45 (arbitrary, but seems reasonable for your data). We can do this by shifting the time stamps by 15 minutes.
Second, we need to get the closest value to the hour. So, we prefer 1:57 to 2:05. We can do this by considering the first value in (57, 60 - 57, 5, 60 - 5).
We can put these rules into a SQL statement, using row_number():
select site, tstamp, usedTimestamp
from (select site, tstamp,
date_trunc('hour', tstamp + 'time 00:15') as usedTimestamp
row_number() over (partition by site, to_char(tstamp + time '00:15', 'YYYY-MM-DD-HH24'),
order by least(extract(minute from tstamp), 60 - extract(minute from tstamp))
) as seqnum
from t
) as dt
where seqnum = 1;

For the extensibility aspect of your question.
I also will need to extend this to other time frames (like one entry per site per day
From the distinct set of site ids, and using a (recursive) CTE, I would build a set comprised of one entry per site per hour (or other specified interval), within a specified StartDateTime, EndDateTime range.
SITE..THE DATE-TIME-HOUR
6000 12.1.2013 00:00:00
6000 12.1.2013 01:00:00
.
.
.
6000 12.1.2013 24:00:00
7000 12.1.2013 00:00:00
7000 12.1.2013 01:00:00
.
.
.
7000 12.1.2013 24:00:00
Then I would left join that CTE against your SITES log on site id and on the min absolute difference between the CTE point-in-time and the LOG's point-in-time.
That way you are assured of a row for each site per interval.
P.S. For a site that has not phoned home for a long time, its most recent phone-in timestamp will be repeated multiple times as the closest one available.

Related

Cross apply historical date range in BigQuery

I have a growing table of orders which looks something like this:
units_sold
timestamp
1
2021-03-02 10:00:00
2
2021-03-02 11:00:00
4
2021-03-02 12:00:00
3
2021-03-03 13:00:00
9
2021-03-03 14:00:00
I am trying to partition the table into each day, and gather statistics on units sold on the day, and on the day before. I can pretty easily get the units sold today and yesterday for just today, but I need to cross apply a date range for every date in my orders table.
The expected result would look like this:
units_sold_yesterday
units_sold_today
date_measured
12
7
2021-03-02
NULL
12
2021-03-03
One way of doing it, is by creating or appending the order data every day to a new table. However, this table could grow very large and also I need historical data as well.
In my minds eye I know I have cascade the data, so that BigQuery compares the data to "todays date" which would shift across a all the dates in the table.
I'm thinking this shift could come from a cross apply of all the distinct dates in the table, and so I would get a copy of the orders table for each date, but with a different "todays date" column that I can extrapolate the units_sold_today data from by using that column to date-diff the salesdate to.
This would still, however, create a massive amount of data to process, and I guess maybe there is a simple function for this in BigQuery or standard SQL syntax.
This sounds like aggregation and lag():
select timestamp_trunc(timestamp, day), count(*) as sold_today,
lag(count(*)) over (order by min(timestamp)) as sold_yesterday
from t
group by 1
order by 1;
Note: This assumes that you have data for every day.
Consider below
select date_measured, units_sold_today,
lag(units_sold_today) over(order by date_measured) units_sold_yesterday,
from (
select date(timestamp) date_measured,
sum(units_sold) units_sold_today
from `project.dataset.table`
group by date_measured
)
if applied to sample data in your question - output is

BQ: Select latest date from multiple columns

Good day, all. I wrote a question relating to this earlier, but now I have encountered another problem.
I have to calculate the timestamp difference between the install_time and contributer_time columns. HOWEVER, I have three contributor_time columns, and I need to select the latest time from those columns first then subtract it from install time.
Sample Data
users
install_time
contributor_time_1
contributor_time_2
contributor_time_3
1
8:00
7:45
7:50
7:55
2
10:00
9:15
9:45
9:30
3
11:00
10:30
null
null
For example, in the table above I would need to select contributor_time_3 and subtract it from install_time for user 1. For user 2, I would do the same, but with contributor_time_2.
Sample Results
users
install_time
time_diff_min
1
8:00
5
2
10:00
15
3
11:00
30
The problem I am facing is that 1) the contributor_time columns are in string format and 2) some of them have 'null' string values (which means that I cannot cast it into a timestamp.)
I created a query, but I am am facing an error stating that I cannot subtract a string from timestamp. So I added safe_cast, however the time_diff_min results are only showing when I have all three contributor_time columns as a timestamp. For example, in the sample table above, only the first two rows will pull.
The query I have so far is below:
SELECT
users,
install_time,
TIMESTAMP_DIFF(install_time, greatest(contributor_time_1, contributor_time_2, contributor_time_3), MINUTE) as ctct_min
FROM
(SELECT
users,
install_time,
safe_cast(contributor_time_1 as timestamp) as contributor_time_1,
safe_cast(contributor_time_2 as timestamp) as contributor_time_2,
safe_cast(contributor_time_3 as timestamp) as contributor_time_3,
FROM
(SELECT
users,
install_time,
case when contributor_time_1 = 'null' then '0' else contributor_time_1 end as contributor_time_1,
....
FROM datasource
Any help to point me in the right direction is appreciated! Thank you in advance!
Consider below
select users, install_time,
time_diff(
parse_time('%H:%M',install_time),
greatest(
parse_time('%H:%M',contributor_time_1),
parse_time('%H:%M',contributor_time_2),
parse_time('%H:%M',contributor_time_3)
),
minute) as time_diff_min
from `project.dataset.table`
if applied to sample data in your question - output is
Above can be refactored slightly into below
create temp function latest_time(arr any type) as ((
select parse_time('%H:%M',val) time
from unnest(arr) val
order by time desc
limit 1
));
select users, install_time,
time_diff(
parse_time('%H:%M',install_time),
latest_time([contributor_time_1, contributor_time_2, contributor_time_3]),
minute) as time_diff_min
from `project.dataset.table`
less verbose and no redundant parsing - with same result - so just matter of preferences
You can use greatest():
select t.*,
time_diff(install_time, greatest(contributor_time_1, contributor_time_2, contributor_time_3), minute) as diff_min
from t;
Note: this assumes that the values are never NULL, which seems reasonable based on your sample data.

Multiple aggregate sums from different conditions in one sql query

Whereas I believe this is a fairly general SQL question, I am working in PostgreSQL 9.4 without an option to use other database software, and thus request that any answer be compatible with its capabilities.
I need to be able to return multiple aggregate totals from one query, such that each sum is in a new row, and each of the groupings are determined by a unique span of time, e.g. WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'. The number of records that satisfy there WHERE clause is unknown and may be zero, in which case ideally the result is "0". This is what I have worked out so far:
(
SELECT SUM(minutes) AS min
FROM downtime
WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-14' AND '2016-02-21'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-28' AND '2016-03-06'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-06' AND '2016-03-13'
)
UNION ALL
(
SELECT SUM(minutes))
FROM downtime
WHERE time_stamp BETWEEN '2016-03-13' AND '2016-03-20'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-20' AND '2016-03-27'
)
Result:
min
---+-----
1 | 119
2 | 4
3 | 30
4 |
5 | 62
6 | 350
That query gets me almost the exact result that I want; certainly good enough in that I can do exactly what I need with the results. Time spans with no records are blank but that was predictable, and whereas I would prefer "0" I can account for the blank rows in software.
But, while it isn't terrible for the 6 weeks that it represents, I want to be flexible and to be able to do the same thing for different time spans, and for a different number of data points, such as each day in a week, each week in 3 months, 6 months, each month in 1 year, 2 years, etc... As written above, it feels as if it is going to get tedious fast... for instance 1 week spans over a 2 year period is 104 sub-queries.
What I'm after is a more elegant way to get the same (or similar) result.
I also don't know if doing 104 iterations of a similar query to the above (vs. the 6 that it does now) is a particularly efficient usage.
Ultimately I am going to write some code which will help me build (and thus abstract away) the long, ugly query--but it would still be great to have a more concise and scale-able query.
In Postgres, you can generate a series of times and then use these for the aggregation:
select g.dte, coalesce(sum(dt.minutes), 0) as minutes
from generate_series('2016-02-07'::timestamp, '2016-03-20'::timestamp, interval '7 day') g(dte) left join
downtime dt
on dt.timestamp >= g.dte and dt.timestamp < g.dte + interval '7 day'
group by g.dte
order by g.dte;

oracle sql: efficient way to calculate business days in a month

I have a pretty huge table with columns dates, account, amount, etc. eg.
date account amount
4/1/2014 XXXXX1 80
4/1/2014 XXXXX1 20
4/2/2014 XXXXX1 840
4/3/2014 XXXXX1 120
4/1/2014 XXXXX2 130
4/3/2014 XXXXX2 300
...........
(I have 40 months' worth of daily data and multiple accounts.)
The final output I want is the average amount of each account each month. Since there may or may not be record for any account on a single day, and I have a seperate table of holidays from 2011~2014, I am summing up the amount of each account within a month and dividing it by the number of business days of that month. Notice that there is very likely to be record(s) on weekends/holidays, so I need to exclude them from calculation. Also, I want to have a record for each of the date available in the original table. eg.
date account amount
4/1/2014 XXXXX1 48 ((80+20+840+120)/22)
4/2/2014 XXXXX1 48
4/3/2014 XXXXX1 48
4/1/2014 XXXXX2 19 ((130+300)/22)
4/3/2014 XXXXX2 19
...........
(Suppose the above is the only data I have for Apr-2014.)
I am able to do this in a hacky and slow way, but as I need to join this process with other subqueries, I really need to optimize this query. My current code looks like:
<!-- language: lang-sql -->
select
date,
account,
sum(amount/days_mon) over (partition by last_day(date))
from(
select
date,
-- there are more calculation to get the account numbers,
-- so this subquery is necessary
account,
amount,
-- this is a list of month-end dates that the number of
-- business days in that month is 19. similar below.
case when last_day(date) in ('','',...,'') then 19
when last_day(date) in ('','',...,'') then 20
when last_day(date) in ('','',...,'') then 21
when last_day(date) in ('','',...,'') then 22
when last_day(date) in ('','',...,'') then 23
end as days_mon
from mytable tb
inner join lookup_businessday_list busi
on tb.date = busi.date)
So how can I perform the above purpose efficiently? Thank you!
This approach uses sub-query factoring - what other RDBMS flavours call common table expressions. The attraction here is that we can pass the output from one CTE as input to another. Find out more.
The first CTE generates a list of dates in a given month (you can extend this over any range you like).
The second CTE uses an anti-join on the first to filter out dates which are holidays and also dates which aren't weekdays. Note that Day Number varies depending according to the NLS_TERRITORY setting; in my realm the weekend is days 6 and 7 but SQL Fiddle is American so there it is 1 and 7.
with dates as ( select date '2014-04-01' + ( level - 1) as d
from dual
connect by level <= 30 )
, bdays as ( select d
, count(d) over () tot_d
from dates
left join holidays
on dates.d = holidays.hol_date
where holidays.hol_date is null
and to_number(to_char(dates.d, 'D')) between 2 and 6
)
select yt.account
, yt.txn_date
, sum(yt.amount) over (partition by yt.account, trunc(yt.txn_date,'MM'))
/tot_d as avg_amt
from your_table yt
join bdays
on bdays.d = yt.txn_date
order by yt.account
, yt.txn_date
/
I haven't rounded the average amount.
You have 40 month of data, this data should be very stable.
I will assume that you have a cold body (big and stable easily definable range of data) and hot tail (small and active part).
Next, I would like to define a minimal period. It is a data range that is a smallest interval interesting for Business.
It might be year, month, day, hour, etc. Do you expect to get questions like "what was averege for that account between 1900 and 12am yesterday?".
I will assume that the answer is DAY.
Then,
I will calculate sum(amount) and count() for every account for every DAY of cold body.
I will not create a dummy records, if particular account had no activity on some day.
and I will save day, account, total amount, count in a TABLE.
if there are modifications later to the cold body, you delete and reload affected day from that table.
For hot tail there might be multiple strategies:
Do the same as above (same process, clear to support)
always calculate on a fly
use materialized view as an averege between 1 and 2.
Cold body table totalc could also be implemented as materialized view, but if data never change - no need to rebuild it.
With this you go from (number of account) x (number of transactions per day) x (number of days) to (number of account)x(number of active days) number of records.
That should speed up all following calculations.

Plot a graph of the number of people online between a time range

I have a database model that stores
visit time
last seen time
how many seconds online (derived value, calculated by subtracting last seen time from visit time)
I need to build a graph of online people for a time range (say 8pm to 9pm). I'm thinking of the x-axis as the time with the y-axis as the number of people. The granularity is 1 minute for the graph, but I have data granular to 5 seconds.
I can't just sum the seconds online value because people visit before or after 8pm.
I was thinking of just loading up all records found in a particular day and doing calculations in memory (which I would probably do for now, then just cache the derived values for later) but I wanted to know if there's a more efficient way?
I wonder if there's a special sql query group by thing I can do to make this work.
Edit: Here's a graphical representation I am stealing from another question (Count Grouped Gaps In Time For Time Range) :P
|.=========]
|.=============]
|=========.======]
|===.=================.====]
|.=================.==========]
T 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
The bars represent the data I've stored (visit time, last seen time, and seconds online) and I need to know at a particular point how many are online. In this example for T=0 the number is 3 and for T=9 the number is 4)
Q: I can't understand what you mean with "but I have data granular to 5 seconds", how many records do you store per visit? Can you add some example data?
A: There's only one record per visit. Granular to 5 seconds means I'm storing up to 5 seconds worth of accurate data.
Sample data as requested:
id visit_time last_seen_time seconds_online
1 00:00:00 00:00:12 10
2 00:12:41 00:12:47 5
3 00:01:20 00:01:22 0
4 00:01:22 00:01:27 5
In this particular case, if I graph the people online at 00:00:00 there would be one person, until 00:00:15 where there would be 0 people:
4|
3|
2| *
1|* *
-*****-******************-
Very interesting and hard question, if we suppose that the interval of graph should be by hour (for example from 8.00 to 8.59), and the granularity by minute we can leverage the problem by extract this date parts (if you are using postgresql the function to use should be EXTRACT), I also suggest to use Commont Table Expressions.
We can then build a CTE to have first minute and last minute of each visit in the target hour, like:
SELECT CASE WHEN EXTRACT(hour FROM visit_time) = 8
THEN EXTRACT(minute FROM visit_time)
ELSE 0 END AS first_minute,
CASE WHEN EXTRACT(hour FROM last_seen_time) = 8
THEN EXTRACT(minute FROM last_seen_time)
ELSE 59 END AS last_minute
FROM visit_table
WHERE EXTRACT(hour FROM visit_time) <= 8 AND EXTRACT(hour FROM last_seen_time) >= 8
The number of visitor changes when a new visit begin or a visit ends, so we can build a second CTE from the first to have a list of all minutes where the visitors' number changes, lets name target the first CTE, then the latter could be defined as:
SELECT first_minute AS minute
FROM target
UNION
SELECT last_minute AS minute
FROM target
The UNION will also eliminate duplicates.
Finally we can join the two tables and count the visitors:
WITH target AS (
SELECT CASE WHEN EXTRACT(hour FROM visit_time) = 8
THEN EXTRACT(minute FROM visit_time)
ELSE 0 END AS first_minute,
CASE WHEN EXTRACT(hour FROM last_seen_time) = 8
THEN EXTRACT(minute FROM last_seen_time)
ELSE 59 END AS last_minute
FROM visit_table
WHERE EXTRACT(hour FROM visit_time) <= 8
AND EXTRACT(hour FROM last_seen_time) >= 8
), time_table AS (
SELECT first_minute AS minute
FROM target
UNION
SELECT last_minute AS minute
FROM target
)
SELECT time_table.minute, COUNT(*) AS Users
FROM target INNER JOIN
time_table ON time_table.minute BETWEEN target.first_minute
AND target.last_minute
GROUP BY time_table.minute
ORDER BY time_table.minute
You should obtain a table where the first record contains the first minute, within the target hour, when there is at least an online visitor, with the number of online people, then you have a record for each change of the number of online people, with the minute of the change and the new number of online people, you can easily make your graph from this.
Sorry if I can't test this solution, but I hope it could help you anyway.