I have a table in PostgreSQL 13 that looks like this (modified for the purpose of this question):
SELECT * FROM visits.visitors_log;
visitor_id | day | source
--------------+------------------------+----------
9 | 2019-12-30 12:10:10-05 | Twitter
7 | 2019-12-14 22:10:26-04 | Netflix
5 | 2019-12-13 15:21:04-05 | Netflix
9 | 2019-12-22 23:34:47-05 | Twitter
7 | 2019-12-22 00:10:26-04 | Netflix
9 | 2019-12-22 13:20:42-04 | Twitter
After converting the times to another timezone, I want to calculate the percentage of visits on 2019-12-22 that came from a specific source.
There are 4 steps involved:
Convert timezones
Calculate how many total visits happened on that day
Calculate how many total visits happened on that day that came from source Netflix
Divide those 2 numbers to get percentage.
I wrote this code which works but seems repetitive and not very clean:
SELECT (SELECT COUNT(*) FROM (SELECT visitor_id, source, day AT TIME ZONE 'PST' FROM visits.visitors_log WHERE day::date = '2019-12-22') AS a
WHERE day::date = '2019-12-22' AND source = 'Netflix') * 100.0
/
(SELECT COUNT(*) FROM (SELECT visitor_id, source, day AT TIME ZONE 'PST' FROM visits.visitors_log WHERE day::date = '2019-12-22') AS b
WHERE day::date = '2019-12-22')
AS visitors_percentage;
Can anyone suggest a neater way of answering this question?
Use an aggregate FILTER clause:
SELECT count(*) FILTER (WHERE source = 'Netflix') * 100.0
/ count(*) AS visitors_percentage
FROM visits.visitors_log
WHERE day >= timestamp '2019-12-22' AT TIME ZONE 'PST'
AND day < timestamp '2019-12-23' AT TIME ZONE 'PST';
See:
Aggregate columns with additional (distinct) filters
I rephrased the WHERE condition so it is "sargable" and can use an index on (day). A predicate with an expression on the column cannot use a plain index. So I moved the computation of inclusive lower and exclusive upper bound (day boundaries for the given time zone) to the right side of the expressions in the WHERE clause.
Makes a huge difference for performance with big tables.
If you use that query a lot, consider crating a function for it:
CREATE OR REPLACE FUNCTION my_func(_source text, _day date, _tz text)
RETURNS numeric
LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
SELECT round(count(*) FILTER (WHERE source = _source) * 100.0 / count(*), 2) AS visitors_percentage
FROM visits.visitors_log
WHERE day >= _day::timestamp AT TIME ZONE _tz
AND day < (_day + 1)::timestamp AT TIME ZONE _tz;
$func$;
Call:
SELECT my_func('Netflix', '2019-12-22', 'PST');
I threw in round(), which is a totally optional addition.
db<>fiddle here
Aside: "day" is a rather misleading name for a timestamp with time zone column.
Hmmm . . . You can use window functions to calculate the total:
SELECT source, COUNT(*) / SUM(COUNT(*)) OVER () as visitors_percentage
FROM visits.visitors_log
WHERE (day AT TIME ZONE 'PST')::date = '2019-12-22'
GROUP BY SOURCE
Related
I have a schema with the following fields:
Name of row | Type
--------------------------+--------
name | string
value1 | numeric
timestamp | bigint
The rows contain entries with a name, a numeric value and a bigint value storing the unix timestamp in nanoseconds. Using TimescaleDB and I would like to use time_buckets_gapfill to retrieve the data. Given the timestamps are stored in bigint, this is quite cumbersome.
I would like to get aggregated data for these intervals: 5min, hour, day, week, month, quarter, year. I have managed to make it work using normal time_buckets, but now I would like to fill the gaps as well. I am using the following query now:
SELECT COALESCE(COUNT(*), 0), COALESCE(SUM(value1), 0), time_bucket_gapfill('5 min', date_trunc('quarter', to_timestamp(timestamp/1000000000)), to_timestamp(1599100000), to_timestamp(1599300000)) AS bucket
FROM playground
WHERE name = 'test' AND timestamp >= 1599100000000000000 AND timestamp <= 1599300000000000000
GROUP BY bucket
ORDER BY bucket ASC
This returns the values correctly, but does not fill the empty spaces. If I modified my query to
time_bucket_gapfill('5 min',
date_trunc('quarter',
to_timestamp(timestamp/1000000000),
to_timestamp(1599100000),
to_timestamp(1599200000))
I would get the first entry correctly and then empty rows every 5 minutes. How could I make it work? Thanks!
Here is a DB fiddle, but it doesn't work as it doesn't support TimeScaleDB. The query above returns the following:
coalesce | coalesce | avg_val
------------------------+-------------------------
3 | 300 | 2020-07-01 00:00:00+00
0 | 0 | 2020-09-03 02:25:00+00
0 | 0 | 2020-09-03 02:30:00+00
You should use datatypes in your time_bucket_gapfill that matches the datatypes in your table. The following query should get you what you are looking for:
SELECT
COALESCE(count(*), 0),
COALESCE(SUM(value1), 0),
time_bucket_gapfill(300E9::BIGINT, timestamp) AS bucket
FROM
t
WHERE
name = 'example'
AND timestamp >= 1599100000000000000
AND timestamp < 1599200000000000000
GROUP BY
bucket;
I have managed to solve it by building on Sven's answer. It first uses his function to fill out the gaps and then date_trunc is called eliminating the extra rows.
WITH gapfill AS (
SELECT
COALESCE(count(*), 0) as count,
COALESCE(SUM(value1), 0) as sum,
time_bucket_gapfill(300E9::BIGINT, timestamp) as bucket
FROM
playground
WHERE
name = 'test'
AND timestamp >= 1599100000000000000
AND timestamp < 1599300000000000000
GROUP BY
bucket
)
SELECT
SUM(count),
SUM(sum),
date_trunc('quarter', to_timestamp(bucket/1000000000)) as truncated
FROM
gapfill
GROUP BY truncated
ORDER BY truncated ASC
I would like to subtract 2 timestamp to get the hours between them. I have use days_between function but it returns me an error of Invalid operation:function days_between has no timezone setup. Below is the sample table and timestamps that I want to subtract.
job_number timestamp 1 timestamp 2
123456789 2020-03-16 16:59:26 2020-03-17 10:58:25
134232125 2020-03-18 08:57:05 2020-03-19 01:47:26
The HOURS_BETWEEN function is the cleanest way to find the number of full hours between two timestamps in DB2
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061478.html
The HOURS_BETWEEN function returns the number of full hours between the specified arguments.
For example
VALUES
( HOURS_BETWEEN('2020-03-17-10.58.25', '2019-03-16-16.59.26')
, HOURS_BETWEEN('2020-03-19-01.47.26', '2019-03-18-08.57.05')
)
returns
1 |2
----|----
8801|8800
Note that the value is negative if the first value is less than the second value in the function
Also note that this function does not exist in version of Db2 (for LUW) lower than 11.1
You can use TIMESTAMPDIFF() function with numeric-expression is equal to 8 to represent hour difference :
SELECT job_number, TIMESTAMPDIFF(8,CHAR(timestamp2 - timestamp1)) AS ts_diff
FROM T
Demo
TIMESTAMPDIFF function has quite a specific implementation. See Table 3. TIMESTAMPDIFF computations.
I've set earlier year (2019) for timestamp 1 values deliberately.
WITH TAB (job_number, timestamp1, timestamp2) AS
(
VALUES
(123456789, TIMESTAMP('2019-03-16-16.59.26'), TIMESTAMP('2020-03-17-10.58.25'))
, (134232125, TIMESTAMP('2019-03-18-08.57.05'), TIMESTAMP('2020-03-19-01.47.26'))
)
SELECT job_number
, TIMESTAMPDIFF(8, CHAR(TIMESTAMP2 - TIMESTAMP1)) HOURS_TSDIFF
, ((DAYS(TIMESTAMP2) - DAYS(TIMESTAMP1)) * 86400 + MIDNIGHT_SECONDS(TIMESTAMP2) - MIDNIGHT_SECONDS(TIMESTAMP1)) / 3600 HOURS_REAL
FROM TAB;
The result is:
|JOB_NUMBER |HOURS_TSDIFF|HOURS_REAL |
|-----------|------------|-----------|
|123456789 |8777 |8801 |
|134232125 |8776 |8800 |
I want to filter some data by both yyyymmdd(date) and hhmmss(time), but clickhouse don't support time type. So I choose datetime to combine them. But how to do such things:
This is code of dolphindb(which supports second type to represent hhmmss.
select avg(ofr + bid) / 2.0 as avg_price
from taq
where
date between 2007.08.05 : 2007.08.07,
time between 09:30:00 : 16:00:00
group by symbol, date
This is code of clickhouse, but a logical problematic code.
SELECT avg(ofr + bid) / 2.0 AS avg_price
FROM taq
WHERE
time BETWEEN '2007-08-05 09:30:00' AND '2007-08-07 16:00:00'
GROUP BY symbol, toYYYYMMDD(time)
;
how to express it in sql just like the dolphindb code?
Assume that you just want to average the trading price in normal trading hours, excluding after hour trading, then a possible solution:
SELECT avg(ofr + bid) / 2.0 AS avg_price
FROM taq
WHERE
toYYYYMMDD(time) BETWEEN 20070805 AND 20070807 AND
toYYYYMMDDhhmmss(time)%1000000 BETWEEN 93000 and 160000
GROUP BY symbol, toYYYYMMDD(time)
This filters the taq table within specified date and time.
The gem we have installed (Blazer) on our site limits us to one query.
We are trying to write a query to show how many hours each employee has for the past 10 days. The first column would have employee names and the rest would have hours with the column header being each date. I'm having trouble figuring out how to make the column headers dynamic based on the day. The following is an example of what we have working without dynamic column headers and only using 3 days.
SELECT
pivot_table.*
FROM
crosstab(
E'SELECT
"User",
"Date",
"Hours"
FROM
(SELECT
"q"."qdb_users"."name" AS "User",
to_char("qdb_works"."date", \'YYYY-MM-DD\') AS "Date",
sum("qdb_works"."hours") AS "Hours"
FROM
"q"."qdb_works"
LEFT OUTER JOIN
"q"."qdb_users" ON
"q"."qdb_users"."id" = "q"."qdb_works"."qdb_user_id"
WHERE
"qdb_works"."date" > current_date - 20
GROUP BY
"User",
"Date"
ORDER BY
"Date" DESC,
"User" DESC) "x"
ORDER BY 1, 2')
AS
pivot_table (
"User" VARCHAR,
"2017-10-06" FLOAT,
"2017-10-05" FLOAT,
"2017-10-04" FLOAT
);
This results in
| User | 2017-10-05 | 2017-10-04 | 2017-10-03 |
|------|------------|------------|------------|
| John | 1.5 | 3.25 | 2.25 |
| Jill | 6.25 | 6.25 | 6 |
| Bill | 2.75 | 3 | 4 |
This is correct, but tomorrow, the column headers will be off unless we update the query every day. I know we could pivot this table with date on the left and names on the top, but that will still need updating with each new employee – and we get new ones often.
We have tried using functions and queries in the "AS" section with no luck. For example:
AS
pivot_table (
"User" VARCHAR,
current_date - 0 FLOAT,
current_date - 1 FLOAT,
current_date - 2 FLOAT
);
Is there any way to pull this off with one query?
You could select a row for each user, and then per column sum the hours for one day:
with user_work as
(
select u.name as user
, to_char(w.date, 'YYYY-MM-DD') as dt_str
, w.hours
from qdb_works w
join qdb_users u
on u.id = w.qdb_user_id
where w.date >= current_date - interval '2 days'
)
select User
, sum(case when dt_str = to_char(current_date,
'YYYY-MM-DD') then hours end) as Today
, sum(case when dt_str = to_char(current_date - 'interval 1 day',
'YYYY-MM-DD') then hours end) as Yesterday
, sum(case when dt_str = to_char(current_date - 'interval 2 days',
'YYYY-MM-DD') then hours end) as DayBeforeYesterday
from user_work
group by
user
, dt_str
It's often easier to return a list and pivot it client side. That also allows you to generate column names with a date.
Is there any way to pull this off with one query?
No, because a fixed SQL query cannot have any variability in its output columns. The SQL engine determines the number, types and names of every column of a query before executing it, without reading any data except in the catalog (for the structure of tables and other objects), execution being just the last of 5 stages.
A single-query dynamic pivot, if such a thing existed, couldn't be prepared, since a prepared query always have the same results structure, whereas by definition a dynamic pivot doesn't, as the rows that pivot into columns can change between executions. That would be at odds again with the Prepare-Bind-Execute model.
You may find some limited workarounds and additional explanations in other questions, for example: Execute a dynamic crosstab query, but since you mentioned specifically:
The gem we have installed (Blazer) on our site limits us to one
query
I'm afraid you're out of luck. Whatever the workaround, it always need at best one step with a query to figure out the columns and generate a dynamic query from them, and a second step executing the query generated at the previous step.
Whereas I believe this is a fairly general SQL question, I am working in PostgreSQL 9.4 without an option to use other database software, and thus request that any answer be compatible with its capabilities.
I need to be able to return multiple aggregate totals from one query, such that each sum is in a new row, and each of the groupings are determined by a unique span of time, e.g. WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'. The number of records that satisfy there WHERE clause is unknown and may be zero, in which case ideally the result is "0". This is what I have worked out so far:
(
SELECT SUM(minutes) AS min
FROM downtime
WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-14' AND '2016-02-21'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-28' AND '2016-03-06'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-06' AND '2016-03-13'
)
UNION ALL
(
SELECT SUM(minutes))
FROM downtime
WHERE time_stamp BETWEEN '2016-03-13' AND '2016-03-20'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-20' AND '2016-03-27'
)
Result:
min
---+-----
1 | 119
2 | 4
3 | 30
4 |
5 | 62
6 | 350
That query gets me almost the exact result that I want; certainly good enough in that I can do exactly what I need with the results. Time spans with no records are blank but that was predictable, and whereas I would prefer "0" I can account for the blank rows in software.
But, while it isn't terrible for the 6 weeks that it represents, I want to be flexible and to be able to do the same thing for different time spans, and for a different number of data points, such as each day in a week, each week in 3 months, 6 months, each month in 1 year, 2 years, etc... As written above, it feels as if it is going to get tedious fast... for instance 1 week spans over a 2 year period is 104 sub-queries.
What I'm after is a more elegant way to get the same (or similar) result.
I also don't know if doing 104 iterations of a similar query to the above (vs. the 6 that it does now) is a particularly efficient usage.
Ultimately I am going to write some code which will help me build (and thus abstract away) the long, ugly query--but it would still be great to have a more concise and scale-able query.
In Postgres, you can generate a series of times and then use these for the aggregation:
select g.dte, coalesce(sum(dt.minutes), 0) as minutes
from generate_series('2016-02-07'::timestamp, '2016-03-20'::timestamp, interval '7 day') g(dte) left join
downtime dt
on dt.timestamp >= g.dte and dt.timestamp < g.dte + interval '7 day'
group by g.dte
order by g.dte;