BigQuery - Query for each and set elements in column - sql

I would like to loop over several elements for a query.
Here is the query :
SELECT
timestamp_trunc(timestamp, DAY) as Day,
count(1) as Number
FROM `table`
WHERE user_id="12345" AND timestamp >= '2021-07-05 00:00:00 UTC' AND timestamp <= '2021-07-08 23:59:59 UTC'
GROUP BY 1
ORDER BY Day
So I have for the user "12345" a row counter per each day between two dates, this is perfect.
But I would like to do this query for each user_id of my table,
and if possible I would like each day on column, so each row is a user and the number available for each column (which is a day).
Result wanted :
User | 2021-07-05 | 2021-07-06 | 2021-07-07
---------------------------------------------
user_1 | 345 | 16 | 41
user_2 | 555 | 53 | 26
Thank you very much

Use below approach
SELECT * FROM (
SELECT
user_id,
DATE(timestamp) as Day,
COUNT(1) as Number
FROM `project.dataset.table`
WHERE timestamp >= '2021-07-05 00:00:00 UTC' AND timestamp <= '2021-07-08 23:59:59 UTC'
GROUP BY 1, 2
)
PIVOT (SUM(Number) FOR Day IN ('2021-07-05','2021-07-06','2021-07-07'))
Or even simpler (w/o GROUP BY as in your original query)
SELECT * FROM (
SELECT
user_id,
DATE(timestamp) as Day,
FROM `project.dataset.table`
WHERE timestamp >= '2021-07-05 00:00:00 UTC' AND timestamp <= '2021-07-08 23:59:59 UTC'
)
PIVOT (COUNT(*) FOR Day IN ('2021-07-05','2021-07-06','2021-07-07'))

Related

Query that counts total records per day and total records with same time timestamp and id per day in Bigquery

I have timeseries data like this:
time
id
value
2018-04-25 22:00:00 UTC
A
1
2018-04-25 23:00:00 UTC
A
2
2018-04-25 23:00:00 UTC
A
2.1
2018-04-25 23:00:00 UTC
B
1
2018-04-26 23:00:00 UTC
B
1.3
How do i write a query to produce an output table with these columns:
date: the truncated time
records: the number of records during this date
records_conflicting_time_id: the number of records during this date where the combination of time, id are not unique. In the example data above the two records with id==A at 2018-04-25 23:00:00 UTC would be counted for date 2018-04-25
So the output of our query should be:
date
records
records_conflicting_time_id
2018-04-25
4
2
2018-04-26
1
0
Getting records is easy, i just truncate the time to get date and then group by date. But i'm really struggling to produce a column that counts the number of records where id + time is not unique over that date...
Consider below approach
select date(time) date,
sum(cnt) records,
sum(if(cnt > 1, cnt, 0)) conflicting_records
from (
select time, id, count(*) cnt
from your_table
group by time, id
)
group by date
if applied to sample data in your question - output is
with YOUR_DATA as
(
select cast('2018-04-25 22:00:00 UTC' as timestamp) as `time`, 'A' as id, 1.0 as value
union all select cast('2018-04-25 23:00:00 UTC' as timestamp) as `time`, 'A' as id, 2.0 as value
union all select cast('2018-04-25 23:00:00 UTC' as timestamp) as `time`, 'A' as id, 2.1 as value
union all select cast('2018-04-25 23:00:00 UTC' as timestamp) as `time`, 'B' as id, 1.0 as value
union all select cast('2018-04-26 23:00:00 UTC' as timestamp) as `time`, 'B' as id, 1.3 as value
)
select cast(timestamp_trunc(t1.`time`, day) as date) as `date`,
count(*) as records,
case when count(*)-count(distinct cast(t1.`time` as string) || t1.id) = 0 then 0
else count(*)-count(distinct cast(t1.`time` as string) || t1.id)+1
end as records_conflicting_time_id
from YOUR_DATA t1
group by cast(timestamp_trunc(t1.`time`, day) as date)
;

Google Big query different result based on same date filter

Edit 1: so the issue is '<=' is acting as '<' in google query which is
strange. But '>=' acts normally. Any idea why this is happening?
Goal: to get data for May 2019.
Info about database here: https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/
Query 1 uses timestamp > '2019-04-30' AND timestamp < '2019-06-01'
SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp > '2019-04-30' AND timestamp < '2019-06-01'
GROUP BY month, package;
Query 2 uses timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'
SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'
GROUP BY month, package;
Both query one and two should scan same amount of data - May 2019 but both query gives different results and scans different amount of data as you can see in attached images.
Which one is correct and why both are not matching?
You're comparing timestamp with a date literal. When a date literal is implicitly cast as timestamp, it will have '00:00:00' time.
Query 1 uses timestamp > '2019-04-30' AND timestamp < '2019-06-01'
This is same as
timestamp > '2019-04-30 00:00:00 UTC' AND timestamp < '2019-06-01 00:00:00 UTC'
which includes data between 2019-04-30 00:00:01 UTC and 2019-04-30 23:59:59 UTC.
Query 2 uses timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'
same as
timestamp >= '2019-05-01 00:00:00 UTC' AND timestamp <= '2019-05-31 00:00:00 UTC'
in this case, you're missing data between 2019-05-31 00:00:01 UTC and 2019-05-31 23:59:59 UTC which is incorrect.
Correct Condition
You might want to use:
timestamp >= '2019-05-01' AND timestamp < '2019-06-01'
Note that since BEWEEN condition is inclusive, following conditions will not be what you want also.
WHERE timestamp BETWEEN '2019-05-01' AND '2019-05-31' --> this will ignore data on last day of May except '2019-05-31 00:00:00 UTC'.
or
WHERE timestamp BETWEEN '2019-05-01' AND '2019-06-01' --> this will include '2019-06-01 00:00:00 UTC' data like below screenshot.
SELECT EXTRACT(MONTH FROM timestamp) month, COUNT(1) cnt
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN '2019-05-01' AND '2019-06-01' -- scan 22.57 GB
GROUP BY 1
(update)
SELECT EXTRACT(DAY FROM timestamp) day, COUNT(1) cnt
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN '2019-05-29' AND '2019-05-31'
GROUP BY 1
;
output:
+-----+-----+-----------+
| Row | day | cnt |
+-----+-----+-----------+
| 1 | 30 | 116744449 |
| 2 | 29 | 120865824 |
| 3 | 31 | 1027 | -- should be 112116613
+-----+-----+-----------+
The two filters are different, you can simply check the difference in the result by the below script.
Differences
SELECT timestamp, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
timestamp > '2019-04-30' AND timestamp < '2019-06-01'
AND NOT (timestamp >= '2019-05-01' AND timestamp <= '2019-05-31')
;
Results
Personal Preference
SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN '2019-05-01' AND '2019-05-31'
p.s. As you can check out in the doc, the ordering of the standard SQL is below. The filter of WHERE happens before the SELECT, thus you might want to store the result of the SELECT statement and do the filtering to filter by date, not datetime.
FROM -> WHERE -> GROUP BY -> HAVING -> ...

How to convert unix timestamp and aggregate min and max date in Oracle SQL Developer?

I have table in Oracle SQL like below:
ID | date | place
-----------------------------
123 | 1610295784376 | OBJ_1
444 | 1748596758291 | OBJ_1
567 | 8391749204754 | OBJ_2
888 | 1747264526789 | OBJ_3
ID - ID of client
date - date in Unix timestamp in UTC
place - place of contact with client
And I need to aggregate above date to achieve results as below, so I need to:
convert unix timestamp in UTC from column "date" to normal date as below
calculate min and max date for each values from column "place"
min_date
max_date
distinct_place
2022-01-05
2022-02-15
OBJ_1
2022-02-10
2022-03-20
OBJ_2
2021-10-15
2021-11-21
OBJ_3
You can use:
SELECT TIMESTAMP '1970-01-01 00:00:00 UTC'
+ MIN(date_column) * INTERVAL '0.001' SECOND(3)
AS min_date,
TIMESTAMP '1970-01-01 00:00:00 UTC'
+ MAX(date_column) * INTERVAL '0.001' SECOND(3)
AS max_date,
place
FROM table_name
GROUP BY place;
Note: the (3) after SECOND is optional and will just explicitly specify the precision of the fractional seconds.
or:
SELECT TIMESTAMP '1970-01-01 00:00:00 UTC'
+ NUMTODSINTERVAL( MIN(date_column) / 1000, 'SECOND')
AS min_date,
TIMESTAMP '1970-01-01 00:00:00 UTC'
+ NUMTODSINTERVAL( MAX(date_column) / 1000, 'SECOND')
AS max_date,
place
FROM table_name
GROUP BY place;
Which, for the sample data:
CREATE TABLE table_name (ID, date_column, place) AS
SELECT 123, 1610295784376, 'OBJ_1' FROM DUAL UNION ALL
SELECT 444, 1748596758291, 'OBJ_1' FROM DUAL UNION ALL
SELECT 567, 1391749204754, 'OBJ_2' FROM DUAL UNION ALL -- Fixed leading digit
SELECT 888, 1747264526789, 'OBJ_3' FROM DUAL;
Both output:
MIN_DATE
MAX_DATE
PLACE
2021-01-10 16:23:04.376000000 UTC
2025-05-30 09:19:18.291000000 UTC
OBJ_1
2014-02-07 05:00:04.754000000 UTC
2014-02-07 05:00:04.754000000 UTC
OBJ_2
2025-05-14 23:15:26.789000000 UTC
2025-05-14 23:15:26.789000000 UTC
OBJ_3
db<>fiddle here

Group by with Unix time stamps

I am trying write a query where time stamps are in Unix format.
The objective of the query is group by these time stamps in five minute segments and to count each unique Id in those segments.
Is there a simple way of doing this?
The result looking for this
Time_utc Id count
25/07/2019 1600 1 3
25/07/2019 1600 2 1
25/07/2019 1605 1 4
You haven't shown data, so as a starting point you can group the Unix timestamps by dividing by 300 (for 5 minutes worth of seconds):
select 300 * floor(unix_ts/300) as unix_five_minute,
timestamp '1970-01-01 00:00:00 UTC'
+ (300*floor(unix_ts/300)) * interval '1' second as oracle_timestamp,
count(*)
from cte2
group by floor(unix_ts/300);
or if you have millisecond precision adjust by a factor of 1000:
select 300000 * floor(unix_ts/300000) as unix_five_minute,
timestamp '1970-01-01 00:00:00 UTC'
+ (300*floor(unix_ts/300000)) * interval '1' second as oracle_timestamp,
count(*)
from cte2
group by floor(unix_ts/300000);
Demo using made-up data generated from current time:
-- CTEs to generate some sample data
with cte1 (oracle_interval) as (
select systimestamp - level * interval '42' second
- timestamp '1970-01-01 00:00:00.0 UTC'
from dual
connect by level <= 30
),
cte2 (unix_ts) as (
select trunc(
extract(day from oracle_interval) * 86400000
+ extract(hour from oracle_interval) * 3600000
+ extract(minute from oracle_interval) * 60000
+ extract(second from oracle_interval) * 1000
)
from cte1
)
-- actual query
select 300000 * floor(unix_ts/300000) as unix_five_minute,
timestamp '1970-01-01 00:00:00 UTC'
+ (300*floor(unix_ts/300000)) * interval '1' second as oracle_timestamp,
count(*)
from cte2
group by floor(unix_ts/300000);
UNIX_FIVE_MINUTE ORACLE_TIMESTAMP COUNT(*)
---------------- ------------------------- ----------------
1564072500000 2019-07-25 16:35:00.0 UTC 7
1564072200000 2019-07-25 16:30:00.0 UTC 7
1564071600000 2019-07-25 16:20:00.0 UTC 4
1564071900000 2019-07-25 16:25:00.0 UTC 8
1564072800000 2019-07-25 16:40:00.0 UTC 4
Unix time stamps such as 155639.600 or 155639.637
Those are unusual values; Unix/epoch times are usually 10-digit numbers, or 13 digits for millisecond precision. Assuming (or rather, guessing) that they are tenths of a second for some reason:
-- CTE for sample data
with cte (unix_ts) as (
select 155639.600 from dual
union all
select 155639.637 from dual
)
-- actual query
select 300 * floor(unix_ts*10000/300) as unix_five_minute,
timestamp '1970-01-01 00:00:00 UTC'
+ (300*floor(unix_ts*10000/300)) * interval '1' second as oracle_timestamp,
count(*)
from cte
group by floor(unix_ts*10000/300);
UNIX_FIVE_MINUTE ORACLE_TIMESTAMP COUNT(*)
---------------- ------------------------- ----------------
1556396100 2019-04-27 20:15:00.0 UTC 1
1556395800 2019-04-27 20:10:00.0 UTC 1
The 10000/300 could be simplified to 100/3, but I think it's clearer left as it is.

Is there a way to group timestamp data by 30 day intervals starting from the min(date) and add them as columns

I am trying to use the min() value of a timestamp as a starting point and then group data by 30 day intervals in order to get a count of occurrences for each unique value within the timestamp date range as columns
i have two tables that i am joining together to get a count. Table 1 (page_creation) has 2 columns labeled link and dt_crtd. Table 2(page visits) has 2 other columns labeled url and date. the tables are being joined by joining table1.link = table2.pagevisits.
After the join i get a table similar to this:
+-------------------+------------------------+
| url | date |
+-------------------+------------------------+
| www.google.com | 2018-01-01 00:00:00' |
| www.google.com | 2018-01-02 00:00:00' |
| www.google.com | 2018-02-01 00:00:00' |
| www.google.com | 2018-02-05 00:00:00' |
| www.google.com | 2018-03-04 00:00:00' |
| www.facebook.com | 2014-01-05 00:00:00' |
| www.facebook.com | 2014-01-07 00:00:00' |
| www.facebook.com | 2014-04-02 00:00:00' |
| www.facebook.com | 2014-04-10 00:00:00' |
| www.facebook.com | 2014-04-11 00:00:00' |
| www.facebook.com | 2014-05-01 00:00:00' |
| www.twitter.com | 2016-02-01 00:00:00' |
| www.twitter.com | 2016-03-04 00:00:00' |
+---------------------+----------------------+
what i am trying to get is results that pull this :
+-------------------+------------------------+------------+------------+-------------+
| url | MIN_Date | Interval 1 | Interval 2| Interval 3 |
+-------------------+------------------------+-------------+-----------+-------------+
| www.google.com | 2018-01-01 00:00:00' | 2 | 2 | 1
| www.facebook.com | 2014-01-05 00:00:00' | 2 | 0 | 1
| www.twitter.com | 2016-02-01 00:00:00' | 1 | 1 | 0
+---------------------+----------------------+-------------+-----------+-------------+
So the 30 day intervals begin from the min(date) as shown in Interval 1 and are counted every 30 days.
Ive looked at other questions such as :
Group rows by 7 days interval starting from a certain date
MySQL query to select min datetime grouped by 30 day intervals
However it did not seem to answer my specific problem.
Ive also looked into pivot syntax but noticed it is only supported for certain DBMS.
Any help would be greatly appreciated.
Thank you.
If I understood your question clearly, you want to calculate page visits between 30 , 60 , 90 days intervals after page creation. If it's the requirement, try below SQL code :-
select a11.url
,Sum(case when a12.date between a11.dt_crtd and a11.dt_crtd+30 then 1 else 0) Interval_1
,Sum(case when a12.date between a11.dt_crtd+31 and a11.dt_crtd+60 then 1 else 0) Interval_2
,Sum(case when a12.date between a11.dt_crtd+61 and a11.dt_crtd+90 then 1 else 0) Interval_3
from page_creation a11
join page_visits a12
on a11.link = a12.url
group by a11.url
If you are using BigQuery, I would recommend:
countif() to count a boolean value
timestamp_add() to add intervals to timestamps
The exact boundaries are a bit vague, but I would go for:
select pc.url,
countif(pv.date >= pc.dt_crtd and
pv.date < timestamp_add(pc.dt_crtd, interval 30 day
) as Interval_00_29,
countif(pv.date >= timestamp_add(pc.dt_crtd, interval 30 day) and
pv.date < timestamp_add(pc.dt_crtd, interval 60 day
) as Interval_30_59,
countif(pv.date >= timestamp_add(pc.dt_crtd, interval 60 day) and
pv.date < timestamp_add(pc.dt_crtd, interval 90 day
) as Interval_60_89
from page_creation pc join
page_visits pv
on pc.link = pv.url
group by pc.url
The way I am reading your scenario and especially based on example of After the join i get a table similar to ... is that you have two tables that you need to UNION - not to JOIN
So, based on that reading below example is for BigQuery Standard SQL (project.dataset.page_creation and project.dataset.page_visits are here just to mimic your Table 1 and Table2)
#standardSQL
WITH `project.dataset.page_creation` AS (
SELECT 'www.google.com' link, TIMESTAMP '2018-01-01 00:00:00' dt_crtd UNION ALL
SELECT 'www.facebook.com', '2014-01-05 00:00:00' UNION ALL
SELECT 'www.twitter.com', '2016-02-01 00:00:00'
), `project.dataset.page_visits` AS (
SELECT 'www.google.com' url, TIMESTAMP '2018-01-02 00:00:00' dt UNION ALL
SELECT 'www.google.com', '2018-02-01 00:00:00' UNION ALL
SELECT 'www.google.com', '2018-02-05 00:00:00' UNION ALL
SELECT 'www.google.com', '2018-03-04 00:00:00' UNION ALL
SELECT 'www.facebook.com', '2014-01-07 00:00:00' UNION ALL
SELECT 'www.facebook.com', '2014-04-02 00:00:00' UNION ALL
SELECT 'www.facebook.com', '2014-04-10 00:00:00' UNION ALL
SELECT 'www.facebook.com', '2014-04-11 00:00:00' UNION ALL
SELECT 'www.facebook.com', '2014-05-01 00:00:00' UNION ALL
SELECT 'www.twitter.com', '2016-03-04 00:00:00'
), `After the join` AS (
SELECT url, dt FROM `project.dataset.page_visits` UNION DISTINCT
SELECT link, dt_crtd FROM `project.dataset.page_creation`
)
SELECT
url, min_date,
COUNTIF(dt BETWEEN min_date AND TIMESTAMP_ADD(min_date, INTERVAL 29 DAY)) Interval_1,
COUNTIF(dt BETWEEN TIMESTAMP_ADD(min_date, INTERVAL 30 DAY) AND TIMESTAMP_ADD(min_date, INTERVAL 59 DAY)) Interval_2,
COUNTIF(dt BETWEEN TIMESTAMP_ADD(min_date, INTERVAL 60 DAY) AND TIMESTAMP_ADD(min_date, INTERVAL 89 DAY)) Interval_3
FROM (
SELECT url, dt, MIN(dt) OVER(PARTITION BY url ORDER BY dt) min_date
FROM `After the join`
)
GROUP BY url, min_date
with result as
Row url min_date Interval_1 Interval_2 Interval_3
1 www.facebook.com 2014-01-05 00:00:00 UTC 2 0 1
2 www.google.com 2018-01-01 00:00:00 UTC 2 2 1
3 www.twitter.com 2016-02-01 00:00:00 UTC 1 1 0