adding all columns from mutiple tables - sql

I have a simple question.
I need to count all records from multiple tables with day and hour and add all of them together in a single final table.
So the query for each tab is something like this
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_1
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_2
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_3
and so on so forth
I would like to combine all the results showing number of total records for each day and hour from these tables.
Expected results will be like this
date, hour, number of records of table 1, number of records of table 2, number of records of table 3 ........
What would the most optimum SQL query for this?

Probably the simplest way is to union them together and aggregation:
select timestamp_trunc(timestamp, hour) as hh,
countif(which = 1) as num_1,
countif(which = 2) as num_2
from ((select timestamp, 1 as which
from table_1
) union all
(select timestamp, 2 as which
from table_2
) union all
. . .
) t
group hh
order by hh;
You are using timestamp_trunc(). It returns a timestamp truncated to the hour -- there is no need to also include the date.

Below is for BigQuery Standard SQL
#standardSQL
SELECT
TIMESTAMP_TRUNC(TIMESTAMP, DAY) day,
EXTRACT(HOUR FROM TIMESTAMP) hour,
COUNT(*) cnt,
_TABLE_SUFFIX AS table
FROM `project.dataset.table_*`
GROUP BY day, hour, table

Related

BigQuery: iterating groups within a window of 28days before a start_date column using _TABLE_SUFFIX

I got a table like this:
group_id
start_date
end_date
19335
20220613
20220714
19527
20220620
20220719
19339
20220614
20220720
19436
20220616
20220715
20095
20220711
20220809
I am trying to retrieve data from another table that is partitioned, and data should be access with _TABLE_SUFFIX BETWEEN start_date AND end_date.
Each group_id contains different user_id within the period [start_date, end_date]. What I need is to retrieve data of users of a column/metric of the last 28D prior to the start_date of each group_id.
My idea is to:
Retrieve distinct user_id per group_id within the period [start_date, end_date]
Retrieve previous 28d metric data prior to the start date of each group_id
A snippet code on how to retrieve data from a single group_id is the following:
WITH users_per_group AS (
SELECT
users_metadata.user_id,
users_metadata.group_id,
FROM
`my_table_users_*` users_metadata
WHERE
_TABLE_SUFFIX BETWEEN '20220314' --start_date
AND '20220413' --end_date
AND experiment_id = 16709
GROUP BY
1,
2
)
SELECT
_TABLE_SUFFIX AS date,
user_id,
SUM(
COALESCE(metric, 0)
) AS metric,
FROM
users_per_group
JOIN `my_metric_table*` metric USING (user_id)
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 28 DAY
)
) -- 28 days before it starts
AND FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 1 DAY
)
) -- 1 day before it starts
GROUP BY
1,
2
ORDER BY
date ASC
Also, I want to avoid retrieving all data (considering all dates) from that metric, as the table is huge and it will take very long time to retrieve it.
Is there an easy way to retrieve the metric data of each user across groups and considering the previous 28 days to the start data of each group_id?
I can think of 2 approaches.
Join all the tables and then perform your query.
Create dynamic queries for each of your users.
Both approaches will require search_from and search_to to be available beforehand i.e you need to calculate each user's search range before you do anything.
EG:
WITH users_per_group AS (
SELECT
user_id, group_id
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
FROM TableName
)
Once you have this kind of table then you can use any of the mentioned approaches.
Since I don't have your data and don't know about your table names I am giving an example using a public dataset.
Approach 1
-- consider this your main table which contains user,grp,start_date,end_date
with maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
),
--then calculate search from-to date for every user and group
user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from --change interval as per your need
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
)
select visit_from,_TABLE_SUFFIX date,count(visitId) total_visits from
user_per_grp ug
left join `bigquery-public-data.google_analytics_sample.ga_sessions_*` as pub on pub.geoNetwork.country = ug.visit_from
where _TABLE_SUFFIX between format_date("%Y%m%d",ug.search_from) and format_date("%Y%m%d",ug.search_to)
group by 1,2
Approach 2
declare queries array<string> default [];
create temp table maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
);
create temp table user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
);
-- for each user create a seperate query here
FOR record IN (SELECT * from user_per_grp)
DO
set queries = queries || [format('select "%s" Visit_From,_TABLE_SUFFIX Date,count(visitId) total_visits from `bigquery-public-data.google_analytics_sample.ga_sessions_*` where _TABLE_SUFFIX between format_date("%%Y%%m%%d","%t") and format_date("%%Y%%m%%d","%t") and geoNetwork.country="%s" group by 1,2',record.visit_from,record.search_from,record.search_to,record.visit_from)];
--replace your query here.
END FOR;
--aggregating all the queries and executing it
execute immediate (select string_agg(query, ' union all ') from unnest(queries) query);
Here the 2nd approach processed much less data(~750 KB) than the 1st approach(~17 MB). But that might not be the same for your dataset as the date range may overlap for 2 users and that will lead to reading the same table twice.

Appending the result query in bigquery

I am doing a query where the query will append the data from previous date as the outcome in BigQuery.
So, the result data for today will be higher than yesterdays as the data is appending by days.
So far, what I only managed to get the outcome is the data by days (where you can see the number of ID declining and is not appending from previous day) as this result:
What should I do to add appending function in the query so each day will get the result of data from the previous day in bigquery?
code:
WITH
table1 AS (
SELECT
ID,
...
FROM t
WHERE DATE_SUB('2020-01-31', INTERVAL 31 DAY) and '2020-01-31'
),
table2 AS (
SELECT
ID,
COUNTIF((rating < 7) as bad,
COUNTIF((rating >= 7 AND SAFE_CAST(NPS_Rating as INT64) < 9) as intermediate,
COUNTIF((rating as good
FROM
t
WHERE DATE_SUB('2020-01-31', INTERVAL 31 DAY) and '2020-01-31'
)
SELECT
DATE_SUB('2020-01-31', INTERVAL 31 DAY) as date,
*
FROM table1
FULL OUTER JOIN table2 USING (ID)
If you have counts that you want to accumulate, then you want a cumulative sum. The query would look something like this:
select datecol, count(*), sum(count(*)) over (order by datecol)
from t
group by datecol
order by datecol;

Group by for each row in bigquery

I have a table that stores user comments for each month. Comments are stored using UTC timestamps, I want to get the users that posts more than 20 comments per day. I am able to get the timestamp start and end for each day, but I can't group the comments table by number of comments.
This is the script that I have for getting dates, timestamps and distinct users.
SELECT
DATE(TIMESTAMP_SECONDS(r.ts_start)) AS date,
r.ts_start AS timestamp_start,
r.ts_start+86400 AS timestamp_end,
COUNT(*) AS number_of_comments,
COUNT(DISTINCT s.author) AS dictinct_authors
FROM ((
WITH
shifts AS (
SELECT
[STRUCT(" 00:00:00 UTC" AS hrs,
GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY) AS dt_range) ] AS full_timestamps )
SELECT
UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) AS ts_start,
UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) + 86400 AS ts_end
FROM
shifts,
shifts.full_timestamps
LEFT JOIN
full_timestamps.dt_range AS dt)) r
INNER JOIN
`user_comments.2018_07` s
ON
(s.created_utc BETWEEN r.ts_start
AND r.ts_end)
GROUP BY
r.ts_start
ORDER BY
number_of_comments DESC
And this is the sample output 1:
The user_comments.2018_07 table is as the following:
More concretely I want the first output 1, has one more column showing the number of authors that have more than 20 comments for the date. How can I do that?
If the goal is only to get the number of users with more than twenty comments for each day from table user_comments.2018_07, and add it to the output you have so far, this should simplify the query you first used. So long as you're not attached to keeping the min/max timestamps for each day.
with nb_comms_per_day_per_user as (
SELECT
day,
author,
COUNT(*) as nb_comments
FROM
# unnest as we don't really want an array
unnest(GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY)) AS day
INNER JOIN `user_comments.2018_07` c
on
# directly convert timestamp to a date, without using min/max timestamp
date(timestamp_seconds(created_utc))
=
day
GROUP BY day, c.author
)
SELECT
day,
sum(nb_comments) as total_comments,
count(*) as distinct_authors, # we have already grouped by author
# sum + if enables to count "very active" users
sum(if(nb_comments > 20, 1, 0)) as very_active_users
FROM nb_comms_per_day_per_user
GROUP BY day
ORDER BY total_comments desc
Also I supposed the column comment containing booleans is not used, as you do not use it in your initial query?

Google BigQuery: Rolling Count Distinct

I have a table with is simply a list of dates and user IDs (not aggregated).
We define a metric called active users for a given date by counting the distinct number of IDs that appear in the previous 45 days.
I am trying to run a query in BigQuery that, for each day, returns the day plus the number of active users for that day (count distinct user from 45 days ago until today).
I have experimented with window functions, but can't figure out how to define a range based on the date values in a column. Instead, I believe the following query would work in a database like MySQL, but does not in BigQuery.
SELECT
day,
(SELECT
COUNT(DISTINCT visid)
FROM daily_users
WHERE day BETWEEN DATE_ADD(t.day, -45, "DAY") AND t.day
) AS active_users
FROM daily_users AS t
GROUP BY 1
This doesn't work in BigQuery: "Subselect not allowed in SELECT clause."
How to do this in BigQuery?
BigQuery documentation claims that count(distinct) works as a window function. However, that doesn't help you, because you are not looking for a traditional window frame.
One method would adds a record for each date after a visit:
select theday, count(distinct visid)
from (select date_add(u.day, n.n, "day") as theday, u.visid
from daily_users u cross join
(select 1 as n union all select 2 union all . . .
select 45
) n
) u
group by theday;
Note: there may be simpler ways to generate a series of 45 integers in BigQuery.
Below should work with BigQuery
#legacySQL
SELECT day, active_users FROM (
SELECT
day,
COUNT(DISTINCT id)
OVER (ORDER BY ts RANGE BETWEEN 45*24*3600 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts
FROM daily_users
)
) GROUP BY 1, 2 ORDER BY 1
Above assumes that day field is represented as '2016-01-10' format.
If it is not a case , you should adjust TIMESTAMP_TO_SEC(TIMESTAMP(day)) in most inner select
Also please take a look at COUNT(DISTINC) specifics in BigQuery
Update for BigQuery Standard SQL
#standardSQL
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 3888000 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1
You can test / play with it using below dummy sample
#standardSQL
WITH daily_users AS (
SELECT 1 AS id, '2016-01-10' AS day UNION ALL
SELECT 2 AS id, '2016-01-10' AS day UNION ALL
SELECT 1 AS id, '2016-01-11' AS day UNION ALL
SELECT 3 AS id, '2016-01-11' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-13' AS day
)
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1

SQL Statement Only latest entry of the day

seems it is too long ago that I needed create own SQL Statements. I have a table (GAS_COUNTER) with timestamps (TS) and values (VALUE).
There are hundreds of entries per day, but I only need the latest of the day. I tried different ways but never get what I need.
Edit
Thanks for the fast replies, but some do not meet my needs (I need the latest value of each day in the table) and some don't work. My best own statement was:
select distinct (COUNT),
from
(select
extract (DAY_OF_YEAR from TS) as COUNT,
extract (YEAR from TS) as YEAR,
extract (MONTH from TS) as MONTH,
extract (DAY from TS) as DAY,
VALUE as VALUE
from GAS_COUNTER
order by COUNT)
but the value is missing. If I put it in the first select all rows return. (logical correct as every line is distinct)
Here an example of the Table content:
TS VALUE
2015-07-25 08:47:12.663 0.0
2015-07-25 22:50:52.155 2.269999999552965
2015-08-10 11:18:07.667 52.81999999284744
2015-08-10 20:29:20.875 53.27999997138977
2015-08-11 10:27:21.49 54.439999997615814
2nd Edit and solution
select TS, VALUE from GAS_COUNTER
where TS in (
select max(TS) from GAS_COUNTER group by extract(DAY_OF_YEAR from TS)
)
This one would give you the very last record:
select top 1 * from GAS_COUNTER order by TS desc
Here is one that would give you last records for every day:
select VALUE from GAS_COUNTER
where TS in (
select max(TS) from GAS_COUNTER group by to_date(TS,'yyyy-mm-dd')
)
Depending on the database you are using you might need to replace/adjust to_date(TS,'yyyy-mm-dd') function. Basically it should extract date-only part from the timestamp.
Select the max value for the timestamp.
select MAX(TS), value -- or whatever other columns you want from the record
from GAS_COUNTER
group by value
Something like this would window the data and give you the last value on the day - but what happens if you get two TS the same? Which one do you want?
select *
from ( select distinct cast( TS as date ) as dt
from GAS_COUNTER ) as gc1 -- distinct days
cross apply (
select top 1 VALUE -- last value on the date.
from GAS_COUNTER as gc2
where gc2.TS < dateadd( day, 1, gc1.dt )
and gc2.TS >= gc1.dt
order by gc2.TS desc
) as x