Count users from multiple tables in the same dataset - sql

I have the following four data tables in the same dataset at Google Bigquery:
I need to count users from these four tables, and organizing the information into a table like this:
The following query returns the <projectID>:<dataset>.<tableID> path of all existing tables at this moment:
SELECT CONCAT(project_id, ':', dataset_id, '.', table_id) AS paths,
FROM [<projectID>:<dataset>.__TABLES__]
WHERE MSEC_TO_TIMESTAMP(creation_time) < DATE_ADD(CURRENT_TIMESTAMP(), 0, 'DAY')
How to iterate the counting in Google Bigquery for all previous paths?

Wildcard tables should do the trick by pulling out the _TABLE_SUFFIX reserved column e.g.
#standardsql
SELECT
COUNT(*) AS lazy_count,
_TABLE_SUFFIX AS table
FROM
`bigquery-public-data.noaa_gsod.*`
GROUP BY
table
Note: I'm not sure what you are counting, so I've just used a lazy COUNT(*). You could simply change this to whatever column you need.

Related

Querying all partition table

I have around 600 partitioned tables called table.ga_session. Each table is separated by 1 day, and for each table it has its own unique name, for example, table for date (30/12/2021) has its name as table.ga_session_20211230. The same goes for other table, the naming format would be like this table.ga_session_YYYYMMDD.
Now, when I try to call all partitioned table, I cannot use command like this:. The error showed that _PARTITIONTIME is unrecognized.
SELECT
*,
_PARTITIONTIME pt
FROM `table.ga_sessions_20211228`
where _PARTITIONTIME
BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP('2020-01-02')
I also tried this and does not work
select *
from between `table.ga_sessions_20211228`
and
`table.ga_sessions_20211229`
I also cannot use FROM 'table.ga_sessions' to apply WHERE clause to take out range of time as the table does not exist. How do I call all of these partitioned table? Thank you in advance!
You can query using wildcard tables. For example:
SELECT max
FROM `bigquery-public-data.noaa_gsod.gsod*`
WHERE _TABLE_SUFFIX = '1929'
This will specifically query the gsod1929 table, but the table_suffix clause can be excluded if desired.
In your scenario you could do:
select *
from table.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20190101' and '20200102'
For more information see the documentation here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference

Bigquery Select all latest partitions from a wildcard set of tables

We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;
With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)

BigQuery: Querying multiple datasets and tables using Standard SQL

I have Google Analytics data that's spread across multiple BigQuery datasets, all using the same schema. I would like to query multiple tables each across these datasets at the same time using BigQuery's new Standard SQL dialect. I know I can query multiple tables within a single database like so:
FROM `12345678`.`ga_sessions_2016*` s
WHERE s._TABLE_SUFFIX BETWEEN '0501' AND '0720'
What I can't figure out is how to query against not just 12345678 but also against 23456789 at the same time.
How about using a simple UNION, with a SELECT wrapping around it (I tested this using the new standard SQL option and it worked as expected):
SELECT
SUM(foo)
FROM (
SELECT
COUNT(*) AS foo
FROM
<YOUR_DATASET_1>.<YOUR_TABLE_1>
UNION ALL
SELECT
COUNT(*) AS foo
FROM
<YOUR_DATASET_1>.<YOUR_TABLE_1>)
I believe that using table wild card & union (in bigquery, use comma to achieve the union function) will get what you need very quickly, if the tables have the same schema.
select *
from
(select * from table_table_range([dataset1], date1, date2),
(select * from table_table_range([dataset2], date3, date4),
......

How to extract record's table name when using Table wildcard functions [duplicate]

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

Is there a way to select table_id in a Bigquery Table Wildcard Query

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)