Is there a way to select table_id in a Bigquery Table Wildcard Query - google-bigquery

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?

This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`

Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

Related

Querying all partition table

I have around 600 partitioned tables called table.ga_session. Each table is separated by 1 day, and for each table it has its own unique name, for example, table for date (30/12/2021) has its name as table.ga_session_20211230. The same goes for other table, the naming format would be like this table.ga_session_YYYYMMDD.
Now, when I try to call all partitioned table, I cannot use command like this:. The error showed that _PARTITIONTIME is unrecognized.
SELECT
*,
_PARTITIONTIME pt
FROM `table.ga_sessions_20211228`
where _PARTITIONTIME
BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP('2020-01-02')
I also tried this and does not work
select *
from between `table.ga_sessions_20211228`
and
`table.ga_sessions_20211229`
I also cannot use FROM 'table.ga_sessions' to apply WHERE clause to take out range of time as the table does not exist. How do I call all of these partitioned table? Thank you in advance!
You can query using wildcard tables. For example:
SELECT max
FROM `bigquery-public-data.noaa_gsod.gsod*`
WHERE _TABLE_SUFFIX = '1929'
This will specifically query the gsod1929 table, but the table_suffix clause can be excluded if desired.
In your scenario you could do:
select *
from table.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20190101' and '20200102'
For more information see the documentation here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference

Count users from multiple tables in the same dataset

I have the following four data tables in the same dataset at Google Bigquery:
I need to count users from these four tables, and organizing the information into a table like this:
The following query returns the <projectID>:<dataset>.<tableID> path of all existing tables at this moment:
SELECT CONCAT(project_id, ':', dataset_id, '.', table_id) AS paths,
FROM [<projectID>:<dataset>.__TABLES__]
WHERE MSEC_TO_TIMESTAMP(creation_time) < DATE_ADD(CURRENT_TIMESTAMP(), 0, 'DAY')
How to iterate the counting in Google Bigquery for all previous paths?
Wildcard tables should do the trick by pulling out the _TABLE_SUFFIX reserved column e.g.
#standardsql
SELECT
COUNT(*) AS lazy_count,
_TABLE_SUFFIX AS table
FROM
`bigquery-public-data.noaa_gsod.*`
GROUP BY
table
Note: I'm not sure what you are counting, so I've just used a lazy COUNT(*). You could simply change this to whatever column you need.

How to extract record's table name when using Table wildcard functions [duplicate]

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

How to follow DRY when using a complex, constructed column as part of a GROUP BY clause in SQL

I often have to write queries with a fairly complex, constructed column that I will aggregate by. For example:
SELECT
EXTRACT(week FROM to_timestamp("Date Created"/1000)) AS week
...
I know that you cannot use aliases in the GROUP BY clause (this Why doesn't Oracle SQL allow us to use column aliases in GROUP BY clauses? question explains logically why), but is there anything else I can do other than re-doing the column calculation, or am I stuck with this:
SELECT COUNT(*), EXTRACT(week FROM to_timestamp("Date Created"/1000)) AS week
FROM mytable
GROUP BY EXTRACT(week FROM to_timestamp("Date Created"/1000))
Often, I break complexity by using sub-queries.
select count(*), week
from
(
SELECT EXTRACT(week FROM to_timestamp("Date Created"/1000)) AS week
FROM mytable
) sel
GROUP BY week
Divide and conquer approach has paid off pretty well so far.
Update
Alternatives to solving this issue:
Computed columns (as #gbn stated in his answer).
Pros:
You can declare a column that's pretty much used in most queries
Some RDBMs allow you to create an index over a computed column (pretty important for performance)
Cons:
not all RDBMs provide computed columns
You might end up declaring a column that's used in one very specific query (out of the thousands of queries you have in your system). Someday, this query will have changed and the column will just sit there...
CTEs
I think that you can do this:
SELECT COUNT(*), week
FROM ( SELECT *, EXTRACT(week FROM to_timestamp("Date Created"/1000)) AS week
FROM mytable) MT
GROUP BY week
Derived tables
SELECT foo FROM (SELECT 1+1 AS foo FROM ...) WHERE foo = ...
Computed columns (not all RDBMS)
ALTER TABLE someTable ADD WeekPart AS WEEK(SomeDate)

Aggregate functions in WHERE clause in SQLite

Simply put, I have a table with, among other things, a column for timestamps. I want to get the row with the most recent (i.e. greatest value) timestamp. Currently I'm doing this:
SELECT * FROM table ORDER BY timestamp DESC LIMIT 1
But I'd much rather do something like this:
SELECT * FROM table WHERE timestamp=max(timestamp)
However, SQLite rejects this query:
SQL error: misuse of aggregate function max()
The documentation confirms this behavior (bottom of page):
Aggregate functions may only be used in a SELECT statement.
My question is: is it possible to write a query to get the row with the greatest timestamp without ordering the select and limiting the number of returned rows to 1? This seems like it should be possible, but I guess my SQL-fu isn't up to snuff.
SELECT * from foo where timestamp = (select max(timestamp) from foo)
or, if SQLite insists on treating subselects as sets,
SELECT * from foo where timestamp in (select max(timestamp) from foo)
There are many ways to skin a cat.
If you have an Identity Column that has an auto-increment functionality, a faster query would result if you return the last record by ID, due to the indexing of the column, unless of course you wish to put an index on the timestamp column.
SELECT * FROM TABLE ORDER BY ID DESC LIMIT 1
I think I've answered this question 5 times in the past week now, but I'm too tired to find a link to one of those right now, so here it is again...
SELECT
*
FROM
table T1
LEFT OUTER JOIN table T2 ON
T2.timestamp > T1.timestamp
WHERE
T2.timestamp IS NULL
You're basically looking for the row where no other row matches that is later than it.
NOTE: As pointed out in the comments, this method will not perform as well in this kind of situation. It will usually work better (for SQL Server at least) in situations where you want the last row for each customer (as an example).
you can simply do
SELECT *, max(timestamp) FROM table
Edit:
As aggregate function can't be used like this so it gives error. I guess what SquareCog had suggested was the best thing to do
SELECT * FROM table WHERE timestamp = (select max(timestamp) from table)