How to extract record's table name when using Table wildcard functions [duplicate] - google-bigquery

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?

This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`

Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

Related

Querying all partition table

I have around 600 partitioned tables called table.ga_session. Each table is separated by 1 day, and for each table it has its own unique name, for example, table for date (30/12/2021) has its name as table.ga_session_20211230. The same goes for other table, the naming format would be like this table.ga_session_YYYYMMDD.
Now, when I try to call all partitioned table, I cannot use command like this:. The error showed that _PARTITIONTIME is unrecognized.
SELECT
*,
_PARTITIONTIME pt
FROM `table.ga_sessions_20211228`
where _PARTITIONTIME
BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP('2020-01-02')
I also tried this and does not work
select *
from between `table.ga_sessions_20211228`
and
`table.ga_sessions_20211229`
I also cannot use FROM 'table.ga_sessions' to apply WHERE clause to take out range of time as the table does not exist. How do I call all of these partitioned table? Thank you in advance!
You can query using wildcard tables. For example:
SELECT max
FROM `bigquery-public-data.noaa_gsod.gsod*`
WHERE _TABLE_SUFFIX = '1929'
This will specifically query the gsod1929 table, but the table_suffix clause can be excluded if desired.
In your scenario you could do:
select *
from table.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20190101' and '20200102'
For more information see the documentation here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference

Is there a way to select table_id in a Bigquery Table Wildcard Query

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

SQL query to select one of multiple rows from a table

I have a table which contains more than one row for a particular value . Here is the table structure:
NAME,NUMBER,STATUS,DESC,START_DATE,END_DATE
A,3,X,DetailsOfX,13-10-15,13-10-15
A,2,Y,DetailsOfY,13-10-15,13-10-15
A,2,Z,DetailsOfZ,13-10-15,13-10-15
A,1,X,DetailsOfX,12-10-15,12-10-15
The output i need is i.e.
A,3,X,DetailsOfX,13-10-15,13-10-15
A,2,Y,DetailsOfY-DetailsofZ,13-10-15,13-10-15
A,1,X,DetailsOfX,12-10-15,12-10-15
So basically i want to select one of two or more rows from a table with data from columns from both the rows (in bold above). The query below i tried using JOIN returns 4 rows.
SELECT A.NAME,A.NUMBER,B.STATUS,A.DESC||"-"||B.DESC,A.START_DATE,A.END_DATE
FROM TABLE A
JOIN (SELECT NUMBER,STATUS,DESC,START_DATE,END_DATE FROM TABLE WHERE NAME='A') B
ON A.NAME=B.NAME AND
A.NUMBER=B.NUMBER
Can somebody help me with the query that would work.
Thanks
If you are using IBM i 7.1 (formerly known as OS/400), you should be able do this with two tricks: hierarchical queries, and XML functions.
See my tutorial under Q: SQL concatenate strings which explains how to do this on DB2 for i in order to merge the descriptions.
GROUP BY any fields by which you would want rows combined into one, but all other columns must be the result of an aggregate function. So for example, if you want one row per name, number, but have various values for Status, StartDate, EndDate, then you will need to say something like min(Status), min(StartDate), max(EndDate). Is the minimum status code actually the one you want to report?
If your OS is at version 6.1, you may still be able to use a conventional recursive query (or under v5r4), but you might need an addtional CTE (or two?) to concatenate the descriptions.
You need to use GROUP BY and FOR XML PATH:
SELECT
X.NAME, X.NUMBER, X.STATUS,
STUFF((
SELECT '-' + [Desc] AS Desc
FROM YourTable Y
WHERE Y.ID = X.ID
FOR XML PATH(''),TYPE),1,1,'') AS DescValues,
StartDate,
EndDate
FROM YourTable X
GROUP BY Name, Number, Status, StartDate, EndDate
This is assuming you want separate rows for any differences in name, number, status, start date, or end date.
Also, this is assuming SQL Server.

SQL Having on columns not in SELECT

I have a table with 3 columns:
userid mac_address count
The entries for one user could look like this:
57193 001122334455 42
57193 000C6ED211E6 15
57193 FFFFFFFFFFFF 2
I want to create a view that displays only those MAC's that are considered "commonly used" for this user. For example, I want to filter out the MAC's that are used <10% compared to the most used MAC-address for that user. Furthermore I want 1 row per user. This could easily be achieved with a GROUP BY, HAVING & GROUP_CONCAT:
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
And indeed, the result is as follows:
57193 001122334455,000C6ED211E6 42
However I really don't want the count-column in my view. But if I take it out of the SELECT statement, I get the following error:
#1054 - Unknown column 'count' in 'having clause'
Is there any way I can perform this operation without being forced to have a nasty count-column in my view? I know I can probably do it using inner queries, but I would like to avoid doing that for performance reasons.
Your help is very much appreciated!
As HAVING explicitly refers to the column names in the select list, it is not possible what you want.
However, you can use your select as a subselect to a select that returns only the rows you want to have.
SELECT a.userid, a.macs
FROM
(
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
) as a
UPDATE:
Because of a limitation of MySQL this is not possible, although it works in other DBMS like Oracle.
One solution would be to create a view for the subquery. Another solution seems cleaner:
CREATE VIEW YOUR_VIEW (userid, macs) AS
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
This will declare the view as returning only the columns userid and macs although the underlying SELECT statement returns more columns than those two.
Although I am not sure, whether the non-DBMS MySQL supports this or not...

Aggregate functions in WHERE clause in SQLite

Simply put, I have a table with, among other things, a column for timestamps. I want to get the row with the most recent (i.e. greatest value) timestamp. Currently I'm doing this:
SELECT * FROM table ORDER BY timestamp DESC LIMIT 1
But I'd much rather do something like this:
SELECT * FROM table WHERE timestamp=max(timestamp)
However, SQLite rejects this query:
SQL error: misuse of aggregate function max()
The documentation confirms this behavior (bottom of page):
Aggregate functions may only be used in a SELECT statement.
My question is: is it possible to write a query to get the row with the greatest timestamp without ordering the select and limiting the number of returned rows to 1? This seems like it should be possible, but I guess my SQL-fu isn't up to snuff.
SELECT * from foo where timestamp = (select max(timestamp) from foo)
or, if SQLite insists on treating subselects as sets,
SELECT * from foo where timestamp in (select max(timestamp) from foo)
There are many ways to skin a cat.
If you have an Identity Column that has an auto-increment functionality, a faster query would result if you return the last record by ID, due to the indexing of the column, unless of course you wish to put an index on the timestamp column.
SELECT * FROM TABLE ORDER BY ID DESC LIMIT 1
I think I've answered this question 5 times in the past week now, but I'm too tired to find a link to one of those right now, so here it is again...
SELECT
*
FROM
table T1
LEFT OUTER JOIN table T2 ON
T2.timestamp > T1.timestamp
WHERE
T2.timestamp IS NULL
You're basically looking for the row where no other row matches that is later than it.
NOTE: As pointed out in the comments, this method will not perform as well in this kind of situation. It will usually work better (for SQL Server at least) in situations where you want the last row for each customer (as an example).
you can simply do
SELECT *, max(timestamp) FROM table
Edit:
As aggregate function can't be used like this so it gives error. I guess what SquareCog had suggested was the best thing to do
SELECT * FROM table WHERE timestamp = (select max(timestamp) from table)