BigQuery - creating timestamped 'versions' of tables - google-bigquery

How can I create tables with multiple 'generations' (i.e. like on old mainframe environments with JCL), I've seen this done with Firebase analytics sample data.
e.g. I have the following table: mydataset.mytable (7), as listed on the UI.
If I expand the table details, I can see that I can select from the timestamped tables and the preview details for that data
In BigQuery, how can I go around emulating this? This looks REALLY useful.
EDIT: This is better explained with a picture!
Here's the table with the 7 snapshots:
Here, looking at the schema, I can select the snapshot I want to query:
I can't quite work out how to do this.
best wishes
Dave

You can use snapshot decorators for this
For example, below gives you version of table as of hour ago
#legacySQL
SELECT .... FROM [project:dataset.table#-3600000]
in BigQuery StandardSQL - you can use below syntax
#standardSQL
SELECT ... FROM `project.dataset.table` FOR SYSTEM TIME AS OF <timestamp_expression>
Update for
Here, looking at the schema, I can select the snapshot I want to query
That drop down represents actual sharded tables rather than snapshots.
Those are just separate tables with suffix that is presented as YYYYMMDD
Whenever you have any tables having common prefix with YYYYMMDD as a suffix in your dataset - Web UI just "collapse" them (in UI only - they are still separate tables) into one entry with count of actual tables in pair of round brackets ( )
Then, you can select which exactly table you want to deal with by selecting it from that drop down (in image from your question)
Hope, this helps you

Related

Impala/Hive get list of tables along with creator and date created

I'm trying to clean up some dev/test tables in Impala for my team, but can't seem to find a way to list out tables with their creator and the date last accessed. The show tables command simply lists out the table names. Because there are hundreds of tables, with less than a quarter belonging to our team, going through each table individually to see if it should be dropped would take hours.
Is there not a way to list out table names along with the creator and date created?
Edit: I can see the creator and creation date/time when I click on a table's info button in Hue, so I know the information is stored somewhere:
One way around this, take each table and describe table. You need to run the following command:
describe formatted <your_table_name>;
There you can find details such as below,
Database:
Owner:
CreateTime:
LastAccessTime:
Other way around is, in mysql metadata is stored in hive database you can query for tables created and its meta information in below query,
use hive;
select * from TBLS;

How to insert/update a partitioned table in Big Query

Problem statement:
I need to insert/update a few columns in a big query table that is partitioned by date.So basically I need to do the necessary changes for each partitioned date (done by day).
(its the sessions table that is created automatically by linking the GA View to BQ so I haven't done the partition manually but its automatically taken care by google).
query reference from google_docs
my query:
I also tried the below :
Can anyone help me here ? sorry I am a bit naive with BQ.
You are trying to insert into a wildcard table, a meta-table that is actually composed of multiple tables. Wildcard table is read only and cannot be inserted into.
As Hua said, ga_sessions_* is not a partitioned table, but represents many tables, each with a different suffix.
You probably want to do this then:
INSERT INTO `p.d.ga_sessions_20191125` (visitNumber, visitId)
SELECT 1, 1574

Bigquery fails to return proper data from table when queried using wildcard query

we are using Looker (dashboard/reporting solution) to create persistent derived tables in BigQuery. These are normal tables as far as bigquery is concerned, but the naming is as per looker standard (it creates a hash based on DB + SQL etc.) and names the table accordingly. These tables are generated through view in scheduled time daily. The table names in BigQuery look like below.
table_id
LR_Z504ZN0UK2AQH8N2DOJDC_AGG__table1
LR_Z5321I8L284XXY1KII4TH_MART__table2
LR_Z53WLHYCZO32VK3FWRS2D_JND__table3
If I query the resulting table in BQ by explicit name then the result is returned as expected.
select * from `looker_scratch.LR_Z53WLHYCZO32VK3FWRS2D_JND__table3`
Looker changes the hash value in the table name when the table is regenerated after a query/job change. Hence I wanted to create a view with a wildcard table query to make the changes in the table name transparent to outside world.
But the below query always fails.
SELECT *
FROM \`looker_scratch.LR_*\`
where _table_suffix like '%JND__table3'
I either get a completely random schema with null values or errors such as:
Error: Cannot read field 'reportDate' of type DATE as TIMESTAMP_MICROS
There are no clashing table suffixes and I have used all sort of regular expression checks (lower , contains, etc)
Is this happening since the table names have hash values in them? I have run multiple tests on other datasets and there are absolutely no problem, we have been running wildcard table queries since a long time and have faced no issues whatsoever.
Please let me know your thoughts.
When you are using wildcard like below
`looker_scratch.LR_*`
you actually looking for ALL tables with this prefix and than - when you apply below clause
LIKE '%JND__table3'
you further filter in tables with such suffix
So the trick here is that very first (chronologically) table defines the schema of your output
To address your issue - verify if there are more tables that match your query and than look into very first one (the one that was created first)

Google Big Query - Date-Partitioned Tables with Eventual Data

Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?
You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to

BigQuery best practice for segmenting tables by dates

I am new to columnar DB concepts and BigQuery in particular. I noticed that for the sake of performance and cost efficiency it is recommended to split data across tables not only logically - but also by time.
For example - while I need a table to store my logs (1 logical table that is called "logs"), it is actually considered a good practice to have a separate table for different periods, like "logs_2012", "logs_2013", etc... or even "logs_2013_01", "logs_2013_02", etc...
My questions:
1) Is it actually the best practice?
2) Where would be best to draw the line - an annual table? A monthly table? A daily table? You get the point...
3) In terms of retrieving the data via queries - what is the best approach? Should I construct my queries dynamically using the UNION option? If I had all my logs in one table - I would naturally use the where clause to get data for the desired time range, but having data distributed over multiple tables makes it weird. I come from the world of relational DB (if it wasn't obvious so far) and I'm trying to make the leap as smoothly as possible...
4) Using the distributed method (different tables for different periods) still raises the following question: before querying the data itself - I want to be able to determine for a specific log type - what is the available range for querying. For example - for a specific machine I would like to first present to my users the relevant scope of their available logs, and let them choose the specific period within that scope to get insights for. The question is - how do I construct such a query when my data is distributed over a number of tables (each for a period) where I don't know which tables are available? How can I construct a query when I don't know which tables exist? I might try to access the table "logs_2012_12" when this table doesn't actually exist, or event worst - I wouldn't know which tables are relevant and available for my query.
Hope my questions make sense...
Amit
Table naming
For daily tables, the suggested table name pattern is the specific name of your table + the date like in '20131225'. For example, "logs20131225" or "logs_20131225".
Ideal aggregation: Day, month, year?
The answer to this question will depend on your data and your queries.
Will you usually query one or two days of data? Then have daily tables, and your costs will be much lower, as you query only the data you need.
Will you usually query all your data? Then have all the data in one table. Having many tables in one query can get slower as the number of tables to query grow.
If in doubt, do both! You could have daily, monthly, yearly tables. For a small storage cost, you could save a lot when doing queries that target only the intended data.
Unions
Feel free to do unions.
Keep in mind that there is a limit of a 1000 tables per query. This means if you have daily tables, you won't be able to query 3 years of data (3*365 > 1000).
Remember that unions in BigQuery don't use the UNION keyword, but the "," that other databases use for joins. Joins in BigQuery can be done with the explicit SQL keyword JOIN (or JOIN EACH for very big joins).
Table discovery
API: tables.list will list all tables in a dataset, through the API.
SQL: To query the list of tables within SQL... keep tuned.
New 2016 answer: Partitions
Now you can have everything in one table, and BigQuery will analyze only the data contained in the desired dates - if you set up the new partitioned tables:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables