BigQuery copy command - google-bigquery

I have been going through the book Google BigQuery Analytics. On page 354 it states that a table copy completes in less than a minute irrespective of table size. Is this correct? How is it possible?

Let's test that assessment.
I have a 2TB table with 55 billion rows:
I will ask BigQuery to make a copy of it:
The requested job was completed in 55 seconds - less than a minute:
So let me answer the 2 questions above:
On page 354 it states that a table copy completes in less than a minute irrespective of table size. Is this correct?
The book is old, but the answer is still "yes".
How is it possible?
BigQuery is powered by

Related

How to manually test a data retention requirement in a search functionality?

Say, data needs to be kept for 2years. Then all data that were created 2years + 1day ago should not be displayed and be deleted from the server. How do you manually test that?
I’m new to testing and I can’t think of any other ways. Also, we cannot do automation due to time constraints.
You can create the data with backdating of more than two years in the database and can test, if it is being deleted or not automatically, In other ways ,you can change the current business date from the database and can test it
For the data retention functionality a manual tester needs to remember the search data so that the tester can perform the test cases for the search retention feature.
By Taking an example of a social networking app , being a manual tester you need to remember all the users that you searched for recently.
To check the time period of retention you can take the help from the backend developer so that they can change the time period (from like one year to 10 min) for testing purpose.
Even if you delete the search history and then you start typing the already entered search result the related result should pop on the first location of the search result. Data retention policies concern what data should be stored or archived, where that should happen, and for exactly how long. Once the retention time period for a particular data set expires, it can be deleted or moved as historical data to secondary or tertiary storage, depending on the requirement
Let’s us understand with an example, that we have below data in our database table based on past search made by users. Now with the help of this table, you can perform this testing with minimum effort and optimum result. We have Current Date as - ‘2022-03-10’ and Status column states that data is available / not available in database, where Visible means available, while Expired means deleted from table.

Search Keyword
Search On Date
Search Expiry Date
Status
sport
2022-03-05
2024-03-04
Visible
cricket news
2020-03-10
2022-03-09
Expired - Deleted
holy books
2020-03-11
2022-03-10
Visible
dance
2020-03-12
2022-03-11
Visible

Is there a way to export more than 10000 records to AWS s3 from Google reporting analytics API daily?

I have a GA 360 view that gets decent amount of traffic daily and i want to export the hit level data (using GA_client_id) to the aws s3 server. the limitation here is that GA API allows only 10000 records a day. someone suggested that if we put the GA_client_id in a custom dimension, the limit would not apply. is it true? please let me know if there is another solution to export more than 10000 records for a single view per day. please note that this will be a single query that will auto run daily at a specific time.
Thank you so much in advance.
10000 records a day.
Correction 10000 requests per day per view. A request response can include millions of records (rows)

How long a temporary table and a Job goes on bigquery goes for default?

By default how long a temp table will be on to get data? I know that we can set up the expiration, but what is the default?
And about the job? What's the default expiration time if it has?
I tried to find these in the documentation but i couldn't find, we return the jobId to the client so he can get the data when the job is complete, but some of them like to store and tries to fetch data with a jobId from 2 weeks ago, 1 month ago.
What's the default time here so i can explain them better?
Query results are stored for 24 hours:
All query results, including both interactive and batch queries, are cached in temporary tables for approximately 24 hours with some exceptions.
https://cloud.google.com/bigquery/docs/cached-results
As was mentioned by Alexey, they query results are stored for 24 hours when using cache.
Regarding the timelife for BigQuery jobs, you can get the job history for the last six months
By the other side, based on your description, creating a new table from your query results with expiration time seems to be the most appropriate strategy. Also, you could check if using materialized views could help you to store some results of recurring queries.

Shifting Window in google Big Query dataset

I have 30 daily sharded tables in Big Query from Nov 1 to Nov 30, 2016.
Each of these tables follow the naming convention of "sample_datamart_YYYYMMDD".
Each of these daily tables have a field called timestampServer.
My goal is to advance the data by 24 hours at 00:00:00 UTC every day.
So that the data is kept current without me having to copy the tables.
Is there any way to :
1) do a calculation on the field timestampServer so that it gets updated every 24 hours?
2) and at the same time rename the table name from sample_datamart_20161130 to sample_datamart_20161201?
I've read the other posts and I think those are more on aggregations in a 30 day window. My objective is not to do any aggreagtions. I just want to move the whole dataset forward by 24 hours so that when I searched for the last 1 day, there will always be data there.
Does anyone know if Google Cloud Datasets: Update be able to perform the tasks?
https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/update#try-it
Thanks very much for any guidance.
As of #2 - how to rename the table name from sample_datamart_20161130
to sample_datamart_20161201?
This can be achieved by copying table to new table and then deleting original table.
Zero extra cost as copy job is free of charge
Table can be copied with Jobs: Insert API using copy configuration and then table can be deleted using Tables: Delete API
Just wanted to note that above answer just directly answers your (second) question. But somehow I feel you can go wrong direction. If you want to describe in more details what your are trying to achieve (as oposed to how you think you will implement it) we might be able to provide better help for you. If you will go this way - I would recommend to post it as a separate question :o)

BigQuery data availability

I am running a series of BigQuery jobs, two jobs are each using LOAD function to insert-overwrite data into two tables from Google Storage, and then the last job performs a JOIN on these tables to produce a result table.
The problem I am experiencing is that the result table from the JOIN does not reflect the data from one of the two tables I have loaded, implying that the data written during the LOAD job is not yet available for query.
When I re-ran the JOIN manually about an hour later, the result table was correct. This implies there is some unknown time period where the data was loaded but the contents of the table was not yet refreshed.
Is there more information that the google team can provide regarding this situation?
Here is the logging to understand the time line:
table 1 LOAD complete
2015-03-20 16:22:54,237 INFO com.ni.google.application.ImportApplication - job job_U_OkoXXk91zl2wlyKWb5uWxNHkk is complete, table media_20150320 set to expire
at 1434639614948
table 2 LOAD complete
2015-03-20 16:33:29,123 INFO com.ni.google.application.LoadTablesApplication - job job_QHxva8d6lXmxpaiZDyUmyDSWu6o is complete, table warehouse_dataview_interest_counts_1day set to expire at 1434645158930
# SHOULD I SLEEP HERE
table1 JOIN table2 BEGIN
2015-03-20 16:33:39,916 INFO com.ni.google.application.RollupApplication - loading query template: warehouse_comparison_1day
Thanks,
Luke
It happened to me also, and it could have been a transient issue for some nodes. Anyway we are good now. And I see from your updates that you are good.
If you see similar issue, please post it to the issue tracker:
https://code.google.com/p/google-bigquery/