BigQuery dataset deletion and name reuse - google-bigquery

When I delete a dataset in BigQuery and create another one, in the same project but in a different region, with the same dataset name, this throws an error. It simply says 'Not found: Dataset 'projectId:datasetName' '
This is an important problem, as GA360 imports rely on having the dataset named with the view ID. Now that we have BigQuery in Australia, we would like to be able to use it.
How can I fix this problem?

False alarm. It turns out that BQ just needs some more time to complete this operation. I tried again after some minutes and it now works.

Related

BigQuery create table error: dataset not found in location

Here is my situation:
My colleague has a dataset located in asia-northeast3 in his BigQuery. He has already give me reader access to his dataset. I'm trying to extract some necessary data from one of his tables and save them into a new table under my dataset (location: us-central).
I wrote the following sql to do this, but BigQuery reported error:
Not found: Dataset my_project_id:dataset_in_us was not found in
location asia-northeast3
CREATE OR REPLACE TABLE `my_project_id.dataset_in_us.my_tablename` AS
SELECT
create_date
, totalid -- id for article.
, urlpath -- format like /article/xxxx
, article_title -- text article title
FROM `my_colleagues_project_id.dataset_in_asia_northeast3.tablename`
ORDER BY 1 DESC
;
I can't change my dataset location or his. I need to join the data from his dataset with data from my dataset. How to solve this?
After 1 day of trying and failing, I found a not perfect solution.
I copied my colleague's entire dataset from asia-northeast3 to us-central following this guide.
After that I can run my query on the copied dataset.
This solution is time (and money) consuming. I'm still trying to figure out if there is a way to only copy a single table, instead of an entire dataset, from one location to another.

Airflow Pipeline CSV to BigQuery with Schema Changes

Background
I'm need to design an Airflow pipeline to load CSV's into BigQuery.
I know the CSV's frequently have a changing schema. After loading the first file the schema might be
id | ps_1 | ps_1_value
when the second file lands and I load it it might look like
id | ps_1 | ps_1_value | ps_1 | ps_2_value.
Question
What's the best approach to handling this?
My first thought on approaching this would be
Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows
I would do this in a PythonOperator.
If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.
Thanks for the feedback.
After loading two prior files example_data_1.csv and example_data_2.csv I can see that the fields are being inserted into the correct columns, with new columns being added as needed.
Edit: The light bulb moment was realizing that the schema_update_options exist. See here: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.SchemaUpdateOption.html
csv_to_bigquery = GoogleCloudStorageToBigQueryOperator(
task_id='csv_to_bigquery',
google_cloud_storage_conn_id='google_cloud_default',
bucket=airflow_bucket,
source_objects=['data/example_data_3.csv'],
skip_leading_rows=1,
bigquery_conn_id='google_cloud_default',
destination_project_dataset_table='{}.{}.{}'.format(project, schema, table),
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
schema_update_options=['ALLOW_FIELD_RELAXATION', 'ALLOW_FIELD_ADDITION'],
autodetect=True,
dag=dag
)
Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data.
Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.
Given that, your DAG could be very similar to your current DAG:
Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.
Finally, I would like to point some links that might be useful to you:
AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.
I hope it helps

Can't save GBQ results in any format

I ran a query in Google BigQuery that is a basic Select * From [table] where [single column = (name)]
The results came out to about 310 lines and 48 columns, but when I try to save the results in ANY format, nothing happens.
I've tried saving as a view AND a table, which I can do just fine, but trying to download the results, or trying to export the results to GCP, fails every time. There is no error, no notification that something went wrong, literally nothing happens.
I'm about ready to yank out my hair and throw my computer out the window. I ran a query that was almost identical except for the (name) this morning and had no issue. Now it's after 4pm and it's not working.
All of my browsers are up to date, my logins are fine, my queries aren't reliant on tables that update during that time, I've restarted my computer four times in the hope that SOMETHING will help.
Has anyone had this issue? What else can I do to troubleshoot?
Do you have any field of RECORD (REPEATED) type in your results? I had a similar problem today, trying to save my results to Google Sheets - literally nothing happened, no error message whatsoever - but fortunately (and quite puzzlingly) - I got this error while trying to save them to CSV on Google Drive instead: "Operation cannot be performed on a nested schema. Field: ...". After removing the "offending" field, which was of RECORD (REPEATED) type, I was able to save to Google Sheets again.

Binary Sankey Diagram in Tableau - Not All Activities Match The Corresponding Number of KPIs

How do I link my activities variable to only the corresponding KPIs variable?
Using guidance from a number of sources, but primarily the genius of Jeffery Shafer articulated through the SuperDataScience video, I built a Sankey Diagram for my work. For the most part it works, however, I have been trying to figure out how to adjust my Sankey Diagram model to line up each activity with ONLY the corresponding KPIs, but am having no luck.
The data structure looks like this:
You'll note I changed the binary value to "", 2 instead of 0, 1 as it makes visual calculations easier. For the "Viz" variable, I have "Activity" for the raw data set, then I copy/paste/replicate the data to mirror the data (required for the model) but with "KPI" for the mirrored data.
In the following image, you'll see my main issue is that the smallest represented activity still shows as corresponding to all KPIs when in fact it does not. I want activity to line up only with the corresponding KPIs as some activities don't correspond with all, or even any, KPIs.
Finally, here is the model very similar to what the above video link shows:
Can someone help provide insight into how I can adjust the model to fit activities linking only to corresponding KPIs? I appreciate any insight. Thanks!
I have a solution to the issue, thanks to a helpful Tableau support member named Anthony. It was in the data structure. The data was not structured to only associate "Activities" with their "KPI" values within Tableau's requirements, but every "Activities" value with every "KPI" value. As a result, to achieve the desired result, the data needs to be restructured to only contain a row for every valid "Activities" and "KPI" combination. See the visual below where data is removed to format properly:
-------------------------------------->
Once the table is restructured, the desired visual result should configure with the model. It works like a charm!
Good luck out there!

Finding the query that created a table in BigQuery

I am a new employee at the company. The person before me had built some tables in BigQuery. I want to investigate the create table query for that particular table.
Things I would want to check using the query is:
What joins were used?
What are the other tables used to make the table in question?
I have not worked with BigQuery before but I did my due diligence by reading tutorials and the documentation. I could not find anything related there.
Brief outline of your actions below:
Step 1 - gather all query jobs of that user using Jobs.list API - you must have Is Owner permission for respective projects to get someone else's jobs
Step 2 - extract only those jobs run by the user you mentioned and referencing your table of interest - using destination table attribute
Step 3 - for those extracted jobs - just simply check respective queries which allow you to learn how that table was populated
Hth!
I have been looking for an answer since a long time.
Finally found it :
Go to the three bars tab on the left hand side top
From there go to the Analytics tab.
Select BigQuery under which you will find Scheduled queries option,click on that.
In the filter tab you can enter the keywords and get the required query of the table.
For me, I was able to go through my query history and find the query I used.
Step 1.
Go to the Bigquery UI, on the bottom there are personal history and project history tabs. If you can use the same account used to execute the query I recommend personal history.
Step 2.
Click on the tab and there will be a list of queries ordered from most recently run. Check the time the table was created and find a query that ran before the table creation time.
Since the query will run first and create the table there will be slight differences. For me it stayed between a few seconds.
Step 3.
After you find the query used to create the table, simply copy it. And you're done.