BigQuery create table error: dataset not found in location - sql

Here is my situation:
My colleague has a dataset located in asia-northeast3 in his BigQuery. He has already give me reader access to his dataset. I'm trying to extract some necessary data from one of his tables and save them into a new table under my dataset (location: us-central).
I wrote the following sql to do this, but BigQuery reported error:
Not found: Dataset my_project_id:dataset_in_us was not found in
location asia-northeast3
CREATE OR REPLACE TABLE `my_project_id.dataset_in_us.my_tablename` AS
SELECT
create_date
, totalid -- id for article.
, urlpath -- format like /article/xxxx
, article_title -- text article title
FROM `my_colleagues_project_id.dataset_in_asia_northeast3.tablename`
ORDER BY 1 DESC
;
I can't change my dataset location or his. I need to join the data from his dataset with data from my dataset. How to solve this?

After 1 day of trying and failing, I found a not perfect solution.
I copied my colleague's entire dataset from asia-northeast3 to us-central following this guide.
After that I can run my query on the copied dataset.
This solution is time (and money) consuming. I'm still trying to figure out if there is a way to only copy a single table, instead of an entire dataset, from one location to another.

Related

Cannot create a PARTITION BY DATE(timestamp)

Hello I am using my personal GCP account to play around in Bigquery, and I am still within my free-tier range (a billing account is linked, but no fees incurred yet).
So I create a table to fetch baseball.games_wide table from bigquery-public-dataset. The following is my simple CREATE TABLE query with PARTITION on a timestamp column 'startTime'.
CREATE TABLE project.table
PARTITION BY date(startTime) AS
SELECT
gameId, seasonID, date(startTime) as game_date, startTime, year
FROM `bigquery-public-data.baseball.games_wide`
WHERE YEAR = 2016
The table was created successfully and I can see the worker has the write phase, which is an indicator that something is writing to the table. However, when I go to 'Preview' the table, there is no data to display, and table size is 0 KB.
I tried remove the second line (i.e., PARTITION BY date(startTime)) when creating table, the data can be ingested and I am able to Preview it in console. It seems the PARTITION command is causing problem, but I can't tell where goes wrong. Any idea?
As you have mentioned in the comment this issue is resolved by creating a new dataset after the billing account is linked to the project.
You can follow this tutorial to create billing account and link it to project.

Grafana Status timeline not working with PostgresSQL and only one Query

I’m creating a dashboard in Grafana with data extracted from Google Servers and stored in a PostgresSQL Database.
In one of the visualization I would like to create a Status Timeline:
I have created a query in PostgresSQL which returns me the following table:
As my understanding goes, that is the data that is need to create a Status Timeline. (Time, count of a variable and name of the variable count).
But when I copy that query inside Grafana, the chart is not the same as I have imagined:
I don’t know what else to do or how to fix it.
Does anyone has faced this issue before or know how to solve it, in order to get a Status Timeline like the one showed above?
Thank you very much!
As you can see in your last image, the metric names used in the panel are the column names. The status values used are the column values. So you need a table result like that:
executed_on
dev
asia-dev
...
2022-06-07 12:00:00
1
4
...
...
...
...
...

Incompatible parquet type in Impala

I have seen some other posts about this, but have not found an answer that permanently works.
We have a table, and I had to add two columns to it. In order to do so, I dropped the table and recreated it. But since it was an external table, it did not drop the associated data files. The data gets loaded from a control file and is partitioned by date. So let's say the dates that were in the table were 2021-01-01 and 2021-01-02. But only 2021-01-02 is in the control file. So when I am loading that date, it gets re-run with the new columns and everything is fine. But 2021-01-01 is still there, but with a different schema.
This is no issue in Hive, as it seems to default to resolve by name, not position. But Impala resolves by position, so the new columns throw it off.
If I have a table that before had the columns c1,c2,c3, and now have the additional columns c4,c5, if I try to run a query such as
select * from my_table where c5 is null limit 3;
This will give an incompatible parquet schema error in Impala (but Hive is fine, it would just have null for c4 and c5 for the date 2021-01-01).
If I run the command set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; and then the above query again, it is fine. But I would have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; at the beginning of each session, which is not ideal.
From searching online, I have come up with a few solutions:
Drop all data files when creating the new table and just start loading from scratch (I think we want to keep the old files)
Re-load each date (this might not be ideal as there could be many, many dates that would have to be re-loaded and overwritten)
Change the setting permanently in Cloudera Manager (I do not have access to CM and don't know how feasible it would be to change it)
Are there any other solutions to have it so I don't have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; each time I want to use this table in Impala?

Airflow Pipeline CSV to BigQuery with Schema Changes

Background
I'm need to design an Airflow pipeline to load CSV's into BigQuery.
I know the CSV's frequently have a changing schema. After loading the first file the schema might be
id | ps_1 | ps_1_value
when the second file lands and I load it it might look like
id | ps_1 | ps_1_value | ps_1 | ps_2_value.
Question
What's the best approach to handling this?
My first thought on approaching this would be
Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows
I would do this in a PythonOperator.
If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.
Thanks for the feedback.
After loading two prior files example_data_1.csv and example_data_2.csv I can see that the fields are being inserted into the correct columns, with new columns being added as needed.
Edit: The light bulb moment was realizing that the schema_update_options exist. See here: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.SchemaUpdateOption.html
csv_to_bigquery = GoogleCloudStorageToBigQueryOperator(
task_id='csv_to_bigquery',
google_cloud_storage_conn_id='google_cloud_default',
bucket=airflow_bucket,
source_objects=['data/example_data_3.csv'],
skip_leading_rows=1,
bigquery_conn_id='google_cloud_default',
destination_project_dataset_table='{}.{}.{}'.format(project, schema, table),
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
schema_update_options=['ALLOW_FIELD_RELAXATION', 'ALLOW_FIELD_ADDITION'],
autodetect=True,
dag=dag
)
Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data.
Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.
Given that, your DAG could be very similar to your current DAG:
Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.
Finally, I would like to point some links that might be useful to you:
AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.
I hope it helps

One to Many - Calculated Column

I am trying to teach myself the new Tabular model for SQL 2012 SSAS to handle some analytic reports that were previously handled in (slow) stored procedures.
I've made decent progress on most of it, just figuring out how things work and how to add the calculations I need but I have been banging my head against the following:
I have a table that has file information -- it has:
ID
FileName
CurrentStatus
UploadedBy
And then a table that has statuses that the file went through (a many relationship to the file table):
FileID
StatusID
TimeStamp
What I'm trying to do is to add a calculated column to the File table that returns the TimeStamp information when a file was in a particular status. ie: StatusID=100 is uploaded. I want to add a calculated column called UploadedDate on the File table that has the associated TimeStamp information from the FileStatus table.
It seems like this should be doable with DAX but I just can't seem to wrap my head around it. Any ideas out there?
In advance, many thanks,
Brent
Here's a formula that should work for what you want to do...
=MAXX(
CALCULATETABLE(
'FileStatus'
,'FileStatus'[StatusID] = 100
)
,'FileStatus'[TimeStamp]
)
I'm assuming each file can only be in each status once (there is only one row per FileID that has StatusID 100). I believe you can just use a lookupvalue formula. The formula for your UploadedDate calculated column would be something like
=LOOKUPVALUE(FileStatus[Timestamp], File[FileID], FileStatus[FileID], FileStatus[StatusID], 100)
Here's the MSDN description of LOOKUPVALUE. You provide the column containing the value you want returned, the column you want to search, and the value you are searching for. You can add multiple criteria to your lookup table. Here's a blog post that contains a good example.