How to ETL from BigQuery to BigQuery? - google-bigquery

I have an external data source that will load data into several tables, in batch mode, several times a day. However, the tables need cleanings, transferring, joining, etc...
Once raw tables are transformed they are visualized on data visualization software (PowerBI or Data Studio).
Now the question is how to update the result tables every time new data is loaded into raw tables?.
In BigQuery creating a table from other tables then when updating the source tables they don't update the created tables, right?
Another solution is to use something like cloud composer to query source tables and load data into destination tables.

Related

Get table names in dataset after dataset truncate

It seems that the BigQuery CLI supports restoring tables in a dataset after they have been deleted by using BigQuery Time Travel functionality -- as in:
bq cp dataset.table#TIME_AGO_UNIX dataset.table
However, this assumes we know the names of the tables. I want to write a script to iterate over all the tables that were in the dataset at TIME_AGO_UNIX time.
How would I go about finding those tables at that time?

Small single parquet file on Data Lake, or relational SQL DB?

I am designing a Data Lake in Azure Synapse and, in my model, there is a table that will store a small amount of data (like 5000 rows).
The single parquet file that stores this data will surely be smaller than the smallest recommended size for a parquet file (128 MB) and I know that Spark is not optimized to handle small files. This table will be linked to a delta table, and I will insert/update new data by using the MERGE command.
In this scenario, regarding performance, is it better to stick with a delta table, or should I create a SQL relational table in another DB and store this data there?
It depends on multiple factors like the types of query you will be running and how often you want to run merge command to upsert data to delta.
But even if you do perform analytical queries, looking at the size of data I would have gone with relational DB.

How to deal with data structure changes when performing a full historical data load?

I'm dealing with a SQL Server database which contains a column "defined data" with JSON data in it (and some other simple columns). The data builds up over time, right now we have about 8 million rows.
The data from this db is periodically read by an ETL system which then reads the JSON data in the "defined data" column and maps the data to a new SQL Server table based on the columns names contained in the JSON data.
This SQL Server table is prone to changes, meaning that about every 4 months additional columns are needed or column names change. Whenever this SQL Server table changes its data structure, a new version is introduced, which also forces the JSON data structure to change.
However, the ETL system should still be able to load all historical (JSON) data from the SQL Server database, regardless of the changing version throughout time. How can I make this work, taking into consideration version changes of the SQL Server tables and the JSON data?
!example]1
So in this example my question is:
How can I ensure that I can load both client 20 and 21 into one SQL Server table without getting errors because the JSON data structure is not reflecting version 2 in the case of historical data?
Given the size of the SQL Server database, it doesn't seem like an option to update all historical JSON data according to the latest version (in this example that would mean adding "AssetType" for the 01-01-2021 data and filling it in with NULL).
Many, many thanks in advance!
First I would check if json fields exist in the table as column names by looking them up in the information schema. If not exists then alter table add column.
How can I ensure that I can load both client 20 and 21 into one SQL Server table without getting errors because the JSON data structure is not reflecting version 2 in the case of historical data?
You maintain 2 separate tables. A Raw/Staging/Bronze table that has the same schema as the source, and a Cleansed/Warehouse/Silver table that has the desired schema for reporting. If you have multiple separate sources, you may have separate Raw tables.
Periodically you enhance the schema of the Cleansed table to add new data that has appeared in the Raw table.

Pentaho multi table input multi table output

Question in regard of Pentaho Spoon (Data Integration):
How can I transfer the input of multiple tables from one database to multiple tables in another database? Basically a 1:1 data migration with creating the tables automatically in the target database.
I basically want to multiply the following transfomation: Picture of table transformation
Try the Copy Tables wizard, under the tools menu.
To use it, you will need to create a new transformation and define both database connections that you want to use.

Using SSIS to create new Database from two separate databases

I am new to SSIS.I got the task have according to the scenario as explained.
Scenario:
I have two databases A and B on different machines and have around 25 tables and 20 columns with relationships and dependencies. My task is to create a database C with selected no of tables and in each table I don't require all the columns but selected some. Conditions to be met are that the relationships should be intact and created automatically in new database.
What I have done:
I have created a package using the transfer SQL Server object task to transfer the tables and relationships.
then I have manually edited the columns that are not required
and then I transferred the data using the data source and destination
My question is: can I achieve all these things in one package? Also after I have transferred the data how can I schedule the package to just transfer the recently inserted rows in the database to the new database?
Please help me
thanks in advance
You can schedule the package by using a SQL Server Agent Job - one of the options for a job step is run SSIS package.
With regard to transferring new rows, I would either:
Track your current "position" in another table, assumes you have either an ascending key or a time stamp column - load the current position into an SSIS variable, use this variable in the WHERE statement of your data source queries.
Transfer all data across into "dump" copies of each table (no relationships/keys etc required just the same schema) & use a T-SQL MERGE statement to load new rows in, then truncate "dump" tables.
Hope this makes sense - its a bit difficult to get across in writing.