How to dynamically copy multiple datasets from a Google BigQuery project using Azure Synapse Analytics - google-bigquery

Is it possible to dynamically copy all datasets from a BigQuery Project to Azure Synapse Analytics, then dynamically copy all tables within each dataset? I know we can dynamically copy all tables within a BigQuery dataset reference to this answered question Loop over of table names ADFv2, but is there a way to do it at the project level with the lookup activity to loop through all datasets? Is there a way to do a SELECT * to the datasets?
SELECT
*
FROM
gcp_project_name.dataset_name.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'
According to Microsoft's Lookup activity in Azure Data Factory and Azure Synapse Analytics, this only reaches the dataset level.
I also tried just putting in the GCP's project name into the Lookup activity's query, but it did not work, ref Understanding the "Not found: Dataset ### was not found in location US" error

This can be done using two-level pipeline. I tried to repro this and below is the approach.
Take a lookup activity and take the Google big query as source dataset. In Query text box, enter the below query.
SELECT schema_name
FROM `project_name`.INFORMATION_SCHEMA.SCHEMATA
This query will list the datasets in the project.
Add a for-each activity next to the lookup activity. In for-each settings' item , type #activity('Lookup1').output.value as a dynamic content.
Then inside for-each activity, take another lookup activity with same big Query dataset as source dataset. Type the below query as a dynamic content.
SELECT
*
FROM
gcp_project_name.dataset_name.#{item().schema_name}.TABLES
WHERE table_type = 'BASE TABLE'
This will give list of all tables within each dataset.
Since you cannot nest a for-each inside for-each in ADF, you can design a two-level pipeline where the outer pipeline with the outer ForEach loop iterates over an inner pipeline with the nested loop.
Refer the NiharikaMoola-MT's answer on this SO thread for Nested foreach in ADF.

Related

Copy data from Synapse instance to another instance

Is there a way to copy data from all tables in a Synapse instance to another?
Considered Data Migration Assistant, but it doesn't allow to select Synapse as Source.
Also, I am considering to use Copy activity in a pipeline, but the number of tables is not small.
Copy Data Tool can be used to copy data from Synapse to Synapse.
You create a Synapse or ADF pipeline to achieve same.
Step1: Use Lookup activity or Script activity with a dataset Synapse SQL database table. Select Query option there and write query to get list all table names.
Step2: Use ForEach activity to iterate each table name.
Step3: Inside ForEach activity, use Copy activity. In the copy activity you can use parameterized datasets which dynamically to points to tables in Synapse SQL.

See managed tables in Databricks AWS

I need to identify and list all managed tables in a Databricks AWS workspace. I can see that manually in the table details, but I need to this for several thousand tables on different databases, and I cannot find a way to automate it.
The only way I found to tell programmatically if a table is managed or external is with the DESCRIBE TABLE EXTENDED command, but that returns it as a value on a column, and cannot be used with SELECT or WHERE to filter, even if I try running it as a subquery.
What is the easiest way to filter the managed tables?
spark.sql('use my_database')
df = spark.sql('show tables in my_database')
for t in df.collect():
print('table {}'.format(t.tableName))
display(spark.sql('describe table extended {}'.format(t.tableName)).where("col_name='Type' and data_type='MANAGED'"))
#use if condition to filter out the Managed data_type and collect the database and table names
#loop over all databases using "show databases" in outer loop

Checking of replicated data Pentaho

I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.

ADF - How should I copy table data from source Azure SQL Database to 6 other Azure SQL Databases?

We curate data in the "Dev" Azure SQL Database and then currently use RedGate's Data Compare tool to push up to 6 higher Azure SQL Databases. I am trying to migrate that manual process to ADFv2 and would like to avoid copy/pasting the 10+ copy data actives for each database (x6) to keep it more maintainable for future changes. The static tables have some customization in the copy data activity but the basic idea follows this post to perform an upsert.
How can the implementation described above be done in Azure Data Factory?
I was imagining something like the following:
Using one parameterized link service that has the server name & database name configurable to generate a dynamic connection to Azure SQL Database.
Creating a pipeline for each table's copy data activity.
Creating a master pipeline to then nest each table's pipeline in.
Using variables loop over the different connections an passing those to the sub-pipelines parameters.
Not sure if that is the most efficient plan or even works yet. Other ideas/suggestions?
we can not tell you if that's the most efficient plan. But I think so. Just make it works.
As you said in the comment:
we can use Dynamic Pipelines - Copy multiple tables in Bulk with
'Lookup' & 'ForEach'. we can perform dynamic copies of your data
table lists in bulk within a single pipeline. Lookup returns either
the lists of data or first row of data. ForEach - #activity('Azure
SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv') + This is
efficient and cost optimized since we are using less number of
activities and datasets.
In usually, we also will choose same solution with you: dynamic parameter/pipeline, lookup + foreach active to achieve the scenario. In one word, make the pipeline has a strong logic, simple and efficient.
Added the same info mentioned in the Comment as Answer.
Yup, we can use Dynamic Pipelines - Copy multiple tables in Bulk with 'Lookup' & 'ForEach'.
We can perform dynamic copies of your data table lists in bulk within a single pipeline. Lookup returns either the lists of data or first row of data.
ForEach - #activity('Azure SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv')
This is efficient and cost optimized since we are using less number of activities and datasets.
Attached pic as ref-

U-SQL job to query multiple tables with dynamic names

Our challenge is the following one :
in an Azure SQL database, we have multiple tables with the following table names : table_num where num is just an integer. These tables are created dynamically so the number of tables can vary. (from table_1, table_2 to table_N) All tables have the same columns.
As part of a U-SQL script file, we would like to execute the same query on all of these tables and generate an output csv file with the combined results of all these queries.
We tried several things :
U-SQL does not allow looping so we were thinking creating a View in our Azure SQL database that would combine all the tables using a cursor of some sort. Then, the U-SQL file would query this View (using external source). However, a View in Azure SQL database can only be created via a function and a function cannot execute dynamic SQL or even call a stored procedure...
We did not find a way to call a stored procedure of the external data source directly from U-SQL
we dont want to update our U-SQL job each time a new table is added...
Is there a way to do that in U-SQL through a custom extractor for instance? Any other ideas?
One solution I can think of is to use Azure Data Factory (v2) to assist in this.
You could create a pipeline with the following activities:
Lookup activity configured to execute the stored procedure
For Each activity that uses the output of the lookup activity as a source
As a child item use a U-Sql Activity that executes your U-Sql script which writes the output of a single table (the item of the For Each activity) to blob or datalake
Add a Copy Activity that merges the blobs from step 2.1 to one final blob.
If you have little or no experience working with ADF v2 do mind that it takes some time to get to know it but once you do, you won't regret it. Having a GUI to create the pipeline is a nice bonus.
Edit: as #wBob mentions another (far easier) solution is to somehow create a single table with all rows since all dynamically generated table have the same schema. You can create a stored procedure for populating this table for example.