Creating a dynamic frame by using a connector from the datacatalog and a custom sql query string not working - sql

I am trying to retrieve a custom sql query from an oracle database source using the existing connector in the data catalog in an AWS Glue job script. I found this in the doc of AWS :
DataSource = glueContext.create_dynamic_frame.from_options(connection_type =
"custom.jdbc", connection_options = {"query":"SELECT id, name, department FROM department
WHERE id < 200","connectionName":"test-connection-jdbc"}, transformation_ctx =
"DataSource0")
But it's not working and I don't know why.
ps : the connector is well configured and tested.
the error raised is : getDynamicFrame. empty.reduceLeft
that the query is executed and loaded to the target.

Related

PowerBI API Clone Report and Dataset changing the datasource

I am developing an application in .NET6 using the PoweBI Client for managing workspaces, reports, datasets, etc.
The idea is that the application will be able to create client workspaces and will inherit reports and datasets from a main workspace. In the main workspace there will be reports published from PowerBI Desktop and therefore the respective dataset will be also there.
At the moment of the clone datasource database, user and password should be changed accordantly to match the workspace customer context. Using the following code I can list the reports on the main workspace (workspace_from_id) and I can create them on customer workspace (workspace_towa_id)
var reports_from = pbiClient.Reports.GetReports(workspace_from_id);
foreach (Report report_from in reports_from.Value)
{
Guid report_from_id = report_from.Id;
CloneReportRequest cloneReportRequest = new();
cloneReportRequest.TargetWorkspaceId = workspace_towa_id;
cloneReportRequest.TargetModelId = dataset_towa.Id;
cloneReportRequest.Name = report_from.Name;
Report report_towa = pbiClient.Reports.CloneReport(workspace_from_id, report_from_id, cloneReportRequest);
}
The problem of the above code is that the dataset is not cloned and the source dataset is used as shared dataset for both workspaces. I tried already to copy the dataset details and create a new one with different database using the following code:
CreateDatasetRequest createDatasetRequest = new();
createDatasetRequest.Name = dataset_from.Name;
createDatasetRequest.Datasources = new List<Datasource>();
createDatasetRequest.Tables = new List<Table>();
Datasources datasources_from = pbiClient.Datasets.GetDatasources(workspace_from_id, dataset_from_id);
foreach (Datasource datasource_from in datasources_from.Value)
{
//FOREACH DATASOURCE IN DATASET
Datasource datasource_towa = new ();
datasource_towa.Name = datasource_from.Name;
datasource_towa.DatasourceType = datasource_from.DatasourceType;
//CHANGE DATASOURCE CONNECTION DETAILS
DatasourceConnectionDetails datasourceConnectiondetails = datasource_from.ConnectionDetails;
datasourceConnectiondetails.Database = $"{Variables.reporting_db}_{group_towa.Name.ToLower()}";
datasource_towa.ConnectionDetails = datasourceConnectiondetails;
datasource_towa.ConnectionString = datasource_from.ConnectionString;
datasource_towa.GatewayId = datasource_from.GatewayId;
//ADD DATASOURCE INTO DATASET
createDatasetRequest.Datasources.Add(datasource_towa);
}
Tables tables_from = pbiClient.Datasets.GetTables(workspace_from_id, dataset_from_id); //WORKS FOR PUSH DATASET
foreach (Table table_from in tables_from.Value)
{
//FOREACH TABLE IN DATASET
Table table_towa = new ();
table_towa.Name = table_from.Name;
table_towa.Source = table_from.Source;
table_towa.Columns = table_from.Columns;
table_towa.Rows = table_from.Rows;
table_towa.Description = table_from.Description;
//ADD TABLE INTO DATASET
createDatasetRequest.Tables.Add(table_from);
}
The problem with the above code is that the pbiClient.Datasets.GetTables function is not working for normal datasets but is used only for push datasets. Finally without beeing able to get the Tables the following code is failing:
var dataset_towa = pbiClient.Datasets.PostDataset(workspace_towa_id, createDatasetRequest);
Finally discovered that also the pbiClient.Datasets.PostDataset method is used to post push dataset as described here: https://learn.microsoft.com/en-us/rest/api/power-bi/push-datasets/datasets-post-dataset
=======UPDATE 13/01/2023=======
Tried already a few other ways to clone the report and dataset like to create a datasource but for that we need a data gateway. In that case when the reports are already into a cloud like Azure for PostgreSQL we do need a gateway. On the other side I tried to create a Virtual Gateway in order to create datasource into this Gateway, but =Virtual Gateway is not supported by PowerBI Api and is only supported in premium capacities.
So seems that I cannot clone report together with a dataset and change the datasource.
Any ideas?
After a lot of hours researching I managed to download report from main workspace and upload them into customer workpaces by changing the datasource details.
Steps to perform:
Export report with PowerBI API /Export as stream into memory
Import report with PowerBI API /Imports as stream from memory (needs special treat to be sure that you read the whole stream from the HTTP Content / using SDK is not working)
Update report ConnectionDetails with PowerBI Client SDK /pbiClient.Datasets.UpdateDatasourcesInGroup (needs special treat to change only the server and database attribute without putting new instance of object)
Update datasource of the dataset with PowerBI Client SDK /pbiClient.Gateways.UpdateDatasource (needs special treat to give credentials as JSON)

BQ client.load_table_from_uri behaves different from python to with in Airflow

I am trying to load a CSV into BQ using a custom operator in Airflow.
My custom operator is using
load_job_config = bigquery.LoadJobConfig(
schema=self.schema_fields,
skip_leading_rows=self.skip_leading_rows,
source_format=bigquery.SourceFormat.CSV
)
load_job = client.load_table_from_uri(
'gs://' + self.source_bucket + "/" + self.source_object, self.dsp_tmp_dataset_table,
job_config=load_job_config
) #
The issue I am facing is that I always get errors
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table nonprod-cloud-composer:dsp_data_transformation.tremorvideo_daily_datafeed. Field Date has changed type from TIMESTAMP to DATE
The exact same code when run outside of Airflow as a stand alone python works fine.
I am using the exactly same schema object , same source CSV file just that the environment is different.
Below is the high level steps followed
Created table in BQ
Using the
LOAD DATA OVERWRITE XXXX
FROM FILES (
format = 'CSV',
uris = ['gs://xxx.csv']);
This worked fine and the data was loaded into the table.
3. Truncated the table and tried to run the custom operator what has the code above listed. Then faced errors.
4. Created a simple python program with to test the bq load job and that works fine too.
Its just that when ever the same load job is triggered using Airflow the schema detection fails and leads to all sorts of errors.

Hudi errors with 'DELETE is only supported with v2 tables.'

I'm trying out Hudi, Delta Lake, and Iceberg in AWS Glue v3 engine (Spark 3.1) and have both Delta Lake and Iceberg running just fine end to end using a test pipeline I built with test data. Note I am not using any of the Glue Custom Connectors. I'm using pyspark and standard Spark code (not the Glue classes that wrap the standard Spark classes)
For Hudi, the install of the Hudi jar is working fine as I'm able to write the table in the Hudi format and can create the table DDL in the Glue Catalog just fine and read it via Athena. However, when I try to run a crud statement on the newly created table, I get errors. For example, trying to run a simple DELETE SparkSQL statement, I get the error: 'DELETE is only supported with v2 tables.'
I've added the following jars when building the SparkSession:
org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0
com.amazonaws:aws-java-sdk:1.10.34
org.apache.hadoop:hadoop-aws:2.7.3
And I set the following config for the SparkSession:
self.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer')
I've tried many different versions of writing the data/creating the table including:
hudi_options = {
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.table.version': 2,
'hoodie.table.name': 'db.table_name',
'hoodie.datasource.write.recordkey.field': 'id', key is required in table.
'hoodie.datasource.write.partitionpath.field': '',
'hoodie.datasource.write.table.name': 'db.table_name',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'date_modified',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
df.write \
.format('hudi') \
.options(**hudi_options) \
.mode('overwrite') \
.save('s3://...')
sql = f"""CREATE TABLE {FULL_TABLE_NAME}
USING {DATA_FORMAT}
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'date_modified',
partitionPathField = '',
hoodie.table.name = 'db.table_name',
hoodie.datasource.write.recordkey.field = 'id',
hoodie.datasource.write.precombine.field = 'date_modified',
hoodie.datasource.write.partitionpath.field = '',
hoodie.table.version = 2
)
LOCATION '{WRITE_LOC}'
AS SELECT * FROM {SOURCE_VIEW};"""
spark.sql(sql)
The above works fine. It's when I try to run a CRUD operation on the table created above that I get errors. For instance, I try deleting records via the SparkSQL DELETE statement and get the error 'DELETE is only supported with v2 tables.'. I can't figure out why it's complaining about not being a v2 table. Any clues would be hugely appreciated.

How to show a table from a SQL Server database by using SqlKata?

I am trying to show a table from a database in my SQL Server 2017 by using SqlKata.
I have browsed for some researches. Based from one of the articles, I need to write this command var books = db.Query("Books").Get();
My question here is: Where do we put the command in a C# .NETCoreApp 1.1 target framework file? And how to run to display out the result?
If you have a cast class use
var books = db.Query("Books").Get<YouClass>();
But you dont have cast class -> use
var books = db.Query("Books").Get<dynamic>();
If you want logging execute query, write code startup.cs
var db = new QueryFactory(connection, new SqlServerCompiler());
// Log the compiled query to the console
db.Logger = compiled => {
Console.WriteLine(compiled.ToString()); //NLog - GrayLog - API - DB - TextFile - more..
};
etc. https://sqlkata.com/docs/execution/logging

Need information about JPA based transaction for dynamic SQL table

Firstly, I would like to state our environment details.
We are trying to use EJB-hibernate with sql Azure to create apps on Azure cloud using Eclipse.
We needed to create and transact on databases dynamically. We are able to create databases dynamically. However, on trying to transact on these we are getting an error:
"java.sql.SQLException: No suitable driver found for connection url"
When we tried statically transacting using jpa was not a problem. However, dynamic transactions cannot be done. The entitymanager object is created but not able to connect database.
Could someone help us and explain how we can handle transactions using JPA for dynamically created databases.
Thanks,
Saugata
[edit] We are using the following persistence.xml:
>org.hibernate.ejb.HibernatePersistence
java:jboss/EDS</jta-data-source> -->
net.oauth.database.Co
net.oauth.database.Cr
value="org.hibernate.transaction.JTATransactionFactory" />
value="org.hibernate.transaction.JBossTransactionManagerLookup" />
Our code to connect to the db is as follows:
Map configOverrides = new HashMap();
configOverrides.put("hibernate.connection.password", "");
configOverrides.put("hibernate.connection.username", "");
configOverrides.put("hibernate.connection.driver_class","com.microsoft.sqlserver.jdbc.SQLServerDriver");
configOverrides.put("hibernate.connection.url", "jdbc:sqlsever://;" + "databaseName=;user=;password=");
EntityManagerFactory factory = Persistence.createEntityManagerFactory(ENTERPRISE_UNIT_NAME, configOverrides);
Please note that we are trying to create and connect to db dynamically and hence to do not the db created statically.
For this we are getting the error:
"java.sql.SQLException: No suitable driver found for connection url"
Create a persistence.xml with a persistence unit and put everything there which is static (eg database dialect, logging parameters, etc.)
Then use the following method to create the entity manager:
javax.persistence.Persistence.createEntityManagerFactory(String persistenceUnitName, Map properties);
Supply the variable parameters in the map, like this:
properties.put("hibernate.connection.url", "jdbc:postgresql://127.0.0.1/test");
properties.put("hibernate.connection.username", "joe");
properties.put("hibernate.connection.password", "pass");