Compare data in a table in Databricks and MS SQL Server - sql

I have to compare the table in databricks with the same table in SQL Server and populate only the missing records into databricks. Can someone help me how to connect to SQL Server with databricks , how and where to write the query that will populate the missing data.
Thanks!

You can connect to SQL server just using standard JDBC interface supported by Spark - databricks runtimes should contain the drivers for MS SQL out of box. When data read you can do anti join between SQL server data and your data in Databricks. Something like this (in Python):
jdbc_url = f"jdbc:sqlserver://{hostname}:{port};database={database}"
sql_data = spark.read.jdbc(jdbc_url, "your_table", connectionProperties)
your_data = spark.read.format("delta").load("path")
# get difference between datasets
diff = sql_data.join(your_data, <join-condition>, "leftanti")
# work with diff
For reading from SQL Server, follow instructions on how to optimize read performance, but this may depend on your actual schema.

Related

How to copy one table data to another one between two Azure databases on the same server

I want to copy data from one database table into another database table on the same server in Azure SQL. I have done all of the Azure SQL Cross Database Query' steps that are written here https://www.mssqltips.com/sqlservertip/6445/azure-sql-cross-database-query/but still get the same error whenever I execute a query
'Reference to database and/or server name in 'db name' is not supported in this version of SQL Server.'
Can you pls help to figure out this?
Azure SQL database doesn't support across query directly.
We can not use USE statements and it not supported. That's why you get the error. We can not run statements like select * from [other_database].[schema].[table].
In Azure SQL database, only elastic query overview (preview) can achieve cross database query:
The elastic query feature (in preview) enables you to run a
Transact-SQL query that spans multiple databases in Azure SQL
Database. It allows you to perform cross-database queries to access
remote tables, and to connect Microsoft and third-party tools (Excel,
Power BI, Tableau, etc.) to query across data tiers with multiple
databases.
You could follow the tutorial and it may be more complex than on-premise SQL Server:
Get started with cross-database queries (vertical partitioning) (preview)

How to set up SSIS to extract data from Postgres Database

I have a database PostGres database in the AWS Cloud. I would like to use SSIS to extract tables and move them over to a local SQL Server.
Has anyone attempted to do this? Is it possible?
Ultimately I would like to move over tables from the PostGres to a SQL server, without having to purchase a tool.
As per the documentation, you would need to follow these steps to connect SSIS to a Postgres database:
get the PostgreSQL ODBC driver, either with Stack Builder or using ODBC
connect to PostgreSQL with the PostgreSQL ODBC driver (psqlODBC), using the proper connection string, typically Driver={PostgreSQL ODBC Driver(UNICODE)};Server=<server>;Port=<port>;Database=<database>;UID=<user id>;PWD=<password>
You can use the Postgres OLE DB Provider to connect to Postgres using OLE DB Source. The following link contains a step by step guide to import data from Postgres into SQL Server:
Export data from Postgres to SQL Server using SSIS

How to use SQL query on CSV file in POWER BI Desktop

We can use SQL Server link between POWER BI and SQL server to make SQL Query on a database. But If I only have CSV file how can I request some information WITH SQL QUERY. In fact we can import CSV file and (I suppose) do some query with the POWER QUERY(it's DAX language I think), but I don't want to use DAX I want to use SQL QUERY like requesting data from SQL Server.
Example of query with SQL Server: = Sql.Database("LAPTOP-P3DH07C9", "Projet Big Data", [Query="SELECT * FROM MatchComplete"])
Is it possible or I must use DAX ?

Optimal way to Load Data from SQL Server to DB2

We have 40+ Tables present in SQL SERVER DB and we need to copy the data to an IBM DB2 database. What methods do you recommend to accomplish this?
My ANALYSIS:
BCP and Data Import - The team is trying to avoid any BCP files
Write Stored procedure and use LINKED Server in SQL and insert the data in DB2
SSIS Packages to move data.
Please let us know if you have any better way to approach this issue.
Have you considered Information Integration, that is known in DB2 as federation? you can do a select in SQL Server directly from DB2, and with this feature you can define a cursor and then just use the LOAD command.

Oracle and SQL Dataset

Problem: I have to pull data from a SQL Server database and an Oracle database and put it together in one dataset. The problem I have is: The SQL Server query requires an ID that is only found from the return of the Oracle query.
What I am wondering is: How can I approach this problem in a way such that performance is not harmed?
You can do this with linked servers or by transferring the data all to one side. It's all going to depend on the volume of data on each side.
A general rule of thumb is to execute the query on the side which has the most data.
For instance, if the set of Oracle IDs is small, but the SQL Server set is large, you make a linked server to the Oracle side and execute this on the SQL Server side:
SELECT *
FROM sqlservertable
INNER JOIN linkedserver.oracletable
ON whatever
In this case, if the Oracle side is large (or cannot be prefiltered before the need to join with the SQL Server side), the performance will normally be pretty poor - and will improve a lot by pulling the entire table (or the minimal subset you can determine) into a SQL Server table instead and do the JOIN all on the SQL Server side.