How to connect databricks SQL notebook to a parquet file via filepath - ssms

I'm trying to connect to a list of parquet files that contain our data tables, I need to retrieve them to create a new table within a databricks notebook that will have the following fields:
Field Name
Data Type
Table Name
I just need to know the syntax for connecting to these parquet files via sql in a databricks notebook and any help with setting up these fields to display data as if pulling from information_schema in SSMS. Thanks.

The below syntax will help you to create table using given parquet file path:
%sql
CREATE TABLE <Table_Name>
USING parquet
OPTIONS (path "</path/to/Parquet>")
Change the <Table_Name> and </path/to/Parquet> with your values.
You can read the data by using SELECT statement:
SELECT * FROM <Table_Name>
Apache Spark also enables us to easily read and write Parquet files to Azure SQL Database.
df.write
.mode("overwrite")
.format("jdbc")
.option("url", f"jdbc:sqlserver://{servername}.database.windows.net;databaseName={databasename};")
.option("dbtable", "{tablename}")
.option("user", {localusername})
.option("password", {localpassword})
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("customSchema", "sqlschemadetails")
.save()
We have to specify the JDBC connection string including SQL user name and password along with the schema name.
Useful link: https://www.c-sharpcorner.com/article/ingest-data-to-azure-sql-database-using-azure-databricks/
You can open Azure SQL Server database on SSMS simply by using servername, user and password as shown below.

Related

How to Load data from Azure SQL DB to Synapse Spark Datalake?

I am looking for a Solution through which we can load the data from Azure SQL DB to Azure Synapse Spark Datalake (not in dedicated pool).
The Requirements are:
we have a csv file in which data is present. Currently we are updating or inserting the data into csv file which we are reading through the spark datalake and loading into dataframes.
But rather than using a csv file we want to load our csv data into Azure SQL DB and in future if any new updates or insert happen we should do directly in Azure SQL DB only.
Currently we are doing our transformations in Synapse using Pyspark and reading the File data through the spark tables which is in our lake database. we have put those csv files in our Synapse ADLS and reading data from there only.
We want to make a connection from the Azure SQL DB to Synapse Spark Datalake. So in Future if any upsert happen in SQL DB those changes will also reflect in our table in Spark datalake and when we are loading those tables in our Synapse notebook as a dataframe it should always pick up the latest Data.
Thanks in Advance for your Responses
You can do it by following ways.
By connecting Azure SQL Database to Synapse notebook via JDBC connection:
First Go to SQL Database and in the connection strings, copy the JDBC credentials.
In this approach, for every new data you should have a column last_modified date which helps in getting new data.
Now in Synapse notebook use the following code.
jdbcHostname = "rakeshserver.database.windows.net"
jdbcDatabase = "rakeshsqldatbase"
jdbcPort = "1433"
username = "rakeshadmin"
password = "< your password >"
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : username,
"password" : password,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
pushdown_query = "(select * from ok where last_modified>='2022-10-01' and last_modified<='2022-11-01') ok2"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
In the synapse SQL code, get the data from the table like above. Schedule this notebook every day by using Notebook activity. For the above dates to be dynamic, you can pass those dates from Notebook activity.
My Execution of dataframe:
By using Lookup to SQL and passing the JSON to Notebook activity as parameter:
First Use lookup query (use last_modified date) to get the desired results as json and then pass this output array as parameter to the Synapse notebook using Notebook activity. You can read that as dataframe in synapse code.

Get CSV Data from Blob Storage to SQL server Using ADF

I want to transfer data from csv file which is in an azure blob storage with the correct data types to SQL server table.
How can I get the structure for the table in the CSV file? ( I mean like when we do script table to new query in SSMS).
Note that the CSV file is not available on premise.
If your target table is already created in SSMS, copy activity will take care of the schema of source and target tables.
This is my sample csv file from blob:
In the sink I have used a table from Azure SQL database. For you, you can create SQL server dataset by SQL server linked service.
You can see the schema of csv and target tables and their mapping.
Result:
if your target table is not created in SSMS, you can use dataflows and can define the schema that you want in the Projection.
Create a data flow and in the sink give our blob csv file. In the projection of sink, we can give the datatypes that we want for the csv file.
As our target table is not created before, check on edit in the dataset and give the name for the table.
In the sink, give this dataset (SQL server dataset in your case) and make sure you check on the Recreate table in the sink Settings, so that a new table with that name will be created.
Execute this Dataflow, your target table will be created with your user defined data types.

Creating external hive table in databricks

I am using databricks community edition.
I am using a hive query to create an external table , the query is running without any error but the table is not getting populated with the specified file that has been specified in the hive query.
Any help would be appreciated .
from official docs ... make sure your s3/storage location path and schema (with respects to the file format [TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM]) are correct
DROP TABLE IF EXISTS <example-table> // deletes the metadata
dbutils.fs.rm("<your-s3-path>", true) // deletes the data
CREATE TABLE <example-table>
USING org.apache.spark.sql.parquet
OPTIONS (PATH "<your-s3-path>")
AS SELECT <your-sql-query-here>
// alternative
CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"

How do I export database tables data from hdfs into local csv using hive without write permission

I dont have write permission on hdfs cluster.
I am accessing database tables created/stored on hdfs using hive via edge node.
I have read access.
I want to export data from tables located on hdfs into csv on my local system.
How should i do it?
insert overwrite local directory '/____/____/' row format delimited fields terminated by ',' select * from table;
Note that this may create multiple files and you may want to concatenate them on the client side after it's done exporting.

How to query data from gz file of Amazon S3 using Qubole Hive query?

I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.
You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344