Azure Synapse Analytics Spark pool vnet integration solution - azure-synapse

So I'm hoping to be able to move some ETL pipelines over from Azure Databricks over to Azure Synapse Analytics, but I'm running into some issues with a vnet integration.
The case is as follows:
We have a vnet that is peered to a VNG-vnet that provides access to on prem SQL servers. For these ETLs I need to be able to access the above mentioned SQL servers.
Azure Databricks solves this problem with the workspace letting me assign specific subnets instead of managing the networking for me:
resource "azurerm_subnet_network_security_group_association" "databricks_pub_sub_nsg_assos" {
subnet_id = azurerm_subnet.databricks_pub_sub.id
network_security_group_id = azurerm_network_security_group.databricks_nsg.id
}
resource "azurerm_subnet_network_security_group_association" "databricks_priv_sub_nsg_assos" {
subnet_id = azurerm_subnet.databricks_priv_sub.id
network_security_group_id = azurerm_network_security_group.databricks_nsg.id
}
resource "azurerm_databricks_workspace" "databricks" {
name = "newcorp-databricks"
resource_group_name = azurerm_resource_group.main_rg.name
managed_resource_group_name = "databricks-resources-rg"
location = azurerm_resource_group.main_rg.location
sku = "premium"
custom_parameters{
public_subnet_name = azurerm_subnet.databricks_pub_sub.name
private_subnet_name = azurerm_subnet.databricks_priv_sub.name
virtual_network_id = var.iver_vnet_id
}
depends_on = [azurerm_network_security_group.databricks_nsg]
}
Basically I'm wondering if anyone know of any way to inject a vnet (subnet) that my Azure Synapse Spark cluster can run in?
If not: can anyone think of a clever way around where we still keep the ETL (or ELT if you've got a storage account solution in mind?) in code -> not fond of the Azure Data Factory drag/drop/config solution.

Related

Is there a way to automate this Python script in GCP?

I am a complete beginner in using GCP functions/products.
I have written the following code below, that takes a list of cities from a local folder, and call in weather data for each city in that list, eventually uploading those weather values into a table in BigQuery. I don't need to change the code anymore, as it creates new tables when a new week begins, now I would want to "deploy" (I am not even sure if this is called deploying a code) in the cloud for it to automatically run there. I tried using App Engine and Cloud Functions but faced issues in both places.
import requests, json, sqlite3, os, csv, datetime, re
from google.cloud import bigquery
#from google.cloud import storage
list_city = []
with open("list_of_cities.txt", "r") as pointer:
for line in pointer:
list_city.append(line.strip())
API_key = "PLACEHOLDER"
Base_URL = "http://api.weatherapi.com/v1/history.json?key="
yday = datetime.date.today() - datetime.timedelta(days = 1)
Date = yday.strftime("%Y-%m-%d")
table_id = f"sonic-cat-315013.weather_data.Historical_Weather_{yday.isocalendar()[0]}_{yday.isocalendar()[1]}"
credentials_path = r"PATH_TO_JSON_FILE"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path
client = bigquery.Client()
try:
schema = [
bigquery.SchemaField("city", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Date", "Date", mode="REQUIRED"),
bigquery.SchemaField("Hour", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Temperature", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Humidity", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Condition", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Chance_of_rain", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Precipitation_mm", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Cloud_coverage", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Visibility_km", "FLOAT", mode="REQUIRED")
]
table = bigquery.Table(table_id, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="Date", # name of column to use for partitioning
)
table = client.create_table(table) # Make an API request.
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
except:
print("Table {}_{} already exists".format(yday.isocalendar()[0], yday.isocalendar()[1]))
def get_weather():
try:
x["location"]
except:
print(f"API could not call city {city_name}")
global day, time, dailytemp, dailyhum, dailycond, chance_rain, Precipitation, Cloud_coverage, Visibility_km
day = []
time = []
dailytemp = []
dailyhum = []
dailycond = []
chance_rain = []
Precipitation = []
Cloud_coverage = []
Visibility_km = []
for i in range(24):
dayval = re.search("^\S*\s" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
timeval = re.search("\s(.*)" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
day.append(dayval.group()[:-1])
time.append(timeval.group()[1:])
dailytemp.append(x["forecast"]["forecastday"][0]["hour"][i]["temp_c"])
dailyhum.append(x["forecast"]["forecastday"][0]["hour"][i]["humidity"])
dailycond.append(x["forecast"]["forecastday"][0]["hour"][i]["condition"]["text"])
chance_rain.append(x["forecast"]["forecastday"][0]["hour"][i]["chance_of_rain"])
Precipitation.append(x["forecast"]["forecastday"][0]["hour"][i]["precip_mm"])
Cloud_coverage.append(x["forecast"]["forecastday"][0]["hour"][i]["cloud"])
Visibility_km.append(x["forecast"]["forecastday"][0]["hour"][i]["vis_km"])
for i in range(len(time)):
time[i] = int(time[i][:2])
def main():
i = 0
while i < len(list_city):
try:
global city_name
city_name = list_city[i]
complete_URL = Base_URL + API_key + "&q=" + city_name + "&dt=" + Date
response = requests.get(complete_URL, timeout = 10)
global x
x = response.json()
get_weather()
table = client.get_table(table_id)
varlist = []
for j in range(24):
variables = city_name, day[j], time[j], dailytemp[j], dailyhum[j], dailycond[j], chance_rain[j], Precipitation[j], Cloud_coverage[j], Visibility_km[j]
varlist.append(variables)
client.insert_rows(table, varlist)
print(f"City {city_name}, ({i+1} out of {len(list_city)}) successfully inserted")
i += 1
except Exception as e:
print(e)
continue
In the code, there is direct reference to two files that is located locally, one is the list of cities and the other is the JSON file containing the credentials to access my project in GCP. I believed that uploading these files in Cloud Storage and referencing them there won't be an issue, but then I realised that I can't actually access my Buckets in Cloud Storage without using the credential files.
This leads me to being unsure whether the entire process would be possible at all, how do I authenticate in the first place from the cloud, if I need to reference that first locally? Seems like an endless circle, where I'd authenticate from the file in Cloud Storage, but I'd need authentication first to access that file.
I'd really appreciate some help here, I have no idea where to go from this, and I also don't have great knowledge in SE/CS, I only know Python R and SQL.
For Cloud Functions, the deployed function will run with the project service account credentials by default, without needing a separate credentials file. Just make sure this service account is granted access to whatever resources it will be trying to access.
You can read more info about this approach here (along with options for using a different service account if you desire): https://cloud.google.com/functions/docs/securing/function-identity
This approach is very easy, and keeps you from having to deal with a credentials file at all on the server. Note that you should remove the os.environ line, as it's unneeded. The BigQuery client will use the default credentials as noted above.
If you want the code to run the same whether on your local machine or deployed to the cloud, simply set a "GOOGLE_APPLICATION_CREDENTIALS" environment variable permanently in the OS on your machine. This is similar to what you're doing in the code you posted; however, you're temporarily setting it every time using os.environ rather than permanently setting the environment variable on your machine. The os.environ call only sets that environment variable for that one process execution.
If for some reason you don't want to use the default service account approach outlined above, you can instead directly reference it when you instantiate the bigquery.Client()
https://cloud.google.com/bigquery/docs/authentication/service-account-file
You just need to package the credential file with your code (i.e. in the same folder as your main.py file), and deploy it alongside so it's in the execution environment. In that case, it is referenceable/loadable from your script without needing any special permissions or credentials. Just provide the relative path to the file (i.e. assuming you have it in the same directory as your python script, just reference only the filename)
There may be different flavors and options to deploy your application and these will depend on your application semantics and execution constraints.
It will be too hard to cover all of them and the official Google Cloud Platform documentation cover all of them in great details:
Google Compute Engine
Google Kubernetes Engine
Google App Engine
Google Cloud Functions
Google Cloud Run
Based on my understanding of your application design, the most suitable ones would be:
Google App Engine
Google Cloud Functions
Google Cloud Run: Check these criteria to see if you application is a good fit for this deployment style
I would suggest using Cloud Functions as you deployment option in which case your application will default to using the project App Engine service account to authenticate itself and perform allowed actions. Hence, you should only check if the default account PROJECT_ID#appspot.gserviceaccount.com under the IAM configuration section has proper access to needed APIs (BigQuery in your case).
In such a setup, you want need to push your service account key to Cloud Storage which I would recommend to avoid in either cases, and you want need to pull it either as the runtime will handle authentication the function for you.

Create a Sql Azure Database with serverless tier using SDK

Currently, I create databases and attach them to an SQL elastic pool:
database = await sqlServer.Databases.Define(mainDb.DbName).WithExistingElasticPool(pool.Name).CreateAsync();
Instead, I want to create databases with tier "General Purpose: Serverless, Gen5, 1 vCore", but I couldn't find any method that offers that possibility.
This feature is still in preview, I can't find anything on the forums on this. How can I achieve this?
As an addendum to #Jim Xu accepted answer, the API has changed.
var database = sqlserver.Databases.Define("test").WithEdition("GeneralPurpose").WithServiceObjective("GP_S_Gen5_1").Create();
The WithEdition is now a DatabaseEdition edition type, and WithServiceObjective is now a ServiceObjectiveName. Both of these are muddled string enums with lists of version types. They do both also accept a .Parse() method. So the line should now be:
var database = sqlserver.Databases.Define("test")
.WithEdition(**Database.Edition.Parse("GeneralPurpose")**)
.WithServiceObjective(**ServiceObjectiveName.Parse("GP_S_Gen5_1")**)
.Create();
According to my test, we can use the following c# code to create "General Purpose: Serverless, Gen5, 1 vCore" database
var credentials = SdkContext.AzureCredentialsFactory.FromServicePrincipal(client,key,tenant,AzureEnvironment.AzureGlobalCloud);
var azure = Azure.Configure().Authenticate(credentials).WithSubscription(SubscriptionId);
var sqlserver=azure.SqlServers.GetById("/subscriptions/<your subscrption id>/resourceGroups/<your resource group name>/providers/Microsoft.Sql/servers/<your server name>");
var database = sqlserver.Databases.Define("test").WithEdition("GeneralPurpose").WithServiceObjective("GP_S_Gen5_1").Create();
Console.WriteLine(database.ServiceLevelObjective);
Console.WriteLine(database.Edition);
Console.WriteLine(database.Name);
Console.ReadLine();
Please reference this tutorial: Create a new elastic database pool with C#.
It provides the code example about Create a new database in a pool:
Create a DataBaseCreateorUpdateProperties instance, and set the properties of the new database. Then call the CreateOrUpdate method with the resource group, server name, and new database name.
// Create a database: configure create or update parameters and properties explicitly
DatabaseCreateOrUpdateParameters newPooledDatabaseParameters = new DatabaseCreateOrUpdateParameters()
{
Location = currentServer.Location,
Properties = new DatabaseCreateOrUpdateProperties()
{
Edition = "Standard",
RequestedServiceObjectiveName = "ElasticPool",
ElasticPoolName = "ElasticPool1",
MaxSizeBytes = 268435456000, // 250 GB,
Collation = "SQL_Latin1_General_CP1_CI_AS"
}
};
var poolDbResponse = sqlClient.Databases.CreateOrUpdate("resourcegroup-name", "server-name", "Database2", newPooledDatabaseParameters);
Please try to replace "standard" with the price tier "General Purpose: Serverless, Gen5, 1 vCore".
Hope this helps.

ArcGis Offline map layer changes synchronization

In my WPF application I’m trying to use off-line map functionality. Right now my feature service is configured for data sync and I’m able to create data replica on server and download local copy of geodatabase.
gdbSyncTask = await GeodatabaseSyncTask.CreateAsync(_featureServiceUri);
Envelope extent = new Envelope(xmin, ymin, xmax, ymax, new SpatialReference(wkidStart));
GenerateGeodatabaseParameters generateParams = await _gdbSyncTask.CreateDefaultGenerateGeodatabaseParametersAsync(extent);
_generateGdbJob = _gdbSyncTask.GenerateGeodatabase(generateParams, _gdbPath);
_generateGdbJob.JobChanged += GenerateGdbJobChanged;
_generateGdbJob.ProgressChanged += ((object sender, EventArgs e) =>
{
UpdateProgressBar();
});
_generateGdbJob.Start();
After initial synchronization, I’m able to successfully work with map in off-line mode. This includes operations like adding new geometries or editing existing polygons inside local DB.
However, when I’m trying to synchronize changes back to server – I’m getting no results.
To perform data synchronization with local database – I’m using the following code:
SyncGeodatabaseParameters parameters = new SyncGeodatabaseParameters()
{
GeodatabaseSyncDirection = SyncDirection.Bidirectional,
RollbackOnFailure = false
};
Geodatabase gdb = await Geodatabase.OpenAsync(this.GetGdbPath());
foreach (GeodatabaseFeatureTable table in gdb.GeodatabaseFeatureTables)
{
long id = table.ServiceLayerId;
SyncLayerOption option = new SyncLayerOption(id);
option.SyncDirection = SyncDirection.Bidirectional;
parameters.LayerOptions.Add(option);
}
_gdbSyncTask = await GeodatabaseSyncTask.CreateAsync(_featureServiceUri);
SyncGeodatabaseJob job = _gdbSyncTask.SyncGeodatabase(parameters, gdb);
job.JobChanged += SyncJob_JobChanged;
job.ProgressChanged += SyncJob_ProgressChanged;
job.Start();
Everything goes well. The synchronization ends with status “Succeeded”. The messages logged by the SyncGeodatabaseJob are like on the screen below:
However – when I open edited feature layer from server inside map web client I cannot found any of my local changes. In the serve database I can also see that no new records were created during synchronization.
Interesting think is that when I open “Replica” data inside web I can see the following information:
Replica Server Gen: 2
Creation Date: 2018/02/07 10:49:54 UTC
Last Sync Date: 2018/02/07 10:49:54 UTC
The “Last Sync Data” is equal to replica “Creation date” However, in the replica log in ArcMap I can see the following information:
Can anyone can tell me how should I interpret above described situation? Am I missing some steps in my code? Or maybe some configuration feature is missing on the server? It looks like data modifications are successfully pushed back to replica on server but after that replica is not synchronized with server database (should it work automatically?).
I’m a “fresh” person regarding ArcGis development so any help will be appreciated
Thanks for all the answers. It occurred that there is versioning enabled on the server database and the offline, versioned changes was not reconciled to the server.
After running reconcile/post script (http://desktop.arcgis.com/en/arcmap/10.3/manage-data/geodatabases/automate-reconcile-post-after-sync.htm) off-line changes started to be visibile to other system users.
The code looks ok on fast look so I would assume that there is something going on in the setup.
What do you get back from the sync operation after the sync has completed? Note that you can just use await syncJob.GetResultsAsync to start the job and wait the results.
How is the Feature Service set up on the server? Please refer https://enterprise.arcgis.com/en/server/latest/publish-services/linux/prepare-data-for-offline-use.htm for the different ways to set these things.

Cannot set Azure SQL Database Long-term backup retention

I'm trying to re-configure long term backup retention for my Azure SQL Database from a previously deleted recovery services vault (via Powershell) to a new recovery services vault
Now when I try to configure it gives me an error saying
TemplateBladeVirtualPart
SQLAZUREEXTENSION
Here is the script I used to removed the old recovery services vault (if it matters?)
$vault = Get-AzureRmRecoveryServicesVault -Name "is-vault-prod"
Set-AzureRmRecoveryServicesVaultContext -Vault $vault
$container = Get-AzureRmRecoveryServicesBackupContainer -ContainerType AzureSQL -FriendlyName $vault.Name
$item = Get-AzureRmRecoveryServicesBackupItem -Container $container -WorkloadType AzureSQLDatabase
$availableBackups = Get-AzureRmRecoveryServicesBackupRecoveryPoint -Item $item
$containers = Get-AzureRmRecoveryServicesBackupContainer -ContainerType AzureSQL -FriendlyName $vault.Name
ForEach ($container in $containers)
{
$items = Get-AzureRmRecoveryServicesBackupItem -container $container -WorkloadType AzureSQLDatabase
ForEach ($item in $items)
{
Disable-AzureRmRecoveryServicesBackupProtection -item $item -RemoveRecoveryPoints -ea SilentlyContinue
}
Unregister-AzureRmRecoveryServicesBackupContainer -Container $container
}
Remove-AzureRmRecoveryServicesVault -Vault $vault
Unfortunately, you just cannot choose another Recovery service vault once you had already used one.
I did a test in my lab and tried to disable the long backup retention but still failed, and found that:
This screenshot means that once you configured a recovery service vault for a SQL Server, it will be locked , you cannot use another vault.
I also found this in a FAQ:
Can I register my server to store backups to more than one vault?
No, you can currently store backups to only one vault at a time.
I understand why you want to use another vault.
However, We can just use this recovery service vault currently.If it was deleted, we cannot use long-term backup retention. It seems like a bad point of design. I will report this issue and I believe this feature would be better in future.
You can also post your idea in this Feedback Forum.
Hope this helps!

Using Microsoft Sync Framework to sync files across network

The file synchronization example given here - http://code.msdn.microsoft.com/Release/ProjectReleases.aspx?ProjectName=sync&ReleaseId=3424 only talks about syncing files on the same machine. Has anyone come across a working example of using something like WCF to enable this to work for files across a network?
Bryant's example - http://bryantlikes.com/archive/2008/01/03/remote-file-sync-using-wcf-and-msf.aspx is not complete and is only a one way sync and is less than ideal.
The Sync framework can synchronize files across the network as long as you have an available network share.
In the constructor of the FileSyncProvider set the rootDirectoryPath to a network share location that you have read and write permissions to:
string networkPath = #"\\machinename\sharedfolderlocation";
FileSyncProvidor provider = new FileSyncProvider(networkPath);
To do a two way sync in this fashion you will need to create a FileSyncProvider for both the source and destination systems and use the SyncOrchestrator to do the heavy lifting for you.
An example:
string firstLocation = #"\\sourcemachine\sourceshare";
string secondLocation = #"\\sourcemachine2\sourceshare2";
FileSyncProvidor firstProvider = new FileSyncProvider(firstLocation);
FileSyncProvidor secondProvider = new FileSyncProvider(secondLocation);
SyncOrchestrator orchestrator = new SyncOrchestrator();
orchestrator.LocalProvider = firstProvider;
orchestrator.RemoteProvider = secondProvider;
orchestrator.Direction = SyncDirectionOrder.DownloadAndUpload;
What this does is define two filesync providers and the orchestrator will sync the files in both directions. It tracks creates, modifications, and deletes of files in the directories set in the providers.
All that is needed at this point is to call Synchronize on the SyncOrchestrator:
orchestrator.Synchronize();