Azure Datalake Analytics U-SQL with Azure Datalake Storage Gen 2 - azure-data-lake

Question : what is the path forward for using ADLA (U-SQL) with ADLS(Gen2) ?
I have been running Azure Data lake Analytics (U-SQL) jobs via Azure Data factory (ADF v2) with Azure Data lake Store Generation 1 for quite a while now in East US2
I was planning to have another instance deployed to cater Canadian clients and wanted to setup Azure Data lake Store Generation 1
What I tried :
I was not able to create an Azure Datalake Storage Gen 1 account in Central Canada (or any Canadian region for that matter)
I tried to move to Azure Datalake Storage Gen2 but then ran into an issue where Azure Data Factory - U-SQL activity could not be linked with Gen2 Storage linked service to pick up U-SQL script
I stumbled upon multiple links about this topic :
https://feedback.azure.com/forums/327234-data-lake/suggestions/36445702-add-support-for-adls-gen2-to-adla
https://social.msdn.microsoft.com/Forums/en-US/5ce97eef-8940-4591-a19c-934f71825e7d/connect-data-lake-analytics-to-adls-gen-2
which essentially say that U-SQL / ADLA won't be supporting ADLS Gen2
I am a bit confused since there is no official documentation on ADLA's direction

Update:
This is the structure of my u-sql activity. It can work and process successfully:(You can try to create a new json of u-sql activity to replace your u-sql activity.)
{
"name": "pipeline4",
"properties": {
"activities": [
{
"name": "U-SQL1",
"type": "DataLakeAnalyticsU-SQL",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"scriptPath": "test1/u-sql.txt",
"scriptLinkedService": {
"referenceName": "LinkTo0730",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "AzureDataLakeAnalytics1",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
}
}
Original Answer:
I was not able to create an Azure Datalake Storage Gen 1 account in
Central Canada (or any Canadian region for that matter)
On my side, I also cannot create datalake gen1 on region Central Canada. This is the limit of my subscription. But you can have a check of the resource manager on your side, maybe you can.(Azure data lake gen1 is 'Microsoft.DataLakeStore')
Resource Manager is supported in all regions, but the resources you deploy might not be supported in all regions. In addition, there may be limitations on your subscription that prevent you from using some regions that support the resource. The resource explorer displays valid locations for the resource type.
Please have a check of this document:
https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-providers-and-types
I tried to move to Azure Datalake Storage Gen2 but then ran into an
issue where Azure Data Factory - U-SQL activity could not be linked
with Gen2 Storage linked service to pick up U-SQL script
On my side, seems it is reading the u-sql script in gen2, did you get some error?

Related

azure stream analytics implementation or the best approach

I am new to Steam analytics and I need help here to achieve a specific task.
I have telemetry data coming from iot hub in this format.
Basically i will be getting machines telemetry data and the stage of the operations on that machine streamed to iot hub.
The stages will be indicated with tag ex:"stageid":"stage1". I need to calculate the time taken for each stage using stream analytics based on timestamp and stage tag.
packet Ex:
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
}
}]
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-24T09:26:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-26T09:24:00.3270000Z"
}
}]
pls help me can we achieve this with query and what could be the query or what is the other best approach?
Thanks,
Per my knowledge,your needs can't be implemented by ASA built-in features. ASA is a real-time collect data and analytics service.In other words,data need to be processed in the real-time.Current data can't wait for next dataset to do some calculate or merge things. Even if you could use windows function and group by,i believe the frequency of messages pushed by the device is also variable.
As a workaround,my idea is using iot-hub azure function trigger.Inside trigger,you could use code to parse message and save key columns(stageid,timestamp,devid) into some storage,maybe azure table storage. Before every insert,just grab latest row of current device to calculate the time taken with current message so that you could produce that time to store in other residence.In the end, update the latest row for every device.

How to read the files from Azure Blob Storage with folder structure as 'StartDateOfMonth-EndDatefMonth'?

Scenario
We have azure blob storage container with following folder structure.
• 20190601-20190630
Basically, this folder will contain daily CSV files for the given month.
This folder structure is dynamic. So, in the next month, folder 20190701-20190731 will be populated with daily CSV files.
Problem
On daily basis, need to move these files from azure blob storage to azure data lake using azure data factory (v2).
How to specify folder structure (dynamically) in the Input Dataset (Azure Blob Storage) in Azure Data Factory(V2)?
Example:
20190601-20190630/*.CSV for the month June 2019
Basically, StartDateOfMonth and EndDateOfMonth are dynamic.
Thanks in Advance
You could configure your dataset folder path like:
"folderPath": {
"value": "#concat(
formatDateTime(pipeline().parameters.scheduledRunTimeStart, 'yyyyMMdd'),
'-',
formatDateTime(pipeline().parameters.scheduledRunTimeEnd, 'yyyyMMdd')
, '/'
"type": "Expression"
}
And pass the parameters into dataset:
"parameters": {
"scheduledRunTimeStart": {
"type": "String"
},
"scheduledRunTimeEnd": {
"type": "String"
}
}

Syncing Kafka with aws s3 with different directory structure

We have events coming to Kafka and using kafka connect we are syncing these events with aws s3.
Data is visible in s3 in below dir structure:
bucket_name/sub_folder/
Partition=0/events.json
Partition=1/events.json
Partition=2/events.json
is there a way to store in below dir structure:
Bucket_name/sub_folder/date=today_date/ events.json or Partition=0..2/date=today/events.json
Bucket_name/sub_folder/date=today_date/ events.json or
Motivation is to store that days events in that that days directory, i searched web but could not find any other way .
Thanks in advance.
You can use the TimeBasedPartitioner which
partitions data according to ingestion time.
e.g. for hourly partioning:
[…]
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "US",
"timezone": "UTC",
"partition.duration.ms": "3600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "my_record_field_with_timestamp_in",
[…]

Azure SqlServer and SqlDatabase resource groups

I would like to use Azure to create environments under a single ResourceGroup for clients comprising:
X web servers
Y app servers
1 database
Ideally, the database would be hosted on a server that is part of a different resource group, so I can leverage elastic pools across multiple clients.
When I attempt to use New-AzureRmResourceGroupDeployment to do this, I get an error stating that the parent SqlServer cannot be found.
I have created a SqlServer via PowerShell:
New-AzureRmSqlServer -servername "MyServer" -ResourceGroupName "Common" -Location "South Central US"
I then attempt a deployment for ClientA with:
New-AzureRmResourceGroup -Name 'ClientA' -Location "South Central US"
New-AzureRmResourceGroupDeployment -ResourceGroupName 'ClientA' -TemplateFile azure.ClientDeployment.json -Client 'ClientA'
My deployment configuration is:
{
"parameters": { ... }
"variables": { ... }
"resources": [
{
"name": "MySqlServer/ClientA",
"type": "Microsoft.Sql/servers/databases",
"location": "[resourceGroup().location]",
"apiVersion": "2014-04-01",
"tags": {},
"location": "[resourceGroup().location]",
"properties": {
"edition": "Basic"
},
}
],
"outputs": { }
}
Results in the error message:
New-AzureRmResourceGroupDeployment : 5:04:33 PM - Resource Microsoft.Sql/servers/databases 'MySqlServer/ClientA' failed with message '{
"code": "NotFound",
"message": "Server 'MySqlServer' does not exist in resource group 'ClientA' in subscription '{...}'.",
"target": null,
"details": [],
"innererror": []
}
Note that both resource groups (Common and ClientA) are in the same subscription and location.
Is it possible to have a SqlServer part of one resource group, and the SqlDatabase part of a different resource group?
Separating SQLServer with one Subscriptions(location)and it's Database to another Subscriptions is not possible as of now.
See the similar question here in MSDN forum
An Azure resource group is also defined to be a strong container for
its resources, which defines their identity (look at the URI of all
Azure resources), and propagation of role-based access rights, and
controls resource lifetimes - thus granting access to a resource group
grants rights to its resources and deleting a resource group deletes
the resources it contains.
Combining the above means that the resource group that contains a
server must also contain the databases hosted/nested on that server.
Even if you try to create all in a single subscription and try to move the particular SQLDB resource group after. It won't allow doing that (In Portal)
For your case, I think to create all the services in a common subscription where the location common or easily accessible by all the client.
You can use AzureSpeed to get the latency for all the location from your current location and try to create the subscription in a common location which is having minimal latency for all your clients.

Correct JSON to POST to PubSub - Dataflow - BiqQuery? Correct dataschema?

I'm taking my first experimental steps with google-pre-setup templates in a Google Cloud Template (Cloud Pub/Sub to BigQuery).
As a milestone to my final goal (having physical gadgets reporting a data stream to Google Cloud Pub/Bub), my wish is to achieve something like this:
POSTMAN (make authenticated POST request with JSON message to an Google Cloud Platform, GPC, endpoint) --> GPC Pub/Sub --> GPC DataFlow --> GPC BigQuery.
Right now I am following the tutorial found in Executing Templates, https://cloud.google.com/dataflow/docs/templates/executing-templates, "Example 2: Custom template, streaming job". This section states:
...This example projects.templates.launch request creates a streaming job
from a template that reads from a Pub/Sub topic and writes to a
BigQuery table. The BigQuery table must already exist with the
appropriate schema. If successful, the response body contains an
instance of LaunchTemplateResponse. ...
and further more how to do a POST:
https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://[YOUR_BUCKET_NAME]/templates/TemplateName
{
"jobName": "[JOB_NAME]",
"parameters": {
"topic": "projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME]",
"table": "[YOUR_PROJECT_ID]:[YOUR_DATASET].[YOUR_TABLE_NAME]"
},
"environment": {
"tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
"zone": "us-central1-f"
}
}
There are two things that confuses me. Let's for the sake of a simple example say that I have multiple vehicles who continuously should report their current status. I have already created my MQTT topic: VEHICLE_STATUS. Each och my vehicles should be able to report its:
Position [String]
Speed [Float]
Time [String]
VehicleID [Integer]
=======
I'm aware of the prototype for a PubsubMessage:
{
"data": string,
"attributes": {
string: string,
...
},
"messageId": string,
"publishTime": string,
}
My questions:
How should my BigQuery table schema look (which columns do I need to create)?
How should the entire corresponding JSON message look? What should my vehicle report to the endpoint each time?