How to read the files from Azure Blob Storage with folder structure as 'StartDateOfMonth-EndDatefMonth'? - azure-data-factory-2

Scenario
We have azure blob storage container with following folder structure.
• 20190601-20190630
Basically, this folder will contain daily CSV files for the given month.
This folder structure is dynamic. So, in the next month, folder 20190701-20190731 will be populated with daily CSV files.
Problem
On daily basis, need to move these files from azure blob storage to azure data lake using azure data factory (v2).
How to specify folder structure (dynamically) in the Input Dataset (Azure Blob Storage) in Azure Data Factory(V2)?
Example:
20190601-20190630/*.CSV for the month June 2019
Basically, StartDateOfMonth and EndDateOfMonth are dynamic.
Thanks in Advance

You could configure your dataset folder path like:
"folderPath": {
"value": "#concat(
formatDateTime(pipeline().parameters.scheduledRunTimeStart, 'yyyyMMdd'),
'-',
formatDateTime(pipeline().parameters.scheduledRunTimeEnd, 'yyyyMMdd')
, '/'
"type": "Expression"
}
And pass the parameters into dataset:
"parameters": {
"scheduledRunTimeStart": {
"type": "String"
},
"scheduledRunTimeEnd": {
"type": "String"
}
}

Related

Azure Datalake Analytics U-SQL with Azure Datalake Storage Gen 2

Question : what is the path forward for using ADLA (U-SQL) with ADLS(Gen2) ?
I have been running Azure Data lake Analytics (U-SQL) jobs via Azure Data factory (ADF v2) with Azure Data lake Store Generation 1 for quite a while now in East US2
I was planning to have another instance deployed to cater Canadian clients and wanted to setup Azure Data lake Store Generation 1
What I tried :
I was not able to create an Azure Datalake Storage Gen 1 account in Central Canada (or any Canadian region for that matter)
I tried to move to Azure Datalake Storage Gen2 but then ran into an issue where Azure Data Factory - U-SQL activity could not be linked with Gen2 Storage linked service to pick up U-SQL script
I stumbled upon multiple links about this topic :
https://feedback.azure.com/forums/327234-data-lake/suggestions/36445702-add-support-for-adls-gen2-to-adla
https://social.msdn.microsoft.com/Forums/en-US/5ce97eef-8940-4591-a19c-934f71825e7d/connect-data-lake-analytics-to-adls-gen-2
which essentially say that U-SQL / ADLA won't be supporting ADLS Gen2
I am a bit confused since there is no official documentation on ADLA's direction
Update:
This is the structure of my u-sql activity. It can work and process successfully:(You can try to create a new json of u-sql activity to replace your u-sql activity.)
{
"name": "pipeline4",
"properties": {
"activities": [
{
"name": "U-SQL1",
"type": "DataLakeAnalyticsU-SQL",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"scriptPath": "test1/u-sql.txt",
"scriptLinkedService": {
"referenceName": "LinkTo0730",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "AzureDataLakeAnalytics1",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
}
}
Original Answer:
I was not able to create an Azure Datalake Storage Gen 1 account in
Central Canada (or any Canadian region for that matter)
On my side, I also cannot create datalake gen1 on region Central Canada. This is the limit of my subscription. But you can have a check of the resource manager on your side, maybe you can.(Azure data lake gen1 is 'Microsoft.DataLakeStore')
Resource Manager is supported in all regions, but the resources you deploy might not be supported in all regions. In addition, there may be limitations on your subscription that prevent you from using some regions that support the resource. The resource explorer displays valid locations for the resource type.
Please have a check of this document:
https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-providers-and-types
I tried to move to Azure Datalake Storage Gen2 but then ran into an
issue where Azure Data Factory - U-SQL activity could not be linked
with Gen2 Storage linked service to pick up U-SQL script
On my side, seems it is reading the u-sql script in gen2, did you get some error?

Data Factory copy pipeline from API

We use Azure Data Factory copy pipeline to transfer data from REST api's to a Azure SQL Database and it is doing some strange things. Because we loop over a set of API's that need to be transferred the mapping is empty from the copy activity.
But for one API the automatic mapping is going wrong, the destination table is created with all the needed columns and correct datatypes based on the received metadata. When we run the pipeline for that specific API, the following message is showed.
{ "errorCode": "2200", "message": "ErrorCode=SchemaMappingFailedInHierarchicalToTabularStage,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to process hierarchical to tabular stage, error message: Ticks must be between DateTime.MinValue.Ticks and DateTime.MaxValue.Ticks.\r\nParameter name: ticks,Source=Microsoft.DataTransfer.ClientLibrary,'", "failureType": "UserError", "target": "Copy data1", "details": [] }
As a test we did do the mapping for that API manually by using the "Import Schema" option on the Mapping page. there we see that all the fields are correctly mapped. We execute the pipeline again using the mapping and everything is working fine.
But of course, we don't want to use a manually mapping because it is used in a loop for different API's also.

azure stream analytics implementation or the best approach

I am new to Steam analytics and I need help here to achieve a specific task.
I have telemetry data coming from iot hub in this format.
Basically i will be getting machines telemetry data and the stage of the operations on that machine streamed to iot hub.
The stages will be indicated with tag ex:"stageid":"stage1". I need to calculate the time taken for each stage using stream analytics based on timestamp and stage tag.
packet Ex:
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
}
}]
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-24T09:26:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-26T09:24:00.3270000Z"
}
}]
pls help me can we achieve this with query and what could be the query or what is the other best approach?
Thanks,
Per my knowledge,your needs can't be implemented by ASA built-in features. ASA is a real-time collect data and analytics service.In other words,data need to be processed in the real-time.Current data can't wait for next dataset to do some calculate or merge things. Even if you could use windows function and group by,i believe the frequency of messages pushed by the device is also variable.
As a workaround,my idea is using iot-hub azure function trigger.Inside trigger,you could use code to parse message and save key columns(stageid,timestamp,devid) into some storage,maybe azure table storage. Before every insert,just grab latest row of current device to calculate the time taken with current message so that you could produce that time to store in other residence.In the end, update the latest row for every device.

Exporting Cloudwatch logs in original format

I am looking to find a way to export CW logs in their original form to s3. I used the console to export a days worth of logs from a log group, and it seems that a timestamp was prepended on each line, breaking the original JSON formatting. I was looking to import this into glue as a json file for a test transformation script. The original data used is formated as a normal json string when imported to cloudwatch and normally process the data it looks like:
{ "a": 123, "b": "456", "c": 789 }
After exporting and decompressing the data it looks like
2019-06-28T00:00:00.099Z { "a": 123, "b": "456", "c": 789 }
Which breaks reading the line as a json string since its no long a standard format.
The dataset is fairly large(100GB+) for this run, and will possibly grow larger in the future, so running the command a CLI command and processing each line locally isn't feasible in my opinion. Is there any known way to do what I am looking to do?
Thank you
TimeStamps are automatically added when you push the logs to the CloudWatch.
All the log events present in the CloudWatch has timestamp.
You can create a subscription filter to Kinesis Firehose and on Kinesis using lambda function you can formate the log events(remove the timestamp) then store the logs in the S3.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html

Syncing Kafka with aws s3 with different directory structure

We have events coming to Kafka and using kafka connect we are syncing these events with aws s3.
Data is visible in s3 in below dir structure:
bucket_name/sub_folder/
Partition=0/events.json
Partition=1/events.json
Partition=2/events.json
is there a way to store in below dir structure:
Bucket_name/sub_folder/date=today_date/ events.json or Partition=0..2/date=today/events.json
Bucket_name/sub_folder/date=today_date/ events.json or
Motivation is to store that days events in that that days directory, i searched web but could not find any other way .
Thanks in advance.
You can use the TimeBasedPartitioner which
partitions data according to ingestion time.
e.g. for hourly partioning:
[…]
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "US",
"timezone": "UTC",
"partition.duration.ms": "3600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "my_record_field_with_timestamp_in",
[…]