I am trying to copy Azure tables from one storage account to another storage account. But while doing this copy i want to change a column's datetime value to Unix Timestamp.
I am using Azure DataFactory Copy Activity. If i specify InitialDate column type as Int64 in output dataset then i am getting error that cannot convert datetimeoffset to Int64.
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureTableSource"
},
"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "PartitionKey",
"azureTableRowKeyName": "RowKey",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "PartitionKey:PartitionKey,RowKey:RowKey,Timestamp:Timestamp,InitialDate"
},
"parallelCopies": 32,
"cloudDataMovementUnits": 32
},
"inputs": [
{
"name": "InputDataset-3tk"
}
],
"outputs": [
{
"name": "OutputDataset-3tk"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-Test->Test"
}
]
Is there anyway i can change InitialDate column values to UnixTimestamp (Int64) while copying to output dataset ?
Are there any other translators other than TabularTranslator ? I couldn't find any info in web.
If I understand correctly, you want to change column value during copy data from one storage table to another storage table with DataFactory. Base on my experience, it is not supported by DataFactory Copy Activity currently.
In my option, there is a workaround is that we could use the Azure scheduled webjob to do that. In the WebJob could use Azure storage SDK to copy table record and the change column value then insert into another table.
Related
I'm currently struggling with the Azure Data Factory v2 If activity which always fails with this error message:
enter image description here
I've designed two separate pipelines, one takes the full snapshot of the data (1333 records) from the on-premises SQL Server and loads the data into the Azure SQL Database, and another one just takes delta from the same source.
Both pipelines work fine when executed independently.
I then decided to wrap these two pipelines into the one parent pipeline which would do this:
1.
Execute LookUp activity to check if the target table in Azure SQL Database has any records, basic Select Count(Request_ID) As record_count From target_table - activity works fine, I can preview the returned record count.
2.
Pass the output from the LookUp activity to the If activity with the conditions that if record_count = 0, the parent pipeline would invoke the full load pipeline, otherwise the parent pipeline would invoke the delta load pipeline.
This is the actual expression:
{#activity('lookup_sites_record_count').output.firstRow.record_count}==0"
Whenever I try to execute this parent pipeline, it fails with the above message of "Activity failed: Activity failed because an inner activity failed."
Both inner activities, that is, full load and delta load pipelines, work just fine when triggered independently.
What I'm missing?
Many thanks in advance :).
mikhailg
Pipeline's JSON definition below:
{
"name": "pl_remedyreports_load_rs_sites",
"properties": {
"activities": [
{
"name": "lookup_sites_record_count",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "Select Count(Request_ID) As record_count From mdp.RS_Sites;"
},
"dataset": {
"referenceName": "ds_azure_sql_db_sites",
"type": "DatasetReference"
}
}
},
{
"name": "If_check_site_record_count",
"type": "IfCondition",
"dependsOn": [
{
"activity": "lookup_sites_record_count",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "{#activity('lookup_sites_record_count').output.firstRow.record_count}==0",
"type": "Expression"
},
"ifFalseActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_inc",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_inc",
"type": "PipelineReference"
}
}
}
],
"ifTrueActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_full",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_full",
"type": "PipelineReference"
}
}
}
]
}
}
],
"folder": {
"name": "Load Remedy Reference Data"
}
}
}
Your expression should be:
#equals(activity('lookup_sites_record_count').output.firstRow.record_count,0)
I'm not sure that what I'm trying to achieve is even possible in Data factory, but I guess there should be a way.
Simply put it, I have a table in DW that needs to be updated by a stored procedure once a day.
This stored procedure resides on the Source DB, I am looking for a way to pass some IDs and get the results from that SP and store it in DB.
Any Help would be appreciated. Below Pipeline is all I could think of:
{
"name": "UpdateColumnX",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure?? Not Really Sure",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "$$Text.Format('Passing IDs to the stored Procedure', Time.AddHours(WindowStart,10), Time.AddHours(WindowEnd,10))\n"
},
"storedProcedureName": "UpdateDataThroughSP",
"storedProcedureParameters": {
"StartDate": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', Time.AddHours(WindowStart,10))",
"EndDate ": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', Time.AddHours(WindowEnd,10))"
}
},
"inputs": [
{
"name": "Not Sure which table should be my Input, the DW table having the IDs or the source table? "
}
],
"outputs": [
{
"name": "Sames and Input not sure"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"offset": "20:30:00"
},
"name": "Update Data through Source SP"
}
],
"start": "2017-09-13T20:30:00.045Z",
"end": "2099-12-30T13:00:00Z",
"isPaused": false,
"hubName": "HubName",
"pipelineMode": "Scheduled"
}
}
I am trying to ingest data into druid from hive orc compressed table data in hdfs. Any pointers on this would be very helpful.
Assuming you have Druid and Yarn/MapReduce setup already, you can launch a index_hadoop task that will do what you ask.
There is a druid-orc-extensions that allows to read ORC file, I don't think it come with the standard release, so you'll have to get it somehow (we build it from source)
(extension list http://druid.io/docs/latest/development/extensions.html)
Here an example that would ingest a bunch of orc file and append an interval to a datasource. to POST to an overlord http://overlord:8090/druid/indexer/v1/task
(doc http://druid.io/docs/latest/ingestion/batch-ingestion.html)
You may have to adjust depending of your distribution, I remember we had issue on hortonworks with some class not found (classpathPrefix will help adjusting MapReduce classpath)
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "granularity",
"inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat",
"dataGranularity": "hour",
"inputPath": "/apps/hive/warehouse/table1",
"filePattern": ".*",
"pathFormat": "'partition='yyyy-MM-dd'T'HH"
}
},
"dataSchema": {
"dataSource": "cube_indexed_from_orc",
"parser": {
"type": "orc",
"parseSpec": {
"format": "timeAndDims",
"timestampSpec": {
"column": "timestamp",
"format": "nano"
},
"dimensionsSpec": {
"dimensions": ["cola", "colb", "colc"],
"dimensionExclusions": [],
"spatialDimensions": []
}
},
"typeString": "struct<timestamp:bigint,cola:bigint,colb:string,colc:string,cold:bigint>"
},
"metricsSpec": [{
"type": "count",
"name": "count"
}],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "HOUR",
"intervals": ["2017-06-14T00:00:00.000Z/2017-06-15T00:00:00.000Z"]
}
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"type": "hashed",
"targetPartitionSize": 5000000
},
"leaveIntermediate": false,
"forceExtendableShardSpecs": "true"
}
}
}
I want to copy data from Azure Table Storage to Azure SQL Server using Azure Data Factory, but I get a strange error.
In my Azure Table Storage I have a column which contains multiple data types (this is how Table Storage works) E.G. Date time and String.
In my Data Factory project I mentioned that the entire column is string, but for some reason the Data Factory assumes the data type based on the first cell that it encounters during the extraction process.
In my Azure SQL Server database all columns are string.
Example
I have this table in Azure Table Storage: Flights
RowKey PartitionKey ArrivalTime
--------------------------------------------------
1332-2 2213dcsa-213 04/11/2017 04:53:21.707 PM - this cell is DateTime
1332-2 2213dcsa-214 DateTime.Null - this cell is String
If my table is like the one below, the copy process will work, because the first row is string and it will convert the entire column to string.
RowKey PartitionKey ArrivalTime
--------------------------------------------------
1332-2 2213dcsa-214 DateTime.Null - this cell is String
1332-2 2213dcsa-213 04/11/2017 04:53:21.707 PM - this cell is DateTime
Note: I am not allowed to change the data type in Azure Table Storage, move the rows or to add new ones.
Below are the input and output data sets from Azure Data Factory:
"datasets": [
{
"name": "InputDataset",
"properties": {
"structure": [
{
"name": "PartitionKey",
"type": "String"
},
{
"name": "RowKey",
"type": "String"
},
{
"name": "ArrivalTime",
"type": "String"
}
],
"published": false,
"type": "AzureTable",
"linkedServiceName": "Source-AzureTable",
"typeProperties": {
"tableName": "flights"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
},
{
"name": "OutputDataset",
"properties": {
"structure": [
{
"name": "PartitionKey",
"type": "String"
},
{
"name": "RowKey",
"type": "String"
},
{
"name": "ArrivalTime",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Destination-SQLAzure",
"typeProperties": {
"tableName": "[dbo].[flights]"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
}
]
Does anyone knows a solution to this issue?
I've just been playing around with this. I think you have 2 options to deal with this.
Option 1
Simply remove the data type attribute from your input dataset. In the 'structure' block of the input JSON table dataset you don't have to specify the type attribute. Remove or comment it out.
For example:
{
"name": "InputDataset-ghm",
"properties": {
"structure": [
{
"name": "PartitionKey",
"type": "String"
},
{
"name": "RowKey",
"type": "String"
},
{
"name": "ArrivalTime"
/* "type": "String" --<<<<<< Optional! */
},
This should mean the data type is not validated on read.
Option 2
Use a custom activity upstream of the SQL DB table load to cleanse and transform the table data. This will mean breaking out the C# and require a lot more dev time. But you may want to reuse the cleaning code for other datasets.
Hope this helps.
I have a JSON file stored in the Data Lake Store. I can extract the JSON file using the JsonExtractor from Microsoft.
Is it possible to load the JSON file in a POCO object without using EXTRACT command? If I use EXTRACT command is it possible for me combine all the rows in a single C# object?
Below is a sample JSON file which I want to de-serialize and store in a C# object
{
"sourcePath": "wasb://container#accountName.blob.core.net/Input/{*}.txt",
"destinationPath": "wasb://container#accountName.blob.core.net/Output/myfile.txt",
"errorPath": "wasb://container#accountName.blob.core.net/Error/error.txt",
"schema": [
{
"name": "column1",
"type": "string",
"allowNull": true,
"minLength": 12,
"maxLength": 50
},
{
"name": "column2",
"type": "int",
"allowNull": true,
"minLength": 0,
"maxLength": 0
},
{
"name": "column3",
"type": "bool",
"allowNull": true,
"minLength": 0,
"maxLength": 0
},
{
"name": "column4",
"type": "DateTime",
"allowNull": false,
"minLength": 0,
"maxLength": 0
}
]
}
You can write your own custom Extractor that reads the data (input.baseStream) and you can create your object. Take a look at the Microsoft JSON Extractor for the pattern.
Note that you will have 1/2 GB of main memory limit for your extractor.