Load a JSON file in a C# object using U-SQL - azure-data-lake

I have a JSON file stored in the Data Lake Store. I can extract the JSON file using the JsonExtractor from Microsoft.
Is it possible to load the JSON file in a POCO object without using EXTRACT command? If I use EXTRACT command is it possible for me combine all the rows in a single C# object?
Below is a sample JSON file which I want to de-serialize and store in a C# object
{
"sourcePath": "wasb://container#accountName.blob.core.net/Input/{*}.txt",
"destinationPath": "wasb://container#accountName.blob.core.net/Output/myfile.txt",
"errorPath": "wasb://container#accountName.blob.core.net/Error/error.txt",
"schema": [
{
"name": "column1",
"type": "string",
"allowNull": true,
"minLength": 12,
"maxLength": 50
},
{
"name": "column2",
"type": "int",
"allowNull": true,
"minLength": 0,
"maxLength": 0
},
{
"name": "column3",
"type": "bool",
"allowNull": true,
"minLength": 0,
"maxLength": 0
},
{
"name": "column4",
"type": "DateTime",
"allowNull": false,
"minLength": 0,
"maxLength": 0
}
]
}

You can write your own custom Extractor that reads the data (input.baseStream) and you can create your object. Take a look at the Microsoft JSON Extractor for the pattern.
Note that you will have 1/2 GB of main memory limit for your extractor.

Related

Nullable date in avro schema for google pub/sub

I'm using avro as the schema for google pub/sub to write directly to BigQuery.
One of the fields can be null, so I've written my avro schema like this:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "status",
"type": "string"
},
{
"name": "createDate",
"type": "string"
},
{
"name": "purchaseDate",
"type": ["null", "string"]
}
]
}
However, for an input to fit with this schema, it has to look something like one of the below:
{
"id": "123",
"status": "not-purchased",
"createDate": "2023-01-17T04:49:16.966Z",
"purchaseDate": null
}
{
"id": "123",
"status": "purchased",
"createDate": "2023-01-17T04:49:16.966Z",
"purchaseDate": {
"string": "2023-01-17T04:49:16.966Z"
}
}
The input in the 2nd example above is not in a format that is expected by the BigQuery subscription. I'm looking for something that looks like this instead:
{
"id": "123",
"status": "purchased",
"createDate": "2023-01-17T04:49:16.966Z",
"purchaseDate": "2023-01-17T04:49:16.966Z"
}
Is there something I did wrong with the avro schema or is it just the way it is for how nullable fields works in avro?
This is a known issue with Pub/Sub BigQuery subscriptions. You can follow the progress on a fix in the issue tracker. Once fixed, the example that uses the string keyword should work for inserting into BigQuery via Pub/Sub subscription.
Nullable doesn't matter. It's just a union type.
The string key is required for union types, as per the Avro JSON encoding specification.

Load Avro file to GCS with nested record using customized column name

I was trying to load an Avro file with nested record. One of the record was having a union of schema. When loaded to BigQuery, it created a very long name like com_mycompany_data_nestedClassname_value on each union element. That name is long. Wondering if there is a way to specify name without having the full package name prefixed.
For example. The following Avro schema
{
"type": "record",
"name": "EventRecording",
"namespace": "com.something.event",
"fields": [
{
"name": "eventName",
"type": "string"
},
{
"name": "eventTime",
"type": "long"
},
{
"name": "userId",
"type": "string"
},
{
"name": "eventDetail",
"type": [
{
"type": "record",
"name": "Network",
"namespace": "com.something.event",
"fields": [
{
"name": "hostName",
"type": "string"
},
{
"name": "ipAddress",
"type": "string"
}
]
},
{
"type": "record",
"name": "DiskIO",
"namespace": "com.something.event",
"fields": [
{
"name": "path",
"type": "string"
},
{
"name": "bytesRead",
"type": "long"
}
]
}
]
}
]
}
Came up with
Is that possible to make the long field name like eventDetail.com_something_event_Network_value to be something like eventDetail.Network
Avro loading is not as flexible as it should be in BigQuery (basic example is that it does not support load a subset of the fields (reader schema). Also, renaming of the columns is not supported today in BigQuery refer here. Only options are recreate your table with the proper names (create a new table from your existing table) or recreate the table from your previous table

Reference object JSON schema for entity in array JSON schema

I am using Anypoint Studio 7.2.3 and Mule runtime 4.1 to write my RAML.
I have a JSON schema for an order object and I also need a JSON schema for a list of order objects. I thought I could reference the order object in the JSON schema for the list of orders to save maintaining the same fields in 2 schemas but I am seeing an error because $schema appears twice and is showing the error when I add a JSON example of an order list in the RAML.
Is it possible to have a separate order object JSON schema that can be referenced by an order list JSON schema?
Order Object JSON Schema (cut down version)
{
"type": "object",
"$schema": "http://json-schema.org/draft-04/schema",
"properties": {
"orderId": {
"type": "string",
"maxLength": 255
},
"comments": {
"type": "string",
"maxLength": 255
}
},
"required": [
"orderId"
]
}
Order List JSON Schema
{
"type": "array",
"$schema": "http://json-schema.org/draft-04/schema",
"properties": {
"$ref": "#Order"
}
}
The below JSON schema works for order list but means I will need to maintain the fields in 2 separate schemas so any change to e.g. orderId will mean I will need to change it in both the order object schema and the order list schema.
{
"type": "array",
"$schema": "http://json-schema.org/draft-04/schema",
"properties": {
"orderId": {
"type": "string",
"maxLength": 255
},
"comments": {
"type": "string",
"maxLength": 255
}
},
"required": [
"orderId"
]
}
Thanks for any help.
Just change "properties" to "items" in your OrderList schema. Also reference the name of the file in the same folder or insert the full path to a file located elsewhere. See the code below:
{
"type": "array",
"$schema": "http://json-schema.org/draft-04/schema",
"items": {
"$ref": "order.schema.json"
}
}
[]'s

Bigquery fails to load data from Google Cloud Storage

I have a json record which looks like
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40",
"line_item": [{"line":"1","sku":"10","amount":10},
{"line":"2","sku":"20","amount":20}]}
I am trying to load record stated above into the table which has schema definition as,
"fields": [
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "STRING"
}
]
I am getting following error "message":
JSON parsing error in row starting at position 0 at file:
gs://gcs_bucket/file0. JSON object specified for non-record field:
line_item
I want line_item json string which can have more than 1 row as array of json string in line item column in table.
Any suggestion?
The first thing is that your input JSON should't have a "\n" character, so you should save it like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
One example of how your JSON file should look like:
{"customer_id":"2349uslvn2q3","order_id":"9sufd23rdl40", "line_item": [{"line":"1","sku":"10","amount":10}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"2","order_id":"2", "line_item": [{"line":"2","sku":"20","amount":20}, {"line":"2","sku":"20","amount":20}]}
{"customer_id":"3","order_id":"3", "line_item": [{"line":"3","sku":"30","amount":30}, {"line":"3","sku":"30","amount":30}]}
And also your schema is not correct. It should be:
[
{
"mode": "NULLABLE",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "order_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "line_item",
"type": "RECORD",
"fields": [{"name": "line", "type": "STRING"}, {"name": "sku", "type": "STRING"}, {"name": "amount", "type": "INTEGER"}]
}
]
For a better understanding of how schemas work, I've tried writing sort of a guide in this answer. Hopefully it can be of some value.
If your data content is saved for instance in a filed called gs://gcs_bucket/file0 and your schema in schema.json then this command should work for you:
bq load --source_format=NEWLINE_DELIMITED_JSON dataset.table gs://gcs_bucket/file0 schema.json
(supposing you are using the CLI tool as it seems to be the case in your question).

Error when extracting data from Azure Table Storage using Azure Data Factory

I want to copy data from Azure Table Storage to Azure SQL Server using Azure Data Factory, but I get a strange error.
In my Azure Table Storage I have a column which contains multiple data types (this is how Table Storage works) E.G. Date time and String.
In my Data Factory project I mentioned that the entire column is string, but for some reason the Data Factory assumes the data type based on the first cell that it encounters during the extraction process.
In my Azure SQL Server database all columns are string.
Example
I have this table in Azure Table Storage: Flights
RowKey PartitionKey ArrivalTime
--------------------------------------------------
1332-2 2213dcsa-213 04/11/2017 04:53:21.707 PM - this cell is DateTime
1332-2 2213dcsa-214 DateTime.Null - this cell is String
If my table is like the one below, the copy process will work, because the first row is string and it will convert the entire column to string.
RowKey PartitionKey ArrivalTime
--------------------------------------------------
1332-2 2213dcsa-214 DateTime.Null - this cell is String
1332-2 2213dcsa-213 04/11/2017 04:53:21.707 PM - this cell is DateTime
Note: I am not allowed to change the data type in Azure Table Storage, move the rows or to add new ones.
Below are the input and output data sets from Azure Data Factory:
"datasets": [
{
"name": "InputDataset",
"properties": {
"structure": [
{
"name": "PartitionKey",
"type": "String"
},
{
"name": "RowKey",
"type": "String"
},
{
"name": "ArrivalTime",
"type": "String"
}
],
"published": false,
"type": "AzureTable",
"linkedServiceName": "Source-AzureTable",
"typeProperties": {
"tableName": "flights"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
},
{
"name": "OutputDataset",
"properties": {
"structure": [
{
"name": "PartitionKey",
"type": "String"
},
{
"name": "RowKey",
"type": "String"
},
{
"name": "ArrivalTime",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Destination-SQLAzure",
"typeProperties": {
"tableName": "[dbo].[flights]"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
}
]
Does anyone knows a solution to this issue?
I've just been playing around with this. I think you have 2 options to deal with this.
Option 1
Simply remove the data type attribute from your input dataset. In the 'structure' block of the input JSON table dataset you don't have to specify the type attribute. Remove or comment it out.
For example:
{
"name": "InputDataset-ghm",
"properties": {
"structure": [
{
"name": "PartitionKey",
"type": "String"
},
{
"name": "RowKey",
"type": "String"
},
{
"name": "ArrivalTime"
/* "type": "String" --<<<<<< Optional! */
},
This should mean the data type is not validated on read.
Option 2
Use a custom activity upstream of the SQL DB table load to cleanse and transform the table data. This will mean breaking out the C# and require a lot more dev time. But you may want to reuse the cleaning code for other datasets.
Hope this helps.