Compare two structs in BigQuery recursively ignoring key order - sql

Let's say I have a complex struct, perhaps with ten levels of nesting and repeated fields. Is there a built-in way to compare these two objects to see if they are the same minus the sorting of keys? This may be related to: Compare two json values for equality. An example might be:
{
"id": "0001",
"type": "donut",
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "type": "Powdered Sugar" , "id": "5007"},
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
],
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
}
}
Versus:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "type": "None", "id": "5001" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}

From the previous question, we learnt that JSON type allows such a comparison, then it is just a matter of how to use JSON type as a proxy to compare the 2 structs
with data as (
select struct<a string, b struct<x string, y string>>('a', ('x', 'y')) col1,
struct<b struct<y string, x string>, a string>(('y', 'x'), 'a') col2,
) select col1,
col2,
TO_JSON_STRING(PARSE_JSON(TO_JSON_STRING(col1))) = TO_JSON_STRING(PARSE_JSON(TO_JSON_STRING(col2)))
from data;

Related

Avro Schema: multiple records reference same data type issue: Unknown union branch

I have Avro Schema: customer record import the CustomerAddress subset.
[
{
"type": "record",
"namespace": "com.example",
"name": "CustomerAddress",
"fields": [
{ "name": "address", "type": "string" },
{ "name": "city", "type": "string" },
{ "name": "postcode", "type": ["string", "int"] },
{ "name": "type","type": {"type": "enum","name": "type","symbols": ["POBOX","RESIDENTIAL","ENTERPRISE"]}}
]
},
{
"type": "record",
"namespace": "com.example",
"name": "Customer",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "middle_name", "type": ["null", "string"], "default": null },
{ "name": "last_name", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "height", "type": "float" },
{ "name": "weight", "type": "float" },
{ "name": "automated_email", "type": "boolean", "default": true },
{ "name": "customer_emails", "type": {"type": "array","items": "string"},"default": []},
{ "name": "customer_address", "type": "com.example.CustomerAddress" }
]
}
]
i have JSON payload:
{
"Customer" : {
"first_name": "John",
"middle_name": null,
"last_name": "Smith",
"age": 25,
"height": 177.6,
"weight": 120.6,
"automated_email": true,
"customer_emails": ["ning.chang#td.com", "test#td.com"],
"customer_address":
{
"address": "21 2nd Street",
"city": "New York",
"postcode": "10021",
"type": "RESIDENTIAL"
}
}
}
when i runt the command: java -jar avro-tools-1.8.2.jar fromjson --schema-file customer.avsc customer.json
got the following exception:
Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union branch Customer
In your JSON data you use the key Customer but you have to use the fully qualified name. So it should be com.example.Customer.

Running data flows in Azure Data Factory 4 times slower than running in Azure SSIS data flow

Here are the details of this performance test (very simple). I'm trying to understand why running data flows in the cloud native Azure Data Factory environment (Spark) is so much slower than running data flows hosted in Azure SSIS IR. My results show that running in latest ADFv2 is over 4 times slower than running the exact same data flow in Azure SSIS (even with a warm IR cluster already warmed up from previous run). I like all the new features of the v2 data flows but it hardly seems worth the performance hit unless I'm completely missing something. Eventually I'll be adding more complex data flows but wanted to understand base performance behavior.
Source:
1GB CSV stored in blob storage
Destination:
Azure SQL Server Database (one table and truncated before each run)
When using control flow in ADFv2 using a simple CopyActivity (no data flow)
91 seconds
When using native SSIS package with data flow (Azure Feature Pack to pull from same blob storage) running Azure SSIS with 8 cores.
76 seconds
Pure ADF Cloud Pipeline using DataFlow with warm Azure IR (cached from previous run) 8 (+ 8 Driver cores) with default partitioning (Spark)
(includes 96 seconds cluster startup which is another thing I don't understand since the TTL is 30 minutes on the IR and it was just ran 10 minutes prior)
360 seconds
Pipeline (LandWithCopy)
{
"name": "LandWithCopy",
"properties": {
"activities": [
{
"name": "CopyData",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"wildcardFileName": "data.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "AzureSqlSink",
"preCopyScript": "TRUNCATE TABLE PatientAR",
"disableMetricsCollection": false
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "RecordAction",
"type": "String"
},
"sink": {
"name": "RecordAction",
"type": "String"
}
},
{
"source": {
"name": "UniqueId",
"type": "String"
},
"sink": {
"name": "UniqueId",
"type": "String"
}
},
{
"source": {
"name": "Type",
"type": "String"
},
"sink": {
"name": "Type",
"type": "String"
}
},
{
"source": {
"name": "TypeDescription",
"type": "String"
},
"sink": {
"name": "TypeDescription",
"type": "String"
}
},
{
"source": {
"name": "PatientId",
"type": "String"
},
"sink": {
"name": "PatientId",
"type": "String"
}
},
{
"source": {
"name": "PatientVisitId",
"type": "String"
},
"sink": {
"name": "PatientVisitId",
"type": "String"
}
},
{
"source": {
"name": "VisitDateOfService",
"type": "String"
},
"sink": {
"name": "VisitDateOfService",
"type": "String"
}
},
{
"source": {
"name": "VisitDateOfEntry",
"type": "String"
},
"sink": {
"name": "VisitDateOfEntry",
"type": "String"
}
},
{
"source": {
"name": "DoctorId",
"type": "String"
},
"sink": {
"name": "DoctorId",
"type": "String"
}
},
{
"source": {
"name": "DoctorName",
"type": "String"
},
"sink": {
"name": "DoctorName",
"type": "String"
}
},
{
"source": {
"name": "FacilityId",
"type": "String"
},
"sink": {
"name": "FacilityId",
"type": "String"
}
},
{
"source": {
"name": "FacilityName",
"type": "String"
},
"sink": {
"name": "FacilityName",
"type": "String"
}
},
{
"source": {
"name": "CompanyName",
"type": "String"
},
"sink": {
"name": "CompanyName",
"type": "String"
}
},
{
"source": {
"name": "TicketNumber",
"type": "String"
},
"sink": {
"name": "TicketNumber",
"type": "String"
}
},
{
"source": {
"name": "TransactionDateOfEntry",
"type": "String"
},
"sink": {
"name": "TransactionDateOfEntry",
"type": "String"
}
},
{
"source": {
"name": "InternalCode",
"type": "String"
},
"sink": {
"name": "InternalCode",
"type": "String"
}
},
{
"source": {
"name": "ExternalCode",
"type": "String"
},
"sink": {
"name": "ExternalCode",
"type": "String"
}
},
{
"source": {
"name": "Description",
"type": "String"
},
"sink": {
"name": "Description",
"type": "String"
}
},
{
"source": {
"name": "Fee",
"type": "String"
},
"sink": {
"name": "Fee",
"type": "String"
}
},
{
"source": {
"name": "Units",
"type": "String"
},
"sink": {
"name": "Units",
"type": "String"
}
},
{
"source": {
"name": "AREffect",
"type": "String"
},
"sink": {
"name": "AREffect",
"type": "String"
}
},
{
"source": {
"name": "Action",
"type": "String"
},
"sink": {
"name": "Action",
"type": "String"
}
},
{
"source": {
"name": "InsuranceGroup",
"type": "String"
},
"sink": {
"name": "InsuranceGroup",
"type": "String"
}
},
{
"source": {
"name": "Payer",
"type": "String"
},
"sink": {
"name": "Payer",
"type": "String"
}
},
{
"source": {
"name": "PayerType",
"type": "String"
},
"sink": {
"name": "PayerType",
"type": "String"
}
},
{
"source": {
"name": "PatBalance",
"type": "String"
},
"sink": {
"name": "PatBalance",
"type": "String"
}
},
{
"source": {
"name": "InsBalance",
"type": "String"
},
"sink": {
"name": "InsBalance",
"type": "String"
}
},
{
"source": {
"name": "Charges",
"type": "String"
},
"sink": {
"name": "Charges",
"type": "String"
}
},
{
"source": {
"name": "Payments",
"type": "String"
},
"sink": {
"name": "Payments",
"type": "String"
}
},
{
"source": {
"name": "Adjustments",
"type": "String"
},
"sink": {
"name": "Adjustments",
"type": "String"
}
},
{
"source": {
"name": "TransferAmount",
"type": "String"
},
"sink": {
"name": "TransferAmount",
"type": "String"
}
},
{
"source": {
"name": "FiledAmount",
"type": "String"
},
"sink": {
"name": "FiledAmount",
"type": "String"
}
},
{
"source": {
"name": "CheckNumber",
"type": "String"
},
"sink": {
"name": "CheckNumber",
"type": "String"
}
},
{
"source": {
"name": "CheckDate",
"type": "String"
},
"sink": {
"name": "CheckDate",
"type": "String"
}
},
{
"source": {
"name": "Created",
"type": "String"
},
"sink": {
"name": "Created",
"type": "String"
}
},
{
"source": {
"name": "ClientTag",
"type": "String"
},
"sink": {
"name": "ClientTag",
"type": "String"
}
}
]
}
},
"inputs": [
{
"referenceName": "PAR_Source_DS",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "PAR_Sink_DS",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}
Pipeline Data Flow (LandWithFlow)
{
"name": "WriteData",
"properties": {
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": {
"referenceName": "PAR_Source_DS",
"type": "DatasetReference"
},
"name": "GetData"
}
],
"sinks": [
{
"dataset": {
"referenceName": "PAR_Sink_DS",
"type": "DatasetReference"
},
"name": "WriteData"
}
],
"transformations": [],
"script": "source(output(\n\t\tRecordAction as string,\n\t\tUniqueId as string,\n\t\tType as string,\n\t\tTypeDescription as string,\n\t\tPatientId as string,\n\t\tPatientVisitId as string,\n\t\tVisitDateOfService as string,\n\t\tVisitDateOfEntry as string,\n\t\tDoctorId as string,\n\t\tDoctorName as string,\n\t\tFacilityId as string,\n\t\tFacilityName as string,\n\t\tCompanyName as string,\n\t\tTicketNumber as string,\n\t\tTransactionDateOfEntry as string,\n\t\tInternalCode as string,\n\t\tExternalCode as string,\n\t\tDescription as string,\n\t\tFee as string,\n\t\tUnits as string,\n\t\tAREffect as string,\n\t\tAction as string,\n\t\tInsuranceGroup as string,\n\t\tPayer as string,\n\t\tPayerType as string,\n\t\tPatBalance as string,\n\t\tInsBalance as string,\n\t\tCharges as string,\n\t\tPayments as string,\n\t\tAdjustments as string,\n\t\tTransferAmount as string,\n\t\tFiledAmount as string,\n\t\tCheckNumber as string,\n\t\tCheckDate as string,\n\t\tCreated as string,\n\t\tClientTag as string\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\twildcardPaths:['data.csv']) ~> GetData\nGetData sink(input(\n\t\tRecordAction as string,\n\t\tUniqueId as string,\n\t\tType as string,\n\t\tTypeDescription as string,\n\t\tPatientId as string,\n\t\tPatientVisitId as string,\n\t\tVisitDateOfService as string,\n\t\tVisitDateOfEntry as string,\n\t\tDoctorId as string,\n\t\tDoctorName as string,\n\t\tFacilityId as string,\n\t\tFacilityName as string,\n\t\tCompanyName as string,\n\t\tTicketNumber as string,\n\t\tTransactionDateOfEntry as string,\n\t\tInternalCode as string,\n\t\tExternalCode as string,\n\t\tDescription as string,\n\t\tFee as string,\n\t\tUnits as string,\n\t\tAREffect as string,\n\t\tAction as string,\n\t\tInsuranceGroup as string,\n\t\tPayer as string,\n\t\tPayerType as string,\n\t\tPatBalance as string,\n\t\tInsBalance as string,\n\t\tCharges as string,\n\t\tPayments as string,\n\t\tAdjustments as string,\n\t\tTransferAmount as string,\n\t\tFiledAmount as string,\n\t\tCheckNumber as string,\n\t\tCheckDate as string,\n\t\tCreated as string,\n\t\tClientTag as string,\n\t\tFileName as string,\n\t\tPractice as string\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\tdeletable:false,\n\tinsertable:true,\n\tupdateable:false,\n\tupsertable:false,\n\tformat: 'table',\n\tpreSQLs:['TRUNCATE TABLE PatientAR'],\n\tmapColumn(\n\t\tRecordAction,\n\t\tUniqueId,\n\t\tType,\n\t\tTypeDescription,\n\t\tPatientId,\n\t\tPatientVisitId,\n\t\tVisitDateOfService,\n\t\tVisitDateOfEntry,\n\t\tDoctorId,\n\t\tDoctorName,\n\t\tFacilityId,\n\t\tFacilityName,\n\t\tCompanyName,\n\t\tTicketNumber,\n\t\tTransactionDateOfEntry,\n\t\tInternalCode,\n\t\tExternalCode,\n\t\tDescription,\n\t\tFee,\n\t\tUnits,\n\t\tAREffect,\n\t\tAction,\n\t\tInsuranceGroup,\n\t\tPayer,\n\t\tPayerType,\n\t\tPatBalance,\n\t\tInsBalance,\n\t\tCharges,\n\t\tPayments,\n\t\tAdjustments,\n\t\tTransferAmount,\n\t\tFiledAmount,\n\t\tCheckNumber,\n\t\tCheckDate,\n\t\tCreated,\n\t\tClientTag\n\t),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> WriteData"
}
}
}
We are having the Same issues. Copy Activity without Data Flow is much faster than Data Flow. Our case is Copy Activity vs Data Flow. Not Sure if I'm doing anything wrong.
Our Scenario is just copy from Source to Destination 13 tables based on where Clause. We now have two copy activity which takes 1.5 minutes. So I was thinking may be create Data Flow and do one Source two Sinks. But it's running like 5 minutes to 8 Minutes depending on Cluster startup time. Hope we get an answer.

DocumentDB filter documents on multiple items in a child array

I have a Cosmos DB database with documents that have the following form:
{
"Id": "1",
"Price": 200,
"Properties": [
{
"Name": "Name1",
"Type": "Type1",
},
{
"Name": "Name2",
"Type": "Type2",
}
]
},
{
"Id": "2",
"Price": 500,
"Properties": [
{
"Name": "Name1",
"Type": "Type1",
},
{
"Name": "Name2",
"Type": "Type3",
}
]
},
{
"Id": "3",
"Price": 400,
"Properties": [
{
"Name": "Name1",
"Type": "Type2",
}
]
}
I would like to create a query that returns documents that satisfy multiple properties. E.g. I would like to retrieve the documents that have both properties of Type1 and Type2. The result should give me only the document with Id = 1.
SELECT c.Id
FROM c
WHERE ARRAY_CONTAINS(c.Properties, {'Type': 'Type1' }, true)
AND ARRAY_CONTAINS(c.Properties, {'Type': 'Type2' }, true)

Using a json schema in multiple layouts

I'm helping to build an interface that works with Json Schema, and I have a question about interface generation based on that schema. There are two display types - one for internal users and one for external users. Both are dealing with the same data, but the external users should see a smaller subset of fields than the internal users.
For example, here is one schema, it defines an obituary:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "",
"type": "object",
"required": [
"id",
"deceased"
],
"properties": {
"id": { "type": "string" },
"account": {
"type": "object",
"required": [
"name"
],
"properties": {
"id": { "type": "number" },
"name": { "type": "string" },
"website": {
"anyOf": [
{
"type": "string",
"format": "uri"
},
{
"type": "string",
"maxLength": 0
}
]
},
"email": {
"anyOf": [
{
"type": "string",
"format": "email"
},
{
"type": "string",
"maxLength": 0
}
]
},
"address": {
"type": "object",
"properties": {
"address1": { "type": "string" },
"address2": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" },
"postalCode": { "type": "string" },
"country": { "type": "string" }
}
},
"phoneNumber": {
"anyOf": [
{
"type": "string",
"format": "phone"
},
{
"type": "string",
"maxLength": 0
}
]
},
"faxNumber": {
"anyOf": [
{
"type": "string",
"format": "phone"
},
{
"type": "string",
"maxLength": 0
}
]
},
"type": { "type": "string" }
}
},
"deceased": {
"type": "object",
"required": [
"fullName"
],
"properties": {
"fullName": { "type": "string" },
"prefix": { "type": "string" },
"firstName": { "type": "string" },
"middleName": { "type": "string" },
"nickName": { "type": "string" },
"lastName1": { "type": "string" },
"lastName2": { "type": "string" },
"maidenName": { "type": "string" },
"suffix": { "type": "string" }
}
},
"description": { "type": "string" },
"photos": {
"type": "array",
"items": { "type": "string" }
}
}
}
Internal users would be able to access all the fields, but external users shouldn't be able to read/write the account fields.
Should I make a second schema for the external users, or is there a way to indicate different display levels or public/private on each field?
You cannot restrict acess to the fields defined in a schema, but you can have 2 schema files, one defining the "public" fields, and the other one defining the restricted fields plus including the restricted fields.
So
public-schema.json:
{
"properties" : {
"id" : ...
}
}
restricted-schema.json:
{
"allOf" : [
{
"$ref" : "./public-schema.json"
},
{
"properties" : {
"account": ...
}
}
]
}

Extracting values from a JSON string

I want to retrieve the different tag values in an NSString.
NSString *test =
{
"data": [
{
"id": "100002319144563_125257217561582",
"from": {
"name": "Umair Ahmed",
"id": "100002319144563"
},
"message": "Hello Umair Here",
"actions": [
{
"name": "Comment",
"link": "http://www.facebook.com/100002319144563/posts/125257217561582"
},
{
"name": "Like",
"link": "http://www.facebook.com/100002319144563/posts/125257217561582"
}
],
"privacy": {
"description": "Everyone",
"value": "EVERYONE"
},
"type": "status",
"application": {
"name": "iPhone",
"id": "213257025359930"
},
"created_time": "2011-07-08T11:59:15+0000",
"updated_time": "2011-07-08T11:59:15+0000"
},
{
"id": "100002319144563_125251050895532",
"from": {
"name": "Umair Ahmed",
"id": "100002319144563"
},
"message": "Hello testing testing",
"actions": [
{
"name": "Comment",
"link": "http://www.facebook.com/100002319144563/posts/125251050895532"
},
{
"name": "Like",
"link": "http://www.facebook.com/100002319144563/posts/125251050895532"
}
]
}
]
}
How can I retrieve the name and message tag values into an array or dictionary?
It looks like a JSON string, so just use one of JSON libraries, like TouchJSON or JSONKit and you can easily extract the data from the structures they will provide you.