Amazon Personalize dataset import job creation failed - amazon-s3

My schema look is like:
{
"type": "record",
"name": "Interactions",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "USER_ID",
"type": "string"
},
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "TIMESTAMP",
"type": "long"
},
{
"name": "EVENT_TYPE",
"type": "string"
},
{
"name": "EVENT_VALUE",
"type": "float"
},
{
"name": "SESSION_ID",
"type": "string"
},
{
"name": "SERVICE_TYPE",
"type": "string"
},
{
"name": "SERVICE_LOCATION",
"type": "string"
},
{
"name": "SERVICE_PRICE",
"type": "int"
},
{
"name": "SERVICE_TIME",
"type": "long"
},
{
"name": "USER_LOCATION",
"type": "string"
}
]
}
I uploaded my .CSV file in S3 bucket user-flights-bucket. When I tried to uploaded it to personalize it failed with the reason:
Path does not exist: s3://user-flights-bucket/null;

S3://user-flights-bucket/"give ur file name.csv"
It will work..give this in the data location

Related

Avro Schema: multiple records reference same data type issue: Unknown union branch

I have Avro Schema: customer record import the CustomerAddress subset.
[
{
"type": "record",
"namespace": "com.example",
"name": "CustomerAddress",
"fields": [
{ "name": "address", "type": "string" },
{ "name": "city", "type": "string" },
{ "name": "postcode", "type": ["string", "int"] },
{ "name": "type","type": {"type": "enum","name": "type","symbols": ["POBOX","RESIDENTIAL","ENTERPRISE"]}}
]
},
{
"type": "record",
"namespace": "com.example",
"name": "Customer",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "middle_name", "type": ["null", "string"], "default": null },
{ "name": "last_name", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "height", "type": "float" },
{ "name": "weight", "type": "float" },
{ "name": "automated_email", "type": "boolean", "default": true },
{ "name": "customer_emails", "type": {"type": "array","items": "string"},"default": []},
{ "name": "customer_address", "type": "com.example.CustomerAddress" }
]
}
]
i have JSON payload:
{
"Customer" : {
"first_name": "John",
"middle_name": null,
"last_name": "Smith",
"age": 25,
"height": 177.6,
"weight": 120.6,
"automated_email": true,
"customer_emails": ["ning.chang#td.com", "test#td.com"],
"customer_address":
{
"address": "21 2nd Street",
"city": "New York",
"postcode": "10021",
"type": "RESIDENTIAL"
}
}
}
when i runt the command: java -jar avro-tools-1.8.2.jar fromjson --schema-file customer.avsc customer.json
got the following exception:
Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union branch Customer
In your JSON data you use the key Customer but you have to use the fully qualified name. So it should be com.example.Customer.

Running data flows in Azure Data Factory 4 times slower than running in Azure SSIS data flow

Here are the details of this performance test (very simple). I'm trying to understand why running data flows in the cloud native Azure Data Factory environment (Spark) is so much slower than running data flows hosted in Azure SSIS IR. My results show that running in latest ADFv2 is over 4 times slower than running the exact same data flow in Azure SSIS (even with a warm IR cluster already warmed up from previous run). I like all the new features of the v2 data flows but it hardly seems worth the performance hit unless I'm completely missing something. Eventually I'll be adding more complex data flows but wanted to understand base performance behavior.
Source:
1GB CSV stored in blob storage
Destination:
Azure SQL Server Database (one table and truncated before each run)
When using control flow in ADFv2 using a simple CopyActivity (no data flow)
91 seconds
When using native SSIS package with data flow (Azure Feature Pack to pull from same blob storage) running Azure SSIS with 8 cores.
76 seconds
Pure ADF Cloud Pipeline using DataFlow with warm Azure IR (cached from previous run) 8 (+ 8 Driver cores) with default partitioning (Spark)
(includes 96 seconds cluster startup which is another thing I don't understand since the TTL is 30 minutes on the IR and it was just ran 10 minutes prior)
360 seconds
Pipeline (LandWithCopy)
{
"name": "LandWithCopy",
"properties": {
"activities": [
{
"name": "CopyData",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"wildcardFileName": "data.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "AzureSqlSink",
"preCopyScript": "TRUNCATE TABLE PatientAR",
"disableMetricsCollection": false
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "RecordAction",
"type": "String"
},
"sink": {
"name": "RecordAction",
"type": "String"
}
},
{
"source": {
"name": "UniqueId",
"type": "String"
},
"sink": {
"name": "UniqueId",
"type": "String"
}
},
{
"source": {
"name": "Type",
"type": "String"
},
"sink": {
"name": "Type",
"type": "String"
}
},
{
"source": {
"name": "TypeDescription",
"type": "String"
},
"sink": {
"name": "TypeDescription",
"type": "String"
}
},
{
"source": {
"name": "PatientId",
"type": "String"
},
"sink": {
"name": "PatientId",
"type": "String"
}
},
{
"source": {
"name": "PatientVisitId",
"type": "String"
},
"sink": {
"name": "PatientVisitId",
"type": "String"
}
},
{
"source": {
"name": "VisitDateOfService",
"type": "String"
},
"sink": {
"name": "VisitDateOfService",
"type": "String"
}
},
{
"source": {
"name": "VisitDateOfEntry",
"type": "String"
},
"sink": {
"name": "VisitDateOfEntry",
"type": "String"
}
},
{
"source": {
"name": "DoctorId",
"type": "String"
},
"sink": {
"name": "DoctorId",
"type": "String"
}
},
{
"source": {
"name": "DoctorName",
"type": "String"
},
"sink": {
"name": "DoctorName",
"type": "String"
}
},
{
"source": {
"name": "FacilityId",
"type": "String"
},
"sink": {
"name": "FacilityId",
"type": "String"
}
},
{
"source": {
"name": "FacilityName",
"type": "String"
},
"sink": {
"name": "FacilityName",
"type": "String"
}
},
{
"source": {
"name": "CompanyName",
"type": "String"
},
"sink": {
"name": "CompanyName",
"type": "String"
}
},
{
"source": {
"name": "TicketNumber",
"type": "String"
},
"sink": {
"name": "TicketNumber",
"type": "String"
}
},
{
"source": {
"name": "TransactionDateOfEntry",
"type": "String"
},
"sink": {
"name": "TransactionDateOfEntry",
"type": "String"
}
},
{
"source": {
"name": "InternalCode",
"type": "String"
},
"sink": {
"name": "InternalCode",
"type": "String"
}
},
{
"source": {
"name": "ExternalCode",
"type": "String"
},
"sink": {
"name": "ExternalCode",
"type": "String"
}
},
{
"source": {
"name": "Description",
"type": "String"
},
"sink": {
"name": "Description",
"type": "String"
}
},
{
"source": {
"name": "Fee",
"type": "String"
},
"sink": {
"name": "Fee",
"type": "String"
}
},
{
"source": {
"name": "Units",
"type": "String"
},
"sink": {
"name": "Units",
"type": "String"
}
},
{
"source": {
"name": "AREffect",
"type": "String"
},
"sink": {
"name": "AREffect",
"type": "String"
}
},
{
"source": {
"name": "Action",
"type": "String"
},
"sink": {
"name": "Action",
"type": "String"
}
},
{
"source": {
"name": "InsuranceGroup",
"type": "String"
},
"sink": {
"name": "InsuranceGroup",
"type": "String"
}
},
{
"source": {
"name": "Payer",
"type": "String"
},
"sink": {
"name": "Payer",
"type": "String"
}
},
{
"source": {
"name": "PayerType",
"type": "String"
},
"sink": {
"name": "PayerType",
"type": "String"
}
},
{
"source": {
"name": "PatBalance",
"type": "String"
},
"sink": {
"name": "PatBalance",
"type": "String"
}
},
{
"source": {
"name": "InsBalance",
"type": "String"
},
"sink": {
"name": "InsBalance",
"type": "String"
}
},
{
"source": {
"name": "Charges",
"type": "String"
},
"sink": {
"name": "Charges",
"type": "String"
}
},
{
"source": {
"name": "Payments",
"type": "String"
},
"sink": {
"name": "Payments",
"type": "String"
}
},
{
"source": {
"name": "Adjustments",
"type": "String"
},
"sink": {
"name": "Adjustments",
"type": "String"
}
},
{
"source": {
"name": "TransferAmount",
"type": "String"
},
"sink": {
"name": "TransferAmount",
"type": "String"
}
},
{
"source": {
"name": "FiledAmount",
"type": "String"
},
"sink": {
"name": "FiledAmount",
"type": "String"
}
},
{
"source": {
"name": "CheckNumber",
"type": "String"
},
"sink": {
"name": "CheckNumber",
"type": "String"
}
},
{
"source": {
"name": "CheckDate",
"type": "String"
},
"sink": {
"name": "CheckDate",
"type": "String"
}
},
{
"source": {
"name": "Created",
"type": "String"
},
"sink": {
"name": "Created",
"type": "String"
}
},
{
"source": {
"name": "ClientTag",
"type": "String"
},
"sink": {
"name": "ClientTag",
"type": "String"
}
}
]
}
},
"inputs": [
{
"referenceName": "PAR_Source_DS",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "PAR_Sink_DS",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}
Pipeline Data Flow (LandWithFlow)
{
"name": "WriteData",
"properties": {
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": {
"referenceName": "PAR_Source_DS",
"type": "DatasetReference"
},
"name": "GetData"
}
],
"sinks": [
{
"dataset": {
"referenceName": "PAR_Sink_DS",
"type": "DatasetReference"
},
"name": "WriteData"
}
],
"transformations": [],
"script": "source(output(\n\t\tRecordAction as string,\n\t\tUniqueId as string,\n\t\tType as string,\n\t\tTypeDescription as string,\n\t\tPatientId as string,\n\t\tPatientVisitId as string,\n\t\tVisitDateOfService as string,\n\t\tVisitDateOfEntry as string,\n\t\tDoctorId as string,\n\t\tDoctorName as string,\n\t\tFacilityId as string,\n\t\tFacilityName as string,\n\t\tCompanyName as string,\n\t\tTicketNumber as string,\n\t\tTransactionDateOfEntry as string,\n\t\tInternalCode as string,\n\t\tExternalCode as string,\n\t\tDescription as string,\n\t\tFee as string,\n\t\tUnits as string,\n\t\tAREffect as string,\n\t\tAction as string,\n\t\tInsuranceGroup as string,\n\t\tPayer as string,\n\t\tPayerType as string,\n\t\tPatBalance as string,\n\t\tInsBalance as string,\n\t\tCharges as string,\n\t\tPayments as string,\n\t\tAdjustments as string,\n\t\tTransferAmount as string,\n\t\tFiledAmount as string,\n\t\tCheckNumber as string,\n\t\tCheckDate as string,\n\t\tCreated as string,\n\t\tClientTag as string\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\twildcardPaths:['data.csv']) ~> GetData\nGetData sink(input(\n\t\tRecordAction as string,\n\t\tUniqueId as string,\n\t\tType as string,\n\t\tTypeDescription as string,\n\t\tPatientId as string,\n\t\tPatientVisitId as string,\n\t\tVisitDateOfService as string,\n\t\tVisitDateOfEntry as string,\n\t\tDoctorId as string,\n\t\tDoctorName as string,\n\t\tFacilityId as string,\n\t\tFacilityName as string,\n\t\tCompanyName as string,\n\t\tTicketNumber as string,\n\t\tTransactionDateOfEntry as string,\n\t\tInternalCode as string,\n\t\tExternalCode as string,\n\t\tDescription as string,\n\t\tFee as string,\n\t\tUnits as string,\n\t\tAREffect as string,\n\t\tAction as string,\n\t\tInsuranceGroup as string,\n\t\tPayer as string,\n\t\tPayerType as string,\n\t\tPatBalance as string,\n\t\tInsBalance as string,\n\t\tCharges as string,\n\t\tPayments as string,\n\t\tAdjustments as string,\n\t\tTransferAmount as string,\n\t\tFiledAmount as string,\n\t\tCheckNumber as string,\n\t\tCheckDate as string,\n\t\tCreated as string,\n\t\tClientTag as string,\n\t\tFileName as string,\n\t\tPractice as string\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\tdeletable:false,\n\tinsertable:true,\n\tupdateable:false,\n\tupsertable:false,\n\tformat: 'table',\n\tpreSQLs:['TRUNCATE TABLE PatientAR'],\n\tmapColumn(\n\t\tRecordAction,\n\t\tUniqueId,\n\t\tType,\n\t\tTypeDescription,\n\t\tPatientId,\n\t\tPatientVisitId,\n\t\tVisitDateOfService,\n\t\tVisitDateOfEntry,\n\t\tDoctorId,\n\t\tDoctorName,\n\t\tFacilityId,\n\t\tFacilityName,\n\t\tCompanyName,\n\t\tTicketNumber,\n\t\tTransactionDateOfEntry,\n\t\tInternalCode,\n\t\tExternalCode,\n\t\tDescription,\n\t\tFee,\n\t\tUnits,\n\t\tAREffect,\n\t\tAction,\n\t\tInsuranceGroup,\n\t\tPayer,\n\t\tPayerType,\n\t\tPatBalance,\n\t\tInsBalance,\n\t\tCharges,\n\t\tPayments,\n\t\tAdjustments,\n\t\tTransferAmount,\n\t\tFiledAmount,\n\t\tCheckNumber,\n\t\tCheckDate,\n\t\tCreated,\n\t\tClientTag\n\t),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> WriteData"
}
}
}
We are having the Same issues. Copy Activity without Data Flow is much faster than Data Flow. Our case is Copy Activity vs Data Flow. Not Sure if I'm doing anything wrong.
Our Scenario is just copy from Source to Destination 13 tables based on where Clause. We now have two copy activity which takes 1.5 minutes. So I was thinking may be create Data Flow and do one Source two Sinks. But it's running like 5 minutes to 8 Minutes depending on Cluster startup time. Hope we get an answer.

Using a json schema in multiple layouts

I'm helping to build an interface that works with Json Schema, and I have a question about interface generation based on that schema. There are two display types - one for internal users and one for external users. Both are dealing with the same data, but the external users should see a smaller subset of fields than the internal users.
For example, here is one schema, it defines an obituary:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "",
"type": "object",
"required": [
"id",
"deceased"
],
"properties": {
"id": { "type": "string" },
"account": {
"type": "object",
"required": [
"name"
],
"properties": {
"id": { "type": "number" },
"name": { "type": "string" },
"website": {
"anyOf": [
{
"type": "string",
"format": "uri"
},
{
"type": "string",
"maxLength": 0
}
]
},
"email": {
"anyOf": [
{
"type": "string",
"format": "email"
},
{
"type": "string",
"maxLength": 0
}
]
},
"address": {
"type": "object",
"properties": {
"address1": { "type": "string" },
"address2": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" },
"postalCode": { "type": "string" },
"country": { "type": "string" }
}
},
"phoneNumber": {
"anyOf": [
{
"type": "string",
"format": "phone"
},
{
"type": "string",
"maxLength": 0
}
]
},
"faxNumber": {
"anyOf": [
{
"type": "string",
"format": "phone"
},
{
"type": "string",
"maxLength": 0
}
]
},
"type": { "type": "string" }
}
},
"deceased": {
"type": "object",
"required": [
"fullName"
],
"properties": {
"fullName": { "type": "string" },
"prefix": { "type": "string" },
"firstName": { "type": "string" },
"middleName": { "type": "string" },
"nickName": { "type": "string" },
"lastName1": { "type": "string" },
"lastName2": { "type": "string" },
"maidenName": { "type": "string" },
"suffix": { "type": "string" }
}
},
"description": { "type": "string" },
"photos": {
"type": "array",
"items": { "type": "string" }
}
}
}
Internal users would be able to access all the fields, but external users shouldn't be able to read/write the account fields.
Should I make a second schema for the external users, or is there a way to indicate different display levels or public/private on each field?
You cannot restrict acess to the fields defined in a schema, but you can have 2 schema files, one defining the "public" fields, and the other one defining the restricted fields plus including the restricted fields.
So
public-schema.json:
{
"properties" : {
"id" : ...
}
}
restricted-schema.json:
{
"allOf" : [
{
"$ref" : "./public-schema.json"
},
{
"properties" : {
"account": ...
}
}
]
}

BQ command line how to use "--noflatten_results" option to have nested fields

I need to query table with nested and repeated fields, using bq command line give me flattened result while i need to get result as the original format.
The orignal format is looking like
{
"fields": [
{
"fields": [
{
"mode": "REQUIRED",
"name": "version",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "hash",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "header",
"type": "STRING"
},
{
"name": "organization",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "date",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "encoding",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "message_type",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "receiver_code",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "sender_code",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "segment_separator",
"type": "STRING"
},
{
"fields": [
{
"fields": [
{
"name": "name",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
},
{
"name": "value",
"type": "STRING"
},
{
"fields": [
{
"name": "name",
"type": "STRING"
},
{
"name": "description",
"type": "STRING"
},
{
"name": "value",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "composite_elements",
"type": "RECORD"
}
],
"mode": "REPEATED",
"name": "elements",
"type": "RECORD"
},
{
"name": "description",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "name",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "segments",
"type": "RECORD"
},
{
"mode": "REQUIRED",
"name": "message_identifier",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "element_separator",
"type": "STRING"
},
{
"name": "composite_element_separator",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "messages",
"type": "RECORD"
},
{
"mode": "REQUIRED",
"name": "syntax",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "encoding",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "file_name",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "size",
"type": "INTEGER"
}
]
}
So how to export (locally) data with the nesting representation ?
[EDIT]
Export to Google to have nested representation
It's seem the only solution to export the nesting representation it's to export to table then extract to Google Storage and finally download the file.
bq query --destination_table=DEV.EDI_DATA_EXPORT --replace \
--allow_large_results --noflatten_results \
"select * from DEV.EDI_DATA where syntax='EDIFACT' " \
&& bq extract --destination_format=NEWLINE_DELIMITED_JSON DEV.EDI_DATA_EXPORT gs://mybucket/data.json \
&& gsutil cp gs://mybucket/data.json .
It's surprising to me...
Whenever you use -noflatten_results you also have to use --allow_large_results and --destination_table. This stores the non-flattened results in a new table.

bigquery - Input contained no data

I'm testing bigquery platform with real traffic of my site (more than 80M of events by day).
I'm uploading gz files using java api, using insert jobs.
In some cases, i've receive this message: Input contained no data
{
"kind": "bigquery#job",
"etag": "\"******************\"",
"id": "*********",
"selfLink": "********",
"jobReference": {
"projectId": "********",
"jobId": "**************"
},
"configuration": {
"load": {
"schema": {
"fields": [
{
"name": "tms",
"type": "TIMESTAMP"
},
{
"name": "page",
"type": "STRING"
},
{
"name": "user_agent",
"type": "STRING"
},
{
"name": "print_id",
"type": "STRING"
},
{
"name": "referer",
"type": "STRING"
},
{
"name": "gtms",
"type": "TIMESTAMP"
},
{
"name": "cookies",
"type": "STRING"
},
{
"name": "ip",
"type": "STRING"
},
{
"name": "site",
"type": "STRING"
},
{
"name": "call_params",
"type": "STRING"
},
{
"name": "domains",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "name",
"type": "STRING"
},
{
"name": "ads",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "id",
"type": "STRING"
},
{
"name": "type",
"type": "STRING"
},
{
"name": "position",
"type": "STRING"
},
{
"name": "strategy",
"type": "STRING"
},
{
"name": "score",
"type": "STRING"
},
{
"name": "cpc",
"type": "STRING"
},
{
"name": "site",
"type": "STRING"
},
{
"name": "categ",
"type": "STRING"
},
{
"name": "cust",
"type": "STRING"
},
{
"name": "campaign",
"type": "STRING"
}
]
}
]
}
]
},
"destinationTable": {
"projectId": "**********",
"datasetId": "*******",
"tableId": "********"
},
"createDisposition": "CREATE_IF_NEEDED",
"writeDisposition": "WRITE_APPEND",
"sourceFormat": "NEWLINE_DELIMITED_JSON"
}
},
"status": {
"state": "DONE",
"errors": [
{
"reason": "invalid",
"message": "Input contained no data"
}
]
},
"statistics": {
"creationTime": "1416491042309",
"startTime": "1416491061440",
"endTime": "1416491076876",
"load": {
"inputFiles": "1",
"inputFileBytes": "0",
"outputRows": "0",
"outputBytes": "0"
}
}
}
And then of this, all my jobs return the same response.
Can anybody tell me what is the reason of this behaviour?
Thanks!!!!
Your job succeeded: there is no "errorResult" field in the status.
First, I understand this mistake: the return of errors and warnings in the job api is, frankly, as clear as mud.
Here's the quick overview:
status.errorResult is where job error is reported. If no errorResult is reported, the job succeeded.
status.errors is where individual errors and warnings are reported.
Please reference the documentation https://cloud.google.com/bigquery/docs/reference/v2/jobs and search for status.errorResult and status.errors.
Most people don't hit this problem since a job only encountering a warning is pretty rare.
Ok, the problem was very simple: the gz file.
Thanks!