Using arguments in bigquery schema for Google DataFusion

Using arguments in bigquery schema for Google DataFusion - google-bigquery

I am trying to use runtime arguments in bigquery schema defintion in Bigquery sink plugin.
It is just two columns. Definition in argument setter.json. -
{
"arguments" : [
{"name":"bq.config.table","value":"activity_category"},
{
"name" : "sqloutput_schema",
"type" : "schema",
"value" :
[
{"name":"activity_category_id","type":"string","nullable":true},
{"name":"activity_category_description","type":"string"}
]
}
]
}
Issue is in the 'sqloutput_schema', which is failing during runtime -
PFA screenshot of plugin:-
Error received -
Spark program 'phase-2' failed with error: Argument 'sqloutput_schema' is not defined.Please check the system logs for more details. io.cdap.cdap.api.macro.InvalidMacroException: Argument 'sqloutput_schema' is not defined.
I am unable to find a solution as to why this is failing.

The problem is your schema definition. I had the same use-case and my argument was of type string and the value had the following format -
"{\"name\":\"etlSchemaBody\",\"type\":\"record\",\"fields\":
[
{\"name\":\"Id\",\"type\":\"int\"},
{\"name\":\"name\",\"type\":\"string\"}
]}"
So change the type of the argument of the schema and the schema json following the format above.

Related

Dynamic BigQuery Schema using Auto Detection:Error Schema has no fields

I am trying to mess with the auto detection feature in Bigquery and currently I am encountering issues on updating the schema on my table.
What currently I have done.
I created manually a dataset and table name in Bigquery.
Execute my first bq load command (Works perfectly fine):
bq --location=${LOCATION} load --autodetect --source_format=${FORMAT} ${DATASET}.${TABLE} ${PATH_TO_SOURCE}.
I try to append a new JSON object introduced with new field to update the current schema.
Execute 2nd bq load command:
bq --location=${LOCATION} load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=${FORMAT} ${DATASET}.${TABLE} ${PATH_TO_SOURCE}
Throws an error:
Error in query string.
Error processing job. Schema has no fields.
I thought when --autodetect flag is enabled bq load command will not request for schema on your load job. Has anyone already encountered this issue?
First object:
{
"chatSessionId": "123",
"chatRequestId": "1234",
"senderType": "CUSTOMER",
"senderFriendlyName": "Player"
}
Second Object:
{
"chatSessionId": "456",
"chatRequestId": "5678",
"senderType": "CUSTOMER",
"senderFriendlyName": "Player",
"languageCode": "EN"
}

I reproduced your steps but I couldn't reproduce the same error as you can see in the images below:
Loading first JSON
First table's data
Loading second JSON
Second's table data
The only thing I changed in your data was the format: you provided a JSON and I turned it into a NEWLINE DELIMITED JSON (type of JSON that BigQuery expects).
You can find more information about it here.
Please let me know if it clarifies something for you.
I hope it helps.

Export database(having nested documents) from cloud firestore to Bigquery using command

I am trying to export cloud firestore data into bigquery to do sql operations.
Exported cloud firestore to cloud storage [using] (https://cloud.google.com/firestore/docs/manage-data/export-import)
gcloud beta firestore export gs://htna-3695c-storage --collection-ids='users','feeds'
Followed https://cloud.google.com/bigquery/docs/loading-data-cloud-firestoreto import from bigquery.
We have 2 collections: Users & Feeds in the cloud firestore.
I have successfully exported feeds collection but am not able to export users collection.
I am getting an error while importing data from storage to bigquery
Error: unexpected property name 'Contacts'. we have contacts field in the collection users.
this contacts field is of type 'Map'.
I also tried the command line. Below is the command to export bigquery.
**bq --location=US load --source_format=DATASTORE_BACKUP myproject_Dataset.users gs://myproject-storage/2019-04-19T13:29:28_75338/all_namespaces/kind_users/all_namespaces_kind_users.export_metadata**
here also I got the same error:
'unexpected property name 'Contacts'.
I thought to add projection fields to export only specified fields some thing like below
**bq --location=US load --projection_fields=[Coins,Referral,Profile] --source_format=DATASTORE_BACKUP myproject_Dataset.users gs://myproject-storage/2019-04-19T13:29:28_75338/all_namespaces/kind_users/all_namespaces_kind_users.export_metadata**
here also I got the error:
Waiting on >bqjob_r73b7ddbc9398b737_0000016a4909dd27_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'myproject:bqjob_r73b7ddbc9398b737_0000016a4909dd27_1': An internal error
occurred and the request could not be completed. Error: 4550751
Can anyone please let me know how to fix these issues?
Thank you in advance.Image of firestore Db

It seems to be an issue with some of your documents. According to the limitations of the BigQuery import of Firestore data, you must have a schema with less than 10,000 unique field names. In your Contacts schema, you are using the contacts' names as keys, that design is likely to produce a big amount of field names. You would need to check if this is happening to your other documents.
As a workaround, you could change the design (at some stage of the loading process) from:
"Contacts" : {
"contact1" : "XXXXX",
"contact2" : "YYYYY",
}
to:
"Contacts" : [
{
"name" : "contact1",
"number" : "XXXXX"
},
{
"name" : "contact2",
"number" : "YYYYY"
},
]
This schema would reduce drastically the number of field_names which will make it easier to manipulate from BigQuery.

how to get more info than generic "Failed to parse JSON: No active field found.; ParsedString returned false; Could not parse value" on BigQuery load?

We're trying BigQuery for the first time, with data extracted from mongo in json format. I kept getting this generic parse error upon loading the file. But then I tried a smaller subset of the file, 20 records, and it loaded fine. This tells me it's not the general structure of the file, which I had originally thought was the problem. Is there any way to get more info on the parse error, such as the string of the record that it's trying to parse when it has this error?
I also tried using the max errors field, but that didn't work either.
This was via the website. I also tried it via the Google Cloud SDK command line 'bq load...' and got the same error.

This error is most likely caused by some of the JSON records not compying with table schema. It is not clear whether you used schema autodetect feature, or you are supplying schema for the load. But here is one example where such error could happen:
{ "a" : "1" }
{ "a" : { "b" : "2" } }
If you only have a few of these and they are for invalid records - you can automatically ignore them by using max_bad_records option for load job. More details at: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json

Processing Event Hub Capture AVRO files with Azure Data Lake Analytics

I'm attempting to extract data from AVRO files produced by Event Hub Capture. In most cases this works flawlessly. But certain files are causing me problems. When I run the following U-SQL job, I get the error:
USE DATABASE Metrics;
USE SCHEMA dbo;
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
REFERENCE ASSEMBLY [Avro];
REFERENCE ASSEMBLY [log4net];
USING Microsoft.Analytics.Samples.Formats.ApacheAvro;
USING Microsoft.Analytics.Samples.Formats.Json;
USING System.Text;
//DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}/{filename}";
DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/2018/01/16/19/rcpt-metrics-us-es-eh-metrics-v3-us-0-35-36.avro";
#eventHubArchiveRecords =
EXTRACT Body byte[],
date DateTime,
filename System.String
FROM #input
USING new AvroExtractor(#"
{
""type"":""record"",
""name"":""EventData"",
""namespace"":""Microsoft.ServiceBus.Messaging"",
""fields"":[
{""name"":""SequenceNumber"",""type"":""long""},
{""name"":""Offset"",""type"":""string""},
{""name"":""EnqueuedTimeUtc"",""type"":""string""},
{""name"":""SystemProperties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Properties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Body"",""type"":[""null"",""bytes""]}
]
}
");
#json =
SELECT Encoding.UTF8.GetString(Body) AS json
FROM #eventHubArchiveRecords;
OUTPUT #json
TO "/outputs/Avro/testjson.csv"
USING Outputters.Csv(outputHeader : true, quoting : true);
I get the following error:
Unhandled exception from user code: "The given key was not present in the dictionary."
An unhandled exception from user code has been reported when invoking the method 'Extract' on the user type 'Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor'
Am I correct in assuming the problem is within the AVRO file produced by Event Hub Capture, or is there something wrong with my code?

The Key Not Present error is referring to the fields in your extract statement. It's not finding the data and filename fields. I removed those fields and your script runs correctly in my ADLA instance.

The current implementation only supports primitive types, not complex types of the Avro specification at the moment.

You have to build and use an extractor based on apache avro and not use the sample extractor provided by MS.
We went the same path

bq CLI says my JSON schema is invalid while the browser GUI says it's fine. Where am I going wrong?

I have a JSON schema:
[{"name":"timestamp","type":"integer"},{"name":"xml_id","type":"string"},{"name":"prod","type":"string"},{"name":"version","type":"string"},{"name":"distcode","type":"string"},{"name":"origcode","type":"string"},{"name":"overcode","type":"string"},{"name":"prevcode","type":"string"},{"name":"ie","type":"string"},{"name":"os","type":"string"},{"name":"payload","type":"string"},{"name":"language","type":"string"},{"name":"userid","type":"string"},{"name":"sysid","type":"string"},{"name":"loc","type":"string"},{"name":"impetus","type":"string"},{"name":"numprompts","type":"record","mode":"repeated","fields":[{"name":"type","type":"string"},{"name":"count","type":"integer"}]},{"name":"rcode","type":"record","mode":"repeated","fields":[{"name":"offer","type":"string"},{"name":"code","type":"integer"}]},{"name":"bw","type":"string"},{"name":"pkg_id","type":"string"},{"name":"cpath","type":"string"},{"name":"rsrc","type":"string"},{"name":"pcode","type":"string"},{"name":"opage","type":"string"},{"name":"action","type":"string"},{"name":"value","type":"string"},{"name":"other","type":"record","mode":"repeated","fields":[{"name":"param","type":"string"},{"name":"value","type":"string"}]}]
(http://jsoneditoronline.org/ for pretty print)
When loading through the browser GUI the schema is accepted as valid. The cli throws the following error:
BigQuery error in load operation: Invalid schema entry: "fields":[{"name":"type"
Is there something wrong with my schema as specified?

If you are passing the schema as json, you should write it to a file and pass the file name as the schema parameter. Passing the schema inline on the command line is only allowed for simple flat schemas.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using arguments in bigquery schema for Google DataFusion - google-bigquery

Related

Dynamic BigQuery Schema using Auto Detection:Error Schema has no fields

Export database(having nested documents) from cloud firestore to Bigquery using command

how to get more info than generic "Failed to parse JSON: No active field found.; ParsedString returned false; Could not parse value" on BigQuery load?

Processing Event Hub Capture AVRO files with Azure Data Lake Analytics

bq CLI says my JSON schema is invalid while the browser GUI says it's fine. Where am I going wrong?

Categories

Resources