Loading AVRO from Bucket via CLI into BigQuery with Date partition - google-bigquery

I'm trying to import data into BigQuery via AVRO with a Date partition. When importing via the cli an error is related to a partitioned date has to be a Date or Timestamp but it is getting an Integer.
Given an AVRO file similar to the one below:
{
"namespace": "test_namespace",
"name": "test_name",
"type": "record",
"fields": [
{
"name": "partition_date",
"type": "int",
"logicalType": "date"
},
{
"name": "unique_id",
"type": "string"
},
{
"name": "value",
"type": "double"
}
}
I am then using the following commands through the CLI to try and create a new table
bg load \
--replace \
--source_format=AVRO \
--use_avro_logical_types=True \
--time_partitioning_field partition_date \
--clustering_fields unique_id \
mydataset.mytable \
gs://mybucket/mydata.avro
The expectation is that a new table that is partitioned on the Date column "partition_date" and then clustered by "unique_id".
Edit: Please see the error below
The field specified for the time partition can only be of type TIMESTAMP or DATE. The type found is: INTEGER.
The exact command I am using is as follows:
bq load --replace --source_format=AVRO --use_avro_logical_types=True --time_partitioning_field "partition_date" --clustering_fields "unique_id" BQ_DATASET BUCKET_URI
This is the AVRO schema that I am using
{
"namespace": "example.avro",
"type": "record",
"name": "Test",
"fields": [
{ "name": "partition_date", "type": "int", "logicalType": "date" },
{ "name": "unique_id", "type": "string"},
{ "name": "value", "type": "float" }
]
}
It's worth noting that this is an old Google Project (about 2 - 3 years old) if that is any relevance.
I'm also on windows 10 with the latest Google SDK.

Google finally got back to me (7 months later). In this time I no longer have access to the initial project that I had issues with. However I'm documenting a successful example for those finding this later with a new project.
Following a comment from the issue tracker here I found that I was not using a complex type for the logical date field.
So this:
{
"name": "partition_date",
"type": "int",
"logicalType": "date"
}
Should have been written like this (Note the nested complex object for type):
{
"name": "partition_date",
"type": {
"type": "int",
"logicalType": "date"
}
}
Although the avro specification lists a date as the number of days from the unix epoch (1 Jan 1970) I had to write the partition_date as datetime.date(1970, 1, 1) instead of just 0.
The commands (bq) were unchanged from the original post.
As stated I don't know if this would have fixed my issue with the original project but hopefully this helps the next person.

I haven't received any error message doing the same loading operation, generating equal AVRO data schema and using the desired Bigdata sink table structure.
According to GCP documentation you've used --use_avro_logical_types=True flag along bq command-line properly propagating conversion data types, keeping DATA Avro logical type be translated to the equivalent Date type in Bigquery.
You can refer to my Bigquery table schema, validating table structure on your side, as you haven't provided table structure and error message itself I can't suggest more so far:
$ bq show --project_id=<Project_ID> <Dataset>.<Table>
Table <Project_ID>:<Dataset>.<Table>
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels
----------------- ------------------------- ------------ ------------- ------------ ----------------------------- ------------------ --------
22 Apr 12:03:57 |- partition_date: date 3 66 DAY (field: partition_date) unique_id
|- unique_id: string
|- value: float
I have used FLOAT type for value to plainly convert AVRO DOUBLE data type as per recommendations here.
bq CLI version:
$ bq version
This is BigQuery CLI 2.0.56
Feel free to expand the origin question with more specific information on the issue you're hitting, further assisting more accurately with the solution.
UPDATE:
I've checked information provided, but I'm still confused with the error you're getting. Apparently I see that in your case flag use_avro_logical_types=True does not perform logical type conversion. However I've found this PIT feature request where people are asking to "whitelist" their projects in order to afford AVRO logicaltype functionality, i.e. this comment. Since this feature have been rolled out to globe community, it might be the oversight that some GCP projects are not enabled to use it.

Related

Azure Data Factory Source Dataset value from Parameter

I have a Dataset in Azure Datafactory backed by a CSV file. I added an additional column in Dataset and want to pass it's value from Dataset parameter but value never gets copied to the column
"type": "AzureBlob",
"structure":
[
{
"name": "MyField",
"type": "String"
}
]
I have a defined parameter as well
"parameters": {
"MyParameter": {
"type": "String",
"defaultValue": "ABC"
}
}
How can copy the parameter value to Column? I tried following but doesn't work
"type": "AzureBlob",
"structure":
[
{
"name": "MyField",
"type": "String",
"value": "#dataset().MyParameter"
}
]
But this does not work. I am getting NULL in destination although parameter value is set
Based on document: Expressions and functions in Azure Data Factory , #dataset().XXX is not supported in Azure Data Factory so far. So, you can't use parameters value as custom column into sink or source with native copy activity directly.
However, you could adopt below workarounds:
1.You could create a custom activity and write code to do whatever you need.
2.You could stage the csv file in a azure data lake, then execute a U-SQL script to read the data from the file and append the new column with the pipeline rundId. Then output it to a new area in the data lake so that the data could be picked up by the rest of your pipeline. To do this, you just need to simply pass a Parameter to U-SQL from ADF. Please refer to the U-SQL Activity.
In this thread: use adf pipeline parameters as source to sink columns while mapping, the customer used the second way.

How to find data structure of an app in PowerApps

I want to create a SQL connection and import data from an app (Shoutouts template) to SQL database. I created a SQL connection and tried to import the data in there but I got this error.
CreatedOnDateTime: The specified column is generated by the server and can't be specified
I do have the CreatedOnDateTime column created but I guess it's datatype is not the same or something else.
Where can I look and see what fields and datatypes are being imported from PowerApps to SQL table in PowerApps via SQL connection?
Thank you for your help!
Overall, there's no easy way to find out the structure of a data source in PowerApps (please create a new feature request in the PowerApps Ideas board for that). There is a convoluted way to find it out, however, which I'll go over here.
But for your specific problem, this is the schema of a SQL table that would match the schema of the data source in PowerApps:
CREATE TABLE PowerAppsTest.StackOverflow51847975 (
PrimaryID BIGINT PRIMARY KEY,
[Id] NVARCHAR(MAX),
[Message] NVARCHAR(MAX),
CreatedOnDateTime NVARCHAR(MAX),
CreatorEmail NVARCHAR(MAX),
CreatorName NVARCHAR(MAX),
RecipientEmail NVARCHAR(MAX),
RecipientName NVARCHAR(MAX),
ShoutoutType NVARCHAR(MAX),
[Image] IMAGE
)
Now for the generic case. You've been warned that this is convoluted, so proceed at your own risk :)
First, save the app locally to your computer:
The app will be saved with the .msapp extension, but it's basically a .zip file. If you're using Windows, you can rename it to change the extension to .zip and you'll be able to uncompress and extract the files that describe the app.
One of those files, Entities.json, contains, among other things, the definition of the schema of all data sources used in the app. The file is a huge JSON file, and it has all of its whitespaces removed, so you may want to use some online tool to format (or prettify) the JSON to read it easier. Once this is done, you can open the file in your favorite text editor (anything better than Notepad should be able to handle it).
With the file opened, search for an entry in the JSON root with the property "Name" and the value equal to the name of the data source. For example, in the shoutouts app case, the data source is called "Shoutout", so search for
"Name": "Shoutout"
You'll have to remove the space if you didn't pretty-print the JSON file prior to opening it. This should be an object that describes the data source, and it has one property called DataEntityMetadataJson that has the data source schema, formatted as a JSON string. Again in the Shoutouts example, this is the value:
"{\"name\":\"Shoutout\",\"title\":\"Shoutout\",\"x-ms-permission\":\"read-write\",\"schema\":{\"type\":\"array\",\"items\":{...
Notice that it again is not pretty-printed. You'll first need to decode that string, then pretty-print it again, and you'll end up with something like this:
{
"name": "Shoutout",
"title": "Shoutout",
"x-ms-permission": "read-write",
"schema": {
"type": "array",
"items": {
"type": "object",
"properties": {
"PrimaryID": {
"type": "number",
"format": "double",
...
},
"Message": {
"type": "string",
...
},
"Image": {
"type": "string",
"format": "uri",
"x-ms-media-kind": "image",
...
},
"Id": {
"type": "string",
...
},
"CreatedOnDateTime": {
"type": "string",
...
},
...
And this is the schema for the data source. From that I recreated the schema in SQL, removed the reference to the Shoutout data source from the app (which caused many errors), then added a reference to my SQL table, and since it has a different name, went looking for all places that have errors in the app to fix those.
Hope this helps!

Cloudant json index vs text index

Hi I am trying to understand json index vs text index in Cloudant. Now I know using
{ "index": {}, "type": "text" }
Will make the entire document searchable. But what is the difference between say,
{
"index": {
"fields": [
"title"
]
},
"type": "json"
}
and
{
"index": {
"fields": [
{
"name": "title",
"type": "string"
}
]
},
"name": "title-text",
"type": "text"
}
Thanks.
the json type:
leverages the Map phase of MapReduce
will build and query faster than a text type for a fixed key
no bookmark field
cannot use combination or array logical operators such as $regex as the basis of a query
only equality operators such as $eq, $gt, $gte, $lt, and $lte (but not $ne) can be used as the basis of a query
might end up doing more work in memory for complex queries
sorting fields must be indexed
the text type:
leverages a Lucene search index
permits indexing all fields in documents automatically with a single simple command
provides more flexibility to perform adhoc queries and sort across multiple keys
permits you to use any operator as a basis for query in a selector
type (:string, :number) sometimes need to be appended to sort field
from: https://docs.cloudant.com/cloudant_query.html
If you know exactly what data you want to look for, or you want to
keep storage and processing requirements to a minimum, you can specify
how the index is created, by making it of type json.
But for maximum possible flexibility when looking for data, you would
typically create an index of type text.
additional information:
https://developer.ibm.com/clouddataservices/docs/cloudant/get-started/use-cloudant-query/

How to add field descriptions programmatically in BigQuery table

I want to add field description in a bq table programmatically, I know how to do in UI.
I have this requirement because I have few tables in my dataset which are refreshed on a daily basis and we use "writeMode": "WRITE_TRUNCATE". This also deletes the description of all the field names of the table.
I have also added the description in my schema file for the table, like this
{
"name" : "tax",
"type" : "FLOAT",
"description" : "Tax amount customer paid"
}
But I don't see the descriptions in my final table after running the scripts to load data.
Some Tables API (https://cloud.google.com/bigquery/docs/reference/v2/tables) allow you to set table and schema's fields descriptions
You can set descriptions during
table creation - https://cloud.google.com/bigquery/docs/reference/v2/tables/insert
or after table created using one of below APIs:
Patch -
https://cloud.google.com/bigquery/docs/reference/v2/tables/patch
or Update - https://cloud.google.com/bigquery/docs/reference/v2/tables/update
I think, in your case Patch API is more suitable
Below link shows you table resources you can set with those APIs
https://cloud.google.com/bigquery/docs/reference/v2/tables#resource
BigQuery load jobs accept a schema that includes "description" with each field.
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load
If you specify the description along with each field you are creating during your WRITE_TRUNCATE operation, the descriptions should be applied to the destination table.
Here's a snippet from the above link that includes the schema you are specifying:
"load": {
"sourceUris": [
string
],
"schema": {
"fields": [
{
"name": string,
"type": string,
"mode": string,
"fields": [
(TableFieldSchema)
],
"description": string
}
]
},

Null nested fields in Google BigQuery

I'm trying to upload a json file to BigQuery contaning a nested field which is null but it's not accepting.
I tried a lot of different syntax but I always got the error:
File: 0 / Offset:0 / Line:1 / Column:410, missing required field(s)
I tried to sent the value as many different values listed below and even ommiting it...
"quotas": []
"quotas": null
"quotas": "null"
etc...
The schema definition...
[..]
"name": "quotas",
"type": "record",
"mode": "repeated",
"fields":[
{
"name": "service",
"type": "string",
"mode": "nullable"
},
[..]
]
[..]
From what I can tell in the logs for the import worker for that job, the line in question is missing a required field (the field name starts with "msi"). The line is otherwise well-formatted from what I can tell.
I've filed a bug that BigQuery should give the name of the required field or fields that are missing to make this easier to debug in the future.