No schema when creating Sheets-based external table from command line - google-bigquery

Having found that for one specific sheets document I was trying to reference as an external table, the heading row was being included in the data when executing queries*. I decided to drop the table and recreate it using a definitions file which would definitely expose the options from the docs. It didn't seem to work as no schema is created, despite being defined in the file.
I've recreated the issue with a simple sheet with 3 columns and a frozen header row and the following test.def file:
{
"autodetect": false,
"schema": {
"fields": [
{"name": "c1", "type": "STRING", "mode": "nullable"},
{"name": "c2", "type": "STRING", "mode": "nullable"},
{"name": "c3", "type": "STRING", "mode": "nullable"},
]
},
"sourceFormat": "GOOGLE_SHEETS",
"sourceUris": [
"https://docs.google.com/spreadsheets/..."
],
"googleSheetsOptions": {
"skipLeadingRows": 1
}
}
and then I try to create the file using:
bq mk myproject:mydataset.mytable < test.def
the table is created but no schema is present - what am I doing wrong?
this issue remains but I cannot identify why 95% of the time the table is created OK and the first/header row correctly excluded from the data returned by a query but in one specific case, created the same way as all others, the header row is returned in the data ...
Odd :(
M

OK so the correct syntax is:
bq mk --external_table_definition=myfile.def project:dataset.table
This also allows you to tell google to skip leading rows on the sheet (as of tiem of writing not possible from BQ UI)
M

Related

Change JSON Keys in Nested JSON in SQL Table

I have a table with column called tableJson which contains information of the following type:
[
{
"type": "TABLE",
"content": {
"rows":
[{"Date": "2021-09-28","Monthly return": "1.44%"},
{"Date": "2021-11-24", "Yearly return": "0.62%"},
{"Date": "2021-12-03", "Monthly return": "8.57%"},
{},
]
}
}
]
I want to change "Monthly Return" to "Weekly Return" everywhere in the table column where it exists.
Thank you in advance!
I tried different approaches to Parse, read, OPENJSON, CROSS APPLY but could not make it.

changing column type of delta table and glue catalog

I have a strange problem that I can't figure out. I'm stuck.
We are writing data as it comes to s3 in the format of delta lake like this:
df.write.format("delta").mode("overwrite").save("s3://path_to_table/)
delta_table = DeltaTable.forPath(spark, "s3://path_to_table/")
delta_table.generate("symlink_format_manifest")
We also manually create the glue database and glue table like this:
glue_clt = boto3.client("glue", region_name="us-east-1")
glue_clt.create_table(
DatabaseName="database_name",
TableInput={
"Name": "table_name",
"StorageDescriptor": {
"Columns": [{"Name": "column1", "Type": "double"}, {"Name": "column2", "Type": "string"}],
"Location": "s3://path_to_table/_symlink_format_manifest",
"InputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
}
},
"PartitionKeys": [{"Name": "column3", "Type": "string"}, {"Name": "column4", "Type": "string"}],
"TableType": "EXTERNAL_TABLE"
}
)
We then had new data come in and a data type change on "column1". So we catch this error then want to run this through a new process. This new process changes the type like this:
df = spark.read.format('delta').load("s3://path_to_table/")
delta_table_df = delta_table_df.withColumn("column1", col("column1").cast("string"))
delta_table_df.write.format("delta").mode("overwrite") \
.partitionBy(["column3", "column4"]).option("overwriteSchema","true") \
.save("s3://path_to_table/")
delta_table = DeltaTable.forPath(spark, "s3://path_to_table/")
delta_table.generate("symlink_format_manifest")
But after doing this I can confirm the schema has changed on the underlying parquet files because when I read the data back in after completing above, I see that "column1" has a type of "string".
Yet I get this error when trying to query in Athena:
HIVE_BAD_DATA: Field column1's type BINARY in parquet file
s3://bucket/test_source/test_database/test_schema/test_table/column3=Q/column4=541/part-00001-2a918783-6cd1-4cd8-9a68-28c63ab40989.c000.snappy.parquet
is incompatible with type double defined in table schema
what am I missing?

How can a required field in bigquery be nullable in a select?

I doing some meta programming with bigquery and noticed something I didn't expect.
I'm using this query:
SELECT * FROM `bigquery-public-data.samples.shakespeare` LIMIT 1000
which should go against a public dataset. If you analyze that query you would see that the schema looked like this:
"schema": {
"fields": [
{
"name": "word",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "word_count",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "corpus",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "corpus_date",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
},
This might look good at first, but if you look at the table definition for bigquery-public-data.samples.shakespeare you will notice that the every field in that select are required in the table, so why does it end up being nullable in the schema for the select?
Some context:
I'm working on a F# type provider wher I try to encode all the values as correct as possible. That means nullable as option types and non nullable as regular types. If I always get nullable it will make the use much more cumbersome for fields that can't be nullable.
Even though fields are REQUIRED in the table schema, query can do transformations which converts non NULL values to NULL values, therefore query result may have different schema (both with respect to nullability and to data types) then what the original table had.

What properties of an item should be included in the json schema?

I am confused about for which situation I am defining the properties in my json schemas.
Assume I have an item product for which I am trying to define a schema. In my database the products table has id, brand_id, name, item_number and description. All except description are required fields. The id is autogenerated by the database and the brand_id is set upon creation automatically by the api based on the user creating.
This means I can POST /api/products using only the following data:
{
"product": {
"name": "Product Name",
"item_number": "item001"
}
}
However, how should I now define the product schema? Should I include the properties id and brand_id? If so, should I label them as required, even though they are set automatically?
This is what I came up with:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"id": "http://jsonschema.net/products",
"type": "object",
"properties": {
"item_number": {
"id": "http://jsonschema.net/products/item_number",
"type": "string"
},
"name": {
"id": "http://jsonschema.net/products/name",
"type": "string"
},
"description": {
"id": "http://jsonschema.net/products/description",
"type": "string",
"default": "null"
}
},
"required": [
"item_number",
"name"
]
}
You should only define in your JSON schema properties that are dealt with by the user of the API.
In your case, it makes no sense to have id and brand_id in the schema that defines the POST body entity for the creation of a new product because these values are not provided by the API user.
This said, you may have another schema for existing product entities where these two fields are present, if it's OK to expose them publicly.
If that's the case, for this you can use schema union mechanism and have the "existing product" schema use allOf new_product.json and add id and brand_id to it.

AWS: Other function than COPY by transferring data from S3 to Redshift with amazon-data-pipeline

I'm trying to transfer data from the Amazon S3-Cloud to Amazon-Redshift with the Amazon-Data-Pipeline tool.
Is it possible while transferring the Data to change the Data with e.G. an SQL Statement so that just the results of the SQL-Statement will be the input into Redshift?
I only found the Copy Command like:
{
"id": "S3Input",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://example-bucket/source/inputfile.csv"
},
Source: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-get-started-copy-data-cli.html
Yes, it is possible. There are two approaches to it:
Use transformSQL of RedShiftCopyActivity
transformSQL is useful if the transformations are performed within the scope of the record that are getting loaded on a timely basis, e.g. every day or hour. That way changes are only applied to the batch and not to the whole table.
Here is an excerpt from the documentation:
transformSql: The SQL SELECT expression used to transform the input data. When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called staging and initially loads it in there. Data from this table is used to update the target table. If the transformSql option is specified, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table. So transformSql must be run on the table named staging and the output schema of transformSql must match the final target table's schema.
Please, find an example of usage of transformSql below. Notice that select is from staging table. It will effectively run CREATE TEMPORARY TABLE staging2 AS SELECT <...> FROM staging;. Also, all fields must be included and match the existing table in RedShift DB.
{
"id": "LoadUsersRedshiftCopyActivity",
"name": "Load Users",
"insertMode": "OVERWRITE_EXISTING",
"transformSql": "SELECT u.id, u.email, u.first_name, u.last_name, u.admin, u.guest, CONVERT_TIMEZONE('US/Pacific', cs.created_at_pst) AS created_at_pst, CONVERT_TIMEZONE('US/Pacific', cs.updated_at_pst) AS updated_at_pst FROM staging u;",
"type": "RedshiftCopyActivity",
"runsOn": {
"ref": "OregonEc2Resource"
},
"schedule": {
"ref": "HourlySchedule"
},
"input": {
"ref": "OregonUsersS3DataNode"
},
"output": {
"ref": "OregonUsersDashboardRedshiftDatabase"
},
"onSuccess": {
"ref": "LoadUsersSuccessSnsAlarm"
},
"onFail": {
"ref": "LoadUsersFailureSnsAlarm"
},
"dependsOn": {
"ref": "BewteenRegionsCopyActivity"
}
}
Use script of SqlActivity
SqlActivity allows operations on the whole dataset, and can be scheduled to run after particular events through dependsOn mechanism
{
"name": "Add location ID",
"id": "AddCardpoolLocationSqlActivity",
"type": "SqlActivity",
"script": "INSERT INTO locations (id) SELECT 100000 WHERE NOT EXISTS (SELECT * FROM locations WHERE id = 100000);",
"database": {
"ref": "DashboardRedshiftDatabase"
},
"schedule": {
"ref": "HourlySchedule"
},
"output": {
"ref": "LocationsDashboardRedshiftDatabase"
},
"runsOn": {
"ref": "OregonEc2Resource"
},
"dependsOn": {
"ref": "LoadLocationsRedshiftCopyActivity"
}
}
There is an optional field in RedshiftCopyActivity called 'transformSql'.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html
I have not personally used this, but from the looks of it, it seems like - you will treat your s3 data being in a temp table and this sql stmt will return transformed data for redshift to insert.
So, you will need to list all fields in the select whether or not you are transforming that field.
AWS Datapipeline SqlActivity
{
"id" : "MySqlActivity",
"type" : "SqlActivity",
"database" : { "ref": "MyDatabase" },
"script" : "insert into AnalyticsTable (select (cast(requestEndTime as bigint) - cast(requestBeginTime as bigint)) as requestTime, hostname from StructuredLogs where hostname LIKE '%.domain.sfx');",
"schedule" : { "ref": "Hour" },
"queue" : "priority"
}
So basically in
"script" any sql script/transformations/commands Amazon Redshift SQL Commands
transformSql is fine but support only The SQL SELECT expression used to transform the input data. ref : RedshiftCopyActivity