changing column type of delta table and glue catalog - amazon-s3

I have a strange problem that I can't figure out. I'm stuck.
We are writing data as it comes to s3 in the format of delta lake like this:
df.write.format("delta").mode("overwrite").save("s3://path_to_table/)
delta_table = DeltaTable.forPath(spark, "s3://path_to_table/")
delta_table.generate("symlink_format_manifest")
We also manually create the glue database and glue table like this:
glue_clt = boto3.client("glue", region_name="us-east-1")
glue_clt.create_table(
DatabaseName="database_name",
TableInput={
"Name": "table_name",
"StorageDescriptor": {
"Columns": [{"Name": "column1", "Type": "double"}, {"Name": "column2", "Type": "string"}],
"Location": "s3://path_to_table/_symlink_format_manifest",
"InputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
}
},
"PartitionKeys": [{"Name": "column3", "Type": "string"}, {"Name": "column4", "Type": "string"}],
"TableType": "EXTERNAL_TABLE"
}
)
We then had new data come in and a data type change on "column1". So we catch this error then want to run this through a new process. This new process changes the type like this:
df = spark.read.format('delta').load("s3://path_to_table/")
delta_table_df = delta_table_df.withColumn("column1", col("column1").cast("string"))
delta_table_df.write.format("delta").mode("overwrite") \
.partitionBy(["column3", "column4"]).option("overwriteSchema","true") \
.save("s3://path_to_table/")
delta_table = DeltaTable.forPath(spark, "s3://path_to_table/")
delta_table.generate("symlink_format_manifest")
But after doing this I can confirm the schema has changed on the underlying parquet files because when I read the data back in after completing above, I see that "column1" has a type of "string".
Yet I get this error when trying to query in Athena:
HIVE_BAD_DATA: Field column1's type BINARY in parquet file
s3://bucket/test_source/test_database/test_schema/test_table/column3=Q/column4=541/part-00001-2a918783-6cd1-4cd8-9a68-28c63ab40989.c000.snappy.parquet
is incompatible with type double defined in table schema
what am I missing?

Related

Is there a way to match avro schema with Bigquery and Bigtable?

I'd like to import bigquery data to bigtable using Google Composer.
Exporting bigquery rows in Avro format to GCS was successful. However, import Avro data to Bigtable was not.
The error says
Caused by: org.apache.avro.AvroTypeException: Found Root, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
I guess the schema between bigquery and bigtable should match each other. But I have no idea how to do this.
For every record read from the Avro files:
Attributes present in the files and in the table are loaded into the table.
Attributes present in the file but not in the table are subject to ignore_unknown_fields,
Attributes that exist in the table but not in the file will use their default value, if there is one set.
The below links are helpful.
[1] https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-storage-avro-to-bigtable
[2] https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/resources/schema/avro/bigtable.avsc
[3] Avro to BigTable - Schema issue?
For those of you who still have problem like me because they are not familiar with avro, here is one working schema transformation that I found after some tinkering.
For example, if you have table from bigquery like this
And you want to use user_id as the bigtable row_key and ingest all columns, here is the example code to encode them as avro file.
from avro.schema import Parse
from avro.io import DatumWriter
from avro.datafile import DataFileWriter
bigtable_schema = {
"name" : "BigtableRow",
"type" : "record",
"namespace" : "com.google.cloud.teleport.bigtable",
"fields" : [
{ "name" : "key", "type" : "bytes"},
{ "name" : "cells",
"type" : {
"type" : "array",
"items": {
"name": "BigtableCell",
"type": "record",
"fields": [
{ "name" : "family", "type" : "string"},
{ "name" : "qualifier", "type" : "bytes"},
{ "name" : "timestamp", "type" : "long", "logicalType" : "timestamp-micros"},
{ "name" : "value", "type" : "bytes"}
]
}
}
}
]
}
parsed_schema = Parse(json.dumps(bigtable_schema))
row_key = 'user_id'
family_name = 'feature_name'
feature_list = ['channel', 'zip_code', 'history']
with open('features.avro', 'wb') as f:
writer = DataFileWriter(f, DatumWriter(), parsed_schema)
for item in df.iterrows():
row = item[1]
ts = int(datetime.now().timestamp()) * 1000 * 1000
for feat in feature_list:
writer.append({
"key": row[row_key].encode('utf-8'),
"cells": [{"family": family_name,
"qualifier": feat.encode('utf-8'),
"timestamp": ts,
"value": str(row[feat]).encode('utf-8')}]
})
writer.close()
Then you can use dataflow template job to run the ingestion.
Complete code can be found here: https://github.com/mitbal/sidu/blob/master/bigquery_to_bigtable.ipynb

postgresql filter data from bytea column

I have a table where i am saving data in a column of type bytea, the data is actually a JSON object.
I need to implement a filter on the JSON data.
SELECT cast(job_data::TEXT as jsonb) FROM job_details where job_data ->> "organization" = "ABC";
This query does not work.
The JSON Object looks like
{
"uid": "FdUR4SB0h7",
"Type": "Reference Data Service",
"user": "hk#ss.com",
"SubType": "Reference Data Task",
"_version": 1,
"Frequency": "Once",
"Parameters": "sdfsdfsdfds",
"organization": "ABC",
"StartDateTime": "2020-01-20T10:30:00Z"
}
You need to predicate on the converted column, also, that conversion may not necessarily work depending on encoding. Try something like this:
SELECT
*
FROM
job_details
WHERE
convert_from(job_data, 'UTF-8')::json ->> 'organization' = 'ABC';

BigQuery: --[no]use_avro_logical_types flag doesn't work

I try to use bq command with --[no]use_avro_logical_types flag to load avro files into BigQuery table which does not exist before executing the command. The avro schema contains timestamp-millis logical type value. When the command is executed, a new table is created but the schema of its column becomes INTEGER.
This is a recently released feature so that I cannot find examples and I don't know what I am missing. Could anyone give me a good example?
My avro schema looks like following,
...
}, {
"name" : "timestamp",
"type" : [ "null", "long" ],
"default" : null,
"logicalType" : [ "null", "timestamp-millis" ]
}, {
...
And executing command is this:
bq load --source_format=AVRO --use_avro_logical_types <table> <path/to/file>
To use the timestamp-millis logical type, you can specify the field in the following way:
{
"name" : "timestamp",
"type" : {"type": "long", "logicalType" : "timestamp-millis"}
}
In order to provide an optional 'null' value, you can try out the following spec:
{
"name" : "timestamp",
"type" : ["null", {"type" : "long", "logicalType" : "timestamp-millis"}]
}
For a full list of supported Avro logical types please refer to the Avro spec: https://avro.apache.org/docs/1.8.0/spec.html#Logical+Types.
According to https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro, the avro type, timestamp-millis, is converted to an INTEGER once loaded in BigQuery.

No schema when creating Sheets-based external table from command line

Having found that for one specific sheets document I was trying to reference as an external table, the heading row was being included in the data when executing queries*. I decided to drop the table and recreate it using a definitions file which would definitely expose the options from the docs. It didn't seem to work as no schema is created, despite being defined in the file.
I've recreated the issue with a simple sheet with 3 columns and a frozen header row and the following test.def file:
{
"autodetect": false,
"schema": {
"fields": [
{"name": "c1", "type": "STRING", "mode": "nullable"},
{"name": "c2", "type": "STRING", "mode": "nullable"},
{"name": "c3", "type": "STRING", "mode": "nullable"},
]
},
"sourceFormat": "GOOGLE_SHEETS",
"sourceUris": [
"https://docs.google.com/spreadsheets/..."
],
"googleSheetsOptions": {
"skipLeadingRows": 1
}
}
and then I try to create the file using:
bq mk myproject:mydataset.mytable < test.def
the table is created but no schema is present - what am I doing wrong?
this issue remains but I cannot identify why 95% of the time the table is created OK and the first/header row correctly excluded from the data returned by a query but in one specific case, created the same way as all others, the header row is returned in the data ...
Odd :(
M
OK so the correct syntax is:
bq mk --external_table_definition=myfile.def project:dataset.table
This also allows you to tell google to skip leading rows on the sheet (as of tiem of writing not possible from BQ UI)
M

AWS: Other function than COPY by transferring data from S3 to Redshift with amazon-data-pipeline

I'm trying to transfer data from the Amazon S3-Cloud to Amazon-Redshift with the Amazon-Data-Pipeline tool.
Is it possible while transferring the Data to change the Data with e.G. an SQL Statement so that just the results of the SQL-Statement will be the input into Redshift?
I only found the Copy Command like:
{
"id": "S3Input",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://example-bucket/source/inputfile.csv"
},
Source: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-get-started-copy-data-cli.html
Yes, it is possible. There are two approaches to it:
Use transformSQL of RedShiftCopyActivity
transformSQL is useful if the transformations are performed within the scope of the record that are getting loaded on a timely basis, e.g. every day or hour. That way changes are only applied to the batch and not to the whole table.
Here is an excerpt from the documentation:
transformSql: The SQL SELECT expression used to transform the input data. When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called staging and initially loads it in there. Data from this table is used to update the target table. If the transformSql option is specified, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table. So transformSql must be run on the table named staging and the output schema of transformSql must match the final target table's schema.
Please, find an example of usage of transformSql below. Notice that select is from staging table. It will effectively run CREATE TEMPORARY TABLE staging2 AS SELECT <...> FROM staging;. Also, all fields must be included and match the existing table in RedShift DB.
{
"id": "LoadUsersRedshiftCopyActivity",
"name": "Load Users",
"insertMode": "OVERWRITE_EXISTING",
"transformSql": "SELECT u.id, u.email, u.first_name, u.last_name, u.admin, u.guest, CONVERT_TIMEZONE('US/Pacific', cs.created_at_pst) AS created_at_pst, CONVERT_TIMEZONE('US/Pacific', cs.updated_at_pst) AS updated_at_pst FROM staging u;",
"type": "RedshiftCopyActivity",
"runsOn": {
"ref": "OregonEc2Resource"
},
"schedule": {
"ref": "HourlySchedule"
},
"input": {
"ref": "OregonUsersS3DataNode"
},
"output": {
"ref": "OregonUsersDashboardRedshiftDatabase"
},
"onSuccess": {
"ref": "LoadUsersSuccessSnsAlarm"
},
"onFail": {
"ref": "LoadUsersFailureSnsAlarm"
},
"dependsOn": {
"ref": "BewteenRegionsCopyActivity"
}
}
Use script of SqlActivity
SqlActivity allows operations on the whole dataset, and can be scheduled to run after particular events through dependsOn mechanism
{
"name": "Add location ID",
"id": "AddCardpoolLocationSqlActivity",
"type": "SqlActivity",
"script": "INSERT INTO locations (id) SELECT 100000 WHERE NOT EXISTS (SELECT * FROM locations WHERE id = 100000);",
"database": {
"ref": "DashboardRedshiftDatabase"
},
"schedule": {
"ref": "HourlySchedule"
},
"output": {
"ref": "LocationsDashboardRedshiftDatabase"
},
"runsOn": {
"ref": "OregonEc2Resource"
},
"dependsOn": {
"ref": "LoadLocationsRedshiftCopyActivity"
}
}
There is an optional field in RedshiftCopyActivity called 'transformSql'.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html
I have not personally used this, but from the looks of it, it seems like - you will treat your s3 data being in a temp table and this sql stmt will return transformed data for redshift to insert.
So, you will need to list all fields in the select whether or not you are transforming that field.
AWS Datapipeline SqlActivity
{
"id" : "MySqlActivity",
"type" : "SqlActivity",
"database" : { "ref": "MyDatabase" },
"script" : "insert into AnalyticsTable (select (cast(requestEndTime as bigint) - cast(requestBeginTime as bigint)) as requestTime, hostname from StructuredLogs where hostname LIKE '%.domain.sfx');",
"schedule" : { "ref": "Hour" },
"queue" : "priority"
}
So basically in
"script" any sql script/transformations/commands Amazon Redshift SQL Commands
transformSql is fine but support only The SQL SELECT expression used to transform the input data. ref : RedshiftCopyActivity