dbt snapshot fails on schema change in nested column - dbt

I get snapshop datatypes error
Value has type STRUCT<module STRING, reference_entity_class STRING,
reference_table STRING, ...> which cannot be inserted into column
created_by, which has type STRUCT<module STRING,
reference_entity_class STRING, reference_table STRING, ...> at [16:33]
the problematic column is a struct and there were new fields/subcolumns added. Does this mean snapshots cannot account for schema changes in nested columns, or am i missing something?
Thank you so much in advance!
Attached on the left is the updated schema of the tmp table to be merged, on the right is the original table

Related

Table can't be queried after change column position

When querying table using "select * from t2p", the reponse is as blow. I think I have missed some concepts, please help me out.
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazy.objectinspector.LazyMapObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
Step1, create table
create table t2p(id int, name string, score map<string,double>)
partitioned by (class int)
row format delimited
fields terminated by ','
collection items terminated by '\\;'
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
Step2, insert data like
1,zs,math:90.0;english:92.0
2,ls,chinese:89.0;math:80.0
3,xm,geo:87.0;math:80.0
4,lh,chinese:89.0;english:81.0
5,xw,physics:91v;english:81.0
Step3, add another column
alter table t2p add columns (school string);
Step4, change column's order
alter table t2p change school school string after name;
Step5, do query and get error as mentioned above.
select * from t2p;
This is an obvious error.
Your command alter table t2p change school school string after name; changes metadata only. If you are moving columns, the data must already match the new schema or you must change it to match by some other means.
Which means, the map column has to be matching to the new column. In other words, if you want to move column around, make sure new column and existing data types are same.
I did a simple experiment with int data type. It worked because data type are not hugely different but you can see metadata changed but data stayed same.
create table t2p(id int, name string, score int)
partitioned by (class int)
stored as textfile;
insert into t2p partition(class=1) select 100,'dum', 199;
alter table t2p add columns (school string);
alter table t2p change school school string after name;
MSCK REPAIR TABLE t2p ;
select * from t2p;
You can see new column school is mapped to position 3( defined as INT).
Solution - You can do this but make sure new structure+data type is compatible to old structure.

Insert into Nested records in Bigquery FROM another nested table

I am trying to insert data from one Bigquery table (nested) to another bigquery table (nested). However, I am getting issues during insert.
Source schema: T1
FieldName Type Mode
User STRING NULLABLE
order RECORD REPEATED
order.Name STRING NULLABLE
order.location STRING NULLABLE
order.subscription RECORD NULLABLE
order.subscription.date TIMESTAMP NULLABLE
order.Details RECORD REPEATED
order.Details.id STRING NULLABLE
order.Details.nextDate STRING NULLABLE
Target schema: T2
FieldName Type Mode
User STRING NULLABLE
order RECORD REPEATED
order.Name STRING NULLABLE
order.location STRING NULLABLE
order.subscription RECORD NULLABLE
order.subscription.date TIMESTAMP NULLABLE
order.Details RECORD REPEATED
order.Details.id STRING NULLABLE
order.Details.nextDate STRING NULLABLE
I am trying to use insert into functionality of bigquery. I am looking to insert only few field from source table. My query is like below:
INSERT INTO T2 (user,order.name,order.subscription.date,details.id)
SELECT user,order.name,order.subscription.date,details.id
from
T1 o
join unnest (o.order) order,
unnest ( order.details) details
After a bit of googling I am aware that I would need to use STRUCT when defining field names while inserting, but not sure how to do it. Any help is appreciated. Thanks in advance!
You will have to insert the records as per is needed in your destination table, Struct types need to be inserted fully ( with all the records that it contains ).
I provide a small sample below, I build the following table with a single record to explain this:
create or replace table `project-id.dataset-id.table-source` (
user STRING,
order_detail STRUCT<name STRING, location STRING,subscription STRUCT<datesub TIMESTAMP>,details STRUCT<id STRING,nextDate STRING>>
)
insert into `project-id.dataset-id.table-source` (user,order_detail)
values ('Karen',STRUCT('ShopAPurchase','Germany',STRUCT('2022-03-01'),STRUCT('1','2022-03-05')))
With that information we can now star inserting into our destination tables. In our sample, I'm reusing the source table and just adding an additional record into it like this:
insert into `project-id.dataset-id.table-source` (user,order_detail)
select 'Anna',struct(ox.name,'Japan',ox.subscription,struct('2',dx.nextDate))
from `project-id.dataset-id.table-source` o
join unnest ([o.order_detail]) ox, unnest ([o.order_detail.details]) dx
You will see that in order to perform an unnesting structs I will have to add the value inside an array []. As unnest flatens the struct as a single row. Also, when inserting struct types you will also have to create the struct or use the flattening records to create that struct column.
If you want to add additional records inside a STRUCT you will have to declare your destination table with an ARRAY inside of it. Lets look at this new table source_array:
create or replace table `project-id.dataset-id.table-source_array` (
user STRING,
order_detail STRUCT<name STRING, location STRING,subscription STRUCT<datesub TIMESTAMP>,details ARRAY<STRUCT<id STRING ,nextDate STRING>>>
)
insert into `project-id.dataset-id.table-source_array` (user,order_detail)
values ('Karen',STRUCT('ShopAPurchase','Germany',STRUCT(['2022-03-01']),STRUCT('1','2022-03-05')))
insert into `project-id.dataset-id.table-source_array` (user,order_detail)
select 'Anna',struct(ox.name,'Japan',ox.subscription,[struct('2',dx.nextDate),struct('3',dx.nextDate)])
from `project-id.dataset-id.table-source` o
join unnest ([o.order_detail]) ox, unnest ([o.order_detail.details]) dx
Keep in mind that you should be careful as when dealing with this as you might encounter subarrays error which may cause issues.
I make use of the following documentation for this sample:
STRUCT
UNNEST

Are Databricks SQL tables & views duplicates of the source data, or do you update the same data source?

Let's say you create a table in DBFS as follows.
%sql
DROP TABLE IF EXISTS silver_loan_stats;
-- Explicitly define our table, providing schema for schema enforcement.
CREATE TABLE silver_loan_stats (
loan_status STRING,
int_rate FLOAT,
revol_util FLOAT,
issue_d STRING,
earliest_cr_line STRING,
emp_length FLOAT,
verification_status STRING,
total_pymnt DOUBLE,
loan_amnt FLOAT,
grade STRING,
annual_inc FLOAT,
dti FLOAT,
addr_state STRING,
term STRING,
home_ownership STRING,
purpose STRING,
application_type STRING,
delinq_2yrs FLOAT,
total_acc FLOAT,
bad_loan STRING,
issue_year DOUBLE,
earliest_year DOUBLE,
credit_length_in_years DOUBLE)
USING DELTA
LOCATION "/tmp/${username}/silver_loan_stats";
Later, you save data (a dataframe named 'loan_stats) to this source LOCATION.
# Configure destination path
DELTALAKE_SILVER_PATH = f"/tmp/{username}/silver_loan_stats"
# Write out the table
loan_stats.write.format('delta').mode('overwrite').save(DELTALAKE_SILVER_PATH)
# Read the table
loan_stats = spark.read.format("delta").load(DELTALAKE_SILVER_PATH)
display(loan_stats)
My questions are:
Are the table and the source data linked? So e.g. removing or joining data on the table updates it on the source as well, and removing or joining data on the source updates it in the table as well?
Does the above hold when you create a view instead of a table as well ('createOrReplaceTempView' instead of CREATE TABLE)?
I am trying to see the point of using Spark SQL when the Spark dataframes already offer a lot of functionality.. I guess it makes sense for me if the two are effectively the same data, but if CREATE TABLE (or createOrReplaceTempView) means you create a duplicate then I find it difficult to understand why you would put so much effort (and compute resources) into doing so.
The table and source data are linked in that the metastore contains the table information (silver_loan_stats) and that table points to the location as defined in DELTALAKE_SILVER_PATH.
The CREATE TABLE is really a CREATE EXTERNAL TABLE as the table and its metadata is defined in the DELTALAKE_SILVER_PATH - specifically the ``DELTALAKE_SILVER_PATH/_delta_log`.
To clarify, you are not duplicating the data when you do this - it's just an intermixing of SQL vs. API. HTH!

Altering the Hive table partitions by reducing the number of partitions

Create Statement:
CREATE EXTERNAL TABLE tab1(usr string)
PARTITIONED BY (year string, month string, day string, hour string, min string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
LOCATION '/tmp/hive1';
Data:
select * from tab1;
jhon,2017,2,20,10,11
jhon,2017,2,20,10,12
jhon,2017,2,20,10,13
Now I need to alter tab1 table to have only 3 partitions (year string, month string, day string) without manually copying/modifying files. I have thousands of files, so I should alter only table defination without touching files?
Please let me know how to do this?
if this is something that you will do one time, I would suggest create a new table with the expected partitions and insert the table from the older table to the new one using dynamic partitioning. This will also avoid keep small files in your partitions. The other option is create a new table pointing to the old location with the expected partitions and use the following properties
TBLPROPERTIES ("hive.input.dir.recursive" = "TRUE",
"hive.mapred.supports.subdirectories" = "TRUE",
"hive.supports.subdirectories" = "TRUE",
"mapred.input.dir.recursive" = "TRUE");
after that, you can run the msck repair table to recognize the partitions.

Delete records from Hive table using filename

I have a use case where I build a hive table from a bunch of csv files. While writing csv information into hive table, I assign INPUT__FILE__NAME (part of the name) to one of the columns. When I want to the update the records for the same filename, I need to delete the records of the csv file before writing it again.
I use the below query but failed
CREATE EXTERNAL TABLE T_TEMP_CSV(
F_FRAME_RANK BIGINT,
F_FRAME_RATE BIGINT,
F_SOURCE STRING,
F_PARAMETER STRING,
F_RECORDEDVALUE STRING,
F_VALIDITY INT,
F_VALIDITY_INTERPRETATION STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
location '/user/baamarna5617/HUMS/csv'
TBLPROPERTIES ("skip.header.line.count"="2");
DELETE FROM T_RECORD
WHERE T_RECORD.F_SESSION = split(reverse(split(reverse(T_TEMP_CSV.INPUT__FILE__NAME),"/")[0]), "[.]")[0]
from T_TEMP_CSV;
The T_RECORD table has a column called F_SESSION which was assigned part of the INPUT__FILE__NAME using the split method shown above. I want to use the same method while removing those records. Can someone please point me where i am going wrong in this query?
I could successfully delete the records using the below syntax
DELETE FROM T_RECORD
WHERE F_SESSION = 68;
I need to get that 68 from the INPUT_FILE_NAME.