apache spark sql table overwrite issue - apache-spark-sql

I am using the below code to create a table from a dataframe in databricks and run into error.
df.write.saveAsTable("newtable")
This works fine the very first time but for re-usability if I were to rewrite like below
df.write.mode(SaveMode.Overwrite).saveAsTable("newtable")
I get the following error.
Error Message:
org.apache.spark.sql.AnalysisException: Can not create the managed table newtable. The associated location dbfs:/user/hive/warehouse/newtable already exists

The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.

What are the differences between saveAsTable and insertInto in different SaveMode(s)?
Run following command to fix issue :
dbutils.fs.rm("dbfs:/user/hive/warehouse/newtable/", true)
Or
Set the flag
spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation = true
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")

Related

HIVE_METASTORE_ERROR persists after removing problematic column from schema

I am trying to query my CloudTrail logging bucket through the use of Athena. I already deployed a crawler into the bucket and managed to populate a few tables. When I tried running a simple "preview table" query, I get the following error:
HIVE_METASTORE_ERROR:
com.amazonaws.services.datacatalog.model.InvalidInputException: Error: : expected at the position 121 of 'struct<roleArn:string,roleSessionName:string,durationSeconds:int,keySpec:string,keyId:string,encryptionContext:struct<aws\:cloudtrail\:arn:string,aws\:s3\......
I narrowed down the column name in question and removed it completely from my schema.
After removing it from the schema in AWS Glue and rerunning the preview table query I still get the same error at the same position. I tried again in a different browser but I get the same error. How can this be, am I missing something?
Please provide any advice.
Thanks in advance!

Getting a Databricks drop schema error for delta table

I have a delta table schema that needs new columns/changed data types (Usually I do this on non delta tables and those work fine)
I have already dropped the existing delta table and tried dropping the schema and getting a 'v1 session catalog' error.
I am currently using SQL, 10.4 LTS cluster, spark3.2.1, scala 2.12 (I cant change these computes), driver and workers are standard E_v4
What I already did, and worked as usual
drop table if exists dbname.tablename;
What I wanted to do next:
drop schema if exists dbname.tablename;
The error I got instead:
Error in SQL statement: AnalysisException: Nested databases are not supported by v1 session catalog: dbname.tablename
When I try recreating the schema in the same location I get the error:
AnalysisException: The specified schema does not match the existing schema at dbfs:locationOfMy/table
... Differences
-Specified schema has additional fields newColNameIAdded, anotherNewColIAdded
-Specified type for myOldCol is different from existing schema ...
If your intention is to keep the existing schema, you can omit the
schema from the create table command. Otherwise please ensure that
the schema matches.
How can I do the schema drop and re-register it in same location and same name with new definitions?
Answering a month later since I didnt get replies and found the right solution;
Delta files have left over partitions and logs that cannot be updated using the drop commands. I had to manually delete the logs depending on where my location was.
Try this:
dbutils.fs.rm(path, True)
Use the path of your schema.
Then create your table again.

BigQuery scheduled query: Cannot create a transfer in > JURISDICTION_US when destination dataset is located in > REGION_ASIA_SOUTHEAST_1

I am getting this error quite frequently while trying to create a scheduled query
Error creating scheduled query: Cannot create a transfer in
JURISDICTION_US when destination dataset is located in
REGION_ASIA_SOUTHEAST_1
I just need a scheduled query to overwrite data in a table.
I had the same problem while trying to create a scheduled query with python:
400 Cannot create a transfer in REGION_EUROPE_WEST_1 when destination dataset is located in JURISDICTION_EU
I figured out that even my project is located in europe-west1 but my destination dataset was located in multinational location: Europe. I had to update my parent path : parent=project_path to '{project_path}/locations/eu' so that it works.
I hope that it helps someone.
It's look like as a bug from BQ.
I got the same problems, with source and destination dataset located in EU both.
I've change just for testing purpose the destination for an other EU dataset, and it works.
I've finally update the scheduled query to use my first destination choice and now it works.. I can't explain why, but it's seem to be a workaround.
Maybe, you can try with starting from the Scheduled Queries BigQuery UI and click on "+ create schedule query" button, then I don't get error. If I start directly in BigQuery UI I get the same error.
As I tried, it may happen because I have existed table with the same id as the destination table. This happens even if the table is the result of manually running that query and saved.
I faced the same issue recently.I tried 2 things and they worked:
try setting the query location to destination dataset/table location, then try scheduling the query.
If that does not work try to run the query and save results to the intended table in bigquery i.e. try creating the destination table with storing the results of the query you are trying to schedule first. Then try scheduling the query.
Both cases worked for me in different cases.
I had this error and tried many of the solutions in this thread. I tried a new session in an incognito window and it worked so I believe this is a transient issue as suggested.
I just scheduled query select 1 and then edited it to the needed one – it worked
I think trouble with time when start schedule. If it is in the past relative to local time, then bg tries to run the request on another server.
I had the same issue. The way I solved it was to disable the editor tabs (there is a button at the top). Then opened the query settings and set the processing location to EU manually.
I was using bq command when I came across this issue and was able to resolve it by adding parameter --location='europe-west1
So my final query looked like this
bq query \
--use_legacy_sql=false \
--display_name='my_table' \
--location='europe-west1' \
'''create or replace table my_dataset.my_table as (select * from external_query('projects/my_mysql_connection/locations/europe-west1/connections/bi', '(select * from my_table)'))'''

Can't access external Hive metastore with Pyspark

I am trying to run a simple code to simply show databases that I created previously on my hive2-server. (note in this example there are both, examples in python and scala both with the same results).
If I log in into a hive shell and list my databases I see a total of 3 databases.
When I start Spark shell(2.3) on pyspark I do the usual and add the following property to my SparkSession:
sqlContext.setConf("hive.metastore.uris","thrift://*****:9083")
And re-start a SparkContext within my session.
If I run the following line to see all the configs:
pyspark.conf.SparkConf().getAll()
spark.sparkContext._conf.getAll()
I can indeed see the parameter has been added, I start a new HiveContext:
hiveContext = pyspark.sql.HiveContext(sc)
But If I list my databases:
hiveContext.sql("SHOW DATABASES").show()
It will not show the same results from the hive shell.
I'm a bit lost, for some reason it looks like it is ignoring the config parameter as I am sure the one I'm using it's my metastore as the address I get from running:
hive -e "SET" | grep metastore.uris
Is the same address also if I run:
ses2 = spark.builder.master("local").appName("Hive_Test").config('hive.metastore.uris','thrift://******:9083').getOrCreate()
ses2.sql("SET").show()
Could it be a permission issue? Like some tables are not set to be seen outside the hive shell/user.
Thanks
Managed to solve the issue, because a communication issue the Hive was not hosted in that machine, corrected the code and everything fine.

Redshift drop/create/select query failing in Data Pipeline

I'm trying to run a daily migration script in Redshift using Data Pipeline.
The script works as expected when I run it directly using SQL Workbench/J, but fails when triggered through Data Pipeline.
I have reproduced the problem with this simple code:
drop table if exists image_stg;
create table image_stg (like image_full);
select * from image_stg;
When I run it in Data Pipeline, I get this error:
[Amazon](500310) Invalid operation: relation "image_stg" does not exist;
I also got this error once, for the exact same code, without changing anything:
[Amazon](500310) Invalid operation: Relation with OID 108425 does not exist.;
Here's a screenshot of the two error messages:
I've found this thread on the AWS forums, but it didn't help: Pipeline started failing on simple Redshift SqlActivity and temp table
What is causing this error? Is there a workaround?
I've contacted Amazon, and it looks like a problem in Data Pipeline.
They did suggest a workaround that seems to work in my case: Change the JDBC connection string from jdbc:redshift://… to jdbc:postgresql://… .
I had the same problem when creating a temporary table in Redshift via Pipeline but the workaround of changing the connection string from jdbc:redshift://… to jdbc:postgresql://… didn't work for me though. My last resort is to create the table as physical table and drop it after use - through Pipeline.