I have a pandas dataframe that I've created. This prints out fine, however I need to manipulate this in SQL.
I've run the following:
spark_df = spark.createDataFrame(df)
spark_df.write.mode("overwrite").saveAsTable("temp.testa")
pd_df = spark.sql('select * from temp.testa').toPandas()
But get an error:
AnalysisException: Database 'temp' not found;
Obviously I have not created a database, but not sure how to do this.
Can anyone advise how I might go about achieving what I need?
The error message clearly says "AnalysisException: Database 'temp' not found;" database temp is not found. Once the database is created you can run the query without any issue.
To create a database, you can use the below command:
To create a database in SQL:
CREATE DATABASE <database-name>
Reference: Azure Databricks - SQL
Related
I need to push some data from Databricks on AWS to SAP Data Warehouse cloud, and have been encouraged to use the python hdbcli (https://pypi.org/project/hdbcli/). The only documentation I have been able to find is the one in pypi, which is quite scarce. I can see an example of how to push individual rows to a sql table, but I have found no examples of how to save a pyspark dataframe to a table in SAP Data Warehouse cloud.
Documentation examples:
sql = 'INSERT INTO T1 (ID, C2) VALUES (:id, :c2)'
cursor = conn.cursor()
id = 3
c2 = "goodbye"
cursor.execute(sql, {"id": id, "c2": c2})
# returns True
cursor.close()
I have tried the following in my data bricks notebook:
df.createOrReplaceTempView("final_result_local")
sql = "INSERT INTO final_result SELECT * FROM final_result_local"
cursor.execute(sql)
cursor.close()
After this I got the following error:
invalid table name: Could not find table/view FINAL_RESULT_LOCAL in
schema DATABRICKS_SCHEMA
It seems df.createOrReplaceTempView created the sql table in a different context to the one called by hdbcli, and I don't know how to push the local table to sap data warehouse cloud. Any help would be much appreciated.
You should consider using the Python machine learning client for SAP HANA (hana-ml). You can think of it as being an abstraction layer on top of hdbcli. The central object to send and retrieve data is the HANA dataframe, which behaves similar to a Pandas dataframe, but is persisted on database side (i.e. this can be a table).
For your scenario, you should be able to create a HANA dataframe and thus a table using function create_dataframe_from_spark() (see documentation).
Regarding the direct use of hdbcli, you can find the full documentation here (also linked on PyPi).
I disagree with using hdbcli. Instead look into connecting from Spark directly, this instruction should be helpful.
I have a delta table schema that needs new columns/changed data types (Usually I do this on non delta tables and those work fine)
I have already dropped the existing delta table and tried dropping the schema and getting a 'v1 session catalog' error.
I am currently using SQL, 10.4 LTS cluster, spark3.2.1, scala 2.12 (I cant change these computes), driver and workers are standard E_v4
What I already did, and worked as usual
drop table if exists dbname.tablename;
What I wanted to do next:
drop schema if exists dbname.tablename;
The error I got instead:
Error in SQL statement: AnalysisException: Nested databases are not supported by v1 session catalog: dbname.tablename
When I try recreating the schema in the same location I get the error:
AnalysisException: The specified schema does not match the existing schema at dbfs:locationOfMy/table
... Differences
-Specified schema has additional fields newColNameIAdded, anotherNewColIAdded
-Specified type for myOldCol is different from existing schema ...
If your intention is to keep the existing schema, you can omit the
schema from the create table command. Otherwise please ensure that
the schema matches.
How can I do the schema drop and re-register it in same location and same name with new definitions?
Answering a month later since I didnt get replies and found the right solution;
Delta files have left over partitions and logs that cannot be updated using the drop commands. I had to manually delete the logs depending on where my location was.
Try this:
dbutils.fs.rm(path, True)
Use the path of your schema.
Then create your table again.
I am using the below code to create a table from a dataframe in databricks and run into error.
df.write.saveAsTable("newtable")
This works fine the very first time but for re-usability if I were to rewrite like below
df.write.mode(SaveMode.Overwrite).saveAsTable("newtable")
I get the following error.
Error Message:
org.apache.spark.sql.AnalysisException: Can not create the managed table newtable. The associated location dbfs:/user/hive/warehouse/newtable already exists
The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.
What are the differences between saveAsTable and insertInto in different SaveMode(s)?
Run following command to fix issue :
dbutils.fs.rm("dbfs:/user/hive/warehouse/newtable/", true)
Or
Set the flag
spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation = true
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
I am trying to write a pyspark dataframe to a hive table with the following code on Jupyternotebook:
df.repartition("dt").write.partitionBy("dt").format("orc").saveAsTable(fraud_nr.test)
fraud_nr is the database I am trying to write the table to. But I got this error.
> NameError: name 'fraud_nr' is not defined
I would like to know what else I need to do to be able to write to this database.
Keep db_name.table_name enclosed in quotes(")
df.repartition("dt").write.partitionBy("dt").format("orc").saveAsTable("fraud_nr.test")
I have a table in Redshift in which I am inserting data from S3.
I viewed the table before inserting the data and it returned a blank table.
However, After inserting data in Redshift table, I am getting below error while doing select * from table.
Command to copy data in table from S3 runs successfully without any error.
java.lang.NoClassDefFoundError:
com/amazon/jdbc/utils/DataTypeUtilities$NumericRepresentation error in
redshift
what could be the possible cause and sol for this?
I have faced this error : java.lang.NoClassDefFoundError when the JDBC connection properties are set incorrectly.
If you are using postgres driver then ensure using postgres://
eg : jdbc:postgresql:// HostName:5439/
Let me know if this works.