How to create a table from a dataframe in SparkR - sql

I am trying to find a way to convert a dataframe into a table to be used in another Databricks notebook. I cannot find any documentation regarding doing this in R.

First, convert R dataframe to SparkR dataframe using SparkR::createDataFrame(R_dataframe). Then use saveAsTable function to save as a permanent table - which can be accessed through other notebooks. SparkR::createOrReplaceTempView will not help if you try to access it from different notebook.
require(SparkR)
data1 <- createDataFrame(output)
saveAsTable(data1, tableName = "default.sample_table", source="parquet", mode="overwrite")
In the above code, default is some existing database name, under which a new table will get created having name as sample_table.

Related

How can I save a pyspark dataframe in databricks to SAP Hana (SAP Data Warehouse Cloud) using hdbcli?

I need to push some data from Databricks on AWS to SAP Data Warehouse cloud, and have been encouraged to use the python hdbcli (https://pypi.org/project/hdbcli/). The only documentation I have been able to find is the one in pypi, which is quite scarce. I can see an example of how to push individual rows to a sql table, but I have found no examples of how to save a pyspark dataframe to a table in SAP Data Warehouse cloud.
Documentation examples:
sql = 'INSERT INTO T1 (ID, C2) VALUES (:id, :c2)'
cursor = conn.cursor()
id = 3
c2 = "goodbye"
cursor.execute(sql, {"id": id, "c2": c2})
# returns True
cursor.close()
I have tried the following in my data bricks notebook:
df.createOrReplaceTempView("final_result_local")
sql = "INSERT INTO final_result SELECT * FROM final_result_local"
cursor.execute(sql)
cursor.close()
After this I got the following error:
invalid table name: Could not find table/view FINAL_RESULT_LOCAL in
schema DATABRICKS_SCHEMA
It seems df.createOrReplaceTempView created the sql table in a different context to the one called by hdbcli, and I don't know how to push the local table to sap data warehouse cloud. Any help would be much appreciated.
You should consider using the Python machine learning client for SAP HANA (hana-ml). You can think of it as being an abstraction layer on top of hdbcli. The central object to send and retrieve data is the HANA dataframe, which behaves similar to a Pandas dataframe, but is persisted on database side (i.e. this can be a table).
For your scenario, you should be able to create a HANA dataframe and thus a table using function create_dataframe_from_spark() (see documentation).
Regarding the direct use of hdbcli, you can find the full documentation here (also linked on PyPi).
I disagree with using hdbcli. Instead look into connecting from Spark directly, this instruction should be helpful.

Pyspark: Parquet tables visible in SQL?

I am fairly new to PySpark/Hive and I have a problem:
I have a dataframe and want to write it as a paritioned table to HDFS. So far, I've done that via:
df = spark.sql('''
CREATE EXTERNAL TABLE database.df(
ID STRING
)
PARTITIONED BY (
DATA_DATE_PART STRING
)
STORED AS PARQUET
LOCATION 'hdfs://path/file'
''')
df.createOrReplaceTempView("df")
df = spark.sql('''
INSERT INTO database.df PARTITION(DATA_DATE_PART = '{}')
SELECT ID
FROM df
'''.format(date))
But as with growing dataframes, instead of having to define all columns, I thought there is a better solution to this:
df.write.mode('overwrite').partitionBy('DATA_DATE_PART').parquet("/path/file")
However, a table like this I cannot access via spark.sql nor see it in my HUE browser. I can see it though via PySpark shell: hdfs dfs -ls /path/
So my question, why is that? I've read that parquet files can be special when reading with SQL but my first script does well and the tables are visible everywhere.
You just need to use saveAsTable function for that (doc). By default it stores data in the default location, but you can use the path option to redefine it & make a table "unmanaged" (see this doc for more details). Just use following code:
df.write.mode('overwrite').partitionBy('DATA_DATE_PART') \
.format("parquet") \
.option("path", "/path/file") \
.saveAsTable("database.df")

How do I create a databricks table from a pandas dataframe?

I have a pandas dataframe that I've created. This prints out fine, however I need to manipulate this in SQL.
I've run the following:
spark_df = spark.createDataFrame(df)
spark_df.write.mode("overwrite").saveAsTable("temp.testa")
pd_df = spark.sql('select * from temp.testa').toPandas()
But get an error:
AnalysisException: Database 'temp' not found;
Obviously I have not created a database, but not sure how to do this.
Can anyone advise how I might go about achieving what I need?
The error message clearly says "AnalysisException: Database 'temp' not found;" database temp is not found. Once the database is created you can run the query without any issue.
To create a database, you can use the below command:
To create a database in SQL:
CREATE DATABASE <database-name>
Reference: Azure Databricks - SQL

Rename whitespace in column name in Parquet file using spark sql

I want to show the content of the parquet file using Spark Sql but since the column names in parquet file contains space I am getting error -
Attribute name "First Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
I have written below code -
val r1 = spark.read.parquet("filepath")
val r2 = r1.toDF()
r2.select(r2("First Name").alias("FirstName")).show()
but still getting same error
Try and rename the column first instead of aliasing it:
r2 = r2.withColumnRenamed("First Name", "FirstName")
r2.show()
For anyone still looking for an answer,
There is no optimised way to remove spaces from column names while dealing with parquet data.
What can be done is:
Change the column names at the source itself, i.e, while creating the parquet data itself.
OR
(NOT THE OPTIMISED WAY - won't WORK FOR HUGE DATASETS) read the parquet file using pandas and rename the column for the pandas dataframe. If required, write back the dataframe to a parquet using pandas itself and then progress using spark if required.
PS: With the new Pandas API for PySpark scheduled to be present from PySpark 3.2, implementing pandas with spark might be much faster and optimised when dealing with huge datasets.
For anybody struggling with this, the only thing that worked for me was:
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(base_df.schema).parquet(filename)
This is from this thread: Spark Dataframe validating column names for parquet writes (scala)
Alias, withColumnRenamed, and "as" sql select statements wouldn't work. Pyspark would still use the old name whenever trying to .show() the dataframe.

Pandas Dataframe to_sql with Flask-SQLAlchemy

I am working with two csv files that i have merged into one dataframe that i am currently storing as an sql databse using pandas to_sql(). I am writing all my app with Flask and i would like to use SQLAlchemy. So far i have created the sql database like this:
df.to_sql("User", con=db.engine, if_exists="append")
The problem now is that i would like to keep using SQLAlchemy 'syntax' to do queries and to use features like pagination.
This is how it should be if i wanted to execute a query:
users = User.query.all().paginate(...)
However i haven't created in my models.py my User table since when doing it with pandas to_sql it will automatically create the table into the database for me. But now since i don't have my table defined in my models.py file i don't know how can i define my global variable 'Users' so i can proceed to use the query and other SQLAlchemy methods.
Thank you,