I am working with two csv files that i have merged into one dataframe that i am currently storing as an sql databse using pandas to_sql(). I am writing all my app with Flask and i would like to use SQLAlchemy. So far i have created the sql database like this:
df.to_sql("User", con=db.engine, if_exists="append")
The problem now is that i would like to keep using SQLAlchemy 'syntax' to do queries and to use features like pagination.
This is how it should be if i wanted to execute a query:
users = User.query.all().paginate(...)
However i haven't created in my models.py my User table since when doing it with pandas to_sql it will automatically create the table into the database for me. But now since i don't have my table defined in my models.py file i don't know how can i define my global variable 'Users' so i can proceed to use the query and other SQLAlchemy methods.
Thank you,
Related
I am quite new to Databricks, and I am trying to do some basic data exploration with koalas.
When I log into Databricks, under DATA I see 2 main tabs, DATABASE TABLES and DBFS. I managed to read csv files as koalas dataframes (ksdf=ks.read_csv('/FileStore/tables/countryCodes.csv'), but I do not know how I could read as koalas dataframe the tables I see under the DATABASE TABLES. None of those tables have filename extensions, I guess those are SQL tables?
Sorry if my question is too basic, and thanks very much for your help.
You just need to use read_table function as pointed in the documentation:
ksdf = ks.read_table('my_database.my_table')
P.S. It's a part of so called Metastore API
I have a pandas dataframe that I've created. This prints out fine, however I need to manipulate this in SQL.
I've run the following:
spark_df = spark.createDataFrame(df)
spark_df.write.mode("overwrite").saveAsTable("temp.testa")
pd_df = spark.sql('select * from temp.testa').toPandas()
But get an error:
AnalysisException: Database 'temp' not found;
Obviously I have not created a database, but not sure how to do this.
Can anyone advise how I might go about achieving what I need?
The error message clearly says "AnalysisException: Database 'temp' not found;" database temp is not found. Once the database is created you can run the query without any issue.
To create a database, you can use the below command:
To create a database in SQL:
CREATE DATABASE <database-name>
Reference: Azure Databricks - SQL
I am trying to find a way to convert a dataframe into a table to be used in another Databricks notebook. I cannot find any documentation regarding doing this in R.
First, convert R dataframe to SparkR dataframe using SparkR::createDataFrame(R_dataframe). Then use saveAsTable function to save as a permanent table - which can be accessed through other notebooks. SparkR::createOrReplaceTempView will not help if you try to access it from different notebook.
require(SparkR)
data1 <- createDataFrame(output)
saveAsTable(data1, tableName = "default.sample_table", source="parquet", mode="overwrite")
In the above code, default is some existing database name, under which a new table will get created having name as sample_table.
I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!
I am new to Spark I am trying to access Hive table to Spark
1) Created Spark Context
val hc=new HiveContext(sc)
val hivetable= hc.sql("Select * from test_db.Table")
My Question is I got the table into Spark.
1) Why we need to register the Table ?
2) We can Perform Directly SQL operations still why do we need Dataframe functions
Like Join, Select, Filter...etc ?
What makes difference in both operations between SQL Query` and Dataframe Operations
3) What is Spark Optimization ? How does it works?
You don't need to register temporary table if you are accessing Hive table using Spark HiveContext. Registering a DataFrame as a temporary table allows you to run SQL queries over its data.Suppose a scenario that you are accessing data from a file from some location and you want to run SQL queries over this data.
then you need to createDataframe from the Row RDD and you will register temporary table over this DataFrame to run the SQL operations. To perform SQL queries over that data, you need to use Spark SQLContext in your code.
Both methods use exactly the same execution engine and internal data structures. At the end of the day all boils down to the personal preferences of the developer.
Arguably DataFrame queries are much easier to construct programmatically and
provide a minimal type safety.
Plain SQL queries can be significantly more concise an easier to understand.
There are also portable and can be used without any modifications with every supported language. With HiveContext these can be also used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers
Reference: Spark sql queries vs dataframe functions
Here is a good read reference on performance comparison between Spark RDDs vs DataFrames vs SparkSQL
Apparently I don't have answer for it and will keep it on you to do some research over net and find out solution :)