I have 2 dataframes in pyspark that I loaded from a hive database using 2 sparksql queries.
When I try to join the 2 dataframes using the df1.join(df2,df1.id_1=df2.id_2), it takes a long time.
Does Spark re execute the sqls for df1 and df2 when I call the JOIN?
The underlying database is HIVE
Pyspark will be slower compared to using Scala as data serialization occurs between Python process and JVM, and work is done in Python.
Related
Is there a way to make few columns read-only while writing the spark dataframe in excel?
Using com.crealytics.spark.excel and
PySpark,
Databricks
I have a databricks notebook that will run every 2-4 weeks. It will read in a small csv, perform etl on python, truncate and load to a delta table.
This is what I am currently doing to avoid failures related to data type:
python to replace all '-' with '0'
python to drop rows with NaN or nan
spark_df = spark.createDataFrame(dfnew)
spark_df.write.saveAsTable("default.test_table", index=False, header=True)
This automatically detects the datatypes and is working right now.
BUT, what if the datatype cannot be detected or detects wrong? Mostly concerned about doubles, ints, bigints.
I tested casting but it doesnt work on databricks:
spark_df = spark.createDataFrame(dfnew.select(dfnew("Year").cast(IntegerType).as("Year")))
Is there a way to feed a DDL to spark dataframe for databricks? Should I not use spark?
I want to show the content of the parquet file using Spark Sql but since the column names in parquet file contains space I am getting error -
Attribute name "First Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
I have written below code -
val r1 = spark.read.parquet("filepath")
val r2 = r1.toDF()
r2.select(r2("First Name").alias("FirstName")).show()
but still getting same error
Try and rename the column first instead of aliasing it:
r2 = r2.withColumnRenamed("First Name", "FirstName")
r2.show()
For anyone still looking for an answer,
There is no optimised way to remove spaces from column names while dealing with parquet data.
What can be done is:
Change the column names at the source itself, i.e, while creating the parquet data itself.
OR
(NOT THE OPTIMISED WAY - won't WORK FOR HUGE DATASETS) read the parquet file using pandas and rename the column for the pandas dataframe. If required, write back the dataframe to a parquet using pandas itself and then progress using spark if required.
PS: With the new Pandas API for PySpark scheduled to be present from PySpark 3.2, implementing pandas with spark might be much faster and optimised when dealing with huge datasets.
For anybody struggling with this, the only thing that worked for me was:
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(base_df.schema).parquet(filename)
This is from this thread: Spark Dataframe validating column names for parquet writes (scala)
Alias, withColumnRenamed, and "as" sql select statements wouldn't work. Pyspark would still use the old name whenever trying to .show() the dataframe.
I have tabular data of 100 Million records, each record having 15 columns.
I need to query 3 columns of this data and filter out the records to be used in further processing.
Currently I'm deciding between two approaches
Approach 1
Store the data as a csv or parquet in HDFS. When I need to query read the whole data and query using Spark SQL.
Approach 2
Create a Hive table using HiveContext and persist the table and Hive-metadata. Query this table when needed using HiveContext.
Doubts:
In Approach 2, is the query pushed to database level(HDFS) and only the records which satisfy the criteria are read and returned? Or the entire data is read into memory(as is the case with most spark jobs) and then query is run using the metadata?
Runtime: Of the two approaches, which one will be faster?
Please note that the Hive setup isn't Hive over Spark, it's HiveContext provided with Spark.
Spark Version: 2.2.0
In approach2, You should have hive table structured and stored in proper way.
Spark doesn't load all the data if hive table is partitioned and stored in file format that supports indexing(like ORC).
Spark optimized engine will use partition pruning and predicate push down and load only relevant data for further processing(transformation/action).
Partition Pruning:
choose appropriate column(which distribute data across partition evenly) to partition the hive table.
Spark partition pruning works efficiently with hive meta store. It will look into only relevant partition as per partition_column used in WHERE clause of your query.
Predicate PushDown:
ORC file has min/max index and bloom filters . Will work for string columns also in ORC(not sure about latest parquet string support), but more efficient on numerical column.
Spark will read only rows that are matching the filters as it pushed the filter down to underlying storage (orc files).
Below is a sample spark snippet to create such hive table. (assuming raw_df is the dataframe created from your raw data)
sorted_df = raw_df .sort("column2")
sorted_df.write.mode("append").format("orc").partitionBy("column1").saveAsTable("hive_table_name")
This will partition the data as per column1 values save orc files in hdfs and update hive metastore.
Sorting the table using column2 assuming that we are going to use column2 in our query WHERE clause.(sort is needed for efficient orc index)
Then you can query hive and load spark dataframe with relevant data . below is the sample.
filtered_df = spark.sql('SELECT column1,column2,column3 FROM hive_table_name WHERE column1= "some_value1" AND column2= "some_value2"')
In above sample spark will look into only some_value1 partition as column1 is the partition column in hive table created .
Then Spark will push the predicate(i,e filter) "some_value2" for column2 in orc files only under "some_value1" partition.
Here Spark will load only values of column1,column2,column3 , ignoring even other columns in the table.
Unless you combine the second approach with more advanced storage layout (bucketBy / DISTRIBUTE BY) which can be used to optimize the query there shoulde be no difference between between these two as long as you don't use schema inference in the approach 1 (you'll have to provide schema for the DataFrameReader).
Bucketing can be used to optimize execution plans for joins, aggregations and filters on bucketing column, but everything is still executed with Spark. In general Spark will use Hive only as metastore, not as execution engine.
I am new to Spark I am trying to access Hive table to Spark
1) Created Spark Context
val hc=new HiveContext(sc)
val hivetable= hc.sql("Select * from test_db.Table")
My Question is I got the table into Spark.
1) Why we need to register the Table ?
2) We can Perform Directly SQL operations still why do we need Dataframe functions
Like Join, Select, Filter...etc ?
What makes difference in both operations between SQL Query` and Dataframe Operations
3) What is Spark Optimization ? How does it works?
You don't need to register temporary table if you are accessing Hive table using Spark HiveContext. Registering a DataFrame as a temporary table allows you to run SQL queries over its data.Suppose a scenario that you are accessing data from a file from some location and you want to run SQL queries over this data.
then you need to createDataframe from the Row RDD and you will register temporary table over this DataFrame to run the SQL operations. To perform SQL queries over that data, you need to use Spark SQLContext in your code.
Both methods use exactly the same execution engine and internal data structures. At the end of the day all boils down to the personal preferences of the developer.
Arguably DataFrame queries are much easier to construct programmatically and
provide a minimal type safety.
Plain SQL queries can be significantly more concise an easier to understand.
There are also portable and can be used without any modifications with every supported language. With HiveContext these can be also used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers
Reference: Spark sql queries vs dataframe functions
Here is a good read reference on performance comparison between Spark RDDs vs DataFrames vs SparkSQL
Apparently I don't have answer for it and will keep it on you to do some research over net and find out solution :)