I am trying to transpose/unpivot a dataframe in SparkR. I don't find any direct method available in SparkR package to accomplish unpivoting a dataframe. Neither I am able to use R package on a SparkR dataframe even after using includePackage method. It would be helpful if someone could let me know if there are direct ways to unpivot using SparkR or other alternatives such as Hive.
Neither I am able to use R package on a SparkR dataframe
Native R commands don't run on Spark DataFrames. Only Spark commands run on Spark DataFrames. If you want to run an R command on a Spark DataFrame you can collect() to convert it to an R data.frame, but you lose the benefits of distributed processing.
The Spark DataFrame is a similar construct to a table in a relational database. By working with Spark commands on a Spark DataFrame you will retain the benefits of distributed processing across the cluster.
It's difficult to answer such a general question - normally on this forum people expect specific examples with data and code. In general, if I wanted to un-pivot a relational table then the most basic way would be to create a set of queries, each query containing the row key plus one column, filtered for non-null in the column. I would then union together the multiple results into a new DataFrame.
If your preference is for R language syntax, that union can be done using the unionAll(x,y) command in SparkR, which will be processed across the cluster (unlike an R command on an R data.frame).
Related
My goal is to write a dbt macro that will allow me to flatten a table column with arbitrarily nested JSON content.
I have already found a wonderful tutorial for this for Snowflake, however I would like to implement this for Databricks (Delta Lake) - using SQL.
Ultimately, I am looking for the Databricks equivalent of the LATERAL FLATTEN function in Snowflake.
The following is an example of a source...
Source
If ultimately using SQL to transform to the following target state:
Target
I have already looked at several projects, for example json-denormalize. However, I would like to implement this completely in SQL.
Also I have seen the Databricks functions json_object_keys, lateral view, explode, but can't make sense of how exactly I should ideally approach the problem.
Can someone steer me in the right direction?
I want to run the same SQL logic twice, on two different underlying tables. Is there a way to do this in spark that doesn't involve writing the exact same logic twice with just the table name being different?
You can use spark.sql(s"query logic from ${tablename}").
Other way is using unbound columns via col("column_name") instead of referencing them via dataframe reference. And then wrapping this in a function:
def processDf(df: DataFrame): DataFrame = {
df.withColumn("some_col", col("input_col") + lit(5))
// this just an illustration via dummy code
}
Now you can pass any data frame to this function that has input_col in its schema and that is numeric and this would work irrespective of data frame reference. In case of incompatible schemas and advanced use cases I would advise looking into Transformers from spark ml.
It is a common pattern in spark ml for transform method that takes Dataset[_] and outputs DataFrame. In case of incompatible schema you can pass these as parameters.
I've a dataframe "Forecast" with columns - Store, Item, FC_startdate, FC_enddate, FC_qty
Another dataframe "Actual" with columns - Store, Item, Saledate, Sales_qty.
I want to create a UDF with parameters passed - p_store, p_item, p_startdate, p_enddate and get the sum of Sales_qty in between these dates and add this as a new column (Act_qty) to "Forecast" dataframe.
but spark is not allowing to pass a dataframe in UDF along with fields of Forecast.
Instead of using merge - What can be the solution?
After defining and registering your udf, you can use the udf function in your transformation code like any other function of the spark-sql library.
Similar to the spark-sql library functions you can only pass columns of your dataframe and return the processed value. Dataframes cannot be passed to udf's.
So in your case you can transform your current dataframe into another dataframe by using the udf as a function and then proceed ahead.
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
A golden rule is, that anything that can be done without UDFs, should be done without UDFs, they should be applied more-so when you require a very specific transformation on a singular row, rather than for the big aggregation type operation you decribe.
In this case it seems like you could just use SparkSQL: Select rows of Actual, where the Saledate is between the dates you would like (Spark understands dates natively, refer to the documentation), sum SalesQty per Store or Item, or both (I am not sure what you intend to do), rename the sum column and join this new dataframe into the Forecast using Store or Item or both again.
If you, however, insist on using UDFs you will have to pass columns, rather than dataframes as arguments but I can't think of a straightforward way of how to achieve what you describe using UDFs while not sacrificing a lot of performance.
I have been playing around with Apache Spark, firstly I learned PostgreSQL and I have a few queries that I need to run on Spark. I managed to run them as SQL Strings in Spark SQL, but now I have to perform RDD operations in order to get the same results. I load my data from csv to map. Now I have to select specific columns in those maps, but I do not know how to join them (multiple maps/csv files). Second question I have is how to best perform RDD operations in order to get the same results from postgresql queries?
I tried reading on RDD operations, which include transformations and amongst them is join, but it is not letting me join maps.
One of the queries:
SELECT Tournaments.TYear,Countries.Name,Max(Matches.MatchDate) -
Min(Matches.MatchDate) AS LENGTH
FROM Tournaments,Countries,Hosts,Teams,Matches
WHERE Tournaments.TYear = Hosts.TYear AND Countries.Cid = Hosts.Cid
AND (Teams.Tid = Matches.HomeTid OR Teams.Tid = Matches.VisitTid) AND
date_part('year', Matches.MatchDate)::text LIKE (Tournaments.TYear ||
'%')
GROUP BY Tournaments.TYear,Countries.Name
ORDER BY LENGTH,Tournaments.TYear ASC
When you say you are trying to join "maps", are you referring to RDDs? Spark data is contained within RDDs, which can be transformed using map transformations. What is the reason you are unable to use Spark SQL? Using Spark SQL to perform this query on DataFrames in Spark would be the easiest translation from this query to what you want to achieve using Spark.
I am new to PySpark Dataframe i am following one sample from this link. In this link they are using pandas dataframe wheras i want to achieve the same using Spark Dataframe. I am stuck up on issue where i want to transpose the table i couldn't find any better way to do it. As there are so many columns i find it difficult to implement and understand Pivot. Is there any better way to do that ? Can i use pandas in Pyspark with cluster environment ?
In pyspark API pyspark.mllib.linalg.distributed.BlockMatrix has transpose function.
if you have a df with columns id, features
bm_transpose = IndexedRowMatrix(df.rdd.map(lambda x:(x[0],
Vectors.dense(x[1])))).toBlockMatrix(2,2).transpose()