PySpark Window function on entire data frame

PySpark Window function on entire data frame - dataframe

Consider a PySpark data frame. I would like to summarize the entire data frame, per column, and append the result for every row.
+-----+----------+-----------+
|index| col1| col2 |
+-----+----------+-----------+
| 0.0|0.58734024|0.085703015|
| 1.0|0.67304325| 0.17850411|
Expected result
+-----+----------+-----------+-----------+-----------+-----------+-----------+
|index| col1| col2 | col1_min | col1_mean |col2_min | col2_mean
+-----+----------+-----------+-----------+-----------+-----------+-----------+
| 0.0|0.58734024|0.085703015| -5 | 2.3 | -2 | 1.4 |
| 1.0|0.67304325| 0.17850411| -5 | 2.3 | -2 | 1.4 |
To my knowledge, I'll need Window function with the whole data frame as Window, to keep the result for each row (instead of, for example, do the stats separately then join back to replicate for each row)
My questions are:
How to write Window without any partition nor order by?
I know there is the standard Window with Partition and Order, but not the one taking everything as 1 single partition
w = Window.partitionBy("col1", "col2").orderBy(desc("col1"))
df = df.withColumn("col1_mean", mean("col1").over(w)))
How would I write a Window with everything as one partition?
Any way to write dynamically for all columns?
Let's say I have 500 columns, it does not look great to write repeatedly.
df = (df
.withColumn("col1_mean", mean("col1").over(w)))
.withColumn("col1_min", min("col2").over(w))
.withColumn("col2_mean", mean().over(w))
.....
)
Let's assume I want multiple stats for each column, so each colx will spawn colx_min, colx_max, colx_mean.

Instead of using window you can achieve the same with a custom aggregation in combination with cross join:
import pyspark.sql.functions as F
from pyspark.sql.functions import broadcast
from itertools import chain
df = spark.createDataFrame([
[1, 2.3, 1],
[2, 5.3, 2],
[3, 2.1, 4],
[4, 1.5, 5]
], ["index", "col1", "col2"])
agg_cols = [(
F.min(c).alias("min_" + c),
F.max(c).alias("max_" + c),
F.mean(c).alias("mean_" + c))
for c in df.columns if c.startswith('col')]
stats_df = df.agg(*list(chain(*agg_cols)))
# there is no performance impact from crossJoin since we have only one row on the right table which we broadcast (most likely Spark will broadcast it anyway)
df.crossJoin(broadcast(stats_df)).show()
# +-----+----+----+--------+--------+---------+--------+--------+---------+
# |index|col1|col2|min_col1|max_col1|mean_col1|min_col2|max_col2|mean_col2|
# +-----+----+----+--------+--------+---------+--------+--------+---------+
# | 1| 2.3| 1| 1.5| 5.3| 2.8| 1| 5| 3.0|
# | 2| 5.3| 2| 1.5| 5.3| 2.8| 1| 5| 3.0|
# | 3| 2.1| 4| 1.5| 5.3| 2.8| 1| 5| 3.0|
# | 4| 1.5| 5| 1.5| 5.3| 2.8| 1| 5| 3.0|
# +-----+----+----+--------+--------+---------+--------+--------+---------+
Note1: Using broadcast we will avoid shuffling since the broadcasted df will be send to all the executors.
Note2: with chain(*agg_cols) we flatten the list of tuples which we created in the previous step.
UPDATE:
Here is the execution plan for the above program:
== Physical Plan ==
*(3) BroadcastNestedLoopJoin BuildRight, Cross
:- *(3) Scan ExistingRDD[index#196L,col1#197,col2#198L]
+- BroadcastExchange IdentityBroadcastMode, [id=#274]
+- *(2) HashAggregate(keys=[], functions=[finalmerge_min(merge min#233) AS min(col1#197)#202, finalmerge_max(merge max#235) AS max(col1#197)#204, finalmerge_avg(merge sum#238, count#239L) AS avg(col1#197)#206, finalmerge_min(merge min#241L) AS min(col2#198L)#208L, finalmerge_max(merge max#243L) AS max(col2#198L)#210L, finalmerge_avg(merge sum#246, count#247L) AS avg(col2#198L)#212])
+- Exchange SinglePartition, [id=#270]
+- *(1) HashAggregate(keys=[], functions=[partial_min(col1#197) AS min#233, partial_max(col1#197) AS max#235, partial_avg(col1#197) AS (sum#238, count#239L), partial_min(col2#198L) AS min#241L, partial_max(col2#198L) AS max#243L, partial_avg(col2#198L) AS (sum#246, count#247L)])
+- *(1) Project [col1#197, col2#198L]
+- *(1) Scan ExistingRDD[index#196L,col1#197,col2#198L]
Here we see a BroadcastExchange of a SinglePartition which is broadcasting one single row since stats_df can fit into a SinglePartition. Therefore the data being shuffled here is only one row (the minimum possible).

There is another good solution for PySpark 2.0+ where over requires window argument:
empty partitionBy or orderBy clause.
from pyspark.sql import functions as F, Window as W
df.withColumn(f"{c}_min", F.min(f"{c}").over(W.partitionBy())
# or
df.withColumn(f"{c}_min", F.min(f"{c}").over(W.orderBy())

We can also able to specify without orderby,partitionBy clauses in window function
min("<col_name>").over()
Example:
//sample data
val df=Seq((1,2,3),(4,5,6)).toDF("i","j","k")
val df1=df.columns.foldLeft(df)((df, c) => {
df.withColumn(s"${c}_min",min(col(s"${c}")).over()).
withColumn(s"${c}_max",max(col(s"${c}")).over()).
withColumn(s"${c}_mean",mean(col(s"${c}")).over())
})
df1.show()
//+---+---+---+-----+-----+------+-----+-----+------+-----+-----+------+
//| i| j| k|i_min|i_max|i_mean|j_min|j_max|j_mean|k_min|k_max|k_mean|
//+---+---+---+-----+-----+------+-----+-----+------+-----+-----+------+
//| 1| 2| 3| 1| 4| 2.5| 2| 5| 3.5| 3| 6| 4.5|
//| 4| 5| 6| 1| 4| 2.5| 2| 5| 3.5| 3| 6| 4.5|
//+---+---+---+-----+-----+------+-----+-----+------+-----+-----+------+

I know this was a while ago, but you could also add a dummy variable column that has the same value for each row. Then the partition contains the entire dataframe.
df_dummy = df.withColumn("dummy", col("index") * 0)
w = Window.partitionBy("dummy")

Related

Mismatched feature counts in spark data frame

I am new to Spark and I am trying to clean a relatively large dataset. The problem I have is that the feature values seem to be mismatched in the original dataset. It looks something like this for the first line when I take a summary of the dataset :
|summary A B |
---------------------
|count 5 10 |
I am trying to find a way to filter based on the row with the lowest count across all features and maintain the ordering.
I would like to have:
|summary A B |
---------------------
|count 5 5 |
How could I achieve this? Thanks!

Here are two approaches for you to consider:
Simple approach
# Set up the example df
df = spark.createDataFrame([('count',5,10)],['summary','A','B'])
# +-------+---+---+
# |summary| A| B|
# +-------+---+---+
# | count| 5| 10|
# +-------+---+---+
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType
#udf(returnType=IntegerType())
def get_row_min(A,B):
return min([A,B])
df.withColumn('new_A', get_row_min(col('A'),col('B')))\
.withColumn('new_B', col('new_A'))\
.drop('A')\
.drop('B')\
.withColumnRenamed('new_A','A')\
.withColumnRenamed('new_B', 'B')\
.show()
# +-------+---+---+
# |summary| A| B|
# +-------+---+---+
# | count| 5| 5|
# +-------+---+---+
Generic approach for indirectly specified columns
# Set up df with an extra column (and an extra row to show it works)
df2 = spark.createDataFrame([('count',5,10,15),
('count',3,2,1)],
['summary','A','B','C'])
# +-------+---+---+---+
# |summary| A| B| C|
# +-------+---+---+---+
# | count| 5| 10| 15|
# | count| 3| 2| 1|
# +-------+---+---+---+
#udf(returnType=IntegerType())
def get_row_min_generic(*cols):
return min(cols)
exclude = ['summary']
df3 = df2.withColumn('min_val', get_row_min_generic(*[col(col_name) for col_name in df2.columns
if col_name not in exclude]))
exclude.append('min_val') # this could just be specified in the list
# from the beginning instead of appending
new_cols = [col('min_val').alias(c) for c in df2.columns if c not in exclude]
df_out = df3.select(['summary']+new_cols)
df_out.show()
# +-------+---+---+---+
# |summary| A| B| C|
# +-------+---+---+---+
# | count| 5| 5| 5|
# | count| 1| 1| 1|
# +-------+---+---+---+

Spark SQL generate SCD2 without dropping historic state

Data from an relation database is loaded over into spark - supposedly daily but in reality not every day. Furthermore, it is a full copy of the DB - no delta loading.
In order to join the dimension tables easily with the main event data I want to:
deduplicate it (i.e. improves potential for broadcast join later)
have valid_to/valid_from columns so even though data is not available daily (inconsistently) it can still be used nicely (from downstream)
I am using spark 3.0.1 and want to SCD2 style transform the existing data - without loosing history.
spark-shell
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
case class Foo (key:Int, value:Int, date:String)
val d = Seq(Foo(1, 1, "20200101"), Foo(1, 8, "20200102"), Foo(1, 9, "20200120"),Foo(1, 9, "20200121"),Foo(1, 9, "20200122"), Foo(1, 1, "20200103"), Foo(2, 5, "20200101"), Foo(1, 10, "20200113")).toDF
d.show
val windowDeduplication = Window.partitionBy("key", "value").orderBy("key", "date")
val windowPrimaryKey = Window.partitionBy("key").orderBy("key", "date")
val nextThing = lead("date", 1).over(windowPrimaryKey)
d.withColumn("date", to_date(col("date"), "yyyyMMdd")).withColumn("rank", rank().over(windowDeduplication)).filter(col("rank") === 1).drop("rank").withColumn("valid_to", nextThing).withColumn("valid_to", when(nextThing.isNotNull, date_sub(nextThing, 1)).otherwise(current_date)).withColumnRenamed("date", "valid_from").orderBy("key", "valid_from", "valid_to").show
results in:
+---+-----+----------+----------+
|key|value|valid_from| valid_to|
+---+-----+----------+----------+
| 1| 1|2020-01-01|2020-01-01|
| 1| 8|2020-01-02|2020-01-12|
| 1| 10|2020-01-13|2020-01-19|
| 1| 9|2020-01-20|2020-10-09|
| 2| 5|2020-01-01|2020-10-09|
+---+-----+----------+----------+
which is already pretty good. However:
| 1| 1|2020-01-03| 2|2020-01-12|
Is lost.
I.e. any values which occur again later (after an intermediary change) are lost.
How can I keep this row without keeping larger ranks such as:
d.withColumn("date", to_date(col("date"), "yyyyMMdd")).withColumn("rank", rank().over(windowDeduplication)).withColumn("valid_to", nextThing).withColumn("valid_to",
when(nextThing.isNotNull, date_sub(nextThing, 1)).otherwise(current_date)).withColumnRenamed("date", "valid_from").orderBy("key", "valid_from", "valid_to").show
+---+-----+----------+----+----------+
|key|value|valid_from|rank| valid_to|
+---+-----+----------+----+----------+
| 1| 1|2020-01-01| 1|2020-01-01|
| 1| 8|2020-01-02| 1|2020-01-02|
| 1| 1|2020-01-03| 2|2020-01-12|
| 1| 10|2020-01-13| 1|2020-01-19|
| 1| 9|2020-01-20| 1|2020-01-20|
| 1| 9|2020-01-21| 2|2020-01-21|
| 1| 9|2020-01-22| 3|2020-10-09|
| 2| 5|2020-01-01| 1|2020-10-09|
+---+-----+----------+----+----------+
Which is definitely not desired
The idea is to drop duplicates
But keep any historic changes to the data using a valid_to, valid_from
How can I properly transform this to a SCD2 representation, i.e. have a valid_from, valid_to but not drop intermediary state?
NOTICE: I do not need to update existing data (merge into, JOIN). It is fine to recreate / overwrite it.
I.e. Implement SCD Type 2 in Spark seems to be way too complicated. Is there a better way in my case where the state handling is not required? I.e. I have data originating from a daily full copy of a database and want to deduplicate it.

The previous approach only keeps the first (earliest) version of a duplicate. I think the only solution without a join for state handling is with a window function where each value is compared against the previous row - and if there is no change in the whole row it is discarded.
Probably less efficient - but more accurate. But this also depends on the use-case at hand i.e. how likely it is that a changed value will be seen again.
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
case class Foo (key:Int, value:Int, value2:Int, date:String)
val d = Seq(Foo(1, 1,1, "20200101"), Foo(1, 8,1, "20200102"), Foo(1, 9,1, "20200120"),Foo(1, 6,1, "20200121"),Foo(1, 9,1, "20200122"), Foo(1, 1,1, "20200103"), Foo(2, 5,1, "20200101"), Foo(1, 10,1, "20200113"), Foo(1, 9,1, "20210120"),Foo(1, 9,1, "20220121"),Foo(1, 9,3, "20230122")).toDF
def compare2Rows(key:Seq[String], sortChangingIgnored:Seq[String], timeColumn:String)(df:DataFrame):DataFrame = {
val windowPrimaryKey = Window.partitionBy(key.map(col):_*).orderBy(sortChangingIgnored.map(col):_*)
val columnsToCompare = df.drop(key ++ sortChangingIgnored:_*).columns
val nextDataChange = lead(timeColumn, 1).over(windowPrimaryKey)
val deduplicated = df.withColumn("data_changes", columnsToCompare.map(e=> col(e) =!= lead(col(e), 1).over(windowPrimaryKey)).reduce(_ or _)).filter(col("data_changes").isNull or col("data_changes"))
deduplicated.withColumn("valid_to", when(nextDataChange.isNotNull, date_sub(nextDataChange, 1)).otherwise(current_date)).withColumnRenamed("date", "valid_from").drop("data_changes")
}
d.orderBy("key", "date").show
d.withColumn("date", to_date(col("date"), "yyyyMMdd")).transform(compare2Rows(Seq("key"), Seq("date"), "date")).orderBy("key", "valid_from", "valid_to").show
returns:
+---+-----+------+----------+----------+
|key|value|value2|valid_from| valid_to|
+---+-----+------+----------+----------+
| 1| 1| 1|2020-01-01|2020-01-01|
| 1| 8| 1|2020-01-02|2020-01-02|
| 1| 1| 1|2020-01-03|2020-01-12|
| 1| 10| 1|2020-01-13|2020-01-19|
| 1| 9| 1|2020-01-20|2020-01-20|
| 1| 6| 1|2020-01-21|2022-01-20|
| 1| 9| 1|2022-01-21|2023-01-21|
| 1| 9| 3|2023-01-22|2020-10-09|
| 2| 5| 1|2020-01-01|2020-10-09|
+---+-----+------+----------+----------+
for an input of:
+---+-----+------+--------+
|key|value|value2| date|
+---+-----+------+--------+
| 1| 1| 1|20200101|
| 1| 8| 1|20200102|
| 1| 1| 1|20200103|
| 1| 10| 1|20200113|
| 1| 9| 1|20200120|
| 1| 6| 1|20200121|
| 1| 9| 1|20200122|
| 1| 9| 1|20210120|
| 1| 9| 1|20220121|
| 1| 9| 3|20230122|
| 2| 5| 1|20200101|
+---+-----+------+--------+
This function has the downside that unlimited amount of state is build up - for each key ... But as I plan to apply this to rather small dimension tables I think it should be fine anyways.

Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark

I have the following time-series data in a DataFrame in pyspark:
(id, timestamp, type)
the id column can be any integer value and many rows of the same id
can exist in the table
the timestamp column is a timestamp represented by an integer (for simplification)
the type column is a string type variable where each distinct
string on the column represents one category. One special category
out of all is 'A'
My question is the following:
Is there any way to compute (with SQL or pyspark DataFrame operations):
the counts of every type
for all the time differences from the timestamp corresponding to all the rows
of type='A' within a time range (e.g. [-5,+5]), with granularity of 1 second
For example, for the following DataFrame:
ts_df = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(1,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
ts_df.show()
+---+----+-----+
| id|type| ts|
+---+----+-----+
| 1| A| 100|
| 2| A| 1000|
| 3| A|10000|
| 1| b| 99|
| 1| b| 99|
| 1| b| 99|
| 2| b| 999|
| 2| b| 999|
| 2| c| 999|
| 2| c| 999|
| 1| d| 999|
| 3| c| 9999|
| 3| c| 9999|
| 3| d| 9999|
| 1| b| 98|
| 1| b| 98|
| 2| b| 998|
| 2| c| 998|
| 3| c| 9998|
+---+----+-----+
for a time difference of -1 second the result should be:
# result for time difference = -1 sec
# b: 5
# c: 4
# d: 2
while for a time difference of -2 seconds the result should be:
# result for time difference = -2 sec
# b: 3
# c: 2
# d: 0
and so on so forth for any time difference within a time range for a granularity of 1 second.
I tried many different ways by using mostly groupBy but nothing seems to work.
I am mostly having difficulties on how to express the time difference from each row of type=A even if I have to do it for one specific time difference.
Any suggestions would be greatly appreciated!
EDIT:
If I only have to do it for one specific time difference time_difference then I could do it with the following way:
time_difference = -1
df_type_A = ts_df.where(F.col("type")=='A').selectExpr("ts as fts")
res = df_type_A.join(ts_df, on=df_type_A.fts+time_difference==ts_df.ts)\
.drop("ts","fts").groupBy(F.col("type")).count()
The the returned res DataFrame will give me exactly what I want for one specific time difference. I create a loop and solve the problem by repeating the same query over and over again.
However, is there any more efficient way than that?
EDIT2 (solution)
So that's how I did it at the end:
df1 = sc.parallelize([
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(2,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
df1.show()
df2 = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
]).toDF(["id","type","ts"]).selectExpr("id as fid","ts as fts","type as ftype")
df2.show()
df3 = df2.join(df1, on=df1.id==df2.fid).withColumn("td", F.col("ts")-F.col("fts"))
df3.show()
df4 = df3.groupBy([F.col("type"),F.col("td")]).count()
df4.show()
Will update performance details as soon as I'll have any.
Thanks!

Another way to solve this problem would be:
Divide existing data-frames in two data-frames - with A and without A
Add a new column in without A df, which is sum of "ts" and time_difference
Join both data frame, group By and count.
Here is a code:
from pyspark.sql.functions import lit
time_difference = 1
ts_df_A = (
ts_df
.filter(ts_df["type"] == "A")
.drop("id")
.drop("type")
)
ts_df_td = (
ts_df
.withColumn("ts_plus_td", lit(ts_df['ts'] + time_difference))
.filter(ts_df["type"] != "A")
.drop("ts")
)
joined_df = ts_df_A.join(ts_df_td, ts_df_A["ts"] == ts_df_td["ts_plus_td"])
agg_df = joined_df.groupBy("type").count()
>>> agg_df.show()
+----+-----+
|type|count|
+----+-----+
| d| 2|
| c| 4|
| b| 5|
+----+-----+
>>>
Let me know if this is what you are looking for?
Thanks,
Hussain Bohra

Pyspark Dataframes as View

For a script that I am running, I have a bunch of chained views that looked at a specific set of data in sql (I am using Apache Spark SQL):
%sql
create view view_1 as
select column_1,column_2 from original_data_table
This logic culminates in view_n.
However, I then need to perform logic that is difficult (or impossible) to implement in sql, specifically, the explode command:
%python
df_1 = sqlContext.sql("SELECT * from view_n")
df1_exploded=df_1.withColumn("exploded_column", explode(split(df_1f.col_to_explode,',')))
My Questions:
Is there a speed cost associated with switching to and from sql tables to pyspark dataframes? Or, since pyspark dataframes are lazily evaluated, is it very similair to a view?
Is there a better way of switching from and sql table to a pyspark dataframe?

You can use explode() and just about anything that DF has via Spark SQL (https://spark.apache.org/docs/latest/api/sql/index.html)
print(spark.version)
2.4.3
df = spark.createDataFrame([(1, [1,2,3]), (2, [4,5,6]), (3, [7,8,9]),],["id", "nest"])
df.printSchema()
root
|-- id: long (nullable = true)
|-- nest: array (nullable = true)
| |-- element: long (containsNull = true)
df.createOrReplaceTempView("sql_view")
spark.sql("SELECT id, explode(nest) as un_nest FROM sql_view").show()
df.createOrReplaceTempView("sql_view")
spark.sql("SELECT id, explode(nest) as flatten FROM sql_view").show()
+---+-------+
| id|flatten|
+---+-------+
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 7|
| 3| 8|
| 3| 9|
+---+-------+

How to create new columns based on cartesian product of multiple columns from pyspark dataframes

Let me take a simple example to explain what I am trying to do. let us say we have two very simple dataframes as below:
Df1
+---+---+---+
| a1| a2| a3|
+---+---+---+
| 2| 3| 7|
| 1| 9| 6|
+---+---+---+
Df2
+---+---+
| b1| b2|
+---+---+
| 10| 2|
| 9| 3|
+---+---+
From df1, df2, we need to create a new df with columns that are Cartesian product of original columns from df1, df2. Particularly, the new df will have ‘a1b1’,’a1b2’,’a2b1’,’a2b2’,’a3b1’,’a3b2’, and the rows will be the multiplication of corresponding columns from df1, df2. Result df should look like the following:
Df3
+----+----+----+----+----+----+
|a1b1|a1b2|a2b1|a2b2|a3b1|a3b2|
+----+----+----+----+----+----+
| 20| 4| 30| 6| 70| 14|
| 9| 3| 81| 27| 54| 18|
+----+----+----+----+----+----+
I have searched spark online docs as well as questions posted here but it seems that they are all about cartesian product of rows, not columns. For example, rdd.cartesian() provides cartesian product of different combination of values in row, like the following code:
r = sc.parallelize([1, 2])
r.cartesian(r).toDF().show()
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 1| 2|
| 2| 1|
| 2| 2|
+---+---+
But this is not what I need. Again, I need to create new columns instead of rows. Number of rows will remain same in my problem. I understand udf can eventually solve the problem. However in my real application, we have huge dataset which takes too long to create all columns (about 500 new columns as the all possible combinations of columns). we prefer to have some sorts of vector operation which may increase the efficiency. I may be wrong, but spark udf seems like to be based on row operations which may be the reason why it took so long to finish.
Thanks a lot for any suggestions/feedback/comments.
For your convenience, I attached the simple code here to create the example dataframes shown above:
df1 = sqlContext.createDataFrame([[2,3,7],[1,9,6]],['a1','a2','a3'])
df1.show()
df2 = sqlContext.createDataFrame([[10,2],[9,3]],['b1','b2'])
df2.show()

Its not straightforward as far as i know. Here is a shot at it using eval :
# function to add rownumbers in a dataframe
def addrownum(df):
dff = df.rdd.zipWithIndex().toDF(['features','rownum'])
odf = dff.map(lambda x : tuple(x.features)+tuple([x.rownum])).toDF(df.columns+['rownum'])
return odf
df1_ = addrownum(df1)
df2_ = addrownum(df2)
# Join based on rownumbers
outputdf = df1_.rownum.join(df2_,df1_.rownum==df2_.rownum).drop(df1_.rownum).drop(df2_.rownum)
n1 = ['a1','a2','a3'] # columns in set1
n2 = ['b1','b2'] # columns in set2
# I create a string of expression that I want to execute
eval_list = ['x.'+l1+'*'+'x.'+l2 for l1 in n1 for l2 in n2]
eval_str = '('+','.join(eval_list)+')'
col_list = [l1+l2 for l1 in n1 for l2 in n2]
dfcartesian = outputdf.map(lambda x:eval(eval_str)).toDF(col_list)
Something else that might be of help to you is Elementwise Product in spark.ml.feature but it will be no less complex. You take elements from one list multiple element wise to the other list and expand the feature vectors back to a dataframe.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PySpark Window function on entire data frame - dataframe

There is another good solution for PySpark 2.0+ where over requires window argument: empty partitionBy or orderBy clause. from pyspark.sql import functions as F, Window as W df.withColumn(f"{c}_min", F.min(f"{c}").over(W.partitionBy()) # or df.withColumn(f"{c}_min", F.min(f"{c}").over(W.orderBy())

I know this was a while ago, but you could also add a dummy variable column that has the same value for each row. Then the partition contains the entire dataframe. df_dummy = df.withColumn("dummy", col("index") * 0) w = Window.partitionBy("dummy")

Related

Mismatched feature counts in spark data frame

Spark SQL generate SCD2 without dropping historic state

Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark

Pyspark Dataframes as View

How to create new columns based on cartesian product of multiple columns from pyspark dataframes

Categories

Resources