How can we iterate through a column vertically downwards using PySpark? - dataframe

For instance, in a dataframe where col1 is the name of a column and it has values 1,2,3 and so for every row, how do I iterate through the 10,20,30.. values alone?

Well... Bluntly said, in Spark you just don't iterate. You don't deal with rows in Spark. You just learn a new way of thinking and only deal with columns.
E.g., your example:
df = spark.range(101).toDF("col1")
df.show()
# +----+
# |col1|
# +----+
# | 0|
# | 1|
# | 2|
# | 3|
# | 4|
# | 5|
# | 6|
# | 7|
# | 8|
# | 9|
# | 10|
# | 11|
# | ...|
If you want to get only rows where col1 = 10,........ 20,........ 30,........ 40......... you must see a sequence there. You think about it and create a rule to smart-filter your dataframe:
df = df.filter('col1 % 10 = 0')
df.show()
# +----+
# |col1|
# +----+
# | 0|
# | 10|
# | 20|
# | 30|
# | 40|
# | 50|
# | 60|
# | 70|
# | 80|
# | 90|
# | 100|
# +----+
Row order is never deterministic in Spark. Every action changes row order. Sorting is available, but it's costy and impractical, as next operation will ruin the order. When you sort, you pull everything into one machine (only when data is on one node you may, at least temporarily, preserve the order, because normally data is split across many machines and none of them is "first" or "second"). In distributed computing, data as much as possible should stay distributed.
That said, iterating rarely may be needed. There's df.collect() which (same as sorting) collects all rows into one list in one machine (the driver - the weakest machine). This operation is to be avoided, because it distorts the nature of distributed computing. But in rare cases it is used. Iterating over rows is an exception. Almost any data operation is possible without iterating. You just search the web, think and learn new ways of doing things.

Related

How can I replace the values in one pyspark dataframe column with the values from another column in a sub-section of the dataframe?

I have to perform a group-by and pivot operation on a dataframe's "activity" column, and populate the new columns resulting from the pivot with the sum of the "quantity" column. One of the activity columns, however has to be populated with the sum of the "cost" column.
Data frame before group-by and pivot:
+----+-----------+-----------+-----------+-----------+
| id | quantity | cost | activity | category |
+----+-----------+-----------+-----------+-----------+
| 1 | 2 | 2 | skiing | outdoor |
| 2 | 0 | 2 | swimming | outdoor |
+----+-----------+-----------+-----------+-----------+
pivot code:
pivotDF = df.groupBy("category").pivot("activity").sum("quantity")
result:
+----+-----------+-----------+-----------+
| id | category | skiing | swimming |
+----+-----------+-----------+-----------+
| 1 | outdoor | 2 | 5 |
| 2 | outdoor | 4 | 7 |
+----+-----------+-----------+-----------+
The problem is that for one of these activities, I need the activity column to be populated with sum("cost") instead of sum("quantity"). I can't seem to find a way to specify this during the pivot operation itself, so I thought maybe I can just exchange the values in the quantity column for the ones in the cost column wherever the activity column value corresponds to the relevant activity. However, I can't find an example of how to do this in a pyspark data frame.
Any help would be much appreciated.
You can provide more than 1 aggregation after the pivot.
Let's say the input dataframe looks like the following
# +---+---+----+--------+-------+
# | id|qty|cost| act| cat|
# +---+---+----+--------+-------+
# | 1| 2| 2| skiing|outdoor|
# | 2| 0| 2|swimming|outdoor|
# | 3| 1| 2| skiing|outdoor|
# | 4| 2| 4|swimming|outdoor|
# +---+---+----+--------+-------+
Do a pivot and use agg() to provide more than 1 aggregation.
data_sdf. \
groupBy('id', 'cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +---+-------+-----------+----------+-------------+------------+
# | id| cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +---+-------+-----------+----------+-------------+------------+
# | 2|outdoor| null| null| 2| 0|
# | 1|outdoor| 2| 2| null| null|
# | 3|outdoor| 2| 1| null| null|
# | 4|outdoor| null| null| 4| 2|
# +---+-------+-----------+----------+-------------+------------+
Notice the field names. Pyspark automatically assigned the suffix based on the alias provided in the aggregations. Use a drop or select to retain the columns required and rename them per your choice.
Removing id from the groupBy makes the result much better.
data_sdf. \
groupBy('cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +-------+-----------+----------+-------------+------------+
# | cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +-------+-----------+----------+-------------+------------+
# |outdoor| 4| 3| 6| 2|
# +-------+-----------+----------+-------------+------------+

Pyspark SQL Select but with a function?

I am looking at this SQL query:
SELECT
tbl.id as id,
tbl. as my_name,
tbl.account as new_account_id,
CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', tbl.entry_time)::DATE AS my_time
FROM tbl
I am wondering how I would convert this into a Pyspark dataframe?
Say I loaded tbl as a CSV into Pyspark like:
tbl_dataframe = spark...load('/files/tbl.csv')
Now I want to use SELECT on this dataframe, something like:
final_dataframe = tbl_dataframe.select('id', 'name', ...)
The issue here is:
How do I rename say that 'name' into 'my_name' with this select?
Is it even possible to apply that CONVERT_TIMEZONE function with select in dataframe? Whats the best/standard approach for this?
How do I rename say that 'name' into 'my_name' with this select?
Assuming your dataframe looks like this
# +---+----+
# | id|name|
# +---+----+
# | 1| foo|
# | 2| bar|
# +---+----+
There are few different ways to do this "rename":
df.select(F.col('name').alias('my_name')) # you select specific column and give it an alias
# +-------+
# |my_name|
# +-------+
# | foo|
# | bar|
# +-------+
# or
df.withColumn('my_name', F.col('name')) # you create new column with value from old column
# +---+----+-------+
# | id|name|my_name|
# +---+----+-------+
# | 1| foo| foo|
# | 2| bar| bar|
# +---+----+-------+
# or
df.withColumnRenamed('name', 'my_name') # you rename column
# +---+-------+
# | id|my_name|
# +---+-------+
# | 1| foo|
# | 2| bar|
# +---+-------+
Is it even possible to apply that CONVERT_TIMEZONE function with select in dataframe? What's the best/standard approach for this?
CONVERT_TIMEZONE is not a standard Spark function, but if it's a Hive function that is already registered somewhere, you can try this F.expr('convert_timezone()')

Mismatched feature counts in spark data frame

I am new to Spark and I am trying to clean a relatively large dataset. The problem I have is that the feature values seem to be mismatched in the original dataset. It looks something like this for the first line when I take a summary of the dataset :
|summary A B |
---------------------
|count 5 10 |
I am trying to find a way to filter based on the row with the lowest count across all features and maintain the ordering.
I would like to have:
|summary A B |
---------------------
|count 5 5 |
How could I achieve this? Thanks!
Here are two approaches for you to consider:
Simple approach
# Set up the example df
df = spark.createDataFrame([('count',5,10)],['summary','A','B'])
# +-------+---+---+
# |summary| A| B|
# +-------+---+---+
# | count| 5| 10|
# +-------+---+---+
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType
#udf(returnType=IntegerType())
def get_row_min(A,B):
return min([A,B])
df.withColumn('new_A', get_row_min(col('A'),col('B')))\
.withColumn('new_B', col('new_A'))\
.drop('A')\
.drop('B')\
.withColumnRenamed('new_A','A')\
.withColumnRenamed('new_B', 'B')\
.show()
# +-------+---+---+
# |summary| A| B|
# +-------+---+---+
# | count| 5| 5|
# +-------+---+---+
Generic approach for indirectly specified columns
# Set up df with an extra column (and an extra row to show it works)
df2 = spark.createDataFrame([('count',5,10,15),
('count',3,2,1)],
['summary','A','B','C'])
# +-------+---+---+---+
# |summary| A| B| C|
# +-------+---+---+---+
# | count| 5| 10| 15|
# | count| 3| 2| 1|
# +-------+---+---+---+
#udf(returnType=IntegerType())
def get_row_min_generic(*cols):
return min(cols)
exclude = ['summary']
df3 = df2.withColumn('min_val', get_row_min_generic(*[col(col_name) for col_name in df2.columns
if col_name not in exclude]))
exclude.append('min_val') # this could just be specified in the list
# from the beginning instead of appending
new_cols = [col('min_val').alias(c) for c in df2.columns if c not in exclude]
df_out = df3.select(['summary']+new_cols)
df_out.show()
# +-------+---+---+---+
# |summary| A| B| C|
# +-------+---+---+---+
# | count| 5| 5| 5|
# | count| 1| 1| 1|
# +-------+---+---+---+

PySpark Window function on entire data frame

Consider a PySpark data frame. I would like to summarize the entire data frame, per column, and append the result for every row.
+-----+----------+-----------+
|index| col1| col2 |
+-----+----------+-----------+
| 0.0|0.58734024|0.085703015|
| 1.0|0.67304325| 0.17850411|
Expected result
+-----+----------+-----------+-----------+-----------+-----------+-----------+
|index| col1| col2 | col1_min | col1_mean |col2_min | col2_mean
+-----+----------+-----------+-----------+-----------+-----------+-----------+
| 0.0|0.58734024|0.085703015| -5 | 2.3 | -2 | 1.4 |
| 1.0|0.67304325| 0.17850411| -5 | 2.3 | -2 | 1.4 |
To my knowledge, I'll need Window function with the whole data frame as Window, to keep the result for each row (instead of, for example, do the stats separately then join back to replicate for each row)
My questions are:
How to write Window without any partition nor order by?
I know there is the standard Window with Partition and Order, but not the one taking everything as 1 single partition
w = Window.partitionBy("col1", "col2").orderBy(desc("col1"))
df = df.withColumn("col1_mean", mean("col1").over(w)))
How would I write a Window with everything as one partition?
Any way to write dynamically for all columns?
Let's say I have 500 columns, it does not look great to write repeatedly.
df = (df
.withColumn("col1_mean", mean("col1").over(w)))
.withColumn("col1_min", min("col2").over(w))
.withColumn("col2_mean", mean().over(w))
.....
)
Let's assume I want multiple stats for each column, so each colx will spawn colx_min, colx_max, colx_mean.
Instead of using window you can achieve the same with a custom aggregation in combination with cross join:
import pyspark.sql.functions as F
from pyspark.sql.functions import broadcast
from itertools import chain
df = spark.createDataFrame([
[1, 2.3, 1],
[2, 5.3, 2],
[3, 2.1, 4],
[4, 1.5, 5]
], ["index", "col1", "col2"])
agg_cols = [(
F.min(c).alias("min_" + c),
F.max(c).alias("max_" + c),
F.mean(c).alias("mean_" + c))
for c in df.columns if c.startswith('col')]
stats_df = df.agg(*list(chain(*agg_cols)))
# there is no performance impact from crossJoin since we have only one row on the right table which we broadcast (most likely Spark will broadcast it anyway)
df.crossJoin(broadcast(stats_df)).show()
# +-----+----+----+--------+--------+---------+--------+--------+---------+
# |index|col1|col2|min_col1|max_col1|mean_col1|min_col2|max_col2|mean_col2|
# +-----+----+----+--------+--------+---------+--------+--------+---------+
# | 1| 2.3| 1| 1.5| 5.3| 2.8| 1| 5| 3.0|
# | 2| 5.3| 2| 1.5| 5.3| 2.8| 1| 5| 3.0|
# | 3| 2.1| 4| 1.5| 5.3| 2.8| 1| 5| 3.0|
# | 4| 1.5| 5| 1.5| 5.3| 2.8| 1| 5| 3.0|
# +-----+----+----+--------+--------+---------+--------+--------+---------+
Note1: Using broadcast we will avoid shuffling since the broadcasted df will be send to all the executors.
Note2: with chain(*agg_cols) we flatten the list of tuples which we created in the previous step.
UPDATE:
Here is the execution plan for the above program:
== Physical Plan ==
*(3) BroadcastNestedLoopJoin BuildRight, Cross
:- *(3) Scan ExistingRDD[index#196L,col1#197,col2#198L]
+- BroadcastExchange IdentityBroadcastMode, [id=#274]
+- *(2) HashAggregate(keys=[], functions=[finalmerge_min(merge min#233) AS min(col1#197)#202, finalmerge_max(merge max#235) AS max(col1#197)#204, finalmerge_avg(merge sum#238, count#239L) AS avg(col1#197)#206, finalmerge_min(merge min#241L) AS min(col2#198L)#208L, finalmerge_max(merge max#243L) AS max(col2#198L)#210L, finalmerge_avg(merge sum#246, count#247L) AS avg(col2#198L)#212])
+- Exchange SinglePartition, [id=#270]
+- *(1) HashAggregate(keys=[], functions=[partial_min(col1#197) AS min#233, partial_max(col1#197) AS max#235, partial_avg(col1#197) AS (sum#238, count#239L), partial_min(col2#198L) AS min#241L, partial_max(col2#198L) AS max#243L, partial_avg(col2#198L) AS (sum#246, count#247L)])
+- *(1) Project [col1#197, col2#198L]
+- *(1) Scan ExistingRDD[index#196L,col1#197,col2#198L]
Here we see a BroadcastExchange of a SinglePartition which is broadcasting one single row since stats_df can fit into a SinglePartition. Therefore the data being shuffled here is only one row (the minimum possible).
There is another good solution for PySpark 2.0+ where over requires window argument:
empty partitionBy or orderBy clause.
from pyspark.sql import functions as F, Window as W
df.withColumn(f"{c}_min", F.min(f"{c}").over(W.partitionBy())
# or
df.withColumn(f"{c}_min", F.min(f"{c}").over(W.orderBy())
We can also able to specify without orderby,partitionBy clauses in window function
min("<col_name>").over()
Example:
//sample data
val df=Seq((1,2,3),(4,5,6)).toDF("i","j","k")
val df1=df.columns.foldLeft(df)((df, c) => {
df.withColumn(s"${c}_min",min(col(s"${c}")).over()).
withColumn(s"${c}_max",max(col(s"${c}")).over()).
withColumn(s"${c}_mean",mean(col(s"${c}")).over())
})
df1.show()
//+---+---+---+-----+-----+------+-----+-----+------+-----+-----+------+
//| i| j| k|i_min|i_max|i_mean|j_min|j_max|j_mean|k_min|k_max|k_mean|
//+---+---+---+-----+-----+------+-----+-----+------+-----+-----+------+
//| 1| 2| 3| 1| 4| 2.5| 2| 5| 3.5| 3| 6| 4.5|
//| 4| 5| 6| 1| 4| 2.5| 2| 5| 3.5| 3| 6| 4.5|
//+---+---+---+-----+-----+------+-----+-----+------+-----+-----+------+
I know this was a while ago, but you could also add a dummy variable column that has the same value for each row. Then the partition contains the entire dataframe.
df_dummy = df.withColumn("dummy", col("index") * 0)
w = Window.partitionBy("dummy")

How to create new columns based on cartesian product of multiple columns from pyspark dataframes

Let me take a simple example to explain what I am trying to do. let us say we have two very simple dataframes as below:
Df1
+---+---+---+
| a1| a2| a3|
+---+---+---+
| 2| 3| 7|
| 1| 9| 6|
+---+---+---+
Df2
+---+---+
| b1| b2|
+---+---+
| 10| 2|
| 9| 3|
+---+---+
From df1, df2, we need to create a new df with columns that are Cartesian product of original columns from df1, df2. Particularly, the new df will have ‘a1b1’,’a1b2’,’a2b1’,’a2b2’,’a3b1’,’a3b2’, and the rows will be the multiplication of corresponding columns from df1, df2. Result df should look like the following:
Df3
+----+----+----+----+----+----+
|a1b1|a1b2|a2b1|a2b2|a3b1|a3b2|
+----+----+----+----+----+----+
| 20| 4| 30| 6| 70| 14|
| 9| 3| 81| 27| 54| 18|
+----+----+----+----+----+----+
I have searched spark online docs as well as questions posted here but it seems that they are all about cartesian product of rows, not columns. For example, rdd.cartesian() provides cartesian product of different combination of values in row, like the following code:
r = sc.parallelize([1, 2])
r.cartesian(r).toDF().show()
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 1| 2|
| 2| 1|
| 2| 2|
+---+---+
But this is not what I need. Again, I need to create new columns instead of rows. Number of rows will remain same in my problem. I understand udf can eventually solve the problem. However in my real application, we have huge dataset which takes too long to create all columns (about 500 new columns as the all possible combinations of columns). we prefer to have some sorts of vector operation which may increase the efficiency. I may be wrong, but spark udf seems like to be based on row operations which may be the reason why it took so long to finish.
Thanks a lot for any suggestions/feedback/comments.
For your convenience, I attached the simple code here to create the example dataframes shown above:
df1 = sqlContext.createDataFrame([[2,3,7],[1,9,6]],['a1','a2','a3'])
df1.show()
df2 = sqlContext.createDataFrame([[10,2],[9,3]],['b1','b2'])
df2.show()
Its not straightforward as far as i know. Here is a shot at it using eval :
# function to add rownumbers in a dataframe
def addrownum(df):
dff = df.rdd.zipWithIndex().toDF(['features','rownum'])
odf = dff.map(lambda x : tuple(x.features)+tuple([x.rownum])).toDF(df.columns+['rownum'])
return odf
df1_ = addrownum(df1)
df2_ = addrownum(df2)
# Join based on rownumbers
outputdf = df1_.rownum.join(df2_,df1_.rownum==df2_.rownum).drop(df1_.rownum).drop(df2_.rownum)
n1 = ['a1','a2','a3'] # columns in set1
n2 = ['b1','b2'] # columns in set2
# I create a string of expression that I want to execute
eval_list = ['x.'+l1+'*'+'x.'+l2 for l1 in n1 for l2 in n2]
eval_str = '('+','.join(eval_list)+')'
col_list = [l1+l2 for l1 in n1 for l2 in n2]
dfcartesian = outputdf.map(lambda x:eval(eval_str)).toDF(col_list)
Something else that might be of help to you is Elementwise Product in spark.ml.feature but it will be no less complex. You take elements from one list multiple element wise to the other list and expand the feature vectors back to a dataframe.