Map Spark DF to (row_number, column_number, value) format - dataframe

I have a Dataframe in the following shape
1 2
5 9
How can I convert it to (row_num, col_num, value) format
0 0 1
0 1 2
1 0 5
1 1 9
Is there any way to apply some function or any mapper?
Thanks in advance

Check below code.
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> val colExpr = array(df.columns.zipWithIndex.map(c => struct(lit(c._2).as("col_name"),col(c._1).as("value"))):_*)
colExpr: org.apache.spark.sql.Column = array(named_struct(col_name, 0 AS `col_name`, NamePlaceholder(), a AS `value`), named_struct(col_name, 1 AS `col_name`, NamePlaceholder(), b AS `value`))
scala> df.withColumn("row_number",lit(row_number().over(Window.orderBy(lit(1)))-1)).withColumn("data",explode(colExpr)).select($"row_number",$"data.*").show(false)
+----------+--------+-----+
|row_number|col_name|value|
+----------+--------+-----+
|0 |0 |1 |
|0 |1 |2 |
|1 |0 |5 |
|1 |1 |9 |
+----------+--------+-----+

You can do it by transposing the data as:
from pyspark.sql.functions import *
from pyspark.sql import Window
df = spark.createDataFrame([(1,2),(5,9)],['col1','col2'])
#renaming the columns based on their position
df = df.toDF(*list(map(lambda x: str(x),[*range(len(df.columns))])))
#Transposing the dataframe as required
col_list = ','.join([f'{i},`{i}`'for i in df.columns])
rows = len(df.columns)
df.withColumn('row_id',lit(row_number().over(Window.orderBy(lit(1)))-1)).select('row_id',
expr(f'''stack({rows},{col_list}) as (col_id,col_value)''')).show()
+------+------+---------+
|row_id|col_id|col_value|
+------+------+---------+
| 0| 0| 1|
| 0| 1| 2|
| 1| 0| 5|
| 1| 1| 9|
+------+------+---------+

In pyspark, row_number() and pos_explode will be helpful. Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst1= tst.withColumn("row_number",F.row_number().over(Window.orderBy(F.lit(1)))-1)
#%%
tst_arr = tst1.withColumn("arr",F.array(tst.columns))
tst_new = tst_arr.select('row_number','arr').select('row_number',F.posexplode('arr'))
results:
In [47]: tst_new.show()
+----------+---+---+
|row_number|pos|col|
+----------+---+---+
| 0| 0| 1|
| 0| 1| 7|
| 0| 2| 80|
| 1| 0| 1|
| 1| 1| 8|
| 1| 2| 40|
| 2| 0| 1|
| 2| 1| 5|
| 2| 2|100|
| 3| 0| 5|
| 3| 1| 8|
| 3| 2| 90|
| 4| 0| 7|
| 4| 1| 6|
| 4| 2| 50|
| 5| 0| 0|
| 5| 1| 3|
| 5| 2| 60|
+----------+---+---+

Related

How to add a new column to a Spark Dataframe?

Currently I have a dataframe like below
+---+
| id|
+---+
| 0|
| 1|
+---+
and I want to add a new column called product_id.
+-----------+
| product_id|
+-----------+
| A|
| B|
| C|
+-----------+
For each id in the dataframe, I want to add all product_id:
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+
Is there a way to do this?
This is generation of a sample dataframe
df = spark.range(2)
df.show()
+---+
| id|
+---+
| 0|
| 1|
+---+
Option 1: stack
stack_df = df.selectExpr("*","stack(3,'A','B','C') as product_id")
stack_df.show()
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+
Option 2: explode
explode_df = df.selectExpr("*","explode(array('A','B','C')) as product_id")
explode_df.show()
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+

spark sql spark.range(7).select('*,'id % 3 as "bucket").show // how to understand ('*,'id % 3 as "bucket")

spark.range(7).select('*,'id % 3 as "bucket").show
// result:
+---+------+
| id|bucket|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
+---+------+
spark.range(7).withColumn("bucket",$"id" % 3).show
///result:
+---+------+
| id|bucket|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
+---+------+
I want to know what to make of *, and the whole select statement
Is the bottom of these two ways equivalent?
spark.range(7).select('*,'id % 3 as "bucket").show
spark.range(7).select($"*",$"id" % 3 as "bucket").show
spark.range(7).select(col("*"),col("id") % 3 as "bucket").show
val df = spark.range(7)
df.select(df("*"),df("id") % 3 as "bucket").show
These four ways are equivalent;
// https://spark.apache.org/docs/2.4.4/api/scala/index.html#org.apache.spark.sql.Column

Pyspark : How to find and convert top 5 row values to 1 and rest all to 0?

I have a dataframe and i need to find the maximum 5 values in each row, convert only those values to 1 and rest all to 0 while maintaining the dataframe structure, i.e. the column names should remain the same
I tried using toLocalIterator and then converting each row to a list, then converting top 5 to values 1.
But it gives me a java.lang.outOfMemoryError when i run the code on large dataset.
While looking at the logs i found that a task of very large size(around 25000KB) is submitted while the max recommended size is 100KB
Is there a better way to find and convert top 5 values to a certain value(1 in this case) and rest all to 0, which would utilize less memory
EDIT 1:
For example if i have this 10 columns and 5 rows as the input
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+----+----+----+----+----+----+----+----+----+----+
|0.74| 0.9|0.52|0.85|0.18|0.23| 0.3| 0.0| 0.1|0.07|
|0.11|0.57|0.81|0.81|0.45|0.48|0.86|0.38|0.41|0.45|
|0.03|0.84|0.17|0.96|0.09|0.73|0.25|0.05|0.57|0.66|
| 0.8|0.94|0.06|0.44| 0.2|0.89| 0.9| 1.0|0.48|0.14|
|0.73|0.86|0.68| 1.0|0.78|0.17|0.11|0.19|0.18|0.83|
+----+----+----+----+----+----+----+----+----+----+
this is what i want as the output
+---+---+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+---+---+---+---+---+---+---+---+---+---+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+---+---+---+---+---+---+---+---+---+---+
as you can see i want to find the top(max) 5 values in each row convert them to 1 and the rest of the values to 0, while maintaining the structure i.e. rows and columns
this is what i am using (which gives me outOfMemoryError)
for row in prob_df.rdd.toLocalIterator():
rowPredDict = {}
for cat in categories:
rowPredDict[cat]= row[cat]
sorted_row = sorted(rowPredDict.items(), key=lambda kv: kv[1],reverse=True)
#print(rowPredDict)
rowPredDict = rowPredDict.fromkeys(rowPredDict,0)
rowPredDict[sorted_row[0:5][0][0]] = 1
rowPredDict[sorted_row[0:5][1][0]] = 1
rowPredDict[sorted_row[0:5][2][0]] = 1
rowPredDict[sorted_row[0:5][3][0]] = 1
rowPredDict[sorted_row[0:5][4][0]] = 1
#print(count,sorted_row[0:2][0][0],",",sorted_row[0:2][1][0])
rowPredList.append(rowPredDict)
#count=count+1
I don't have enough volume for performance testing but could you try below approach using spark functions array apis
1. Prepare Dataset:
import pyspark.sql.functions as f
l1 = [(0.74,0.9,0.52,0.85,0.18,0.23,0.3,0.0,0.1,0.07),
(0.11,0.57,0.81,0.81,0.45,0.48,0.86,0.38,0.41,0.45),
(0.03,0.84,0.17,0.96,0.09,0.73,0.25,0.05,0.57,0.66),
(0.8,0.94,0.06,0.44,0.2,0.89,0.9,1.0,0.48,0.14),
(0.73,0.86,0.68,1.0,0.78,0.17,0.11,0.19,0.18,0.83)]
df = spark.createDataFrame(l1).toDF('col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9','col_10')
df.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 0.74| 0.9| 0.52| 0.85| 0.18| 0.23| 0.3| 0.0| 0.1| 0.07|
| 0.11| 0.57| 0.81| 0.81| 0.45| 0.48| 0.86| 0.38| 0.41| 0.45|
| 0.03| 0.84| 0.17| 0.96| 0.09| 0.73| 0.25| 0.05| 0.57| 0.66|
| 0.8| 0.94| 0.06| 0.44| 0.2| 0.89| 0.9| 1.0| 0.48| 0.14|
| 0.73| 0.86| 0.68| 1.0| 0.78| 0.17| 0.11| 0.19| 0.18| 0.83|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
2. Get top 5 for each row
Following below steps on df
Create array and Sort the elements
Get first 5 elements into new column called all
UDF to get Max 5 elements from sorted:
Note : spark >= 2.4.0 have slice function which can do similar task. I am using 2.2 in currently so creating UDF but if you have 2.4 or higher version then you can give a try with slice
def get_n_elements_(arr, n):
return arr[:n]
get_n_elements = f.udf(get_n_elements_, t.ArrayType(t.DoubleType()))
df_all = df.withColumn('all', get_n_elements(f.sort_array(f.array(df.columns), False),f.lit(5)))
df_all.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|all |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|0.74 |0.9 |0.52 |0.85 |0.18 |0.23 |0.3 |0.0 |0.1 |0.07 |[0.9, 0.85, 0.74, 0.52, 0.3] |
|0.11 |0.57 |0.81 |0.81 |0.45 |0.48 |0.86 |0.38 |0.41 |0.45 |[0.86, 0.81, 0.81, 0.57, 0.48]|
|0.03 |0.84 |0.17 |0.96 |0.09 |0.73 |0.25 |0.05 |0.57 |0.66 |[0.96, 0.84, 0.73, 0.66, 0.57]|
|0.8 |0.94 |0.06 |0.44 |0.2 |0.89 |0.9 |1.0 |0.48 |0.14 |[1.0, 0.94, 0.9, 0.89, 0.8] |
|0.73 |0.86 |0.68 |1.0 |0.78 |0.17 |0.11 |0.19 |0.18 |0.83 |[1.0, 0.86, 0.83, 0.78, 0.73] |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
3. Create dynamic sql and execute with selectExpr
sql_stmt = ''' case when array_contains(all, {0}) then 1 else 0 end AS `{0}` '''
df_all.selectExpr(*[sql_stmt.format(c) for c in df.columns]).show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
You can do that easily like this.
For example we want to do that task for value column, so first sort the value column take the 5th value and change the values using a when condition.
df2 = sc.parallelize([("fo", 100,20),("rogerg", 110,56),("franre", 1080,297),("f11", 10100,217),("franci", 10,227),("fran", 1002,5),("fran231cis", 10007,271),("franc3is", 1030,2)]).toDF(["name", "salary","value"])
df2 = df2.orderBy("value",ascending=False)
+----------+------+-----+
| name|salary|value|
+----------+------+-----+
| franre| 1080| 297|
|fran231cis| 10007| 271|
| franci| 10| 227|
| f11| 10100| 217|
| rogerg| 110| 56|
| fo| 100| 20|
| fran| 1002| 5|
| franc3is| 1030| 2|
+----------+------+-----+
maxx = df2.take(5)[4]["value"]
dff = df2.select(when(df2['value'] >= maxx, 1).otherwise(0).alias("value"),"name", "salary")
+---+----------+------+
|value| name|salary|
+---+----------+------+
| 1| franre| 1080|
| 1|fran231cis| 10007|
| 1| franci| 10|
| 1| f11| 10100|
| 1| rogerg| 110|
| 0| fo| 100|
| 0| fran| 1002|
| 0| franc3is| 1030|
+---+----------+------+

Spark: Dataframe pipe delimited doesn't return correct values

I have data frame as below:
scala> products_df.show(5)
+--------------------+
| value|
+--------------------+
|1009|45|Diamond F...|
|1010|46|DBX Vecto...|
|1011|46|Old Town ...|
|1012|46|Pelican T...|
|1013|46|Perceptio...|
+--------------------+
I need to divide each column wise-
I use below query which works in all the other delimiter but here it doen't ==>
products_df.selectExpr(("cast((split(value,'|'))[0] as int) as product_id"),("cast((split(value,'|'))[1] as int) as product_category_id"),("cast((split(value,'|'))[2] as string) as product_name"),("cast((split(value,'|'))[3] as string) as description"), ("cast((split(value,'|'))[4] as float) as product_price") ,("cast((split(value,'|'))[5] as string) as product_image")).show
It returns -
product_id|product_category_id|product_name|description|product_price|product_image|
+----------+-------------------+------------+-----------+-------------+-------------+
| 1| 0| 0| 9| null| 4|
| 1| 0| 1| 0| null| 4|
| 1| 0| 1| 1| null| 4|
| 1| 0| 1| 2| null| 4|
| 1| 0| 1| 3| null| 4|
| 1| 0| 1| 4| null| 4|
| 1| 0| 1| 5| null| 4|
It works fine when the file is delimited by comma(,) or (:)
only with pipe(|) and returns above values whereas it should be
product_id|product_category_id| product_name|description|product_price| product_image|
+----------+-------------------+--------------------+-----------+-------------+--------------------+
| 1009| 45|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 1010| 46|Under Armour Men'...| | 129.99|http://images.acm...|
| 1011| 47|Under Armour Men'...| | 89.99|http://images.acm...|
Thanks, Guys for the suggestions-
-> It seems selectExpr doesn't work when file is delimited by pipe(|).
so the alternate way is to use withColumn.
val products_df=spark.read.textFile("/user/code/products").withColumn("product_id",split($"value","\|")(0).cast("int")).withColumn("product_cat_id",split($"value","\|")(1).cast("int")).withColumn("product_name",split($"value","\|")(2).cast("string")).withColumn("product_description",split($"value","\|")(3).cast("string")).withColumn("product_price",split($"value","\|")(4).cast("float")).withColumn("product_image",split($"value","\|")(5).cast("string")).select("product_id","product_cat_id","product_name","product_description","product_price","product_image")
Spark 2.4.3 Just adding a neat and clean code
scala> var df =Seq(("1009|45|Diamond F"),("1010|46|DBX Vecto")).toDF("value")
scala> df.show
+-----------------+
| value|
+-----------------+
|1009|45|Diamond F|
|1010|46|DBX Vecto|
+-----------------+
val splitedViewsDF = df.withColumn("product_id", split($"value", "\\|").getItem(0)).withColumn("product_cat_id", split($"value", "\\|").getItem(1)).withColumn("product_name", split($"value", "\\|").getItem(2)).drop($"value")
scala> splitedViewsDF.show
+----------+--------------+------------+
|product_id|product_cat_id|product_name|
+----------+--------------+------------+
| 1009| 45| Diamond F|
| 1010| 46| DBX Vecto|
+----------+--------------+------------+
here you can get data by using getItem. Happy Hadoop

need to perform multi-column join on a dataframe with alook-up dataframe

I have two dataframes like so
+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
| 0| 1| 2| 3| 4|
| 5| 6| 7| 8| 9|
+---+---+---+---+---+
+---+---+
|key|val|
+---+---+
| 0| A|
| 1| B|
| 2| C|
| 3| D|
| 4| E|
| 5| F|
| 6| G|
| 7| H|
| 8| I|
| 9| J|
+---+---+
I want to lookup each column on df1 with the equivalent key in df2 and return the lookup val from df2 for each.
Here is the code to produce the two input dataframes
df1 = sc.parallelize([('0','1','2','3','4',), ('5','6','7','8','9',)]).toDF(['c1','c2','c3','c4','c5'])
df1.show()
df2 = sc.parallelize([('0','A',), ('1','B', ),('2','C', ),('3','D', ),('4','E',),\
('5','F',), ('6','G', ),('7','H', ),('8','I', ),('9','J',)]).toDF(['key','val'])
df2.show()
I want to join the above to produce the following
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
| 0| 1| 2| 3| 4|A |B |C |D |E |
| 5| 6| 7| 8| 9|F |G |H |I |J |
+---+---+---+---+---+---+---+---+--+----+
I can get it to work for a single column like so but I'm not sure how to extend it to all columns
df1.join(df2, df1.c1 == df2.key).select('c1','val').show()
+---+---+
| c1|val|
+---+---+
| 0| A|
| 5| F|
+---+---+
You can just chain the join:
df1
.join(df2, on=df1.c1 == df2.key, how='left')
.withColumnRenamed('val', 'lu1') \
.join(df2, on=df1.c2 == df2.key, how='left) \
.withColumnRenamed('val', 'lu2') \
.etc
You can even do it in a loop, but don't do it with too many columns:
from pyspark.sql import functions as f
df = df1
for i in range(1, 6):
df = df \
.join(df2.alias(str(i)), on=f.col('c{}'.format(i)) == f.col("{}.key".format(i)), how='left') \
.withColumnRenamed('val', 'lu{}'.format(i))
df \
.select('c1', 'c2', 'c3', 'c4', 'c5', 'lu1', 'lu2', 'lu3', 'lu4', 'lu5') \
.show()
output
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
| 5| 6| 7| 8| 9| F| G| H| I| J|
| 0| 1| 2| 3| 4| A| B| C| D| E|
+---+---+---+---+---+---+---+---+---+---+