How to add a new column to a Spark Dataframe? - dataframe

Currently I have a dataframe like below
+---+
| id|
+---+
| 0|
| 1|
+---+
and I want to add a new column called product_id.
+-----------+
| product_id|
+-----------+
| A|
| B|
| C|
+-----------+
For each id in the dataframe, I want to add all product_id:
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+
Is there a way to do this?

This is generation of a sample dataframe
df = spark.range(2)
df.show()
+---+
| id|
+---+
| 0|
| 1|
+---+
Option 1: stack
stack_df = df.selectExpr("*","stack(3,'A','B','C') as product_id")
stack_df.show()
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+
Option 2: explode
explode_df = df.selectExpr("*","explode(array('A','B','C')) as product_id")
explode_df.show()
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+

Related

Pyspark crossJoin with specific condition

The crossJoin of two dataframes of 5 rows for each one gives a dataframe of 25 rows (5*5).
What I want is to do a crossJoin but which is "not full".
For example:
df1: df2:
+-----+ +-----+
|index| |value|
+-----+ +-----+
| 0| | A|
| 1| | B|
| 2| | C|
| 3| | D|
| 4| | E|
+-----+ +-----+
The result must be a dataframe of number of rows < 25, while for each row in index choosing randomly the number of rows in value with which the crossJoin is done.
It will be something like that:
+-----+-----+
|index|value|
+-----+-----+
| 0| D|
| 0| A|
| 1| A|
| 1| D|
| 1| B|
| 1| C|
| 2| A|
| 2| E|
| 3| D|
| 4| A|
| 4| B|
| 4| E|
+-----+-----+
Thank you
You can try with sample(withReplacement, fraction, seed=None) to get the less number of rows after cross join.
Example:
spark.sql("set spark.sql.crossJoin.enabled=true")
df.join(df1).sample(False,0.6).show()

Map Spark DF to (row_number, column_number, value) format

I have a Dataframe in the following shape
1 2
5 9
How can I convert it to (row_num, col_num, value) format
0 0 1
0 1 2
1 0 5
1 1 9
Is there any way to apply some function or any mapper?
Thanks in advance
Check below code.
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> val colExpr = array(df.columns.zipWithIndex.map(c => struct(lit(c._2).as("col_name"),col(c._1).as("value"))):_*)
colExpr: org.apache.spark.sql.Column = array(named_struct(col_name, 0 AS `col_name`, NamePlaceholder(), a AS `value`), named_struct(col_name, 1 AS `col_name`, NamePlaceholder(), b AS `value`))
scala> df.withColumn("row_number",lit(row_number().over(Window.orderBy(lit(1)))-1)).withColumn("data",explode(colExpr)).select($"row_number",$"data.*").show(false)
+----------+--------+-----+
|row_number|col_name|value|
+----------+--------+-----+
|0 |0 |1 |
|0 |1 |2 |
|1 |0 |5 |
|1 |1 |9 |
+----------+--------+-----+
You can do it by transposing the data as:
from pyspark.sql.functions import *
from pyspark.sql import Window
df = spark.createDataFrame([(1,2),(5,9)],['col1','col2'])
#renaming the columns based on their position
df = df.toDF(*list(map(lambda x: str(x),[*range(len(df.columns))])))
#Transposing the dataframe as required
col_list = ','.join([f'{i},`{i}`'for i in df.columns])
rows = len(df.columns)
df.withColumn('row_id',lit(row_number().over(Window.orderBy(lit(1)))-1)).select('row_id',
expr(f'''stack({rows},{col_list}) as (col_id,col_value)''')).show()
+------+------+---------+
|row_id|col_id|col_value|
+------+------+---------+
| 0| 0| 1|
| 0| 1| 2|
| 1| 0| 5|
| 1| 1| 9|
+------+------+---------+
In pyspark, row_number() and pos_explode will be helpful. Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst1= tst.withColumn("row_number",F.row_number().over(Window.orderBy(F.lit(1)))-1)
#%%
tst_arr = tst1.withColumn("arr",F.array(tst.columns))
tst_new = tst_arr.select('row_number','arr').select('row_number',F.posexplode('arr'))
results:
In [47]: tst_new.show()
+----------+---+---+
|row_number|pos|col|
+----------+---+---+
| 0| 0| 1|
| 0| 1| 7|
| 0| 2| 80|
| 1| 0| 1|
| 1| 1| 8|
| 1| 2| 40|
| 2| 0| 1|
| 2| 1| 5|
| 2| 2|100|
| 3| 0| 5|
| 3| 1| 8|
| 3| 2| 90|
| 4| 0| 7|
| 4| 1| 6|
| 4| 2| 50|
| 5| 0| 0|
| 5| 1| 3|
| 5| 2| 60|
+----------+---+---+

spark sql spark.range(7).select('*,'id % 3 as "bucket").show // how to understand ('*,'id % 3 as "bucket")

spark.range(7).select('*,'id % 3 as "bucket").show
// result:
+---+------+
| id|bucket|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
+---+------+
spark.range(7).withColumn("bucket",$"id" % 3).show
///result:
+---+------+
| id|bucket|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
+---+------+
I want to know what to make of *, and the whole select statement
Is the bottom of these two ways equivalent?
spark.range(7).select('*,'id % 3 as "bucket").show
spark.range(7).select($"*",$"id" % 3 as "bucket").show
spark.range(7).select(col("*"),col("id") % 3 as "bucket").show
val df = spark.range(7)
df.select(df("*"),df("id") % 3 as "bucket").show
These four ways are equivalent;
// https://spark.apache.org/docs/2.4.4/api/scala/index.html#org.apache.spark.sql.Column

Spark: Dataframe pipe delimited doesn't return correct values

I have data frame as below:
scala> products_df.show(5)
+--------------------+
| value|
+--------------------+
|1009|45|Diamond F...|
|1010|46|DBX Vecto...|
|1011|46|Old Town ...|
|1012|46|Pelican T...|
|1013|46|Perceptio...|
+--------------------+
I need to divide each column wise-
I use below query which works in all the other delimiter but here it doen't ==>
products_df.selectExpr(("cast((split(value,'|'))[0] as int) as product_id"),("cast((split(value,'|'))[1] as int) as product_category_id"),("cast((split(value,'|'))[2] as string) as product_name"),("cast((split(value,'|'))[3] as string) as description"), ("cast((split(value,'|'))[4] as float) as product_price") ,("cast((split(value,'|'))[5] as string) as product_image")).show
It returns -
product_id|product_category_id|product_name|description|product_price|product_image|
+----------+-------------------+------------+-----------+-------------+-------------+
| 1| 0| 0| 9| null| 4|
| 1| 0| 1| 0| null| 4|
| 1| 0| 1| 1| null| 4|
| 1| 0| 1| 2| null| 4|
| 1| 0| 1| 3| null| 4|
| 1| 0| 1| 4| null| 4|
| 1| 0| 1| 5| null| 4|
It works fine when the file is delimited by comma(,) or (:)
only with pipe(|) and returns above values whereas it should be
product_id|product_category_id| product_name|description|product_price| product_image|
+----------+-------------------+--------------------+-----------+-------------+--------------------+
| 1009| 45|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 1010| 46|Under Armour Men'...| | 129.99|http://images.acm...|
| 1011| 47|Under Armour Men'...| | 89.99|http://images.acm...|
Thanks, Guys for the suggestions-
-> It seems selectExpr doesn't work when file is delimited by pipe(|).
so the alternate way is to use withColumn.
val products_df=spark.read.textFile("/user/code/products").withColumn("product_id",split($"value","\|")(0).cast("int")).withColumn("product_cat_id",split($"value","\|")(1).cast("int")).withColumn("product_name",split($"value","\|")(2).cast("string")).withColumn("product_description",split($"value","\|")(3).cast("string")).withColumn("product_price",split($"value","\|")(4).cast("float")).withColumn("product_image",split($"value","\|")(5).cast("string")).select("product_id","product_cat_id","product_name","product_description","product_price","product_image")
Spark 2.4.3 Just adding a neat and clean code
scala> var df =Seq(("1009|45|Diamond F"),("1010|46|DBX Vecto")).toDF("value")
scala> df.show
+-----------------+
| value|
+-----------------+
|1009|45|Diamond F|
|1010|46|DBX Vecto|
+-----------------+
val splitedViewsDF = df.withColumn("product_id", split($"value", "\\|").getItem(0)).withColumn("product_cat_id", split($"value", "\\|").getItem(1)).withColumn("product_name", split($"value", "\\|").getItem(2)).drop($"value")
scala> splitedViewsDF.show
+----------+--------------+------------+
|product_id|product_cat_id|product_name|
+----------+--------------+------------+
| 1009| 45| Diamond F|
| 1010| 46| DBX Vecto|
+----------+--------------+------------+
here you can get data by using getItem. Happy Hadoop

Pyspark: Add new Column contain a value in a column counterpart another value in another column that meets a specified condition

Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
For instance,
original DF as follows:
+-----+-----+-----+
|col1 |col2 |col3 |
+-----+-----+-----+
| A| 17| 1|
| A| 16| 2|
| A| 18| 2|
| A| 30| 3|
| B| 35| 1|
| B| 34| 2|
| B| 36| 2|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
+-----+-----+-----+
I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value
the desired Df as follows:
+----+----+----+----------+
|col1|col2|col3|new_column|
+----+----+----+----------+
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
+----+----+----+----------+
df3=df.filter(df.col3==1)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| B| 35| 1|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
| A| 17| 1|
+----+----+----+
df3.createOrReplaceTempView("mytable")
To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark
df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1")
df6.show()
+----+-------+
|col1|minimum|
+----+-------+
| A| 17|
| B| 35|
| C| 20|
+----+-------+
df_a=df.join(df6,['col1'],'leftouter')
+----+----+----+-------+
|col1|col2|col3|minimum|
+----+----+----+-------+
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
+----+----+----+-------+
Is there way better than this solution?