Pyspark : How to find and convert top 5 row values to 1 and rest all to 0? - apache-spark-sql

I have a dataframe and i need to find the maximum 5 values in each row, convert only those values to 1 and rest all to 0 while maintaining the dataframe structure, i.e. the column names should remain the same
I tried using toLocalIterator and then converting each row to a list, then converting top 5 to values 1.
But it gives me a java.lang.outOfMemoryError when i run the code on large dataset.
While looking at the logs i found that a task of very large size(around 25000KB) is submitted while the max recommended size is 100KB
Is there a better way to find and convert top 5 values to a certain value(1 in this case) and rest all to 0, which would utilize less memory
EDIT 1:
For example if i have this 10 columns and 5 rows as the input
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+----+----+----+----+----+----+----+----+----+----+
|0.74| 0.9|0.52|0.85|0.18|0.23| 0.3| 0.0| 0.1|0.07|
|0.11|0.57|0.81|0.81|0.45|0.48|0.86|0.38|0.41|0.45|
|0.03|0.84|0.17|0.96|0.09|0.73|0.25|0.05|0.57|0.66|
| 0.8|0.94|0.06|0.44| 0.2|0.89| 0.9| 1.0|0.48|0.14|
|0.73|0.86|0.68| 1.0|0.78|0.17|0.11|0.19|0.18|0.83|
+----+----+----+----+----+----+----+----+----+----+
this is what i want as the output
+---+---+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+---+---+---+---+---+---+---+---+---+---+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+---+---+---+---+---+---+---+---+---+---+
as you can see i want to find the top(max) 5 values in each row convert them to 1 and the rest of the values to 0, while maintaining the structure i.e. rows and columns
this is what i am using (which gives me outOfMemoryError)
for row in prob_df.rdd.toLocalIterator():
rowPredDict = {}
for cat in categories:
rowPredDict[cat]= row[cat]
sorted_row = sorted(rowPredDict.items(), key=lambda kv: kv[1],reverse=True)
#print(rowPredDict)
rowPredDict = rowPredDict.fromkeys(rowPredDict,0)
rowPredDict[sorted_row[0:5][0][0]] = 1
rowPredDict[sorted_row[0:5][1][0]] = 1
rowPredDict[sorted_row[0:5][2][0]] = 1
rowPredDict[sorted_row[0:5][3][0]] = 1
rowPredDict[sorted_row[0:5][4][0]] = 1
#print(count,sorted_row[0:2][0][0],",",sorted_row[0:2][1][0])
rowPredList.append(rowPredDict)
#count=count+1

I don't have enough volume for performance testing but could you try below approach using spark functions array apis
1. Prepare Dataset:
import pyspark.sql.functions as f
l1 = [(0.74,0.9,0.52,0.85,0.18,0.23,0.3,0.0,0.1,0.07),
(0.11,0.57,0.81,0.81,0.45,0.48,0.86,0.38,0.41,0.45),
(0.03,0.84,0.17,0.96,0.09,0.73,0.25,0.05,0.57,0.66),
(0.8,0.94,0.06,0.44,0.2,0.89,0.9,1.0,0.48,0.14),
(0.73,0.86,0.68,1.0,0.78,0.17,0.11,0.19,0.18,0.83)]
df = spark.createDataFrame(l1).toDF('col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9','col_10')
df.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 0.74| 0.9| 0.52| 0.85| 0.18| 0.23| 0.3| 0.0| 0.1| 0.07|
| 0.11| 0.57| 0.81| 0.81| 0.45| 0.48| 0.86| 0.38| 0.41| 0.45|
| 0.03| 0.84| 0.17| 0.96| 0.09| 0.73| 0.25| 0.05| 0.57| 0.66|
| 0.8| 0.94| 0.06| 0.44| 0.2| 0.89| 0.9| 1.0| 0.48| 0.14|
| 0.73| 0.86| 0.68| 1.0| 0.78| 0.17| 0.11| 0.19| 0.18| 0.83|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
2. Get top 5 for each row
Following below steps on df
Create array and Sort the elements
Get first 5 elements into new column called all
UDF to get Max 5 elements from sorted:
Note : spark >= 2.4.0 have slice function which can do similar task. I am using 2.2 in currently so creating UDF but if you have 2.4 or higher version then you can give a try with slice
def get_n_elements_(arr, n):
return arr[:n]
get_n_elements = f.udf(get_n_elements_, t.ArrayType(t.DoubleType()))
df_all = df.withColumn('all', get_n_elements(f.sort_array(f.array(df.columns), False),f.lit(5)))
df_all.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|all |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|0.74 |0.9 |0.52 |0.85 |0.18 |0.23 |0.3 |0.0 |0.1 |0.07 |[0.9, 0.85, 0.74, 0.52, 0.3] |
|0.11 |0.57 |0.81 |0.81 |0.45 |0.48 |0.86 |0.38 |0.41 |0.45 |[0.86, 0.81, 0.81, 0.57, 0.48]|
|0.03 |0.84 |0.17 |0.96 |0.09 |0.73 |0.25 |0.05 |0.57 |0.66 |[0.96, 0.84, 0.73, 0.66, 0.57]|
|0.8 |0.94 |0.06 |0.44 |0.2 |0.89 |0.9 |1.0 |0.48 |0.14 |[1.0, 0.94, 0.9, 0.89, 0.8] |
|0.73 |0.86 |0.68 |1.0 |0.78 |0.17 |0.11 |0.19 |0.18 |0.83 |[1.0, 0.86, 0.83, 0.78, 0.73] |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
3. Create dynamic sql and execute with selectExpr
sql_stmt = ''' case when array_contains(all, {0}) then 1 else 0 end AS `{0}` '''
df_all.selectExpr(*[sql_stmt.format(c) for c in df.columns]).show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+

You can do that easily like this.
For example we want to do that task for value column, so first sort the value column take the 5th value and change the values using a when condition.
df2 = sc.parallelize([("fo", 100,20),("rogerg", 110,56),("franre", 1080,297),("f11", 10100,217),("franci", 10,227),("fran", 1002,5),("fran231cis", 10007,271),("franc3is", 1030,2)]).toDF(["name", "salary","value"])
df2 = df2.orderBy("value",ascending=False)
+----------+------+-----+
| name|salary|value|
+----------+------+-----+
| franre| 1080| 297|
|fran231cis| 10007| 271|
| franci| 10| 227|
| f11| 10100| 217|
| rogerg| 110| 56|
| fo| 100| 20|
| fran| 1002| 5|
| franc3is| 1030| 2|
+----------+------+-----+
maxx = df2.take(5)[4]["value"]
dff = df2.select(when(df2['value'] >= maxx, 1).otherwise(0).alias("value"),"name", "salary")
+---+----------+------+
|value| name|salary|
+---+----------+------+
| 1| franre| 1080|
| 1|fran231cis| 10007|
| 1| franci| 10|
| 1| f11| 10100|
| 1| rogerg| 110|
| 0| fo| 100|
| 0| fran| 1002|
| 0| franc3is| 1030|
+---+----------+------+

Related

How to get dummies for each elements having multiple dummy variables in pyspark dataframe

Here is my original dataframe:
+---+---+---+
| ID| P1| P2|
+---+---+---+
| 0|447| O1|
| 0|448| O2|
| 1|447| O2|
| 1|450| O3|
| 2|450| O3|
| 3|451| O4|
| 3|452| O5|
+---+---+---+
What I want is a dataframe like this:
+---+------+------+------+------+------+-----+-----+-----+-----+-----+
| ID|P1_447|P1_448|P1_450|P1_451|P1_452|P2_O1|P2_O2|P2_O3|P2_O4|P2_O5|
+---+------+------+------+------+------+-----+-----+-----+-----+-----+
| 0| 1| 1| 0| 0| 0| 1| 1| 0| 0| 0|
| 1| 1| 0| 1| 0| 0| 0| 1| 1| 0| 0|
| 2| 0| 0| 1| 0| 0| 0| 0| 1| 0| 0|
| 3| 0| 0| 0| 1| 1| 0| 0| 0| 1| 1|
+---+------+------+------+------+------+-----+-----+-----+-----+-----+
I tried
df.groupby('ID').any().astype(int)
but it didn't work.
Thank you!!
import pyspark.sql.functions as F
df = spark.createDataFrame(
[('0','447','O1')
,('0','448','O2')
,('1','447','O2')
,('1','450','O3')
,('2','450','O3')
,('3','451','O4')
,('3','452','O5')
],
['ID','P1','P2']
)
for c in df.columns:
if not c == 'ID':
df = df\
.withColumn(c, F.concat(F.lit(c),F.lit('_'),F.col(c)))\
.groupBy(*[x for x in df.columns if x != c]).pivot(c).agg(F.lit(1))
df = df\
.groupBy('ID').agg(*[F.sum(x).alias(x) for x in df.columns if x!='ID'])\
.fillna(0)\
.orderBy('ID')
df.show()
# +---+------+------+------+------+------+-----+-----+-----+-----+-----+
# | ID|P1_447|P1_448|P1_450|P1_451|P1_452|P2_O1|P2_O2|P2_O3|P2_O4|P2_O5|
# +---+------+------+------+------+------+-----+-----+-----+-----+-----+
# | 0| 1| 1| 0| 0| 0| 1| 1| 0| 0| 0|
# | 1| 1| 0| 1| 0| 0| 0| 1| 1| 0| 0|
# | 2| 0| 0| 1| 0| 0| 0| 0| 1| 0| 0|
# | 3| 0| 0| 0| 1| 1| 0| 0| 0| 1| 1|
# +---+------+------+------+------+------+-----+-----+-----+-----+-----+

Map Spark DF to (row_number, column_number, value) format

I have a Dataframe in the following shape
1 2
5 9
How can I convert it to (row_num, col_num, value) format
0 0 1
0 1 2
1 0 5
1 1 9
Is there any way to apply some function or any mapper?
Thanks in advance
Check below code.
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> val colExpr = array(df.columns.zipWithIndex.map(c => struct(lit(c._2).as("col_name"),col(c._1).as("value"))):_*)
colExpr: org.apache.spark.sql.Column = array(named_struct(col_name, 0 AS `col_name`, NamePlaceholder(), a AS `value`), named_struct(col_name, 1 AS `col_name`, NamePlaceholder(), b AS `value`))
scala> df.withColumn("row_number",lit(row_number().over(Window.orderBy(lit(1)))-1)).withColumn("data",explode(colExpr)).select($"row_number",$"data.*").show(false)
+----------+--------+-----+
|row_number|col_name|value|
+----------+--------+-----+
|0 |0 |1 |
|0 |1 |2 |
|1 |0 |5 |
|1 |1 |9 |
+----------+--------+-----+
You can do it by transposing the data as:
from pyspark.sql.functions import *
from pyspark.sql import Window
df = spark.createDataFrame([(1,2),(5,9)],['col1','col2'])
#renaming the columns based on their position
df = df.toDF(*list(map(lambda x: str(x),[*range(len(df.columns))])))
#Transposing the dataframe as required
col_list = ','.join([f'{i},`{i}`'for i in df.columns])
rows = len(df.columns)
df.withColumn('row_id',lit(row_number().over(Window.orderBy(lit(1)))-1)).select('row_id',
expr(f'''stack({rows},{col_list}) as (col_id,col_value)''')).show()
+------+------+---------+
|row_id|col_id|col_value|
+------+------+---------+
| 0| 0| 1|
| 0| 1| 2|
| 1| 0| 5|
| 1| 1| 9|
+------+------+---------+
In pyspark, row_number() and pos_explode will be helpful. Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst1= tst.withColumn("row_number",F.row_number().over(Window.orderBy(F.lit(1)))-1)
#%%
tst_arr = tst1.withColumn("arr",F.array(tst.columns))
tst_new = tst_arr.select('row_number','arr').select('row_number',F.posexplode('arr'))
results:
In [47]: tst_new.show()
+----------+---+---+
|row_number|pos|col|
+----------+---+---+
| 0| 0| 1|
| 0| 1| 7|
| 0| 2| 80|
| 1| 0| 1|
| 1| 1| 8|
| 1| 2| 40|
| 2| 0| 1|
| 2| 1| 5|
| 2| 2|100|
| 3| 0| 5|
| 3| 1| 8|
| 3| 2| 90|
| 4| 0| 7|
| 4| 1| 6|
| 4| 2| 50|
| 5| 0| 0|
| 5| 1| 3|
| 5| 2| 60|
+----------+---+---+

How to get last row value when flag is 0 and get the current row value to new column when flag 1 in pyspark dataframe

Scenario 1 when Flag 1 :
For the row where Flag is 1 Copy trx_date to Destination
Scenario 2 When Flag 0 :
For the row where Flag is 0 Copy the previous Destination Value
Input :
+-----------+----+----------+
|customer_id|Flag| trx_date|
+-----------+----+----------+
| 1| 1| 12/3/2020|
| 1| 0| 12/4/2020|
| 1| 1| 12/5/2020|
| 1| 1| 12/6/2020|
| 1| 0| 12/7/2020|
| 1| 1| 12/8/2020|
| 1| 0| 12/9/2020|
| 1| 0|12/10/2020|
| 1| 0|12/11/2020|
| 1| 1|12/12/2020|
| 2| 1| 12/1/2020|
| 2| 0| 12/2/2020|
| 2| 0| 12/3/2020|
| 2| 1| 12/4/2020|
+-----------+----+----------+
Output :
+-----------+----+----------+-----------+
|customer_id|Flag| trx_date|destination|
+-----------+----+----------+-----------+
| 1| 1| 12/3/2020| 12/3/2020|
| 1| 0| 12/4/2020| 12/3/2020|
| 1| 1| 12/5/2020| 12/5/2020|
| 1| 1| 12/6/2020| 12/6/2020|
| 1| 0| 12/7/2020| 12/6/2020|
| 1| 1| 12/8/2020| 12/8/2020|
| 1| 0| 12/9/2020| 12/8/2020|
| 1| 0|12/10/2020| 12/8/2020|
| 1| 0|12/11/2020| 12/8/2020|
| 1| 1|12/12/2020| 12/12/2020|
| 2| 1| 12/1/2020| 12/1/2020|
| 2| 0| 12/2/2020| 12/1/2020|
| 2| 0| 12/3/2020| 12/1/2020|
| 2| 1| 12/4/2020| 12/4/2020|
+-----------+----+----------+-----------+
Code to generate spark Dataframe :
df = spark.createDataFrame([(1,1,'12/3/2020'),(1,0,'12/4/2020'),(1,1,'12/5/2020'),
(1,1,'12/6/2020'),(1,0,'12/7/2020'),(1,1,'12/8/2020'),(1,0,'12/9/2020'),(1,0,'12/10/2020'),
(1,0,'12/11/2020'),(1,1,'12/12/2020'),(2,1,'12/1/2020'),(2,0,'12/2/2020'),(2,0,'12/3/2020'),
(2,1,'12/4/2020')], ["customer_id","Flag","trx_date"])
Pyspark way to do this. After getting trx_date in datetype, First get incremental sum of Flag to create the groupings we need in order to use the first function on a window partitioned by those groupings. We can use date_format to get both columns back to desired date format. I assumed your format was MM/dd/yyyy, if it was different please change it to dd/MM/yyyy in the code.
df.show() #sample data
#+-----------+----+----------+
#|customer_id|Flag| trx_date|
#+-----------+----+----------+
#| 1| 1| 12/3/2020|
#| 1| 0| 12/4/2020|
#| 1| 1| 12/5/2020|
#| 1| 1| 12/6/2020|
#| 1| 0| 12/7/2020|
#| 1| 1| 12/8/2020|
#| 1| 0| 12/9/2020|
#| 1| 0|12/10/2020|
#| 1| 0|12/11/2020|
#| 1| 1|12/12/2020|
#| 2| 1| 12/1/2020|
#| 2| 0| 12/2/2020|
#| 2| 0| 12/3/2020|
#| 2| 1| 12/4/2020|
#+-----------+----+----------+
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().orderBy("customer_id","trx_date")
w1=Window().partitionBy("Flag2").orderBy("trx_date").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df.withColumn("trx_date", F.to_date("trx_date", "MM/dd/yyyy"))\
.withColumn("Flag2", F.sum("Flag").over(w))\
.withColumn("destination", F.when(F.col("Flag")==0, F.first("trx_date").over(w1)).otherwise(F.col("trx_date")))\
.withColumn("trx_date", F.date_format("trx_date","MM/dd/yyyy"))\
.withColumn("destination", F.date_format("destination", "MM/dd/yyyy"))\
.orderBy("customer_id","trx_date").drop("Flag2").show()
#+-----------+----+----------+-----------+
#|customer_id|Flag| trx_date|destination|
#+-----------+----+----------+-----------+
#| 1| 1|12/03/2020| 12/03/2020|
#| 1| 0|12/04/2020| 12/03/2020|
#| 1| 1|12/05/2020| 12/05/2020|
#| 1| 1|12/06/2020| 12/06/2020|
#| 1| 0|12/07/2020| 12/06/2020|
#| 1| 1|12/08/2020| 12/08/2020|
#| 1| 0|12/09/2020| 12/08/2020|
#| 1| 0|12/10/2020| 12/08/2020|
#| 1| 0|12/11/2020| 12/08/2020|
#| 1| 1|12/12/2020| 12/12/2020|
#| 2| 1|12/01/2020| 12/01/2020|
#| 2| 0|12/02/2020| 12/01/2020|
#| 2| 0|12/03/2020| 12/01/2020|
#| 2| 1|12/04/2020| 12/04/2020|
#+-----------+----+----------+-----------+
You can use window functions. I am unsure whether spark sql supports the standard ignore nulls option to lag().
If it does, you can just do:
select
t.*,
case when flag = 1
then trx_date
else lag(case when flag = 1 then trx_date end ignore nulls)
over(partition by customer_id order by trx_date)
end destination
from mytable t
Else, you can build groups with a window sum first:
select
customer_id,
flag,
trx_date,
case when flag = 1
then trx_date
else min(trx_date) over(partition by customer_id, grp order by trx_date)
end destination
from (
select t.*, sum(flag) over(partition by customer_id order by trx_date) grp
from mytable t
) t
You can achieve this in the following way if you are considering dataframe API
#Convert date format while creating window itself
window = Window().orderBy("customer_id",f.to_date('trx_date','MM/dd/yyyy'))
df1 = df.withColumn('destination', f.when(f.col('Flag')==1,f.col('trx_date'))).\
withColumn('destination',f.last(f.col('destination'),ignorenulls=True).over(window))
df1.show()
+-----------+----+----------+-----------+
|customer_id|Flag| trx_date|destination|
+-----------+----+----------+-----------+
| 1| 1| 12/3/2020| 12/3/2020|
| 1| 0| 12/4/2020| 12/3/2020|
| 1| 1| 12/5/2020| 12/5/2020|
| 1| 1| 12/6/2020| 12/6/2020|
| 1| 0| 12/7/2020| 12/6/2020|
| 1| 1| 12/8/2020| 12/8/2020|
| 1| 0| 12/9/2020| 12/8/2020|
| 1| 0|12/10/2020| 12/8/2020|
| 1| 0|12/11/2020| 12/8/2020|
| 1| 1|12/12/2020| 12/12/2020|
| 2| 1| 12/1/2020| 12/1/2020|
| 2| 0| 12/2/2020| 12/1/2020|
| 2| 0| 12/3/2020| 12/1/2020|
| 2| 1| 12/4/2020| 12/4/2020|
+-----------+----+----------+-----------+
Hope it helps.

Spark: Dataframe pipe delimited doesn't return correct values

I have data frame as below:
scala> products_df.show(5)
+--------------------+
| value|
+--------------------+
|1009|45|Diamond F...|
|1010|46|DBX Vecto...|
|1011|46|Old Town ...|
|1012|46|Pelican T...|
|1013|46|Perceptio...|
+--------------------+
I need to divide each column wise-
I use below query which works in all the other delimiter but here it doen't ==>
products_df.selectExpr(("cast((split(value,'|'))[0] as int) as product_id"),("cast((split(value,'|'))[1] as int) as product_category_id"),("cast((split(value,'|'))[2] as string) as product_name"),("cast((split(value,'|'))[3] as string) as description"), ("cast((split(value,'|'))[4] as float) as product_price") ,("cast((split(value,'|'))[5] as string) as product_image")).show
It returns -
product_id|product_category_id|product_name|description|product_price|product_image|
+----------+-------------------+------------+-----------+-------------+-------------+
| 1| 0| 0| 9| null| 4|
| 1| 0| 1| 0| null| 4|
| 1| 0| 1| 1| null| 4|
| 1| 0| 1| 2| null| 4|
| 1| 0| 1| 3| null| 4|
| 1| 0| 1| 4| null| 4|
| 1| 0| 1| 5| null| 4|
It works fine when the file is delimited by comma(,) or (:)
only with pipe(|) and returns above values whereas it should be
product_id|product_category_id| product_name|description|product_price| product_image|
+----------+-------------------+--------------------+-----------+-------------+--------------------+
| 1009| 45|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 1010| 46|Under Armour Men'...| | 129.99|http://images.acm...|
| 1011| 47|Under Armour Men'...| | 89.99|http://images.acm...|
Thanks, Guys for the suggestions-
-> It seems selectExpr doesn't work when file is delimited by pipe(|).
so the alternate way is to use withColumn.
val products_df=spark.read.textFile("/user/code/products").withColumn("product_id",split($"value","\|")(0).cast("int")).withColumn("product_cat_id",split($"value","\|")(1).cast("int")).withColumn("product_name",split($"value","\|")(2).cast("string")).withColumn("product_description",split($"value","\|")(3).cast("string")).withColumn("product_price",split($"value","\|")(4).cast("float")).withColumn("product_image",split($"value","\|")(5).cast("string")).select("product_id","product_cat_id","product_name","product_description","product_price","product_image")
Spark 2.4.3 Just adding a neat and clean code
scala> var df =Seq(("1009|45|Diamond F"),("1010|46|DBX Vecto")).toDF("value")
scala> df.show
+-----------------+
| value|
+-----------------+
|1009|45|Diamond F|
|1010|46|DBX Vecto|
+-----------------+
val splitedViewsDF = df.withColumn("product_id", split($"value", "\\|").getItem(0)).withColumn("product_cat_id", split($"value", "\\|").getItem(1)).withColumn("product_name", split($"value", "\\|").getItem(2)).drop($"value")
scala> splitedViewsDF.show
+----------+--------------+------------+
|product_id|product_cat_id|product_name|
+----------+--------------+------------+
| 1009| 45| Diamond F|
| 1010| 46| DBX Vecto|
+----------+--------------+------------+
here you can get data by using getItem. Happy Hadoop

Drop multiple columns from DataFrame recursively in SPARK <= version 1.6.0

I want to drop multiple cols from the data frame in one go. Don't want to write .drop("col1").drop("col2").
Note: I am using spark-1.6.0
This functionality is available in the current spark version (2.0 onwards) and for earlier version we can make use of the below code.
1.
implicit class DataFrameOperation(df: DataFrame) {
def dropCols(cols: String*): DataFrame = {
#tailrec def deleteCol(df: DataFrame, cols: Seq[String]): DataFrame =
if(cols.size == 0) df else deleteCol(df.drop(cols.head), cols.tail)
deleteCol(df, cols)
}
}
To call the method
val finalDF = dataFrame.dropCols("col1","col2","col3")
This method is a work around method.
public static DataFrame drop(DataFrame dataFrame, List<String> dropCol) {
List<String> colname = Arrays.stream(dataFrame.columns()).filter(col -> !dropCol.contains(col)).collect(Collectors.toList());
// colname list will have the names of the cols except the ones to be dropped.
return dataFrame.selectExpr(JavaConversions.asScalaBuffer(colname));
}
inputDataFrame:
+---+---+---+---+---+
| C0| C1| C2| C3| C4|
+---+---+---+---+---+
| 0| 0| 0| 0| 1|
| 1| 5| 6| 0| 14|
| 1| 6| 1| 0| 3|
| 1| 0| 1| 0| 1|
| 1| 37| 9| 0| 19|
+---+---+---+---+---+
If you want to drop C0, C2, C4 columns,
colDroppedDataFrame:
+---+---+
| C1| C3|
+---+---+
| 0| 0|
| 5| 0|
| 6| 0|
| 0| 0|
| 37| 0|
+---+---+