I am using pyspark so I have tried both pyspark code and SQL.
I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table:
+---+-------+-------+----+
| ID|USER_ID|ADDRESS|TIME|
+---+-------+-------+----+
| 1| 1| A| 10|
| 2| 1| B| 15|
| 3| 1| A| 20|
| 4| 1| A| 40|
| 5| 1| A| 45|
+---+-------+-------+----+
The correct new column I would like is as below:
+---+-------+-------+----+---------+
| ID|USER_ID|ADDRESS|TIME|LAST_DIFF|
+---+-------+-------+----+---------+
| 1| 1| A| 10| null|
| 2| 1| B| 15| 10|
| 3| 1| A| 20| 15|
| 4| 1| A| 40| 15|
| 5| 1| A| 45| 15|
+---+-------+-------+----+---------+
I have tried using different windows but none ever seem to get exactly what I want. Any ideas?
A simplified version of #jxc's answer.
from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
.withColumn('prev_time',lag(col('time')).over(w))
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
grp_df.withColumn('last_addr_change_time',min(col('prev_time')).over(w_grp)).show()
Use lag with running sum to assign groups when there is a change in the column value (based on the defined window). Get the time from the previous row, which will be used in the next step.
Once you get the groups, use the running minimum to get the last timestamp of the column value change. (Suggest you look at the intermediate results to understand the transformations better)
One way using two Window specs:
from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window
w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')
# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))
# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))
df1.join(df2, on=['USER_ID', 'g']).show()
+-------+---+---+-------+----+----+---------+
|USER_ID| g| ID|ADDRESS|TIME|diff|last_diff|
+-------+---+---+-------+----+----+---------+
| 1| 1| 1| A| 10| 10| null|
| 1| 2| 2| B| 15| 15| 10|
| 1| 3| 3| A| 20| 105| 15|
| 1| 3| 4| A| 40| 105| 15|
| 1| 3| 5| A| 45| 105| 15|
+-------+---+---+-------+----+----+---------+
df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')
I have a dataframe and i need to find the maximum 5 values in each row, convert only those values to 1 and rest all to 0 while maintaining the dataframe structure, i.e. the column names should remain the same
I tried using toLocalIterator and then converting each row to a list, then converting top 5 to values 1.
But it gives me a java.lang.outOfMemoryError when i run the code on large dataset.
While looking at the logs i found that a task of very large size(around 25000KB) is submitted while the max recommended size is 100KB
Is there a better way to find and convert top 5 values to a certain value(1 in this case) and rest all to 0, which would utilize less memory
EDIT 1:
For example if i have this 10 columns and 5 rows as the input
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+----+----+----+----+----+----+----+----+----+----+
|0.74| 0.9|0.52|0.85|0.18|0.23| 0.3| 0.0| 0.1|0.07|
|0.11|0.57|0.81|0.81|0.45|0.48|0.86|0.38|0.41|0.45|
|0.03|0.84|0.17|0.96|0.09|0.73|0.25|0.05|0.57|0.66|
| 0.8|0.94|0.06|0.44| 0.2|0.89| 0.9| 1.0|0.48|0.14|
|0.73|0.86|0.68| 1.0|0.78|0.17|0.11|0.19|0.18|0.83|
+----+----+----+----+----+----+----+----+----+----+
this is what i want as the output
+---+---+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+---+---+---+---+---+---+---+---+---+---+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+---+---+---+---+---+---+---+---+---+---+
as you can see i want to find the top(max) 5 values in each row convert them to 1 and the rest of the values to 0, while maintaining the structure i.e. rows and columns
this is what i am using (which gives me outOfMemoryError)
for row in prob_df.rdd.toLocalIterator():
rowPredDict = {}
for cat in categories:
rowPredDict[cat]= row[cat]
sorted_row = sorted(rowPredDict.items(), key=lambda kv: kv[1],reverse=True)
#print(rowPredDict)
rowPredDict = rowPredDict.fromkeys(rowPredDict,0)
rowPredDict[sorted_row[0:5][0][0]] = 1
rowPredDict[sorted_row[0:5][1][0]] = 1
rowPredDict[sorted_row[0:5][2][0]] = 1
rowPredDict[sorted_row[0:5][3][0]] = 1
rowPredDict[sorted_row[0:5][4][0]] = 1
#print(count,sorted_row[0:2][0][0],",",sorted_row[0:2][1][0])
rowPredList.append(rowPredDict)
#count=count+1
I don't have enough volume for performance testing but could you try below approach using spark functions array apis
1. Prepare Dataset:
import pyspark.sql.functions as f
l1 = [(0.74,0.9,0.52,0.85,0.18,0.23,0.3,0.0,0.1,0.07),
(0.11,0.57,0.81,0.81,0.45,0.48,0.86,0.38,0.41,0.45),
(0.03,0.84,0.17,0.96,0.09,0.73,0.25,0.05,0.57,0.66),
(0.8,0.94,0.06,0.44,0.2,0.89,0.9,1.0,0.48,0.14),
(0.73,0.86,0.68,1.0,0.78,0.17,0.11,0.19,0.18,0.83)]
df = spark.createDataFrame(l1).toDF('col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9','col_10')
df.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 0.74| 0.9| 0.52| 0.85| 0.18| 0.23| 0.3| 0.0| 0.1| 0.07|
| 0.11| 0.57| 0.81| 0.81| 0.45| 0.48| 0.86| 0.38| 0.41| 0.45|
| 0.03| 0.84| 0.17| 0.96| 0.09| 0.73| 0.25| 0.05| 0.57| 0.66|
| 0.8| 0.94| 0.06| 0.44| 0.2| 0.89| 0.9| 1.0| 0.48| 0.14|
| 0.73| 0.86| 0.68| 1.0| 0.78| 0.17| 0.11| 0.19| 0.18| 0.83|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
2. Get top 5 for each row
Following below steps on df
Create array and Sort the elements
Get first 5 elements into new column called all
UDF to get Max 5 elements from sorted:
Note : spark >= 2.4.0 have slice function which can do similar task. I am using 2.2 in currently so creating UDF but if you have 2.4 or higher version then you can give a try with slice
def get_n_elements_(arr, n):
return arr[:n]
get_n_elements = f.udf(get_n_elements_, t.ArrayType(t.DoubleType()))
df_all = df.withColumn('all', get_n_elements(f.sort_array(f.array(df.columns), False),f.lit(5)))
df_all.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|all |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|0.74 |0.9 |0.52 |0.85 |0.18 |0.23 |0.3 |0.0 |0.1 |0.07 |[0.9, 0.85, 0.74, 0.52, 0.3] |
|0.11 |0.57 |0.81 |0.81 |0.45 |0.48 |0.86 |0.38 |0.41 |0.45 |[0.86, 0.81, 0.81, 0.57, 0.48]|
|0.03 |0.84 |0.17 |0.96 |0.09 |0.73 |0.25 |0.05 |0.57 |0.66 |[0.96, 0.84, 0.73, 0.66, 0.57]|
|0.8 |0.94 |0.06 |0.44 |0.2 |0.89 |0.9 |1.0 |0.48 |0.14 |[1.0, 0.94, 0.9, 0.89, 0.8] |
|0.73 |0.86 |0.68 |1.0 |0.78 |0.17 |0.11 |0.19 |0.18 |0.83 |[1.0, 0.86, 0.83, 0.78, 0.73] |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
3. Create dynamic sql and execute with selectExpr
sql_stmt = ''' case when array_contains(all, {0}) then 1 else 0 end AS `{0}` '''
df_all.selectExpr(*[sql_stmt.format(c) for c in df.columns]).show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
You can do that easily like this.
For example we want to do that task for value column, so first sort the value column take the 5th value and change the values using a when condition.
df2 = sc.parallelize([("fo", 100,20),("rogerg", 110,56),("franre", 1080,297),("f11", 10100,217),("franci", 10,227),("fran", 1002,5),("fran231cis", 10007,271),("franc3is", 1030,2)]).toDF(["name", "salary","value"])
df2 = df2.orderBy("value",ascending=False)
+----------+------+-----+
| name|salary|value|
+----------+------+-----+
| franre| 1080| 297|
|fran231cis| 10007| 271|
| franci| 10| 227|
| f11| 10100| 217|
| rogerg| 110| 56|
| fo| 100| 20|
| fran| 1002| 5|
| franc3is| 1030| 2|
+----------+------+-----+
maxx = df2.take(5)[4]["value"]
dff = df2.select(when(df2['value'] >= maxx, 1).otherwise(0).alias("value"),"name", "salary")
+---+----------+------+
|value| name|salary|
+---+----------+------+
| 1| franre| 1080|
| 1|fran231cis| 10007|
| 1| franci| 10|
| 1| f11| 10100|
| 1| rogerg| 110|
| 0| fo| 100|
| 0| fran| 1002|
| 0| franc3is| 1030|
+---+----------+------+
I have data frame as below:
scala> products_df.show(5)
+--------------------+
| value|
+--------------------+
|1009|45|Diamond F...|
|1010|46|DBX Vecto...|
|1011|46|Old Town ...|
|1012|46|Pelican T...|
|1013|46|Perceptio...|
+--------------------+
I need to divide each column wise-
I use below query which works in all the other delimiter but here it doen't ==>
products_df.selectExpr(("cast((split(value,'|'))[0] as int) as product_id"),("cast((split(value,'|'))[1] as int) as product_category_id"),("cast((split(value,'|'))[2] as string) as product_name"),("cast((split(value,'|'))[3] as string) as description"), ("cast((split(value,'|'))[4] as float) as product_price") ,("cast((split(value,'|'))[5] as string) as product_image")).show
It returns -
product_id|product_category_id|product_name|description|product_price|product_image|
+----------+-------------------+------------+-----------+-------------+-------------+
| 1| 0| 0| 9| null| 4|
| 1| 0| 1| 0| null| 4|
| 1| 0| 1| 1| null| 4|
| 1| 0| 1| 2| null| 4|
| 1| 0| 1| 3| null| 4|
| 1| 0| 1| 4| null| 4|
| 1| 0| 1| 5| null| 4|
It works fine when the file is delimited by comma(,) or (:)
only with pipe(|) and returns above values whereas it should be
product_id|product_category_id| product_name|description|product_price| product_image|
+----------+-------------------+--------------------+-----------+-------------+--------------------+
| 1009| 45|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 1010| 46|Under Armour Men'...| | 129.99|http://images.acm...|
| 1011| 47|Under Armour Men'...| | 89.99|http://images.acm...|
Thanks, Guys for the suggestions-
-> It seems selectExpr doesn't work when file is delimited by pipe(|).
so the alternate way is to use withColumn.
val products_df=spark.read.textFile("/user/code/products").withColumn("product_id",split($"value","\|")(0).cast("int")).withColumn("product_cat_id",split($"value","\|")(1).cast("int")).withColumn("product_name",split($"value","\|")(2).cast("string")).withColumn("product_description",split($"value","\|")(3).cast("string")).withColumn("product_price",split($"value","\|")(4).cast("float")).withColumn("product_image",split($"value","\|")(5).cast("string")).select("product_id","product_cat_id","product_name","product_description","product_price","product_image")
Spark 2.4.3 Just adding a neat and clean code
scala> var df =Seq(("1009|45|Diamond F"),("1010|46|DBX Vecto")).toDF("value")
scala> df.show
+-----------------+
| value|
+-----------------+
|1009|45|Diamond F|
|1010|46|DBX Vecto|
+-----------------+
val splitedViewsDF = df.withColumn("product_id", split($"value", "\\|").getItem(0)).withColumn("product_cat_id", split($"value", "\\|").getItem(1)).withColumn("product_name", split($"value", "\\|").getItem(2)).drop($"value")
scala> splitedViewsDF.show
+----------+--------------+------------+
|product_id|product_cat_id|product_name|
+----------+--------------+------------+
| 1009| 45| Diamond F|
| 1010| 46| DBX Vecto|
+----------+--------------+------------+
here you can get data by using getItem. Happy Hadoop
Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
For instance,
original DF as follows:
+-----+-----+-----+
|col1 |col2 |col3 |
+-----+-----+-----+
| A| 17| 1|
| A| 16| 2|
| A| 18| 2|
| A| 30| 3|
| B| 35| 1|
| B| 34| 2|
| B| 36| 2|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
+-----+-----+-----+
I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value
the desired Df as follows:
+----+----+----+----------+
|col1|col2|col3|new_column|
+----+----+----+----------+
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
+----+----+----+----------+
df3=df.filter(df.col3==1)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| B| 35| 1|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
| A| 17| 1|
+----+----+----+
df3.createOrReplaceTempView("mytable")
To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark
df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1")
df6.show()
+----+-------+
|col1|minimum|
+----+-------+
| A| 17|
| B| 35|
| C| 20|
+----+-------+
df_a=df.join(df6,['col1'],'leftouter')
+----+----+----+-------+
|col1|col2|col3|minimum|
+----+----+----+-------+
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
+----+----+----+-------+
Is there way better than this solution?
I'm working with Spark 1.6
Here are my data :
eDF = sqlsc.createDataFrame([Row(v=1, eng_1=10,eng_2=20),
Row(v=2, eng_1=15,eng_2=30),
Row(v=3, eng_1=8,eng_2=12)])
eDF.select('v','eng_1','eng_2').show()
+---+-----+-----+
| v|eng_1|eng_2|
+---+-----+-----+
| 1| 10| 20|
| 2| 15| 30|
| 3| 8| 12|
+---+-----+-----+
I would like to 'flatten' this table.
That is to say :
+---+-----+---+
| v| key|val|
+---+-----+---+
| 1|eng_1| 10|
| 1|eng_2| 20|
| 2|eng_1| 15|
| 2|eng_2| 30|
| 3|eng_1| 8|
| 3|eng_2| 12|
+---+-----+---+
Note that since I'm working with Spark 1.6, I can't use pyspar.sql.functions.create_map or pyspark.sql.functions.posexplode.
Use rdd.flatMap to flatten it:
df = spark.createDataFrame(
eDF.rdd.flatMap(
lambda r: [Row(v=r.v, key=col, val=r[col]) for col in ['eng_1', 'eng_2']]
)
)
df.show()
+-----+---+---+
| key| v|val|
+-----+---+---+
|eng_1| 1| 10|
|eng_2| 1| 20|
|eng_1| 2| 15|
|eng_2| 2| 30|
|eng_1| 3| 8|
|eng_2| 3| 12|
+-----+---+---+