Apache Spark SQL: How to use GroupBy and Max to filter data - apache-spark-sql

I have a given dataset with the following structure:
https://i.imgur.com/Kk7I1S1.png
I need to solve the below problem using SparkSQL: Dataframes
For each postcode find the customer that has had the most number of previous accidents. In the case of a tie, meaning more than one customer have the same highest number of accidents, just return any one of them. For each of these selected customers output the following columns: postcode, customer id, number of previous accidents.

I think you have missed to provide data that you have mentioned in image link. I have created my own data set by taking your problem as a reference. You can use below code snippet just for your reference and also can replace df data Frame with your data set to add required column such as id etc.
scala> val df = spark.read.format("csv").option("header","true").load("/user/nikhil/acc.csv")
df: org.apache.spark.sql.DataFrame = [postcode: string, customer: string ... 1 more field]
scala> df.show()
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
| 1| Nikhil| 5|
| 2| Ram| 4|
| 1| Shyam| 3|
| 3| pranav| 1|
| 1| Suman| 2|
| 3| alex| 2|
| 2| Raj| 5|
| 4| arpit| 3|
| 1| darsh| 2|
| 1| rahul| 3|
| 2| kiran| 4|
| 3| baba| 4|
| 4| alok| 3|
| 1| Nakul| 5|
+--------+--------+---------+
scala> df.createOrReplaceTempView("tmptable")
scala> spark.sql(s"""SELECT postcode,customer, accidents FROM (SELECT postcode,customer, accidents, row_number() over (PARTITION BY postcode ORDER BY accidents desc) as rn from tmptable) WHERE rn = 1""").show(false)
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
|3 |baba |4 |
|1 |Nikhil |5 |
|4 |arpit |3 |
|2 |Raj |5 |
+--------+--------+---------+

You can get the result with the following code in python:
from pyspark.sql import Row, Window
import pyspark.sql.functions as F
from pyspark.sql.window import *
l = [(1, '682308', 25), (1, '682308', 23), (2, '682309', 23), (1, '682309', 27), (2, '682309', 22)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(c_id=int(x[0]), postcode=x[1], accident=int(x[2])))
schemaPeople = sqlContext.createDataFrame(people)
result = schemaPeople.groupby("postcode", "c_id").agg(F.max("accident").alias("accident"))
new_result = result.withColumn("row_num", F.row_number().over(Window.partitionBy("postcode").orderBy(F.desc("accident")))).filter("row_num==1")
new_result.show()

Related

PySpark: How to concatenate two distinct dataframes?

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+
You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+

Determine if pyspark DataFrame row value is present in other columns

I'm working with a dataframe in pyspark, and need to evaluate row by row if a value is present in other columns of the dataframe. As an example, given this dataframe:
df:
+---------+--------------+-------+-------+-------+
|Subject |SubjectTotal |TypeA |TypeB |TypeC |
+---------+--------------+-------+-------+-------+
|Subject1 |10 |5 |3 |2 |
+---------+--------------+-------+-------+-------+
|Subject2 |15 |0 |15 |0 |
+---------+--------------+-------+-------+-------+
|Subject3 |5 |0 |0 |5 |
+---------+--------------+-------+-------+-------+
As an output, I need to determine which Type has 100% of the SubjectTotal. So my output would look like this:
df_output:
+---------+--------------+
|Subject |Type |
+---------+--------------+
|Subject2 |TypeB |
+---------+--------------+
|Subject3 |TypeC |
+---------+--------------+
Is it even possible?
Thanks!
Yo can try with when().otherwise() PySpark SQL function or case statement in SQL
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("Subject1", 10, 5, 3, 2),
("Subject2", 15, 0, 15, 0),
("Subject3", 5, 0, 0, 5)
],
("subject", "subjectTotal", "TypeA", "TypeB", "TypeC"))
df.show()
+--------+------------+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC|
+--------+------------+-----+-----+-----+
|Subject1| 10| 5| 3| 2|
|Subject2| 15| 0| 15| 0|
|Subject3| 5| 0| 0| 5|
+--------+------------+-----+-----+-----+
df.withColumn("Type", F.
when(F.col("subjectTotal") == F.col("TypeA"), "TypeA").
when(F.col("subjectTotal") == F.col("TypeB"), "TypeB").
when(F.col("subjectTotal") == F.col("TypeC"), "TypeC").
otherwise(None)).show()
+--------+------------+-----+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC| Type|
+--------+------------+-----+-----+-----+-----+
|Subject1| 10| 5| 3| 2| null|
|Subject2| 15| 0| 15| 0|TypeB|
|Subject3| 5| 0| 0| 5|TypeC|
+--------+------------+-----+-----+-----+-----+
You can use when expression within a list comprehension over all columns TypeX, then coalesce the list of expressions:
from pyspark.sql import functions as F
df1 = df.select(
F.col("Subject"),
F.coalesce(*[F.when(F.col(c) == F.col("SubjectTotal"), F.lit(c)) for c in df.columns[2:]]).alias("Type")
).filter("Type is not null")
df1.show()
#+--------+-----+
#| Subject| Type|
#+--------+-----+
#|Subject2|TypeB|
#|Subject3|TypeC|
#+--------+-----+
You can unpivot the dataframe using stack and filter the rows where SubjectTotal is equal to the value in the type columns:
df2 = df.selectExpr(
'Subject',
'SubjectTotal',
"stack(3, 'TypeA', TypeA, 'TypeB', TypeB, 'TypeC', TypeC) as (type, val)"
).filter('SubjectTotal = val').select('Subject', 'type')
df2.show()
+--------+-----+
| Subject| type|
+--------+-----+
|Subject2|TypeB|
|Subject3|TypeC|
+--------+-----+

How to get value_counts for a spark row?

I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+

Partition PySpark DataFrame depending on unique values in column (Custom Partitioning)

I have a PySpark data frame in which I have separate columns for names, types, days and values. An example of the dataframe can be seen below:
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
| name4| b| 1| 145|
| name5| b| 1| 185|
| name6| c| 1| 155|
| name7| c| 1| 160|
| name8| a| 2| 120|
| name9| a| 2| 110|
|name10| b| 2| 125|
|name11| b| 2| 185|
|name12| c| 3| 195|
+------+----+---+-----+
For a selected value of Type, I want to create separate dataframes depending on the unique values of the column titled Day. Let's say, I have chosen a as my preferred Type. In the aforementioned example, I have three unique values of Day (viz. 1, 2 , 3). For each unique value of Day which has a row with the chosen Type a - (that is days 1 and 2 in the above data), I want to create a dataframe which has all rows with the chosen chosen Type and Day. In the example mentioned above, I will have two dataframe which will look as below
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
+------+----+---+-----+
and
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name8| a| 2| 120|
| name9| a| 2| 110|
+------+----+---+-----+
How can I do this? In the actual data that I will be working with, I have millions of columns. So, I want to know about the most efficient way in which I can realize the above mentioned aim.
You can use the below mentioned code to generate the example given above.
from pyspark.sql import *
import numpy as np
Stats = Row("Name", "Type", "Day", "Value")
stat1 = Stats('name1', 'a', 1, 140)
stat2 = Stats('name2', 'a', 1, 180)
stat3 = Stats('name3', 'a', 1, 150)
stat4 = Stats('name4', 'b', 1, 145)
stat5 = Stats('name5', 'b', 1, 185)
stat6 = Stats('name6', 'c', 1, 155)
stat7 = Stats('name7', 'c', 1, 160)
stat8 = Stats('name8', 'a', 2, 120)
stat9 = Stats('name9', 'a', 2, 110)
stat10 = Stats('name10', 'b', 2, 125)
stat11 = Stats('name11', 'b', 2, 185)
stat12 = Stats('name12', 'c', 3, 195)
You can just use df.repartition("Type", "Day")
Docs for the same.
When I validate using the following function, I get the mentioned output
def validate(partition):
count = 0
for row in partition:
print(row)
count += 1
print(count)
My data
+------+--------------------+-------+-------+
|amount| trans_date|user_id|row_num|
+------+--------------------+-------+-------+
| 99.1|2019-06-04T00:00:...| 101| 1|
| 89.27|2019-06-04T00:00:...| 102| 2|
| 89.1|2019-03-04T00:00:...| 102| 3|
| 73.11|2019-09-10T00:00:...| 103| 4|
|-69.81|2019-09-11T00:00:...| 101| 5|
| 12.51|2018-12-14T00:00:...| 101| 6|
| 43.23|2018-09-11T00:00:...| 101| 7|
+------+--------------------+-------+-------+
After df.repartition("user_id") I get the following:
Output
Row(amount=73.11, trans_date='2019-09-10T00:00:00.000+05:30', user_id='103', row_num=4)
1
Row(amount=89.27, trans_date='2019-06-04T00:00:00.000+05:30', user_id='102', row_num=2)
Row(amount=89.1, trans_date='2019-03-04T00:00:00.000+05:30', user_id='102', row_num=3)
2
Row(amount=99.1, trans_date='2019-06-04T00:00:00.000+05:30', user_id='101', row_num=1)
Row(amount=-69.81, trans_date='2019-09-11T00:00:00.000+05:30', user_id='101', row_num=5)
Row(amount=12.51, trans_date='2018-12-14T00:00:00.000+05:30', user_id='101', row_num=6)
Row(amount=43.23, trans_date='2018-09-11T00:00:00.000+05:30', user_id='101', row_num=7)
4

group by and filter highest value in data frame in scala

I have some data like this:
a,timestamp,list,rid,sbid,avgvalue
1,1011,1001,4,4,1.20
2,1000,819,2,3,2.40
1,1011,107,1,3,5.40
1,1021,819,1,1,2.10
In the data above I want to find which stamp has the highest tag value (avg. value) based on the tag. Like this.
For time stamp 1011 and a 1:
1,1011,1001,4,4,1.20
1,1011,107,1,3,5.40
The output would be:
1,1011,107,1,3,5.40 //because for timestamp 1011 and tag 1 the higest avg value is 5.40
So I need to pick this column.
I tried this statement, but still it does not work properly:
val highvaluetable = df.registerTempTable("high_value")
val highvalue = sqlContext.sql("select a,timestamp,list,rid,sbid,avgvalue from high_value") highvalue.select($"a",$"timestamp",$"list",$"rid",$"sbid",$"avgvalue".cast(IntegerType).as("higher_value")).groupBy("a","timestamp").max("higher_value")
highvalue.collect.foreach(println)
Any help will be appreciated.
After I applied some of your suggestions, I am still getting duplicates in my data.
+---+----------+----+----+----+----+
|a| timestamp| list|rid|sbid|avgvalue|
+---+----------+----+----+----+----+
| 4|1496745915| 718| 4| 3|0.30|
| 4|1496745918| 362| 4| 3|0.60|
| 4|1496745913| 362| 4| 3|0.60|
| 2|1496745918| 362| 4| 3|0.10|
| 3|1496745912| 718| 4| 3|0.05|
| 2|1496745918| 718| 4| 3|0.30|
| 4|1496745911|1901| 4| 3|0.60|
| 4|1496745912| 718| 4| 3|0.60|
| 2|1496745915| 362| 4| 3|0.30|
| 2|1496745915|1901| 4| 3|0.30|
| 2|1496745910|1901| 4| 3|0.30|
| 3|1496745915| 362| 4| 3|0.10|
| 4|1496745918|3878| 4| 3|0.10|
| 4|1496745915|1901| 4| 3|0.60|
| 4|1496745912| 362| 4| 3|0.60|
| 4|1496745914|1901| 4| 3|0.60|
| 4|1496745912|3878| 4| 3|0.10|
| 4|1496745912| 718| 4| 3|0.30|
| 3|1496745915|3878| 4| 3|0.05|
| 4|1496745914| 362| 4| 3|0.60|
+---+----------+----+----+----+----+
4|1496745918| 362| 4| 3|0.60|
4|1496745918|3878| 4| 3|0.10|
Same time stamp with same tag. This is considered as duplicate.
This is my code:
rdd.createTempView("v1")
val rdd2=sqlContext.sql("select max(avgvalue) as max from v1 group by (a,timestamp)")
rdd2.createTempView("v2")
val rdd3=sqlContext.sql("select a,timestamp,list,rid,sbid,avgvalue from v1 join v2 on v2.max=v1.avgvalue").show()
You can use dataframe api to find the max as below:
df.groupBy("timestamp").agg(max("avgvalue"))
this will give you output as
+---------+-------------+
|timestamp|max(avgvalue)|
+---------+-------------+
|1021 |2.1 |
|1000 |2.4 |
|1011 |5.4 |
+---------+-------------+
which doesn't include the other fields you require . so you can use first as
df.groupBy("timestamp").agg(max("avgvalue") as "avgvalue", first("a") as "a", first("list") as "list", first("rid") as "rid", first("sbid") as "sbid")
you should have output as
+---------+--------+---+----+---+----+
|timestamp|avgvalue|a |list|rid|sbid|
+---------+--------+---+----+---+----+
|1021 |2.1 |1 |819 |1 |1 |
|1000 |2.4 |2 |819 |2 |3 |
|1011 |5.4 |1 |1001|4 |4 |
+---------+--------+---+----+---+----+
The above solution would not still give you correct row-wise output so what you can do is use window function and select the correct row as
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("timestamp").orderBy("a")
df.withColumn("newavg", max("avgvalue") over windowSpec)
.filter(col("newavg") === col("avgvalue"))
.drop("newavg").show(false)
This will give row-wise correct data as
+---+---------+----+---+----+--------+
|a |timestamp|list|rid|sbid|avgvalue|
+---+---------+----+---+----+--------+
|1 |1021 |819 |1 |1 |2.1 |
|2 |1000 |819 |2 |3 |2.4 |
|1 |1011 |107 |1 |3 |5.4 |
+---+---------+----+---+----+--------+
You can use groupBy and find the max value for that perticular group as
//If you have the dataframe as df than
df.groupBy("a", "timestamp").agg(max($"avgvalue").alias("maxAvgValue"))
Hope this helps
I saw the above answers. Below is the one which you can try as well
val sqlContext=new SQLContext(sc)
case class Tags(a:Int,timestamp:Int,list:Int,rid:Int,sbid:Int,avgvalue:Double)
val rdd=sc.textFile("file:/home/hdfs/stackOverFlow").map(x=>x.split(",")).map(x=>Tags(x(0).toInt,x(1).toInt,x(2).toInt,x(3).toInt,x(4).toInt,x(5).toDouble)).toDF
rdd.createTempView("v1")
val rdd2=sqlContext.sql("select max(avgvalue) as max from v1 group by (a,timestamp)")
rdd2.createTempView("v2")
val rdd3=sqlContext.sql("select a,timestamp,list,rid,sbid,avgvalue from v1 join v2 on v2.max=v1.avgvalue").show()
OutPut
+---+---------+----+---+----+--------+
| a|timestamp|list|rid|sbid|avgvalue|
+---+---------+----+---+----+--------+
| 2| 1000| 819| 2| 3| 2.4|
| 1| 1011| 107| 1| 3| 5.4|
| 1| 1021| 819| 1| 1| 2.1|
+---+---------+----+---+----+--------+
All the other solutions provided here did not give me the correct answer so this is what it worked for me with row_number():
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("timestamp").orderBy(desc("avgvalue"))
df.select("a", "timestamp", "list", "rid", "sbid", "avgvalue")
.withColumn("largest_avgvalue", row_number().over( windowSpec ))
.filter($"largest_avgvalue" === 1)
.drop("largest_avgvalue")
The other solutions had the following problems in my tests:
The solution with .agg( max(x).as(x), first(y).as(y), ... ) doesn't work because first() function "will return the first value it sees" according to documentation, which means it is non-deterministic,
The solution with .withColumn("x", max("y") over windowSpec.orderBy("m") ) doesn't work because the result of the max will be same as in the value that is selecting for the row. I believe the problem there is the orderBy()".
Hence, the following also gives the correct answer, with max():
val windowSpec = Window.partitionBy("timestamp").orderBy(desc("avgvalue"))
df.select("a", "timestamp", "list", "rid", "sbid", "avgvalue")
.withColumn("largest_avgvalue", max("avgvalue").over( windowSpec ))
.filter($"largest_avgvalue" === $"avgvalue")
.drop("largest_avgvalue")