Finding the value of a column based on other 2 columns - dataframe

I have specific problem, where I want to retrieve the value of bu_id field from id and matched_ id.
When there is some value in matched_id column, bu_id should be same as the id for that particular id and ids of corresponding matched_id.
When matched_id is blank, bu_id should be same as id.
input
+---+------------+
|id |matched_id |
+---+------------+
|0 |7,8 |
|1 | |
|2 |4 |
|3 |5,9 |
|4 |2 |
|5 |3,9 |
|6 | |
|7 |0,8 |
|8 |0,7 |
|9 |3,5 |
output
+---+------------+-----+
|id |matched_id |bu_id|
+---+------------+-----+
|0 |7,8 |0 |
|1 | |1 |
|2 |4 |2 |
|3 |5,9 |3 |
|4 |2 |2 |
|5 |3,9 |3 |
|6 | |6 |
|7 |0,8 |0 |
|8 |0,7 |0 |
|9 |3,5 |3 |
Can anyone help me how to approach this problem. Thanks in advance.

We should try to use functions exclusively from the pyspark.sql.functions module because these are optimized for pyspark dataframes (see here), whereas udfs are not and should be avoided when possible.
To achieve the desired output pyspark dataframe, we can concatenate both "id" and "matched_id" columns together, convert the string that into a list of strings using split, cast the result as an array of integers, and take the minimum of the array – and we can get away with not having to worry about the blank strings because they get converted into null, and F.array_min drops nulls from consideration. This can be done with the following line of code (and while it is a little hard to read, it gets the job done):
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
],
["id", "matched_id"]
)
df.withColumn(
"bu_id",
F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),",").cast("array<int>"))
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
+---+----------+-----+
Update: in the case of non-numeric strings in columns "id" and "matched_id", we can no longer cast to an array of integers, so we can instead use pyspark functions F.when and .otherwise (see here) to set our new column to the "id" column when "matched_id" is an empty string "", and apply our other longer nested function when "matched_id" is non-empty.
df2 = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
("x", ""),
("x", "y,z")
],
["id", "matched_id"]
)
df2.withColumn(
"bu_id",
F.when(F.col("matched_id") != "", F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),","))).otherwise(
F.col("id")
)
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
| x| | x|
| x| y,z| x|
+---+----------+-----+

To answer this question I assumed that the logic you are looking to implement is,
If the matched_id column is null, then bu_id should be the same as id.
If the matched_id column is not null, we should consider the values listed in both the id and matched_id columns and bu_id should be the minimum of those values.
The Set-Up
# imports to include
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
# making your dataframe
df = spark.createDataFrame(
[
('0','7,8'),
('1',''),
('2','4'),
('3','5,9'),
('4','2'),
('5','3,9'),
('6',''),
('7','0,8'),
('8','0,7'),
('9','3,5'),
],
['id', 'matched_id'])
print(df.schema.fields)
df.show(truncate=False)
In this df, both the id and matched_id columns are StringType data types. The code that follows builds-off this assumption. You can check the column types in your df by running print(df.schema.fields)
id
matched_id
0
7,8
1
2
4
3
5,9
4
2
5
3,9
6
7
0,8
8
0,7
9
3,5
The Logic
To implement the logic for bu_id, we created a function called bu_calculation that defines the logic. Then we wrap the function in pyspark sql UDF. The bu_id column is then created by inputing the columns we need to evaluate (the id and matched_id columns) into the UDF
# create custom function with the logic for bu_id
def bu_calculation(id_col, matched_id_col):
id_int = int(id_col)
# turn the string in the matched_id column into a list and remove empty values from the list
matched_id_list = list(filter(None, matched_id_col.split(",")))
if len(matched_id_list) > 0:
# if matched_id column has values, convert strings to ints
all_ids = [int(x) for x in matched_id_list]
# join id column values with matched_id column values
all_ids.append(id_int)
# return minimum value
return min(all_ids)
else:
# if matched_id column is empty return the id column value
return id_int
# apply custom bu_calculation function to pyspark sql udf
# the use of IntegerType() here enforces that the bu_calculation function has to return an int
bu_udf = F.udf(bu_calculation, IntegerType())
# make a new column called bu_id using the pyspark sql udf we created called bu_udf
df = df.withColumn('bu_id', bu_udf('id', 'matched_id'))
df.show(truncate=False)
id
matched_id
bu_id
0
7,8
0
1
1
2
4
2
3
5,9
3
4
2
2
5
3,9
3
6
6
7
0,8
0
8
0,7
0
9
3,5
3
More about the pyspark sql udf function here: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html

Related

Convert Column of List to a Dataframe Column

I have a column of lists in a spark dataframe.
+-----+----------+
|c1 | c2 |
+-----+----------+
|a |[1, 0, 1, 1] |
|b |[0, 1, 1, 0] |
|c |[1, 1, 0, 0] |
+-----+----------+
How do I convert this into another spark dataframe where each list is turned into a dataframe column? Also each entry from column 'c1' is the name of the new column created. Something like below.
+--------+
|a| b | c|
+--------+
|1 |0| 1 |
|0 |0| 1 |
|1 |1| 0 |
|1 |0| 0 |
+--------+
Note: I did think about following this: Convert Column of List to Dataframe and then taking a transpose of the resultant matrix. But, this creates quite a lot of columns [as the size of the list data I have is pretty huge] and therefore isn't an efficient solution.
Any help is welcome.
import pyspark.sql.functions as F
#Not a part of the solution, only used to generate the data sample
df = spark.sql("select stack(3 ,'a',array(1, 0, 1, 1), 'b',array(0, 1, 1, 0) ,'c',array(1, 1, 0, 0)) as (c1,c2)")
df.groupBy().pivot('c1').agg(F.first('c2')).selectExpr('inline(arrays_zip(*))').show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 0| 1|
| 0| 1| 1|
| 1| 1| 0|
| 1| 0| 0|
+---+---+---+
This can be easily tested for large datasets
df = sql("select id as c1, transform(sequence(1,10000), e -> tinyint(round(rand()))) as c2 from range(10000)")
Just completed a succesfull execution of 10K arrays, 10K elements each, on a VM with 4 cores & 32 GB RAM (Azure Databricks).
Took 5.35 minutes.

How to append data to a column value in dataframe

In spark, I have a dataframe having a column named goals which holds numeric value. Here, I just want to append "goal or goals" string to the actual value
I want to print it as
if,
value = 1 then 1 goal
value = 2 then 2 goals and so on..
My data looks like this
val goalsDF = Seq(("meg", 2), ("meg", 4), ("min", 3),
("min2", 1), ("ss", 1)).toDF("name", "goals")
goalsDF.show()
+-----+-----+
|name |goals|
+-----+-----+
|meg |2 |
|meg |4 |
|min |3 |
|min2 |1 |
|ss |1 |
+-----+-----+
Expected Output:
+-----+---------+
|name |goals |
+-----+---------+
|meg |2 goals |
|meg |4 goals |
|min |3 goals |
|min2 |1 goal |
|ss |1 goal |
+-----+---------+
I tried below code but it doesn't work and prints the data as null
goalsDF.withColumn("goals", col("goals") + lit("goals")).show()
+----+-----+
|name|goals|
+----+-----+
| meg| null|
| meg| null|
| min| null|
|min2| null|
| ss| null|
+----+-----+
Please suggest if we can do this inside .withColumn() without any addition user defined method
You should use case when. It's pyspark example but you should be able to reference it and use scala.
DF.
withColumn('goals', F.When(F.col('goals') == 1, '1 goal').otherwise(F.concat_ws(" ", F.col("goals"), "goals"))
)
For scala example see here: https://stackoverflow.com/a/37108127/5899997

Determine if pyspark DataFrame row value is present in other columns

I'm working with a dataframe in pyspark, and need to evaluate row by row if a value is present in other columns of the dataframe. As an example, given this dataframe:
df:
+---------+--------------+-------+-------+-------+
|Subject |SubjectTotal |TypeA |TypeB |TypeC |
+---------+--------------+-------+-------+-------+
|Subject1 |10 |5 |3 |2 |
+---------+--------------+-------+-------+-------+
|Subject2 |15 |0 |15 |0 |
+---------+--------------+-------+-------+-------+
|Subject3 |5 |0 |0 |5 |
+---------+--------------+-------+-------+-------+
As an output, I need to determine which Type has 100% of the SubjectTotal. So my output would look like this:
df_output:
+---------+--------------+
|Subject |Type |
+---------+--------------+
|Subject2 |TypeB |
+---------+--------------+
|Subject3 |TypeC |
+---------+--------------+
Is it even possible?
Thanks!
Yo can try with when().otherwise() PySpark SQL function or case statement in SQL
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("Subject1", 10, 5, 3, 2),
("Subject2", 15, 0, 15, 0),
("Subject3", 5, 0, 0, 5)
],
("subject", "subjectTotal", "TypeA", "TypeB", "TypeC"))
df.show()
+--------+------------+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC|
+--------+------------+-----+-----+-----+
|Subject1| 10| 5| 3| 2|
|Subject2| 15| 0| 15| 0|
|Subject3| 5| 0| 0| 5|
+--------+------------+-----+-----+-----+
df.withColumn("Type", F.
when(F.col("subjectTotal") == F.col("TypeA"), "TypeA").
when(F.col("subjectTotal") == F.col("TypeB"), "TypeB").
when(F.col("subjectTotal") == F.col("TypeC"), "TypeC").
otherwise(None)).show()
+--------+------------+-----+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC| Type|
+--------+------------+-----+-----+-----+-----+
|Subject1| 10| 5| 3| 2| null|
|Subject2| 15| 0| 15| 0|TypeB|
|Subject3| 5| 0| 0| 5|TypeC|
+--------+------------+-----+-----+-----+-----+
You can use when expression within a list comprehension over all columns TypeX, then coalesce the list of expressions:
from pyspark.sql import functions as F
df1 = df.select(
F.col("Subject"),
F.coalesce(*[F.when(F.col(c) == F.col("SubjectTotal"), F.lit(c)) for c in df.columns[2:]]).alias("Type")
).filter("Type is not null")
df1.show()
#+--------+-----+
#| Subject| Type|
#+--------+-----+
#|Subject2|TypeB|
#|Subject3|TypeC|
#+--------+-----+
You can unpivot the dataframe using stack and filter the rows where SubjectTotal is equal to the value in the type columns:
df2 = df.selectExpr(
'Subject',
'SubjectTotal',
"stack(3, 'TypeA', TypeA, 'TypeB', TypeB, 'TypeC', TypeC) as (type, val)"
).filter('SubjectTotal = val').select('Subject', 'type')
df2.show()
+--------+-----+
| Subject| type|
+--------+-----+
|Subject2|TypeB|
|Subject3|TypeC|
+--------+-----+

Apache Spark SQL: How to use GroupBy and Max to filter data

I have a given dataset with the following structure:
https://i.imgur.com/Kk7I1S1.png
I need to solve the below problem using SparkSQL: Dataframes
For each postcode find the customer that has had the most number of previous accidents. In the case of a tie, meaning more than one customer have the same highest number of accidents, just return any one of them. For each of these selected customers output the following columns: postcode, customer id, number of previous accidents.
I think you have missed to provide data that you have mentioned in image link. I have created my own data set by taking your problem as a reference. You can use below code snippet just for your reference and also can replace df data Frame with your data set to add required column such as id etc.
scala> val df = spark.read.format("csv").option("header","true").load("/user/nikhil/acc.csv")
df: org.apache.spark.sql.DataFrame = [postcode: string, customer: string ... 1 more field]
scala> df.show()
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
| 1| Nikhil| 5|
| 2| Ram| 4|
| 1| Shyam| 3|
| 3| pranav| 1|
| 1| Suman| 2|
| 3| alex| 2|
| 2| Raj| 5|
| 4| arpit| 3|
| 1| darsh| 2|
| 1| rahul| 3|
| 2| kiran| 4|
| 3| baba| 4|
| 4| alok| 3|
| 1| Nakul| 5|
+--------+--------+---------+
scala> df.createOrReplaceTempView("tmptable")
scala> spark.sql(s"""SELECT postcode,customer, accidents FROM (SELECT postcode,customer, accidents, row_number() over (PARTITION BY postcode ORDER BY accidents desc) as rn from tmptable) WHERE rn = 1""").show(false)
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
|3 |baba |4 |
|1 |Nikhil |5 |
|4 |arpit |3 |
|2 |Raj |5 |
+--------+--------+---------+
You can get the result with the following code in python:
from pyspark.sql import Row, Window
import pyspark.sql.functions as F
from pyspark.sql.window import *
l = [(1, '682308', 25), (1, '682308', 23), (2, '682309', 23), (1, '682309', 27), (2, '682309', 22)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(c_id=int(x[0]), postcode=x[1], accident=int(x[2])))
schemaPeople = sqlContext.createDataFrame(people)
result = schemaPeople.groupby("postcode", "c_id").agg(F.max("accident").alias("accident"))
new_result = result.withColumn("row_num", F.row_number().over(Window.partitionBy("postcode").orderBy(F.desc("accident")))).filter("row_num==1")
new_result.show()

group by and filter highest value in data frame in scala

I have some data like this:
a,timestamp,list,rid,sbid,avgvalue
1,1011,1001,4,4,1.20
2,1000,819,2,3,2.40
1,1011,107,1,3,5.40
1,1021,819,1,1,2.10
In the data above I want to find which stamp has the highest tag value (avg. value) based on the tag. Like this.
For time stamp 1011 and a 1:
1,1011,1001,4,4,1.20
1,1011,107,1,3,5.40
The output would be:
1,1011,107,1,3,5.40 //because for timestamp 1011 and tag 1 the higest avg value is 5.40
So I need to pick this column.
I tried this statement, but still it does not work properly:
val highvaluetable = df.registerTempTable("high_value")
val highvalue = sqlContext.sql("select a,timestamp,list,rid,sbid,avgvalue from high_value") highvalue.select($"a",$"timestamp",$"list",$"rid",$"sbid",$"avgvalue".cast(IntegerType).as("higher_value")).groupBy("a","timestamp").max("higher_value")
highvalue.collect.foreach(println)
Any help will be appreciated.
After I applied some of your suggestions, I am still getting duplicates in my data.
+---+----------+----+----+----+----+
|a| timestamp| list|rid|sbid|avgvalue|
+---+----------+----+----+----+----+
| 4|1496745915| 718| 4| 3|0.30|
| 4|1496745918| 362| 4| 3|0.60|
| 4|1496745913| 362| 4| 3|0.60|
| 2|1496745918| 362| 4| 3|0.10|
| 3|1496745912| 718| 4| 3|0.05|
| 2|1496745918| 718| 4| 3|0.30|
| 4|1496745911|1901| 4| 3|0.60|
| 4|1496745912| 718| 4| 3|0.60|
| 2|1496745915| 362| 4| 3|0.30|
| 2|1496745915|1901| 4| 3|0.30|
| 2|1496745910|1901| 4| 3|0.30|
| 3|1496745915| 362| 4| 3|0.10|
| 4|1496745918|3878| 4| 3|0.10|
| 4|1496745915|1901| 4| 3|0.60|
| 4|1496745912| 362| 4| 3|0.60|
| 4|1496745914|1901| 4| 3|0.60|
| 4|1496745912|3878| 4| 3|0.10|
| 4|1496745912| 718| 4| 3|0.30|
| 3|1496745915|3878| 4| 3|0.05|
| 4|1496745914| 362| 4| 3|0.60|
+---+----------+----+----+----+----+
4|1496745918| 362| 4| 3|0.60|
4|1496745918|3878| 4| 3|0.10|
Same time stamp with same tag. This is considered as duplicate.
This is my code:
rdd.createTempView("v1")
val rdd2=sqlContext.sql("select max(avgvalue) as max from v1 group by (a,timestamp)")
rdd2.createTempView("v2")
val rdd3=sqlContext.sql("select a,timestamp,list,rid,sbid,avgvalue from v1 join v2 on v2.max=v1.avgvalue").show()
You can use dataframe api to find the max as below:
df.groupBy("timestamp").agg(max("avgvalue"))
this will give you output as
+---------+-------------+
|timestamp|max(avgvalue)|
+---------+-------------+
|1021 |2.1 |
|1000 |2.4 |
|1011 |5.4 |
+---------+-------------+
which doesn't include the other fields you require . so you can use first as
df.groupBy("timestamp").agg(max("avgvalue") as "avgvalue", first("a") as "a", first("list") as "list", first("rid") as "rid", first("sbid") as "sbid")
you should have output as
+---------+--------+---+----+---+----+
|timestamp|avgvalue|a |list|rid|sbid|
+---------+--------+---+----+---+----+
|1021 |2.1 |1 |819 |1 |1 |
|1000 |2.4 |2 |819 |2 |3 |
|1011 |5.4 |1 |1001|4 |4 |
+---------+--------+---+----+---+----+
The above solution would not still give you correct row-wise output so what you can do is use window function and select the correct row as
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("timestamp").orderBy("a")
df.withColumn("newavg", max("avgvalue") over windowSpec)
.filter(col("newavg") === col("avgvalue"))
.drop("newavg").show(false)
This will give row-wise correct data as
+---+---------+----+---+----+--------+
|a |timestamp|list|rid|sbid|avgvalue|
+---+---------+----+---+----+--------+
|1 |1021 |819 |1 |1 |2.1 |
|2 |1000 |819 |2 |3 |2.4 |
|1 |1011 |107 |1 |3 |5.4 |
+---+---------+----+---+----+--------+
You can use groupBy and find the max value for that perticular group as
//If you have the dataframe as df than
df.groupBy("a", "timestamp").agg(max($"avgvalue").alias("maxAvgValue"))
Hope this helps
I saw the above answers. Below is the one which you can try as well
val sqlContext=new SQLContext(sc)
case class Tags(a:Int,timestamp:Int,list:Int,rid:Int,sbid:Int,avgvalue:Double)
val rdd=sc.textFile("file:/home/hdfs/stackOverFlow").map(x=>x.split(",")).map(x=>Tags(x(0).toInt,x(1).toInt,x(2).toInt,x(3).toInt,x(4).toInt,x(5).toDouble)).toDF
rdd.createTempView("v1")
val rdd2=sqlContext.sql("select max(avgvalue) as max from v1 group by (a,timestamp)")
rdd2.createTempView("v2")
val rdd3=sqlContext.sql("select a,timestamp,list,rid,sbid,avgvalue from v1 join v2 on v2.max=v1.avgvalue").show()
OutPut
+---+---------+----+---+----+--------+
| a|timestamp|list|rid|sbid|avgvalue|
+---+---------+----+---+----+--------+
| 2| 1000| 819| 2| 3| 2.4|
| 1| 1011| 107| 1| 3| 5.4|
| 1| 1021| 819| 1| 1| 2.1|
+---+---------+----+---+----+--------+
All the other solutions provided here did not give me the correct answer so this is what it worked for me with row_number():
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("timestamp").orderBy(desc("avgvalue"))
df.select("a", "timestamp", "list", "rid", "sbid", "avgvalue")
.withColumn("largest_avgvalue", row_number().over( windowSpec ))
.filter($"largest_avgvalue" === 1)
.drop("largest_avgvalue")
The other solutions had the following problems in my tests:
The solution with .agg( max(x).as(x), first(y).as(y), ... ) doesn't work because first() function "will return the first value it sees" according to documentation, which means it is non-deterministic,
The solution with .withColumn("x", max("y") over windowSpec.orderBy("m") ) doesn't work because the result of the max will be same as in the value that is selecting for the row. I believe the problem there is the orderBy()".
Hence, the following also gives the correct answer, with max():
val windowSpec = Window.partitionBy("timestamp").orderBy(desc("avgvalue"))
df.select("a", "timestamp", "list", "rid", "sbid", "avgvalue")
.withColumn("largest_avgvalue", max("avgvalue").over( windowSpec ))
.filter($"largest_avgvalue" === $"avgvalue")
.drop("largest_avgvalue")