I have two PySpark dataframe which are as given underneath
First is df1 which is given below:
+-----+-----+----------+-----+
| name| type|timestamp1|score|
+-----+-----+----------+-----+
|name1|type1|2012-01-10| 11|
|name2|type1|2012-01-10| 14|
|name3|type2|2012-01-10| 2|
|name3|type2|2012-01-17| 3|
|name1|type1|2012-01-18| 55|
|name1|type1|2012-01-19| 10|
+-----+-----+----------+-----+
Second is df2 which is given below:
+-----+-------------------+-------+-------+
| name| timestamp2|string1|string2|
+-----+-------------------+-------+-------+
|name1|2012-01-10 00:00:00| A| aa|
|name2|2012-01-10 00:00:00| A| bb|
|name3|2012-01-10 00:00:00| C| cc|
|name4|2012-01-17 00:00:00| D| dd|
|name3|2012-01-10 00:00:00| C| cc|
|name2|2012-01-17 00:00:00| A| bb|
|name2|2012-01-17 00:00:00| A| bb|
|name4|2012-01-10 00:00:00| D| dd|
|name3|2012-01-17 00:00:00| C| cc|
+-----+-------------------+-------+-------+
These two dataframes have one common column, i.e. name. Each unique value of name in df2 has unique values of string1 and string2.
I want to join df1 and df2 and form a new dataframe df3 such that df3 contains all the rows of df1 (same structure, numbers of rows as df1) but assigns values from columns string1 and string2 (from df2) to appropriate values of name in df1. Following is how I want the combined dataframe (df3) to look like.
+-----+-----+----------+-----+-------+-------+
| name| type|timestamp1|score|string1|string2|
+-----+-----+----------+-----+-------+-------+
|name1|type1|2012-01-10| 11| A| aa|
|name2|type1|2012-01-10| 14| A| bb|
|name3|type2|2012-01-10| 2| C| cc|
|name3|type2|2012-01-17| 3| C| cc|
|name1|type1|2012-01-18| 55| A| aa|
|name1|type1|2012-01-19| 10| A| aa|
+-----+-----+----------+-----+-------+-------+
How can I do get the above mentioned dataframe (df3)?
I tried the following df3 = df1.join( df2.select("name", "string1", "string2") , on=["name"], how="left"). But that gives me a dataframe with 14 rows with multiple (duplicate) entries of rows.
You can use the below mentioned code to generate df1 and df2.
from pyspark.sql import *
import pyspark.sql.functions as F
df1_Stats = Row("name", "type", "timestamp1", "score")
df1_stat1 = df1_Stats('name1', 'type1', "2012-01-10", 11)
df1_stat2 = df1_Stats('name2', 'type1', "2012-01-10", 14)
df1_stat3 = df1_Stats('name3', 'type2', "2012-01-10", 2)
df1_stat4 = df1_Stats('name3', 'type2', "2012-01-17", 3)
df1_stat5 = df1_Stats('name1', 'type1', "2012-01-18", 55)
df1_stat6 = df1_Stats('name1', 'type1', "2012-01-19", 10)
df1_stat_lst = [df1_stat1 , df1_stat2, df1_stat3, df1_stat4, df1_stat5, df1_stat6]
df1 = spark.createDataFrame(df1_stat_lst)
df2_Stats = Row("name", "timestamp2", "string1", "string2")
df2_stat1 = df2_Stats("name1", "2012-01-10 00:00:00", "A", "aa")
df2_stat2 = df2_Stats("name2", "2012-01-10 00:00:00", "A", "bb")
df2_stat3 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat4 = df2_Stats("name4", "2012-01-17 00:00:00", "D", "dd")
df2_stat5 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat6 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat7 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat8 = df2_Stats("name4", "2012-01-10 00:00:00", "D", "dd")
df2_stat9 = df2_Stats("name3", "2012-01-17 00:00:00", "C", "cc")
df2_stat_lst = [
df2_stat1,
df2_stat2,
df2_stat3,
df2_stat4,
df2_stat5,
df2_stat6,
df2_stat7,
df2_stat8,
df2_stat9,
]
df2 = spark.createDataFrame(df2_stat_lst)
It would be better to remove duplicates before joining , making small table to join.
df3 = df1.join(df2.select("name", "string1", "string2").distinct(),on=["name"] , how="left")
Apparently the following technique does it:
df3 = df1.join(
df2.select("name", "string1", "string2"), on=["name"], how="left"
).dropDuplicates()
df3.show()
+-----+-----+----------+-----+-------+-------+
| name| type| timestamp|score|string1|string2|
+-----+-----+----------+-----+-------+-------+
|name2|type1|2012-01-10| 14| A| bb|
|name3|type2|2012-01-10| 2| C| cc|
|name1|type1|2012-01-18| 55| A| aa|
|name1|type1|2012-01-10| 11| A| aa|
|name3|type2|2012-01-17| 3| C| cc|
|name1|type1|2012-01-19| 10| A| aa|
+-----+-----+----------+-----+-------+-------+
I am still open for answers. So, if you have a more efficient method of answering the question, please feel free to drop your answer.
Related
I am doing aggregation based on groupBy condition and applying some filter on my existing Spark/Scala DataFrame. But while executing my code I'm getting 'cannot resolve 'flag' given input columns:'
val someDF = Seq(
(1, 111,100,100,"C","5th","Y",11),
(1, 111,100,100,"C","5th","Y",11),
(2, 222,200,200,"C","5th","Y",22),
(2, 222,200,200,"C","5th","Y",22)
).toDF("id","rollno","sub1","sub2","flag","class","status","sno")
var df2 = someDF.groupBy("id","rollno")
.agg(sum("sub1").alias("sub1"),sum("sub2").alias("sub2"))
.filter(col("flag") === "C")
.filter(length(col("rollno")) >= 2)
.filter(col("class") === ("5th") || col("class") === ("6th"))
.filter(substring(col("rollno"), 1, 2) === col("sno"))
.filter(col("status") === "Y")
.select("id", "rollno", "sub1", "sub2", "flag", "class", "sno", "status")
Error:
org.apache.spark.sql.AnalysisException: cannot resolve 'flag' given input columns: [id, rollno, sub1, sub2];;
'Filter ('flag = C)
Expected result:
+---+------+----+----+----+-----+------+---+
| id|rollno|sub1|sub2|flag|class|status|sno|
+---+------+----+----+----+-----+------+---+
| 1| 111| 200| 200| C| 5th| Y| 11|
| 2| 222| 400| 400| C| 5th| Y| 22|
+---+------+----+----+----+-----+------+---+
After aggregation the other columns have disappeared, so you can't filter based on those. You need to filter before group by. Also you need to group by the other columns if you want to keep them.
var df2 = someDF
.filter(col("flag") === "C")
.filter(length(col("rollno")) >= 2)
.filter(col("class") === ("5th") || col("class") === ("6th"))
.filter(substring(col("rollno"), 1, 2) === col("sno"))
.filter(col("status") === "Y")
.groupBy("id", "rollno", "flag", "class", "sno", "status")
.agg(sum("sub1").alias("sub1"),sum("sub2").alias("sub2"))
.select("id", "rollno", "sub1", "sub2", "flag", "class", "sno", "status")
df2.show
+---+------+----+----+----+-----+---+------+
| id|rollno|sub1|sub2|flag|class|sno|status|
+---+------+----+----+----+-----+---+------+
| 1| 111| 200| 200| C| 5th| 11| Y|
| 2| 222| 400| 400| C| 5th| 22| Y|
+---+------+----+----+----+-----+---+------+
I have a pyspark data frame as
| ID|colA|colB |colC|
+---+----+-----+----+
|ID1| 3|5.85 | LB|
|ID2| 4|12.67| RF|
|ID3| 2|20.78| LCM|
|ID4| 1| 2 | LWB|
|ID5| 6| 3 | LF|
|ID6| 7| 4 | LM|
|ID7| 8| 5 | RS|
+---+----+----+----+
My goal is to replace the values in ColC as for the values of LB,LWB,LF with x and so on as shown below.
x = [LB,LWB,LF]
y = [RF,LCM]
z = [LM,RS]
Currently I'm able to achieve this by replacing each of the values manually as in below code :
# Replacing the values LB,LWF,LF with x
df_new = df.withColumn('ColC',f.when((f.col('ColC') == 'LB')|(f.col('ColC') == 'LWB')|(f.col('ColC') == 'LF'),'x').otherwise(df.ColC))
My question here is that how can we replace the values of a column (ColC in my example) by iterating through a list (x,y,z) dynamically at once using pyspark? What is the time complexity involved? Also, how can we truncate the decimal values in ColB to 1 decmial place?
You can coalesce the when statements if you have many conditions to match. You can also use a dictionary to hold the columns to be converted, and construct the when statements dynamically using a dict comprehension. As for rounding to 1 decimal place, you can use round.
import pyspark.sql.functions as F
xyz_dict = {'x': ['LB','LWB','LF'],
'y': ['RF','LCM'],
'z': ['LM','RS']}
df2 = df.withColumn(
'colC',
F.coalesce(*[F.when(F.col('colC').isin(v), k) for (k, v) in xyz_dict.items()])
).withColumn(
'colB',
F.round('colB', 1)
)
df2.show()
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5.9| x|
|ID2| 4|12.7| y|
|ID3| 2|20.8| y|
|ID4| 1| 2.0| x|
|ID5| 6| 3.0| x|
|ID6| 7| 4.0| z|
|ID7| 8| 5.0| z|
+---+----+----+----+
You can use replace on dataframe to replace the values in colC by passing a dict object for the mappings. And round function to limit the number of decimals in colB:
from pyspark.sql import functions as F
replacement = {
"LB": "x", "LWB": "x", "LF": "x",
"RF": "y", "LCM": "y",
"LM": "z", "RS": "z"
}
df1 = df.replace(replacement, ["colC"]).withColumn("colB", F.round("colB", 1))
df1.show()
#+---+----+----+----+
#| ID|colA|colB|colC|
#+---+----+----+----+
#|ID1| 3| 5.9| x|
#|ID2| 4|12.7| y|
#|ID3| 2|20.8| y|
#|ID4| 1| 2.0| x|
#|ID5| 6| 3.0| x|
#|ID6| 7| 4.0| z|
#|ID7| 8| 5.0| z|
#+---+----+----+----+
Also you can use isin function:
from pyspark.sql.functions import col, when
x = ['LB','LWB','LF']
y = ['LCM','RF']
z = ['LM','RS']
df = df.withColumn('ColC', when(col('colC').isin(x), "x")\
.otherwise(when(col('colC').isin(y), "y")\
.otherwise(when(col('colC').isin(z), "z")\
.otherwise(df.ColC))))
If you have a few lists with too many values in this way your complexity is less than blackbishop answer but in this problem his answer is easier.
You can try also with a regular expression using regexp_replace:
import pyspark.sql.functions as f
replacements = [
("(LB)|(LWB)|(LF)", "x"),
("(LCM)|(RF)", "y"),
("(LM)|(RS)", "z")
]
for x, y in replacements:
df = df.withColumn("colC", f.regexp_replace("colC", x, y))
I have a pyspark data frame that has 7 columns, I have to add a new column named "sum" and calculate a number of columns that have data (Not null) in the sum column.Example a data frame in which yellow highlighted part is required answer
This sum can be calculated like this:
df = spark.createDataFrame([
(1, "a", "xxx", None, "abc", "xyz","fgh"),
(2, "b", None, 3, "abc", "xyz","fgh"),
(3, "c", "a23", None, None, "xyz","fgh")
], ("ID","flag", "col1", "col2", "col3", "col4", "col5"))
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
df2 = df.withColumn("sum",sum([(~F.isnull(df[col])).cast(IntegerType()) for col in df.columns]))
df2.show()
+---+----+----+----+----+----+----+---+
| ID|flag|col1|col2|col3|col4|col5|sum|
+---+----+----+----+----+----+----+---+
| 1| a| xxx|null| abc| xyz| fgh| 6|
| 2| b|null| 3| abc| xyz| fgh| 6|
| 3| c| a23|null|null| xyz| fgh| 5|
+---+----+----+----+----+----+----+---+
Hope this helps!
I have a dataframe, I need to get the row number / index of the specific row. I would like to add a new row such that it includes the Letter as well as the row number/index eg. "A - 1","B - 2"
#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
with output
+------+---------+
|Letter|distances|
+------+---------+
| A| 20|
| B| 30|
| D| 80|
+------+---------+
I would like the new out put to be something like this,
+------+---------------+
|Letter|distances|index|
+------+---------------+
| A| 20|A - 1|
| B| 30|B - 2|
| D| 80|D - 3|
+------+---------------+
This is a function I have been working on
def cate(letter):
return letter + " - " + #index
a.withColumn("index", cate(a["Letter"])).show()
Since you want to achieve the result using UDF (only) let's try this
from pyspark.sql.functions import udf, monotonically_increasing_id
from pyspark.sql.types import StringType
#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
def cate(letter, idx):
return letter + " - " + str(idx)
cate_udf = udf(cate, StringType())
a = a.withColumn("temp_index", monotonically_increasing_id())
a = a.\
withColumn("index", cate_udf(a.Letter, a.temp_index)).\
drop("temp_index")
a.show()
Output is:
+------+---------+--------------+
|Letter|distances| index|
+------+---------+--------------+
| A| 20| A - 0|
| B| 30|B - 8589934592|
| D| 80|D - 8589934593|
+------+---------+--------------+
This should work
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
df.createOrReplaceTempView("df")
spark.sql("select concat(Letter,' - ',row_number() over (order by Letter)) as num, * from df").show()
+-----+------+---------+
| num|Letter|distances|
+-----+------+---------+
|A - 1| A| 20|
|B - 2| B| 30|
|D - 3| D| 80|
+-----+------+---------+
I need to implement the below SQL logic in Spark DataFrame
SELECT KEY,
CASE WHEN tc in ('a','b') THEN 'Y'
WHEN tc in ('a') AND amt > 0 THEN 'N'
ELSE NULL END REASON,
FROM dataset1;
My input DataFrame is as below:
val dataset1 = Seq((66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4")).toDF("KEY", "tc", "amt")
dataset1.show()
+---+---+---+
|KEY| tc|amt|
+---+---+---+
| 66| a| 4|
| 67| a| 0|
| 70| b| 4|
| 71| d| 4|
+---+---+---+
I have implement the nested case when statement as:
dataset1.withColumn("REASON", when(col("tc").isin("a", "b"), "Y")
.otherwise(when(col("tc").equalTo("a") && col("amt").geq(0), "N")
.otherwise(null))).show()
+---+---+---+------+
|KEY| tc|amt|REASON|
+---+---+---+------+
| 66| a| 4| Y|
| 67| a| 0| Y|
| 70| b| 4| Y|
| 71| d| 4| null|
+---+---+---+------+
Readability of the above logic with "otherwise" statement is little messy if the nested when statements goes further.
Is there any better way of implementing nested case when statements in Spark DataFrames?
There is no nesting here, therefore there is no need for otherwise. All you need is chained when:
import spark.implicits._
when($"tc" isin ("a", "b"), "Y")
.when($"tc" === "a" && $"amt" >= 0, "N")
ELSE NULL is implicit so you can omit it completely.
Pattern you use, is more more applicable for folding over a data structure:
val cases = Seq(
($"tc" isin ("a", "b"), "Y"),
($"tc" === "a" && $"amt" >= 0, "N")
)
where when - otherwise naturally follows recursion pattern and null provides the base case.
cases.foldLeft(lit(null)) {
case (acc, (expr, value)) => when(expr, value).otherwise(acc)
}
Please note, that it is impossible to reach "N" outcome, with this chain of conditions. If tc is equal to "a" it will be captured by the first clause. If it is not, it will fail to satisfy both predicates and default to NULL. You should rather:
when($"tc" === "a" && $"amt" >= 0, "N")
.when($"tc" isin ("a", "b"), "Y")
For more complex logic, I prefer to use UDFs for better readability:
val selectCase = udf((tc: String, amt: String) =>
if (Seq("a", "b").contains(tc)) "Y"
else if (tc == "a" && amt.toInt <= 0) "N"
else null
)
dataset1.withColumn("REASON", selectCase(col("tc"), col("amt")))
.show
you can simply use selectExpr on your dataset
dataset1.selectExpr("*", "CASE WHEN tc in ('a') AND amt > 0 THEN 'N' WHEN tc in ('a','b') THEN 'Y' ELSE NULL END
REASON").show()
+---+---+---+------+
|KEY| tc|amt|REASON|
+---+---+---+------+
| 66| a| 4| N|
| 67| a| 0| Y|
| 70| b| 4| Y|
| 71| d| 4| null|
+---+---+---+------+
Second condition should be place before first one, as first condition is more generic one.
WHEN tc in ('a') AND amt > 0 THEN 'N'