why left_anti join doesn't work as expected in pyspark? - apache-spark-sql

In a dataframe I'm trying to identify those rows that have a value in column C2 that does not exist in column C1 in any other row. I tryed the following code:
in_df = sqlContext.createDataFrame([[1,None,'A'],[2,1,'B'],[3,None,'C'],[4,11,'D']],['C1','C2','C3'])
in_df.show()
+---+----+---+
| C1| C2| C3|
+---+----+---+
| 1|null| A|
| 2| 1| B|
| 3|null| C|
| 4| 11| D|
+---+----+---+
filtered = in_df.filter(in_df.C2.isNotNull())
filtered.show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 2| 1| B|
| 4| 11| D|
+---+---+---+
Now applying a left_anti join is expected to return only the row 4, however I also get row 2:
filtered.join(in_df,(in_df.C1 == filtered.C2), 'left_anti').show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 2| 1| B|
| 4| 11| D|
+---+---+---+
If I 'materialize' the filtered DF the result is as expected:
filtered = filtered.toDF(*filtered.columns)
filtered.join(in_df,(in_df.C1 == filtered.C2), 'left_anti').show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 4| 11| D|
+---+---+---+
Why is this .toDF needed?

in_df.C1 is actually refering to a filtered column as shows the following code:
in_df = sqlContext.createDataFrame([[1,None,'A'],[2,1,'B'],[3,None,'C'],[4,11,'D']],['C1','C2','C3'])
filtered = in_df.filter(in_df.C2.isNotNull()).select("C2")
filtered.join(in_df,(in_df.C1 == filtered.C2), 'left_anti').show()
Py4JJavaError: An error occurred while calling o699.join.
: org.apache.spark.sql.AnalysisException: cannot resolve 'in_df.C1' given input columns: [C2, C1, C2, C3];;
'Join LeftAnti, ('in_df.C1 = 'filtered.C2)
:- Project [C2#891L]
: +- Filter isnotnull(C2#891L)
: +- LogicalRDD [C1#890L, C2#891L, C3#892]
+- LogicalRDD [C1#900L, C2#901L, C3#902]
So basically when joining the 2 dataframes you actually use the condition filtered.C1 == filtered.C2:
filtered = in_df.filter(in_df.C2.isNotNull())
filtered.join(in_df,(filtered.C1 == filtered.C2), 'left_anti').show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 2| 1| B|
| 4| 11| D|
+---+---+---+
You might have changed the name of the dataframe but the columns in it can still be referred calling in_df.Ci. To make sure you're referring to the right dataframe you can use aliases:
import pyspark.sql.functions as psf
filtered.alias("filtered").join(in_df.alias("in_df"),(psf.col("in_df.C1") == psf.col("filtered.C2")), 'left_anti').show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 4| 11| D|
+---+---+---+
The best way to deal with column name ambiguities is to avoid them from the start (renaming columns or using aliases for your data frame).

Related

Pyspark crossJoin with specific condition

The crossJoin of two dataframes of 5 rows for each one gives a dataframe of 25 rows (5*5).
What I want is to do a crossJoin but which is "not full".
For example:
df1: df2:
+-----+ +-----+
|index| |value|
+-----+ +-----+
| 0| | A|
| 1| | B|
| 2| | C|
| 3| | D|
| 4| | E|
+-----+ +-----+
The result must be a dataframe of number of rows < 25, while for each row in index choosing randomly the number of rows in value with which the crossJoin is done.
It will be something like that:
+-----+-----+
|index|value|
+-----+-----+
| 0| D|
| 0| A|
| 1| A|
| 1| D|
| 1| B|
| 1| C|
| 2| A|
| 2| E|
| 3| D|
| 4| A|
| 4| B|
| 4| E|
+-----+-----+
Thank you
You can try with sample(withReplacement, fraction, seed=None) to get the less number of rows after cross join.
Example:
spark.sql("set spark.sql.crossJoin.enabled=true")
df.join(df1).sample(False,0.6).show()

SQL or Pyspark - Get the last time a column had a different value for each ID

I am using pyspark so I have tried both pyspark code and SQL.
I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table:
+---+-------+-------+----+
| ID|USER_ID|ADDRESS|TIME|
+---+-------+-------+----+
| 1| 1| A| 10|
| 2| 1| B| 15|
| 3| 1| A| 20|
| 4| 1| A| 40|
| 5| 1| A| 45|
+---+-------+-------+----+
The correct new column I would like is as below:
+---+-------+-------+----+---------+
| ID|USER_ID|ADDRESS|TIME|LAST_DIFF|
+---+-------+-------+----+---------+
| 1| 1| A| 10| null|
| 2| 1| B| 15| 10|
| 3| 1| A| 20| 15|
| 4| 1| A| 40| 15|
| 5| 1| A| 45| 15|
+---+-------+-------+----+---------+
I have tried using different windows but none ever seem to get exactly what I want. Any ideas?
A simplified version of #jxc's answer.
from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
.withColumn('prev_time',lag(col('time')).over(w))
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
grp_df.withColumn('last_addr_change_time',min(col('prev_time')).over(w_grp)).show()
Use lag with running sum to assign groups when there is a change in the column value (based on the defined window). Get the time from the previous row, which will be used in the next step.
Once you get the groups, use the running minimum to get the last timestamp of the column value change. (Suggest you look at the intermediate results to understand the transformations better)
One way using two Window specs:
from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window
w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')
# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))
# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))
df1.join(df2, on=['USER_ID', 'g']).show()
+-------+---+---+-------+----+----+---------+
|USER_ID| g| ID|ADDRESS|TIME|diff|last_diff|
+-------+---+---+-------+----+----+---------+
| 1| 1| 1| A| 10| 10| null|
| 1| 2| 2| B| 15| 15| 10|
| 1| 3| 3| A| 20| 105| 15|
| 1| 3| 4| A| 40| 105| 15|
| 1| 3| 5| A| 45| 105| 15|
+-------+---+---+-------+----+----+---------+
df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')

need to perform multi-column join on a dataframe with alook-up dataframe

I have two dataframes like so
+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
| 0| 1| 2| 3| 4|
| 5| 6| 7| 8| 9|
+---+---+---+---+---+
+---+---+
|key|val|
+---+---+
| 0| A|
| 1| B|
| 2| C|
| 3| D|
| 4| E|
| 5| F|
| 6| G|
| 7| H|
| 8| I|
| 9| J|
+---+---+
I want to lookup each column on df1 with the equivalent key in df2 and return the lookup val from df2 for each.
Here is the code to produce the two input dataframes
df1 = sc.parallelize([('0','1','2','3','4',), ('5','6','7','8','9',)]).toDF(['c1','c2','c3','c4','c5'])
df1.show()
df2 = sc.parallelize([('0','A',), ('1','B', ),('2','C', ),('3','D', ),('4','E',),\
('5','F',), ('6','G', ),('7','H', ),('8','I', ),('9','J',)]).toDF(['key','val'])
df2.show()
I want to join the above to produce the following
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
| 0| 1| 2| 3| 4|A |B |C |D |E |
| 5| 6| 7| 8| 9|F |G |H |I |J |
+---+---+---+---+---+---+---+---+--+----+
I can get it to work for a single column like so but I'm not sure how to extend it to all columns
df1.join(df2, df1.c1 == df2.key).select('c1','val').show()
+---+---+
| c1|val|
+---+---+
| 0| A|
| 5| F|
+---+---+
You can just chain the join:
df1
.join(df2, on=df1.c1 == df2.key, how='left')
.withColumnRenamed('val', 'lu1') \
.join(df2, on=df1.c2 == df2.key, how='left) \
.withColumnRenamed('val', 'lu2') \
.etc
You can even do it in a loop, but don't do it with too many columns:
from pyspark.sql import functions as f
df = df1
for i in range(1, 6):
df = df \
.join(df2.alias(str(i)), on=f.col('c{}'.format(i)) == f.col("{}.key".format(i)), how='left') \
.withColumnRenamed('val', 'lu{}'.format(i))
df \
.select('c1', 'c2', 'c3', 'c4', 'c5', 'lu1', 'lu2', 'lu3', 'lu4', 'lu5') \
.show()
output
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
| 5| 6| 7| 8| 9| F| G| H| I| J|
| 0| 1| 2| 3| 4| A| B| C| D| E|
+---+---+---+---+---+---+---+---+---+---+

How to flatten a pyspark dataframe? (spark 1.6)

I'm working with Spark 1.6
Here are my data :
eDF = sqlsc.createDataFrame([Row(v=1, eng_1=10,eng_2=20),
Row(v=2, eng_1=15,eng_2=30),
Row(v=3, eng_1=8,eng_2=12)])
eDF.select('v','eng_1','eng_2').show()
+---+-----+-----+
| v|eng_1|eng_2|
+---+-----+-----+
| 1| 10| 20|
| 2| 15| 30|
| 3| 8| 12|
+---+-----+-----+
I would like to 'flatten' this table.
That is to say :
+---+-----+---+
| v| key|val|
+---+-----+---+
| 1|eng_1| 10|
| 1|eng_2| 20|
| 2|eng_1| 15|
| 2|eng_2| 30|
| 3|eng_1| 8|
| 3|eng_2| 12|
+---+-----+---+
Note that since I'm working with Spark 1.6, I can't use pyspar.sql.functions.create_map or pyspark.sql.functions.posexplode.
Use rdd.flatMap to flatten it:
df = spark.createDataFrame(
eDF.rdd.flatMap(
lambda r: [Row(v=r.v, key=col, val=r[col]) for col in ['eng_1', 'eng_2']]
)
)
df.show()
+-----+---+---+
| key| v|val|
+-----+---+---+
|eng_1| 1| 10|
|eng_2| 1| 20|
|eng_1| 2| 15|
|eng_2| 2| 30|
|eng_1| 3| 8|
|eng_2| 3| 12|
+-----+---+---+

Drop multiple columns from DataFrame recursively in SPARK <= version 1.6.0

I want to drop multiple cols from the data frame in one go. Don't want to write .drop("col1").drop("col2").
Note: I am using spark-1.6.0
This functionality is available in the current spark version (2.0 onwards) and for earlier version we can make use of the below code.
1.
implicit class DataFrameOperation(df: DataFrame) {
def dropCols(cols: String*): DataFrame = {
#tailrec def deleteCol(df: DataFrame, cols: Seq[String]): DataFrame =
if(cols.size == 0) df else deleteCol(df.drop(cols.head), cols.tail)
deleteCol(df, cols)
}
}
To call the method
val finalDF = dataFrame.dropCols("col1","col2","col3")
This method is a work around method.
public static DataFrame drop(DataFrame dataFrame, List<String> dropCol) {
List<String> colname = Arrays.stream(dataFrame.columns()).filter(col -> !dropCol.contains(col)).collect(Collectors.toList());
// colname list will have the names of the cols except the ones to be dropped.
return dataFrame.selectExpr(JavaConversions.asScalaBuffer(colname));
}
inputDataFrame:
+---+---+---+---+---+
| C0| C1| C2| C3| C4|
+---+---+---+---+---+
| 0| 0| 0| 0| 1|
| 1| 5| 6| 0| 14|
| 1| 6| 1| 0| 3|
| 1| 0| 1| 0| 1|
| 1| 37| 9| 0| 19|
+---+---+---+---+---+
If you want to drop C0, C2, C4 columns,
colDroppedDataFrame:
+---+---+
| C1| C3|
+---+---+
| 0| 0|
| 5| 0|
| 6| 0|
| 0| 0|
| 37| 0|
+---+---+