Compare columns from two different dataframes based on id - dataframe

I have two dataframes to compare, the order of records are different, the name of columns might be different. Have to compare columns (more than one) based on the unique key (id)
Example: consider cataframes df1 and df2
df1:
+---+-------+-----+
| id|student|marks|
+---+-------+-----+
| 1| Vijay| 23|
| 4| Vithal| 24|
| 2| Ram| 21|
| 3| Rahul| 25|
+---+-------+-----+
df2:
+-----+--------+------+
|newId|student1|marks1|
+-----+--------+------+
| 3| Rahul| 25|
| 2| Ram| 23|
| 1| Vijay| 23|
| 4| Vithal| 24|
+-----+--------+------+
Here based on id and newId, I need to compare values studentName and Marks, and need to check that whether the student with same id has same name and marks
In this example student with id 2 has 21 marks but in df2 23 marks

df1.exceptAll(df2).show()
// +---+-------+-----+
// | id|student|marks|
// +---+-------+-----+
// | 2| Ram| 21|
// +---+-------+-----+

I think diff will give the result you are looking for.
scala> df1.diff(df2)
res0: Seq[org.apache.spark.sql.Row] = List([2,Ram,21])

Related

PySpark: How to concatenate two distinct dataframes?

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+
You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+

Compare 2 dataframes and create an output dataframe containing the name of the columns that contain differences and their values

Using Spark and Scala
I have df1 and df2 as follows:
df1
+--------------------+--------+----------------+----------+
| ID|colA. |colB. |colC |
+--------------------+--------+----------------+----------+
| 1| 0| 10| APPLES|
| 2| 0| 20| APPLES|
|. 3| 0| 30| PEARS|
+--------------------+--------+----------------+----------+
df2
+--------------------+--------+----------------+----------+
| ID|colA. |colB |colC |
+--------------------+--------+----------------+----------+
| 1| 0| 10| APPLES|
| 2| 0| 20| PEARS|
| 3| 0| 10| APPLES|
+--------------------+--------+----------------+----------+
I need to compare these 2 dataframes and extract differences in a df3 that contains 4 columns: Column Names that contains a difference, Value from df1, Value from df2, ID
How can I achieve this without using the column names, I can only use the ID hard coded.
+--------------------+--------+----------------+-------------+-----
| Column Name |Value from df1. |Value from df2| ID |
+--------------------+--------+----------------+--------------+-----
| col B | 30| 10| 3. |
| col C | APPLES| PEARS| 2. |
| col C | PEARS| APPLES| 3. |
+--------------------+--------+----------------+---------------+----+
What I did so far is to extract the names of the columns that contain differences but I'm stuck on how to get the values.
val columns = df1.columns
val df_join = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"),
"left")
val test = columns.foldLeft(df_join) {(df_join, name) => df_join.withColumn(name +
"_temp", when(col("d1." + name) =!= col("d2." + name), lit(name))))}
.withColumn("Col Name", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
You can try this way:
// Consider the below dataframes
df1.show()
+---+----+----+------+
| ID|colA|colB| colC|
+---+----+----+------+
| 1| 0| 10|APPLES|
| 2| 0| 20|APPLES|
| 3| 0| 30| PEARS|
+---+----+----+------+
df2.show()
+---+----+----+------+
| ID|colA|colB| colC|
+---+----+----+------+
| 1| 0| 10|APPLES|
| 2| 0| 20| PEARS|
| 3| 0| 10|APPLES|
+---+----+----+------+
// As ID column can be hardcoded, we can use it to exclude from the list of all the columns of the dataframe so that we will be left with the remaining columns
val df1_columns = df1.columns.to[ListBuffer].-=("ID")
val df2_columns = df2.columns.to[ListBuffer].-=("ID")
// obtain the number of columns to use it in the stack function later
val df1_columns_count = df1_columns.length
val df2_columns_count = df2_columns.length
// obtain the columns in dynamic way to use in the stack function
var df1_stack_str = ""
var df2_stack_str = ""
// Typecasting columns to string type to avoid conflicts
df1_columns.foreach { column =>
df1_stack_str += s"'$column',cast($column as string),"
}
df1_stack_str = df1_stack_str.substring(0,df1_stack_str.lastIndexOf(","))
// Typecasting columns to string type to avoid conflicts
df2_columns.foreach { column =>
df2_stack_str += s"'$column',cast($column as string),"
}
df2_stack_str = df2_stack_str.substring(0,df2_stack_str.lastIndexOf(","))
/*
In this case the stack function implementation would look like this
val df11 = df1.selectExpr("id","stack(3,'colA',cast(colA as string),'colB',cast(colB as string),'colC',cast(colC as string)) as (column_name,value_from_df1)")
val df21 = df2.selectExpr("id id_","stack(3,'colA',cast(colA as string),'colB',cast(colB as string),'colC',cast(colC as string)) as (column_name_,value_from_df2)")
*/
val df11 = df1.selectExpr("id",s"stack($df1_columns_count,$df1_stack_str) as (column_name,value_from_df1)")
val df21 = df2.selectExpr("id id_",s"stack($df2_columns_count,$df2_stack_str) as (column_name_,value_from_df2)")
// use inner join to get value_from_df1 and value_from_df2 in one dataframe and apply the filter
df11.as("df11").join(df21.as("df21"),expr("df11.id=df21.id_ and df11.column_name=df21.column_name_"))
.drop("id_","column_name_")
.filter("value_from_df1!=value_from_df2")
.show
// Final output
+---+-----------+--------------+--------------+
| id|column_name|value_from_df1|value_from_df2|
+---+-----------+--------------+--------------+
| 2| colC| APPLES| PEARS|
| 3| colB| 30| 10|
| 3| colC| PEARS| APPLES|
+---+-----------+--------------+--------------+

PySpark Dataframe: Unify Certain Rows

I'm having some trouble figuring this one out
Here's a simple example:
+---+----+-----+
| Id|Rank|State+
+---+----+-----+
| a| 5| NJ +
| a| 7| GA +
| b| 8| CA +
| b| 1| CA +
+---+----+-----+
I'd like to format this dataframe in a way where if the same Id is in multiple states, have it only store one state. In this example, any row with Id "a" should have state "NJ" instead of "NJ" and "GA".
The result should be something like:
+---+----+-----+
| Id|Rank|State+
+---+----+-----+
| a| 5| NJ +
| a| 7| NJ +
| b| 8| CA +
| b| 1| CA +
+---+----+-----+`
How can this be accompished? Thanks!!
Try first windowing function like:
w = Window().partitionBy("Id").orderBy("Rank")
df.select(col("Id"), col("Rank"), first("State", True).over(w).alias("NewState"))
This will put into "NewState" column the first state according to the rank within id group.
The same thing can easily be expressed in pure SQL, if you want to use it.
BTW, welcome to StackOverflow community!

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.
Is there a way to replicate the following command:
sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
by using only pyspark functions such as join(), select() and the like?
I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.
Asterisk (*) works with alias. Ex:
from pyspark.sql.functions import *
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df1.join(df2, df1.id == df2.id).select('df1.*')
Not sure if the most efficient way, but this worked for me:
from pyspark.sql.functions import col
df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
The trick is in:
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b
Without using alias.
df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])
Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.
a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])
c = a.join(b, a.a_id == b.b_id)
Then, c.show() yields:
+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
| a| foo| p1| a|
| b| hem| p2| b|
| c| haw| p3| c|
+----+-----+-----+----+
I believe that this would be the easiest and most intuitive way:
final = (df1.alias('df1').join(df2.alias('df2'),
on = df1['id'] == df2['id'],
how = 'inner')
.select('df1.*',
'df2.other')
)
drop duplicate b_id
c = a.join(b, a.a_id == b.b_id).drop(b.b_id)
Here is the code snippet that does the inner join and select the columns from both dataframe and alias the same column to different column name.
emp_df = spark.read.csv('Employees.csv', header =True);
dept_df = spark.read.csv('dept.csv', header =True)
emp_dept_df = emp_df.join(dept_df,'DeptID').select(emp_df['*'], dept_df['Name'].alias('DName'))
emp_df.show()
dept_df.show()
emp_dept_df.show()
Output for 'emp_df.show()':
+---+---------+------+------+
| ID| Name|Salary|DeptID|
+---+---------+------+------+
| 1| John| 20000| 1|
| 2| Rohit| 15000| 2|
| 3| Parth| 14600| 3|
| 4| Rishabh| 20500| 1|
| 5| Daisy| 34000| 2|
| 6| Annie| 23000| 1|
| 7| Sushmita| 50000| 3|
| 8| Kaivalya| 20000| 1|
| 9| Varun| 70000| 3|
| 10|Shambhavi| 21500| 2|
| 11| Johnson| 25500| 3|
| 12| Riya| 17000| 2|
| 13| Krish| 17000| 1|
| 14| Akanksha| 20000| 2|
| 15| Rutuja| 21000| 3|
+---+---------+------+------+
Output for 'dept_df.show()':
+------+----------+
|DeptID| Name|
+------+----------+
| 1| Sales|
| 2|Accounting|
| 3| Marketing|
+------+----------+
Join Output:
+---+---------+------+------+----------+
| ID| Name|Salary|DeptID| DName|
+---+---------+------+------+----------+
| 1| John| 20000| 1| Sales|
| 2| Rohit| 15000| 2|Accounting|
| 3| Parth| 14600| 3| Marketing|
| 4| Rishabh| 20500| 1| Sales|
| 5| Daisy| 34000| 2|Accounting|
| 6| Annie| 23000| 1| Sales|
| 7| Sushmita| 50000| 3| Marketing|
| 8| Kaivalya| 20000| 1| Sales|
| 9| Varun| 70000| 3| Marketing|
| 10|Shambhavi| 21500| 2|Accounting|
| 11| Johnson| 25500| 3| Marketing|
| 12| Riya| 17000| 2|Accounting|
| 13| Krish| 17000| 1| Sales|
| 14| Akanksha| 20000| 2|Accounting|
| 15| Rutuja| 21000| 3| Marketing|
+---+---------+------+------+----------+
I got an error: 'a not found' using the suggested code:
from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
I changed a.columns to df1.columns and it worked out.
function to drop duplicate columns after joining.
check it
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
I just dropped the columns I didn't need from df2 and joined:
sliced_df = df2.select(columns_of_interest)
df1.join(sliced_df, on=['id'], how='left')
**id should be in `columns_of_interest` tho
df1.join(df2, ['id']).drop(df2.id)
If you need multiple columns from other pyspark dataframe then you can use this
based on single join condition
x.join(y, x.id == y.id,"left").select(x["*"],y["col1"],y["col2"],y["col3"])
based on multiple join condition
x.join(y, (x.id == y.id) & (x.no == y.no),"left").select(x["*"],y["col1"],y["col2"],y["col3"])
I very much like Xehron's answer above, and I suspect it's mechanically identical to my solution. This works in databricks, and presumably works in a typical spark environment (replacing keyword "spark" with "sqlcontext"):
df.createOrReplaceTempView('t1') #temp table t1
df2.createOrReplaceTempView('t2') #temp table t2
output = (
spark.sql("""
select
t1.*
,t2.desired_field(s)
from
t1
left (or inner) join t2 on t1.id = t2.id
"""
)
)
You could just make the join and after that select the wanted columns https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join

concatenate arbitrary long list of matches in SQL subquery

imagine 2 tables (rather stupid example, but for the sake of simplicity, here you go)
words
word_id
letters
letter
word_id
how can i select all words while selecting all letters that belong to a word and concatenating them to said word? it is important that the letters are returned in the order they appear in the table, as the letter may be mixed into other words, but the order is correct.
|word_id| |word_id|letter|
+-------+ +-------+------+
| 1| | 1| H|
| 2| | 2| B|
| 2| Y|
| 1| I|
| 2| E|
should return
|word_id|word|
+-------+----+
| 1| HI|
| 2| BYE|
any way to accomplish this in pure SQL?
Try this:
SELECT word_id, group_concat (letter,'') FROM letters GROUP BY word_id;