Join two dataframes based on common value in column (which is array) - pandas

I have one dataframe - df_similar_strings, which looks like this:
|---------------------|
| string_values |
|---------------------|
| ['catish', 'cat'] |
|---------------------|
| ['doggo', 'dogy'] |
|---------------------|
and the other one - df_source:
|-----------------------------|------------------|
| values | key_value |
|-----------------------------|------------------|
| ['catish', 'cat', 'cat-'] | cat |
|-----------------------------|------------------|
| ['doggo', 'dogy', 'dog'] | dog |
|-----------------------------|------------------|
I would like to join those data frames based on the column string_values and values so that there is at least one value matching.
I have no idea how to do this since the columns are nested as arrays.

Hey you just need to type cast your list to tuple. And then try merging. Since list is unhashable hence merge operation can't be applied. Try this
df_source.values = df_source["values"].apply(lambda x: tuple(x))
Similarly with the other df and try merging using pd merge.

You can solve it by first doing a cartesian-product between your two dataframes and then dropping from that dataframe all rows which doesn't have any shared value.
For simplicity, I assume the columns on both datasets have the same name ("values"). Also, I assume the lists doesn't have repeated values (all values appear once).
from collections import Counter
def find_duplicates(arr):
return [item for item,count in Counter(arr).items() if count==2]
df1['key']=1
df2['key']=1
cartes_prod_df = df1.merge(df2,on=['key'],how='outer').drop(columns=['key'])
duplicate_values = (cartes_prod_df.values_x + cartes_prod_df.values_y).apply(find_duplicates)
merged_df = cartes_prod_df[duplicate_values.apply(lambda x: len(x)>0)]
I've used a little trick in order to do the cartesian product (Adding the key column), and then the duplicate_values found from the joint array (using the + operator) are the values which appeared twice in the joint array.
UPDATE
In order to supply a full example, here's an example of df1 and df2:
d1 = {'values': [['A','B'],['B','C'],['D']],'otherkey':[1,2,3]}
d2 = {'values': [['A'],['B'],['A','C'],['D']],'otherkey':[4,5,3,6]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
Now, merged_df would give the output:

Related

How to add multiple column dynamically based on filter condition

I am trying to create multiple columns dynamically based on filter condition after comparing two data frame with below code
source_df
+---+-----+-----+----+
|key|val11|val12|date|
+---+-----+-----+-----+
|abc| 1.1| john|2-3-21
|def| 3.0| dani|2-2-21
+---+-----+-----+------
dest_df
+---+-----+-----+------+
|key|val11|val12|date |
+---+-----+-----+------
|abc| 2.1| jack|2-3-21|
|def| 3.0| dani|2-2-21|
-----------------------
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
The output I expect is
#Expected
+---+-----------------+------------------+
|key| difference_in_val11| difference_in_val12 |
+---+-----------------+------------------+
|abc|[src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----------------+-------------------+
I get only one column result
#Actual
+---+-----------------+-
|key| difference_in_val12 |
+---+-----------------+-|
|abc|[src:john,dst:jack]|
+---+-----------------+-
How to generate multiple columns based on filter condition dynamically?
Dataframes are immutable objects. Having said that, you need to create another dataframe using the one that got generated in the 1st iteration. Something like below -
from pyspark.sql import functions as F
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
if column != columns[-1]:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
else:
column_name="difference_in_"+str(column)
report1 = report.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
report1.show()
#report.show()
Output -
+---+-----+-----+-----+-----+-------------------+-------------------+
|key|val11|val12|val11|val12|difference_in_val11|difference_in_val12|
+---+-----+-----+-----+-----+-------------------+-------------------+
|abc| 1.1| john| 2.1| jack| [src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----+-----+-----+-----+-------------------+-------------------+
You could also do this with a union of both dataframes and then collect list only if collect_set size is greater than 1 , this can avoid joining the dataframes:
from pyspark.sql import functions as F
cols = source_df.drop("key").columns
output = (source_df.withColumn("ref",F.lit("src:"))
.unionByName(dest_df.withColumn("ref",F.lit("dst:"))).groupBy("key")
.agg(*[F.when(F.size(F.collect_set(i))>1,F.collect_list(F.concat("ref",i))).alias(i)
for i in cols]).dropna(subset = cols, how='all')
)
output.show()
+---+------------------+--------------------+
|key| val11| val12|
+---+------------------+--------------------+
|abc|[src:1.1, dst:2.1]|[src:john, dst:jack]|
+---+------------------+--------------------+

Compare two dataframes in pandas

I am a beginner. I have two dataframes in pandas, I would like to identify what are the changes from the original to the new dataframe.
Rows: products
Columns: demand for future periods
dataframe differences could be: new rows, deleted rows, and changed demand.
Ideally I would make a heatmap (showing changes) ... but I'm stuck - unsure if I have to iterate over or not ...
A record in a dataframe is:
ProductId | demand_Month1 | demand_Month2 | demand_Month3 ... MonthX
This data is monthly updated. I would like to generate the following table
productID | old - new (demand) ... for each month.
Dataframes contain same months demand data.
def dataframe_difference(df1: DataFrame, df2: DataFrame, which=None):
"""Find rows which are different between two DataFrames."""
comparison_df = df1.merge(
df2,
indicator=True,
how='outer'
)
if which is None:
diff_df = comparison_df[comparison_df['_merge'] != 'both']
else:
diff_df = comparison_df[comparison_df['_merge'] == which]
diff_df.to_csv('data/diff.csv')
return diff_df
Look at this for a start

Merge multiple spark rows to one

I have a dataframe which looks like one given below. All the values for a corresponding id is the same except for the mappingcol field.
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
|ref |banana |Map("lname"->"Nikki"| 2 |
|ddd |apple |Map("lname"->"tenka"| 1 |
+--------------------+----------------+--------------------+-------+
I want to merge the rows with same row in such a way that I get exactly one row for one id and the value of mappingcol needs to be merged. The output should look like :
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
+--------------------+----------------+--------------------+-------+
the value for mappingcol for id = 1 would be :
Map(
"name" -> "Sameer",
"lname" -> "tenka"
)
I know that maps can be merged using ++ operator, so thats not what im worried about. I just cant understand how to merge the rows, because if I use a groupBy, I have nothing to aggregate the rows on.
You can use by groupBy and then managing a little the map
df.groupBy("id", "fruit", "misc").agg(collect_list("mappingcol"))
.as[(Int, String, String, Seq[Map[String, String]])]
.map { case (id, fruit, misc, list) => (id, fruit, misc, list.reduce(_ ++ _)) }
.toDF("id", "fruit", "misc", "mappingColumn")
With the first line, tou group by your desired columns and aggregate the map pairs in the same element (an array)
With the second line (as), you convert your structure to a Dataset of a Tuple4 with the last element being a sequence of maps
With the third line (map), you merge all the elements to a single map
With the last line (toDF) to give the columns the original names
OUTPUT
+---+------+----+--------------------------------+
|id |fruit |misc|mappingColumn |
+---+------+----+--------------------------------+
|1 |apple |ddd |[name -> Sameer, lname -> tenka]|
|2 |banana|ref |[name -> Riyazi, lname -> Nikki]|
+---+------+----+--------------------------------+
You can definitely do the above with a Window function!
This is in PySpark not Scala but there's almost no difference when only using native Spark functions.
The below code only works on a map column that 1 one key, value pair per row, as it how your example data is, but it can be made to work with map columns with multiple entries.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = ['id', 'fruit', 'misc']
# or, a lazier way if you have a lot of columns to group on
cols = df.columns # save as list
group_cols_2 = cols.remove('mappingCol') # remove what you're not grouping by
w = Window.partitionBy(group_cols)
# unpack map value and key into a pair struct column
df1 = df.withColumn(map_col , F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
# Collect all key values into an array of structs, here each row
# contains the map entries for all rows in the group/window
df1 = df1.withColumn(map_col , F.collect_list(map_col).over(w))
# drop duplicate values, as you only want one row per group
df1 = df1.dropDuplicates(group_cols)
# return the values for map type
df1 = df1.withColumn(map_col , F.map_from_entries(map_col))
You can save the output of each step to a new column to see how each step works, as I have done below.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = list('id', 'fruit', 'misc')
w = Window.partitionBy(group_cols)
df1 = df.withColumn('test', F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
df1 = df1.withColumn('test1', F.collect_list('test').over(w))
df1 = df1.withColumn('test2', F.map_from_entries('test1'))
df1.show(truncate=False)
df1.printSchema()
df1 = df1.dropDuplicates(group_cols)

Adding column from dataframe(df1) to another dataframe (df2)

I need some help with this Apache Spark (pyspark) issue.
I've a dataFrame (df1) which has a single column & a single row, it contains max_timestamp
+------------------+
|max_timestamp |
+-------------------+
|2019-10-24 21:18:26|
+-------------------+
I've another DataFrame, which contains 2 Columns - EmpId & Timestamp
masterData = [(1, '1999-10-24 21:18:23',), (1, '2019-10-24 21:18:26',), (2, '2020-01-24 21:18:26',)]
df_masterdata = spark.createDataFrame(masterData, ['dsid', 'txnTime_str'])
df_masterdata = df_masterdata.withColumn('txnTime_ts', col('txnTime_str').cast(TimestampType())).drop('txnTime_str')
df_masterdata.show(5, False)
+----+-------------------+
|dsid|txnTime_ts |
+----+-------------------+
|1 |1999-10-24 21:18:23|
|1 |2019-10-24 21:18:26|
|2 |2020-01-24 21:18:26|
+----+-------------------+
Object is to filter the records in the 2nd Dataframe, based on condition txnTime_ts < max_timestamp
What i'm trying to do -> add the column 'max_timestamp' to the 2nd DataFrame, and filter records by comparing the 2 values.
df_masterdata1 = df_masterdata.withColumn('maxTime', maxTS2['TEMP_MAX'])
Pyspark does not let me add the column from maxTS2 to the dataFrame - df_masterdata
Error -
AnalysisException: 'Resolved attribute(s) TEMP_MAX#207255 missing from dsid#207263L,txnTime_ts#207267 in operator
!Project [dsid#207263L, txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280].;;\n!Project [dsid#207263L,
txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280]\n+- Project [dsid#207263L, txnTime_ts#207267]\n +- Project
[dsid#207263L, txnTime_str#207264, cast(txnTime_str#207264 as timestamp) AS txnTime_ts#207267]\n +- LogicalRDD
[dsid#207263L, txnTime_str#207264], false\n'
Any ideas on how to resolve this issue?
If you actually have a DF with a single row/column, the most efficient way to accomplish this would be to extract the value from the dataframe and then filter df_masterdata against it. If you nevertheless need to do this within the context of a dataframe, you should us join , e.g.:
df_masterdata1 = df_masterdata.join(df1, df_masterdata.txnTime_ts <= df1.max_timestamp)

join two dataframes on common column

I want to join two data sources, orders and customers:
orders is an SQL Server table:
orderid| customerid | orderdate | ordercost
------ | -----------| --------- | --------
12000 | 1500 |2008-08-09 | 38610
and customers is a csv file:
customerid,first_name,last_name,starting_date,ending_date,country
1500,Sian,Read,2008-01-07,2010-01-07,Greenland
I want to join these two tables in my Python application, so I wrote the following code:
# Connect to SQL Sever with Pyodbc library
connection = pypyodbc.connect("connection string here")
cursor=connection.cursor();
cursor.execute("SELECT * from order)
result= cursor.fetchall()
# convert the result to pandas Dataframe
df1 = pd.DataFrame(result, columns= ['orderid','customerid','orderdate','ordercost'])
# Read CSV File
df2=pd.read_csv(customer_csv)
# Merge two dataframes
merged= pd.merge( df1, df2, on= 'customerid', how='inner')
print(merged[['first_name', 'country']])
I expect
first_name | country
-----------|--------
Sian | Greenland
But I get empty result.
When I perform this code for two data frames that are both from CSV files, it works fine. Any help?
Thanks.
I think problem is columns customerid has different dtypes in both DataFrames so no match.
So need convert both columns to int or both to str.
df1['customerid'] = df1['customerid'].astype(int)
df2['customerid'] = df2['customerid'].astype(int)
Or:
df1['customerid'] = df1['customerid'].astype(str)
df2['customerid'] = df2['customerid'].astype(str)
Also is possible omit how='inner', because default value of merge:
merged= pd.merge( df1, df2, on= 'customerid')
empty dataframe result for pd.merge means you don't have any matching values across the two frames. Have you checked the type of the the data? use
df1['customerid'].dtype
to check.
as well as converting after importing (as suggested in the other answer), you can also tell pandas what dtype you want when you read the csv
df2=pd.read_csv(customer_csv, dtype={'customerid': str))