Grouping alternative items with PySpark - dataframe

The sample of the dataset I am working on:
# Creating the DataFrame
test =sqlContext.createDataFrame([(1,2),(2,1),
(1,3),(2,3),
(3,2),(3,1),
(4,5),(5,4)],
['cod_item','alter_cod'])
And it looks like this after grouping the equivalent items in lists:
test.createOrReplaceTempView("teste")
teste = spark.sql("""select cod_item,
collect_list(alter_cod) as alternative_item
from teste
group by cod_item""")
In the first column, I have certain items and in the second column, I have items that are equivalent. I would like, for each list, to have only one item that represents it.
I would like the final dataframe to look like this:
or
Where the items on the right are the items representing their respective equivalent items.

After collect_list, you should filter out rows where any alter_cod is bigger than cod_item. This method would work on strings too.
test = (test
.groupBy('cod_item')
.agg(F.collect_list('alter_cod').alias('alter_cod'))
.filter(F.forall('alter_cod', lambda x: x > F.col('cod_item')))
)
test.show()
# +--------+---------+
# |cod_item|alter_cod|
# +--------+---------+
# | 1| [2, 3]|
# | 4| [5]|
# +--------+---------+
Or add one line to your SQL:
select cod_item,
collect_list(alter_cod) as alternative_item
from teste
group by cod_item
having forall(alternative_item, x -> x > cod_item)

Related

Extract key value from dataframe in PySpark

I have the below dataframe which I have read from a JSON file.
1
2
3
4
{"todo":["wakeup", "shower"]}
{"todo":["brush", "eat"]}
{"todo":["read", "write"]}
{"todo":["sleep", "snooze"]}
I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?
ID
todo
1
wakeup, shower
2
brush, eat
3
read, write
4
sleep, snooze
The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.
What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.
Input dataframe:
df = spark.createDataFrame(
[((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
'`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')
Script:
to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")
df.show()
# +---+----------------+
# | ID| todo|
# +---+----------------+
# | 1|[wakeup, shower]|
# | 2| [brush, eat]|
# | 3| [read, write]|
# | 4| [sleep, snooze]|
# +---+----------------+
Use from_json to convert string to array. Explode to cascade each unique element to row.
data
df = spark.createDataFrame(
[(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
('value1','values2','value3','value4'))
code
new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
.withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
) ).show(truncate=False)
outcome
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1 |values2 |value3 |value4 |todo |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+

Filter out rows in Spark dataframe based on condition

Example Spark dataframe:
product type
table Furniture
chair Furniture
TV Electronic
.
.
I want to drop all the rows with type as Electronic if there exists any row where type is Furniture.
Real data here has million of rows.
Easy way is to count rows with type Furniture and if its greater than zero then drop rows with type Electronic, but this would be inefficient.
Is there a way to do this efficiently?
Not sure if it's exposed to the Pyspark API but you can use ANY in an expression:
chk = df.selectExpr('ANY(type = "Furniture") as chk').collect[0]["chk"]
if chk:
df_filtered = df.where(col("type") != "Electronic")
else:
df_filtered = df
As far as I can understand, if any product is classified as Furniture, you want to remove such product's classifications as Electronic. E.g., if TV is classified both, as Electronic and Furniture, you would like to remove Electronic classification, so that TV would only be classified as Furniture.
You will have to do some kind of aggregation. The following is a way using window functions:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('table', 'Furniture'),
('chair', 'Furniture'),
('TV', 'Electronic'),
('TV', 'Furniture')],
['product', 'type'])
w = W.partitionBy('product').rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
df = df.withColumn('_types', F.collect_set('type').over(w))
df = df.filter((F.col('type') != 'Electronic') | F.forall('_types', lambda x: x != 'Furniture'))
df = df.drop('_types')
df.show()
# +-------+---------+
# |product| type|
# +-------+---------+
# | TV|Furniture|
# | chair|Furniture|
# | table|Furniture|
# +-------+---------+

How to add multiple column dynamically based on filter condition

I am trying to create multiple columns dynamically based on filter condition after comparing two data frame with below code
source_df
+---+-----+-----+----+
|key|val11|val12|date|
+---+-----+-----+-----+
|abc| 1.1| john|2-3-21
|def| 3.0| dani|2-2-21
+---+-----+-----+------
dest_df
+---+-----+-----+------+
|key|val11|val12|date |
+---+-----+-----+------
|abc| 2.1| jack|2-3-21|
|def| 3.0| dani|2-2-21|
-----------------------
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
The output I expect is
#Expected
+---+-----------------+------------------+
|key| difference_in_val11| difference_in_val12 |
+---+-----------------+------------------+
|abc|[src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----------------+-------------------+
I get only one column result
#Actual
+---+-----------------+-
|key| difference_in_val12 |
+---+-----------------+-|
|abc|[src:john,dst:jack]|
+---+-----------------+-
How to generate multiple columns based on filter condition dynamically?
Dataframes are immutable objects. Having said that, you need to create another dataframe using the one that got generated in the 1st iteration. Something like below -
from pyspark.sql import functions as F
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
if column != columns[-1]:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
else:
column_name="difference_in_"+str(column)
report1 = report.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
report1.show()
#report.show()
Output -
+---+-----+-----+-----+-----+-------------------+-------------------+
|key|val11|val12|val11|val12|difference_in_val11|difference_in_val12|
+---+-----+-----+-----+-----+-------------------+-------------------+
|abc| 1.1| john| 2.1| jack| [src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----+-----+-----+-----+-------------------+-------------------+
You could also do this with a union of both dataframes and then collect list only if collect_set size is greater than 1 , this can avoid joining the dataframes:
from pyspark.sql import functions as F
cols = source_df.drop("key").columns
output = (source_df.withColumn("ref",F.lit("src:"))
.unionByName(dest_df.withColumn("ref",F.lit("dst:"))).groupBy("key")
.agg(*[F.when(F.size(F.collect_set(i))>1,F.collect_list(F.concat("ref",i))).alias(i)
for i in cols]).dropna(subset = cols, how='all')
)
output.show()
+---+------------------+--------------------+
|key| val11| val12|
+---+------------------+--------------------+
|abc|[src:1.1, dst:2.1]|[src:john, dst:jack]|
+---+------------------+--------------------+

Join two dataframes based on common value in column (which is array)

I have one dataframe - df_similar_strings, which looks like this:
|---------------------|
| string_values |
|---------------------|
| ['catish', 'cat'] |
|---------------------|
| ['doggo', 'dogy'] |
|---------------------|
and the other one - df_source:
|-----------------------------|------------------|
| values | key_value |
|-----------------------------|------------------|
| ['catish', 'cat', 'cat-'] | cat |
|-----------------------------|------------------|
| ['doggo', 'dogy', 'dog'] | dog |
|-----------------------------|------------------|
I would like to join those data frames based on the column string_values and values so that there is at least one value matching.
I have no idea how to do this since the columns are nested as arrays.
Hey you just need to type cast your list to tuple. And then try merging. Since list is unhashable hence merge operation can't be applied. Try this
df_source.values = df_source["values"].apply(lambda x: tuple(x))
Similarly with the other df and try merging using pd merge.
You can solve it by first doing a cartesian-product between your two dataframes and then dropping from that dataframe all rows which doesn't have any shared value.
For simplicity, I assume the columns on both datasets have the same name ("values"). Also, I assume the lists doesn't have repeated values (all values appear once).
from collections import Counter
def find_duplicates(arr):
return [item for item,count in Counter(arr).items() if count==2]
df1['key']=1
df2['key']=1
cartes_prod_df = df1.merge(df2,on=['key'],how='outer').drop(columns=['key'])
duplicate_values = (cartes_prod_df.values_x + cartes_prod_df.values_y).apply(find_duplicates)
merged_df = cartes_prod_df[duplicate_values.apply(lambda x: len(x)>0)]
I've used a little trick in order to do the cartesian product (Adding the key column), and then the duplicate_values found from the joint array (using the + operator) are the values which appeared twice in the joint array.
UPDATE
In order to supply a full example, here's an example of df1 and df2:
d1 = {'values': [['A','B'],['B','C'],['D']],'otherkey':[1,2,3]}
d2 = {'values': [['A'],['B'],['A','C'],['D']],'otherkey':[4,5,3,6]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
Now, merged_df would give the output:

Merge multiple spark rows to one

I have a dataframe which looks like one given below. All the values for a corresponding id is the same except for the mappingcol field.
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
|ref |banana |Map("lname"->"Nikki"| 2 |
|ddd |apple |Map("lname"->"tenka"| 1 |
+--------------------+----------------+--------------------+-------+
I want to merge the rows with same row in such a way that I get exactly one row for one id and the value of mappingcol needs to be merged. The output should look like :
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
+--------------------+----------------+--------------------+-------+
the value for mappingcol for id = 1 would be :
Map(
"name" -> "Sameer",
"lname" -> "tenka"
)
I know that maps can be merged using ++ operator, so thats not what im worried about. I just cant understand how to merge the rows, because if I use a groupBy, I have nothing to aggregate the rows on.
You can use by groupBy and then managing a little the map
df.groupBy("id", "fruit", "misc").agg(collect_list("mappingcol"))
.as[(Int, String, String, Seq[Map[String, String]])]
.map { case (id, fruit, misc, list) => (id, fruit, misc, list.reduce(_ ++ _)) }
.toDF("id", "fruit", "misc", "mappingColumn")
With the first line, tou group by your desired columns and aggregate the map pairs in the same element (an array)
With the second line (as), you convert your structure to a Dataset of a Tuple4 with the last element being a sequence of maps
With the third line (map), you merge all the elements to a single map
With the last line (toDF) to give the columns the original names
OUTPUT
+---+------+----+--------------------------------+
|id |fruit |misc|mappingColumn |
+---+------+----+--------------------------------+
|1 |apple |ddd |[name -> Sameer, lname -> tenka]|
|2 |banana|ref |[name -> Riyazi, lname -> Nikki]|
+---+------+----+--------------------------------+
You can definitely do the above with a Window function!
This is in PySpark not Scala but there's almost no difference when only using native Spark functions.
The below code only works on a map column that 1 one key, value pair per row, as it how your example data is, but it can be made to work with map columns with multiple entries.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = ['id', 'fruit', 'misc']
# or, a lazier way if you have a lot of columns to group on
cols = df.columns # save as list
group_cols_2 = cols.remove('mappingCol') # remove what you're not grouping by
w = Window.partitionBy(group_cols)
# unpack map value and key into a pair struct column
df1 = df.withColumn(map_col , F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
# Collect all key values into an array of structs, here each row
# contains the map entries for all rows in the group/window
df1 = df1.withColumn(map_col , F.collect_list(map_col).over(w))
# drop duplicate values, as you only want one row per group
df1 = df1.dropDuplicates(group_cols)
# return the values for map type
df1 = df1.withColumn(map_col , F.map_from_entries(map_col))
You can save the output of each step to a new column to see how each step works, as I have done below.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = list('id', 'fruit', 'misc')
w = Window.partitionBy(group_cols)
df1 = df.withColumn('test', F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
df1 = df1.withColumn('test1', F.collect_list('test').over(w))
df1 = df1.withColumn('test2', F.map_from_entries('test1'))
df1.show(truncate=False)
df1.printSchema()
df1 = df1.dropDuplicates(group_cols)