need to perform multi-column join on a dataframe with alook-up dataframe - dataframe

I have two dataframes like so
+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
| 0| 1| 2| 3| 4|
| 5| 6| 7| 8| 9|
+---+---+---+---+---+
+---+---+
|key|val|
+---+---+
| 0| A|
| 1| B|
| 2| C|
| 3| D|
| 4| E|
| 5| F|
| 6| G|
| 7| H|
| 8| I|
| 9| J|
+---+---+
I want to lookup each column on df1 with the equivalent key in df2 and return the lookup val from df2 for each.
Here is the code to produce the two input dataframes
df1 = sc.parallelize([('0','1','2','3','4',), ('5','6','7','8','9',)]).toDF(['c1','c2','c3','c4','c5'])
df1.show()
df2 = sc.parallelize([('0','A',), ('1','B', ),('2','C', ),('3','D', ),('4','E',),\
('5','F',), ('6','G', ),('7','H', ),('8','I', ),('9','J',)]).toDF(['key','val'])
df2.show()
I want to join the above to produce the following
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
| 0| 1| 2| 3| 4|A |B |C |D |E |
| 5| 6| 7| 8| 9|F |G |H |I |J |
+---+---+---+---+---+---+---+---+--+----+
I can get it to work for a single column like so but I'm not sure how to extend it to all columns
df1.join(df2, df1.c1 == df2.key).select('c1','val').show()
+---+---+
| c1|val|
+---+---+
| 0| A|
| 5| F|
+---+---+

You can just chain the join:
df1
.join(df2, on=df1.c1 == df2.key, how='left')
.withColumnRenamed('val', 'lu1') \
.join(df2, on=df1.c2 == df2.key, how='left) \
.withColumnRenamed('val', 'lu2') \
.etc
You can even do it in a loop, but don't do it with too many columns:
from pyspark.sql import functions as f
df = df1
for i in range(1, 6):
df = df \
.join(df2.alias(str(i)), on=f.col('c{}'.format(i)) == f.col("{}.key".format(i)), how='left') \
.withColumnRenamed('val', 'lu{}'.format(i))
df \
.select('c1', 'c2', 'c3', 'c4', 'c5', 'lu1', 'lu2', 'lu3', 'lu4', 'lu5') \
.show()
output
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
| 5| 6| 7| 8| 9| F| G| H| I| J|
| 0| 1| 2| 3| 4| A| B| C| D| E|
+---+---+---+---+---+---+---+---+---+---+

Related

Strategy to convert a corelated hive query to pyspark dataframe transformation?

I need to convert below SQL query into pyspark dataframe transformation. There is a corelated subquery defined inside select clause. is there any way to convert this to pyspark dataframe transformations? Appreciate if you can share articles wrt this.
Note: acc_cap table is also created from test_db.test_table after adding prev_time column using lag window function on time column.
Query ---
SELECT
A.id,
"psmark" fid,
(
SELECT distinct psmark
from test_db.test_table
where id = A.id and time = A.prev_time and rnk=1
)
AS fromvalue,
FROM acc_cap A;
I have taken two sample dataframes acc_cap_df and test_table_df
Below is the pyspark equivalent code to your query
>>> acc_cap_df=spark.read.csv("/path to/sample2.csv",header=True)
>>> acc_cap_df.show()
+---+-------+
| id| time|
+---+-------+
|003| 174256|
|003| 174267|
|003| 17429|
|003|1747567|
|001| 10|
|001| 10|
|004| 12719|
|002| 11|
|002| 117|
|002| 11878|
+---+-------+
>>> from pyspark.sql.functions import rank
>>> acc_cap_df=acc_cap_df.withColumn("rank",rank().over(windowSpec))
>>> acc_cap_df.show()
+---+-------+----+
| id| time|rank|
+---+-------+----+
|003| 174256| 1|
|003| 174267| 2|
|003| 17429| 3|
|003|1747567| 4|
|001| 10| 1|
|001| 10| 1|
|004| 12719| 1|
|002| 11| 1|
|002| 117| 2|
|002| 11878| 3|
+---+-------+----+
>>> acc_cap_df=acc_cap_df.withColumn("prev_time",lag("time",2).over(windowSpec))
>>> acc_cap_df.show()
+---+-------+----+---------+
| id| time|rank|prev_time|
+---+-------+----+---------+
|003| 174256| 1| null|
|003| 174267| 2| 174256|
|003| 17429| 3| 174267|
|003|1747567| 4| 17429|
|001| 10| 1| null|
|001| 10| 1| 10|
|004| 12719| 1| null|
|002| 11| 1| null|
|002| 117| 2| 11|
|002| 11878| 3| 117|
+---+-------+----+---------+
>>> test_table_df=spark.read.csv("/path to/sample1.csv",header=True)
>>> test_table_df.show()
+--------+---+----+
| psmark| id|time|
+--------+---+----+
| csvvsw|001| 10|
|csvvswfw|002| 11|
| csvvsgg|003| 12|
| csvvser|004| 13|
+--------+---+----+
>>> acc_cap_df=acc_cap_df.withColumnRenamed("id","acc_cap_id")
>>> acc_cap_df.show()
+----------+-------+----+
|acc_cap_id| time|rank|
+----------+-------+----+
| 003| 174256| 1|
| 003| 174267| 2|
| 003| 17429| 3|
| 003|1747567| 4|
| 001| 10| 1|
| 001| 10| 1|
| 004| 12719| 1|
| 002| 11| 1|
| 002| 117| 2|
| 002| 11878| 3|
+----------+-------+----+
>>> df_join=test_table_df.join(acc_cap_df, (test_table_df.id == acc_cap_df.acc_cap_id) & (test_table_df.time == acc_cap_df.time)).filter(F.col("rank")=="1")
>>> df_join.show()
+--------+---+----+----------+----+----+
| psmark| id|time|acc_cap_id|time|rank|
+--------+---+----+----------+----+----+
| csvvsw|001| 10| 001| 10| 1|
| csvvsw|001| 10| 001| 10| 1|
|csvvswfw|002| 11| 002| 11| 1|
+--------+---+----+----------+----+----+
>>> df_output=df_join.select('psmark','id').distinct()
>>> df_output=df_output.withColumnRenamed("psmark","fid")
>>> df_output.show()
+--------+---+
| fid| id|
+--------+---+
| csvvsw|001|
|csvvswfw|002|
+--------+---+

Pyspark sum of columns after union of dataframe

How can I sum all columns after unioning two dataframe ?
I have this first df with one row per user:
df = sqlContext.createDataFrame([("2022-01-10", 3, 2,"a"),("2022-01-10",3,4,"b"),("2022-01-10", 1,3,"c")], ["date", "value1", "value2", "userid"])
df.show()
+----------+------+------+------+
| date|value1|value2|userid|
+----------+------+------+------+
|2022-01-10| 3| 2| a|
|2022-01-10| 3| 4| b|
|2022-01-10| 1| 3| c|
+----------+------+------+------+
date value will always be the today's date.
and I have another df, with multiple row per userid this time, so one value for each day:
df2 = sqlContext.createDataFrame([("2022-01-01", 13, 12,"a"),("2022-01-02",13,14,"b"),("2022-01-03", 11,13,"c"),
("2022-01-04", 3, 2,"a"),("2022-01-05",3,4,"b"),("2022-01-06", 1,3,"c"),
("2022-01-10", 31, 21,"a"),("2022-01-07",31,41,"b"),("2022-01-09", 11,31,"c")], ["date", "value3", "value4", "userid"])
df2.show()
+----------+------+------+------+
| date|value3|value4|userid|
+----------+------+------+------+
|2022-01-01| 13| 12| a|
|2022-01-02| 13| 14| b|
|2022-01-03| 11| 13| c|
|2022-01-04| 3| 2| a|
|2022-01-05| 3| 4| b|
|2022-01-06| 1| 3| c|
|2022-01-10| 31| 21| a|
|2022-01-07| 31| 41| b|
|2022-01-09| 11| 31| c|
+----------+------+------+------+
After unioning the two of them with this function, here what I have:
def union_different_tables(df1, df2):
columns_df1 = df1.columns
columns_df2 = df2.columns
data_types_df1 = [i.dataType for i in df1.schema.fields]
data_types_df2 = [i.dataType for i in df2.schema.fields]
for col, _type in zip(columns_df1, data_types_df1):
if col not in df2.columns:
df2 = df2.withColumn(col, f.lit(None).cast(_type))
for col, _type in zip(columns_df2, data_types_df2):
if col not in df1.columns:
df1 = df1.withColumn(col, f.lit(None).cast(_type))
union = df1.unionByName(df2)
return union
+----------+------+------+------+------+------+
| date|value1|value2|userid|value3|value4|
+----------+------+------+------+------+------+
|2022-01-10| 3| 2| a| null| null|
|2022-01-10| 3| 4| b| null| null|
|2022-01-10| 1| 3| c| null| null|
|2022-01-01| null| null| a| 13| 12|
|2022-01-02| null| null| b| 13| 14|
|2022-01-03| null| null| c| 11| 13|
|2022-01-04| null| null| a| 3| 2|
|2022-01-05| null| null| b| 3| 4|
|2022-01-06| null| null| c| 1| 3|
|2022-01-10| null| null| a| 31| 21|
|2022-01-07| null| null| b| 31| 41|
|2022-01-09| null| null| c| 11| 31|
+----------+------+------+------+------+------+
What I want to get is the sum of all columns in df2 (I have 10 of them in the real case) till the date of the day for each userid, so one row per user:
+----------+------+------+------+------+------+
| date|value1|value2|userid|value3|value4|
+----------+------+------+------+------+------+
|2022-01-10| 3| 2| a| 47 | 35 |
|2022-01-10| 3| 4| b| 47 | 59 |
|2022-01-10| 1| 3| c| 23 | 47 |
+----------+------+------+------+------+------+
Since I have to do this operation for multiple tables, here what I tried:
user_window = Window.partitionBy(['userid']).orderBy('date')
list_tables = [df2]
list_col_original = df.columns
for table in list_tables:
df = union_different_tables(df, table)
list_column = list(set(table.columns) - set(list_col_original))
list_col_original.extend(list_column)
df = df.select('userid',
*[f.sum(f.col(col_name)).over(user_window).alias(col_name) for col_name in list_column])
df.show()
+------+------+------+
|userid|value4|value3|
+------+------+------+
| c| 13| 11|
| c| 16| 12|
| c| 47| 23|
| c| 47| 23|
| b| 14| 13|
| b| 18| 16|
| b| 59| 47|
| b| 59| 47|
| a| 12| 13|
| a| 14| 16|
| a| 35| 47|
| a| 35| 47|
+------+------+------+
But that give me a sort of cumulative sum, plus I didn't find a way to add all the columns in the resulting df.
The only thing is that I can't do any join ! My df are very very large and any join is taking too long to compute.
Do you know how I can fix my code to have the result I want ?
After union of df1 and df2, you can group by userid and sum all columns except date for which you get the max.
Note that for the union part, you can actually use DataFrame.unionByName if you have the same data types but only number of columns can differ:
df = df1.unionByName(df2, allowMissingColumns=True)
Then group by and agg:
import pyspark.sql.functions as F
result = df.groupBy("userid").agg(
F.max("date").alias("date"),
*[F.sum(c).alias(c) for c in df.columns if c not in ("date", "userid")]
)
result.show()
#+------+----------+------+------+------+------+
#|userid| date|value1|value2|value3|value4|
#+------+----------+------+------+------+------+
#| a|2022-01-10| 3| 2| 47| 35|
#| b|2022-01-10| 3| 4| 47| 59|
#| c|2022-01-10| 1| 3| 23| 47|
#+------+----------+------+------+------+------+
This supposes the second dataframe contains only dates prior to the today date in the first one. Otherwise, you'll need to filter df2 before union.

Map Spark DF to (row_number, column_number, value) format

I have a Dataframe in the following shape
1 2
5 9
How can I convert it to (row_num, col_num, value) format
0 0 1
0 1 2
1 0 5
1 1 9
Is there any way to apply some function or any mapper?
Thanks in advance
Check below code.
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> val colExpr = array(df.columns.zipWithIndex.map(c => struct(lit(c._2).as("col_name"),col(c._1).as("value"))):_*)
colExpr: org.apache.spark.sql.Column = array(named_struct(col_name, 0 AS `col_name`, NamePlaceholder(), a AS `value`), named_struct(col_name, 1 AS `col_name`, NamePlaceholder(), b AS `value`))
scala> df.withColumn("row_number",lit(row_number().over(Window.orderBy(lit(1)))-1)).withColumn("data",explode(colExpr)).select($"row_number",$"data.*").show(false)
+----------+--------+-----+
|row_number|col_name|value|
+----------+--------+-----+
|0 |0 |1 |
|0 |1 |2 |
|1 |0 |5 |
|1 |1 |9 |
+----------+--------+-----+
You can do it by transposing the data as:
from pyspark.sql.functions import *
from pyspark.sql import Window
df = spark.createDataFrame([(1,2),(5,9)],['col1','col2'])
#renaming the columns based on their position
df = df.toDF(*list(map(lambda x: str(x),[*range(len(df.columns))])))
#Transposing the dataframe as required
col_list = ','.join([f'{i},`{i}`'for i in df.columns])
rows = len(df.columns)
df.withColumn('row_id',lit(row_number().over(Window.orderBy(lit(1)))-1)).select('row_id',
expr(f'''stack({rows},{col_list}) as (col_id,col_value)''')).show()
+------+------+---------+
|row_id|col_id|col_value|
+------+------+---------+
| 0| 0| 1|
| 0| 1| 2|
| 1| 0| 5|
| 1| 1| 9|
+------+------+---------+
In pyspark, row_number() and pos_explode will be helpful. Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst1= tst.withColumn("row_number",F.row_number().over(Window.orderBy(F.lit(1)))-1)
#%%
tst_arr = tst1.withColumn("arr",F.array(tst.columns))
tst_new = tst_arr.select('row_number','arr').select('row_number',F.posexplode('arr'))
results:
In [47]: tst_new.show()
+----------+---+---+
|row_number|pos|col|
+----------+---+---+
| 0| 0| 1|
| 0| 1| 7|
| 0| 2| 80|
| 1| 0| 1|
| 1| 1| 8|
| 1| 2| 40|
| 2| 0| 1|
| 2| 1| 5|
| 2| 2|100|
| 3| 0| 5|
| 3| 1| 8|
| 3| 2| 90|
| 4| 0| 7|
| 4| 1| 6|
| 4| 2| 50|
| 5| 0| 0|
| 5| 1| 3|
| 5| 2| 60|
+----------+---+---+

Pyspark dataframes group by

I have dataframe like below
|123 |124 |125 |
+-----+-----+-----+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10
i need the data to be in
1:[123,125]
2:[123,124,125]
3:[125]
Order is not required to be sorted . I am new to dataframes in pyspark any help would be appreciated
There are no melt or pivot APIs in pyspark that will accomplish this directly. Instead, flatmap from the RDD into a new dataframe and aggregate:
df.show()
+---+---+---+
|123|124|125|
+---+---+---+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10|
+---+---+---+
For each column or each row in the RDD, output a row with two columns: the value of the column and the column name:
cols = df.columns
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.show())
+-----+-----------+
|value|column_name|
+-----+-----------+
| 1| 123|
| 2| 124|
| 3| 125|
| 9| 123|
| 9| 124|
| 4| 125|
| 4| 123|
| 12| 124|
| 1| 125|
| 2| 123|
| 4| 124|
| 8| 125|
| 7| 123|
| 6| 124|
| 3| 125|
| 19| 123|
| 11| 124|
| 2| 125|
| 21| 123|
| 10| 124|
+-----+-----------+
Then, group by the value and aggregate the column names into a list:
from pyspark.sql import functions as f
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.groupby("value").agg(f.collect_list("column_name"))
.show())
+-----+-------------------------+
|value|collect_list(column_name)|
+-----+-------------------------+
| 19| [123]|
| 7| [123]|
| 6| [124]|
| 9| [123, 124]|
| 1| [123, 125]|
| 10| [124, 125]|
| 3| [125, 125]|
| 12| [124]|
| 8| [125]|
| 11| [124]|
| 2| [124, 123, 125]|
| 4| [125, 123, 124]|
| 21| [123]|
+-----+-------------------------+

How to flatten a pyspark dataframe? (spark 1.6)

I'm working with Spark 1.6
Here are my data :
eDF = sqlsc.createDataFrame([Row(v=1, eng_1=10,eng_2=20),
Row(v=2, eng_1=15,eng_2=30),
Row(v=3, eng_1=8,eng_2=12)])
eDF.select('v','eng_1','eng_2').show()
+---+-----+-----+
| v|eng_1|eng_2|
+---+-----+-----+
| 1| 10| 20|
| 2| 15| 30|
| 3| 8| 12|
+---+-----+-----+
I would like to 'flatten' this table.
That is to say :
+---+-----+---+
| v| key|val|
+---+-----+---+
| 1|eng_1| 10|
| 1|eng_2| 20|
| 2|eng_1| 15|
| 2|eng_2| 30|
| 3|eng_1| 8|
| 3|eng_2| 12|
+---+-----+---+
Note that since I'm working with Spark 1.6, I can't use pyspar.sql.functions.create_map or pyspark.sql.functions.posexplode.
Use rdd.flatMap to flatten it:
df = spark.createDataFrame(
eDF.rdd.flatMap(
lambda r: [Row(v=r.v, key=col, val=r[col]) for col in ['eng_1', 'eng_2']]
)
)
df.show()
+-----+---+---+
| key| v|val|
+-----+---+---+
|eng_1| 1| 10|
|eng_2| 1| 20|
|eng_1| 2| 15|
|eng_2| 2| 30|
|eng_1| 3| 8|
|eng_2| 3| 12|
+-----+---+---+